Search not picking up programmatically added doc

tl;dr - just trying to get a simple one line document uploaded via api but neither queries via api nor console are picking up on it

What I’ve done:
1.) Created a very simple corpus via API call and added a simple document (single section, one line)). No errors!

2.) If I run a query about the document in the console, there are 0 results. HOWEVER, I noticed the size increased from 0 bytes to ~12, so something happened
Link: Vectara - Semantic Search (Corpus id 4)

3.) . If I attempt to re-add via API it says the document has already been added, guessing it’s definitely there.
{‘status’: {‘code’: ‘ALREADY_EXISTS’, ‘statusDetail’: ‘’, ‘cause’: None}, ‘quotaConsumed’: {‘numChars’: ‘17’, ‘numMetadataChars’: ‘124’}}

4.) I tried uploading a document directly in the UI. This goes through fine and now the search picks up on things. However, the query seems to ignore the document in the answer (document says that zebras are orange; but if I ask about them, the query talks about real world zebra colors).

5.) Finally I tried running the same query via api: got this error, and not sure how to interpret:
{‘code’: 5, ‘message’: ‘Not Found’, ‘details’: }

Questions:

  1. Is there a way to get more verbose errors? What does the api call error for query mean?

  2. Do you have recommendations on how to debug why programmatically added documents are not showing up? They’re definitely not showing up in console.

  3. Will Vectara perform well for tiny documents? I’m building this for a use case where hallucinations are very costly and I just want it to answer using a simple list of facts I’ve given it, while ignoring the real world data it was trained on. Wondering why the full console

Thanks in advance for your help!! Would love to be able to use this, I’m just stumped and finding it tough to figure out what’s going on without more chatty errors

Thank you for posting @Factline_Dev and welcome to the community!

This sounds quite strange and I think we need to dive in more. We definitely need to make sure the errors are a lot clearer if you’re unable to diagnose

Would it be possible to share some sample indexing request that fails to produce a searchable document? That way we can have a look at the error and try to reproduce/fix it alongside an answer to “why” it’s not working?

Will Vectara perform well for tiny documents? I’m building this for a use case where hallucinations are very costly and I just want it to answer using a simple list of facts I’ve given it, while ignoring the real world data it was trained on. Wondering why the full console

It should! I’ve used it to index Tweets and similar and seen really good results. We use a generative system where we explicitly try to avoid using data that’s irrelevant.

1 Like

Thanks so much for the fast response and the help!

Ok so here’s more info on question 1 - about querying short documents.

Just using the UI, no code: I uploaded a text file to a new copus, with a single line and asked the question: “What color are zebras?”

Here I’m hoping it will tell me they are purple, but it’s coming up with real world facts. If i ask directly if they are purple, even though the doc comes up in search results, the chat ai doesn’t seem to know about them.

Other screenshot:

For question 2 - I’m having trouble getting programmatically added docs to show up in queries at all.

I created a separate empty corpus and wrote a quick python client (below), auth creds removed. tldr - just adds a doc that says “Unicorns are blue” as section text.

import json

class VectaraClient(BaseAPIClient):
    def __init__(
        self,
        api_key,
        customer_id,
        base_url='https://api.vectara.io/v1/',
        org_key=None,
        client_id=None
    ):
        super().__init__(api_key, base_url)
        self.customer_id = customer_id
        self.token = _get_jwt_token(
            auth_url='',
            app_client_id='',
            app_client_secret=''
        )


    def get_headers(self, additional_fields={}, token=None):
        vectara_header = {
            'customer-id': self.customer_id,
        }
        headers = super().get_headers(token=self.token, additional_fields=vectara_header)        #headers['X-Customer-ID'] = self.customer_id
        return headers
    
    def _get_create_corpus_json(self, name, description):
        """ Returns a create corpus json. """
        corpus = {}
        corpus["name"] = name
        corpus["description"] = 'description'
        print('corpus', corpus)
        return {'corpus': corpus}
    
    def create_corpus(self, name, description=None):
        data = self._get_create_corpus_json(name, description)
        response = self.post('/create-corpus', data)
        return response
    
    def _get_index_request_json(self, corpus_id):
        """ Returns some example data to index. """
        data = {}
        document = {}
        document["metadata_json"] = json.dumps(
            {
                "book-name": "Another example title",
                "collection": "Mathematics",
                "author": "Example Author"
            }
        )
        sections = []
        section = {}
        section["text"] = "Unicorns are blue"
        sections.append(section)
        document["section"] = sections
        return {
            'customer_id': self.customer_id,
            'corpus_id': corpus_id, 
            'document': document
        }
    
    def add_document(self, corpus_id):
        data = self._get_index_request_json(corpus_id)
        response = self.post('/index', data)
        return response
    
    def run_query(self, query, corpus_id):
        data = {
            "query": [{"query": query}],
            "corpus_key": [{"customer_id": self.customer_id, "corpus_id": corpus_id}]
        }
        return self.post('/', data)

Then I tried the following queries.

response = client.add_document(6)

{‘status’: {‘code’: ‘OK’, ‘statusDetail’: ‘’, ‘cause’: None}, ‘quotaConsumed’: {‘numChars’: ‘17’, ‘numMetadataChars’: ‘124’}}

response = client.add_document(6)

{‘status’: {‘code’: ‘ALREADY_EXISTS’, ‘statusDetail’: ‘’, ‘cause’: None}, ‘quotaConsumed’: {‘numChars’: ‘17’, ‘numMetadataChars’: ‘124’}}

^^ Seems like it was added based on this last response!

But then this query doesn’t go through:

response = client.run_query('what color are unicorns', 6)
{‘code’: 5, ‘message’: ‘Not Found’, ‘details’: }

And if I try asking the question in the UI it also isn’t finding any results.