Issues with using the Indexing REST API

Charbel_Tabet · April 21, 2024, 12:21am

I am having issues indexing data which may be due to my lack of expertise in this area.

When indexing data, the content I dump as a json document.description is taken into consideration when querying but what I put as a json document.metadataJson is not taken into account.

Let’s take for instance this document:

{
  "id": "10",
  "metadata": {
    "website": "stcharbel.ca",
    "title": "St. Charbel - Ottawa",
    "desc": "{\"Phone Number\": \"+1 (613) 749-9494\", \"Description\": \"From humble beginnings at 87 Mann Avenue, the Maronite community in Ottawa dedicated its efforts in search of a bigger home for a growing community. In 1994, with the blessing of His Excellency Bishop Georges Abi Saber, the Maronite community in Ottawa purchased its current home located at 245 Donald St., with its additional parking facilities that accommodate more than 300 automobiles. The church was dedicated to St. Charbel, who is the first saint of the Lebanese Maronite Order.\", \"Facebook Page\": \"stcharbelottawa\", \"Address\": \"245 Donald St, Ottawa, ON K1K 1N1, Canada\", \"Geohash\": \"f244qmgv84er\", \"Coordinates\": \"Lat: 45.4280523, Lng: -75.6576875\", \"Slug\": \"st-charbel-ottawa\", \"Tagline\": \" Celebrating over 1600 years of Heritage, Faith and Resilience\", \"Active Status\": \"Inactive\", \"Created At\": \"2024-01-07 13:24:09.694000\", \"Updated At\": \"2024-01-07 13:24:16.067000\", \"Published At\": \"2024-01-07 13:24:16.063000\"}"
  }
}

Querying about anything in the description like the phone number will work:
capture

But querying about the website which is outside of the description will not work:
capture

Why is it so? I have tried adding doc.website as a filter attribute (see what I encircled in red) but LLM still can’t give answers about it.

Am I missing something? Please let me know, I think that dumping all of the document’s data in the description is not the way to go from what I have seen in the demos and the doc.

ofermend · April 21, 2024, 6:40pm

Hi @Charbel_Tabet -
The information inside the document meta-data is not generally considered as text for matching. So it will not be returned as part of the search result or included in the summary. I realize that you did get the phone number correctly in the first example, and it’s a bit hard for me to exactly determine how that happened - but my best guess is that it actually retrieved that piece of information from the LLM itself which may have been trained on this. To validate this - do you mind doing a quick experiment: put in a “fake” phone number in that meta data, reindex the document, and retry that? Then we will be able to validate this for sure.
Generally I would recommend including all information you want to be “searchable” as part of the text component in a document. For this website - is the text in “description” includes all content captured from the website?

Charbel_Tabet · April 21, 2024, 6:53pm

Hi,

The phone number was retrieved from the content in metadata.desc:

ofermend · April 21, 2024, 7:34pm

Can you please share the details of what you put in the “section” part of the document (and esp the value associated with the “text” key) you use when indexing?

Charbel_Tabet · April 21, 2024, 8:02pm

I have not used the “section” part until now, thanks for sharing this documentation page. I have tried now adding data in the “section” part but the bot still can’t retrieve the phone number when asked, this is my new document structure I am posting:

{
    "customerId": 0 # dummy id,
    "corpusId": 1,
    "document": {
        "documentId": "10",
        "title": "St. Charbel - Ottawa",
        "description": "From humble beginnings at 87 Mann Avenue, the Maronite community in Ottawa dedicated its efforts in search of a bigger home for a growing community. In 1994, with the blessing of His Excellency Bishop Georges Abi Saber, the Maronite community in Ottawa purchased its current home located at 245 Donald St., with its additional parking facilities that accommodate more than 300 automobiles. The church was dedicated to St. Charbel, who is the first saint of the Lebanese Maronite Order.",
        "metadataJson": "",
        "section": [
            {
                "id": 1,
                "title": "Phone Number",
                "text": "+1 (613) 749-9494",
                "metadataJson": ""
            },
            {
                "id": 2,
                "title": "Website",
                "text": "stcharbel.ca",
                "metadataJson": ""
            },
            {
                "id": 3,
                "title": "Facebook Page Username",
                "text": "stcharbelottawa",
                "metadataJson": ""
            },
            {
                "id": 4,
                "title": "Address",
                "text": "245 Donald St, Ottawa, ON K1K 1N1, Canada",
                "metadataJson": ""
            },
            {
                "id": 5,
                "title": "Geohash",
                "text": "f244qmgv84er",
                "metadataJson": ""
            },
            {
                "id": 6,
                "title": "Coordinates",
                "text": "Lat: 45.4280523, Lng: -75.6576875",
                "metadataJson": ""
            },
            {
                "id": 7,
                "title": "Slug",
                "text": "st-charbel-ottawa",
                "metadataJson": ""
            },
            {
                "id": 8,
                "title": "Tagline",
                "text": "Celebrating over 1600 years of Heritage, Faith and Resilience",
                "metadataJson": ""
            },
            {
                "id": 9,
                "title": "Created At",
                "text": "2024-01-07 13:24:09.694000",
                "metadataJson": ""
            },
            {
                "id": 10,
                "title": "Updated At",
                "text": "2024-01-07 13:24:16.067000",
                "metadataJson": ""
            },
            {
                "id": 11,
                "title": "Published At",
                "text": "2024-01-07 13:24:16.063000",
                "metadataJson": ""
            }
        ]
    }
}

From the console, I cannot see my sections, is there anything wrong I am doing when adding these sections?

Charbel_Tabet · April 21, 2024, 8:28pm

Update

I am now putting the attribute name in the document.section[].text and I am getting better results.

{
    "customerId": 0, # dummy id
    "corpusId": 1,
    "document": {
        "documentId": "10",
        "title": "St. Charbel - Ottawa",
        "description": "From humble beginnings at 87 Mann Avenue, the Maronite community in Ottawa dedicated its efforts in search of a bigger home for a growing community. In 1994, with the blessing of His Excellency Bishop Georges Abi Saber, the Maronite community in Ottawa purchased its current home located at 245 Donald St., with its additional parking facilities that accommodate more than 300 automobiles. The church was dedicated to St. Charbel, who is the first saint of the Lebanese Maronite Order.",
        "metadataJson": "",
        "section": [
            {
                "text": "Phone Number: +1 (613) 749-9494"
            },
            {
                "text": "Website: stcharbel.ca"
            },
            {
                "text": "Facebook Page Username: stcharbelottawa"
            },
            {
                "text": "Address: 245 Donald St, Ottawa, ON K1K 1N1, Canada"
            },
            {
                "text": "Geohash: f244qmgv84er"
            },
            {
...
            }
        ]
    }
}

I receive correct answers to “What’s the phone number” question:

Is this the optimal way of creating indexes? Vectara isn’t able to answer some other basic LLM questions where the data is in the sections.

ofermend · April 21, 2024, 11:48pm

yes, RAG works in terms of natural languages and searches the text portion. It is a common practice to take a row of characteristics from a database and create an “Artificial sentence” from it, adn then use that as the text for indexing, so that all those attributes would be discoverable. How you did it is perfectly fine. In other cases, I’ve seen folks trying to create one single “artificial document” and indexing that as a single section, like “Website: stcharbel.ca. Phone number: +1 (613) 749-9494. Geohash f244qmgv84er…” (imagine it includes all the rest).
And you can add some of these characteristics as metadata in parallel, if it’s helpful for your application, but as discussed above it won’t be searchable, it is just returned with the results.

Topic		Replies	Views
Cannot index files via /upload	10	982	June 16, 2023
JSON File Upload Fails Vectara Platform Q&A indexing	3	249	September 27, 2024
Unable to Retrieve Document Section Information via Search Vectara Platform Q&A indexing	1	750	October 17, 2023
[Python] Querying via Rest API	11	860	September 12, 2023
Indexing error Vectara Platform Q&A indexing	5	2280	February 15, 2023

Issues with using the Indexing REST API

Update

Related topics