List of Words/Ngrams of a given Document indexed with Vectara

Hello,

I m trying to use vectara to automate internal linking between documents. Do not hesitate if you think i m getting it wrong but that’s how I m thinking the application should work.

When a document is created (for example a blog post in WordPress) :

  1. The document D is indexed using Vectara
  2. An API Request is made to get the list of Words/Ngrams D was just indexed for AND that matches other documents
  3. Single queries are made on returned Ngrams (on which D is indexed) to find the other relevant documents
  4. Words / Ngrams or phrases from D are used transformed, links are created.

For what I know in the indexing process a document is splitted in words or group of words, how can I get the entire list of those words after I have indexed a document ?

Thank you for your help

Hey there @wp_semantic and welcome to the forums!

The workflow you describe is pretty common for a recommender system based on keywords, but with a vector-/LLM-based solution, you generally want to do things a bit different. At least a little of the “why” is important because it helps you understand the “what to do” but I’ll tell you “what to do” first and you can read the end on “why”.

What:
So at its base, what you’d do is just perform a search in Vectara and feed the document content in as the query. Vectara will convert the “search document” to a vector and find documents that are similar. There is a catch here though: by default, Vectara is set up in a “question answering” mode. That is, Vectara’s large language models are designed in principal to answer an end-user’s question instead of finding similar documents when in their default mode. So what you want to do is change the mode of the search to doing document similarity instead of question answering. You do that by semantics key which is inside of the corpusKey block in the query. Our documentation on this is currently weak, but I’m planning on putting more together this week. For now, the REST playground at least will show you where to plug in this info via REST or you can look here if you’re using gRPC.

For example, if you had a document that started with All about me\n\nMy name is Shane and I'm ... and you wanted to find similar documents:

{
  "query": [
    {
      "query": "All about me\n\nMy name is Shane and I'm ...",
      "start": 0,
      "numResults": 10,
      "corpusKey": [
        {
          "customerId": 12345678,
          "corpusId": 1,
          "semantics": "RESPONSE"
        }
      ]
    }
  ]
}

The second thing you may want to do after you do the first is to start fiddling around with different chunking and section matching strategies. For example, if you pass in a body (either full text or title) to query and part.is_title = true to the metadataFilter, you can get a recommendation based just on titles. Or you could add some metadata for things like description and match there.

I suspect you’ll get pretty good results overall with these and you can then iterate beyond it. If you’re trying to do recommendations based on the overall document content (and not just the title, description, etc), then there are more fine-grained controls Vectara offers, including a low-level indexing API where you can control the specific text as well as the context for it, but I’d treat these as expert APIs for now and see how far these first (relatively) “simple” strategies get you

Why:
Vectara is able to understand the actual semantic meaning of queries and documents, but it really works much better when the word order, punctuation, capitalization, etc is preserved. “I saw the movie IT” and “I saw the ad placement: it works in the movie I saw” and “my friend in IT works in the movie I saw” all have very different meanings of “it” even though the latter 2 have the exact same sequence except for capitalization starting with “it.” Vectara has the ability to understand the context, so if you tried to tokenize and then send a bag of words, you’d lose one of the key things Vectara can really do.

Vectara should understand the last of these should match queries/documents relating to “coworkers” (from “my friend in IT”) and “technology workers” and give good results for matching similar documents even without those keywords.

Let us know if we can help further!

1 Like

Thank you very much for your response, it’s very interesting. I also appreciate the “why” part, which is really important.

I understand perfectly that the bag-of-words technique doesn’t make sense in the context of Vectara, where the context is crucial for extracting/understanding meaning.

It’s true that I mentioned “words,” but I also mentioned n-grams, which refer to expressions composed of 3 or 4 words that preserve even more context.

I understand that it may still be insufficient, but in that case, considering my objective, can we imagine that longer chunks would be a good alternative instead of using the entire document as a query?

That being said, I’m not at all opposed to using the entire document as a query because, firstly, I find that method simpler. You provided an example of a query to the API, could you give me an example of a response to that query?

The only doubt I have is related to the usefulness of the output. My goal is to add hyperlinks to relevant pages based on small, relevant snippets of text, rather than entire sections. Does that make sense to you?

Got it! (Apologies for the confusion: I know the term “n-gram” should be standard nomenclature, but I wasn’t sure if you were meaning ngram as words or ngrams as characters: Lucene has an “n-gram tokenizer” that breaks strings up into tokens of a particular size and then generally unordered bag-of-word approaches are used for most queries/recommendations on top of this)

Anyway, yes, longer strings of words generally do better with Vectara. The reason I was suggesting a whole document was just that I was thinking it might make your life a bit easier as a developer than having to deal with the tokenization.

This is super useful context, thanks for sharing! And it would change the approach (slightly). I would expect that this should work really well with Vectara with some minimal modification.

What I’d do instead of sending some string of 3-4 ordered words is to try to break up on sentence boundaries instead, and then send a whole sentence. In general, that will help preserve even more context. In general, I’d expect sentence ≈ paragraph > word-order-preserved ngram[5-6] > word-order-preserved ngram[3-4] > whole document > bag of words[*] > single-word, though this may vary by your use documents.

You provided an example of a query to the API, could you give me an example of a response to that query?

Of course. By the way, you can use the console to see the result JSON of a query by selecting “Copy request” on the “Search” tab of a corpus and then navigating to the Response and selecting Copy as JSON:

We don’t yet support setting RESPONSE similarity measure via the UI, but the overall structure of the response doesn’t vary. Here’s an example response:

{
    "responseSet": [
        {
            "generatedList": [],
            "summaryList": [],
            "futureId": 0,
            "response": [
                {
                    "text": "Now everything should be fixed in \"3D printer test fixed (3nd gen). stl\" file. \r\n\r\n- I've also changed rotation of the model, so now it should be in the correct position :-). %START_SNIPPET%This test includes : support test, scale test, overhang test, hole test, diameter test and bridging test.%END_SNIPPET% Print this with 100% Infill without supports. NOTE! : If you are using Cura and you are experiencing missing text issue, be sure to enable \"use thin wall\" setting!",
                    "score": 0.43887752294540405,
                    "documentIndex": 1,
                    "corpusKey": {
                        "customerId": 0,
                        "corpusId": 90,
                        "semantics": 2,
                        "metadataFilter": "",
                        "dim": []
                    },
                    "resultOffset": 190,
                    "resultLength": 105,
                    "metadata": [
                        {
                            "name": "lang",
                            "value": "eng"
                        },
                        {
                            "name": "offset",
                            "value": "2097"
                        },
                        {
                            "name": "len",
                            "value": "105"
                        }
                    ]
                },
                ...
            ],
            "status": [],
            "document": [
                {
                    "id": "5662572",
                    "metadata": [
                        {
                            "name": "id",
                            "value": "5662572"
                        },
                        {
                            "name": "created_time",
                            "value": "2022-11-28T00:10:06+00:00"
                        },
                        {
                            "name": "last_edited_time",
                            "value": "2022-11-28T00:10:06+00:00"
                        },
                        {
                            "name": "like_count",
                            "value": "0"
                        },
                        {
                            "name": "collect_count",
                            "value": "0"
                        },
                        {
                            "name": "comment_count",
                            "value": "0"
                        },
                        {
                            "name": "download_count",
                            "value": "6"
                        }
                    ]
                },
                {
                    "id": "2656594",
                    "metadata": [
                        {
                            "name": "id",
                            "value": "2656594"
                        },
                        {
                            "name": "created_time",
                            "value": "2017-11-19T15:47:42+00:00"
                        },
                        {
                            "name": "last_edited_time",
                            "value": "2017-11-19T15:47:42+00:00"
                        },
                        {
                            "name": "like_count",
                            "value": "51983"
                        },
                        {
                            "name": "collect_count",
                            "value": "87638"
                        },
                        {
                            "name": "comment_count",
                            "value": "424"
                        },
                        {
                            "name": "download_count",
                            "value": "571083"
                        }
                    ]
                },
                ...
            ]
        }
    ],
    "status": []
}

I’ve put ellipses in a 2 places to keep the response shorter rather than listing out a full 10 results. But the general structure you’ll see is:

  • responseSet[].resonse[] contains the list of individual snippets that match the query (responseSet is an array because you can submit multiple queries to Vectara at the same time, but I don’t think you need to based on your use case)
  • responseSet[].resonse[].score is the relevance score
  • responseSet[].resonse[].text tells us the snippet text. In this case, I also passed in query[0].corpusKey[0].context set to { 'sentences_before':2, 'sentences_after':2, 'start_tag':'%START_SNIPPET%', 'end_tag':'%END_SNIPPET%' }. Setting “context” is optional, but might help you in this use case. The sentences_before tells Vectara how many sentences before the relevant snippet to also include in the text, and sentences_after does the same, but after the snippet. (There’s also chars_before and chars_after). These can be useful to show a bit more context around the snippet, since individual sentences aren’t always super meaningful. The start_tag and end_tag are also optional, but they can help you do highlighting in your application by showing exactly where the matching snippet starts and ends. Alternatively for highlighting, you can use the resultOffset and resultLength parameters if you want to do the character counting yourself.
  • Individual sections and whole documents can each have their own metadata. responseSet[].resonse[].metadata[] contains the metadata for the individual section. For example, we see lang which tells you what language Vectara thought the snippet was written in (you can use this to filter for only particular languages, for example, if that’s of interest, via the metadataFilter). In responseSet[].document[], you can find the document IDs and metadata.
  • The relationship between the list in responseSet[].response[] and responseSet[].document[] can be found in responseSet[].response[].documentIndex (which isn’t super intuitive; we’ll change that at some point in a future version of the API)

FYI we also have some examples of how to interpret the responses in our docs: e.g. Highlighting and Snippet Extraction | Vectara Docs