Issue with hybrid search + metadata filter

hossam · January 14, 2024, 4:36pm

I’m trying to use metadata filter to filter out some corpuses from the search process, the problem is when using both hybrid search and metadata filters together, the system takes more than 30 seconds to return results for some queries.
If I use hybrid search alone or metadata filters alone the results return without any noticeable delay

Here is a curl example for a request that took more than 30 seconds:

curl -X POST \
-H "Authorization: Bearer eyJraWQiOiJZZmRcL2p3UG9nSjhQRGZcL3c1TXZlZWRHRXl0OVRHNnRlVUJ5WDlJeHUrZ3M9IiwiYWxnIjoiUlMyNTYifQ.eyJzdWIiOiJmNmY0NGVkZi1kZWY1LTQ2YzItYTgzZi0xN2RkOWYwZGQ1ZjUiLCJlbWFpbF92ZXJpZmllZCI6dHJ1ZSwiaXNzIjoiaHR0cHM6XC9cL2NvZ25pdG8taWRwLnVzLXdlc3QtMi5hbWF6b25hd3MuY29tXC91cy13ZXN0LTJfd1lCSFg5VjNOIiwiY29nbml0bzp1c2VybmFtZSI6Imhvc3NhbSIsIm9yaWdpbl9qdGkiOiJjZTNjODRlYi03MDM4LTRmMTQtOWUwYS0xNWM4ZDI2YTY4ZDYiLCJhdWQiOiJ1N3J0Ymppa2MxYjJlZWFrMGhkdG43cWlxIiwiZXZlbnRfaWQiOiIwMzgzYjAwYS0zZTYxLTQyMDktODc3OC1mMzk5NzY0NWE0OTYiLCJ0b2tlbl91c2UiOiJpZCIsImF1dGhfdGltZSI6MTcwMjY1NjkxOCwiZXhwIjoxNzA1MjQ1MTAwLCJpYXQiOjE3MDUyNDE1MDAsImp0aSI6IjFlZGIzZjE5LWJkMDYtNGMwZC1iOGVkLWRhNjgwMmI4OTNmOCIsImVtYWlsIjoiaG9zc2FtaGFzc2FuMTg5NkBnbWFpbC5jb20ifQ.I6lEjbphuW3pqtGLWBjBab-C3t6ycF8ZsZoJSdfh54tSezxQwTl3XoUk1QAftuvrbaVAqh0L9sTOBi6ehWYP-FuaLD_Mfi9YvKBAt93dmWT8483TsZenWySn2Uw1U6yliAH_n6N5pdZ0tlUOrc2-GGyxzjn5yYfk7-hTM-6hvMZx9_WM_2o_FykBjXzB0vXAgeEPOlGcVB_LrHUFzVa_8FMP6rOKpUCcP5U23Gvs4B8GNWamC6tLHDyzXZbKwC3P6NuMCJjgBF5_YDPv_QwSEG00-5Q4YYt_EWxAu_Cw4mwhLvep1NRjHv-KvkqvWcGZkjRgMRf2_LHyEcJDzy0J5A" \
-H "customer-id: 1353344576" \
https://api.vectara.io:443/v1/query \
-d @- <<END;
{
  "query": [
    {
      "query": "Paradise",
      "queryContext": "",
      "start": 0,
      "numResults": 10,
      "contextConfig": {
        "charsBefore": 0,
        "charsAfter": 0,
        "sentencesBefore": 2,
        "sentencesAfter": 2,
        "startTag": "%START_SNIPPET%",
        "endTag": "%END_SNIPPET%"
      },
      "corpusKey": [
        {
          "customerId": 1353344576,
          "corpusId": 34,
          "semantics": 0,
          "metadataFilter": "part.corpus != 'ontology topics' and part.corpus != 'tags'",
          "lexicalInterpolationConfig": {
            "lambda": 0.005
          },
          "dim": []
        }
      ],
      "summary": []
    }
  ]
}
END```

tallat · January 16, 2024, 9:22pm

Hi hossam,

It seems that when setting up the corpus, you did not create an index on the “corpus” filterable metadata attribute. Can you please try doing that? An indexed filterable metadata attribute helps queries run faster.

Separately, looking at your query request, the filter suggests you are using filter expressions as a way to deploy several “corpora” into a single corpus instead of creating several separate corpora. Using separate corpora will help with data isolation, and to some extent, query processing. Is there a reason you aren’t using separate corpora? Can you try that?

Meanwhile, we’re going to dig into this further on our side as well.

Regards,
Tallat

hossam · January 17, 2024, 2:07pm

I have edited the “corpus” filter and made it as indexed, tried to run the request again but didn’t notice any difference in the delay, still taking more than 30 seconds.

The reason for having multiple “corpora” into a single corpus is because all of the the data are linked to the original document ID, it’s like having one table in a SQL database where all columns are always not null, this is more efficient than having multiple tables that have 1 to 1 relation with the original table.

tallat · January 18, 2024, 11:55pm

Thanks for trying, and apologies for the bad performance.

I have edited the “corpus” filter and made it as indexed, tried to run the request again but didn’t notice any difference in the delay, still taking more than 30 seconds.

I’ve opened a ticket at our end to investigate this further. There seems to be a slowness due to one of the generated internal queries not performing well in this particular case. This might take a while to diagnose completely and implement a fix + rollout to production, but we are going to work on it for sure. I’ll post when we have any meaningful updates.

Topic		Replies	Views
Issue with lambda + metadata filter Vectara Platform Q&A query	6	967	December 11, 2023
API Response - time out Vectara Platform Q&A query	3	626	December 2, 2023
How to speed queries up Vectara Platform Q&A	2	932	May 5, 2023
Timeouts with more than 20 seconds Vectara Platform Q&A	6	854	September 8, 2023
Deadline Exceeded using hybrid search Vectara Platform Q&A	7	441	March 26, 2024

Issue with hybrid search + metadata filter

Related topics