Issue with hybrid search + metadata filter

I’m trying to use metadata filter to filter out some corpuses from the search process, the problem is when using both hybrid search and metadata filters together, the system takes more than 30 seconds to return results for some queries.
If I use hybrid search alone or metadata filters alone the results return without any noticeable delay

Here is a curl example for a request that took more than 30 seconds:

curl -X POST \
-H "Authorization: Bearer eyJraWQiOiJZZmRcL2p3UG9nSjhQRGZcL3c1TXZlZWRHRXl0OVRHNnRlVUJ5WDlJeHUrZ3M9IiwiYWxnIjoiUlMyNTYifQ.eyJzdWIiOiJmNmY0NGVkZi1kZWY1LTQ2YzItYTgzZi0xN2RkOWYwZGQ1ZjUiLCJlbWFpbF92ZXJpZmllZCI6dHJ1ZSwiaXNzIjoiaHR0cHM6XC9cL2NvZ25pdG8taWRwLnVzLXdlc3QtMi5hbWF6b25hd3MuY29tXC91cy13ZXN0LTJfd1lCSFg5VjNOIiwiY29nbml0bzp1c2VybmFtZSI6Imhvc3NhbSIsIm9yaWdpbl9qdGkiOiJjZTNjODRlYi03MDM4LTRmMTQtOWUwYS0xNWM4ZDI2YTY4ZDYiLCJhdWQiOiJ1N3J0Ymppa2MxYjJlZWFrMGhkdG43cWlxIiwiZXZlbnRfaWQiOiIwMzgzYjAwYS0zZTYxLTQyMDktODc3OC1mMzk5NzY0NWE0OTYiLCJ0b2tlbl91c2UiOiJpZCIsImF1dGhfdGltZSI6MTcwMjY1NjkxOCwiZXhwIjoxNzA1MjQ1MTAwLCJpYXQiOjE3MDUyNDE1MDAsImp0aSI6IjFlZGIzZjE5LWJkMDYtNGMwZC1iOGVkLWRhNjgwMmI4OTNmOCIsImVtYWlsIjoiaG9zc2FtaGFzc2FuMTg5NkBnbWFpbC5jb20ifQ.I6lEjbphuW3pqtGLWBjBab-C3t6ycF8ZsZoJSdfh54tSezxQwTl3XoUk1QAftuvrbaVAqh0L9sTOBi6ehWYP-FuaLD_Mfi9YvKBAt93dmWT8483TsZenWySn2Uw1U6yliAH_n6N5pdZ0tlUOrc2-GGyxzjn5yYfk7-hTM-6hvMZx9_WM_2o_FykBjXzB0vXAgeEPOlGcVB_LrHUFzVa_8FMP6rOKpUCcP5U23Gvs4B8GNWamC6tLHDyzXZbKwC3P6NuMCJjgBF5_YDPv_QwSEG00-5Q4YYt_EWxAu_Cw4mwhLvep1NRjHv-KvkqvWcGZkjRgMRf2_LHyEcJDzy0J5A" \
-H "customer-id: 1353344576" \
https://api.vectara.io:443/v1/query \
-d @- <<END;
{
  "query": [
    {
      "query": "Paradise",
      "queryContext": "",
      "start": 0,
      "numResults": 10,
      "contextConfig": {
        "charsBefore": 0,
        "charsAfter": 0,
        "sentencesBefore": 2,
        "sentencesAfter": 2,
        "startTag": "%START_SNIPPET%",
        "endTag": "%END_SNIPPET%"
      },
      "corpusKey": [
        {
          "customerId": 1353344576,
          "corpusId": 34,
          "semantics": 0,
          "metadataFilter": "part.corpus != 'ontology topics' and part.corpus != 'tags'",
          "lexicalInterpolationConfig": {
            "lambda": 0.005
          },
          "dim": []
        }
      ],
      "summary": []
    }
  ]
}
END```

Hi hossam,

It seems that when setting up the corpus, you did not create an index on the “corpus” filterable metadata attribute. Can you please try doing that? An indexed filterable metadata attribute helps queries run faster.

Separately, looking at your query request, the filter suggests you are using filter expressions as a way to deploy several “corpora” into a single corpus instead of creating several separate corpora. Using separate corpora will help with data isolation, and to some extent, query processing. Is there a reason you aren’t using separate corpora? Can you try that?

Meanwhile, we’re going to dig into this further on our side as well.

Regards,
Tallat

I have edited the “corpus” filter and made it as indexed, tried to run the request again but didn’t notice any difference in the delay, still taking more than 30 seconds.

The reason for having multiple “corpora” into a single corpus is because all of the the data are linked to the original document ID, it’s like having one table in a SQL database where all columns are always not null, this is more efficient than having multiple tables that have 1 to 1 relation with the original table.

Thanks for trying, and apologies for the bad performance.

I have edited the “corpus” filter and made it as indexed, tried to run the request again but didn’t notice any difference in the delay, still taking more than 30 seconds.

I’ve opened a ticket at our end to investigate this further. There seems to be a slowness due to one of the generated internal queries not performing well in this particular case. This might take a while to diagnose completely and implement a fix + rollout to production, but we are going to work on it for sure. I’ll post when we have any meaningful updates.