Querying custom metadata in the Vectara platform

Hi there,

I am loving Vectara but have a small issue!

How do I filter using the metadata I have added? I have read the documentation and other threads but cannot work it out. I have also added the metadata to the top of the JSON Section but it doesn’t seem to work for some reason - any help would be highly appreciated!

Am I setting the metadata wrong?
Am I querying wrong?

Please see screenshot, sorry for the weird look but I can only upload 1 piece of media!

Thank you!

Hey there @Ed_Moore – thanks for the question! I think you’ve run into a bug on our side that we’ll work to get fixed, but also I think I can get you unblocked anyway in the interim. I’ll explain what bug I think you’ve run into, and what we’re working on at the end of this post.

To get unblocked, I think the fastest thing is to answer the question: Do you want to have the title be a “hard filter” – as in something you can do an exact match for or something you want to be able to do semantic matching against or both? That will determine how you want to model your data.

Exact Match Case / Both

Create a new corpus. When you’re adding metadata filters, add documentTitle instead of Title on the Document level. Likewise, in your documents, change Title in metadataJson on the top-level of the document to documentTitle. You’ll then be able to do exact matches against this field by filtering to doc.documentTitle = 'foo'

Semantic Search on Title Field

You don’t really need to do anything special in the metadata to set up the ability to do semantic searches on the title. You can/should remove the Title field from metadataJson at the Document (top) level. What you can do for semantic title matching is to perform a search and then set part.is_title = true in your filter metadata to restrict your search to just the title. That’s described here

What Happened

Unfortunately, a few different things. First thing is on the Vectara side: the metadata filters are case insensitive, and Vectara already has a field called title internally that’s used for the semantic title. So when you created a field Title, it looks like that’s preventing Vectara from matching the field you’ve created because it’s getting redirected to the title field which isn’t set up as metadata. There were already several protections to try to prevent this type of conflict: corpus creation won’t let you create filters with 2 different metadata fields of different casing. The file upload API also generally has some protections on it to try to prevent some field ambiguity. But I don’t think these took Title vs title into account correctly, so look to make a change to make that safer.

The second thing is that you have part.Title, but that should be doc.Title. Once you create a new corpus, you’ll want doc.documentTitle if you still want strict filtering.

In the mean time, I think you will need to create a new corpus and put new documents with these changed metadata fields in it unfortunately.

I should also mention that we’re working on an API to allow you to change which metadata is filterable after corpus creation. That’s currently pre-GA, but available if you’d like to test it out and provide us feedback. Feel free to DM me if so, and we can get a quick chat set up

Thanks a lot for this Shane, I thought using Title might cause an issue :upside_down_face:

However and maybe to expand on this, I added all of the items in screenshot here and they all appear as ‘Filterable attributes’ but not as doc or part-level metadata - do you know why this is?

With the other items, would it be best for me to stick to single words and call them either documentXYZ or partXYZ?

I would love to have a play around with the Metachange API if that is ok, let me know how we can schedule a quick call!

Thanks
Ed

Sorry, forgot to attach!

Yes, so there are 2 different types of metadata: document metadata and part metadata. Document metadata exists on the “document” / top level. In your original screenshot, since I can see the JSON entry, the Title field holding #1 - 20/07/2023 would be document metadata. And then fields within a section are your part metadata. So Part Company, Part Title and so on would be part metadata. You should take out the spaces for those and replace them with underscores if you want to use them for now.

So let’s say, looking at your first post, you your replace your spaces with underscores in the Part ... metadata and you replace Title in the top level of the JSON with documentTitle, a valid filter expression would be something like:
doc.documentTitle = 'foo' AND part.Part_Title = 'bar'

Hopefully that makes sense?

As to why they show up as filterable attributes and not clear that they’re also part/document level, that’s something we can take away to review as a potential UX improvement. I’ll raise an internal issue and chat with some folks here to see if there’s a better way to display, but rest assured that the filters are also document/part level (depending).

As to a discussion, feel free to DM me here or ping me at shane@vectara.com and we can set some time up to chat