Specifying data type of PDF metadata

dvvilkins · October 31, 2023, 1:04pm

Hi, Setting up a new corpus on Vectara…Some of the document level PDF metadata is intended to be filtered as boolean and dates. Example would be CreationDate (a standard PDf field that will work as Date/Time or integer) or PublicRelease (a custom field intended to be set to Boolean).

Of course, it’s stored as text in the pdf document, so a conversion is needed somewhere in the ingest pipeline.

How do do this on Vectara? Setting these data types in the Vectara UI under filter attributes’ returns a 400 error. Is it possible via the UI or requires the API?

Also, should level be set to ‘Document’ or ‘Part’? In my Qudrant pipeline, I place them in the payload of each chunk. Same with Vectara or no?

Eager to try Vectara out to see how it compares to my homegrown solution with Qdrant.

Thanks

Bahadir_Danisik · October 31, 2023, 2:52pm

Hi @dvvilkins,

There are 3 ways of indexing at Vectara.

Uploading a document from console. This is probably what you are using. You can upload a document and it is indexed.
Standard API. You can use REST or gRPC interfaces to index a document. You have fine grain controls here. Document has to be sent as doc and sections and you can define metadata, set the date or boolean attributes.
Low level indexing API. This is more for advanced use cases.

#2 may provide you to define date and boolean attributes in metadata and allow filtering. You can define them either in doc or part level. Filter attributes need to be defined in the UI first and set in the API calls.

See the documentation for more details.

dvvilkins · October 31, 2023, 6:44pm

So something like this?

Create the corpus and set the filter attributes on the corpus in the UI
index the documents using the API, identify the metadata and set to data or boolean here
create a filter expression
attach the filter expression to a query during retrieval

If so, it’'s #2 that is unclear. Filtering on PDF metadata is a common use case. Would be great to see a step by step with examples.

Bahadir_Danisik · November 1, 2023, 5:27pm

Hi @dvvilkins,

If you are indexing the pdf with standard API, you can follow the steps you have mentioned. This also gives you better control on what to put as metadata.

If you want to only use the upload API, we do automatically extract the metadata of the PDF. You can see them in the search result page. (Query data tab). If your metadata is automatically imported, you can define filters for them and query.

Hope this clarifies.

dvvilkins · November 2, 2023, 6:42pm

Thanks, yes that helps. Assume by “upload API” you are referring to the Web interface or is this an endpoint?

Where /how do I ensure the data types for each metadata is defined properly for filtering. For example creation date is a date/time format, or at least a value and not a string. Do I do that as part of “define filter for them” you mention?

Bahadir_Danisik · November 2, 2023, 7:37pm

Thanks, yes that helps. Assume by “upload API” you are referring to the Web interface or is this an endpoint?

Both. There is web interface (UI) and there is an API.

Where /how do I ensure the data types for each metadata is defined properly for filtering. For example creation date is a date/time format, or at least a value and not a string. Do I do that as part of “define filter for them” you mention?

You can define the filter attributes from UI. See Filter Attributes in a corpus.
If the pdf has an attribute named “CreationDate” or “ModDate”, this is automatically converted to epoch seconds. Other attributes are handled as string values.

Topic		Replies	Views
Querying custom metadata in the Vectara platform Vectara Platform Q&A query , admin-functions	4	874	July 21, 2023
Uploading integers/floats not working in filters Vectara Platform Q&A indexing	1	242	May 11, 2024
Cannot see `Text` as a type in FilterMetadata Vectara Platform Q&A indexing	2	1037	March 9, 2023
Matching a List Type Field in metadataFilter Vectara Platform Q&A query	4	46	September 5, 2024
Converting dates to use as metadata Vectara Platform Q&A	2	694	October 29, 2023

Specifying data type of PDF metadata

Related topics