Specifying data type of PDF metadata

Hi, Setting up a new corpus on Vectara…Some of the document level PDF metadata is intended to be filtered as boolean and dates. Example would be CreationDate (a standard PDf field that will work as Date/Time or integer) or PublicRelease (a custom field intended to be set to Boolean).

Of course, it’s stored as text in the pdf document, so a conversion is needed somewhere in the ingest pipeline.

How do do this on Vectara? Setting these data types in the Vectara UI under filter attributes’ returns a 400 error. Is it possible via the UI or requires the API?

Also, should level be set to ‘Document’ or ‘Part’? In my Qudrant pipeline, I place them in the payload of each chunk. Same with Vectara or no?

Eager to try Vectara out to see how it compares to my homegrown solution with Qdrant.

Thanks

Hi @dvvilkins,

There are 3 ways of indexing at Vectara.

  1. Uploading a document from console. This is probably what you are using. You can upload a document and it is indexed.
  2. Standard API. You can use REST or gRPC interfaces to index a document. You have fine grain controls here. Document has to be sent as doc and sections and you can define metadata, set the date or boolean attributes.
  3. Low level indexing API. This is more for advanced use cases.

#2 may provide you to define date and boolean attributes in metadata and allow filtering. You can define them either in doc or part level. Filter attributes need to be defined in the UI first and set in the API calls.

See the documentation for more details.

So something like this?

  1. Create the corpus and set the filter attributes on the corpus in the UI
  2. index the documents using the API, identify the metadata and set to data or boolean here
  3. create a filter expression
  4. attach the filter expression to a query during retrieval

If so, it’'s #2 that is unclear. Filtering on PDF metadata is a common use case. Would be great to see a step by step with examples.

Hi @dvvilkins,

If you are indexing the pdf with standard API, you can follow the steps you have mentioned. This also gives you better control on what to put as metadata.

If you want to only use the upload API, we do automatically extract the metadata of the PDF. You can see them in the search result page. (Query data tab). If your metadata is automatically imported, you can define filters for them and query.

Hope this clarifies.

Thanks, yes that helps. Assume by “upload API” you are referring to the Web interface or is this an endpoint?

Where /how do I ensure the data types for each metadata is defined properly for filtering. For example creation date is a date/time format, or at least a value and not a string. Do I do that as part of “define filter for them” you mention?

Thanks, yes that helps. Assume by “upload API” you are referring to the Web interface or is this an endpoint?

Both. There is web interface (UI) and there is an API.

Where /how do I ensure the data types for each metadata is defined properly for filtering. For example creation date is a date/time format, or at least a value and not a string. Do I do that as part of “define filter for them” you mention?

You can define the filter attributes from UI. See Filter Attributes in a corpus.
If the pdf has an attribute named “CreationDate” or “ModDate”, this is automatically converted to epoch seconds. Other attributes are handled as string values.