Crawling: crawler_type: database

JavaScript_Camp · February 4, 2024, 5:20pm

I don’t understand how to correctly add data for indexing.

I have data in a supabase table and I receive data through the postgresql connector.

Supabase:

public.ecommerce (
    id uuid not null default gen_random_uuid (),
    created_at timestamp with time zone not null default now(),
    product text null default ''::text,
    image_url text null default ''::text,
    usd real null,
    rub real null,
    link text null default ''::text,
    category text null default ''::text,
    market text null,
    constraint ecommerce_pkey primary key (id)
  ) tablespace pg_default;

vectara-ingest:

vectara:
  corpus_id: 6
  customer_id: 2252938633
  reindex: true

crawling:
  crawler_type: database

database_crawler:
  db_url: 'postgresql+psycopg2://postgres.**************:*************@aws-0-eu-central-1.pooler.supabase.com:5432/postgres'
  db_table: ecommerce
  doc_id_columns: [product, image_url, market, usd, rub, link, category]
  text_columns: [product, image_url, market, usd, rub, link, category]

The data was added, but I think that somehow I did it incorrectly, since the information is duplicated.

There are a lot of fields in the documentation, but it’s not clear exactly where to add what.

Tell me how to properly organize the data of products from the marketplace?

ofermend · February 4, 2024, 5:49pm

Let me try to explain. The idea here is to have a way to generate a Vectara document from each row in the table. The customization is as follows:

title_column is a text field in the source table that will be used to populate the “title” of this document.
text_columns is a list of column names (1 or more) that are concatenated together to generate the main text field of the document. Those need to be primarily text fields, for for example in your case above I see “usd”, “link” and “image_url” which are typically not a good choice for this. Alternatively in a product search application I would include the product name, a description, reviews, etc. Anything that is a good text field.
you can also use “metadata_columns” to populate other fields that are not textual in the Vectara metdata.
“doc_id_columns” is a special field use to aggregate rows into a Vectara document. If it is unspecified, then vectara-ingest will aggregate every 500 rows and make a random documentID for each Vectara document. If you do specify, then each documentID becomes a concatenation of all the fields specified. In your case, I would pick something that is a good unique ID of a product - is that “product”?

JavaScript_Camp · February 5, 2024, 5:08am

db_table: ecommerce
doc_id_columns: [product]
text_columns: [product, category, market, usd, rub, link, image_url]

Why is there no information from this column?

text_columns: [product, category, market, usd, rub, link, image_url]

JavaScript_Camp · February 5, 2024, 5:18am

  db_table: ecommerce
  doc_id_columns: [product]
  text_columns: [product]
  metadata_columns: [category, market, usd, rub, link, image_url]

For some reason there are no other columns except title.

ofermend · February 5, 2024, 5:50am

Can you please share the source CSV file (or perhaps the first 10-20 rows in it)?

JavaScript_Camp · February 5, 2024, 6:03am

Yes. Document

ofermend · February 5, 2024, 7:47pm

This worked for me:
doc_id_columns: [product]
text_columns: [product]
metadata_columns: [category, usd, rub, image_url]

I did not see “market” or “link” as columns in the document.

JavaScript_Camp · February 6, 2024, 4:32am

Can you show how the information was added to Vectara?

ofermend · February 6, 2024, 3:56pm

Here’s an example of how it looks when issuing a query.

JavaScript_Camp · February 6, 2024, 4:45pm

Why I don’t see all fields?

db_table: ecommerce
  doc_id_columns: [product]
  text_columns: [product]
  metadata_columns: [category, author, usd, rub, url, image_url]

Why can’t I see all the fields in the card, but only the title field?
Where is the category, author, usd, rub, url, image_url data?

ofermend · February 7, 2024, 4:17am

Hi,
With Vectara there can be two types of metadata. (1) The doc_metadata and (2) “section” metadata.

Document level metadata is what you see in the console and attached to a whole Vectara document
A Vectara document can be split into multiple “sections” and section level metadata is attached to each section.
(see for example https://docs.vectara.com/docs/api-reference/indexing-apis/indexing#index-request-and-response)

In the csv crawler, the code currently adds “title” and “source” as the document level metadata. Each row is considered a “Section” and the metadata from each row is added to the section level metadata.
Does that make sense?

If you run a full query using the API you will see the variou sections returned and the right metadata associated with each section.

JavaScript_Camp · February 7, 2024, 1:53pm

I don’t really understand how I can add fields correctly so that they are correctly indexed and displayed?
I need to add this table

Topic		Replies	Views
Website_crawler Amazon Vectara Platform Q&A	18	681	February 9, 2024
Uploading files & Indexing Vectara Platform Q&A indexing	3	863	July 21, 2023
Uploading integers/floats not working in filters Vectara Platform Q&A indexing	1	242	May 11, 2024
Apify to Airbyte to Vectara Vectara Platform Q&A	9	621	February 13, 2024
How to create a RAG for DB with Vectara APIs or SDK? Vectara Platform Q&A query	3	34	March 19, 2025

Crawling: crawler_type: database

Related topics