Export to CSV/XLSX (or some sort of tabular file format), but struggled uploading that to a corpus (more my strategy), pretty sure Vectara supports
Connect Vectra instead of a corpus to my SQL somehow (?)
What do you recommend? Assume I have 35 columns, 42k+ rows, again, can export to a CSV and than upload to a corpus? But found that harder.
Ultimately want to be able to:
(Ask a question) “What is the most common…?”
(perhaps it translates to SQL and than executes)
(or maybe to just directly use the GPT capability against the data)
Expect:
(Some answer/response) “The most common…”
Hi @Ali_Madison - can you share a bit more details about what is in the data? are there text fields you want to search over or is the data mostly numeric / categorical?
Sure Ofer! Its’ a mix of text/numbers (all represented as text, its a mix of metrics/facts across dimensional data, i.e. “Bird”:“50%”, “Cats”:“20”, etc.). I ingested this data into a dataframe (i.e. structured table of rows/columns, stored it on a relational data set) —> but ultimately transformed to a json matching Vectaras example of what a JSON should look like so I can upload it to a Corpus:
One comment (in researching my own data, I do have some instances of Nulls/NoneTypes, etc.) on some of the specific questions I asked (so Vectara actually did exactly what it should do, the data isn’t available in certain instances), but I’ll keep testing.
One item I should mention also:
Status 400: Error
Too much text for the summarizer to summarize. Please try reducing the number of search results to summarize, or the context of each search result.
I reduced the summary to use only “1” and it worked, but still would like your opinion on my approach, calibration, 400 Error, JSON etc.
Hi @Ali_Madison -
My guess it that yes the max_summarized_results (Search API Definition | Vectara Docs) value is too high and goes beyond the summarizer limit. what value are you using for max_summarized_results when you get the 400 error?
BTW, our scale plan provides higher limits if that’s necessary.
I was using 5, ah ok, I believe. what do you think of my JSON structure? I can share more context.
I am asking it a question that I know in SQL gives VERY close answers, but maybe its the nature of LLM (or maybe the way I structured the JSON), seems off?
Generally speaking, LLM/RAG is not meant to replace this type of analytics when you want to “count” things. So for example if you want to ask “how often does X happen” and there could be like 1000 events and the system would have to count the number of actual events and divide by 1000 - that’s not the strong suit of LLMs or RAG. It’s much better at figuring out what the relevant facts are from your overall corpus of documents (in this case all the JSONs).
Generally the JSON structure makes sense to me but for any “shorthand” you use, it’s better to use more human language. For example, what does “P_O_R_CatAndAlaskaBird” mean? can you put that as actual full text?
I’ve seen some NLP->SQL (or work on ‘tables’) engines (although frankly, I’ve found them frustrating/disappointing, i.e. huggingface has a bunch, and they all have way more limitations that makes it not worth it). I understand what you are saying, I don’t necessarily want to compute something (well, I do, but I understand LLM/RAG is not practical for that), more so - I want to be able to rank/or compare all the pet stores, or animals, etc. (in my example). I feel like I’ve provided enough in my data, but is my JSON, poorly structured? I can give you a better example.
And let’s say I want to ask “Compare Pet Store 1 and Pet Store 2”… or “Which pet store has the highest number of pets?” — an LLM doesn’t actually calculate anything, but the ability to just ask things in plain English makes it powerful, Analytics is limited because you have to know say SQL (or create pre-canned reports, with drop downs, etc.). Do you think if I restructure my JSON (does Vectara have another option)? I can use more Natural Language against these 'counting/or sorting/or comparison type requests?
I see what you mean about using more human language (in my testing, I’ve found some trade-offs in variable naming, lets’ say this data is pet store data, i.e. is Cats_Population, or Cats_Population_Absolute_Numbers, or CatsPopulation, or NumberOf_Cats), are there any general rules? Let’s say the question is “Tell me which pet store has the highest number of cats.” “Compare pet store A to pet Store B”. — what is the best way to name variables/key/value pairs, — is it with underscores, spaces, no space (or just make it as human readable as possible).
{
“documentId”: “Community”,
“title”: “Community Data”,
“metadataJson”: “{}”,
“section”: [
{
“title”: “:Pet Store 1”,
“text”: “Pet store 1 is in the city of Los Angeles and has a cat population of 100, and a dog population of 1000. "
}
{
“title”: “:Pet Store 2”,
“text”: “Pet store 2 is in the city of Los Angeles and has a cat population of 1000, and a dog population of 100. "
}
]
},
I simplified this in two ways:
All data is encoded as sentences in natural language as you see. This way the language understanding has maximum impact on retrieval
I made it a single “section” (as opposed to section within section as you did). That is not strictly necessary, so just to simplify. Wondering what is the “CODE ABC” for? Is that really a good title or can it just be a metadata of each section?
#1 Trying it and did a bunch OF iterations! (encoding sentences at natural language, then realized every sentence needs the same ‘dimension’ detail although I would have assumed it can be mentioned once)! It definitely worked MUCH better.
For example, attempt #1:
“title”: “:Pet Store 1”,
“text”: “Pet store 1 is in the city of Los Angeles and has a cat population of 100, and a dog population of 1000. There are 5 german shepherds, 5 poodles, and 1 afghan hound."
Attempt #150 etc.
“text”: “Pet store 1 is in the city of Los Angeles, within the state of CA and has a cat population of 100, and a dog population of 1000. There are 5 german shepherds, 5 poodles, and 1 afghan hound in Los Angeles, CA."
I feel like I have to ‘repeat’ every sentence the same information (if I search by it, i.e. "What are the details for the Los Angeles, CA pet store?'). Is that expected, should I just make a GIANT run on sentence?
For example:
(just make 1 giant sentence)
Pet store 1 is in the city of Los Angeles, within the state of CA and has a cat population of 100, and a dog population of 1000, there are 5 german shepherds, 5 poodles, and 1 afghan hound at this store…"
#2 I did end up keeping it ‘different’ sections (and you’re right, the titles should say more ‘Pet Store located in this city, state’ (because I’m searching more by locations), maybe I’ll come back to this. I am testing ‘comparing’ 2 sections (is that the point of separating them?), and the title section doesn’t seem to make a difference.
Is there a way to do a ‘chat’ history also (I know, big question)?
Thanks Ofer, your help has been awesome in that encoding (it made the difference), but will keep iterating.
Let me know if there’s something different I should try based on these encoding examples (and maybe I should start a new thread for ‘chat’ type vecatara bot/features)