Display title or metadata to LLM

I am indexing scraped web data that I want a chatbot to be able to reference when it answers questions from my users.

As I understand it having a ‘source’ in the metadata does not allow the LLM to see this information so I am wondering what is the best way to give an LLM the URL.

I was thinking to include the URL as the title of the document or part but I am not sure if this is returned with every results - any advice would be appreciated!

Thanks

Hey Ed,
can you please clarify why you want the URL to be accessible to the LLM? If I understand correctly, is the scraped web data not already indexed in Vectara?

I want to ask the LLM a question and all out to answer and provide the source (which is the URL) in its response.

For example it would answer:

Apple is a HSBC is a large bank, source - www.hsbc.com

Thanks for your help
Ed

Is what you’re looking for then similar to the user experience in our ask-news demo? Here whenever a result is referenced the UI provides a citation to the source below.

We built this sample application using:

  1. vectara-ingest - to crawl the website. As part of this crawl job, the code adds a meta-data field it calls “source” which has the URL where the text from each webpage originates
  2. vectara-answer - this is the user interface part. You can point it to the Vectara corpus/Index and setup this UX in minutes. You can alternatively follow the code there and do it yourself, if it’s helpful. In this UX, the “source” metadata field is checked and if it exists it will create a link to the original data source.

Happy to help further if you need.

No it is not unfortunately, for my use case there is no integration with the LLM directly via a chatbot.

I am using the LLM to analyse web page data and then create a report about multiple web pages, inside the report I would like it to be able to provide the source but the issue I face is that the URL is not fed to the LLM at least via Langchain (from my understanding).

One idea I has was to do an API call to Vectara, then feed all of the response to an LLM as it will then have the source but this does not seem ideal so I was wondering if there is a better way!

Any input would be highly appreciated.

Thank you

Ed, sorry I’m still a bit confused - can you please clarify what you use to “create a report about multiple web pages”? are you using a chain in LangChain for this?
Ofer

Hey Ofer,

Apologies, basically I have a workflow in Azure that sends an API call to a Flowise URL asking a question like ‘what is Tescos strategy’, then using the answer to this it will send another API call saying ‘What are Tescos risks related to {strategy}.’

Once this process has finished, the data is posted to a database that then displays this information in my web app for my users to see.

The Flowise URL is a retrieval QA chain that is connected to Anthropic + Vectara.

So instead of the ‘sources’ being included in the response (which does not work for this use case because the answer to the question ‘What are Tescos risks related to {strategy}’ may just come from 1 of the retrieved results so it doesn’t make sense to provide 4/5 URLs per answer.

I guess ultimately my question is, is it possible to display the title of a document or section regardless of which amount of text it retrieves?

Hi Ed,

IIUC, when you ask “what are Tescos risks related to {strategy}”, you get 4-5 results from the retrieval-QA chain, and you would like to provide the source URL for that. Are the 4-5 URLs per answer you are sometimes. What I’m a bit unclear about is why are you not using the same “answer” from the retrieval-QA chain? It is supposed to respond with a single answer summarizing all of these returned results into a proper answer.

Now, when you get a result from Vectara, you do get for each relevant match the metadata of the document in the result set, so if the title is available as meta-data (you can make sure it is during ingest) then you should be able to pull the title. What I’m describing here is Vectara specific - are you asking this specifically about pulling the title using Flowise, a specific component?

Hey Ofer,

I moved my workflow from Flowise to do this in Azure so the LLM can see the metadata and it works perfectly!

Thanks for your help though, much appreciated.
Ed