HTML not supported as a learning format?

IamAI · October 15, 2022, 6:13am

Am I missing something or is html not supported as a learning format for documents. Saw json, markdown, text, pdf and MS Word I believe but no HTML.

IamAI · October 15, 2022, 7:06am

I think I answered my own question by trying it. HTML does work though not explicitly mentioned in the UI. Would be great if it could infect content directly from a url. Bonus if it could also spider.

shane · October 17, 2022, 4:15am

Thanks for the question, @IamAI ! A couple things that may interest you:

Yes, as you noticed, HTML works fine. We actually have a wide variety of file formats we support (most common filetypes you’ll see should work with the exception of “container” formats like .zip that contain lots of sub-files)
We’re have some documentation on the filetypes we support here and we intend to add a link from the console UI to helpful documentation like this when it applies
We actually have a demo application of a spider here. It’s currently a community-supported project (pull requests are welcome!) and there are some limitations with it, but especially if you’re mostly crawling a small website or you want to see how that might done, it might serve as an interesting starting point for you

We’d like to eventually make this type of use case a more built-in feature, so thank you for the feature request!

Topic		Replies	Views
Support for Multi-modal data and inline context? Vectara Platform Q&A	1	33	February 18, 2025
Whats the status on multimodal documents? Vectara Platform Q&A	1	751	August 17, 2023
Not sure where to begin	1	822	August 11, 2023
V2 upload_file playground Vectara Platform Q&A	2	75	June 26, 2024
Upload JSON data in my own structure? Vectara Platform Q&A	7	153	June 11, 2024

HTML not supported as a learning format?

Related topics