Am I missing something or is html not supported as a learning format for documents. Saw json, markdown, text, pdf and MS Word I believe but no HTML.
I think I answered my own question by trying it. HTML does work though not explicitly mentioned in the UI. Would be great if it could infect content directly from a url. Bonus if it could also spider.
Thanks for the question, @IamAI ! A couple things that may interest you:
- Yes, as you noticed, HTML works fine. We actually have a wide variety of file formats we support (most common filetypes you’ll see should work with the exception of “container” formats like .zip that contain lots of sub-files)
- We’re have some documentation on the filetypes we support here and we intend to add a link from the console UI to helpful documentation like this when it applies
- We actually have a demo application of a spider here. It’s currently a community-supported project (pull requests are welcome!) and there are some limitations with it, but especially if you’re mostly crawling a small website or you want to see how that might done, it might serve as an interesting starting point for you
We’d like to eventually make this type of use case a more built-in feature, so thank you for the feature request!