I am getting slightly different mixed messages from these two document pages:
The first seems to indicate a document part can contain text spanning multiple paragraphs, however the lower level api docs seem to indicate that actually a paragraph of text results in degraded performance and a sentence should be the goal.
I am guessing the latter is more true, but I wanted to confirm.
Unrelated side note, it would have been delightful if I could have arbitrarily sized document parts and Vectara splits the text appropriately for indexing for me. That way I could let the document represent how I think of the structure of the document, rather than optimising for what may be going on under the hood. For me the difference between dealing with a DB vs a service.
Your question highlights a key difference between the default API and the low-level API.
In the case of the default API, the text within a section can span multiple paragraphs. Be it short as a tweet or as long as a book—regardless, it will be automatically segmented by the platform.
it would have been delightful if I could have arbitrarily sized document parts and Vectara splits the text appropriately for indexing for me. That way I could let the document represent how I think of the structure of the document, rather than optimising for what may be going on under the hood. For me the difference between dealing with a DB vs a service.
The good news is that this is exactly how the default API works.
The low-level API, on the other hand, directly vectorizes each part, and without attempting segmentation. This can be useful in advanced scenarios where the client wants to control the segmentation process themselves.
Is it possible for me to think of sections as containing a completely arbitrary amount of information then? Could I for instance store a chapter of a book in a section? Or is there a practical upper bound to be considered here (ignoring cost implications).
Is it possible for me to think of sections as containing a completely arbitrary amount of information then?
Yes, that’s right.
Could I for instance store a chapter of a book in a section?
That will work fine.
Or is there a practical upper bound to be considered here (ignoring cost implications).
The only other limit you’ll hit is the maximum message size in our system, which is 20MB or 50MB (I forget exactly which). However, that’s raw text, so it’s pretty massive and sufficient for most use cases.
Someone recently uploaded a 3,000 page technical document in a single JSON and it worked fine.
Sorry, another question on this topic. Are the sections parsed in any other way? For example, can I give html and it still work? Or is it only able to tokenise plain text?
Currently, it only works with plain text. You can feed HTML documents into the platform using the File Upload API, but the extraction quality varies depending on how the page is structured.
When uploading through this API, you can set the parameter d=true to receive the extracted document in reply.
I’ve just tested the upload API and it seems to be doing quite a good job with some test documents.
My upload response seems to indicate it was parsed by org.apache.tika.parser.html.HtmlParser, however my understanding of that parser is that it still needs some processing to produce sections.
It would be helpful to understand if the parsing happened as expected, or if I should expect a lower quality parsing. I can then use that to weight documents appropriately.
Are you able to comment on how it produces sections based on html documents? I can possibly then attempt to use those same heuristics to guess at whether the sections produced are of high quality or not.
We don’t have a formal spec for how HTML documents are converted into semantic documents, and we intend to iterate and improve upon the conversion in the future. At a high level, our converter uses titles and headings to infer the structure of the document. An h2 header following an h1 heading, for example, would imply a section (h1) containing another section (h2).
Did you try setting d=true and inspecting the output document manually to see if it matches your expectations?
If you can shed a little light on the specific types of pages you’re trying to handle, that would help. Generally speaking, you can expect to get better results by writing a specialized extractor, as opposed to depending on the generic one we provide.
I think my use case definitely falls under the more “generic” case. I was curious to know if we can predict if a document may produce sections of lower quality, e.g. if the html was lacking any headers, we could infer maybe sections would not be nested. Or if everything was grouped under one p tag, could we assume everything would be under 1 section.
I was naively guessing that there may be some well known heuristics for this, but I understand that it may be a lot harder than that.
I’ll skip on attempting to weight based on quality for now and maybe tackle this question at one point in the future.