Hi,
So the problem still remains that I want to be able to search over the metadata and I am trying to find ways around the current indexing structure.
I’ll try and clarify what I meant by overloading. I am using the term metadata to mean a set of data that describes/adds context to a document.
By “overloading sections”, I mean, instead of just splitting the document into sections for document data, I will also use sections to show metadata.
So if I had a document that represents a recipe web page, not only will the sections contain actual content of the website, but it may also contain other things that aren’t on the website.
A full example might look like something like this:
{
"documentId": "guacamole",
"title": "How to Make the Best Guacamole",
"metadataJson": "{\"origin\":\"Mexico\",\"ingredients\":[\"avocado\", \"salt\", \"lime\", \"onion\", \"chilis\", \"cilantro\"]}",
"section": [
{
"metadataJson": "{\"section\":\"background\"}",
"text": "The word \"guacamole\" and the dip, are both originally from Mexico, where avocados have been cultivated for thousands of years. The name is derived from two Aztec Nahuatl words—ahuacatl (avocado) and molli (sauce)."
},
{
"metadataJson": "{\"section\":\"ingredients\"}",
"text": "2 ripe avocados\n1/4 teaspoon kosher salt, plus more to taste\n1 tablespoon fresh lime or lemon juice\n2 to 4 tablespoons minced red onion or thinly sliced green onion\n1 to 2 serrano (or jalapeño) chilis, stems and seeds removed, minced\n2 tablespoons cilantro (leaves and tender stems), finely chopped\nPinch freshly ground black pepper\n1/2 ripe tomato, chopped (optional)\nRed radish or jicama slices for garnish (optional)\nTortilla chips, to serve"
},
{
"metadataJson": "{\"section\":\"method\"}",
"title": "Method",
"section": [
{
"metadataJson": "{\"section\":\"method\"}",
"text": "Cut the avocados in half. Remove the pit. Score the inside of the avocado with a blunt knife and scoop out the flesh with a spoon. (See How to Cut and Peel an Avocado.) Place in a bowl."
},
{
"metadataJson": "{\"section\":\"method\"}",
"text": "Using a fork, roughly mash the avocado. (Don't overdo it! The guacamole should be a little chunky.)"
},
{
"metadataJson": "{\"section\":\"method\"}",
"text": "Sprinkle with salt and lime (or lemon) juice. The acid in the lime juice will provide some balance to the richness of the avocado and will help delay the avocados from turning brown.\nAdd the chopped onion, cilantro, black pepper, and chilis. Chili peppers vary individually in their spiciness. So, start with a half of one chili pepper and add more to the guacamole to your desired degree of heat.\nRemember that much of this is done to taste because of the variability in the fresh ingredients. Start with this recipe and adjust to your taste."
},
{
"metadataJson": "{\"section\":\"method\"}",
"text": "If making a few hours ahead, place plastic wrap on the surface of the guacamole and press down to cover it to prevent air reaching it. (The oxygen in the air causes oxidation which will turn the guacamole brown.)\nGarnish with slices of red radish or jigama strips. Serve with your choice of store-bought tortilla chips or make your own homemade tortilla chips.\nRefrigerate leftover guacamole up to 3 days."
}
]
},
{
"metadataJson": "{\"section\":\"author.name\"}",
"text": "John Smith"
},
{
"metadataJson": "{\"section\":\"author.biography\"}",
"text": "John Smith only cooks spicy food"
}
]
}
Note, it is slightly different than the quick example I used above, in that it uses metadataJson
to “tag” the sections appropriately, rather than the title which is also searchable.
Immediately the potential pitfalls I see that come to mind are:
- sections don’t necessarily follow on from each other, is that a problem?
- a section may contain wildly different information to the rest of the document
Are those things a big deal? And can you think of any other pitfalls that may cause problems?