Is it possible to index metadata?

I have a scenario where a single “document” has a few layers to it that have different levels of importance. The document has:

  1. Some text that defines it
  2. Some text that gives it context
  3. Some metadata

A contrived example is a recipe. I have the contents of a recipe, some back story, and maybe metadata (for example ingredients).

What is the best way to include this information into a single document?

From the documentation it seems like sections may be a way to do this, but then I lose the categorisation of that text. Am I able to put this information in the metadata and also search over it?

Thanks for your question. I’ve tried to model a recipe for you using a combination of different sections and metadata. When creating the corpus, you can add origin and section as filterable metadata, which allows later restricting searches in the corpus using expressions like doc.origin='Mexico' or part.section='background', which would restrict the search to recipes from Mexico, and to only the background section within recipes, respectively.

Let me know if this is helpful. Unfortunately, there is no way to run a semantic search against document metadata. Filter expressions on metadata are restricted to the operations described in the Functions and Operations page of the documentation.

{
  "documentId": "guacamole",
  "title": "How to Make the Best Guacamole",
  "metadataJson": "{\"origin\":\"Mexico\",\"ingredients\":[\"avocado\", \"salt\", \"lime\", \"onion\", \"chilis\", \"cilantro\"]}",
  "section": [
    {
      "metadataJson": "{\"section\":\"background\"}",
      "text": "The word \"guacamole\" and the dip, are both originally from Mexico, where avocados have been cultivated for thousands of years. The name is derived from two Aztec Nahuatl words—ahuacatl (avocado) and molli (sauce)."
    },
    {
      "metadataJson": "{\"section\":\"ingredients\"}",
      "text": "2 ripe avocados\n1/4 teaspoon kosher salt, plus more to taste\n1 tablespoon fresh lime or lemon juice\n2 to 4 tablespoons minced red onion or thinly sliced green onion\n1 to 2 serrano (or jalapeño) chilis, stems and seeds removed, minced\n2 tablespoons cilantro (leaves and tender stems), finely chopped\nPinch freshly ground black pepper\n1/2 ripe tomato, chopped (optional)\nRed radish or jicama slices for garnish (optional)\nTortilla chips, to serve"
    },
    {
      "metadataJson": "{\"section\":\"method\"}",
      "title": "Method",
      "section": [
        {
          "metadataJson": "{\"section\":\"method\"}",
          "text": "Cut the avocados in half. Remove the pit. Score the inside of the avocado with a blunt knife and scoop out the flesh with a spoon. (See How to Cut and Peel an Avocado.) Place in a bowl."
        },
        {
          "metadataJson": "{\"section\":\"method\"}",
          "text": "Using a fork, roughly mash the avocado. (Don't overdo it! The guacamole should be a little chunky.)"
        },
        {
          "metadataJson": "{\"section\":\"method\"}",
          "text": "Sprinkle with salt and lime (or lemon) juice. The acid in the lime juice will provide some balance to the richness of the avocado and will help delay the avocados from turning brown.\nAdd the chopped onion, cilantro, black pepper, and chilis. Chili peppers vary individually in their spiciness. So, start with a half of one chili pepper and add more to the guacamole to your desired degree of heat.\nRemember that much of this is done to taste because of the variability in the fresh ingredients. Start with this recipe and adjust to your taste."
        },
        {
          "metadataJson": "{\"section\":\"method\"}",
          "text": "If making a few hours ahead, place plastic wrap on the surface of the guacamole and press down to cover it to prevent air reaching it. (The oxygen in the air causes oxidation which will turn the guacamole brown.)\nGarnish with slices of red radish or jigama strips. Serve with your choice of store-bought tortilla chips or make your own homemade tortilla chips.\nRefrigerate leftover guacamole up to 3 days."
        }
      ]
    }
  ]
}

Thank you so much for that @aahmad , that is very useful to see.

This isn’t easy to explain through a recipes example, but one scenario I’m wary of is the metadata containing information that the section text does not contain.

For example, say a query was “recipes that are spicy”. If only the metadata contained references to spice, how could I match that document?

One option I’m thinking about is overloading the sections to also contain metadata, something like this:

[
    {
        "title": "ingredients",
        "text": "avocado\nsalt\nlime\nonion\nchilis\ncilantro"
    },
    {
        "title": "author name",
        "text": "John Smith"
    },
    {
        "title": "author biography",
        "text": "John Smith only cooks spicy food"
    }
]

This would be on top of the sections that also contain the text of the document. The challenge here being I’m abusing what a “section” actually means here.

Are there any pitfalls I should be aware of if I take this approach?

I realised writing the reply above that one approach here is to split metadata into its own corpus, then do a query on a metadata corpus and use that to search over documents. However, I think it would be quite powerful in my case if I could have all of it in a single document (since my metadata’s fields is quite high cardinality)

In the example you shared, is the ingredients section what you were referring to when you said “I’m thinking about is overloading the sections to also contain metadata”? I’m a little unsure because none of the sections in your example contain any metadata (which is stored in the metadataJson field).

{
    "title": "ingredients",
    "text": "avocado\nsalt\nlime\nonion\nchilis\ncilantro"
}

As long as you set a metadata field to mark the section as containing auxiliary data (e.g. type=‘auxiliary’), then you can store all the data in a single corpus, and use filter expressions to control exactly what is searched:

  1. no expression: query document data and auxiliary data
  2. part.type='auxiliary': query only auxiliary data, and not document data.
  3. part.type<>'auxiliary': query only document data, and not auxiliary data.

Alternatively, you could split the auxiliary data into its own corpus, as you suggested.

Hi,

So the problem still remains that I want to be able to search over the metadata and I am trying to find ways around the current indexing structure.

I’ll try and clarify what I meant by overloading. I am using the term metadata to mean a set of data that describes/adds context to a document.

By “overloading sections”, I mean, instead of just splitting the document into sections for document data, I will also use sections to show metadata.

So if I had a document that represents a recipe web page, not only will the sections contain actual content of the website, but it may also contain other things that aren’t on the website.

A full example might look like something like this:

{
  "documentId": "guacamole",
  "title": "How to Make the Best Guacamole",
  "metadataJson": "{\"origin\":\"Mexico\",\"ingredients\":[\"avocado\", \"salt\", \"lime\", \"onion\", \"chilis\", \"cilantro\"]}",
  "section": [
    {
      "metadataJson": "{\"section\":\"background\"}",
      "text": "The word \"guacamole\" and the dip, are both originally from Mexico, where avocados have been cultivated for thousands of years. The name is derived from two Aztec Nahuatl words—ahuacatl (avocado) and molli (sauce)."
    },
    {
      "metadataJson": "{\"section\":\"ingredients\"}",
      "text": "2 ripe avocados\n1/4 teaspoon kosher salt, plus more to taste\n1 tablespoon fresh lime or lemon juice\n2 to 4 tablespoons minced red onion or thinly sliced green onion\n1 to 2 serrano (or jalapeño) chilis, stems and seeds removed, minced\n2 tablespoons cilantro (leaves and tender stems), finely chopped\nPinch freshly ground black pepper\n1/2 ripe tomato, chopped (optional)\nRed radish or jicama slices for garnish (optional)\nTortilla chips, to serve"
    },
    {
      "metadataJson": "{\"section\":\"method\"}",
      "title": "Method",
      "section": [
        {
          "metadataJson": "{\"section\":\"method\"}",
          "text": "Cut the avocados in half. Remove the pit. Score the inside of the avocado with a blunt knife and scoop out the flesh with a spoon. (See How to Cut and Peel an Avocado.) Place in a bowl."
        },
        {
          "metadataJson": "{\"section\":\"method\"}",
          "text": "Using a fork, roughly mash the avocado. (Don't overdo it! The guacamole should be a little chunky.)"
        },
        {
          "metadataJson": "{\"section\":\"method\"}",
          "text": "Sprinkle with salt and lime (or lemon) juice. The acid in the lime juice will provide some balance to the richness of the avocado and will help delay the avocados from turning brown.\nAdd the chopped onion, cilantro, black pepper, and chilis. Chili peppers vary individually in their spiciness. So, start with a half of one chili pepper and add more to the guacamole to your desired degree of heat.\nRemember that much of this is done to taste because of the variability in the fresh ingredients. Start with this recipe and adjust to your taste."
        },
        {
          "metadataJson": "{\"section\":\"method\"}",
          "text": "If making a few hours ahead, place plastic wrap on the surface of the guacamole and press down to cover it to prevent air reaching it. (The oxygen in the air causes oxidation which will turn the guacamole brown.)\nGarnish with slices of red radish or jigama strips. Serve with your choice of store-bought tortilla chips or make your own homemade tortilla chips.\nRefrigerate leftover guacamole up to 3 days."
        }
      ]
    },
    {
        "metadataJson": "{\"section\":\"author.name\"}",
        "text": "John Smith"
    },
    {
        "metadataJson": "{\"section\":\"author.biography\"}",
        "text": "John Smith only cooks spicy food"
    }
  ]
}

Note, it is slightly different than the quick example I used above, in that it uses metadataJson to “tag” the sections appropriately, rather than the title which is also searchable.

Immediately the potential pitfalls I see that come to mind are:

  • sections don’t necessarily follow on from each other, is that a problem?
  • a section may contain wildly different information to the rest of the document

Are those things a big deal? And can you think of any other pitfalls that may cause problems?

Thank you for the clarification.

The way you’ve structured the document is fine. It’s okay that sections don’t follow on from each other. It’s also okay that sections contain wildly different information to the rest of the document.

The only caveats I see are minor: you will need to use filters when running your queries, and the document doesn’t look as natural, although, again, it won’t have any impact on the quality of the results.

I also wanted to draw your attention to the Low-Level API, which provides greater control over how the document is segmented and vectorized (using the Standard API, as you are currently doing, delegates those decisions to the platform).

I hope this helps.

1 Like

Perfect, thank you, very helpful! Looking forward to testing this.