Pre-processing pipeline (pdf documents to Markdown)

Hello Vectara Community,

I’m currently developing a pre-processing pipeline for our RAG system, and our goal is to prepare complex documents to be indexed in Vectara as a knowledge base for a Contact Center Agent Assistant.

To ensure consistency and optimal structure, I’m experimenting with using a powerful LLM (like GPT-4/Claude) guided by a detailed prompt to transform our raw documents into clean Markdown and associated JSON metadata. The idea is to standardize the input before it ever reaches the Vectara indexing API.

I’d love to share the prompt I’ve designed and get your valuable feedback on this approach.

The Prompt

Here is the core of my pre-processing strategy, encapsulated in a single prompt:

## ROLE

You are a data pre-processing specialist for RAG (Retrieval-Augmented Generation) systems, with specific expertise in the **Vectara platform**. Your skill lies in transforming complex and irregularly formatted documents into clean, structured Markdown text, ideal for accurate semantic indexing.

## OBJECTIVE

Your task is to process a source document and convert it into two output artifacts:

1.  A complete and well-structured document in **Markdown format**, optimized for indexing in Vectara.
2.  A simple **JSON object** containing the essential metadata to be associated with the document in Vectara.

## CONTEXT

The transformed document will serve as the knowledge base for an "Agent Assistant" designed to support contact center agents. The goal is to enable agents to make natural language queries and receive fast, accurate answers about procedures, guides, and customer service scenarios. Proper structuring is crucial for Vectara's RAG system to find and return the most relevant information.

## TRANSFORMATION RULES (DETAILED INSTRUCTIONS)

*   **Golden Rule: Absolute Fidelity.** You must NOT omit, summarize, or simplify ANY information from the original document. The output must be a complete and faithful version of the content, merely reformatted according to the following rules.

*   **Markdown Structure and Formatting:**
    *   Use Markdown headings and subheadings (`#`, `##`, `###`) to replicate the document's logical hierarchy. This is vital for Vectara to create coherent semantic chunks.
    *   Use bulleted (`-` or `*`) or numbered (`1.`, `2.`) lists for any enumerations or step-by-step instructions.
    *   Use `**bold text**` to highlight key concepts, fields, or important titles within a paragraph.

## EXPECTED OUTPUT FORMAT

You must provide two distinct and clearly separated blocks:

1.  **Markdown Document:** The full content of the transformed document.
2.  **JSON Metadata:** The JSON object containing the metadata.

I’m particularly interested in your thoughts on a few points:

  1. General Approach: What are your initial thoughts on using an LLM-based prompt for this kind of structured data transformation? Have you tried something similar?

  2. Vectara-Specific Optimizations: Is there anything in this prompt (e.g., the use of bold text, the heading structure) that might be counter-productive or could be improved for Vectara’s specific chunking and indexing algorithms?

  3. Alternative Methods: Are there other pre-processing techniques or tools you’d recommend, perhaps in combination with this method? For example, running custom scripts before the LLM transformation.

  4. Metadata Best Practices: What best practices do you follow for metadata? Are there specific fields you’ve found particularly useful for filtering and improving retrieval quality?

For the Vectara team, I would be very grateful to know if this approach aligns with your recommended best practices for data preparation.

Thanks in advance for any insights or suggestions you can share! I’m looking forward to learning from your experience.

Best,
Diego Prada

Hi Diego,

Using an LLM with a prompt to pre-process data is certainly a valid approach. Success really depends on the data and the LLM you use, but using Anthropic, Gemini or GPT-4/5 likely is a good bet.
As alternatives, you can also check out “vectara-ingest” which is a package built to crawl and ingest documents into Vectara. It has a few different types of crawlers, and you can also add your own