Introduction
Building a Retrieval-Augmented Generation (RAG) application requires careful preparation of data before it can be effectively retrieved and used by large language models (LLMs). Preprocessing ensures that information is broken down, represented, and stored in ways that maximize efficiency and semantic coherence. The three critical components of preprocessing for RAG are chunking, embeddings, and metadata.
Step 1: Chunking
Chunking is the first and most critical preprocessing step in building Retrieval-Augmented Generation (RAG) applications. It involves splitting documents into smaller, semantically coherent units that can be consumed by embedding models and large language models (LLMs). Effective chunking ensures that data is consumable, coherent, and contextual, enabling accurate retrieval and generation.
The Three Cs of Chunking
1. Consumable
Definition: A chunk must fit within the context window of the embedding model and the LLM.
Technical Rule: At least three chunks should fit into the LLM’s context window, depending on the top K retrieval setting.
Practical Rule: A chunk should be readable in one go, much like a paragraph or short passage.
2. Coherent
Definition: A chunk must make sense as a complete thought.
Technical Rule: Avoid splitting in the middle of words, clauses, or sentences.
Practical Rule: Chunks should be meaningful units (e.g., “Curiosity killed the cat” is coherent; “killed the” is not).
3. Contextual
Definition: A chunk must contain enough surrounding information to preserve meaning.
Example: “Curiosity killed the cat” is coherent but incomplete without “…but satisfaction brought it back.”
Guideline: Ensure chunks retain the necessary context to answer questions accurately.
Key Chunking Parameters
1. Chunk Size
Refers to the number of characters or tokens per chunk.
Typical reference: ~100 words or ~500 characters per paragraph.
Usage: Can be a strict limit or a guideline depending on the chunking method.
2. Chunk Overlap
Purpose:
Preserves context across boundaries.
Reinforces consumability, coherence, and contextuality.
Example: Including the last sentence of one chunk at the start of the next.
3. Special Characters
Characters used to guide chunk boundaries (e.g., periods, newlines).
Purpose: Relax strict size limits to ensure coherent splits.
Example: Allowing a chunk to exceed its size limit to end at the next period or double newline.
Chunking Strategies by Text Type
Chunking strategies vary depending on the type of text being processed. Different document structures require different approaches to ensure that chunks remain consumable, coherent, and contextual. This section explores three common types of text data—document data, Q/A transcripts, and chat transcripts—and how chunking applies to each.
1. Document Data
Examples:
Research papers
Essays
Blogs
Reports
Documentation
Defining Features:
Organized into regularly sized blocks (paragraphs, sections).
Semantic coherence is usually consistent across paragraphs.
Chunking Approach:
Use Case:
Academic papers
Technical documentation
News articles
2. Q/A Transcripts
Examples:
Podcasts
Lectures
AMA sessions
Interviews
Defining Features:
Short-long structure: questions are short, answers are longer.
Questions and answers are semantically linked.
Chunking Approach:
Use Case:
Educational transcripts
Customer FAQs
Panel discussions
3. Chat Transcripts
Examples:
Customer support logs
Text messages
Group chats
Direct messages (DMs)
Defining Features:
Irregular chunk sizes (single words, sentences, or paragraphs).
Unpredictable linking—no guaranteed Q/A structure.
Multiple consecutive messages from the same sender.
Chunking Approach:
Special characters (e.g., message delimiters, speaker tags) are essential.
Metadata must capture speaker identity and timestamps.
Use Case:
Step 2: Embeddings
What Are Embeddings?
Definition: Embeddings are numerical representations (vectors) of unstructured data.
Purpose: They allow us to measure similarity between data points in a mathematically consistent way.
Applications: Search engines, recommendation systems, semantic analysis, multimodal AI, and Retrieval-Augmented Generation (RAG).
Choosing the Right Embedding Model
Three critical considerations guide model selection:
Embedding Size (Dimensionality)
Refers to the length of the vector.
Larger embeddings capture more nuance but require more computational power.
Only embeddings of the same size can be compared.
Model Size
Smaller models are cheaper and faster to run.
Larger models provide fine-grained representations but at higher cost.
Embedding models are not always LLMs—specialized architectures often outperform general-purpose ones.
Training Data
Determines the domain and language coverage.
Chat-trained models excel at conversational embeddings, while essay-trained models capture formal semantics.
Specialized domains (e.g., CSVs, medical text) require tailored training datasets.
Types of Embedding Models
Neural Network-Based
Examples:
Algorithmic Models (Non-Neural)
Examples:
What to Embed
Chunked Text: Embedding smaller sections for granular retrieval.
Partial Chunks: Embedding sentences or phrases for precision.
Large Sections: Embedding paragraphs or documents for broader context.
Techniques:
Basic Embeddings: Directly embedding the chunk of text (sentence, paragraph, or document) without additional structuring.
Large-to-Small: Embed large paragraphs, store sentences as metadata.
Small-to-Large: Embed sentences, store paragraphs as metadata.
Non-English Embeddings: Embedding models trained on English data often fail to capture nuances in other languages. Models like GPT-4, Mixtral, or Queen support multiple languages but may be computationally expensive. The MTEB leaderboard lists models trained on specific languages (French, Polish, Chinese, etc.), offering more efficient and domain-specific embeddings.
Comparing Embeddings
Dense Vector Metrics
Euclidean Distance: Measures spatial distance between vectors.
Cosine Similarity: Measures orientation difference (normalized dot product).
Inner Product: Measures projection of one vector onto another.
Sparse/Binary Metrics
Step 3: Metadata
What Is Metadata?
Definition: Metadata is all the information stored alongside vector embeddings that is not the embedding itself.
Purpose: It provides context, supports filtering, and enables richer retrieval beyond raw similarity scores.
Examples: Original text, author, timestamp, section headers, document titles, or any other descriptive attributes.
Types of Metadata
Metadata can be broadly divided into two categories:
1. Chunking Metadata
Origin: Produced during the chunking process.
Examples: Sentence number, section header, subtitle.
Usage:
Provides context about where a chunk comes from.
Enables filtering by document structure (e.g., retrieving only from a specific section).
2. Non-Chunking Metadata
Origin: Independent of chunking.
Examples: Author, last updated date, document title.
Usage:
Supports filtering by external attributes (e.g., only documents updated in the last month).
Useful for personalization and domain-specific queries.
Why Metadata Matters in RAG
Basic RAG: Metadata ensures the original text is available for retrieval and generation.
Advanced RAG: Metadata enables fine-grained filtering, contextual relevance, and domain-specific constraints.
Semantic Coherence: Metadata helps align retrieved chunks with the broader context of the source document.
Storing Metadata
There are two main approaches:
Linked Storage
Advantage: Separation of concerns, useful for complex relational queries.
Direct Storage in Vector Database
Advantage: Faster retrieval and simpler integration for RAG applications.
Most popular approach in modern vector databases (e.g., Pinecone, Weaviate, Milvus).
Unified Workflow
Raw Data Input → Large documents, transcripts, or datasets.
Chunking → Split into consumable, coherent, contextual units.
Embedding Generation → Convert chunks into dense vectors using appropriate embedding models.
Metadata Attachment → Store original text and contextual attributes alongside embeddings.
Vector Database Storage → Embeddings and metadata indexed for retrieval.
RAG Retrieval → Query retrieves top K embeddings and metadata.
LLM Generation → LLM uses retrieved text and metadata to produce grounded responses.
Visual Summary (Conceptual Flow)
Raw Data → Chunking → Embeddings → Metadata → Vector Database → Retrieval → LLM Generation
Summary
Chunking, embeddings, and metadata form the three pillars of RAG preprocessing. Chunking ensures data is consumable, embeddings make it comparable, and metadata makes it contextual and filterable. Together, they transform raw unstructured data into a structured, retrievable knowledge base that powers effective RAG applications.