Preprocessing Framework for Retrieval-Augmented Generation (RAG)

Deepika Sawant
17h
188
0
0

Article

Introduction

Building a Retrieval-Augmented Generation (RAG) application requires careful preparation of data before it can be effectively retrieved and used by large language models (LLMs). Preprocessing ensures that information is broken down, represented, and stored in ways that maximize efficiency and semantic coherence. The three critical components of preprocessing for RAG are chunking, embeddings, and metadata.

Step 1: Chunking

Chunking is the first and most critical preprocessing step in building Retrieval-Augmented Generation (RAG) applications. It involves splitting documents into smaller, semantically coherent units that can be consumed by embedding models and large language models (LLMs). Effective chunking ensures that data is consumable, coherent, and contextual, enabling accurate retrieval and generation.

The Three Cs of Chunking

1. Consumable

Definition: A chunk must fit within the context window of the embedding model and the LLM.
Technical Rule: At least three chunks should fit into the LLM’s context window, depending on the top K retrieval setting.
Practical Rule: A chunk should be readable in one go, much like a paragraph or short passage.

2. Coherent

Definition: A chunk must make sense as a complete thought.
Technical Rule: Avoid splitting in the middle of words, clauses, or sentences.
Practical Rule: Chunks should be meaningful units (e.g., “Curiosity killed the cat” is coherent; “killed the” is not).

3. Contextual

Definition: A chunk must contain enough surrounding information to preserve meaning.
Example: “Curiosity killed the cat” is coherent but incomplete without “…but satisfaction brought it back.”
Guideline: Ensure chunks retain the necessary context to answer questions accurately.

Key Chunking Parameters

1. Chunk Size

Refers to the number of characters or tokens per chunk.
Typical reference: ~100 words or ~500 characters per paragraph.
Usage: Can be a strict limit or a guideline depending on the chunking method.

2. Chunk Overlap

Refers to repeating characters or sentences between consecutive chunks.

Purpose:

Preserves context across boundaries.
Reinforces consumability, coherence, and contextuality.

Example: Including the last sentence of one chunk at the start of the next.

3. Special Characters

Characters used to guide chunk boundaries (e.g., periods, newlines).
Purpose: Relax strict size limits to ensure coherent splits.
Example: Allowing a chunk to exceed its size limit to end at the next period or double newline.

Chunking Strategies by Text Type

Chunking strategies vary depending on the type of text being processed. Different document structures require different approaches to ensure that chunks remain consumable, coherent, and contextual. This section explores three common types of text data—document data, Q/A transcripts, and chat transcripts—and how chunking applies to each.

1. Document Data

Examples:

Research papers
Essays
Blogs
Reports
Documentation

Defining Features:

Organized into regularly sized blocks (paragraphs, sections).
Semantic coherence is usually consistent across paragraphs.

Chunking Approach:

Straightforward chunk sizing (~500 characters or ~100 words).
Minimal need for special characters; splitting on double newlines often suffices.

Use Case:

Academic papers
Technical documentation
News articles

2. Q/A Transcripts

Examples:

Podcasts
Lectures
AMA sessions
Interviews

Defining Features:

Short-long structure: questions are short, answers are longer.
Questions and answers are semantically linked.

Chunking Approach:

Keep questions and answers coupled but stored as separate chunks.
Metadata plays a critical role in linking Q/A pairs.

Use Case:

Educational transcripts
Customer FAQs
Panel discussions

3. Chat Transcripts

Examples:

Customer support logs
Text messages
Group chats
Direct messages (DMs)

Defining Features:

Irregular chunk sizes (single words, sentences, or paragraphs).
Unpredictable linking—no guaranteed Q/A structure.
Multiple consecutive messages from the same sender.

Chunking Approach:

Special characters (e.g., message delimiters, speaker tags) are essential.
Metadata must capture speaker identity and timestamps.

Use Case:

Customer service bots
Social media conversations
Messaging platforms

Step 2: Embeddings

What Are Embeddings?

Definition: Embeddings are numerical representations (vectors) of unstructured data.
Purpose: They allow us to measure similarity between data points in a mathematically consistent way.
Applications: Search engines, recommendation systems, semantic analysis, multimodal AI, and Retrieval-Augmented Generation (RAG).

Choosing the Right Embedding Model

Three critical considerations guide model selection:

Embedding Size (Dimensionality)

Refers to the length of the vector.
Larger embeddings capture more nuance but require more computational power.
Only embeddings of the same size can be compared.

Model Size

Smaller models are cheaper and faster to run.
Larger models provide fine-grained representations but at higher cost.
Embedding models are not always LLMs—specialized architectures often outperform general-purpose ones.

Training Data

Determines the domain and language coverage.
Chat-trained models excel at conversational embeddings, while essay-trained models capture formal semantics.
Specialized domains (e.g., CSVs, medical text) require tailored training datasets.

Types of Embedding Models

Neural Network-Based

Examples:

Sentence Transformers (text)
ResNet50 (images)
Whisper (audio)
Produce dense embeddings suitable for cosine or Euclidean similarity.

Algorithmic Models (Non-Neural)

Examples:

TF-IDF
BM25
SPLADE
Produce sparse or binary embeddings.
Similarity measured using Hamming or Jaccard distance.

What to Embed

Chunked Text: Embedding smaller sections for granular retrieval.
Partial Chunks: Embedding sentences or phrases for precision.
Large Sections: Embedding paragraphs or documents for broader context.

Techniques:

Basic Embeddings: Directly embedding the chunk of text (sentence, paragraph, or document) without additional structuring.
Large-to-Small: Embed large paragraphs, store sentences as metadata.
Small-to-Large: Embed sentences, store paragraphs as metadata.

Non-English Embeddings: Embedding models trained on English data often fail to capture nuances in other languages. Models like GPT-4, Mixtral, or Queen support multiple languages but may be computationally expensive. The MTEB leaderboard lists models trained on specific languages (French, Polish, Chinese, etc.), offering more efficient and domain-specific embeddings.

Comparing Embeddings

Dense Vector Metrics

Euclidean Distance: Measures spatial distance between vectors.
Cosine Similarity: Measures orientation difference (normalized dot product).
Inner Product: Measures projection of one vector onto another.

Sparse/Binary Metrics

Hamming Distance: Counts differing positions in binary vectors.
Jaccard Distance: Measures overlap between sets (union vs. intersection).

Step 3: Metadata

What Is Metadata?

Definition: Metadata is all the information stored alongside vector embeddings that is not the embedding itself.
Purpose: It provides context, supports filtering, and enables richer retrieval beyond raw similarity scores.
Examples: Original text, author, timestamp, section headers, document titles, or any other descriptive attributes.

Types of Metadata

Metadata can be broadly divided into two categories:

1. Chunking Metadata

Origin: Produced during the chunking process.
Examples: Sentence number, section header, subtitle.

Usage:

Provides context about where a chunk comes from.
Enables filtering by document structure (e.g., retrieving only from a specific section).

2. Non-Chunking Metadata

Origin: Independent of chunking.
Examples: Author, last updated date, document title.

Usage:

Supports filtering by external attributes (e.g., only documents updated in the last month).
Useful for personalization and domain-specific queries.

Why Metadata Matters in RAG

Basic RAG: Metadata ensures the original text is available for retrieval and generation.
Advanced RAG: Metadata enables fine-grained filtering, contextual relevance, and domain-specific constraints.
Semantic Coherence: Metadata helps align retrieved chunks with the broader context of the source document.

Storing Metadata

There are two main approaches:

Linked Storage

Metadata stored in a traditional relational database.
Vector database entries link to external metadata records.

Advantage: Separation of concerns, useful for complex relational queries.

Direct Storage in Vector Database

Metadata stored directly alongside embeddings.

Advantage: Faster retrieval and simpler integration for RAG applications.

Most popular approach in modern vector databases (e.g., Pinecone, Weaviate, Milvus).

Unified Workflow

Raw Data Input → Large documents, transcripts, or datasets.
Chunking → Split into consumable, coherent, contextual units.
Embedding Generation → Convert chunks into dense vectors using appropriate embedding models.
Metadata Attachment → Store original text and contextual attributes alongside embeddings.
Vector Database Storage → Embeddings and metadata indexed for retrieval.
RAG Retrieval → Query retrieves top K embeddings and metadata.
LLM Generation → LLM uses retrieved text and metadata to produce grounded responses.

Visual Summary (Conceptual Flow)

Raw Data → Chunking → Embeddings → Metadata → Vector Database → Retrieval → LLM Generation

Summary

Chunking, embeddings, and metadata form the three pillars of RAG preprocessing. Chunking ensures data is consumable, embeddings make it comparable, and metadata makes it contextual and filterable. Together, they transform raw unstructured data into a structured, retrievable knowledge base that powers effective RAG applications.