RAG Architecture Explained

Learning Objectives

By the end of this session, you will be able to:

  • Understand the complete architecture of a RAG system

  • Learn the major components of a RAG pipeline

  • Understand how documents become searchable

  • Learn how retrieval works internally

  • Understand the role of embeddings and vector databases

  • Follow the complete journey from user query to AI response

  • Build a strong foundation for implementing RAG applications

Introduction

In the previous sessions, we learned:

  • What RAG is

  • Why RAG exists

  • Why LLMs hallucinate

  • How RAG helps overcome knowledge limitations

Now it is time to understand what happens behind the scenes.

When a user asks:

What is our company's leave policy?

The answer does not magically appear.

A RAG system performs multiple steps:

  1. Search relevant documents

  2. Retrieve useful information

  3. Provide context to the LLM

  4. Generate the final answer

This process involves several specialized components working together.

Understanding the architecture is important because every production RAG system follows a similar design pattern.

Whether you use:

  • OpenAI

  • Gemini

  • Claude

  • Llama

or any other LLM, the overall RAG architecture remains largely the same.

Why This Topic Matters

Imagine building a university assistant.

Students ask questions such as:

What is the examination schedule?

The system must:

  • Locate the correct document

  • Find the relevant section

  • Provide context

  • Generate a natural language response

If any step fails:

  • Wrong documents may be retrieved

  • Incorrect answers may be generated

  • User trust may decrease

Understanding the architecture helps developers build reliable AI systems.

High-Level RAG Architecture

A simplified RAG architecture looks like:

                DOCUMENTS
                     ¦
                     ?
            Document Processing
                     ¦
                     ?
               Embeddings
                     ¦
                     ?
             Vector Database
                     ¦
                     ?
---------------------------------
                     ?
                     ¦
              User Question
                     ¦
                     ?
              Query Embedding
                     ¦
                     ?
             Similarity Search
                     ¦
                     ?
            Retrieved Context
                     ¦
                     ?
                   LLM
                     ¦
                     ?
              Final Answer

At first glance this may seem complex.

However, it can be broken into manageable stages.

Two Major Parts of a RAG System

Most RAG systems contain two phases:

Phase 1: Indexing

Preparing documents for retrieval.

Phase 2: Retrieval and Generation

Using those documents to answer questions.

Think of it as:

Knowledge Preparation
          +
Question Answering

Let's explore each phase.

Phase 1: Indexing Pipeline

Before users can ask questions, documents must be prepared.

Example documents:

  • Employee handbook

  • Product manuals

  • Research papers

  • University policies

  • Internal documentation

The system processes these documents and stores them in a searchable format.

Workflow:

Documents
    ?
Processing
    ?
Chunking
    ?
Embeddings
    ?
Vector Database

This preparation step usually happens once when documents are uploaded.

Step 1: Document Collection

Every RAG system starts with knowledge sources.

Examples:

PDFs

Employee Handbook.pdf

Word Documents

HR Policies.docx

Websites

Company Knowledge Portal

Databases

Product Information Database

These sources become the knowledge base of the system.

Step 2: Document Processing

Raw documents cannot be directly stored in a vector database.

The system first extracts text.

Example:

PDF:

Employee Leave Policy

After processing:

Plain Text Content

This extracted content becomes searchable.

Step 3: Chunking

Large documents are split into smaller sections called chunks.

Example:

Original document:

100-page Employee Handbook

Chunked into:

Chunk 1
Chunk 2
Chunk 3
...
Chunk N

Why?

Because embedding models work better with smaller pieces of information.

Chunking is one of the most important aspects of RAG engineering.

We will study chunking in detail later.

Step 4: Creating Embeddings

Each chunk is converted into an embedding.

Remember from Session 5:

An embedding is a numerical representation of meaning.

Example:

Leave Policy

Becomes:

[0.12, 0.34, 0.91, ...]

The exact numbers are not important.

What matters is that similar meanings create similar vectors.

Embedding Workflow

Text Chunk
     ?
Embedding Model
     ?
Vector Representation

Every chunk receives its own vector.

Step 5: Storing in a Vector Database

After embeddings are created, they are stored in a vector database.

Example:

Chunk A ? Vector A
Chunk B ? Vector B
Chunk C ? Vector C

Stored inside:

Vector Database

This completes the indexing phase.

The knowledge base is now searchable.

Phase 2: Retrieval and Generation

Now the system is ready to answer questions.

Workflow:

User Question
      ?
Embedding
      ?
Similarity Search
      ?
Relevant Chunks
      ?
LLM
      ?
Answer

This is where the retrieval process begins.

Step 6: User Submits a Question

Example:

How many annual leave days are employees entitled to?

The system receives the question.

Step 7: Query Embedding

The question is converted into an embedding.

Example:

User Question
      ?
Embedding Model
      ?
Query Vector

Now both:

  • Documents

  • User query

exist in the same vector space.

This enables similarity matching.

Step 8: Similarity Search

The vector database compares:

Query Vector

against

Stored Document Vectors

Goal:

Find Most Similar Chunks

Example:

Question:

Leave Policy

Retrieved chunks:

Annual Leave Policy
Vacation Policy
Employee Benefits

The most relevant content is selected.

Why Similarity Search Is Powerful

Traditional search uses keywords.

Example:

Search:

Vacation

May miss:

Annual Leave

RAG uses semantic meaning.

The system understands:

Vacation ˜ Annual Leave

This is a major advantage.

Step 9: Context Construction

Retrieved chunks are combined.

Example:

Chunk A
+
Chunk B
+
Chunk C

This becomes the context provided to the LLM.

Example:

Context:
Employees receive 24 annual leave days.

Question:
How many annual leave days do employees receive?

Now the model has supporting information.

Step 10: Generation

The LLM receives:

  • User question

  • Retrieved context

Workflow:

Question
      +
Context
      ?
LLM
      ?
Answer

Generated response:

Employees receive 24 annual leave days according to the company policy.

The answer is grounded in retrieved information.

Complete End-to-End Workflow

Documents
     ?
Chunking
     ?
Embeddings
     ?
Vector Database

User Question
     ?
Query Embedding
     ?
Similarity Search
     ?
Relevant Chunks
     ?
LLM
     ?
Answer

This is the core architecture behind modern RAG systems.

Real-World Example

Consider a university assistant.

Question:

What is the MCA admission deadline?

Workflow:

Admission Guide
      ?
Chunked
      ?
Embedded
      ?
Stored

Student Question
      ?
Search
      ?
Retrieve Admission Deadline
      ?
Generate Answer

The student receives an accurate response based on official documents.

Components of a Production RAG System

A production-ready system usually includes:

Document Loader

Loads files.

Chunking Engine

Splits documents.

Embedding Model

Generates vectors.

Vector Database

Stores embeddings.

Retriever

Finds relevant information.

LLM

Generates responses.

Monitoring Layer

Tracks performance.

Architecture:

Documents
    ?
Loader
    ?
Chunker
    ?
Embeddings
    ?
Vector DB
    ?
Retriever
    ?
LLM
    ?
Response

Why Each Component Matters

Document Loader

Without data:

No Knowledge Base

Chunking

Poor chunking reduces retrieval quality.

Embeddings

Poor embeddings reduce semantic understanding.

Vector Database

Poor indexing slows retrieval.

Retriever

Wrong chunks produce poor answers.

LLM

Weak generation affects user experience.

Every component contributes to system quality.

Common RAG Architecture Mistakes

Large Chunks

Too much information reduces precision.

Tiny Chunks

Important context may be lost.

Poor Embedding Model

Weak semantic matching.

Retrieving Too Much Context

Creates noise.

Retrieving Too Little Context

Important information may be missing.

These challenges become important as systems scale.

Enterprise Example

Company documents:

HR Policies
Travel Policies
Benefits Guide
IT Security Rules

Employee asks:

Can I work remotely from another country?

RAG process:

Search Documents
      ?
Retrieve Remote Work Policy
      ?
Provide Context
      ?
Generate Answer

The response becomes evidence-based rather than speculative.

How RAG Improves Trust

Traditional Chatbot:

Question
 ?
Guess

RAG Assistant:

Question
 ?
Evidence
 ?
Answer

This evidence-based approach increases trust and reliability.

.NET Perspective

Popular RAG technologies in .NET include:

  • Semantic Kernel

  • Azure AI Search

  • Azure OpenAI

  • ASP.NET Core

Typical enterprise architecture:

Documents
 ?
Azure AI Search
 ?
OpenAI
 ?
Answer

Many enterprise copilots follow this pattern.

Python Perspective

Popular Python tools include:

  • LangChain

  • LlamaIndex

  • ChromaDB

  • Pinecone

  • Weaviate

  • OpenAI SDK

Python remains the most common environment for RAG experimentation and development.

Assignment

Architecture Exercise

Design a RAG system for:

University Knowledge Assistant

Include:

  • Knowledge sources

  • Chunking process

  • Embedding generation

  • Retrieval process

  • LLM response generation

Research Activity

Compare three vector databases and identify:

  • Features

  • Strengths

  • Limitations

  • Ideal use cases

Key Takeaways

  • RAG consists of indexing and retrieval phases.

  • Documents must be processed, chunked, embedded, and stored before retrieval.

  • User questions are converted into embeddings for similarity search.

  • Vector databases enable semantic retrieval.

  • Retrieved context is supplied to the LLM before answer generation.

  • Every component in the pipeline affects overall response quality.

  • Understanding RAG architecture is essential before building production systems.

What's Next?

In Session 17, we will explore:

Data Ingestion Pipeline

You will learn how documents enter a RAG system, how files are processed, cleaned, transformed, validated, and prepared for embedding and retrieval.