RAG Architecture Explained
Learning Objectives
By the end of this session, you will be able to:
Understand the complete architecture of a RAG system
Learn the major components of a RAG pipeline
Understand how documents become searchable
Learn how retrieval works internally
Understand the role of embeddings and vector databases
Follow the complete journey from user query to AI response
Build a strong foundation for implementing RAG applications
Introduction
In the previous sessions, we learned:
What RAG is
Why RAG exists
Why LLMs hallucinate
How RAG helps overcome knowledge limitations
Now it is time to understand what happens behind the scenes.
When a user asks:
What is our company's leave policy?
The answer does not magically appear.
A RAG system performs multiple steps:
Search relevant documents
Retrieve useful information
Provide context to the LLM
Generate the final answer
This process involves several specialized components working together.
Understanding the architecture is important because every production RAG system follows a similar design pattern.
Whether you use:
OpenAI
Gemini
Claude
Llama
or any other LLM, the overall RAG architecture remains largely the same.
Why This Topic Matters
Imagine building a university assistant.
Students ask questions such as:
What is the examination schedule?
The system must:
Locate the correct document
Find the relevant section
Provide context
Generate a natural language response
If any step fails:
Wrong documents may be retrieved
Incorrect answers may be generated
User trust may decrease
Understanding the architecture helps developers build reliable AI systems.
High-Level RAG Architecture
A simplified RAG architecture looks like:
DOCUMENTS
¦
?
Document Processing
¦
?
Embeddings
¦
?
Vector Database
¦
?
---------------------------------
?
¦
User Question
¦
?
Query Embedding
¦
?
Similarity Search
¦
?
Retrieved Context
¦
?
LLM
¦
?
Final Answer
At first glance this may seem complex.
However, it can be broken into manageable stages.
Two Major Parts of a RAG System
Most RAG systems contain two phases:
Phase 1: Indexing
Preparing documents for retrieval.
Phase 2: Retrieval and Generation
Using those documents to answer questions.
Think of it as:
Knowledge Preparation
+
Question Answering
Let's explore each phase.
Phase 1: Indexing Pipeline
Before users can ask questions, documents must be prepared.
Example documents:
Employee handbook
Product manuals
Research papers
University policies
Internal documentation
The system processes these documents and stores them in a searchable format.
Workflow:
Documents
?
Processing
?
Chunking
?
Embeddings
?
Vector Database
This preparation step usually happens once when documents are uploaded.
Step 1: Document Collection
Every RAG system starts with knowledge sources.
Examples:
PDFs
Employee Handbook.pdf
Word Documents
HR Policies.docx
Websites
Company Knowledge Portal
Databases
Product Information Database
These sources become the knowledge base of the system.
Step 2: Document Processing
Raw documents cannot be directly stored in a vector database.
The system first extracts text.
Example:
PDF:
Employee Leave Policy
After processing:
Plain Text Content
This extracted content becomes searchable.
Step 3: Chunking
Large documents are split into smaller sections called chunks.
Example:
Original document:
100-page Employee Handbook
Chunked into:
Chunk 1
Chunk 2
Chunk 3
...
Chunk N
Why?
Because embedding models work better with smaller pieces of information.
Chunking is one of the most important aspects of RAG engineering.
We will study chunking in detail later.
Step 4: Creating Embeddings
Each chunk is converted into an embedding.
Remember from Session 5:
An embedding is a numerical representation of meaning.
Example:
Leave Policy
Becomes:
[0.12, 0.34, 0.91, ...]
The exact numbers are not important.
What matters is that similar meanings create similar vectors.
Embedding Workflow
Text Chunk
?
Embedding Model
?
Vector Representation
Every chunk receives its own vector.
Step 5: Storing in a Vector Database
After embeddings are created, they are stored in a vector database.
Example:
Chunk A ? Vector A
Chunk B ? Vector B
Chunk C ? Vector C
Stored inside:
Vector Database
This completes the indexing phase.
The knowledge base is now searchable.
Phase 2: Retrieval and Generation
Now the system is ready to answer questions.
Workflow:
User Question
?
Embedding
?
Similarity Search
?
Relevant Chunks
?
LLM
?
Answer
This is where the retrieval process begins.
Step 6: User Submits a Question
Example:
How many annual leave days are employees entitled to?
The system receives the question.
Step 7: Query Embedding
The question is converted into an embedding.
Example:
User Question
?
Embedding Model
?
Query Vector
Now both:
Documents
User query
exist in the same vector space.
This enables similarity matching.
Step 8: Similarity Search
The vector database compares:
Query Vector
against
Stored Document Vectors
Goal:
Find Most Similar Chunks
Example:
Question:
Leave Policy
Retrieved chunks:
Annual Leave Policy
Vacation Policy
Employee Benefits
The most relevant content is selected.
Why Similarity Search Is Powerful
Traditional search uses keywords.
Example:
Search:
Vacation
May miss:
Annual Leave
RAG uses semantic meaning.
The system understands:
Vacation ˜ Annual Leave
This is a major advantage.
Step 9: Context Construction
Retrieved chunks are combined.
Example:
Chunk A
+
Chunk B
+
Chunk C
This becomes the context provided to the LLM.
Example:
Context:
Employees receive 24 annual leave days.
Question:
How many annual leave days do employees receive?
Now the model has supporting information.
Step 10: Generation
The LLM receives:
User question
Retrieved context
Workflow:
Question
+
Context
?
LLM
?
Answer
Generated response:
Employees receive 24 annual leave days according to the company policy.
The answer is grounded in retrieved information.
Complete End-to-End Workflow
Documents
?
Chunking
?
Embeddings
?
Vector Database
User Question
?
Query Embedding
?
Similarity Search
?
Relevant Chunks
?
LLM
?
Answer
This is the core architecture behind modern RAG systems.
Real-World Example
Consider a university assistant.
Question:
What is the MCA admission deadline?
Workflow:
Admission Guide
?
Chunked
?
Embedded
?
Stored
Student Question
?
Search
?
Retrieve Admission Deadline
?
Generate Answer
The student receives an accurate response based on official documents.
Components of a Production RAG System
A production-ready system usually includes:
Document Loader
Loads files.
Chunking Engine
Splits documents.
Embedding Model
Generates vectors.
Vector Database
Stores embeddings.
Retriever
Finds relevant information.
LLM
Generates responses.
Monitoring Layer
Tracks performance.
Architecture:
Documents
?
Loader
?
Chunker
?
Embeddings
?
Vector DB
?
Retriever
?
LLM
?
Response
Why Each Component Matters
Document Loader
Without data:
No Knowledge Base
Chunking
Poor chunking reduces retrieval quality.
Embeddings
Poor embeddings reduce semantic understanding.
Vector Database
Poor indexing slows retrieval.
Retriever
Wrong chunks produce poor answers.
LLM
Weak generation affects user experience.
Every component contributes to system quality.
Common RAG Architecture Mistakes
Large Chunks
Too much information reduces precision.
Tiny Chunks
Important context may be lost.
Poor Embedding Model
Weak semantic matching.
Retrieving Too Much Context
Creates noise.
Retrieving Too Little Context
Important information may be missing.
These challenges become important as systems scale.
Enterprise Example
Company documents:
HR Policies
Travel Policies
Benefits Guide
IT Security Rules
Employee asks:
Can I work remotely from another country?
RAG process:
Search Documents
?
Retrieve Remote Work Policy
?
Provide Context
?
Generate Answer
The response becomes evidence-based rather than speculative.
How RAG Improves Trust
Traditional Chatbot:
Question
?
Guess
RAG Assistant:
Question
?
Evidence
?
Answer
This evidence-based approach increases trust and reliability.
.NET Perspective
Popular RAG technologies in .NET include:
Semantic Kernel
Azure AI Search
Azure OpenAI
ASP.NET Core
Typical enterprise architecture:
Documents
?
Azure AI Search
?
OpenAI
?
Answer
Many enterprise copilots follow this pattern.
Python Perspective
Popular Python tools include:
LangChain
LlamaIndex
ChromaDB
Pinecone
Weaviate
OpenAI SDK
Python remains the most common environment for RAG experimentation and development.
Assignment
Architecture Exercise
Design a RAG system for:
University Knowledge Assistant
Include:
Knowledge sources
Chunking process
Embedding generation
Retrieval process
LLM response generation
Research Activity
Compare three vector databases and identify:
Features
Strengths
Limitations
Ideal use cases
Key Takeaways
RAG consists of indexing and retrieval phases.
Documents must be processed, chunked, embedded, and stored before retrieval.
User questions are converted into embeddings for similarity search.
Vector databases enable semantic retrieval.
Retrieved context is supplied to the LLM before answer generation.
Every component in the pipeline affects overall response quality.
Understanding RAG architecture is essential before building production systems.
What's Next?
In Session 17, we will explore:
Data Ingestion Pipeline
You will learn how documents enter a RAG system, how files are processed, cleaned, transformed, validated, and prepared for embedding and retrieval.