Working with ChromaDB
Learning Objectives
By the end of this session, you will be able to:
Understand what ChromaDB is
Learn why ChromaDB is popular for RAG development
Create and manage ChromaDB collections
Store documents and embeddings
Perform similarity searches
Use metadata filtering
Build a simple retrieval workflow using ChromaDB
Introduction
In the previous session, we learned about vector databases and why they are essential for modern RAG systems.
We explored how vector databases:
Store embeddings
Support semantic search
Enable similarity retrieval
Scale to millions of vectors
Now it is time to work with an actual vector database.
For beginners entering the world of RAG development, one of the most popular choices is:
ChromaDB
ChromaDB has become widely adopted because it is:
Open source
Easy to learn
Lightweight
Developer friendly
Well suited for local experimentation
Many developers build their first RAG application using ChromaDB before moving to larger production platforms.
Why This Topic Matters
Imagine building a university assistant.
You have:
Admission Policies
Scholarship Documents
Course Catalogs
Student Guidelines
After generating embeddings, you need somewhere to store them.
Without a vector database:
No Semantic Search
With ChromaDB:
Documents
?
Embeddings
?
ChromaDB
?
Similarity Search
Now the assistant can retrieve relevant information efficiently.
What Is ChromaDB?
ChromaDB is an open-source vector database designed specifically for AI applications.
Its primary purpose is to:
Store embeddings
Search embeddings
Retrieve relevant information
Think of ChromaDB as:
Database
+
Semantic Search Engine
Unlike traditional databases:
Find Exact Match
ChromaDB focuses on:
Find Similar Meaning
This makes it ideal for RAG systems.
Why Developers Like ChromaDB
Several factors contribute to ChromaDB's popularity.
Easy Setup
Minimal configuration required.
Open Source
Free to use.
Python Friendly
Excellent Python integration.
Local Development
Runs directly on a developer's machine.
RAG Focused
Designed specifically for embedding storage and retrieval.
These advantages make it one of the best learning platforms for vector databases.
Where ChromaDB Fits in RAG
Recall the RAG architecture:
Documents
?
Chunking
?
Embeddings
?
ChromaDB
?
Retriever
?
LLM
?
Answer
ChromaDB acts as the retrieval layer.
Its responsibility is to find relevant chunks based on similarity.
Installing ChromaDB
In Python:
pip install chromadb
After installation, developers can immediately begin creating collections and storing embeddings.
One reason ChromaDB became popular is its simplicity.
Creating a ChromaDB Client
The first step is creating a client.
Example:
import chromadb
client = chromadb.Client()
The client acts as the connection point for interacting with the database.
Think of it as:
Application
?
Client
?
ChromaDB
Understanding Collections
In ChromaDB, data is organized into collections.
Think of a collection as a container.
Example:
Student Documents Collection
or
HR Policies Collection
or
Research Papers Collection
Collections help organize information logically.
Creating a Collection
Example:
collection = client.create_collection(
name="university_documents"
)
Now we have a dedicated storage area for university-related content.
Collection Analogy
Traditional database:
Table
ChromaDB:
Collection
Both help organize information.
The difference is that collections are designed to work with embeddings.
Adding Documents
Suppose we have:
Scholarship applications close on June 30.
This document can be stored.
Example:
collection.add(
documents=[
"Scholarship applications close on June 30."
],
ids=["doc1"]
)
ChromaDB stores the document and prepares it for retrieval.
What Happens Internally?
When documents are added:
Document
?
Embedding
?
Storage
?
Indexing
The document becomes searchable through semantic similarity.
Adding Multiple Documents
Example:
collection.add(
documents=[
"Scholarship applications close on June 30.",
"MCA admissions begin in July.",
"Hostel registration starts next month."
],
ids=["doc1","doc2","doc3"]
)
Now the collection contains multiple knowledge sources.
Understanding IDs
Each document requires a unique identifier.
Example:
doc1
doc2
doc3
IDs allow:
Updates
Deletions
Tracking
Every document should have a unique ID.
Similarity Search
The most important feature of ChromaDB is semantic retrieval.
Example query:
When do scholarship applications end?
Search operation:
results = collection.query(
query_texts=[
"When do scholarship applications end?"
],
n_results=3
)
ChromaDB searches for semantically similar content.
Retrieval Workflow
Question
?
Embedding
?
Similarity Search
?
Top Matches
This workflow powers modern RAG applications.
Example Result
Returned result:
Scholarship applications close on June 30.
Even though the wording differs slightly, semantic similarity enables successful retrieval.
Why Semantic Search Works
Question:
When do scholarship applications end?
Document:
Scholarship applications close on June 30.
Keyword matching:
Possible Match
Semantic search:
Strong Match
because:
End
˜
Close
The meaning is understood.
Using Metadata
Metadata provides additional context.
Example:
collection.add(
documents=[
"Employees receive 24 annual leave days."
],
metadatas=[
{
"department": "HR"
}
],
ids=["policy1"]
)
Metadata helps organize information.
Why Metadata Matters
Consider:
10,000 Documents
Some belong to:
HR
Finance
IT
Legal
Metadata allows filtering.
Example:
Search Only HR Documents
This improves retrieval accuracy.
Metadata Filtering Example
Search:
Leave Policy
Filter:
where={
"department":"HR"
}
Now ChromaDB searches only HR content.
This is extremely useful in enterprise environments.
Viewing Stored Documents
Developers can retrieve stored records.
Example:
collection.get()
This returns:
Documents
IDs
Metadata
Useful for debugging and validation.
Updating Documents
Policies change.
Example:
Before:
Employees receive 20 leave days.
After:
Employees receive 24 leave days.
Documents can be updated without recreating the entire collection.
This supports evolving knowledge bases.
Deleting Documents
Example:
collection.delete(
ids=["policy1"]
)
This removes outdated content.
Organizations frequently use this capability when policies change.
Persistence
By default, some ChromaDB configurations operate in memory.
For production usage, persistent storage is preferred.
Workflow:
Documents
?
ChromaDB Storage
?
Saved to Disk
This ensures data remains available after application restarts.
Building a Simple Retrieval Workflow
Step 1:
Store documents.
Admission Policy
Scholarship Policy
Hostel Rules
Step 2:
User asks:
How do I apply for financial aid?
Step 3:
ChromaDB retrieves:
Scholarship Policy
Step 4:
Send retrieved content to the LLM.
Step 5:
Generate answer.
This is the foundation of a RAG system.
ChromaDB Architecture
Documents
?
Chunking
?
Embeddings
?
ChromaDB Collection
?
Similarity Search
?
Relevant Chunks
?
LLM
?
Answer
Many beginner RAG projects follow this exact architecture.
ChromaDB Advantages
Simple Learning Curve
Easy for beginners.
Local Development
No cloud account required.
Open Source
Community-driven ecosystem.
Lightweight
Minimal infrastructure requirements.
Fast Prototyping
Ideal for learning RAG concepts.
ChromaDB Limitations
Large-Scale Enterprise Deployments
Specialized platforms may scale more efficiently.
Distributed Systems
Additional architecture may be required.
Massive Data Volumes
Some enterprise solutions provide more advanced scaling capabilities.
For learning and medium-sized projects, ChromaDB remains an excellent choice.
Real-World Example
University Knowledge Assistant:
Knowledge Base:
Admission Guide
Scholarship Rules
Hostel Guidelines
Academic Calendar
Workflow:
Student Question
?
ChromaDB Search
?
Relevant Content
?
LLM
?
Answer
This simple architecture can support thousands of student queries.
Enterprise Use Cases
Organizations commonly use ChromaDB for:
Internal Knowledge Assistants
Document Search
Research Assistants
Educational Applications
Customer Support Systems
It is often the starting point for many RAG initiatives.
.NET Perspective
Although ChromaDB is primarily Python-focused, .NET applications can interact with ChromaDB through APIs and service layers.
Many .NET developers use:
ASP.NET Core
Semantic Kernel
Azure OpenAI
while connecting to ChromaDB-based retrieval services.
Python Perspective
Python is the primary ecosystem for ChromaDB.
Popular integrations include:
LangChain
LlamaIndex
OpenAI SDK
Hugging Face
FastAPI
These integrations simplify RAG development significantly.
Assignment
Practical Exercise
Create a ChromaDB collection containing:
Admission Policy
Scholarship Policy
Hostel Policy
Perform five semantic search queries and record the results.
Design Activity
Design a university knowledge assistant using:
ChromaDB
Embeddings
LLM
Create a high-level architecture diagram.
Key Takeaways
ChromaDB is an open-source vector database designed for AI applications.
Collections are used to organize documents and embeddings.
ChromaDB supports semantic similarity search.
Metadata filtering improves retrieval precision.
ChromaDB is widely used for learning and prototyping RAG systems.
It integrates well with popular AI frameworks.
Understanding ChromaDB provides practical experience with vector databases.
What's Next?
In Session 24, we will explore:
Working with Pinecone
You will learn how Pinecone differs from ChromaDB, how managed vector databases work, how to build scalable retrieval systems, and why Pinecone is widely used in production-grade RAG applications.