Large Language Models (LLMs) like Gemma, GPT, LLaMA, or Mistral have changed the way we interact with text. They can transform, summarize, and generate natural language with outstanding fluency. But when you hear about Retrieval-Augmented Generation (RAG), one question naturally pops up:
“If LLMs already transform and summarize text, why do we need to transform documents into embeddings again?”
The Core Problem
LLMs don’t read infinite text. Each model has a context window (say 4k, 16k, or 32k tokens). That means you can only feed a limited chunk of text in one go.
But what if you want your model to answer questions from:
You can’t dump all of that into the model directly.
This is where RAG comes into the picture.
What is RAG?
RAG is a simple but powerful concept of AI:
Store knowledge in a vector database
Break your documents into smaller chunks (e.g., paragraphs).
Convert each chunk into an embedding (a numerical vector that represents meaning).
Save them in a vector database like Chroma, FAISS, Pinecone, or Weaviate.
Retrieve only the relevant parts
When a user asks a question, you embed the query.
Compare it against your document embeddings.
Fetch only the most relevant chunks.
Generate an answer grounded in those chunks
You send the question + the retrieved chunks into the LLM.
The LLM then summarizes, reasons, and generates a response.
One more question might be
"For embeddings, do I need to perform explicit tokenization, or does the model handle it automatically?"
When working with LLMs and embeddings:
Embedding Generation
Most modern embedding APIs (like OpenAI, Hugging Face, or SentenceTransformers) expect raw text input.
The embedding model internally tokenizes the text automatically—you don’t need to explicitly tokenize it yourself.
For example, if you use SentenceTransformer('all-MiniLM-L6-v2')
and pass "Hello world"
, it handles tokenization, padding, and truncation internally before generating the embedding vector.
Explicit Tokenization
You only need to tokenize manually if:
Otherwise, for embeddings or standard LLM usage, the model handles it automatically.
RAG / Vector Database Use Case
When you store embeddings in a vector database (like Chroma, FAISS, or Pinecone), you just pass the raw text to the embedding function.
Tokenization is implicit, so you don’t need an extra step unless your project has special requirements.
Analogy: The Student and the Library
Think of an LLM as a very smart student.
The student can read and summarize a book (LLM internal transformation).
But if you give them a whole library, they’ll waste time flipping through every shelf.
Now imagine a librarian with an index system (RAG embeddings + vector DB).
Without the librarian, the student either guesses from memory or reads everything slowly.