1. Introduction
A context window defines the maximum number of tokens that a Large Language Model (LLM) can process at one time during training or inference. It represents the model’s working memory — everything the model can see, attend to, and reason over in a single forward pass. The context window is a hard architectural constraint and plays a crucial role in:
Long-document understanding
Conversational continuity
Reasoning quality
Retrieval-Augmented Generation (RAG) design
Cost, latency, and scalability of LLM systems
Anything that falls outside the context window effectively does not exist for the model.
2. Tokens and Tokenization
LLMs do not process raw text directly. They operate on tokens, produced by a tokenizer.
A token can be:
A whole word
A sub-word
A symbol or punctuation
Example:
"unbelievable"- ["un", "believ", "able"]
Context window size is measured in tokens, not words or characters.
3. Formal Definition
Context Window
The maximum number of tokens that can participate in the model’s self-attention mechanism during a single forward pass.
If:
Context window size = N
Input tokens > N
Then:
Tokens beyond N are truncated
They do not influence output
The model has no awareness of them
4. Where the Context Window Exists (Architectural View)
The context window is not a software feature or memory buffer. It exists inside the Transformer architecture, specifically because of self-attention computation limits.
Key Insight
The context window exists because every token must attend to every other token.
This requirement creates a fundamental scalability limit.
5. Step-by-Step Internal Working
Step 1: Text → Tokens
Input:
"AI models use context windows"
Tokenized:
[AI] [models] [use] [context] [windows]
Assume:
Step 2: Tokens → Embeddings
Each token is converted into a dense vector.
Embedding Matrix Shape:(tokens × embedding_dimension)
Example:5 × 4096
This matrix cannot exceed:
context_window × embedding_dimension
Step 3: Self-Attention (Why Context Window Is Limited)
For each token, the transformer creates:
Query (Q)
Key (K)
Value (V)
For N tokens:
Q = N × d
K = N × d
V = N × d
Attention is computed as:
Attention = softmax(Q × Kᵀ / √d) × V
Critical Limitation
Q × Kᵀ → produces an N × N matrix
Example: In Attention, every token attended to every other token, so for 1, 000 token, (1,000 × 1000) are the total attention computation.
| Tokens (N) | Attention Computations |
|---|
| 1,000 | 1 million |
| 8,000 | 64 million |
| 128,000 | ~16 billion |
Quadratic growth (O(N²)) is the core reason context windows must be bounded.
6. Decoder-Only Models and Causal Masking
Decoder-only LLMs (GPT, Zephyr, Mistral) generate text left-to-right.
They use causal masking, which ensures:
Causal Masking Diagram
Tokens: T1 T2 T3 T4
T1 sees: T1T2 sees: T1 T2T3 sees: T1 T2 T3T4 sees: T1 T2 T3 T4
Matrix view:
T1 T2 T3 T4
T1 ✔ ✖ ✖ ✖
T2 ✔ ✔ ✖ ✖
T3 ✔ ✔ ✔ ✖
T4 ✔ ✔ ✔ ✔
Causal masking applies only within the context window.
7. Sliding Context Window During Generation
During text generation, the context window slides forward.
Assume:
[T5 T6 T7 T8 T9 T10 T11 T12]
↓
Predict T13
Next step:
[T6 T7 T8 T9 T10 T11 T12 T13]
↓
Predict T14
Older tokens permanently fall out of memory.
8. What Happens When Input Exceeds the Context Window
Example:
Input tokens:[1][2][3][4][5][6][7][8][9][10][11][12]
Context window = 8
Model sees:
[5][6][7][8][9][10][11][12]
Tokens [1–4]:
9. Encoder vs Decoder Context Windows
| Model Type | Example | Context Behavior |
|---|
| Encoder-only | BERT | Full bidirectional attention |
| Decoder-only | GPT | Left-to-right causal attention |
| Encoder-Decoder | T5 | Encoder & decoder have separate windows |
Encoder models see the entire input at once, while decoder models see a growing prefix.
10. Why LLMs "Forget" Earlier Conversations
LLMs have no persistent memory.
They operate strictly on:
Current Context Window - Self-Attention - Output
Once tokens fall outside the window:
11. Context Window in Retrieval-Augmented Generation (RAG)
In RAG systems, the context window must hold:
System Prompt + User Query + Retrieved Document Chunks <= Context Window
RAG Context Diagram
User Query
↓
Embedding Model
↓
Vector Database
↓
Top-K Chunks
↓
LLM Context Window
↓
Generated Answer
If retrieved chunks exceed the context window:
Chunk size, overlap, and Top-K are chosen based on context window size
12. Engineering Trade-offs
| Larger Context Window | Smaller Context Window |
|---|
| Better long-document handling | Faster inference |
| Higher GPU memory | Lower cost |
| Higher latency | Less truncation control |
| More expensive | Easier deployment |
13. Key Takeaways
Context window is the working memory of an LLM
Defined by self-attention O(N²) complexity
Tokens outside the window do not exist
Larger context ≠ better reasoning by default
RAG design is tightly coupled to context size
14. One-Line Summary
The context window defines how much text an LLM can reason over at once; everything outside it is invisible to the model.