LLMs  

Context Window in Large Language Models (LLMs)

1. Introduction

A context window defines the maximum number of tokens that a Large Language Model (LLM) can process at one time during training or inference. It represents the model’s working memory — everything the model can see, attend to, and reason over in a single forward pass. The context window is a hard architectural constraint and plays a crucial role in:

  • Long-document understanding

  • Conversational continuity

  • Reasoning quality

  • Retrieval-Augmented Generation (RAG) design

  • Cost, latency, and scalability of LLM systems

Anything that falls outside the context window effectively does not exist for the model.

2. Tokens and Tokenization

LLMs do not process raw text directly. They operate on tokens, produced by a tokenizer.

A token can be:

  • A whole word

  • A sub-word

  • A symbol or punctuation

Example:

"unbelievable"- ["un", "believ", "able"]

Context window size is measured in tokens, not words or characters.

3. Formal Definition

Context Window
The maximum number of tokens that can participate in the model’s self-attention mechanism during a single forward pass.

If:

  • Context window size = N

  • Input tokens > N

Then:

  • Tokens beyond N are truncated

  • They do not influence output

  • The model has no awareness of them

4. Where the Context Window Exists (Architectural View)

The context window is not a software feature or memory buffer. It exists inside the Transformer architecture, specifically because of self-attention computation limits.

Key Insight

The context window exists because every token must attend to every other token.

This requirement creates a fundamental scalability limit.

5. Step-by-Step Internal Working

Step 1: Text → Tokens

Input:

"AI models use context windows"

Tokenized:

[AI] [models] [use] [context] [windows]

Assume:

  • Context window = 8 tokens

  • Input tokens = 5 → Allowed

Step 2: Tokens → Embeddings

Each token is converted into a dense vector.

Embedding Matrix Shape:(tokens × embedding_dimension)
Example:5 × 4096

This matrix cannot exceed:

context_window × embedding_dimension

Step 3: Self-Attention (Why Context Window Is Limited)

For each token, the transformer creates:

  • Query (Q)

  • Key (K)

  • Value (V)

For N tokens:

Q = N × d
K = N × d
V = N × d

Attention is computed as:

Attention = softmax(Q × Kᵀ / √d) × V

Critical Limitation

Q × Kᵀ → produces an N × N matrix

Example: In Attention, every token attended to every other token, so for 1, 000 token, (1,000 × 1000) are the total attention computation.

Tokens (N)Attention Computations
1,0001 million
8,00064 million
128,000~16 billion

Quadratic growth (O(N²)) is the core reason context windows must be bounded.

6. Decoder-Only Models and Causal Masking

Decoder-only LLMs (GPT, Zephyr, Mistral) generate text left-to-right.

They use causal masking, which ensures:

  • A token can only attend to previous tokens

  • Future tokens are hidden

Causal Masking Diagram

Tokens: T1  T2  T3  T4
T1 sees: T1T2 sees: T1 T2T3 sees: T1 T2 T3T4 sees: T1 T2 T3 T4

Matrix view:

     T1 T2 T3 T4
T1   ✔  ✖  ✖  ✖
T2   ✔  ✔  ✖  ✖
T3   ✔  ✔  ✔  ✖
T4   ✔  ✔  ✔  ✔

Causal masking applies only within the context window.

7. Sliding Context Window During Generation

During text generation, the context window slides forward.

Assume:

  • Context window = 8 tokens

[T5 T6 T7 T8 T9 T10 T11 T12]
 ↓
 Predict T13

Next step:

[T6 T7 T8 T9 T10 T11 T12 T13]
↓
Predict T14

Older tokens permanently fall out of memory.

8. What Happens When Input Exceeds the Context Window

Example:

Input tokens:[1][2][3][4][5][6][7][8][9][10][11][12]
Context window = 8

Model sees:

[5][6][7][8][9][10][11][12]

Tokens [1–4]:

  • Are not attended

  • Are not cached

  • Have zero influence on output

9. Encoder vs Decoder Context Windows

Model TypeExampleContext Behavior
Encoder-onlyBERTFull bidirectional attention
Decoder-onlyGPTLeft-to-right causal attention
Encoder-DecoderT5Encoder & decoder have separate windows

Encoder models see the entire input at once, while decoder models see a growing prefix.

10. Why LLMs "Forget" Earlier Conversations

LLMs have no persistent memory.

They operate strictly on:

Current Context Window - Self-Attention - Output

Once tokens fall outside the window:

  • They are not remembered

  • They cannot influence reasoning

  • The model behaves as if they never existed

11. Context Window in Retrieval-Augmented Generation (RAG)

In RAG systems, the context window must hold:

System Prompt + User Query + Retrieved Document Chunks <= Context Window

RAG Context Diagram

User Query
   ↓
Embedding Model
   ↓
Vector Database
   ↓
Top-K Chunks
   ↓
LLM Context Window
   ↓
Generated Answer

If retrieved chunks exceed the context window:

  • Older or less relevant chunks are dropped

  • Answer quality degrades

Chunk size, overlap, and Top-K are chosen based on context window size

12. Engineering Trade-offs

Larger Context WindowSmaller Context Window
Better long-document handlingFaster inference
Higher GPU memoryLower cost
Higher latencyLess truncation control
More expensiveEasier deployment

13. Key Takeaways

  • Context window is the working memory of an LLM

  • Defined by self-attention O(N²) complexity

  • Tokens outside the window do not exist

  • Larger context ≠ better reasoning by default

  • RAG design is tightly coupled to context size

14. One-Line Summary

The context window defines how much text an LLM can reason over at once; everything outside it is invisible to the model.