Context Window in Large Language Models (LLMs)

Jayant Kumar
Jan 19
685
0
1

Article

1. Introduction

A context window defines the maximum number of tokens that a Large Language Model (LLM) can process at one time during training or inference. It represents the model’s working memory — everything the model can see, attend to, and reason over in a single forward pass. The context window is a hard architectural constraint and plays a crucial role in:

Long-document understanding
Conversational continuity
Reasoning quality
Retrieval-Augmented Generation (RAG) design
Cost, latency, and scalability of LLM systems

Anything that falls outside the context window effectively does not exist for the model.

2. Tokens and Tokenization

LLMs do not process raw text directly. They operate on tokens, produced by a tokenizer.

A token can be:

A whole word
A sub-word
A symbol or punctuation

Example:

"unbelievable"- ["un", "believ", "able"]

Context window size is measured in tokens, not words or characters.

3. Formal Definition

Context Window
The maximum number of tokens that can participate in the model’s self-attention mechanism during a single forward pass.

If:

Context window size = N
Input tokens > N

Then:

Tokens beyond N are truncated
They do not influence output
The model has no awareness of them

4. Where the Context Window Exists (Architectural View)

The context window is not a software feature or memory buffer. It exists inside the Transformer architecture, specifically because of self-attention computation limits.

Key Insight

The context window exists because every token must attend to every other token.

This requirement creates a fundamental scalability limit.

5. Step-by-Step Internal Working

Step 1: Text → Tokens

Input:

"AI models use context windows"

Tokenized:

[AI] [models] [use] [context] [windows]

Assume:

Context window = 8 tokens
Input tokens = 5 → Allowed

Step 2: Tokens → Embeddings

Each token is converted into a dense vector.

Embedding Matrix Shape:(tokens × embedding_dimension)
Example:5 × 4096

This matrix cannot exceed:

context_window × embedding_dimension

Step 3: Self-Attention (Why Context Window Is Limited)

For each token, the transformer creates:

Query (Q)
Key (K)
Value (V)

For N tokens:

Q = N × d
K = N × d
V = N × d

Attention is computed as:

Attention = softmax(Q × Kᵀ / √d) × V

Critical Limitation

Q × Kᵀ → produces an N × N matrix

Example: In Attention, every token attended to every other token, so for 1, 000 token, (1,000 × 1000) are the total attention computation.

Tokens (N)	Attention Computations
1,000	1 million
8,000	64 million
128,000	~16 billion

Quadratic growth (O(N²)) is the core reason context windows must be bounded.

6. Decoder-Only Models and Causal Masking

Decoder-only LLMs (GPT, Zephyr, Mistral) generate text left-to-right.

They use causal masking, which ensures:

A token can only attend to previous tokens
Future tokens are hidden

Causal Masking Diagram

Tokens: T1  T2  T3  T4
T1 sees: T1T2 sees: T1 T2T3 sees: T1 T2 T3T4 sees: T1 T2 T3 T4

Matrix view:

     T1 T2 T3 T4
T1   ✔  ✖  ✖  ✖
T2   ✔  ✔  ✖  ✖
T3   ✔  ✔  ✔  ✖
T4   ✔  ✔  ✔  ✔

Causal masking applies only within the context window.

7. Sliding Context Window During Generation

During text generation, the context window slides forward.

Assume:

Context window = 8 tokens

[T5 T6 T7 T8 T9 T10 T11 T12]
 ↓
 Predict T13

Next step:

[T6 T7 T8 T9 T10 T11 T12 T13]
↓
Predict T14

Older tokens permanently fall out of memory.

8. What Happens When Input Exceeds the Context Window

Example:

Input tokens:[1][2][3][4][5][6][7][8][9][10][11][12]
Context window = 8

Model sees:

[5][6][7][8][9][10][11][12]

Tokens [1–4]:

Are not attended
Are not cached
Have zero influence on output

9. Encoder vs Decoder Context Windows

Model Type	Example	Context Behavior
Encoder-only	BERT	Full bidirectional attention
Decoder-only	GPT	Left-to-right causal attention
Encoder-Decoder	T5	Encoder & decoder have separate windows

Encoder models see the entire input at once, while decoder models see a growing prefix.

10. Why LLMs "Forget" Earlier Conversations

LLMs have no persistent memory.

They operate strictly on:

Current Context Window - Self-Attention - Output

Once tokens fall outside the window:

They are not remembered
They cannot influence reasoning
The model behaves as if they never existed

11. Context Window in Retrieval-Augmented Generation (RAG)

In RAG systems, the context window must hold:

System Prompt + User Query + Retrieved Document Chunks <= Context Window

RAG Context Diagram

User Query
   ↓
Embedding Model
   ↓
Vector Database
   ↓
Top-K Chunks
   ↓
LLM Context Window
   ↓
Generated Answer

If retrieved chunks exceed the context window:

Older or less relevant chunks are dropped
Answer quality degrades

Chunk size, overlap, and Top-K are chosen based on context window size

12. Engineering Trade-offs

Larger Context Window	Smaller Context Window
Better long-document handling	Faster inference
Higher GPU memory	Lower cost
Higher latency	Less truncation control
More expensive	Easier deployment

13. Key Takeaways

Context window is the working memory of an LLM
Defined by self-attention O(N²) complexity
Tokens outside the window do not exist
Larger context ≠ better reasoning by default
RAG design is tightly coupled to context size

14. One-Line Summary

The context window defines how much text an LLM can reason over at once; everything outside it is invisible to the model.