The rapid advancement of Large Language Models (LLMs) such as GPT, BERT, and LLaMA is built upon one key innovation: the Transformer architecture. The Transformer has become the backbone of modern natural language processing (NLP).
What is Transformer?
The Transformer is a sequence-to-sequence architecture originally designed for machine translation. Transformers use self-attention to process all tokens in parallel.
It has two major components:
Types of Transformers
Encoder-only models → BERT (context understanding, classification).
Decoder-only models → GPT (causal language modeling, text generation).
Encoder–Decoder models → T5, BART (translation, summarization, sequence-to-sequence).
1. Embeddings: Numerical Representation of Tokens
Transformers operate on continuous vectors, not raw text. Thus, text must be transformed into dense vector embeddings.
Steps
Tokenization
User writes:
Text → split into tokens.
Sentence 1 → [I, ate, an, apple]
Sentence 2 → [Apple, released, a, new, iPhone]
Embedding Lookup
Each token is converted to a dense vector.
"apple" → [0.12, -0.87, 0.45, ...]
“iPhone" → [1.23, 0.65, -0.44, ...]
Positional Encoding
Transformers don’t know the order of words because self-attention looks at all tokens in parallel. For example: “I ate an apple” vs “Apple ate I” → same words but the meaning changes because of order. Positional Encoding (PE) gives each word a unique "position signal" so the model knows the sequence order.
Now the input vector is Embedding(tokens) + Positional Encoding
Now, the model knows both what the word is and where it appears.
This embedding stage is crucial in word sense disambiguation. The same token "apple" may evolve differently depending on context during subsequent encoder/decoder processing.
2. Encoder: Contextual Representation
An Encoder is the part of the architecture that reads the input text and builds a contextual understanding of it. It doesn’t generate words. Instead, it produces contextual embeddings — numerical vectors that capture the meaning of each token in relation to others.
How does the Encoder work?
An Encoder is built from N identical layers (e.g., 6 or 12). Each layer has:
Multi-Head Self-Attention (MHSA)
Every word looks at every other word to decide what’s important.
Example: In “I ate an apple”, the word “apple” will attend strongly to “ate.”
Feedforward Neural Network (FFN)
Residual Connections + Layer Normalization
Example
So, the encoder transforms embeddings into context-rich representations.
3. Decoder: Generating Response
The decoder also consists of N identical layers but includes an additional cross-attention module.
Decoder Components
Masked Multi-Head Self-Attention
Prevents attending to future tokens during autoregressive generation.
Encoder–Decoder Cross-Attention
Queries come from the decoder’s hidden states, while Keys/Values come from the encoder outputs.
Feedforward + Normalization
Autoregressive Prediction
The Formula is
![Screenshot 2025-09-09 143548]()
Decoder generates token based on:
Think of the decoder as a storyteller:
It remembers the words it has already spoken.
It looks back at the input (if there’s an encoder).
Then it decides the next word with the highest probability
Example
The decoder uses cross-attention (in encoder-decoder models like BART/T5) or just self-attention (in GPT, LLaMA, Gemma) to generate.
Output (Final Text)
The predicted tokens are stitched together:
Summary Flow
Input → Embedding → Encoder (context) → Decoder (generate) → Output