LLMs  

The Transformer Model Architecture in LLM

The rapid advancement of Large Language Models (LLMs) such as GPT, BERT, and LLaMA is built upon one key innovation: the Transformer architecture. The Transformer has become the backbone of modern natural language processing (NLP).

What is Transformer?

The Transformer is a sequence-to-sequence architecture originally designed for machine translation. Transformers use self-attention to process all tokens in parallel.

It has two major components:

  • Encoder → processes and contextualizes input.

  • Decoder → generates or transforms output.

Types of Transformers

  • Encoder-only models → BERT (context understanding, classification).

  • Decoder-only models → GPT (causal language modeling, text generation).

  • Encoder–Decoder models → T5, BART (translation, summarization, sequence-to-sequence).

1. Embeddings: Numerical Representation of Tokens

Transformers operate on continuous vectors, not raw text. Thus, text must be transformed into dense vector embeddings.

Steps

  1. Tokenization
    User writes:

    • “I ate an apple.”

    • “Apple released a new iPhone.”

    Text → split into tokens.

    • Sentence 1 → [I, ate, an, apple]

    • Sentence 2 → [Apple, released, a, new, iPhone]

  2. Embedding Lookup
    Each token is converted to a dense vector.

    • "apple" → [0.12, -0.87, 0.45, ...]

    • “iPhone" → [1.23, 0.65, -0.44, ...] 

  3. Positional Encoding
    Transformers don’t know the order of words because self-attention looks at all tokens in parallel. For example: “I ate an apple” vs “Apple ate I” → same words but the meaning changes because of order. Positional Encoding (PE) gives each word a unique "position signal" so the model knows the sequence order.
    Now the input vector is Embedding(tokens) + Positional Encoding

Now, the model knows both what the word is and where it appears.

This embedding stage is crucial in word sense disambiguation. The same token "apple" may evolve differently depending on context during subsequent encoder/decoder processing.

2. Encoder: Contextual Representation

An Encoder is the part of the architecture that reads the input text and builds a contextual understanding of it. It doesn’t generate words. Instead, it produces contextual embeddings — numerical vectors that capture the meaning of each token in relation to others.

How does the Encoder work?

An Encoder is built from N identical layers (e.g., 6 or 12). Each layer has:

  1. Multi-Head Self-Attention (MHSA)

    • Every word looks at every other word to decide what’s important.

    • Example: In “I ate an apple”, the word “apple” will attend strongly to “ate.”

  2. Feedforward Neural Network (FFN)

    • After attention, a small MLP (neural net) processes the information further.

  3. Residual Connections + Layer Normalization

    • Keep training stable and allow deeper networks.

Example

  • In “I ate an apple”:

    • “apple” attends strongly to “ate” and “I” → meaning = fruit.

  • In “Apple released iPhone”:

    • “Apple” attends strongly to “released” and “iPhone” → meaning = company.

So, the encoder transforms embeddings into context-rich representations.

3. Decoder: Generating Response

The decoder also consists of N identical layers but includes an additional cross-attention module.

Decoder Components

  1. Masked Multi-Head Self-Attention
    Prevents attending to future tokens during autoregressive generation.

  2. Encoder–Decoder Cross-Attention
    Queries come from the decoder’s hidden states, while Keys/Values come from the encoder outputs.

  3. Feedforward + Normalization

Autoregressive Prediction

The Formula is

Screenshot 2025-09-09 143548

Decoder generates token based on:

  • Previous outputs [y_1, ..., y_{t-1}].

  • Encoded input sequence.

Think of the decoder as a storyteller:

  • It remembers the words it has already spoken.

  • It looks back at the input (if there’s an encoder).

  • Then it decides the next word with the highest probability

Example

  • Input: “Explain: I ate an apple”

    • Decoder outputs: “Apple → is → a → fruit.”

  • Input: “Explain: Apple released iPhone”

    • Decoder outputs: “Apple → is → a → technology → company.”

The decoder uses cross-attention (in encoder-decoder models like BART/T5) or just self-attention (in GPT, LLaMA, Gemma) to generate.

Output (Final Text)

The predicted tokens are stitched together:

  • Sentence 1 → “Apple is a fruit.”

  • Sentence 2 → “Apple is a technology company.”

Summary Flow

Input → Embedding → Encoder (context) → Decoder (generate) → Output