Transformers in AI

Rikam Palkar
3d
575
0
2

Article

AI… It's one of those things that sounds super techy and, honestly, a bit confusing. Welcome to AI for Dummies, where I'll walk you through the world of Artificial Intelligence without all the jargon and over-complicated stuff.

Feel free to explore first 3 chapters of the series:

So, what are we talking about again? Ah yes Transformers.

And just to be clear, I'm not talking about Optimus Prime or any electronics. This is something completely different: a genius model architecture in AI.

By definition, a transformer is a type of model that's really good at understanding sequences like sentences, code, or music. It's like a better version of the autocomplete algorithm. Instead of looking at words one by one, it pays attention to all the words at once, figuring out which ones matter most for the meaning. There is factor called "attention" which we'll learn in a while.

If I had to give an analogy, I'd say it's like your mom tasting every ingredient in a big pot at the same time and instantly knowing if more salt or chili is needed to make the dish perfect. That's how transformers work: they suggest the next word, but in a mathematically. Let's see how.

Inside a Transformer

1. Self-Attention Mechanism

Instead of reading a sentence word by word, the model looks at all words at once to figure out relationships.

Sentence: "Harry told Ron that he had found the Horcrux."

The model learns that "he" = Harry not Ron, even though they're a few words apart. This helps it understand context properly.

2. Positional Encoding

Transformers read words in parallel, so they don't naturally know the order. Positional encodings tell the model the position of each word.

Sentence: "Voldemort cursed Harry" vs "Harry cursed Voldemort."
Without positional info, the meaning could get completely scrambled. Positional encoding ensures the model knows who did what to whom.

3. Encoder–Decoder Structure

Encoder: A smart reader. It reads and understands the input, figuring out all the details and context.
Decoder: A smart writer (like me). It takes what the encoder understood and generates the output in the target language or format.

Example

Input (English, encoder reads): "Harry needs the invisibility cloak" , The encoder figures out who, what, and the meaning of the sentence.
Output (French, decoder writes): "Harry a besoin de la cape d'invisibilité" , The decoder uses the encoder's understanding to produce correct output.

Use cases

BERT uses only encoder: Just reads and understands text. Perfect for questions like "Who is using the cloak?" or "What spell was cast?"
GPT uses only decoder: Just generates text. Can write new dialogue, spells, or stories, because it predicts the next words based on context.

4. Feedforward Layers and Residual Connections

These are like refinement steps that polish the model's understanding and prevent it from "forgetting" or going off-track.

Example
Imagine passing a blurry photo through multiple filters to sharpen it. Residual connections make sure the original info isn't lost while enhancing it.

How a Transformer Chooses What Happens Next?

Let's take an example,

Say you asked ChatGPT: "What did Harry do?" and then it start doing this in background.

How are these words predicted?

Context understanding
The transformer reads the entire sentence, not just the last word. It knows the subject is "Harry" and that the sentence is asking about an action he performed.
Self-attention
It looks at all the words in the sentence simultaneously to figure out relationships. For example, "Harry" + "do" is likely actions might be "cast a spell", "picked", "grabbed", etc.
Feedforward & probability calculation
The model calculates a probability distribution over its entire vocabulary for the next word. That's where the numbers come from:

0.50 for "cast" 50% chance the model thinks this is the most likely next word.
0.20 for "picked" 20% chance.
0.15 for "grabbed" 15% chance.
And so on…

Choosing the next word
The model can either pick the highest probability ("cast") or sample randomly based on the probabilities for more varied or creative text.

The words are the possible predictions the model considers based on the sentence context, while the numbers represent how confident the model is that each word fits next.

In AI, blockchain, or any cutting-edge technology, the flashy results look impressive, but you can't truly succeed without a solid foundation in Data Structures and Algorithms.

When generating text, the model can choose the next word in different ways:

Greedy: Pick the word with the highest probability.
Sampling / Top-k / Top-p: Randomly pick a word based on probabilities for more varied or creative output.

But hold on those numbers what are those?

Let's understand the concept of a Vector

In AI, a vector is just a list of numbers that represents something like a word, a sentence, or even an image in a way a computer can understand.

Like coordinates in space:

In 2D: a point is represented as [x,y] e.g., [3,5]
In 3D: a point is [x,y,z] e.g., [1,2,3]

In AI, the "vector space" can have hundreds or thousands of dimensions, rather than just 2D or 3D, and each word or concept gets its own vector in that space.

Example with Harry Potter Words

Suppose we have 3 dimensions (just for simplicity):

Word	Vector (3D)
Harry	[0.8, 0.1, 0.3]
Voldemort	[0.9, 0.2, 0.4]
Wand	[0.1, 0.7, 0.2]

Words that are similar in meaning or context will have vectors close together in this space.

"Harry" and "Voldemort" are closer because they often appear in similar contexts (magic battles, Hogwarts).
"Wand" is a bit farther away but still related.

Why vectors matter in transformers

Each word gets converted into a vector (embedding).
Transformers work with these vectors to compute relationships using self-attention.

Vectors to Predictions

Step 1. Word as a Vector (Embedding)

Suppose “Harry” is represented as a vector: [0.8, 0.1, 0.3]

Step 2. Hidden State Transformation

That vector is processed through the transformer (self-attention + feedforward layers). After all this, we get a hidden state vector for the word’s position still just numbers, but now carrying context.

Example hidden state (say 4 dimensions for simplicity): [2.1, −0.5, 0.3, 1.7]

Step 3. Linear Layer - Logits

The hidden state is passed through a linear layer (matrix multiplication + bias). This maps the vector into a space the size of the vocabulary.

If the vocabulary has 5 words (Harry, Ron, Voldemort, wand, spell), then the linear layer outputs 5 numbers = logits.
[3.2, 1.7, 0.5, −1.0, 2.4]

Each number corresponds to how “relevant” that word might be as the next prediction.

Step 4. Softmax - Probabilities

Now we convert logits into probabilities using the softmax function:

Softmax makes all numbers positive and ensures they sum to 1.

Final result

Harry - 0.50
Ron - 0.20
Voldemort - 0.10
wand - 0.05
spell - 0.15

Vector (embedding/hidden state) = model’s internal representation of meaning.
Linear layer = maps that vector to raw scores (logits) for each possible next word.
Softmax = converts scores into a probability distribution.

That’s how [0.8, 0.1, 0.3] eventually becomes something like: Harry (0.50), Ron (0.20), Voldemort (0.10)…

Then the word with the highest probability is placed as the result.

Example

The transformer looks at the whole sentence and predicts the most likely next word based on context. Here, "slithered" is the top choice because it makes the most sense.

Summary

I hope I’ve done a better job of explaining how Transformers in AI are different from Transformers in movies. Next time you say hello to an AI, you’ll know the complex world that’s working behind the scenes.