![llm]()
1. What a Transformer Actually Computes
A decoder-only transformer (the backbone of most LLMs) is a stack of identical blocks. Each block does:
1. LayerNorm/RMSNorm → Multi-Head Self-Attention (MHSA)
For tokens \(x \in \mathbb{R}^{T \times d_{model}}\), project to queries/keys/values:
Q = xW_Q, K = xW_K, V = xW_V
Attn(Q,K,V) = softmax( (Q Kᵀ) / √d_h + mask ) V
For each head h (head dim \(d_h = d_{model}/H\)). Heads are concatenated and projected by \(W_O\).
2. Residual connection (pre/post-norm depending on variant).
3. MLP (a gated feed-forward network)
Classic: \( \text{GELU}(xW_1 + b_1)W_2 + b_2 \).
Modern LLMs often use SwiGLU: \(\text{SwiGLU}(x) = ((xW_{up}) \odot \sigma(xW_{gate})) W_{down}\).
4. Positional encoding (where RoPE/ALiBi live; see §5).
Key scaling features
GQA/MQA reduce K/V heads relative to Q heads to shrink memory bandwidth during attention.
FlashAttention computes attention with IO-aware tiling: same math, far less HBM traffic.
KV cache stores \(K, V\) from prior tokens, so generation is \(O(1)\) per new token per layer (vs recomputing history).
2. Weights: What They Encode and How They’re Organized
Weights are the learned parameters of projections (Q/K/V/O), MLP matrices, embed tables, and norm scales. They encode:
Token semantics: via input/output embedding matrices.
Relational structure: via attention projections and the statistics baked into them.
Compositional patterns: via MLPs that model nonlinear feature interactions.
Common choices
Norms: RMSNorm (no learned bias) vs LayerNorm (learned \(\gamma,\beta\)). RMSNorm is simpler and stable at scale.
Weight tying: Share input and output embedding matrices (saves params, improves sample efficiency).
Initialization: Scaled variants (e.g., µP, DeepNet) stabilize very deep stacks.
MoE (Mixture-of-Experts): Sparse activation routes tokens to a small subset of experts → more parameters without proportional compute.
Practical memory math (float16)
Model weights ≈ \( \text{params} \times 2\) bytes (fp16) or 1 byte (int8) after quantization.
KV cache per token ≈ \(2 \times L \times H \times d_h \times \text{bytes}\). Example: \(L=48, H=32, d_h=128\), bf16 (2 bytes) → about 786 kB per token; at 8k context that’s ~6.3 GB just for KV per batch element.
3. Embeddings: From Tokens to Geometry
Token embeddings map discrete IDs to vectors \(e_t \in \mathbb{R}^{d}\). Positional embeddings inject order (see §5). Output embeddings (tied) convert hidden states back to vocabulary logits.
Semantic embeddings (for search/RAG) map arbitrary text to vectors where cosine/dot distance correlates with semantic similarity. They are trained with contrastive/ranking losses (positive pairs close, negatives far).
Vector geometry & similarity
Cosine (angles) is scale-invariant; normalize vectors to unit length when using cosine.
Dot product benefits from learned norms but is sensitive to scale.
L2 is less common for language but fine for certain setups.
Dimensionality trade-offs:
Low-dim (e.g., 256–512): cheaper, faster ANN; risk of lost nuance.
Mid (768–1024): good general default.
High (1536–3072): better headroom for multilingual or long passages but heavier memory/index cost.
4. Vectors in Systems: Indexing, Search, and Reranking
Chunking: 256–800 tokens typical. Aim for semantic boundaries (paragraphs/sections). Maintain overlap (e.g., 20–30%) to handle query drift across chunk edges.
Indexing algorithms (ANN)
HNSW (graph-based): strong recall/latency; tune M and efSearch (higher = better recall, slower).
IVF-PQ/OPQ (inverted file + product quantization): smaller memory; add recheck on a small candidate set to reduce PQ error.
DiskANN/SCaNN: optimized for SSD or fast recall on large corpora.
Hybrid search: BM25 (lexical) + dense cosine often beats either alone; use late fusion (reciprocal rank fusion) or learned fusion.
Reranking: Use a cross-encoder (bi-directional attention over [query, candidate]) on top-k ANN results for precision@k. It’s the single highest-ROI step in many RAG stacks.
Evaluation: report Recall@k, MRR, nDCG, and Answer faithfulness when generation is involved. Track coverage (how often gold can be retrieved) to separate retriever from generator errors.
5. Positional Representations and Long Context
Absolute learned: simple, but hard to extrapolate beyond train window.
ALiBi: head-specific linear biases encourage monotonic decay over distance.
RoPE (rotary): rotates Q/K in complex space by position-dependent angles; enables efficient relative encoding: \(\text{RoPE}(q_i) = R(\theta_i)q_i,\ \text{RoPE}(k_j) = R(\theta_j)k_j\). Dot products then depend on relative positions \(i-j\).
Extending context
NTK-aware RoPE scaling / YaRN / PI stretch frequencies to larger windows with minimal retrain.
Segmented attention and ring-attention reduce quadratic cost but require careful engineering.
Memory tokens and state compression attempt to carry long-range info without full attention.
6. Training: Objectives and Signals
Pretraining (causal LM): maximize \( \sum \log p(x_t|x_{<t}) \). Tokenization often BPE/Unigram with special tokens (system, tool, citation).
SFT (Supervised Finetune): align on task format, tool protocols, and refusal behavior.
Preference learning (RLHF/DPO/KTO): optimize for human preferences, safety, calibration.
Tool-augmented training: include function-calling traces; reward tool correctness and citation quality.
Spec-aware training: add process rewards (schema adherence, unit tests pass, contract checks) so the model learns how to produce verifiable outputs, not just fluent text.
Compute budgeting: FLOPs for decoder-only roughly \(6 \times \text{tokens} \times \text{params}\). Data quality dominates once you scale; synthetic data helps if diverse, verified, and de-duplicated.
7. Inference: Latency, Throughput, and Memory
KV cache turns per-token cost into attention over just the new token and stored history. Optimize with paged KV, int8/FP8 KV quant, and grouped-query attention to cut memory BW.
Speculative decoding: Draft model proposes tokens; verifier model confirms—often 1.5–3× speedup when acceptance rate is high.
Batching and continuous batching keep GPUs busy; tune max concurrency vs. tail latency.
Pipeline parallel + tensor parallel for very large models; balance with activation checkpointing if memory-bound.
8. Quantization & Distillation (What Works Today)
· Weight-only quantization:
- INT8 (LLM.int8): safe default with minimal quality loss.
- NF4/INT4 (QLoRA/AWQ/GPTQ): 4-bit weights; add per-channel scales for stability.
KV cache quantization: often INT8 safe; INT4 is model- and task-dependent.
Activation quantization: harder; needs calibration (SmoothQuant/AWQ).
Distillation: smaller student trained on teacher traces; pair with task-specific data for best transfer.
9. Retrieval-Augmented Generation (RAG) That Holds Up
A robust RAG loop is a contract between retriever and generator:
5. Query planning (optionally multi-hop).
6. Retriever (hybrid dense+lexical), filters, and reranker.
7. Grounding: pass citations/snippets; require attribution.
8. Generator constrained to cite; verifier checks claims against sources.
9. Feedback: failure cases become hard negatives for retriever and counter-examples for the generator.
Common pitfalls: over-chunking, no reranker, mixing vector spaces (don’t index cosine-normalized vectors and then use dot without renorm), ignoring temporal freshness in embeddings.
10. Security, Safety, and Governance
Prompt injection & tool abuse: validate/escape untrusted text; whitelist functions; inspect tool arguments.
Data leakage: embeddings can leak if vectors are reversible; protect with encryption at rest, tenant isolation, and careful logging redaction.
PII & compliance: schema-level redaction + policy checks at both retrieval and generation time.
Calibration & uncertainty gates: block low-confidence autonomous actions; escalate with reasons.
11. Practical Tuning Playbook
Start with RMSNorm + RoPE + SwiGLU + GQA + FlashAttention.
Tie embeddings; use bfloat16 for stability.
Profile HBM bandwidth; if bound, prioritize KV quant, GQA, and FlashAttention before chasing more FLOPs.
For RAG, always add a cross-encoder reranker; it’s cheap leverage.
Track per-request cost, P90 latency, and pass@k together; optimize the bottleneck, not just one metric.
12. Mental Models for Builders
Transformers are feature routers. Attention routes which features interact; MLPs synthesize them.
Weights store priors; vectors navigate them. Pretraining builds a cognitive map; embeddings are your coordinates.
Throughput is a memory problem. Most production slowness is bandwidth/KV cache, not FLOPs.
Verification beats clever prompting. Schema checks, tests, and reranking move accuracy more than “better wording.”
Scale smartly. A smaller model with RAG, reranking, and verification often outperforms a larger, unguided one for enterprise tasks.
Conclusion
Transformers give us a programmable bias for compositional reasoning; weights encode the priors; embeddings turn knowledge into geometry; vector indexes make it searchable at scale. Shipping systems means balancing all four: efficient attention and KV memory, disciplined vector pipelines (indexing + reranking), quantization that preserves quality, and verification that converts plausible text into contract-satisfying outcomes. Build with those constraints in mind and you’ll get models that are not just impressive—but reliable, fast, and cost-effective in the real world.