Large Language Models (LLMs) Under the Hood: A Technical Deep Dive

John Godel
2d
2.8k
0
2

Article

This article walks through how modern Large Language Models (LLMs) actually work—from tokenization and transformer math to training data pipelines, optimization tricks, inference serving, and fine-tuning. It focuses on engineering details you can use when building, deploying, or evaluating LLM systems.

1) Tokens, Vocabularies, and Sequence Handling

Tokenization. Most LLMs operate on subword tokens (e.g., byte-pair encoding or unigram LM). Text is split into pieces that balance vocabulary size and coverage. A typical vocabulary ranges from 32k to 200k tokens. Multilingual and code-capable models often use byte-level schemes to guarantee any input is representable.

Context windows. Inputs are truncated or chunked to a fixed maximum length (e.g., 4k–200k tokens). Long-context support relies on positional encodings (see below), memory-efficient attention, and retrieval to avoid quadratic blow-ups.

Padding and masking. Sequences in a batch are padded to the same length; an attention mask prevents the model from attending to padding or to future tokens (causal mask).

2) Transformer Architecture: What Each Layer Does

High level. An LLM is a stack of transformer decoder blocks. Each block contains:

LayerNorm/RMSNorm
Multi-Head Self-Attention (MHSA)
Feed-Forward Network (FFN), often with gated activations (e.g., SwiGLU)
Residual connections

Self-attention math.

Project hidden states (X \in \mathbb{R}^{B \times T \times d}) to queries (Q), keys (K), values (V).
Compute attention weights (A = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} + \text{mask} + \text{pos_bias}\right)).
Aggregate outputs (O = AV), then merge heads and pass to FFN.

Positional methods.

RoPE (rotary) encodes relative positions by rotating (Q,K) in complex space; extrapolates better and is stable at long lengths.
ALiBi adds linear distance biases directly to attention logits; allows length extrapolation without explicit embeddings.
Learned absolute embeddings also exist but extrapolate poorly.

Mixture-of-Experts (MoE). Sparse layers replace some FFNs with a bank of experts (e.g., 16–256). A router selects top-k experts per token. Benefits: higher parameter count at similar FLOPs. Costs: load balancing, routing jitter, and serving complexity.

Stabilization choices. Pre-norm vs. post-norm, RMSNorm vs. LayerNorm, QK normalization, and attention scaling tweaks reduce training instabilities in deep stacks.

3) The Training Pipeline

Objective. Next-token prediction (autoregressive). Loss = cross-entropy over the vocabulary at each time step.

Datasets. Diverse corpora (web, code, books, multilingual), heavily deduplicated and filtered. Mixing strategies assign sampling weights per shard/domain. For code models, repositories are filtered by license and quality signals; tests and docs are valuable supervision.

Curriculum & packing.

Token packing: concatenate documents to minimize padding while respecting boundaries.
Mixture schedules: gradually increase harder distributions (e.g., code, math).
Temperature sampling: balances domain diversity at train time.

Scaling laws. For a given compute budget (C), choose model size (N) and total tokens (D) so the loss is near compute-optimal. Under-trained large models waste parameters; smaller, well-trained ones often win.

Optimization.

AdamW/Lion optimizers with warmup + cosine or exponential decay.
Mixed precision (fp16/bf16) with loss scaling.
Gradient clipping, gradient checkpointing for memory.
Large-batch training with ZeRO (optimizer/state sharding) and distributed parallelism:
- DP (data parallel): split batches across workers.
- TP (tensor parallel): split matrix multiplies across devices.
- PP (pipeline parallel): split layers into stages.
- SP (sequence parallel): shard along sequence length for memory relief.
- 3D parallel: combine DP/TP/PP.

Evaluation during training. Held-out perplexity curves, domain-specific dev sets (e.g., coding/math), and adversarial subsets to monitor regressions.

4) Instruction Tuning and Preference Optimization

Supervised fine-tuning (SFT). Train on instruction–response pairs to make outputs follow directions.

Preference learning.

RLHF: train a reward model from human preference pairs; optimize the policy (model) to maximize reward under a KL penalty.
DPO: optimize directly on preference pairs via a closed-form objective, avoiding the PPO loop; simpler and often more stable.

Safety & guardrails. Additional datasets encode refusal policies, safety taxonomies, and content filters. Classifiers and constrained decoding (e.g., safety grammars) are applied during inference.

5) Inference: Decoding, KV Caches, and Throughput

Autoregressive loop. Given a prompt, the model emits one token at a time. Key/value tensors from attention are cached to avoid recomputing attention over the prefix.

KV cache. For each layer and head, store (K,V) of all generated positions. Memory scales as (O(L \times H \times d \times T)). Engineering tricks:

Paged KV: manage cache in fixed-size pages to reduce fragmentation and enable efficient preemption/merging.
Continuous batching: accept new requests mid-batch; scheduler interleaves decoding steps.
Speculative decoding: a small “drafter” proposes multiple tokens; the main model verifies in parallel. Great for latency with minimal quality loss.

Decoding strategies.

Greedy or beam search for determinism (risk: bland outputs).
Sampling with temperature, top-k, top-p (nucleus). Lower temperature = more deterministic.
Penalties (frequency/presence) reduce repetition.
Constrained decoding: force outputs to match a grammar or JSON schema.

Streaming. Emit tokens incrementally to the client for responsive UX (server keeps the decoding loop tight; clients reassemble).

Latency and cost. Dominated by matmul FLOPs and memory bandwidth. Useful metrics:

TTFT (time to first token)
TPS (tokens per second) per request and per GPU
TPD (tokens per dollar), to compare deployments economically

6) Quantization, Distillation, and LoRA

Quantization. Reduce weights/activations from fp16/bf16 to int8/int4 (sometimes NF4/FP8 variants).

Post-training quantization (PTQ) vs. quantization-aware training (QAT).
Outlier handling (per-channel/per-group scales) preserves attention quality.
Gains: 2–4× memory reduction; sometimes modest speedups if kernels are optimized.

Distillation. Train a smaller student to match teacher logits or hidden states on large unlabeled corpora. Often combined with instruction data to create compact assistants.

Parameter-efficient fine-tuning (PEFT).

LoRA injects low-rank adapters into attention/FFN projections; train only adapter matrices.
Prefix/prompt tuning learns virtual tokens prepended to inputs.
Benefits: small trainable footprint, multi-task adapters, faster iteration.

7) Long-Context and Memory Methods

Positional strategies. RoPE scaling and ALiBi support extrapolation; dynamic rope frequency scaling helps with very long contexts.

Sparse/linear attention. Windowed or block-sparse patterns reduce (O(T^2)) to near-linear for long sequences; trade-off: global information flow.

Chunking + retrieval. Rather than force everything into the context, retrieve top-k chunks from a vector index and condition generation (RAG). This keeps contexts short and fresh while maintaining citations.

External memory. Some systems maintain summaries or key states across turns (e.g., rolling summaries, memory tokens), with periodic compaction.

8) Tool Use and Function Calling

Rationale. LLMs are probabilistic; tools are precise. Delegate math, database queries, web search, code execution, or policy checks to tools.

Interfaces. Schemas specify callable functions with arguments. The model emits a tool call; the runtime executes it and returns structured results; the model then continues with grounded information.

Validation. Downstream services should enforce schemas, rate limits, and authorization; the LLM is untrusted input.

9) Evaluation Beyond Benchmarks

Offline benchmarks. Perplexity; suite scores (reasoning, code, math, multilingual). Good for coarse comparison, not sufficient for production readiness.

Task-level tests. Exact match/F1 on templated tasks, code execution success, unit tests passing, SQL correctness.

Operational metrics. Accuracy under drift, refusal rates, escalation rates, latency/variance, throughput, cost, and incident counts (e.g., policy violations).

A/B and shadowing. Compare new models or decoding policies on real traffic with canary routes; keep rollback ready.

10) Serving Architectures and Orchestration

Engines. High-throughput servers integrate:

Tensor/graph compilers and fused kernels (FlashAttention-style).
Paged KV caches and continuous batching.
CUDA graphs for stable kernel launch overhead.
Multi-tenant schedulers with fairness/priority.

Parallelism at inference.

Tensor parallel for large layers.
Pipeline parallel across layer groups.
Speculative pipelines (drafter/verifier on separate GPUs) for latency wins.

Autoscaling. Queue depth, arrival rate, and active token rate drive scale-out; cold-start penalties can be mitigated with warm pools.

Caching. Prompt and prefix caching reuse attention states for repeated prefixes (e.g., system prompts, RAG boilerplate).

11) Security, Safety, and Compliance (Engineering View)

Prompt injection defenses: input sanitization, provenance labels, tool allow-lists, and untrusted tool sandboxes.
Data governance: separate fine-tune data, inference prompts, and logs; retention policies; PII redaction.
Response control: safety classifiers, content policies, and constrained decoding grammars.
Supply chain: record model/version, dataset lineage, license compliance, and SBOM-style artifacts for audits.

12) Putting It Together: A Minimal but Realistic Stack

Data layer. Curated corpora → dedupe → quality filters → document store + embedding index.
Base model. Pretrained transformer with RoPE/ALiBi, fused kernels, and long-context settings.
Instruction layer. SFT + preference optimization (DPO/RLHF) + safety fine-tuning.
Serving. Quantized weights where feasible, paged KV, continuous batching, speculative decoding, streaming output.
Retrieval. RAG gateway that injects grounded passages and citations; freshness policies.
Tools. Function-calling runtime with schema validation and sandboxed executors.
Observability. Prompt/version lineage, metrics, traces, eval dashboards, and canary deployments.
PEFT. LoRA adapters per domain/team for fast iteration without retraining the base.

13) Practical Tips and Gotchas

Measure token shapes. Many latency issues come from a few huge prompts; trim boilerplate with prefix caching and retrieval.
Stabilize decoding. Keep temperature modest for critical tasks; use grammar constraints for JSON.
Watch KV memory. Long generations at high batch sizes will hit HBM limits before compute—paginate caches and cap max new tokens.
Prefer smaller models + tools. A well-grounded 7–14B model with retrieval and tools often beats a raw 70B model for cost, latency, and reliability.
Version everything. System prompts, safety policies, retrieval indexes, and tool schemas should have versions tied to outputs.

Bottom Line

LLMs are scalable probabilistic sequence models wrapped in a lot of systems engineering. The ingredients that matter most in practice are predictable: clean tokenization, stable transformer blocks, compute-optimal training, careful instruction tuning, and a production-grade inference stack (paged KV, batching, speculative decoding, quantization). Surround the model with retrieval, tools, and observability, and you transform a text generator into a dependable component of real software.