Introduction
Modern LLM systems are built from a few core building blocks that often get conflated. Tokens are the budget and interface, vectors/embeddings are how meaning gets organized and retrieved, weights are the learned capability, and prompts/contracts tell the model how to use evidence. When you understand how these pieces fit, you can ship apps that are faster, cheaper, and measurably more reliable—not just more “intelligent.”
1) Tokens
What they are: The basic units an LLM reads/writes (subwords/characters/bytes).
Why they matter: Cost, latency, and context limits are all in tokens, not characters.
Mental model: Text → tokenizer → integer IDs. Models operate on IDs, not raw text.
Gotchas: Different tokenizers split text differently; “cost per 1K tokens” depends on the model’s tokenizer.
Extra insight: Think of tokens as both currency and bandwidth. Every extra header sentence, redundant instruction, or bloated few-shot example spends budget and crowds out evidence. Well-designed apps treat token usage as an SLO: they set per-route caps, log p50/p95 usage, and refactor prompts like engineers refactor code.
2) Vectors
What they are: Numeric arrays (ℝⁿ) representing things (tokens, sentences, images, SQL rows).
Why they matter: Models compare vectors with algebra instead of brittle string matching.
Common ops: cosine similarity, dot product, nearest-neighbor search.
Gotchas: Vector dimension and normalization change similarity behavior and index choice.
Extra insight: Vectors enable soft matching—useful when users describe concepts indirectly. But algebra is only half the story: pairing vector search with metadata filters (tenant, recency, jurisdiction) is what turns “seems similar” into “eligible and correct.” Treat filters as first-class selectors, not afterthoughts.
3) Embeddings
What they are: Vectors that capture meaning. Produced by an embedding model.
Use cases: Search/RAG, dedupe, clustering, anomaly detection, recommendation.
Similarity: (\text{cosine}(a,b)=\frac{a\cdot b}{|a|,|b|}). Higher ⇒ more similar.
Gotchas: Domain drift; chunking and filters often matter more than a bigger embedding.
Extra insight: Great retrieval is 60% preparation: chunk at semantic boundaries, keep claims atomic, attach source IDs and timestamps, and store normalized entities. When recall looks weak, first fix chunking and indexing hygiene; only then consider swapping models or dimensions.
4) Weights
What they are: Learned parameters inside the model’s layers.
Why they matter: Weights encode general knowledge and inductive biases.
Changing them: Fine-tuning/LoRA updates weights; prompts do not—they condition runtime behavior.
Extra insight: Use fine-tuning for style, format, or domain habits you need everywhere; use prompt and policy contracts for situational rules (freshness windows, source tiering, abstention). This separation lets you upgrade models safely while keeping live policies adjustable without another training cycle.
5) How It All Fits Together (in RAG/agent apps)
User query → tokens → vectors.
Embeddings → retrieval from a vector index + filters.
Retrieved chunks → tokens + prompt contract (rules/format).
LLM weights transform vectors → predicted tokens (answer).
Optional tools return structured data back into the loop.
Extra insight: The glue is the prompt contract: instructions that tell the model how to rank evidence, resolve conflicts, cite sources, and abstain. Without it, even perfect retrieval yields fluent but ungoverned text. With it, tokens become auditable decisions tied to sources.
6) Practical Rules of Thumb
Keep prompts tight; cap generation; measure token p50/p95.
Prioritize chunking and metadata filters over swapping embedding models.
Start with 384–1024 dims; only go higher if evals improve.
Choose indexes to fit scale/recall needs; PQ/OPQ only after measuring loss.
Normalize consistently for cosine; align write/read settings.
Store source_id/effective_date/tenant with vectors.
Require minimal-span citations; avoid overstuffed context.
Extra insight: Evaluate $/successful outcome, not just retrieval@k or tokens/call. A smaller model with disciplined prompts and context shaping often beats a larger one that thrashes through irrelevant text.
7) Frequent Confusions (Cleared Up)
“More tokens ⇒ better answers.” Not necessarily.
“Embeddings = LLM.” Different roles.
“Fine-tuning fixes retrieval.” It doesn’t; retrieval is a data/ops problem.
“Vectors are interpretable.” Not individually.
Extra insight: Another common mix-up: context window ≠ memory. Long prompts help, but they don’t create durable state. Use retrieval + stores for memory; use prompts to define how memory is used (priority, freshness, abstention).
8) Minimal Eval Kit
Retrieval: nDCG@k / Recall@k; conflict & staleness tests.
Grounding: Citation precision/recall; minimal-span checks.
Cost/Latency: Tokens per successful answer; p95 latency; cache hits.
Safety: Abstention quality on under-evidenced cases.
Extra insight: Record golden traces—real, anonymized tasks with fixed context packs—and replay them in CI anytime you change prompts, models, or retrieval policy. This catches regressions before users do and ties metrics to real workflows.
9) Quick Start Config (sane defaults)
Embedding dim: 768; chunk size: 400–700 tokens (sentence-aware).
Top-k: 8–12 with recency/tier re-rank.
Similarity: cosine with L2 normalization.
Index: HNSW (M=32, efConstruction=200; query ef=64–128).
Prompt contract: require citations, tie-break by newest, abstain on low coverage.
Extra insight: Revisit these defaults monthly. As your corpus grows or shifts (new regions, policies, product names), adjust chunk sizes, top-k, and filters. Drift is normal—treat your retrieval and prompt settings as living, versioned configuration, not one-off tweaks.
Conclusion
The fastest way to improve an LLM system is rarely “a bigger model.” It’s usually tighter token budgets, cleaner retrieval with embeddings and vectors, clearer contracts that tell the model how to behave, and targeted fine-tuning when habits—not policies—need to change. Align tokens, vectors/embeddings, weights, and prompts with outcome-based evaluations, and you’ll convert language capability into durable, cost-controlled products that scale.