LLMs  

LLMs: From Tokens to Dollars: Cost Engineering for LLM Apps

Cost isn’t a line item you reconcile at month-end; it’s a design constraint you encode from day one. Turning tokens into dollars—and dollars back into product value—means budgeting, caching, compressing, mixing models intelligently, and watching outcomes instead of raw usage. Here’s a pragmatic playbook for hard-nosed economics in LLM applications.

Token Budgets

Treat tokens like CPU cycles: allocate, measure, and enforce. Set per-request budgets for prompt, context, and generation, and per-workflow budgets for entire user journeys. Express these as hard caps and soft targets (e.g., 1.5k prompt, 2k context, 400 gen, soft target 3.2k total, hard cap 4k).

Design prompts and outputs to your budget: fixed headers, strict JSON schemas, no redundant role text, and trimmed few-shot examples. Add evidence windows so you only pass the most relevant context, not a full dossier. Track p50/p95 tokens by route to catch silent bloat. Budgets force copy to be concise, retrieval to be selective, and outputs to be machine-operable—not verbose.

Cache Hits

Every missed cache is a self-inflicted bill. Layer caches for different reuse patterns:

  • Prompt skeletons: deterministic templates keyed by contract/version.

  • Retrieval results: query→top-k doc IDs with short TTLs.

  • Model responses: exact-match or normalized-input caches for idempotent calls (deterministic temperature, sorted params).

  • KV cache / attention cache: reuse prefill states within sessions and across steps in multi-turn flows.

Define cacheability up front: temperature ≤ 0.2, no timestamps in prompts, normalized whitespace/IDs. Track hit-rate × average saved tokens to convert cache ops into dollar savings. Misses should log why (e.g., params changed, TTL expired), so you can tighten inputs or extend lifetimes where safe.

Prompt Compression

Most apps can cut prompt tokens 20–40% without hurting outcomes. Techniques that pay immediately:

  • Header hygiene: collapse role/context rules into a minimal, versioned contract referenced by ID, not repeated prose.

  • Structured few-shot: replace long examples with compact, schema-only demonstrations and counter-examples that target common failures.

  • Dictionaryization: use short aliases for long, repeated strings (e.g., [SRC_A] instead of full titles) when the model has nearby definitions.

  • Bounded summaries: compress context into atomic claims with source IDs and timestamps; preserve citations so fidelity survives.

  • Stop sequences & max tokens: keep generations tight; tailor by route and user intent.

Measure tokens per successful outcome before/after compression, not just tokens per call. If success rate dips, restore specific elements rather than reverting wholesale.

Model Mix (Small/Large)

Run a portfolio, not a monolith. Route traffic by risk and difficulty:

  • Small/fast models for classification, routing, extraction, known-pattern answers, and pre-validation (cheap first pass).

  • Medium models for standard Q&A, tool orchestration, and summarization with moderate context.

  • Large models for novel reasoning, edge cases, or high-stakes decisions—with strict escalation criteria.

Layer speculative decoding (draft + verifier) to make large models feel instant, and use early-exit strategies: if the small model’s confidence ≥ threshold and the task is low risk, return; otherwise escalate with evidence attached. Track win rate vs. cost by route: you want the small model handling most volume while the large model owns the long tail. Revisit thresholds monthly; drift is real.

$/Outcome Dashboards

Tokens aren’t the KPI—outcomes per dollar are. Build dashboards that show:

  • Cost per successful task (by route/tenant/region).

  • Grounded accuracy & citation precision alongside cost, to prevent “cheap but wrong” wins.

  • Latency p50/p95 vs. abandonment rates; slow is expensive in hidden ways.

  • Abstention quality (targeted follow-ups vs. guesses), because good abstentions avoid costly retries.

  • Cache hit rates and compression savings as first-class widgets.

  • Escalation mix (small→large) with thresholds and confidence curves.

Wire these to release gates: a new prompt, retrieval policy, or model is “green” only if it maintains outcome quality while holding or improving $/outcome and latency. Roll the rest back. Publish weekly cost notes with three bullets: what changed, savings realized, and the next lever to pull.


Bottom line: Budget tokens like an engineer, reuse work like a systems architect, compress like a protocol designer, route like an SRE, and judge success like a product owner. When you manage tokens as a capital expense tied to outcomes—not a meter ticking in the background—you turn LLMs from a variable cost center into a predictable, compounding advantage.