Generative AI, Part 7 — Cost & Latency Engineering: Decoder Policies, Caching, Batching, and $/Accepted Output

John Godel
Oct 16
3.1k
0
3

Article

Introduction

Great content that’s slow or expensive won’t survive production. Cost and latency aren’t afterthoughts; they’re design constraints. This article turns generative pipelines into performance systems: you’ll define budgets, pick decoder policies that finish fast, cache aggressively, batch where it pays, and measure $ per accepted output (not per token) so optimization aligns with business value.

Define Budgets First (or nothing else matters)

Set per-route caps before you tune models.

Token budgets: header ≤ H, context ≤ C, generation ≤ G (e.g., H=200, C=800, G=220).
Latency SLOs: p50/p95 (e.g., p50 ≤ 600 ms, p95 ≤ 1200 ms for short-form).
Acceptance criteria: must pass validators on first pass ≥ target CPR (e.g., 92%).
Outcome unit: accepted outputs (after validation/repair). All costs roll up to $/accepted.

Budget doc lives next to the prompt contract and validator policy; CI blocks merges that exceed it.

Decoder Policies that Finish Fast (and good)

Decoding is your speed dial. Defaults that work for most short/medium forms:

Natural prose: top_p=0.90–0.95, temperature=0.7–0.85, repetition_penalty=1.05.
Highly structured: top_p=0.75–0.85, temperature=0.2–0.5, or beam=3 for very short fields.
Repair-prone sections: lower top_p and temperature to reduce resamples.

Rule: tune for first-pass pass-rate (CPR) × tokens generated. A policy that reduces resamples often beats micro token savings.

Sectioned Generation = Deterministic Latency

Generate by section with stop sequences and per-section max_tokens. Benefits:

Early termination (no spilling into the next section)
Fine-grained repair (regenerate one failed section)
Stable concurrency (predictable token curves per call)

Example caps: Overview ≤ 120 tokens, Benefits (3 bullets × ≤ 18 words), CTA ≤ 25 tokens.

Caching That Actually Moves the Needle

Think three layers:

Template/Frame Cache: pure config (templates, style frames, validator policies). 100% hit-rate, zero risk.
Retrieval/Claim Cache: shaped claim packs for common topics/segments, keyed by topic+region+freshness_window. Invalidate on source change.
Generation Cache: deterministic sections (low temp, no randomness) for repeated assets like disclosures, footers, or boilerplate intros.

Track cache hit-rate × tokens saved. If a cache doesn’t lift $/accepted measurably, delete it.

Batching & KV-Cache Reuse

Where the runtime supports it:

Micro-batching: send multiple similar requests together to amortize overhead.
KV-cache reuse: keep shared headers/context in cache across sections or variants.
Speculative decoding: pair a tiny draft model with your main model for 1.5–2.5× speedups; verify tokens before commit.

Guardrail: cap parallelism to avoid contention spikes; prioritize p95 over hero p50.

The Repair Loop is a Cost Center—Trim It

Repairs are necessary but expensive. Reduce them by:

Tight decoder policies on failure-prone sections.
Lexicon & banned-term clarity (less guesswork).
Better templates (fixed bullet counts, hard stops).
Pre-validated inputs (locale, brand casing hints, claim freshness).

Track repairs per accepted; set a hard budget (e.g., ≤ 0.25 sections repaired per output).

Routing & Model Mix

Use the smallest model that hits CPR + latency SLOs; escalate on uncertainty or risk.

Small (SLM) default handles 70–95% of traffic.
Large LLM for tail cases (long reasoning, unusual constraints).
Consider draft+verify to make large models feel small.

Metric: escalation rate vs. win-rate. If escalations don’t materially improve acceptance or downstream KPI, tighten thresholds.

Measure What Pays: $/Accepted Output

Forget raw token bills. Compute:

[
\text{$/accepted}=\frac{\text{LLM $}+\text{retrieval $}+\text{repair $}+\text{selector $}}{\text{accepted outputs}}
]

Where:

LLM $ = (prompt + completion tokens) × price × attempts
Repair $ = extra generations from failed validations
Selector $ = extra variants scored beyond the chosen one
Accepted outputs = after validators, not drafts

Target down-and-right over releases without hurting CPR.

Observability: What to Put on the Dashboard

Per route, by model & release:

CPR (first-pass) and time-to-valid p50/p95
Tokens per accepted (prompt, completion, total)
Repairs per accepted and resample rate
Cache hit-rate (by layer) and tokens saved
Speculative accept rate (if used)
Escalation rate and win-rate delta
$/accepted with a sparkline over 30 days

Add alerts: p95 latency +20%, CPR −2 pts, $/accepted +25% → auto-canary pause.

Optimization Playbook (in priority order)

Cut wasteful tokens: shorten headers, remove vestigial instructions, compress examples.
Section caps + stops: reduce overrun and long tails.
Decoder tuning: lower top_p/temperature on sections that fail validators.
Cache claim packs for hot topics; ensure freshness policy.
Speculative decoding for long sections; evaluate real-world accept rate.
Route more to SLM as CPR stabilizes; raise escalation threshold carefully.
Trim variants: if selector ROI flattens, reduce N.

Worked Example (Composite)

Baseline (Part 4 pipeline, blog route):

CPR 91.8%, p95 1280 ms, tokens/accepted 1,850, repairs/accepted 0.42, $/accepted $0.84

Changes

Sectioned caps (overview 120, proof 220) + stronger stops
Tightened decoder on “Proof” (top_p 0.82, temp 0.45)
Cached claim packs for 12 hot topics (invalidate weekly)
Draft+Verifier enabled for long sections

Result (2-week canary)

CPR 92.6% (+0.8)
p95 980 ms (−23%)
tokens/accepted 1,290 (−30%)
repairs/accepted 0.21 (−50%)
$/accepted $0.53 (−37%)
→ Rollout: green-light with auto-rollback guards.

Anti-Patterns (and fixes)

“We’ll optimize later.” Treat budgets as gates from day one.
Chasing p50 wins while p95 regresses. Optimize for the tail.
Infinite variants “just in case.” Cap N; let the selector earn its keep.
Caching raw pages instead of claim packs. You’ll pay in tokens and drift.
One giant generation. Always section + validate + stop.

Minimal Config (copy/paste)

Decoder policy

{"top_p":0.9,"temperature":0.7,"repetition_penalty":1.05,
 "section_max_tokens":{"overview":120,"benefits":140,"proof":220,"cta":30}}

Budgets

{"header_tokens_max":200,"context_tokens_max":800,"gen_tokens_max":220,
 "p95_latency_ms_max":1200,"target_cpr":0.92,"repairs_per_accepted_max":0.25}

Canary auto-halt

{"cpr_drop_pts":2.0,"p95_increase_pct":20,"cost_increase_pct":25}

Conclusion

Cost and latency engineering is mostly discipline: budgets at the top, decoder policies that reduce retries, sectioned generation with hard stops, caches that save real tokens, and routing that keeps the big guns for the rare case. Track CPR, time-to-valid, tokens/accepted, repairs/accepted, and $/accepted—and wire auto-halts so mistakes are cheap. With these habits, your generative stack gets faster and cheaper every release without sacrificing quality.