Prompt Engineering: Cost & Speed: Engineering for $/Accepted Output and Predictable Latency — Part 8

John Godel
Oct 16
470
0
3

Article

Introduction

Quality that arrives late or blows the budget never survives contact with production. After contracts, decoding, style, context, tools, validators, and ops are in place, the final lever is performance economics: designing routes to meet latency SLOs and a target $/accepted output without sacrificing acceptance or safety. This article deepens the playbook—budgets as first-class config, sectioned generation with hard stops, claim-pack centric context, caching layers that actually save tokens, small→large routing, and observability wired to business metrics. We’ll move beyond generic “optimize tokens” advice and show where to cut, how to bound p95/p99, and what to measure so tradeoffs are explicit rather than accidental.

A key theme: treat cost and speed as design constraints, not postmortems. That means setting budgets before you write copy, encoding them beside your contract, and failing builds that exceed them. When speed and cost are part of the contract, tuning becomes a methodical search for higher first-pass CPR at lower tokens/accepted—not an endless back-and-forth over wording.

The Problem Cost Engineering Must Solve

Teams that defer performance see the same failure curve. Runaway prompt headers creep from 200 to 600 tokens as rules accrete; context dumps balloon because retrieval isn’t eligibility-first; one-shot longform induces overrun and spiky p95/p99; retry storms appear when validators catch real problems but the system resamples entire artifacts instead of fixing a section. Costs rise linearly with words and quadratically with chaos.

The second class of failures is invisible waste. You regenerate boilerplate on every call; you re-retrieve the same facts instead of caching shaped claim packs; you escalate to a large model “just in case” without a measured win-rate; you run selectors on variants you never use. None of these look like outages, but they erode margins and breach SLOs under load. Cost engineering is the discipline of replacing these habits with explicit budgets, isolation (by section), deterministic stops, and caches with freshness contracts—so both spend and tail latency become predictable.

Design Principles

Budget first, then design inside the box. Write token and latency caps alongside the contract. Features fit budgets; budgets don’t chase features.
Sections over monoliths. Generate per section with caps and stop sequences. This isolates failures, flattens tails, and enables parallel validation.
Optimize CPR × tokens, not vibes. The best policy is the one that increases first-pass constraint pass-rate while reducing tokens/accepted. A 10% token cut that causes 20% more resamples is a net loss.
Cache by meaning, not by page. Cache templates, style/policy bundles, and shaped claims keyed by topic/region/freshness—not raw PDFs.
Default small, escalate when earned. Route to a small model until uncertainty/risk justifies a larger one; measure escalation ROI explicitly.
Log what you optimize. Store artifact hashes, section tokens, stop-hit flags, repair counts, and $/accepted so rollbacks and postmortems trace causes, not symptoms.

These principles shrink the design space: you’ll reach stable, cheap configurations faster because each change is constrained, measurable, and reversible.

Budgets (copy/paste baselines you can enforce)

Token budgets (per request):

Header ≤ 200 tokens — contract + pointers to style/policy by ID, not pasted walls of text.
Context ≤ 800 tokens — claim packs only; no raw passages; enforce freshness windows.
Generation ≤ 220 tokens — enforced via per-section caps (e.g., Overview ≤120; Proof ≤220; CTA ≤30).

Latency SLOs (per route):

p50 ≤ 600 ms, p95 ≤ 1200 ms (short/medium forms). For long routes, set p95 ≤ 2500 ms and track p99 explicitly.
Per-section hard stops (stop sequences + token caps) so no section can spill indefinitely.

Quality gates (release criteria):

First-pass CPR ≥ 92% (schema, safety, locale, citations/freshness, no implied writes).
Repairs/accepted ≤ 0.25 sections.
Implied-write incidents = 0.

Encode these as machine-readable config. CI fails with a clear reason: which budget, by how much, and in which section.

Decoding Policies and Their Cost Impact

Sampling is your throttle. Narrative wants diversity; structure wants discipline. Practical defaults that lower retries:

Narrative paragraphs: top_p = 0.90–0.95, temperature = 0.70–0.85, repetition penalty ≈ 1.05. Pair with sentence caps (≤18 words) to avoid run-ons that trigger repairs.
Bullets / tabular: top_p = 0.75–0.85, temperature = 0.35–0.55, fixed bullet counts and ≤18 words/bullet. This dramatically reduces validator failures.
JSON / short labels: top_p = 0.75–0.82, temperature = 0.25–0.45 or grammar-constrained decoding if supported.

Tune where errors occur, not globally. If “Proof” fails citations often, lower its top-p/temperature only there; don’t punish “Overview.” Track tokens/accepted by section and resample rate; tighten sections that carry most failures and you’ll see cost drop without style penalties.

Sectioned Generation: Deterministic Latency at Scale

One-shot generation couples everything: a long sentence in “Proof” blows the whole request. Sectioning decouples.

Operational advantages:

Predictable p95: stop sequences and caps prevent overrun; outliers in one section no longer stretch the tail.
Local repairs: fix a single section with deterministic substitutions or a tighter resample; accepted sections are reused as-is.
Parallel validation: validate and repair sections concurrently when infra allows, reducing wall-clock time.
Traceability: per-section logs (params, stops, tokens, repairs) make regressions easy to bisect.

Implementation notes: enumerate section IDs/order in the contract; specify caps and stop tokens. Validators enforce the outline to catch accidental spills or missing sections before display.

Caching That Actually Saves Money

Caching random strings helps little; caching stable structure and shaped meaning pays immediately.

Template/Style/Policy cache (config): 100% hit-rate; zero risk. Store by version hash.
Claim-pack cache (context): cache shaped, timestamped claims for hot topics keyed by topic+region+freshness_window. Invalidate on source updates or window expiry. This avoids re-retrieval and cuts context tokens by 30–60% vs raw passages.
Boilerplate generation cache: disclosures, legal footers, standard CTAs generated with low diversity. Store by input hash; never regenerate.

Measure caches by hit-rate × tokens saved and impact on $/accepted. If a cache shows high hits but doesn’t move $/accepted, it’s in the wrong layer (or not trimming enough tokens).

Routing & Model Mix

The cheapest output is the one you don’t escalate. Use a small model (SLM) as default and escalate when justified.

When to escalate:

Uncertainty exceeds threshold (e.g., low selector confidence, many missing inputs).
Risk tier demands stronger reasoning (regulated claims, write actions with large dollar impact).
Repair loop exceeds budget (e.g., >1 section repaired twice).

How to measure:

Escalation rate (fraction of requests routed up).
Win-rate delta (acceptance/quality improvement when escalated).
Cost delta for escalated calls.
If escalation lifts CPR marginally but doubles cost, tighten thresholds. Re-evaluate thresholds monthly; model and data drift change the calculus.

Throughput Tricks (without breaking quality)

KV-cache reuse: keep shared header/context in memory across sections or variants if your runtime supports it; saves compute and time.
Micro-batching: batch similar short requests to amortize overhead (watch for p95 regressions at high concurrency).
Speculative decoding (draft+verify): 1.5–2.5× speedups on long sections; confirm acceptance doesn’t drop (occasionally it does if drafts overrun caps).
Early abort on validator hard-fail: if “Schema” or “Implied-write” trips, quit fast and repair; don’t finish other sections.

Always monitor p95 and p99 when enabling throughput features; users experience the tail, not the median.

Measurement & Dashboards (production-facing, actionable)

Per route and release, track:

CPR (first pass) — primary quality gate; segment by locale/persona.
Time-to-valid p50/p95/p99 — include repairs; tails tell the story.
Tokens per accepted — header/context/generation breakdown and by section.
Repairs per accepted and resample rate — leading indicators of latency and cost.
Cache hit-rates and tokens saved per layer (template/claims/boilerplate).
Escalation rate and win-rate delta — is large-model spend paying for itself?
$/accepted output — LLM + retrieval + selection + repairs (not $/token).
Stop-hit ratio — % of sections that terminated at intended stops; low ratios predict overrun.

Alerts: CPR −2 pts, p95 +20%, $/accepted +25%, cache hit-rate −15% WoW, or stop-hit ratio −10% in any section.

Optimization Playbook (do in order; stop when metrics flatten)

Trim the header. Replace verbose rules with references (style/policy IDs). Expect 10–20% token savings.
Replace raw context with claim packs. Shape to atomic, dated claims. Expect 20–40% context savings and higher citation pass-rates.
Add section caps + stops. Overrun disappears; p95 typically drops 15–30%.
Tighten decoding on failure-prone sections. Lower top-p/temperature where validators fail; repairs/accepted falls quickly.
Cache claim packs for hot topics. Invalidate by freshness windows; measure tokens saved per request.
Introduce speculative decoding on long sections. Validate that CPR holds; keep an easy off-switch.
Tune routing thresholds. Reduce unnecessary escalations; confirm win-rate delta remains positive.
Prune variants. If your selector’s lift plateaus, cut N; save tokens and time.

Each step should show a monotonic improvement in tokens/accepted, p95, or $/accepted with stable CPR; if not, revert.

Worked Example (composite)

Baseline (blog route): CPR 91.8%, p95 1280 ms, tokens/accepted 1,850, repairs/accepted 0.42, $0.84/accepted.

Interventions:

Header compression (−120 tokens) by pointing to policy/style IDs rather than pasting.
Section caps + stops; bullets capped at 3 × 18 words.
Decoding tightened for “Proof” (top_p 0.82, temp 0.45) where citations often failed.
Claim-pack cache for 12 hot topics with weekly invalidation.
Speculative decoding on Overview/Proof only (guarded by flag).

Canary (10%, stratified):

CPR +0.8 pts (92.6%).
p95 −23% (to 980 ms); p99 −19%.
Tokens/accepted −30% (to 1,290).
Repairs/accepted −50% (to 0.21).
$ / accepted −37% (to $0.53).
Stop-hit ratio rose from 71% → 93% across sections (predictability gain).
Gates green; rollout complete. A follow-up analysis showed escalations to the large model dropped from 18% to 7% with no CPR regression—pure savings.

Implementation Checklist

Budgets encoded: token caps (header/context/generation), section caps, stop sequences, latency SLOs.
Per-section decoding policies tuned for CPR and validator pass-rates.
Claim packs replace raw passages; freshness windows enforced; conflicts surfaced.
Caching: template/style/policy by version; claim packs by topic+region+freshness; boilerplate generation by input hash.
Router with small→large thresholds; uncertainty/risk signals logged; escalation ROI measured.
Throughput features (KV-cache, micro-batching, speculative decoding) guarded by flags; p95/p99 monitored.
Dashboards & alerts: CPR, time-to-valid, tokens/accepted (by section), repairs/accepted, cache impact, escalation ROI, $/accepted.
Canary/rollback wired: auto-halt on CPR −2 pts, p95 +20%, $/accepted +25%; one-click revert to last green bundle.
Weekly quality note: KPI deltas, artifact diffs, cost drivers, next optimization step.

Conclusion

Cost and speed aren’t mysteries; they are properties you design. By declaring budgets up front, generating by section with hard stops, feeding claims not pages, caching meaningfully, and routing small-by-default, you turn $/accepted output and p95 latency into levers rather than outcomes. The result is a stack that ships quickly, scales smoothly, and remains affordable even as usage grows. Pair these mechanics with the governance from Parts 1–7, and you have a full, auditable operating model for prompt-driven systems that is robust to model changes and real-world load.