Prompt Engineering: Decoding Discipline - Sampling Policies and Sectioned Generation — Part 2

John Godel
Oct 16
376
0
3

Article

Introduction

Model weights determine what’s possible; decoding determines what you get. In production, the difference between “pretty good” and “predictably great” often comes down to how you sample: top-p, temperature, repetition controls, stop sequences, and whether you generate the whole artifact at once or by section. This article outlines a classical, implementation-first approach to decoding that raises first-pass success, stabilizes latency, and lowers cost—without changing models.

The Problem Decoding Must Solve

Unstable outputs rarely require new training. They reflect:

Variance from overly loose sampling (creative but off-contract).
Runaway generations that spill across sections and inflate p95 latency.
Overcorrection (greedy/beam on long text) that induces repetition.
All-or-nothing retries that regenerate an entire document for a local error.

A disciplined decoding strategy constrains diversity where precision matters and allows it where style benefits, all while isolating failures to the smallest unit (a section or field).

Core Principles

Format before freedom. Fix the output structure and section boundaries, then tune diversity inside those rails.
Per-section policies. Different parts want different settings (e.g., bullets vs. narrative vs. JSON).
Fail small, repair small. Validate each section independently; repair only the failing part.
Optimize for first-pass pass-rate (CPR) × tokens. Small token wins are meaningless if retries surge.
Stop deterministically. Use explicit stop sequences so generations end where you intend.

A Baseline Decoding Policy (Practical Defaults)

Narrative paragraphs: top-p 0.90–0.95, temperature 0.70–0.85, repetition penalty 1.05
Bulleted benefits / lists: top-p 0.80–0.88, temperature 0.40–0.60, word caps per bullet
Short labels / titles: beam 3 or top-p ≤0.80, temperature ≤0.45
JSON/structured fields: top-p 0.75–0.85, temperature 0.20–0.45, or grammar-constrained if supported

Always pair these with stop sequences at section boundaries and per-section token caps.

Sectioned Generation: Architecture and Rationale

Instead of generating the whole artifact in one pass, render an outline, then produce each section with its own decoding profile and hard stop.

Benefits

Predictable p95 latency (no spillover).
Localized repairs (regenerate only one section).
Cleaner observability (params and failures per section).
Easier gating (validators can accept/reject sections atomically).

Minimal contract hooks

Section keys and order
Stop tokens between sections (e.g., \n\n## )
Max tokens per section and max words per sentence
Expected bullet counts per section

Validators and the Repair Loop

Generate → Validate: schema/shape, max words/sentence, banned lexicon, (if grounded) citation coverage.
Deterministic repair: substitutions (remove banned terms, add hedges), trimming to caps, or injecting missing labels.
Tighten sampling: on second attempt, reduce top-p by ~0.05 and temperature by ~0.1 for that section only.
Ask/refuse: on third failure, emit a targeted ASK_FOR_MORE or safe refusal per contract.

Track repairs per accepted output (target ≤0.25) and time-to-valid.

Implementation Recipe (Step-by-Step)

Define sections in your contract (IDs, order, caps, stops).
Attach a per-section decoding policy object (top-p, temperature, penalties, caps).
Render an outline first (1 line/section). Validate length.
Generate each section from its outline line with its decoding policy.
Validate & repair section-wise; never discard already-accepted sections.
Stitch the artifact; run a final whole-document validator (structure, cadence, locale rules).
Log per-section params, tokens, stop hits, validator outcomes, and repair steps.

Concrete Presets (Copyable)

Launch note (blog):

Overview: top-p 0.92, temp 0.80, ≤120 tokens, 2 sentences, stop at \n\n##
What Changed (3 bullets): top-p 0.84, temp 0.50, ≤18 words/bullet
Why It Matters (3 bullets): top-p 0.84, temp 0.55, ≤18 words/bullet
CTA: top-p 0.80, temp 0.40, ≤30 tokens

Support macro (JSON):

Keys only from enum; top-p 0.78, temp 0.35; schema validation hard-fail
If escalation=true, require a reason field; else repair to add one

Email (subject + preheader):

Subject: beam=3 or top-p 0.80, temp 0.40, ≤52 chars
Preheader: top-p 0.85, temp 0.50, ≤90 chars; no emojis if policy forbids

Observability and Tuning

Log per section: decoding params, token counts, whether the stop fired cleanly, validator errors, repair types, time-to-valid. Tune by:

Lowering top-p/temperature on sections that frequently fail validators.
Raising them slightly where outputs feel stiff after CPR is comfortably high.
Watching p95 (not p50). Users feel the tail.
Optimizing CPR × tokens; aim for fewer retries before shaving tokens.

Performance and Cost Tactics

Caps + stops cut long tails and p95 latency.
KV-cache reuse (where supported) across sections reduces total compute.
Speculative decoding (draft + verify) accelerates long sections without hurting CPR.
Cache deterministic fragments (disclosures, boilerplate) instead of regenerating them.

Common Pitfalls (and Fixes)

One-shot longform: switch to outline → sections; you’ll see instant p95 improvements.
Greedy everywhere: acceptable for short labels; harmful for paragraphs (loops, blandness).
Over-diverse bullets: lower temp to ≤0.55 and enforce word caps.
Whole-doc retries on local errors: adopt section repair; cost will drop.
Untracked params: include settings in traces so you can bisect regressions.

What “Good” Looks Like in Metrics

First-pass CPR: +1–3 points after sectioning and tuned params.
p95 time-to-valid: −15–30% with stop sequences and caps.
Repairs/accepted: ≤0.25 sections on average.
$/accepted output: down and to the right (fewer retries; smaller generations).

Conclusion

Decoding isn’t an afterthought; it’s a policy surface as important as the prompt itself. By defining per-section sampling, hard stops, and a surgical repair loop—and by tuning to maximize first-pass success—you transform the same model into a faster, cheaper, and more reliable system. In the next article, we’ll cover style, persona, and lexicon control—how to keep brand voice consistent without fine-tuning.