Prompt Engineering: Validators and Safety - Fail Closed, Repair Small — Part 6

John Godel
Oct 16
387
0
3

Article

Introduction

Once your prompt contracts (Part 1), decoding (Part 2), style controls (Part 3), and context shaping (Part 4) are in place, the last line of defense is safety enforcement. Validators are the mechanical gatekeepers that guarantee outputs satisfy policy, structure, and jurisdictional constraints before they become user-visible or trigger actions. In this article we define a compact validator architecture, show how to encode policy as data, explain “repair small” strategies that avoid costly retries, and outline the metrics and incident playbooks that keep risk low without slowing teams down.

Why Validators Exist

Generative routes fail in predictable ways: malformed JSON, banned claims (“guaranteed results”), stale citations, locale mix-ups, and—most expensive—implied write actions (“I issued your refund”) with no execution record. Validators reduce these to a handful of deterministic checks. They return machine-readable error codes, enabling targeted repairs instead of whole-document resampling. The net effect: higher first-pass acceptance (CPR), lower latency, and fewer incidents.

The Validator Stack (four layers)

Shape & Schema
- Parseability (valid JSON/HTML/Markdown subsets)
- Required fields present; enums/ranges respected
- Section counts, bullet counts, sentence caps
Lexicon & Tone
- Banned terms/phrases; hedge phrases required where uncertainty remains
- Brand casing and style constraints (e.g., sentence length ceilings, paragraph counts)
- Channel rules (subject/preheader lengths, emoji policy, hashtags)
Evidence & Freshness (when grounded)
- Citation coverage for factual sentences (e.g., ≥70%)
- Claim freshness within policy window (e.g., ≤540 days unless “evergreen”)
- Conflict surfacing when claims disagree (show both with dates or abstain)
Action Safety
- Implied-write detector: blocks language asserting state change without a recorded tool result
- Tool whitelist, argument validation, jurisdiction/tenant limits
- Spend/rate/budget checks; idempotency requirements

Each layer yields specific error codes (e.g., SCHEMA, LEXICON, CITATION, STALE_CLAIM, IMPLIED_WRITE, LOCALE, LENGTH), making downstream remediation simple.

Policy as Data (not prose)

Centralize rules in a policy bundle the validators read and the prompt references. Example (abridged):

{
  "contract_version": "brand.v3.2.1",
  "ban": ["guarantee", "only solution", "revolutionary"],
  "hedge": ["may", "typically", "in many cases"],
  "comparatives": { "allow": ["faster than baseline"], "forbid": ["#1","best"] },
  "brand_casing": [["Product X","Product X"]],
  "citations": { "min_coverage": 0.7, "max_age_days": 540 },
  "locale": "en-US",
  "channel": { "email_subject_max": 52, "preheader_max": 90 },
  "writes": { "forbid_implied_success": true }
}

Because the bundle is versioned, audits can reproduce the exact rule set used for any output, and legal/brand can edit rules without touching code.

Repair Small: Deterministic Fixes Before Resample

Resampling entire outputs is slow and expensive. Prefer section-level and token-light repairs:

Lexicon repair: deterministic substitutions for banned phrases; sentence trimming to caps.
Style repair: enforce casing, inject a required hedge word, split long sentences.
Evidence repair: swap a stale claim for a fresher equivalent; add a citation ID; hedge or remove unsupported numbers.
Action repair: replace implied success with a proposal (“I can issue a $25 credit—shall I proceed?”) or append the recorded tool result ID.

If a section fails the second pass, then tighten decoding for that section only (lower top-p/temperature) and try once more. On a third failure, emit a targeted ASK_FOR_MORE or REFUSE with the specific missing/forbidden condition.

Implementation Pattern

1) Contract hooks

Enumerate required sections/fields and stop tokens.
Point to policy_version and required citation_coverage.
Forbid success language; require tool proposals with idempotency keys.

2) Validator services

Stateless functions per layer, returning {ok, errors:[{code, span, details}]}.
Compose into a pipeline; stop early on SCHEMA/IMPLIED_WRITE.

3) Repair engine

Map error codes → ordered rewrite steps (substitute, trim, inject hedge, swap claim).
Re-run only the failing layer(s); re-validate; cap attempts (e.g., 2).

4) Observability

Log policy/contract/decoder hashes, error codes, repair steps, and final status.
Emit adherence scores (lexicon, cadence, coverage).

Example Error → Repair Mappings

Error	Detector	Repair
SCHEMA	JSON invalid / missing fields	Fill defaults; if still invalid → resample that section
LEXICON	banned phrase “guarantee”	Substitute with hedge (“aims to”); add reason note
LENGTH	>18 words/sentence	Split at clause boundary; trim adverbs
CITATION	factual line w/ no claim_id	Attach nearest valid claim (threshold); else hedge/remove
STALE_CLAIM	claim age > window	Swap for fresher claim; else hedge timing (“as of …”)
IMPLIED_WRITE	success language w/o tool result	Rewrite to proposal; or block and surface decision from validator
LOCALE	EN-UK in EN-US route	Normalize spellings; enforce brand casing

Repairs must be deterministic and logged so operators can trust them.

Measuring Safety & Validator Efficacy

Constraint Pass-Rate (CPR), first pass (primary KPI)
Repairs per accepted output (target ≤0.25)
Citation precision/recall (spot-checked or via labeled sets)
Implied-write violations (should be ~0; anything >0 triggers incident review)
Stale-claim rate and freshness coverage
Time-to-valid p95 (repairs should lower p95 by avoiding resamples)
Policy adoption (% outputs using latest policy version)

Visualize by route, locale, and release. Alert on CPR −2 pts, p95 +20%, or any implied-write spike.

Incident Playbooks (keep them short)

Contain: Flip the route’s feature flag to safe copy; block tool execution if relevant.
Triage: Classify failures by code (SCHEMA/CITATION/SAFETY/…); sample affected outputs.
Root cause: Contract change, policy edit, decoder tweak, stale claims, or retrieval eligibility gap.
Remediate: Patch the artifact (policy/contract/decoder/validator), add/adjust a golden case, re-canary.
Postmortem: 1-pager with artifact diffs, failing samples, and a new regression guard.

Aim for MTTR in hours, not days; cheap rollback is part of safety.

Practical Tips

Keep the ban list small and high-signal; overblocking creates wooden prose and repair churn.
Treat numbers and named entities as high-risk—require claims or hedges.
Encode jurisdictional deltas (US vs EU ads) as separate policy bundles; the planner selects the right one.
For channels with tight form factors (ads/email), prioritize shape/language validators before evidence checks to fail fast.
Store validator versions in traces; evolutions in policy should not look like model drift in dashboards.

Worked Example (Composite)

A renewal email route fails validation with errors: LEXICON (uses “revolutionary”), CITATION (one factual sentence uncovered), and IMPLIED_WRITE (“I renewed your plan”). Repairs substitute “advanced,” attach a valid claim to the uncovered sentence, and rewrite the implied write into a proposal with a confirmation link. The section then passes; time-to-valid improves versus whole-resample, and the audit log records the three repairs. Canary shows CPR rising from 91.6% → 93.0%, p95 falling 14%.

Conclusion

Validators are not an add-on; they are the enforcement engine for prompt contracts and policy bundles. By failing closed, repairing small, and logging everything, you turn safety from an after-the-fact review into a first-class, automated property of your system. In Part 7, we’ll put evaluation and operations around these controls—golden tests, canary/rollback, and the metrics that prove a change is safe to ship.