Introduction
Great generative systems don’t rely on “looks good.” They ship with tests, telemetry, and rollback. Evaluation for GenAI is different from supervised ML: labels are scarce, outputs are open-ended, and quality drifts with prompts, parameters, and data. This playbook shows how to measure and gate quality using artifacts you can build today—no massive human labeling effort required.
What “Good” Means (Define It Once, Reuse Everywhere)
Before tooling, freeze the target:
Task objective: e.g., “produce on-brand copy with valid citations” or “JSON schema for refund decisions.”
Constraints: schema, banned terms, section lengths, locale rules.
Risk posture: when to ask, refuse, or escalate.
Acceptance gates: the minimum set of checks an output must pass to be user-visible.
Write this as a policy JSON (contract version, validator settings, decoder defaults). Every eval method below reads from the same policy.
Golden Sets (Your Non-Negotiables)
Golden sets are small, fixed test collections (30–200 items per route) built from real, anonymized cases.
Contents
Inputs (prompts/params + any evidence packs).
Expected properties (not verbatim output): must cite X, must abstain if Y missing, must include Z section, must not contain banned terms.
Reason codes for tricky cases (conflict, stale source, missing fields).
How to use
Run goldens on every PR to prompts, templates, validators, or decoding policies.
Track first-pass pass-rate, time-to-valid (including repair loops), and failure taxonomy (schema, citation, tone, safety).
Why it works: Goldens test behavioral guarantees—the part of quality most likely to regress.
Constraint Pass-Rate (CPR): Your Primary KPI
CPR = % of generations that pass all hard checks on the first attempt.
Typical checks:
Schema & structure (JSON validity, sections present, bullet counts).
Lexical & brand (banned terms, casing, locale).
Claims & citations (minimal spans present, coverage threshold).
Safety (policy-forbidden phrases, write-action implications).
Length & cadence (sentence caps, token budgets).
Track CPR by route, locale, model version, and template. If CPR drops ≥2–3 points in a canary, auto-pause the rollout.
Self-Consistency & n-Best Selection (Quality Without Labels)
When exact ground truth is fuzzy, generate N variants and select the best:
Majority on slots: For structured fields (e.g., tags, categories), pick values that recur across variants.
Centroid similarity: Embed variants; choose the one closest to the cluster center (reduces outliers).
Small rater model: A tiny SLM scores “clarity, concreteness, brand fit” 1–5 using your policy; pick the top.
Hybrid: First filter by hard constraints, then select via centroid or rater.
Metrics to log: n-best win-rate, variance among variants, selector disagreement (good canary signal).
Live Canaries (Reality Check Before Full Traffic)
Push changes to 5–10% of traffic stratified by locale/persona.
Guardrails
Auto-halt rules: CPR −2 pts, citation precision < 0.85, p95 latency +20%, $/accepted +25% with no quality gain.
Leading indicators: increase in repairs per output, rise in abstentions on previously easy cases, spike in selector disagreements.
Operationalize
Metrics That Matter (Dashboards You’ll Actually Use)
First-pass pass-rate (CPR) and time-to-valid (ms).
Citation precision / recall on factual sentences (for RAG routes).
Abstention quality: % ask/refuse when required fields missing; % incorrect answers on low-coverage inputs.
Diversity under control: type–token ratio, sentence length variance (early warning for bland or chaotic outputs).
Economics: $ per accepted output, cache hit-rate, resample rate.
Drift watch: conflict frequency, stale-claim usage, rater score trend.
Show by release version and traffic segment. Keep a 7-day moving window and a static “golden” panel.
Failure Taxonomy (Make Bugs Fixable)
Classify failures into a small, actionable set:
SCHEMA (shape/format)
CITATION (missing/incorrect spans)
SAFETY (banned claims/terms, implied writes)
TONE (brand/locale violations)
LENGTH (over caps)
EVIDENCE (stale/ineligible/conflict not surfaced)
Every failed sample carries one of these tags. Owners fix categories, not one-offs.
Test Harness (Thin, But Strict)
A minimal runner that accepts:
Input bundle (prompt params + evidence pack).
Policy bundle (contract version + validator config + decoder defaults).
Generator (which model & strategy).
Selector (if n-best).
Outputs:
Pass/fail, issues[], time-to-valid, chosen variant, selector score, seed, param hash.
Use it locally (devs) and in CI (blocking). Save artifacts for diffing between versions.
Sampling Plans (Get Signal With Small Budgets)
Pareto sampling: most defects come from a few routes. Focus there.
Stratified by risk: high-stakes locales or personas get 2–3× samples.
Rotating challenge sets: refresh 10–20% each sprint to avoid overfitting goldens.
For new features, require N≥50 canary samples passing CPR with stable time-to-valid before widening rollout.
Human QA Where It Pays
Labels are expensive—spend them well:
Boundary audits: review items that barely passed (near caps) or failed a single check.
Rater calibration: spot-score 20–50 items/week to keep the small rater aligned.
Policy changes: before/after samples around new banned terms or claim rules.
Keep rubric tiny (5–7 questions). Aim for inter-rater agreement ≥ 0.7; if lower, fix the rubric.
Cost & Latency in the Loop
Quality that’s too slow or costly won’t ship. Track:
Resample rate and repair count per accepted output.
KV-cache reuse (for sectioned gen).
Stop-sequence effectiveness (early termination success).
Selector efficiency (N vs. incremental gain).
Target: reduce time-to-valid each release without sacrificing CPR.
Release Gates (Copy/Paste Policy)
A change may ship only if:
CPR ≥ baseline − 0.5 pts and time-to-valid ≤ baseline + 10%
Citation precision/recall ≥ 0.9 (if applicable)
Safety incidents = 0 in canary
$/accepted ≤ baseline or justified by measurable lift (e.g., reply rate +X%)
If any breach → auto-rollback, open an issue with failing samples attached.
Worked Example (Composite)
A team upgrades its style frame and decoder policy. In canary:
CPR dips from 92.3% → 90.6% (−1.7).
Time-to-valid improves 14%.
Citation precision unchanged at 0.93; recall +0.02.
Selector disagreement +18% (warning).
Decision: hold rollout, refine lexicon (over-tight banned list caused extra repairs), re-canary. After tweak: CPR 92.1%, time-to-valid +12% vs. baseline → green-light.
Anti-Patterns (Don’t Do These)
Shipping on human eyeballs alone (“looks fine”).
Mixing policy between prompt text, code, and tribal memory—single policy JSON or it didn’t happen.
One giant long-form generation—section & validate.
Ignoring abstentions—count them as a win when they prevent wrong answers.
Treating goldens as static forever—refresh gradually to prevent prompt overfitting.
Conclusion
Evaluation for generative systems is about guarantees over vibes. Small, durable artifacts—golden sets, policy JSON, constraint validators, n-best selectors, and live canaries—turn creativity into governed computation with clear rollbacks. When CPR, time-to-valid, and citation quality are first-class metrics, you can change prompts, models, and templates with confidence—and ship faster, safer, cheaper.