“Make it work” isn’t enough for enterprise GenAI. What you need is “make it correct, repeatable, auditable, and cheap.” This article lays out a concrete, production-grade approach—contract-driven prompt engineering—that fuses prompts with machine-checkable specs, verifier loops, and tool orchestration. The result is a pipeline that turns LLMs from impressive demos into systems that withstand SLAs, regulators, and real users.
Why prompt-only systems fail in production
LLMs can ace examples yet still fail when:
Small spec ambiguities or edge cases appear.
The task spans multiple steps, days, or tools.
Compliance or traceability is required.
The fix is not “more prompt cleverness.” It’s moving the source of truth into a contract and letting prompts, tools, and verifiers converge on that contract.
The core idea: the contract is the product
A contract is a structured, machine-checkable spec of what a “correct” output looks like: required fields, allowed ranges, invariants, and domain rules. The LLM is an optimizer under that contract; verifiers enforce it; tools supply ground truth; governance records the trace.
Contract anatomy
Schema: JSON Schema (or Pydantic) defining fields, types, enums, ranges.
Constraints: Business rules (e.g., dates non-decreasing; totals equal line-item sums).
Invariants (metamorphic tests): Properties that must hold under input transformations.
References: Authoritative sources (APIs, DBs, docs) used for cross-checks.
Evidence: Citations, provenance, or calculation steps required for audit.
Pipeline architecture (end-to-end)
Spec ingestion
Load a schema + constraints + references. Generate a compact “contract brief” for the model.
Spec-sandwich prompting
Prompt = [role & goal] + [immutable contract brief] + [task & inputs] + [return format & evidence rules]
.
Constrained decoding / JSON mode
Force structured output; reject malformed payloads early.
Verifier stack (cheap→expensive)
a) Schema validation → b) Domain rules → c) Property-based checks → d) Cross-source fact checks → e) Adversarial probes.
At any failure: invoke repair prompts or tool calls targeted at the violated rule.
Uncertainty gates & escalation
Calibrate model self-ratings + verifier scores; if thresholds fail, escalate to a human or fall back to a safer path.
Trace & governance
Persist spec version, inputs, outputs, tool calls, rule hits/misses, and confidence. This is your audit trail.
Self-improver loop
Mine failures; generate counter-examples; fine-tune or rule-tune; re-run eval packs; ship only when win-rate lifts with bounded cost/latency.
The spec-sandwich prompt (pattern)
Header: Role, goal, safety/ethics stance.
Immutable contract brief: Summarized schema + critical rules with terse examples.
Task: Inputs with IDs and context window hints.
Return format: Exact JSON keys; require evidence
blocks or rationale
tags when needed.
Refusal & escalation: Clear “don’t guess; escalate if…” clauses bound to uncertainty thresholds.
Example (abbreviated)
“Produce InvoiceSummary
as JSON matching Schema v3. Fields: invoice_id
(string), currency
(ISO-4217), subtotal
, tax
, total
(numbers), line_items[]
with qty*unit_price
sums. Constraints: total = subtotal + tax
; all qty > 0
; dates non-decreasing. Evidence: cite page:line for each extracted value. If any field uncertain ≥ 0.3, stop and emit needs_human_review:true
with reasons.”
Verifier design (what to check and in what order)
Syntactic: JSON validity, required keys present, type checks.
Algebraic: Totals balance; monotonic properties; units compatible.
Semantic: Domain rules (e.g., ICD codes valid; SKUs exist; exchange rates dated ≤24h).
Metamorphic: Invariance under harmless input changes (whitespace, section order, OCR noise).
Cross-evidence: Does the output cite sources? Do sources actually support the claim?
Adversarial: Prompt-injection filters; role separation; tool parameter whitelists.
Repair loop: Feed verifier deltas back to the model (“You violated Rule R3: total
mismatches. Recalculate. Provide new JSON only.”). Cap retries; escalate on repeated failure.
Tool & memory orchestration that pays for itself
Retrieval: Ground answers with org docs; attach citations.
Calculators/solvers: Let tools do math, regex, SQL; the model orchestrates.
Execution sandboxes: Run generated code/tests in jailed environments; import results back as evidence.
Organizational memory: Cache verified outputs and known-good exemplars; prefer reuse over regeneration.
Cost discipline: log tool calls and token spend per rule fixed. If a rule fires often, improve prompts, add a cheap pre-check, or upgrade training data.
Uncertainty you can rely on
Self-ratings alone are noisy. Calibrate using reliability diagrams on your eval set; learn thresholds that correlate with actual pass/fail. Combine: model self-score, verifier failures, retrieval coverage, and tool success rates into a confidence index that gates autonomy.
Evaluation that predicts production
Task success rate (TSR) on hidden eval packs, not just public benchmarks.
Spec-adherence violations per 100 tasks.
Irrecoverable error rate after N repair attempts.
Escalation rate and mean human time per escalation.
Cost per successful task and P90 latency.
Field-level F1 for structured outputs; calibration error for uncertainty.
Promote changes only when TSR↑, violations↓, and cost/latency within SLO.
Implementation blueprint (concise)
Author schema.json
+ rules.yml
+ refs.yml
.
Build a ContractBrief generator (auto-summarize schema + rules ≤ 1–2k tokens).
Implement a PromptBuilder that renders the spec-sandwich deterministically.
Add a ConstrainedDecoder (JSON mode, regex guard, or grammar-based decoding).
Chain Verifiers with structured error messages; implement Repairs with rule-targeted prompts.
Wire Tools (retrieval, math, code, APIs) behind a policy: whitelisted functions + audited arguments.
Add Uncertainty Gates and an Escalation adapter (ticketing or human-in-the-loop UI).
Log traces: inputs, prompts, outputs, tool IO, rule outcomes, costs, durations.
Run Eval Packs nightly; auto-generate counter-examples from failures; store deltas.
Ship via gated releases: dev → canary → prod with rollbacks keyed by spec version.
Mini case studies (compressed)
1) Regulated data extraction (finance/health)
Contract defines fields + ICD/ISO enums; retrieval supplies policy PDFs; verifiers check totals and code sets; failures trigger targeted repairs. Outcome: TSR from 72%→94%, escalation rate 18%→5%, cost/task −32%.
2) Code migration assistant
Contract: compile, pass unit tests, preserve public API. Tools: static analyzers, test runner. Verifiers: build succeeds, tests green, API diff only in allowed set. Deployed as “autonomy with gates,” not free-running agents. Outcome: P90 latency steady; TSR ↑ with each weekly counter-example pack.
Failure modes and how to harden
Spec drift: Version your contracts; pin prompts to spec version; log migrations.
Prompt injection: Strict function whitelists; escape/strip untrusted inputs; verify tool arguments.
Schema rigidity: Allow unknown_fields[]
for safe forward-compatibility; audit but don’t crash.
Verifier flakiness: Keep tests deterministic; isolate network dependencies; cache reference data.
Cost explosions: Track cost per rule repair; add “cheap pre-checks” ahead of expensive retrieval or ensembles.
Rollout strategy that works
Start with one workflow where value is high and the contract is clean. Wire full tracing from day one. Freeze the spec for an initial window; improve only via counter-example mining. Gate autonomy behind confidence thresholds. When the pipeline becomes boring—few violations, stable costs—scale sideways to adjacent workflows by cloning the pattern, not the prompts.
Conclusion
The future of enterprise GenAI belongs to systems that treat prompts as one component in a contract-driven pipeline. When specs are machine-checkable, verifiers are first-class, tools are policy-bound, and uncertainty gates decide when to escalate, LLMs stop being demo toys and start meeting production bars. Contract-driven prompt engineering is how you get there—on purpose, with proofs, and with costs you can live with.