Prompt Engineering Contracts: Turning Language into Reliable Systems - Part 1

John Godel
Oct 16
434
0
1

Article

Introduction

Large language models are flexible, but production systems cannot rely on flexibility alone. Product teams need behavior that is predictable, auditable, and cheap to operate. The practical solution is to treat a prompt not as prose but as a contract—a compact specification that fixes scope, structure, and policy at the interface between users and the model. This article describes what a prompt contract is, why it reduces risk and cost, and how to implement one without changing your model provider or stack.

The Problem the Contract Solves

Most failures in shipped LLM features have the same shape:

Outputs drift because requirements are buried in long prompts or scattered across code.
The model “promises” actions (refunds, updates) that were never executed.
Small edits cause regressions because nothing is versioned or testable.
Costs spike as teams compensate with longer prompts and retries.

All of these are symptoms of implicit behavior. A prompt contract makes that behavior explicit, short, and verifiable.

Core Idea

A prompt contract is a small artifact—typically <300 tokens of instruction plus a compact schema—that expresses five things:

Role and Scope: What the model may do, and what it must not do.
Output Structure: A JSON or well-formed layout that downstream code can validate.
Decision Rules: When to answer, when to ask for more input, and when to refuse.
Evidence Policy: If using context/evidence, how to cite, how to break ties, and what to do when sources conflict.
Tool Mediation: If actions are allowed, the model may propose a tool call; only your system executes (or rejects) it.

Everything else—style, examples, and niceties—comes after these five.

Design Principles

Keep it small. Long prompts hide requirements and increase token costs. Favor short, numbered rules over paragraphs.
Structure first. Define the output schema before tone; treat the model like a function with a strict return type.
Encode refusal. Missing fields, policy conflicts, or uncertainty must result in a targeted question or a polite refusal—not a guess.
Separate policy from prompt. Place banned terms, disclosures, and locale rules in a machine-readable policy file the prompt references.
Version everything. Use semantic versioning; a schema change is a major bump. Ship with a changelog.
Test like code. Maintain a small golden set of real inputs with expected properties (e.g., “must abstain when X missing”). Run it in CI.

A Minimal Contract (Explained)

Consider a customer-support summarizer that can also propose a follow-up task.

Role & Scope (excerpt):

Provide a concise summary and next step based only on supplied messages.
If critical details are missing (customer ID, urgency), output a single clarifying question.
Do not claim that any task has been created; you may only propose one.

Output Structure:

{
  "summary": "string",
  "next_step": "string",
  "missing": ["string"],
  "proposed_tool": {
    "name": "create_task" | null,
    "args": { "title": "string", "due_date": "string" },
    "preconditions": ["string"]
  }
}

Decision Rules:

If missing is non-empty, leave proposed_tool null.
If urgency ≥ “high” and missing is empty, you may propose create_task with a 48-hour due date.
If any prohibited language appears (policy file), regenerate without it.

Evidence Policy (if using context):

Cite minimal spans when quoting from messages.
Tie-break conflicting claims by newest timestamp; otherwise abstain.

Decoder Defaults:

Use top-p 0.88, temperature 0.55, stop at \n\n## between sections.

This is short, readable, and sufficient for a validator to enforce.

Implementation Steps

Write the contract as a single file near the route (feature). Keep role/scope, schema, rules, and decoder defaults together.
Attach a policy bundle—a JSON with banned terms, disclosures, claim rules, and locale settings. Your validator reads this bundle.
Build a validator that checks: valid schema, banned terms, locale/brand casing, implied-write language, and (if applicable) citation coverage/freshness. Fail closed.
Add a repair path: if validation fails, regenerate only the failing section with slightly tighter decoding; do not redo the entire output.
Create golden traces: 30–50 anonymized inputs with expected properties (not exact wording). Run them on every change.
Ship behind a flag and canary to 5–10% of traffic. Auto-halt on: first-pass pass-rate drop ≥2 points, p95 latency +20%, or cost per accepted output +25%. Keep one-click rollback.

What Good Looks Like in Metrics

Constraint Pass-Rate (first pass) ≥ your target (many teams aim for 92–95%).
Time-to-valid p95 within your SLO—users feel p95, not p50.
Repairs per accepted output ≤ 0.25 (few sections need rework).
$/accepted output trending down (fewer retries, smaller generations).
Zero implied-write incidents (no text asserting actions that were not executed).

Common Failure Modes (and Fixes)

Mega-prompt with vibes: Replace with a small contract and a schema; move policy into a JSON bundle.
Hallucinated actions: Require tool proposals; your server validates and executes. Add a validator that blocks success language without a tool record.
Context overload: If you use retrieval, shape it into atomic claims (timestamped snippets) before generation; cite minimal spans.
Flaky longform: Generate by section with stop sequences and per-section token caps; repair only the failed section.
Uncontrolled edits: Version the contract; run goldens in CI; canary every change; keep rollback cheap.

Discussion

Prompt contracts won’t remove the need for judgment, but they change where judgment lives. Instead of revising prose mid-incident, you adjust a small number of explicit rules with tests and gates. Teams discover that most “model problems” are really contract and policy problems—once fixed, behavior stabilizes even across model upgrades.

Conclusion

Turning prompts into contracts is the shortest path to reliable, auditable LLM features. By fixing scope, structure, decision rules, evidence policy, and tool mediation in a compact artifact—then validating and testing that artifact like code—you reduce variance, prevent incidents, and cut cost. In the next article, we’ll focus on decoding discipline and sectioned generation: how to set sampling policies that increase first-pass success and make latency predictable.