Introduction
“Make the prompt better” is not a strategy. In production, prompts behave like software artifacts with versions, tests, budgets, telemetry, and rollbacks. This article walks through an operational approach to prompt engineering you can ship: treat prompts as contracts, bind them to policies and tools, measure outcomes (not vibes), and manage change with flags and CI.
From words to contracts
A prompt should declare what it’s allowed to do, what it must produce, and how success is judged—before a single token is generated.
# contracts/support_triage.v7.yaml
role: "SupportTriageAgent"
capabilities:
tools: [LookupCustomer, GetOrders, CreateTicket]
retrieval: ["kb://support/*"] # allowlisted sources
policies:
safety_bundle: "policy/support.v4"
pii_masking: true
io:
input_schema:
required: [user_message, channel]
output_schema:
required: [summary, category, priority, next_action, citations]
types:
category: ["billing","shipping","technical","other"]
budgets:
max_input_tokens: 4_000
max_output_tokens: 500
latency_slo_ms: 1500
success:
primary_metric: "CorrectRouting"
guardrails: [{name: "DefectRate", threshold: 0.02, direction: "<"},
{name: "LatencyP95", threshold: 1.5, direction: "<"}]
Why this helps: reviewers can diff intent safely; developers wire tools to typed interfaces; evaluators know exactly what to score.
Prompt structure that scales
Prompts that survive production share a pattern:
System frame: scope, constraints, and JSON schema for output.
Context plan: how to fetch and rank context (retrieval policy, tool preconditions).
Reasoning frame: brief, reusable guidance (few-shot traces, error-handling).
Output contract: exact fields, types, and citation rules.
Minimal example:
System:
You are SupportTriageAgent. Use only allowed tools and sources. Always mask PII in text outputs.
Return JSON that matches the schema.
Context:
- Prefer documents tagged `kb:certified`.
- If unsure after one retrieval pass, ask exactly one clarifying question.
Reasoning rules:
- Draft → Validate (schema) → Emit.
- Cite minimal spans: doc IDs only.
Schema (must match):
{"summary": "...", "category":"billing|shipping|technical|other",
"priority":"low|medium|high", "next_action":"...", "citations":["doc:..."]}
Golden traces beat anecdotes
Create a small, representative set of golden prompts with fixed inputs and expected outputs. Run them in CI for every prompt or model change.
# tests/support_triage_golden.yaml
cases:
- id: "billing-refund-001"
input: {"user_message":"I was charged twice last month", "channel":"email"}
expect:
category: "billing"
priority: "medium"
next_action: "CreateTicket"
must_include: ["duplicate charge","last month"]
- id: "shipping-delay-002"
input: {"user_message":"Package is late. Order #4815", "channel":"chat"}
expect:
category: "shipping"
next_action: "LookupOrder"
metrics:
pass_if: "≥ 90% cases meet expectations"
Golden traces stop regressions and make disagreements concrete (“which case failed?”).
Evaluation: outcomes over tokens
Don’t optimize for shorter prompts or higher BLEU. Optimize for business outcomes:
CorrectRouting (label agreement with human triage)
First-Contact Resolution (FCR)
Time-to-Valid (TTV)
$/Accepted (cost per successful outcome)
Log for every call: prompt version, model, retrieval set, tool calls, citations, latency, and decision outcome. That’s your prompt telemetry.
Managing change safely: flags, canaries, rollback
Ship prompts like code:
Version them (support_triage.v7).
Wrap in a feature flag with variants (v6 control, v7 candidate).
Canary to 5–10% of traffic; monitor guardrails.
Promote or rollback with a single flag change that leaves an audit receipt (change ID).
This keeps UX stable while you learn.
Prompt + context: retrieval is a product
Context engineering is supplementary, not a substitute. Good prompts still fail with noisy context. Set a retrieval policy:
Eligibility: define what is allowed (certified docs, recency window, role-based views).
Claim shaping: convert free text into explicit claims (IDs, fields, spans).
Minimal-span citations: link only the lines that back the answer.
Fallbacks: when context is weak, ask one targeted question or decline.
A simple retrieval policy stub:
retrieval_policy:
k: 6
filters: {tag: "kb:certified", lang: "en"}
recency_days: 365
disallow: ["draft", "sandbox"]
cite: "minimal_span"
Tool use without tears
Bind tools as typed functions. Prompts reference capabilities; the runtime enforces them.
# tools.py
from pydantic import BaseModel
class LookupCustomerArgs(BaseModel): email: str
class CreateTicketArgs(BaseModel): summary: str; priority: str
def lookup_customer(a: LookupCustomerArgs) -> dict: ...
def create_ticket(a: CreateTicketArgs) -> dict: ...
The model proposes tool calls; your executor validates arguments, runs the tool, and returns a receipt (e.g., TKT-1234). The prompt never claims success without a receipt.
Compression and cost control
Cost isn’t just model price; it’s context length × call volume × retries. Practical levers:
Prompt compression: template constants, abbreviate system text, move long policy to IDs (fetched once).
Context pruning: keep top-k with redundancy penalties; ban long, low-signal blobs.
Model mix: route by difficulty/uncertainty; use small models for easy paths, escalate only when needed.
Caching: semantic + exact-match caches keyed by normalized inputs and context fingerprint.
Failure handling that users trust
Write for failure explicitly:
Schema violations: “Emit a valid JSON object; if invalid, self-correct once.”
Low confidence: “If uncertainty > 0.4, ask one clarifying question.”
Tool errors: “If a tool fails, surface the receipt and attempt once with a different parameter; otherwise escalate.”
Policy conflicts: “If policy blocks an answer, summarize why and suggest a safe alternative.”
These clauses prevent awkward silence and preserve auditability.
Example: a full prompt “bundle”
Bundle prompt text, schema, retrieval policy, and budgets in one artifact.
bundle_id: "support_triage.v7"
model: "gpt-5-large"
system_prompt: |
You triage support messages. Use tools only as declared. Output JSON per schema.
Mask PII in free text. Cite minimal spans.
schema: {"type":"object","required":["summary","category","priority","next_action","citations"],"...": "..."}
retrieval_policy: {k:6, filters:{tag:"kb:certified"}, recency_days:365}
tools: [LookupCustomer, GetOrders, CreateTicket]
budgets: {max_input_tokens:4000, max_output_tokens:500, latency_slo_ms:1500}
flags:
key: "support.triage.bundle"
variants: {"v6":"support_triage.v6","v7":"support_triage.v7"}
tests: "tests/support_triage_golden.yaml"
Ship this bundle through your CI pipeline; run goldens; deploy behind a flag; watch guardrails; promote with a receipt.
Conclusion
Production-grade prompt engineering is software engineering: contracts, tests, telemetry, and change control. Keep prompts succinct but explicit, pair them with clean context, measure outcomes, and ship with flags and receipts. Do this, and your prompts won’t just be clever—they’ll be reliable, scalable, and revenue-positive.