Executive summary
AI is creating a completely new world. Prompt engineering is no longer clever wording; it’s systems design. The goal is predictable, auditable outcomes across changing models, messy data, and strict governance. This article lays out a practical, model-agnostic approach: design prompts as modular components, run them inside a controller that plans→retrieves→executes→verifies, instrument everything, and treat prompts like production code (versioned, tested, observed).
Why prompts fail in production
- Prompt spaghetti: business logic hidden in mega-prompts; no contracts or tests.
- Model drift: upgrades change behavior without notice.
- RAG mismatch: great on toy sets, brittle on real content.
- No audit trail: hard to show how a decision was made.
Fix: separate concerns. Keep planning, retrieval, tool use, verification, and generation as distinct steps with explicit inputs/outputs.
Core principles
- Modularity: one prompt per responsibility (planner, executor, checker, router).
- Contracts: JSON schemas for inputs/outputs; reject non-conforming results.
- Grounding before generation: “no source → no claim” for factual tasks.
- Determinism where it matters: low temperature / constrained decoding for critical steps.
- Observability: logs, traces, and metrics tied to every run.
- Safety by construction: policy and risk checks embedded in the flow, not after.
Reference architecture (model-agnostic)
- Controller: orchestrates steps; chooses patterns (Zero-Shot, CoT, Tree-of-Thoughts, tool-first, etc.).
- Retrieval layer: vector/keyword hybrid, task-aware chunking, namespaces per tenant/domain.
- Tool layer: deterministic skills (SQL, search, calculators, APIs) with strict I/O schemas.
- Verification layer: checklists, schema validation, policy tests, citation checks.
- Generation layer: final composition with placeholders filled from verified evidence.
- Telemetry: per-step logs, costs, latencies, pass/fail reasons.
Think “assembly line,” not “magic box.”
The prompt stack: from intent to governed output
- Intent & KPI: goal, audience, success criteria.
- Constraints: tone, style, latency/cost budget, risk boundaries.
- Domain context: glossaries, schemas, examples, counter-examples.
- Retrieval spec: namespaces, filters, freshness, max docs.
- Tool catalog: name, purpose, input/output schema, limits.
- Reasoning mode: e.g., plan→execute→check.
- Output contract: machine-checkable schema + citations + confidence.
When these layers are explicit, you can evolve one without breaking the rest.
High-leverage prompt patterns
Planner–Executor–Checker (PEC)
- Planner decomposes the task and writes success criteria.
- Executor performs retrieval/tool calls and drafts outputs.
- Checker validates against criteria, policies, and schemas.
Router
Decomposer
Critic/Refiner
Tool-First Grounding
Anti-patterns to retire
- Mega-prompt as business logic. Un-testable and fragile.
- Free-form outputs. Always emit JSON (or well-formed XML/CSV) that validates.
- Unbounded retries. Cap iterations; log deltas between attempts.
- Ambiguous retrieval. Be explicit about entities, time windows, jurisdictions.
- One prompt fits all. Use routers and per-task specializations.
Prompt-Oriented Development (POD) lifecycle
- Specify: one-page prompt spec (goal, constraints, KPIs, failure modes).
- Scaffold: implement planner/executor/checker with schemas.
- Simulate: synthetic + historical cases; collect traces.
- Evaluate: automated metrics + human review; compare to baselines.
- Ship: version prompts; canary rollouts with guardrails.
- Monitor: drift dashboards, red-team tests, incident playbooks.
Rule: one prompt/pattern = one owned artifact with version, tests, and SLOs.
Metrics that matter (tie to every run)
- Quality: win rate vs. human baseline; grounded-citation rate; contradiction rate.
- Retrieval: hit rate, novelty, source diversity, freshness.
- Tools: precision/recall by tool; error reasons.
- Safety/Policy: pass rate, false accept/reject.
- Ops: latency (p50/p95/p99), cost per task, retry counts.
Attach all to a Run ID so you can localize failures to plan, retrieval, tools, or verification.
RAG that actually scales
- Task-aware chunking: respect headings, tables, and entities; avoid splitting concepts.
- Hybrid retrieval: combine ANN vectors with keyword/regex filters.
- Caching tiers: short-TTL answers; long-TTL facts; explicit invalidation rules.
- Attribution discipline: require per-sentence or per-section citations; fail closed if missing.
- Feedback loops: learn from “no-evidence” claims and human edits.
Multimodal prompting, safely
- Ingest → reason → generate: OCR/ASR with confidence scores; gate low-confidence spans for review.
- Cross-modal checks: totals, dates, captions vs. body text; verify before publishing.
- Localization & accessibility: locale formats, PII scrubbing, and WCAG considerations before generation.
Governance inside the flow (not after it)
- Policy as code: encode HIPAA/GDPR/PCI rules as validators the checker must pass.
- Human-in-the-loop gates: pause at sensitive actions; show diffs and evidence; require approvals.
- Provenance: store hashed sources, prompts, tool I/O, and final outputs with the Run ID.
Templates you can adopt today
1) Controller prompt (skeleton)
[Role] You are the Controller for <Task>. Optimize for <KPI> under <Constraints>.
[Inputs]
- Intent: <...>
- Policies: <list>
- Tools: <name, input_schema, output_schema, limits>
- Retrieval: <namespaces, filters, freshness, max_docs>
[Plan]
1) Disambiguate; list assumptions.
2) Decompose into subtasks + success criteria.
3) For each subtask: choose tool/RAG with justification.
4) Execute; collect evidence with citations.
5) Verify vs. criteria & policies; note failures.
6) If failed, revise plan (max N iterations).
7) Produce final JSON per schema.
2) Output schema (example)
{
"type": "object",
"required": ["answer", "sources", "policy_findings", "steps", "metrics"],
"properties": {
"answer": {"type": "string"},
"sources": {"type": "array", "items": {"type": "string"}},
"policy_findings": {"type": "array"},
"steps": {"type": "array"},
"metrics": {"type": "object"}
}
}
3) Tool contract (example)
{
"name": "sql.query",
"input_schema": {"type":"object","required":["sql"],"properties":{"sql":{"type":"string"}}},
"output_schema": {"type":"object","required":["rows"],"properties":{"rows":{"type":"array"}}},
"guardrails": {"deny":["DROP","ALTER","TRUNCATE"],"max_rows":5000}
}
4) Verification checklist (snippet)
- All factual statements cited? (Y/N)
- Totals/dates consistent across sections? (Y/N)
- Required policies satisfied? (Y/N)
- Output validates against JSON schema? (Y/N)
5) Router micro-prompt
If task requires math/code/db → route: Tool-First.
If open-ended but safety-critical → route: PEC with Checker + low temperature.
If classification/extraction → route: Deterministic + schema validation.
Else → route: CoT with Critic/Refiner.
Maturity model (L0→L4)
- L0 Ad-hoc: copy-paste prompts; no tests.
- L1 Versioned: prompts in repo; basic A/B.
- L2 Tested: golden sets, schema outputs, drift alerts.
- L3 Governed: controller orchestration, policy as code, full traces.
- L4 Optimized: automatic routing, feedback learning, cost/latency SLOs.
Aim for L3 before scaling across business units.
30–60–90 day plan
Days 0–30
Days 31–60
Days 61–90
Closing note
Prompt engineering is the interface between probabilistic models and deterministic business. Treat it like a product discipline—modular prompts, explicit contracts, rigorous verification, and real observability—and you’ll get durable, repeatable outcomes as the new AI world keeps unfolding.