Prompt Engineering in the New AI World: From Craft to Discipline

John Godel
Aug 21
560
0
2

Article

Executive summary

AI is creating a completely new world. Prompt engineering is no longer clever wording; it’s systems design. The goal is predictable, auditable outcomes across changing models, messy data, and strict governance. This article lays out a practical, model-agnostic approach: design prompts as modular components, run them inside a controller that plans→retrieves→executes→verifies, instrument everything, and treat prompts like production code (versioned, tested, observed).

Why prompts fail in production

Prompt spaghetti: business logic hidden in mega-prompts; no contracts or tests.
Model drift: upgrades change behavior without notice.
RAG mismatch: great on toy sets, brittle on real content.
No audit trail: hard to show how a decision was made.

Fix: separate concerns. Keep planning, retrieval, tool use, verification, and generation as distinct steps with explicit inputs/outputs.

Core principles

Modularity: one prompt per responsibility (planner, executor, checker, router).
Contracts: JSON schemas for inputs/outputs; reject non-conforming results.
Grounding before generation: “no source → no claim” for factual tasks.
Determinism where it matters: low temperature / constrained decoding for critical steps.
Observability: logs, traces, and metrics tied to every run.
Safety by construction: policy and risk checks embedded in the flow, not after.

Reference architecture (model-agnostic)

Controller: orchestrates steps; chooses patterns (Zero-Shot, CoT, Tree-of-Thoughts, tool-first, etc.).
Retrieval layer: vector/keyword hybrid, task-aware chunking, namespaces per tenant/domain.
Tool layer: deterministic skills (SQL, search, calculators, APIs) with strict I/O schemas.
Verification layer: checklists, schema validation, policy tests, citation checks.
Generation layer: final composition with placeholders filled from verified evidence.
Telemetry: per-step logs, costs, latencies, pass/fail reasons.

Think “assembly line,” not “magic box.”

The prompt stack: from intent to governed output

Intent & KPI: goal, audience, success criteria.
Constraints: tone, style, latency/cost budget, risk boundaries.
Domain context: glossaries, schemas, examples, counter-examples.
Retrieval spec: namespaces, filters, freshness, max docs.
Tool catalog: name, purpose, input/output schema, limits.
Reasoning mode: e.g., plan→execute→check.
Output contract: machine-checkable schema + citations + confidence.

When these layers are explicit, you can evolve one without breaking the rest.

High-leverage prompt patterns

Planner–Executor–Checker (PEC)

Planner decomposes the task and writes success criteria.
Executor performs retrieval/tool calls and drafts outputs.
Checker validates against criteria, policies, and schemas.

Router

A lightweight classifier that routes the task to the right model/pattern/toolset.

Decomposer

Breaks a complex task into atomic subtasks with named inputs/outputs.

Critic/Refiner

A second pass flags contradictions, missing citations, or style issues; the refiner fixes only those.

Tool-First Grounding

For factual tasks, call tools/retrieval first; generate only after evidence is bound.

Anti-patterns to retire

Mega-prompt as business logic. Un-testable and fragile.
Free-form outputs. Always emit JSON (or well-formed XML/CSV) that validates.
Unbounded retries. Cap iterations; log deltas between attempts.
Ambiguous retrieval. Be explicit about entities, time windows, jurisdictions.
One prompt fits all. Use routers and per-task specializations.

Prompt-Oriented Development (POD) lifecycle

Specify: one-page prompt spec (goal, constraints, KPIs, failure modes).
Scaffold: implement planner/executor/checker with schemas.
Simulate: synthetic + historical cases; collect traces.
Evaluate: automated metrics + human review; compare to baselines.
Ship: version prompts; canary rollouts with guardrails.
Monitor: drift dashboards, red-team tests, incident playbooks.

Rule: one prompt/pattern = one owned artifact with version, tests, and SLOs.

Metrics that matter (tie to every run)

Quality: win rate vs. human baseline; grounded-citation rate; contradiction rate.
Retrieval: hit rate, novelty, source diversity, freshness.
Tools: precision/recall by tool; error reasons.
Safety/Policy: pass rate, false accept/reject.
Ops: latency (p50/p95/p99), cost per task, retry counts.

Attach all to a Run ID so you can localize failures to plan, retrieval, tools, or verification.

RAG that actually scales

Task-aware chunking: respect headings, tables, and entities; avoid splitting concepts.
Hybrid retrieval: combine ANN vectors with keyword/regex filters.
Caching tiers: short-TTL answers; long-TTL facts; explicit invalidation rules.
Attribution discipline: require per-sentence or per-section citations; fail closed if missing.
Feedback loops: learn from “no-evidence” claims and human edits.

Multimodal prompting, safely

Ingest → reason → generate: OCR/ASR with confidence scores; gate low-confidence spans for review.
Cross-modal checks: totals, dates, captions vs. body text; verify before publishing.
Localization & accessibility: locale formats, PII scrubbing, and WCAG considerations before generation.

Governance inside the flow (not after it)

Policy as code: encode HIPAA/GDPR/PCI rules as validators the checker must pass.
Human-in-the-loop gates: pause at sensitive actions; show diffs and evidence; require approvals.
Provenance: store hashed sources, prompts, tool I/O, and final outputs with the Run ID.

Templates you can adopt today

1) Controller prompt (skeleton)

[Role] You are the Controller for <Task>. Optimize for <KPI> under <Constraints>.

[Inputs]
- Intent: <...>
- Policies: <list>
- Tools: <name, input_schema, output_schema, limits>
- Retrieval: <namespaces, filters, freshness, max_docs>

[Plan]
1) Disambiguate; list assumptions.
2) Decompose into subtasks + success criteria.
3) For each subtask: choose tool/RAG with justification.
4) Execute; collect evidence with citations.
5) Verify vs. criteria & policies; note failures.
6) If failed, revise plan (max N iterations).
7) Produce final JSON per schema.

2) Output schema (example)

{
  "type": "object",
  "required": ["answer", "sources", "policy_findings", "steps", "metrics"],
  "properties": {
    "answer": {"type": "string"},
    "sources": {"type": "array", "items": {"type": "string"}},
    "policy_findings": {"type": "array"},
    "steps": {"type": "array"},
    "metrics": {"type": "object"}
  }
}

3) Tool contract (example)

{
  "name": "sql.query",
  "input_schema": {"type":"object","required":["sql"],"properties":{"sql":{"type":"string"}}},
  "output_schema": {"type":"object","required":["rows"],"properties":{"rows":{"type":"array"}}},
  "guardrails": {"deny":["DROP","ALTER","TRUNCATE"],"max_rows":5000}
}

4) Verification checklist (snippet)

- All factual statements cited? (Y/N)
- Totals/dates consistent across sections? (Y/N)
- Required policies satisfied? (Y/N)
- Output validates against JSON schema? (Y/N)

5) Router micro-prompt

If task requires math/code/db → route: Tool-First.
If open-ended but safety-critical → route: PEC with Checker + low temperature.
If classification/extraction → route: Deterministic + schema validation.
Else → route: CoT with Critic/Refiner.

Maturity model (L0→L4)

L0 Ad-hoc: copy-paste prompts; no tests.
L1 Versioned: prompts in repo; basic A/B.
L2 Tested: golden sets, schema outputs, drift alerts.
L3 Governed: controller orchestration, policy as code, full traces.
L4 Optimized: automatic routing, feedback learning, cost/latency SLOs.

Aim for L3 before scaling across business units.

30–60–90 day plan

Days 0–30

Pick one workflow; write the prompt spec; implement controller + schemas; create a 100–300 item eval set.

Days 31–60

Add PEC pattern; wire policy validators; introduce router; stand up dashboards (win rate, citation rate, policy pass).

Days 61–90

Canary rollout; add red-team tests; close the feedback loop; document incident playbooks.

Closing note

Prompt engineering is the interface between probabilistic models and deterministic business. Treat it like a product discipline—modular prompts, explicit contracts, rigorous verification, and real observability—and you’ll get durable, repeatable outcomes as the new AI world keeps unfolding.