Prompt Engineering, Deeper: A Systems Handbook

John Godel
Aug 20
692
0
2

Article

Executive summary

AI is creating a completely new world. Prompt engineering has matured from clever phrasing to a disciplined, testable engineering practice. The core idea: separate planning, retrieval, tool use, verification, and generation; express contracts with schemas; prefer determinism where accuracy matters; and instrument everything. This handbook goes in-depth on patterns, anti-patterns, evaluation, decoding, routing, and governance—without relying on a single model or vendor.

1) Why prompts fail (and how to frame the problem)

Typical failures

Mega-prompts mix business logic with prose; small edits cause regressions.
Model updates drift behavior; nothing localizes the break.
RAG returns “close-enough” text; generation hallucinates glue claims.
No schema, no citations, no audit trail.

Framing

Treat prompts as interfaces and procedures, not as one-shot instructions.
Enforce contracts (inputs/outputs), use layered prompts, and validate each step.
Optimize for predictability first, creativity second—by context.

2) Prompt specification (formal, portable)

Write a one-page Prompt Spec per task:

Intent & KPI: business goal, audience, success metrics (win rate, accuracy, cost/task).
Constraints: tone, length, latency/cost budgets, risk boundaries.
Inputs/Outputs: JSON schemas; define mandatory fields and allowed ranges.
Context Pack: glossaries, style guide, negative examples, counterfactuals.
Failure Modes: ambiguity cases, known traps, red-team prompts.
Tests: golden set size, invariants, adversarial set, acceptance thresholds.

Rule: if it isn’t in the spec, it doesn’t exist.

3) Pattern library that scales

Planner → Executor → Checker (PEC)

Planner: decomposes task, writes success criteria.
Executor: does retrieval/tool calls, drafts output.
Checker: validates against criteria, schema, and policies; fails closed.

Router

Lightweight classifier routes requests by task: math/code/db → tool-first; extraction → deterministic; open-ended marketing → creative.

Decomposer

Splits a complex job into atomic subtasks with named inputs/outputs; assigns per-subtask prompts.

Critic/Refiner

Second pass flags contradictions, missing citations, or tone issues; refiner edits minimally to satisfy checks.

Tool-First Grounding

For factual tasks: fetch evidence first, bind it, then generate; enforce “no evidence → no claim.”

4) Structured outputs & constrained decoding

JSON schemas: reject non-conforming outputs; include enums, min/max, patterns.
Grammar-guided decoding: limit the token space to valid structures (JSON/XML/SQL).
Stop sequences: prevent models from wandering beyond the schema.
Field-by-field decoding: generate critical fields independently, validate each, then compose.

Example output schema

{
  "type": "object",
  "required": ["answer", "sources", "confidence", "policy_flags"],
  "properties": {
    "answer": {"type":"string", "minLength": 1},
    "sources": {"type":"array","items":{"type":"string","format":"uri"}, "minItems": 1},
    "confidence": {"type":"number","minimum":0,"maximum":1},
    "policy_flags": {"type":"array","items":{"type":"string"}}
  }
}

5) Retrieval-aware prompting (RAG done right)

Task-aware chunking: respect sections, tables, entities; avoid breaking concepts.
Hybrid retrieval: ANN vectors + keyword/regex + filters (time, jurisdiction, product line).
Query prompts: be explicit about entities, timespans, and required granularity.
Attribution discipline: per-sentence or per-section citations; fail closed if missing.
Caching tiers: short-TTL answers; long-TTL facts; event-driven invalidation.

Retrieval prompt snippet

Retrieve up to 6 passages that directly answer:
- Entity: <ACME Model Q2 Warranty>
- Time window: 2023-01-01..2024-12-31
- Jurisdiction: US
Prefer policy/source-of-truth docs over derived content. Return [title, uri, excerpt].

6) Tool use & function calls (deterministic cores)

Model calls deterministic tools (SQL, calculators, internal APIs) via strict I/O.
Validate tool inputs before execution; sanitize and bound outputs.
Prefer tools for math, dates, currency, and database state; let the model plan and compose.

Tool contract example

{
  "name": "sql.query",
  "input_schema": {"type":"object","required":["sql"],"properties":{"sql":{"type":"string"}}},
  "output_schema": {"type":"object","required":["rows"],"properties":{"rows":{"type":"array"}}},
  "guardrails":{"deny":["DROP","ALTER","TRUNCATE"],"max_rows":5000}
}

7) Decoding strategies & determinism

Deterministic where accuracy matters: temperature≈0, nucleus/top-k off, repetition penalties minimal.
Exploratory where creativity matters: moderate temperature with bounded tokens and a checker pass.
Constrained decoding: grammar/regex to enforce structure; beam search rarely needed for prose.
Stochastic guards: cap retries; log deltas between attempts.

8) Routing & orchestration

Static routing: rules on task features (contains code/math/date → tool-first).
Learned routing: small classifier on inputs (length, named entities, ambiguity).
Cost-aware routing: easy tasks → small model; difficult tasks → larger model.
Backoff logic: if retrieval confidence < threshold, ask a clarifying question or escalate.

9) Evaluation: offline, online, and metamorphic

Offline

Golden sets (100–1,000 items) with unambiguous answers or rubrics.
Invariants (metamorphic tests): output shouldn’t change under irrelevant paraphrases.
Adversarial sets: jailbreaks, prompt injection, conflicting facts.

Online

A/B or interleaving on live traffic with guardrails.
Human-in-the-loop review for sensitive cohorts.
Post-hoc audits sampling by risk score.

Judging

Prefer task-specific rubrics over generic LLM-as-judge.
If using model judges, calibrate with human-labeled anchors and measure judge bias.

10) Safety & robustness (built-in, not bolted on)

Prompt injection hygiene: delimit user content, never treat it as instructions, and validate outputs against policies.
PII/PHI rules: redact or route to restricted namespaces; log access decisions.
Policy as validators: encode rules (HIPAA/GDPR/PCI) as executable checks the “Checker” must pass.
Escalation gates: pause before risky actions; show diffs and evidence for approval.

11) Observability & SLOs

Instrument every run with a Run ID and capture:

Inputs, retrieval set, tool I/O (hash large objects), prompts used.
Metrics: latency p50/p95/p99, cost, retries, tool precision/recall, retrieval hit rate.
Quality: win rate vs baseline, citation coverage, contradiction rate, policy pass rate.
Incidents: fail-closed counts, escalation reasons, human overrides.

Set SLOs (e.g., policy pass ≥ 99.5%, grounded-citation rate ≥ 98%, p95 latency ≤ 2.0s).

12) Anti-patterns to retire

One prompt to rule them all. Use layered roles and patterns.
Free-form essays for structured tasks. Emit JSON and validate.
Unlimited retries that mask root causes. Cap and analyze deltas.
RAG without contracts. Always specify namespaces, freshness, and filters.
“Trust me” citations. Require URIs and positional evidence.

13) Maturity roadmap (L0→L4)

L0 Ad-hoc: copy/paste prompts; manual evals.
L1 Versioned: prompts in repo; basic diff reviews.
L2 Tested: schemas, golden sets, drift alerts.
L3 Governed: PEC pattern, policy validators, full traces.
L4 Optimized: learned routing, feedback learning, unit-economics dashboards.

Aim for L3 before broad rollout.

14) Ready-to-use templates

A) Controller prompt (skeleton)

[Role] You are the Controller for <Task>. Optimize for <KPI> under <Constraints>.

[Inputs]
- Intent: <...>
- Policies: <list>
- Tools: <name, input_schema, output_schema, limits>
- Retrieval: <namespaces, filters, freshness, max_docs>

[Plan]
1) Disambiguate; list assumptions.
2) Decompose into subtasks + success criteria.
3) For each subtask: choose tool/RAG with justification.
4) Execute; collect evidence with citations.
5) Verify vs criteria & policies; note failures.
6) If failed, revise plan (max N iterations).
7) Produce final JSON per schema.

B) Verification checklist

- All factual statements cited? (Y/N)
- Totals, dates, and names consistent? (Y/N)
- Output validates against JSON schema? (Y/N)
- Required policies satisfied? (Y/N)
- Tone/length constraints met? (Y/N)

C) Unit test fixture (pseudo)

case = load_case("returns_policy_q2_us.json")
out = run_pipeline(case.input)
assert validate_json(out, schema)
assert all(uri_ok(u) for u in out["sources"])
assert rubric_score(out["answer"], case.reference) >= 0.85

D) Router micro-prompt

If task contains math/code/DB → route Tool-First.
If extraction/classification → deterministic + schema validation.
If open-ended but sensitive → PEC with Checker, low temperature.
Else → plan→execute→check with limited tokens.

Closing note

Treat prompt engineering as systems design: modular prompts with explicit contracts, deterministic cores, verified grounding, and real observability. Do that, and your prompts become reliable components that produce governed outcomes—scaling gracefully as models, data, and business needs evolve.