1. Define the outcome, then freeze constraints
Target: what artifact do you want (e.g., 200-word email, JSON spec, SQL query)?
Constraints: max length, tone, must/forbidden items, budget (e.g., “≤ 1k input tokens, ≤ 300 output tokens”).
Pass/Fail checks: 3–5 objective rules the output must satisfy.
Mini-template
Goal: <deliverable in one line>
Constraints: length≤N, tone=<X>, format=<JSON|bullets|code>, citations=<Y/N>
Budget: max_input_tokens=N, max_output_tokens=M
Quality checks: [check1, check2, check3]
2. Route to the right model (tiering saves money)
Small/cheap: classify, extract, outline, retrieve passages.
Mid: draft simple prose/code.
Best: only for synthesis or tricky reasoning after a gate fails.
Rule of thumb: “cheap → test → escalate only if needed.”
Routing rule (plain-English)
If task ∈ {classify, extract, summarize} → Small model.
If task ∈ {draft, rewrite with style} → Mid model.
If task fails checks or requires complex reasoning → Best model for one pass only.
3. Put hard caps in the prompt
Always set max_tokens, stop sequences, and ban verbose thought:
Keep temperature low (0–0.3) for deterministic tasks.
Prefer short outputs: “Answer in ≤ 120 words” or “Return exactly one JSON object.”
4. Compact the input (token diet)
Remove greetings, repetition, and full docs. Keep IDs, short quotes, and bullet summaries of long context.
Deduplicate lists and compress passages to bullet-point abstracts before feeding.
Aim for ≤ 3–5 short relevant snippets, not the whole corpus.
Compressor helper (use before main prompt)
You are a compressor. Rewrite the following into ≤120 tokens of bullet points
preserving numbers, names, and constraints. No fluff.
5. Retrieval that doesn’t flood the context
Chunk sources (200–400 tokens), embed, and fetch top-k=3 with tight filters.
Prepend each snippet with a 1-line why-it’s-relevant tag.
Never paste entire PDFs; paste only minimal, relevant fragments.
6. Minimize few-shots; prefer schemas
Try zero-shot with explicit schema first.
If examples are needed, use one highly representative example (not five).
Replace long examples with field-by-field rules.
Output schema guard
Return JSON exactly matching:
{
"title": "string (≤60 chars)",
"summary": "string (≤120 words)",
"actions": ["string", ... (3 items max)]
}
No extra keys, no explanations.
7. Single-pass structure beats long prose
Put instructions at the top, followed by input, then schema/checks.
Use numbered directives (1–6). Short lines, no paragraphs.
Add “If unsure, say 'INSUFFICIENT_CONTEXT'.” to avoid rambling.
8. Two-pass “Draft → Verify” pattern (cheap + gated)
Draft (cheap/mid model) → produce output under tight schema.
Verify (small model) → run the 3–5 checks; return pass/fail flags.
Escalate to best model only if any check fails, with the failing flags as input.
Verifier prompt (small model)
Judge the JSON against these rules: [rule1..rule5].
Return {"pass": true|false, "failed_rules": ["..."]}. No prose.
9. Reuse to avoid paying twice
Cache system prompts and role instructions; pass IDs instead of repeating.
Reuse embeddings for retrieval; store canonical exemplars and link by name.
Keep a snippet library for common constraints (style, tone, brand).
10. Log tokens, cost, and quality—then prune
Capture per-call: input tokens, output tokens, dollar cost, latency, which step escalated.
Track a simple quality score (0–5) from validators or user thumbs.
Monthly pruning: remove low-impact instructions, shrink schemas, delete noisy examples.
Cost formula
Cost ≈ (input_tokens $/1k_in) + (output_tokens $/1k_out)
Aim to reduce input first; it scales across every call.
11. Safety & compliance (saves money by preventing reruns)
Data minimization: include only what’s required for this turn.
Forbid chain-of-thought: “Provide the answer only, no step-by-step reasoning.”
Add a PII gate: “If text seems to include personal data, return 'REDACT_NEEDED'.”
12. Ready-to-use templates
A) Cost Guardrails (System)
You are a cost-aware assistant. Rules:
1) Never exceed max_tokens or add extra commentary.
2) Follow the output schema exactly; no markdown unless requested.
3) If insufficient context, output 'INSUFFICIENT_CONTEXT' (no guessing).
4) Keep answers as short as allowed by constraints.
B) Task Prompt (User)
Task: <one sentence>
Inputs: <bulleted facts only>
Format: <schema or tight outline>
Constraints: length≤N; tone=<X>; citations=<Y/N>
Budget: max_output_tokens=M
Return only the formatted result.
C) Cheap Draft → Verify → Escalate
Draft (mid): “Produce per schema. No explanations.”
Verify (small): “Check rules [..]; return pass/fail JSON.”
Fix/Escalate (best if needed): “Here is the draft and failed_rules. Produce a corrected final.”
D) Summarizer/Compressor (before main call)
Condense the following into ≤100 tokens preserving facts, numbers, and constraints.
Output bullets only.
13. Quick troubleshooting
Long, meandering answers → lower max_tokens, add stop sequences, tighten schema.
Hallucinations → require citations or “INSUFFICIENT_CONTEXT”; feed fewer but better snippets.
Inconsistent format → add a JSON schema and a non-compliant penalty (“If invalid, score=0” in verifier).
Too costly → strip boilerplate, remove examples, switch to two-pass with mid/small models, shorten inputs.
14. Pocket checklist (printable)
Goal fits on one line.
Hard caps: max_tokens, stop, “no reasoning text.”
Minimal, deduped context (≤ 3–5 snippets).
Zero- or one-shot with strict schema.
Two-pass with cheap verify; escalate on failure only.
Log tokens & quality; prune monthly.
Micro-example (everything together)
System: Cost guardrails (A).
User
Task: Write a 120-word customer apology email for delayed shipment.
Inputs: order #78421, delay=3 days, new ETA=Sep 12, 10% coupon THANKS10.
Format: Greeting, 3 short sentences, closing + signature.
Constraints: warm, concise; no excuses; ≤120 words.
Budget: max_output_tokens=160
Verifier (small): Check length≤120 words, coupon present, ETA present, tone=warm, no excuses → pass/fail JSON.
Escalate only if fail.
Use this playbook, and your costs will drop while quality becomes repeatable and auditable.