1. Define the outcome, then freeze constraints
- Target: what artifact do you want (e.g., 200-word email, JSON spec, SQL query)? 
- Constraints: max length, tone, must/forbidden items, budget (e.g., “≤ 1k input tokens, ≤ 300 output tokens”). 
- Pass/Fail checks: 3–5 objective rules the output must satisfy. 
Mini-template
Goal: <deliverable in one line>
Constraints: length≤N, tone=<X>, format=<JSON|bullets|code>, citations=<Y/N>
Budget: max_input_tokens=N, max_output_tokens=M
Quality checks: [check1, check2, check3]
2. Route to the right model (tiering saves money)
- Small/cheap: classify, extract, outline, retrieve passages. 
- Mid: draft simple prose/code. 
- Best: only for synthesis or tricky reasoning after a gate fails. 
- Rule of thumb: “cheap → test → escalate only if needed.” 
Routing rule (plain-English)
If task ∈ {classify, extract, summarize} → Small model.
If task ∈ {draft, rewrite with style} → Mid model.
If task fails checks or requires complex reasoning → Best model for one pass only.
3. Put hard caps in the prompt
- Always set max_tokens, stop sequences, and ban verbose thought: 
- Keep temperature low (0–0.3) for deterministic tasks. 
- Prefer short outputs: “Answer in ≤ 120 words” or “Return exactly one JSON object.” 
4. Compact the input (token diet)
- Remove greetings, repetition, and full docs. Keep IDs, short quotes, and bullet summaries of long context. 
- Deduplicate lists and compress passages to bullet-point abstracts before feeding. 
- Aim for ≤ 3–5 short relevant snippets, not the whole corpus. 
Compressor helper (use before main prompt)
You are a compressor. Rewrite the following into ≤120 tokens of bullet points
preserving numbers, names, and constraints. No fluff.
5. Retrieval that doesn’t flood the context
- Chunk sources (200–400 tokens), embed, and fetch top-k=3 with tight filters. 
- Prepend each snippet with a 1-line why-it’s-relevant tag. 
- Never paste entire PDFs; paste only minimal, relevant fragments. 
6. Minimize few-shots; prefer schemas
- Try zero-shot with explicit schema first. 
- If examples are needed, use one highly representative example (not five). 
- Replace long examples with field-by-field rules. 
Output schema guard
Return JSON exactly matching:
{
  "title": "string (≤60 chars)",
  "summary": "string (≤120 words)",
  "actions": ["string", ... (3 items max)]
}
No extra keys, no explanations.
7. Single-pass structure beats long prose
- Put instructions at the top, followed by input, then schema/checks. 
- Use numbered directives (1–6). Short lines, no paragraphs. 
- Add “If unsure, say 'INSUFFICIENT_CONTEXT'.” to avoid rambling. 
8. Two-pass “Draft → Verify” pattern (cheap + gated)
- Draft (cheap/mid model) → produce output under tight schema. 
- Verify (small model) → run the 3–5 checks; return pass/fail flags. 
- Escalate to best model only if any check fails, with the failing flags as input. 
Verifier prompt (small model)
Judge the JSON against these rules: [rule1..rule5].
Return {"pass": true|false, "failed_rules": ["..."]}. No prose.
9. Reuse to avoid paying twice
- Cache system prompts and role instructions; pass IDs instead of repeating. 
- Reuse embeddings for retrieval; store canonical exemplars and link by name. 
- Keep a snippet library for common constraints (style, tone, brand). 
10. Log tokens, cost, and quality—then prune
- Capture per-call: input tokens, output tokens, dollar cost, latency, which step escalated. 
- Track a simple quality score (0–5) from validators or user thumbs. 
- Monthly pruning: remove low-impact instructions, shrink schemas, delete noisy examples. 
Cost formula
Cost ≈ (input_tokens $/1k_in) + (output_tokens $/1k_out)
Aim to reduce input first; it scales across every call.
11. Safety & compliance (saves money by preventing reruns)
- Data minimization: include only what’s required for this turn. 
- Forbid chain-of-thought: “Provide the answer only, no step-by-step reasoning.” 
- Add a PII gate: “If text seems to include personal data, return 'REDACT_NEEDED'.” 
12. Ready-to-use templates
A) Cost Guardrails (System)
You are a cost-aware assistant. Rules:
1) Never exceed max_tokens or add extra commentary.
2) Follow the output schema exactly; no markdown unless requested.
3) If insufficient context, output 'INSUFFICIENT_CONTEXT' (no guessing).
4) Keep answers as short as allowed by constraints.
B) Task Prompt (User)
Task: <one sentence>
Inputs: <bulleted facts only>
Format: <schema or tight outline>
Constraints: length≤N; tone=<X>; citations=<Y/N>
Budget: max_output_tokens=M
Return only the formatted result.
C) Cheap Draft → Verify → Escalate
- Draft (mid): “Produce per schema. No explanations.” 
- Verify (small): “Check rules [..]; return pass/fail JSON.” 
- Fix/Escalate (best if needed): “Here is the draft and failed_rules. Produce a corrected final.” 
D) Summarizer/Compressor (before main call)
Condense the following into ≤100 tokens preserving facts, numbers, and constraints.
Output bullets only.
13. Quick troubleshooting
- Long, meandering answers → lower max_tokens, add stop sequences, tighten schema. 
- Hallucinations → require citations or “INSUFFICIENT_CONTEXT”; feed fewer but better snippets. 
- Inconsistent format → add a JSON schema and a non-compliant penalty (“If invalid, score=0” in verifier). 
- Too costly → strip boilerplate, remove examples, switch to two-pass with mid/small models, shorten inputs. 
14. Pocket checklist (printable)
- Goal fits on one line. 
- Hard caps: max_tokens, stop, “no reasoning text.” 
- Minimal, deduped context (≤ 3–5 snippets). 
- Zero- or one-shot with strict schema. 
- Two-pass with cheap verify; escalate on failure only. 
- Log tokens & quality; prune monthly. 
Micro-example (everything together)
System: Cost guardrails (A).
User
Task: Write a 120-word customer apology email for delayed shipment.
Inputs: order #78421, delay=3 days, new ETA=Sep 12, 10% coupon THANKS10.
Format: Greeting, 3 short sentences, closing + signature.
Constraints: warm, concise; no excuses; ≤120 words.
Budget: max_output_tokens=160
Verifier (small): Check length≤120 words, coupon present, ETA present, tone=warm, no excuses → pass/fail JSON.
Escalate only if fail.
Use this playbook, and your costs will drop while quality becomes repeatable and auditable.