Prompt Engineering  

Step-by-Step, Cost-Efficient Prompt Engineering

1. Define the outcome, then freeze constraints

  • Target: what artifact do you want (e.g., 200-word email, JSON spec, SQL query)?

  • Constraints: max length, tone, must/forbidden items, budget (e.g., “≤ 1k input tokens, ≤ 300 output tokens”).

  • Pass/Fail checks: 3–5 objective rules the output must satisfy.

Mini-template

Goal: <deliverable in one line>

Constraints: length≤N, tone=<X>, format=<JSON|bullets|code>, citations=<Y/N>

Budget: max_input_tokens=N, max_output_tokens=M

Quality checks: [check1, check2, check3]

2. Route to the right model (tiering saves money)

  • Small/cheap: classify, extract, outline, retrieve passages.

  • Mid: draft simple prose/code.

  • Best: only for synthesis or tricky reasoning after a gate fails.

  • Rule of thumb: “cheap → test → escalate only if needed.”

Routing rule (plain-English)

If task ∈ {classify, extract, summarize} → Small model.

If task ∈ {draft, rewrite with style} → Mid model.

If task fails checks or requires complex reasoning → Best model for one pass only.

3. Put hard caps in the prompt

  • Always set max_tokens, stop sequences, and ban verbose thought:

    • Do not explain your reasoning. Return only the final result in the required format.

  • Keep temperature low (0–0.3) for deterministic tasks.

  • Prefer short outputs: “Answer in ≤ 120 words” or “Return exactly one JSON object.”

4. Compact the input (token diet)

  • Remove greetings, repetition, and full docs. Keep IDs, short quotes, and bullet summaries of long context.

  • Deduplicate lists and compress passages to bullet-point abstracts before feeding.

  • Aim for ≤ 3–5 short relevant snippets, not the whole corpus.

Compressor helper (use before main prompt)

You are a compressor. Rewrite the following into ≤120 tokens of bullet points

preserving numbers, names, and constraints. No fluff.

5. Retrieval that doesn’t flood the context

  • Chunk sources (200–400 tokens), embed, and fetch top-k=3 with tight filters.

  • Prepend each snippet with a 1-line why-it’s-relevant tag.

  • Never paste entire PDFs; paste only minimal, relevant fragments.

6. Minimize few-shots; prefer schemas

  • Try zero-shot with explicit schema first.

  • If examples are needed, use one highly representative example (not five).

  • Replace long examples with field-by-field rules.

Output schema guard

Return JSON exactly matching:

{

  "title": "string (≤60 chars)",

  "summary": "string (≤120 words)",

  "actions": ["string", ... (3 items max)]

}

No extra keys, no explanations.

7. Single-pass structure beats long prose

  • Put instructions at the top, followed by input, then schema/checks.

  • Use numbered directives (1–6). Short lines, no paragraphs.

  • Add “If unsure, say 'INSUFFICIENT_CONTEXT'.” to avoid rambling.

8. Two-pass “Draft → Verify” pattern (cheap + gated)

  1. Draft (cheap/mid model) → produce output under tight schema.

  2. Verify (small model) → run the 3–5 checks; return pass/fail flags.

  3. Escalate to best model only if any check fails, with the failing flags as input.

Verifier prompt (small model)

Judge the JSON against these rules: [rule1..rule5].

Return {"pass": true|false, "failed_rules": ["..."]}. No prose.

9. Reuse to avoid paying twice

  • Cache system prompts and role instructions; pass IDs instead of repeating.

  • Reuse embeddings for retrieval; store canonical exemplars and link by name.

  • Keep a snippet library for common constraints (style, tone, brand).

10. Log tokens, cost, and quality—then prune

  • Capture per-call: input tokens, output tokens, dollar cost, latency, which step escalated.

  • Track a simple quality score (0–5) from validators or user thumbs.

  • Monthly pruning: remove low-impact instructions, shrink schemas, delete noisy examples.

Cost formula

Cost ≈ (input_tokens $/1k_in) + (output_tokens $/1k_out)

Aim to reduce input first; it scales across every call.

11. Safety & compliance (saves money by preventing reruns)

  • Data minimization: include only what’s required for this turn.

  • Forbid chain-of-thought: “Provide the answer only, no step-by-step reasoning.”

  • Add a PII gate: “If text seems to include personal data, return 'REDACT_NEEDED'.”

12. Ready-to-use templates

A) Cost Guardrails (System)

You are a cost-aware assistant. Rules:

1) Never exceed max_tokens or add extra commentary.

2) Follow the output schema exactly; no markdown unless requested.

3) If insufficient context, output 'INSUFFICIENT_CONTEXT' (no guessing).

4) Keep answers as short as allowed by constraints.

B) Task Prompt (User)

Task: <one sentence>

Inputs: <bulleted facts only>

Format: <schema or tight outline>

Constraints: length≤N; tone=<X>; citations=<Y/N>

Budget: max_output_tokens=M

Return only the formatted result.

C) Cheap Draft → Verify → Escalate

  • Draft (mid): “Produce per schema. No explanations.”

  • Verify (small): “Check rules [..]; return pass/fail JSON.”

  • Fix/Escalate (best if needed): “Here is the draft and failed_rules. Produce a corrected final.”

D) Summarizer/Compressor (before main call)

Condense the following into ≤100 tokens preserving facts, numbers, and constraints.

Output bullets only.

13. Quick troubleshooting

  • Long, meandering answers → lower max_tokens, add stop sequences, tighten schema.

  • Hallucinations → require citations or “INSUFFICIENT_CONTEXT”; feed fewer but better snippets.

  • Inconsistent format → add a JSON schema and a non-compliant penalty (“If invalid, score=0” in verifier).

  • Too costly → strip boilerplate, remove examples, switch to two-pass with mid/small models, shorten inputs.

14. Pocket checklist (printable)

  • Goal fits on one line.

  • Hard caps: max_tokens, stop, “no reasoning text.”

  • Minimal, deduped context (≤ 3–5 snippets).

  • Zero- or one-shot with strict schema.

  • Two-pass with cheap verify; escalate on failure only.

  • Log tokens & quality; prune monthly.

Micro-example (everything together)

System: Cost guardrails (A).

User

Task: Write a 120-word customer apology email for delayed shipment.

Inputs: order #78421, delay=3 days, new ETA=Sep 12, 10% coupon THANKS10.

Format: Greeting, 3 short sentences, closing + signature.

Constraints: warm, concise; no excuses; ≤120 words.

Budget: max_output_tokens=160

Verifier (small): Check length≤120 words, coupon present, ETA present, tone=warm, no excuses → pass/fail JSON.

Escalate only if fail.

Use this playbook, and your costs will drop while quality becomes repeatable and auditable.