Prompt Engineering: RAG on a Budget: Cost-Efficient Retrieval and Context Packing

John Godel
Sep 08
1.8k
0
1

Article

RAG on a Budget: Cost-Efficient Retrieval and Context Packing

Executive Summary

Most LLM cost comes from input tokens, not output. The fastest way to cut spend is to retrieve less, but better, then pack only the minimum context the model needs. This article gives you a concrete, repeatable pipeline that lowers cost 30–70% without sacrificing accuracy.

The Cost Levers (What Actually Moves the Needle)

Context volume: Fewer, tighter snippets beat long pastes.
Redundancy: Near-duplicate chunks silently double the cost.
Shot count: Schemas beat many few-shot examples.
Model tiering: Cheap models for fetch/condense; strong models for final synthesis only.

Step-by-Step Pipeline

1) Normalize Sources Once

Convert PDFs, slides, and HTML to clean text.
Strip boilerplate, headers, footers, navigation.
Keep stable doc_id, section_id, timestamp metadata.

Tip: Save the cleaned text; normalization is a one-time cost you don’t want to pay again.

2) Smart Chunking (200–400 tokens)

Prefer semantic boundaries (headings, bullet lists) to avoid splitting concepts.
Include a 1-line heading per chunk summarizing its claim.
Store: {doc_id, section_id, heading, text, tokens}.

Why 200–400? Long chunks dilute relevance; tiny chunks increase fetch count. This range balances both.

3) Two-Stage Retrieval (Cheap → Precise)

Stage A: Sparse Filter (cheap)

BM25 / keyword filter on heading + text.
Narrow to candidate_k = 30–60.

Stage B: Dense Rerank (precise)

Embed query and candidates; rerank by cosine similarity or use a cross-encoder reranker.
Keep top_k = 3 (rarely 5). Use MMR to diversify.

Rule: If top-k < 3 with good scores, return INSUFFICIENT_CONTEXT rather than padding.

4) Pre-Compression (Small Model, Not the Big One)

Before you ever hit a powerful model, compress each retrieved chunk:

Compressor Prompt

You are a compressor. Condense the snippet to ≤70 tokens.
Preserve names, numbers, definitions, and decisions. Remove anecdotes and qualifiers.
Return 3–5 bullets. No explanations.

This step alone typically cuts 50–80% of retrieval token volume.

5) Context Packing with Budgets

Assemble the prompt in strict sections:

Instructions (≤60 tokens)
Schema / Output Contract (≤120 tokens)
Quality Checks (3–5 rules) (≤60 tokens)
Compressed Evidence (≤3 snippets)
Each snippet as:
• [doc_id#section_id | why relevant in ≤10 tokens]
Then 3–5 bullets from the compressor.

Hard caps: max_tokens for output, and a visible context budget, e.g., “Total evidence ≤ 250 tokens.”

6) Minimal Shots → Prefer Schemas

Zero-shot plus a JSON or outline schema usually beats multi-example few-shot for cost and stability.

Schema Guard

Return JSON:
{
  "answer": "≤120 words",
  "evidence": [{"doc_id":"...", "section_id":"..."}],
  "confidence": 0.0-1.0
}
No extra keys. No reasoning text.

7) Draft → Verify → Escalate (GSCP-Lite Gate)

Draft (mid-tier): Produce the answer using the packed context + schema.
Verify (small model): Check length, schema validity, evidence present, banned claims (e.g., no new facts).
If any check fails → one corrective pass with the mid-tier.
If it still fails → escalate once to your best model.

Verifier Prompt

Validate the JSON against rules: [schema valid, evidence ≥1, length ≤120 words, no claims absent from evidence].
Return {"pass":true|false,"failed":["..."]}.

Practical Templates

A) Retrieval Controller (Plain-English Spec)

Given user_query:
1) Sparse search (BM25) → top 50.
2) Dense rerank → top 3 with MMR λ=0.3.
3) For each of top 3, run compressor → 3–5 bullets ≤70 tokens.
4) If <2 viable snippets, return INSUFFICIENT_CONTEXT.

B) Final Prompt Skeleton

SYSTEM:
You are a precise, cost-aware assistant. Follow the schema exactly. No reasoning text.

USER:
Task: <one sentence>
Constraints: length≤120 words; cite doc_id+section_id; no new facts.
Output Schema: { "answer": "...", "evidence":[...], "confidence":0.0-1.0 }

Evidence (compressed):
• [docA#s2 | KPI definition]
 - ...
 - ...
• [docB#s5 | 2024 policy]
 - ...
 - ...
• [docC#s1 | exception rule]
 - ...
 - ...

C) Escalation Message (Only If Needed)

Context: previous draft failed rules: ["schema invalid","missing citation"].
Action: produce a corrected final strictly matching schema.
Use only the evidence provided. No extra sources.

Metrics & Budgeting

Track per request

Input tokens (pre-compression vs post-compression)
Output tokens
Model tier used (draft, verify, escalate)
Pass rate at verifier
Latency per stage

Cost Back-of-Envelope

Total Cost ≈ draft_cost + verify_cost + (escalation? cost: 0)
draft_cost ∝ packed_context_tokens + output_tokens
verify_cost ∝ tiny JSON only (cheap)
escalation_rate target: ≤10%

Troubleshooting

Still expensive → Lower top_k from 5→3; shorten compressor budget; remove examples; tighten schema.
Hallucinating → Fail verifier when evidence list is empty or claims lack citations; allow INSUFFICIENT_CONTEXT.
Format drift → Add a JSON Schema validator in the verifier step; reject non-conforming outputs to force correction.
Latency spikes → Cache embeddings; cache compressor outputs by (doc_id, section_id, version).

Ops Checklist (Print-Friendly)

Normalize once; store metadata.
Chunk 200–400 tokens with headings.
Two-stage retrieval (BM25 → dense rerank + MMR).
Compress each snippet to 3–5 bullets.
Pack ≤3 snippets with labels and a hard token budget.
Zero-shot with strict schema; minimal or no few-shot.
Draft (mid) → Verify (small) → Escalate (best, at most once).
Log tokens, pass rate, and escalation rate; prune monthly.

This pipeline reduces context bloat, stabilizes outputs, and makes your spend predictable while keeping quality auditable.