RAG on a Budget: Cost-Efficient Retrieval and Context Packing
Executive Summary
Most LLM cost comes from input tokens, not output. The fastest way to cut spend is to retrieve less, but better, then pack only the minimum context the model needs. This article gives you a concrete, repeatable pipeline that lowers cost 30β70% without sacrificing accuracy.
The Cost Levers (What Actually Moves the Needle)
Context volume: Fewer, tighter snippets beat long pastes.
Redundancy: Near-duplicate chunks silently double the cost.
Shot count: Schemas beat many few-shot examples.
Model tiering: Cheap models for fetch/condense; strong models for final synthesis only.
Step-by-Step Pipeline
1) Normalize Sources Once
Convert PDFs, slides, and HTML to clean text.
Strip boilerplate, headers, footers, navigation.
Keep stable doc_id, section_id, timestamp metadata.
Tip: Save the cleaned text; normalization is a one-time cost you donβt want to pay again.
2) Smart Chunking (200β400 tokens)
Prefer semantic boundaries (headings, bullet lists) to avoid splitting concepts.
Include a 1-line heading per chunk summarizing its claim.
Store: {doc_id, section_id, heading, text, tokens}
.
Why 200β400? Long chunks dilute relevance; tiny chunks increase fetch count. This range balances both.
3) Two-Stage Retrieval (Cheap β Precise)
Stage A: Sparse Filter (cheap)
Stage B: Dense Rerank (precise)
Rule: If top-k < 3 with good scores, return INSUFFICIENT_CONTEXT
rather than padding.
4) Pre-Compression (Small Model, Not the Big One)
Before you ever hit a powerful model, compress each retrieved chunk:
Compressor Prompt
You are a compressor. Condense the snippet to β€70 tokens.
Preserve names, numbers, definitions, and decisions. Remove anecdotes and qualifiers.
Return 3β5 bullets. No explanations.
This step alone typically cuts 50β80% of retrieval token volume.
5) Context Packing with Budgets
Assemble the prompt in strict sections:
Instructions (β€60 tokens)
Schema / Output Contract (β€120 tokens)
Quality Checks (3β5 rules) (β€60 tokens)
Compressed Evidence (β€3 snippets)
Each snippet as:
β’ [doc_id#section_id | why relevant in β€10 tokens]
Then 3β5 bullets from the compressor.
Hard caps: max_tokens
for output, and a visible context budget, e.g., βTotal evidence β€ 250 tokens.β
6) Minimal Shots β Prefer Schemas
Zero-shot plus a JSON or outline schema usually beats multi-example few-shot for cost and stability.
Schema Guard
Return JSON:
{
"answer": "β€120 words",
"evidence": [{"doc_id":"...", "section_id":"..."}],
"confidence": 0.0-1.0
}
No extra keys. No reasoning text.
7) Draft β Verify β Escalate (GSCP-Lite Gate)
Draft (mid-tier): Produce the answer using the packed context + schema.
Verify (small model): Check length, schema validity, evidence present, banned claims (e.g., no new facts).
If any check fails β one corrective pass with the mid-tier.
If it still fails β escalate once to your best model.
Verifier Prompt
Validate the JSON against rules: [schema valid, evidence β₯1, length β€120 words, no claims absent from evidence].
Return {"pass":true|false,"failed":["..."]}.
Practical Templates
A) Retrieval Controller (Plain-English Spec)
Given user_query:
1) Sparse search (BM25) β top 50.
2) Dense rerank β top 3 with MMR Ξ»=0.3.
3) For each of top 3, run compressor β 3β5 bullets β€70 tokens.
4) If <2 viable snippets, return INSUFFICIENT_CONTEXT.
B) Final Prompt Skeleton
SYSTEM:
You are a precise, cost-aware assistant. Follow the schema exactly. No reasoning text.
USER:
Task: <one sentence>
Constraints: lengthβ€120 words; cite doc_id+section_id; no new facts.
Output Schema: { "answer": "...", "evidence":[...], "confidence":0.0-1.0 }
Evidence (compressed):
β’ [docA#s2 | KPI definition]
- ...
- ...
β’ [docB#s5 | 2024 policy]
- ...
- ...
β’ [docC#s1 | exception rule]
- ...
- ...
C) Escalation Message (Only If Needed)
Context: previous draft failed rules: ["schema invalid","missing citation"].
Action: produce a corrected final strictly matching schema.
Use only the evidence provided. No extra sources.
Metrics & Budgeting
Track per request
Input tokens (pre-compression vs post-compression)
Output tokens
Model tier used (draft, verify, escalate)
Pass rate at verifier
Latency per stage
Cost Back-of-Envelope
Total Cost β draft_cost + verify_cost + (escalation? cost: 0)
draft_cost β packed_context_tokens + output_tokens
verify_cost β tiny JSON only (cheap)
escalation_rate target: β€10%
Troubleshooting
Still expensive β Lower top_k from 5β3; shorten compressor budget; remove examples; tighten schema.
Hallucinating β Fail verifier when evidence list is empty or claims lack citations; allow INSUFFICIENT_CONTEXT
.
Format drift β Add a JSON Schema validator in the verifier step; reject non-conforming outputs to force correction.
Latency spikes β Cache embeddings; cache compressor outputs by (doc_id, section_id, version)
.
Ops Checklist (Print-Friendly)
Normalize once; store metadata.
Chunk 200β400 tokens with headings.
Two-stage retrieval (BM25 β dense rerank + MMR).
Compress each snippet to 3β5 bullets.
Pack β€3 snippets with labels and a hard token budget.
Zero-shot with strict schema; minimal or no few-shot.
Draft (mid) β Verify (small) β Escalate (best, at most once).
Log tokens, pass rate, and escalation rate; prune monthly.
This pipeline reduces context bloat, stabilizes outputs, and makes your spend predictable while keeping quality auditable.