Introduction
Great prompts can’t rescue bad context. If the model sees stale, off-limits, or noisy material, outputs will drift, citations will crumble, and remediation will be expensive. Context engineering is the discipline of deciding what information the model is allowed to use, how that information is shaped, and how facts are cited and audited. This article presents a practical, production-minded approach you can adopt without changing models: filter sources by eligibility (tenant, license, jurisdiction, freshness) before retrieval, shape passages into atomic claims with timestamps and IDs, and require minimal-span citations for factual sentences. The outcome is cheaper prompts, higher first-pass acceptance, and traceable answers.
The Problem Context Engineering Must Solve
Common failure modes share a root cause: ungoverned context.
Policy violations: content from the wrong tenant/region leaks into responses.
Staleness: facts are correct in general, but not for this month, market, or SKU.
Bloat and drift: long, raw passages inflate tokens and give the model room to improvise.
Unverifiable specifics: numbers or named entities appear without a source trail.
Incident opacity: when something goes wrong, no one can show which sentence came from which source.
Fixing these requires front-loading eligibility decisions and converting eligible text into a claim format that is cheap to pass, easy to cite, and trivial to audit.
Core Approach
Eligibility before similarity. Filter sources by tenant, license, region/locale, and freshness window before any vector/BM25 search.
Claim shaping. Convert passages to atomic claims with a source_id
, effective_date
, tier
(primary/secondary), and a minimal quote span.
Small claim packs. Provide the generator with 6–20 claims relevant to the section being written, not entire documents.
Minimal-span citations. Each factual sentence references 1–2 claim IDs; when a number is quoted, include the shortest supporting span.
Fail-closed validators. Enforce claim freshness, coverage (% of factual lines with claims), and jurisdictional rules; repair or abstain on failure.
Observability & audit. Log the exact claim IDs used per sentence and the policy versions in force.
Implementation Steps (End-to-End)
1) Build Eligibility Filters
Create a policy object per route/region that answers “Is this source allowed to influence the model?”
Tenant & license: allow-list customer/department indices; exclude private or unlicensed docs.
Jurisdiction & locale: US vs EU versions; EN-US vs EN-UK spellings; industry-specific constraints.
Freshness windows: e.g., 540 days for general facts; 90 days for pricing.
Source tiering: rank primary sources (contracts, product specs) over secondary (blogs) and tertiary (press).
Only eligible documents are candidates for retrieval.
2) Retrieve, Then Shape
After eligibility, run your usual retrieval (vector/BM25/hybrid) to get passages. Immediately transform them into claims:
{
"claim_id": "kb:2025-03-07:pricing#p4",
"text": "The Professional plan includes 5 seats by default.",
"source_id": "doc:pricing_guide_v9",
"effective_date": "2025-03-07",
"tier": "primary",
"span": "…Professional plan includes 5 seats by default…",
"url": "https://example.com/pricing#p4",
"jurisdiction": "US"
}
Normalize entities (plan names, SKUs), deduplicate near-identical lines, and discard claims that fail eligibility or freshness. Group the remaining 6–20 as a claim pack keyed to the user’s question/section.
3) Feed Sections, Not Dumps
Generate by section (from Part 2) and pass only the relevant subset of the claim pack to each section. Keep prompts lean; claims do the heavy lifting.
4) Cite Minimal Spans
Instruct the generator: “Every factual sentence references 1–2 claim_id
s; when quoting numbers or named entities, include the shortest supporting span.” Keep citation markup simple (e.g., [kb:2025-03-07:pricing#p4]
), then post-process into your final format.
5) Validate and Repair
Before display:
Coverage: ≥ 70% of factual sentences have 1–2 claim_id
s.
Freshness: claims within window; stale → swap or hedge/remove.
Jurisdiction: claim’s jurisdiction
matches the route’s policy.
Conflicts: if two claims disagree, surface both with dates or abstain.
On failure, repair the section: replace stale claims with fresher ones, add hedges (“According to the March 2025 pricing guide…”), or remove unsupported specifics. Only resample if deterministic repairs can’t satisfy policy.
6) Log for Audit
Store per response:
Contract & policy hashes
Claim pack IDs and the subset used
Sentence → claim_id mapping
Validator outcomes and repairs
Region/locale and timestamps
This turns “trust me” into a reproducible trace.
Data Model: What a Good Claim Encodes
A claim must be specific enough to be cited, small enough to be cheap, and rich enough for audit.
Identity: claim_id
, source_id
, optional url
Provenance: effective_date
, tier
, jurisdiction
Content: text
(normalized), span
(verbatim minimal quote)
Policy hooks: freshness window, region locks, licensing notes
Minimal quote spans prevent the generator from copying whole paragraphs, reducing tokens and legal risk while preserving verifiability.
Citation Stitching (Lightweight, Deterministic)
After generation, run a pass that:
Detects factual sentences (numbers, entities, superlatives).
Maps each to the nearest claim in the section’s pack by lexical overlap and embeddings.
Attaches claim IDs; if none fit above a threshold, mark as uncovered.
Repairs uncovered sentences: inject a hedge + valid claim, or drop the sentence.
Flags conflicts where multiple claims disagree; require dual-citation or abstention.
This post-processor is small code, not model magic—and it pays for itself immediately in fewer incidents.
Validators: What to Enforce Every Time
Schema & structure: the section exists, bullets/fields are present.
Citation coverage: factual lines hit the ≥70% target (tune per route).
Freshness: claims inside the policy window or marked “evergreen.”
Jurisdictional fit: jurisdiction
matches audience/route; otherwise swap.
Banned terms / tone: from your policy bundle (see Part 3).
Implied writes: still zero tolerance (see Part 5 later in series).
Return machine-readable error codes to power deterministic repairs.
Metrics That Matter
Citation precision/recall: sampled against ground truth or human spot-checks.
Coverage rate: factual sentences with ≥1 valid claim.
Stale-claim rate: % of claims beyond window encountered at validation.
Time-to-valid: ensure shaping and stitching don’t inflate p95; they usually reduce it.
$/accepted output: claim packs lower tokens vs. raw passages → costs drop.
Conflict surfacing: number of conflicting-claim cases handled explicitly (a safety win).
Performance & Cost Considerations
Small packs, big wins: 6–20 claims per section are enough; more increases tokens and conflict risk.
Cache by topic/segment: reuse shaped packs for hot queries; invalidate on source updates.
Precompute normalization: entity maps and canonical IDs cut runtime and errors.
Section targeting: fetch and pass only what the section needs; don’t pay for global context.
Common Pitfalls—and Remedies
Dumping PDFs into context. Shape to claims; pass spans, not pages.
Ignoring eligibility. Similarity alone is not compliance; filter first.
Over-citation. One or two claim IDs per factual sentence is plenty.
Numbers without spans. Always carry the shortest quote; numbers are high-risk.
Hidden conflicts. If claims disagree, show both with dates or abstain—don’t average them.
Stale pack reuse. Tie claim packs to freshness windows; expire them automatically.
Worked Example (Composite)
A pricing FAQ route serves US visitors in March 2025.
Eligibility: US-only, licensed internal pricing guides ≤180 days.
Retrieval & shaping: 14 claims; two about “Professional plan seats.”
Generation: “What’s included?” section cites [kb:2025-03-07:pricing#p4]
for “5 seats by default.”
Validation: coverage 83%; freshness OK; no conflicts.
Audit log: contract/policy hashes; claim IDs per sentence; zero repairs.
Outcome: p95 time-to-valid drops 18% vs. raw-passage baseline; $/accepted down 22%.
Conclusion
Context engineering is not an add-on—it’s the other half of prompt engineering. By filtering for eligibility, shaping text into atomic claims, and enforcing minimal-span citations with fail-closed validators, you turn uncertain inputs into governed evidence. Outputs become shorter, safer, and easier to audit. In the next article, we’ll focus on tool mediation—how to let models propose actions without ever implying success until your system has verified and executed the change.