An airline’s customer assistant looked great in demos—until flight schedules started changing daily. The model spoke confidently, but it mixed old and new refund rules, forgot to cite sources, and sometimes granted refunds that didn’t exist. The fix wasn’t a bigger model. It was context engineering: getting the right evidence to the model in the right shape with the right rules.
The Real-Life Use Case
A traveler messages: “My flight was moved from 8:00am to 11:45am—am I eligible for a refund?” The correct answer depends on:
The traveler’s itinerary and fare class
The current policy for schedule change thresholds (e.g., 90+ minutes)
Jurisdiction-specific regulations (country of departure)
Whether the airline offered reaccommodation options
Everything is knowable, but only if the assistant pulls and uses the right context—today’s policy, this traveler’s booking, this route’s rules.
What Went Wrong (Before)
Stale policy: The assistant quoted a 60-minute threshold that changed to 90 minutes last week.
No provenance: It couldn’t show which policy version it used.
Glue text: Where details were missing (fare class), it guessed.
Silent conflicts: Two sources disagreed; the assistant merged them instead of flagging a discrepancy.
The Context Engineering Fix
We didn’t retrain the model. We changed how it receives and uses evidence:
Contract (the operating rules): Define freshness windows, source tiers, tie-breaks, abstention rules, and output schema.
Retrieval (policy-aware, not just relevant): Pull only eligible policies (correct airline, region, and effective dates), the ticket record, and the schedule change event.
Shaping (atomic claims): Turn documents into timestamped facts (threshold=90, effective_date, exceptions list), each with a source ID.
Reasoning (apply the contract): Rank by score → break ties by recency → prefer primary sources → check required fields → decide answer vs. ask-for-more.
Validation (before user sees it): Enforce JSON schema, minimal-span citations, and an uncertainty check; refuse if key fields are missing.
A One-Page Contract (Drop-In)
System role: Airline refund assistant. Use only provided context.
Rules:
- Rank evidence by retrieval_score; break ties by newest effective_date.
- Prefer primary sources: "policy.md" > email > blog.
- Quote minimal spans; include source_ids for each quoted claim.
- If key fields missing (fare_class OR schedule_change_minutes), ask for them.
- If sources conflict, do not harmonize. Flag discrepancy and present both claims with dates.
- Refuse if evidence is outside freshness window (>= 60 days) unless explicitly marked "current".
Output JSON:
{
"answer": "...",
"citations": ["policy:2025-09-29#threshold", "pnr:AB1234#fare"],
"missing": [],
"uncertainty": 0.0-1.0,
"rationale": "one concise sentence"
}
The Context Pack We Feed the Model
{
"task": "Refund eligibility for PNR AB1234",
"claims": [
{
"id": "policy:2025-10-02#threshold",
"text": "Refund eligible if schedule change >= 90 minutes.",
"effective_date": "2025-10-02",
"retrieval_score": 0.94,
"source_type": "primary"
},
{
"id": "pnr:AB1234#delta",
"text": "Original dep 08:00 → new dep 11:45 (change=225 minutes).",
"effective_date": "2025-10-11",
"retrieval_score": 0.91,
"source_type": "system"
},
{
"id": "pnr:AB1234#fare",
"text": "Fare class: Basic Economy (B).",
"effective_date": "2025-10-11",
"retrieval_score": 0.88,
"source_type": "system"
}
],
"policies": {
"freshness_days": 60,
"tie_break": "newest",
"max_tokens_ctx": 3500,
"required_fields": ["fare_class", "schedule_change_minutes"]
}
}
Shaping: From Documents to Atomic Claims
Normalize: Extract threshold values, effective dates, and exceptions into discrete claims.
Timestamp: Every claim must carry an effective_date.
Deduplicate: Merge duplicates by source tier and date; keep the highest-tier, newest claim.
Compress with guarantees: Summaries must link back to originals; no “free” paraphrases.
This makes the model’s job simple: apply rules to a clean set of facts.
Validation: Guardrails That Run Before Output
A thin post-processor checks:
Schema validity: All required JSON fields present.
Citation coverage: Every factual span maps to a source_id.
Freshness & conflict flags: Out-of-window sources demoted; unresolved conflicts must be surfaced.
Uncertainty threshold: If inputs are incomplete or contradictory, return “ask-for-more” or “refuse.”
What We Measured (and Improved)
Grounded accuracy: % of answers consistent with current policy.
Citation precision/recall: Are citations correct and complete?
Abstention quality: When missing data, did we ask the right follow-up instead of guessing?
Cost & latency: Tokens and response time per successful answer.
Discrepancy surfacing: % of conflicts flagged before reaching users.
Within two weeks, grounded accuracy rose from 78% → 95%, citation precision from 62% → 93%, and unnecessary escalations fell by a third. Token spend dropped because the model stopped thrashing between irrelevant sources.
Rollout Playbook You Can Reuse
Policy inventory: Centralize refund/return/eligibility rules with versioned dates.
Contract first: Write the one-page rules + JSON schema and treat it like code.
Shaper: Build a small service that emits atomic, timestamped claims with source IDs.
Retriever: Add filters for tenant, region, and freshness before relevance.
Validator: Enforce schema, citations, and uncertainty before showing output.
CI evals: Record real tickets; replay them as “context packs” on every change.
Dashboards: Track grounded accuracy, policy adherence, abstention quality, cost, and latency.
The Result
The assistant now answers, cites, and—when it must—asks for exactly the missing field (e.g., fare family or country of departure) instead of guessing. Agents trust it. Auditors can trace it. Customers get consistent outcomes. That’s context engineering: not bigger models, but better evidence, rules, and proof.
Takeaway
If you want a quick start: write the one-page contract, shape policy into atomic, dated claims, and enforce citations and uncertainty on output. Do this for any high-stakes workflow—refunds, warranties, fee disputes, benefit eligibility—and you’ll convert fluent text into reliable decisions that stand up in the real world.