Prompt Engineering  

Prompt Engineering - in 2028 — Context Engineering from Passages to Claims (Series: The Next Five Years of Prompt Engineering, Part 3)

Introduction

By 2028, most prompt failures trace back to context, not wording. Long passages inflate tokens, bury contradictions, and invite improvisation. The maturer pattern is claims-based context: pass the model small, timestamped atomic claims with source IDs and minimal quotes; require minimal-span citations for factual lines; and fail closed when freshness, jurisdiction, or licensing rules are violated. This article lays out the operating approach—eligibility before similarity, claim shaping, citation rules, validation and repair, and the KPIs that prove quality and cost improve together.


Why Passages Fail in Production

Passage dumps look convenient; in practice they cause:

  • Token bloat: higher cost and slower p95/p99 from oversized prompts.

  • Hidden conflicts: contradictory lines from different pages with no tie-break policy.

  • Weak provenance: hard to answer “Where did this come from?” with precision.

  • Policy drift: region/license restrictions get bypassed by over-broad retrieval.

  • Repair churn: validators flag issues that force whole-document retries.

Claims invert this: they are compact, dated, source-addressable, and easy to vet.


Eligibility Before Similarity

Search should not decide who is allowed to speak. Gate sources first:

  • Tenant / license: enforce allow-lists; exclude unlicensed or private docs.

  • Jurisdiction / locale: US vs EU versions; EN-US vs EN-UK spellings; regulated verticals.

  • Freshness windows: e.g., ≤90 days for pricing; ≤540 days for general facts.

  • Source tiering: primary (contracts/specs) > secondary (docs) > tertiary (blogs).

Only eligible material enters retrieval. This single step removes most compliance incidents.


Shaping Passages into Atomic Claims

Transform each relevant passage into a small, normalized record:

claim_id: "pricing:pro:v10:p4"
text: "Professional plan includes 5 seats by default."
source_id: "doc:pricing_guide_v10"
effective_date: "2028-03-07"
tier: "primary"
span: "…includes 5 seats by default…"
jurisdiction: "US"
entities: { plan: "Professional", seats: 5 }

Design notes

  • Keep text normalized for matching; keep span as the minimal quote to support numbers and names.

  • Attach effective_date, tier, and jurisdiction for tie-breaks and routing.

  • Deduplicate near-identical claims; prefer newer effective_date for the same entity.


Minimal-Span Citations (The 1–2 Rule)

Every factual sentence should cite 1–2 claim IDs; quoting numbers or named entities should reference the shortest supporting span. Benefits:

  • Precision: readers see exactly what backs the statement.

  • Cost: smaller claims shrink prompts and reduce retries.

  • Auditability: you can recreate the answer with the same claim set.

When coverage is impossible (no eligible claim), the system must hedge or abstain.


Handling Conflicts and Gaps

Conflicts are inevitable; policies must say what to do:

  • Newest-wins for the same source tier and entity.

  • Tiered preference: primary > secondary > tertiary; if mixed tiers conflict, surface the primary and footnote the rest.

  • Dual-cite when two up-to-date primary sources disagree; include dates.

  • Abstain if conflict persists or eligibility/freshness is unclear.

Gaps (no claim) trigger a targeted ASK for missing inputs or a safe refusal.


Pack Sizing and Targeting

Overfeeding claims recreates passage bloat. Practical ranges:

  • Per section: 6–20 claims tied to that section’s question.

  • Per artifact: compose from per-section packs; do not pass a global heap.

  • Selection: hybrid rank (lexical + embedding) constrained by entity filters and freshness.

Keep packs small; they should feel like evidence, not reading assignments.


Pipeline Architecture (End-to-End)

  1. Eligibility filter: tenant/license/jurisdiction/freshness.

  2. Retrieve: hybrid search over eligible corpus.

  3. Shape: normalize → extract entities → mint claim_id with span and metadata.

  4. Deduplicate & tier: collapse near-dupes; tag tier and effective date.

  5. Pack: select 6–20 claims per section; order by utility.

  6. Generate: sectioned generation with citation requirements.

  7. Validate: coverage %, freshness window, jurisdiction, conflicts, schema/lexicon.

  8. Repair small: swap stale claim, add hedge, attach citation; only then resample.

  9. Trace: store claim set and sentence→claim mapping with artifact hashes.


Validation and Repair (Fail Closed, Fix Cheap)

Checks

  • Coverage:70% of factual lines carry 1–2 claim IDs (tune per route).

  • Freshness: all claims within window or marked evergreen.

  • Jurisdiction/tenant: claims match route policy.

  • Conflict policy: newest-wins or dual-cite applied correctly.

  • Implied writes: still zero tolerance—even with “evidence.”

Repairs

  • Attach nearest valid claim to uncovered line; if none, hedge or drop the line.

  • Replace stale claims with fresher equivalents; if unavailable, annotate “as of .”

  • Rewrite comparative/superlative language to match claim scope.

  • Keep repairs section-local; avoid whole-artifact resamples.


Observability and KPIs

Track evidence like you track latency:

  • Citation coverage and precision/recall (sampled)

  • Stale-claim rate and time-to-freshness after source updates

  • Conflict resolution distribution (newest-wins vs dual-cite vs abstain)

  • Tokens per accepted output: header/context/generation breakdown

  • Time-to-valid p50/p95 and repairs per accepted

  • $/accepted output (LLM + retrieval + selection + repairs)

Healthy programs show rising coverage, falling stale rate, and lower tokens/accepted.


Performance Economics (Why Claims Are Cheaper)

  • Shorter headers: contracts reference policy/style by ID, not prose.

  • Smaller context: claims beat pages by 20–60% tokens.

  • Fewer retries: deterministic repairs replace full resamples.

  • Flatter tails: section caps + stops prevent spillover when claims are tight.

  • Routing lift: small models succeed more often when evidence is compact and precise.

The KPI that moves is $/accepted output, not $/token.


Implementation Patterns That Work

  • Entity maps: canonicalize product names/SKUs/regions to merge noisy sources.

  • Freshness policies per claim type (pricing vs docs) checked at validation time.

  • Inline evidence chips in UI: hover shows minimal quote + source + date.

  • Pack caches for hot topics with strict TTLs; invalidate on source change.

  • Golden tests that assert refusal on missing claims and dual-cite on conflicts.


Anti-Patterns to Retire

  • Dumping entire PDFs and “letting the model figure it out.”

  • Treating similarity score as a permission system.

  • Using long quotes instead of minimal spans for numbers and names.

  • Over-citation (3–5 IDs per sentence) that bloats tokens without improving trust.

  • Whole-artifact resampling for a single stale claim.


Conclusion

In 2028, prompt engineering’s biggest gains come from context discipline. Gate eligibility before you search. Shape passages into atomic, dated claims with minimal quotes. Enforce 1–2 citations per factual line and resolve conflicts by rule. Validate hard and repair small. Log the claim set and the sentence map so you can show your work. Do this and outputs get shorter, safer, cheaper—and, crucially, provable. Next in the series (2029), we’ll focus on tool mediation and plan verification—turning language into safe action without ever letting prose become a side effect.