Context Engineering  

Context Engineering with Gödel’s AgentOS and Prompt Engineering

Context engineering is the backbone of reliable AI: it governs what evidence is retrieved, how it is shaped, and how the model must use it. Prompt engineering is the operating contract that binds the model to those rules. In Gödel’s AgentOS, the two disciplines are first-class citizens—implemented as versioned artifacts, enforced by runtime governors, and measured with policy-aware metrics. This article explains how they fit together to produce auditable, low-latency, and cost-efficient intelligence.

Over time, the strongest returns come from treating both prompts and context like evolving product assets. As your org’s policies, customers, and regulations change, AgentOS keeps the contract and the context pipeline in lockstep so behavior stays predictable even when inputs don’t—turning “model upgrades” from risky events into routine releases.


The Core Idea: Contract + Context, Natively in AgentOS

Gödel’s AgentOS treats prompts and context as programmable infrastructure:

  • Prompt Contracts define role, objectives, policies (freshness, tie-breaks, abstention), output schemas, and error modes.

  • Context Pipelines fetch, filter, and shape eligible evidence into atomic, timestamped claims with source IDs and transformation logs.

  • Runtime Governors ensure agents can only read/write context that matches tenant, license, jurisdiction, and purpose constraints.

  • Attestors & Validators check schema, citations, discrepancy flags, and uncertainty before any output leaves the system.

The result is a governed loop: Contract → Retrieval → Shaping → Reasoning → Validation, with each step observable and testable.

Because these artifacts are code, they inherit the benefits of software lifecycle: diffs, peer review, unit tests, canaries, and rollbacks. That means you can trial a stricter abstention policy on 5% of tenants, watch KPIs, then ramp safely—no big-bang migrations or guesswork.


Architecture: How AgentOS Wires It Together

  • Awareness Layer: Policy-aware retrieval (eligibility first, relevance second). Entitlements, recency, and region routing are enforced at query-time.

  • Context Shaper: Converts documents, tickets, and tables into claim objects: {id, text, source, effective_date, score, source_type}; deduplicates and normalizes entities.

  • Reasoning Kernel: Applies the Prompt Contract to the shaped pack; supports tool calls, function routing, and targeted follow-ups rather than guesses.

  • Validation Gate: Enforces JSON output schema, minimal-span citations, conflict surfacing, and calibrated abstention.

  • Evidence Graph & Chain of Custody: Immutable links from answers to minimal source spans with timestamps and tool traces.

The layers publish structured telemetry—coverage, freshness, discrepancy rate, and cost per stage—so you can pinpoint where quality or latency regresses. Instead of blaming “the model,” incidents map to a stage (e.g., stale sources in Awareness, entity drift in Shaper) with focused fixes.


Prompt Engineering as the Operating Contract

A Prompt Contract in AgentOS is versioned, diffable, and CI-tested. Its job is to turn organizational policy into machine-actionable rules the model follows at runtime.

Minimal Prompt Contract (essence):

System: You are an evidence-grounded assistant. Use only the supplied Context Pack.

Policies:
- Rank evidence by retrieval_score; break ties by newest effective_date.
- Prefer primary sources over summaries; quote minimal spans with source_id.
- If key fields are missing, ask exactly for them; do not guess.
- If sources conflict, surface both with dates; do not harmonize.
- Refuse if evidence freshness > window (unless explicitly exempted).

Output (JSON):
{"answer":"...", "citations":["..."], "missing":["..."], "uncertainty":0-1, "rationale":"one sentence"}

Contracts can define persona variants (e.g., conservative vs. exploratory) and be A/B tested per tenant or workflow. Because outputs are schema-bound, you can evaluate each variant by grounded accuracy, abstention quality, and cost before promoting it globally.


Context Engineering as a Supply Chain with Guarantees

AgentOS expresses context policy in configuration, not tribal knowledge:

Context Spec (excerpt):

eligibility:
  tenant: ${request.tenant}
  regions: [US, EU]
  licenses: [enterprise, premium]
  freshness_days: 60
  source_tiers: [policy, contract, system, human_note]
shaping:
  claim_schema: { id, text, source_id, effective_date, score, source_type }
  deduplicate: true
  normalize_entities: ["customer_id","product_sku","jurisdiction"]
compression:
  mode: "bounded"
  loss_bound: "claims-preserved"
required_fields:
  - fare_class
  - schedule_change_minutes

The same spec can run in shadow mode: AgentOS builds a “what-would-have-happened” context pack alongside production, letting you compare outcomes and costs without user exposure. That’s how you safely tune freshness windows or compression bounds.


The Context Pack the Model Sees

{
  "task": "Determine refund eligibility for PNR AB1234",
  "claims": [
    {"id":"policy:2025-10-02#threshold","text":"Refund eligible if schedule change ≥ 90 minutes.","effective_date":"2025-10-02","score":0.94,"source_type":"policy"},
    {"id":"pnr:AB1234#delta","text":"Original 08:00 → new 11:45 (Δ=225 minutes).","effective_date":"2025-10-11","score":0.91,"source_type":"system"},
    {"id":"pnr:AB1234#fare","text":"Fare class: Basic Economy (B).","effective_date":"2025-10-11","score":0.88,"source_type":"system"}
  ],
  "hints":{"tie_break":"newest","max_tokens_ctx":3500}
}

Atomic claims + timestamps + source IDs make reasoning deterministic and auditable.

Packs also embed provenance hashes of source snippets. If a policy doc changes hours later, you can still prove exactly what text the model saw—ideal for audits, disputes, and reproducible debugging.


Multi-Agent Flow Without Context Drift

AgentOS runs multiple agents (Policy, Retrieval, Shaper, Analyst, Attestor) on a Reasoning Bus:

  1. Retrieval Agent enforces eligibility and freshness.

  2. Shaper Agent emits claim objects; compresses with guarantees.

  3. Analyst Agent applies the Prompt Contract; triggers precise follow-ups if required_fields are missing.

  4. Attestor verifies citations ↔ claims; checks discrepancy and abstention rules.

  5. Publisher emits schema-valid JSON and updates the evidence graph.

The bus deduplicates tool calls, rate-limits, and blocks cross-tenant leakage by design.

The bus also supports consensus strategies: when analysts disagree, a lightweight adjudicator reconciles using contract rules (e.g., prefer primary + newest) and records dissent for later review—boosting reliability without hiding ambiguity.


Validation and Attestation: Guardrails Before Output

  • Schema Validator: Ensures answer, citations[], uncertainty, rationale, and missing[] are present and well-formed.

  • Citation Checker: Every factual span must map to a source_id; precision/recall tracked per output.

  • Conflict Detector: Conflicting claims must be surfaced with dates; silent harmonization is rejected.

  • Uncertainty Gate: If coverage is weak, the output must be an ask-for-more or refusal, not a guess.

Failures never reach users; they are routed back to the bus with actionable diagnostics.

These gates produce machine-readable failure codes (e.g., E-CONFLICT-NEWER-POLICY), enabling dashboards, alerts, and automatic playbooks (like refreshing a stale index) instead of manual triage.


KPIs That Matter (and Ship with AgentOS)

  • Grounded Accuracy: % of answers consistent with eligible sources.

  • Citation Precision/Recall: Correctness and completeness of citations.

  • Policy Adherence Score: Contract rules followed (structure, conflicts surfaced, abstention).

  • Abstention Quality: Targeted follow-ups vs. generic “I don’t know.”

  • Cost & Latency per Answer: Tokens and time, broken down by stage (retrieval, shaping, reasoning, validation).

  • Staleness SLA: Age of the newest policy reflected in outputs.

Dashboards show heatmaps for missing fields, conflict spikes, and spend vs. evidence budgets.

KPI cards support per-tenant baselines and seasonality views (e.g., policy-change weeks). This helps product owners separate genuine regressions from expected spikes due to external events.


A Concrete “Before → After” Use Case

Before: An eligibility assistant blended last quarter’s rules with a blog post; it guessed when the employee’s location was missing and gave contradictory answers.

After AgentOS + Contracts:

  • Retrieval filters by tenant, region, and effective dates.

  • Shaper emits atomic claims for thresholds and exceptions.

  • Analyst applies the contract, flags a conflict across policy versions, and asks exactly for the work location.

  • Attestor rejects responses lacking minimal-span citations.

  • Output is structured JSON with citations and a single targeted follow-up when needed.

Accuracy increased, escalations dropped, and audits became trivial because every decision chained back to minimal source spans.

The same pattern transferred to returns eligibility and warranty claims with almost no code changes—swap the contract template and source tiers, keep the pipeline. This reuse is where total cost of ownership drops.


CI/CD for Intelligence: Shipping Prompts and Context Like Code

  • Contract Tests: Synthetic Context Packs replayed in CI; regressions in grounded accuracy, adherence, or abstention block merges.

  • Golden Traces: Real anonymized tasks ensure non-regression on high-value paths.

  • SemVer & Changelogs: Prompt/Context diffs are reviewed like PRs.

  • Canary & Rollback: Contracts and retrieval policies canary per tenant; automatic rollback on KPI regressions.

This makes reliability a function of governance, not guesswork.

AgentOS supports policy freeze windows and approval gates for regulated tenants, ensuring that contract changes cannot deploy without explicit sign-off from risk/compliance—no accidental policy drift.


Starter Kit (Copy/Paste)

One-Page Contract (short):

Use only supplied context. Rank by score; break ties by newest date.
Prefer primary sources; quote minimal spans with source_id.
If required fields missing, ask for them; do not guess.
If sources conflict, surface both with effective_date; do not harmonize.
Refuse if freshness > 60 days unless marked current.
Output JSON: answer, citations[], missing[], uncertainty (0–1), rationale (1 sentence).

AgentOS Policy (short):

retrieval: { freshness_days: 60, region: ${request.region}, tenant: ${request.tenant}, tiers: [policy, contract, system] }
shaping:   { claim_schema: atomic, normalize: true, deduplicate: true }
compression: { mode: bounded, loss: claims-preserved }
validation: { require_citations: true, discrepancy_flag: true, uncertainty_gate: calibrated }

Pair this with a 10-pack synthetic suite (edge cases: missing fields, conflicting policies, expired sources, cross-tenant ineligibility). Run it on every PR; only promote contracts that pass with headroom on adherence and cost.


Conclusion

Gödel’s AgentOS operationalizes the partnership between prompt engineering (the contract) and context engineering (the governed evidence supply chain). By encoding rules as versioned Prompt Contracts, shaping context into auditable claims, and enforcing validation before output, AgentOS turns fluent language models into reliable, explainable systems. The payoff is compounding: fewer retries, lower cost, faster audits, and trust that scales with your product.

As models commoditize, enduring advantage shifts to governed context + provable behavior. AgentOS gives you both—so every answer is not only useful but defensible, and every change is measurable before it matters.