AI Agents  

Agent Memory and Long-Running Workflows: Designing AI Agents That Don’t Forget, Drift, or Hallucinate

Why “memory” is the hard problem in agent systems

Most agent failures at scale trace back to memory, not reasoning. The agent forgets a constraint, confuses versions, loses track of decisions, repeats work, or silently changes direction. In short runs, this is inconvenient. In multi-hour or multi-day workflows, it becomes operationally destructive: inconsistent deliverables, duplicated spend, and untraceable outcomes.

The deeper issue is that LLM context is not memory. It is a temporary working set with hard size limits and fragile ordering effects. If you treat conversation history as the only source of truth, your system will drift. In production, you need explicit memory design: what is stored, how it is retrieved, how it is summarized, how it is validated, and how it is bounded.

This article outlines a practical, technical approach to memory for AI agents running long workflows, especially when multiple agents collaborate and when GSCP-15 style governance requires stable state and auditable decisions.


The five memory classes you should separate

A reliable agent system separates memory into distinct classes with different retention and validation rules. Mixing them into one “chat history” is a predictable failure mode.

Working Memory (ephemeral)
This is the minimal set needed for the current step: the active task, immediate inputs, and the next actions. It should be aggressively pruned and rebuilt each step. Working memory is optimized for token efficiency and correctness.

Working memory should be derived from durable state, not the other way around. If the system crashes and restarts, working memory must be reconstructible. A good discipline is: treat working memory as a view, not as storage.

Decision Memory (durable)
This stores commitments: chosen architecture, approved requirements, accepted risks, tool decisions, and gate outcomes. It should be small, explicit, and immutable once approved. Decision memory is the spine of long-running workflows.

This is the memory class you show to reviewers. It should be structured (JSON) and versioned, with timestamps and rationale. If you cannot answer “why did the agent do that,” your decision memory is insufficient.

Artifact Memory (durable)
Artifacts are the actual deliverables: specs, code files, test plans, diagrams, run manifests. They should be stored outside the prompt context and referenced by stable identifiers (hashes, paths, or versions). Agents should never “remember” full artifacts in the model context.

Artifact memory should be the authoritative source of outputs. If the agent needs a section from a spec, it retrieves it deterministically rather than relying on the model’s recollection. This is how you avoid drift and accidental rewrites.

Retrieval Memory (searchable)
This is a vector index or search index over relevant corpora: docs, code, past runs, policies. Retrieval memory is not guaranteed truth; it is a source of candidates. It must be paired with verification rules and citation requirements.

Retrieval memory should be scoped: tenant, project, environment, and classification. Unscoped retrieval is a direct path to leakage and incorrect cross-project reuse. The agent should retrieve only what it is entitled to and only what is relevant.

Policy/Contract Memory (immutable)
Policies, entitlements, constraints, schemas, and formatting rules live here. This memory must not drift. If these rules change, they should change via versioned policy updates, not via conversational evolution.

In GSCP-15 style systems, policy memory is always injected into the working context in a controlled form. This is how you keep outputs consistent across runs, teams, and agents.


The “State as Data” pattern for agent reliability

The most effective long-run strategy is to store agent state as a small, validated JSON document. This is not optional once workflows get complex. The agent reads and writes state through a strict schema, and every step starts by reconstructing working context from that state.

A practical state object includes:

  • objective and definition_of_done

  • constraints (non-negotiables)

  • assumptions (explicit and reviewable)

  • decisions (immutable list with ids and timestamps)

  • plan (stage list with statuses)

  • open_questions (must be closed or escalated)

  • risks (with mitigation owner)

  • artifacts (list of generated deliverables with hashes/versions)

  • tool_evidence (ids/hashes for tool outputs)

This pattern changes agent behavior dramatically. The agent stops “thinking in chat” and starts “executing a workflow.” It becomes resilient to context limits because the system can always reload and continue.


Memory writes must be gated, not automatic

A common mistake is letting the model write to memory freely. That leads to polluted memory: guesses stored as facts, outdated decisions persisting, and constraints being overwritten. Instead, memory writes should be gated:

  • Write only structured updates that pass schema validation

  • Separate “proposal” from “commitment”

  • Require explicit approvals for high-impact decision commits

  • Use monotonic versioning (never overwrite without version bump)

  • Track provenance: which step/agent wrote the update and why

In practical terms: the agent can propose changes to decision memory, but only a “commit step” (or a human approval) can finalize them. This is the equivalent of pull requests for agent state.


Summarization is not compression unless it is verifiable

Summarization is unavoidable, but it is dangerous if you treat it as truth. A summary is a lossy transform. Loss is acceptable only if you preserve the ability to recover and verify details.

A robust pattern is “summary with anchors”:

  • The summary is short and structured

  • Every statement points to an anchor: artifact id + section, evidence id, or source citation

  • If a statement has no anchor, it is explicitly labeled as an assumption

This is how you prevent the “summary drift” problem where repeated summarizations compound errors. The agent can use summaries for working context, but the system always has authoritative references.


Preventing drift: the three reconciliation loops

Long-running agent workflows need three reconciliation loops:

Constraint reconciliation
At the start of each stage, revalidate hard constraints against the current plan and artifacts. If a constraint is violated, the agent must either fix the plan or escalate. This is how you prevent silent divergence.

Artifact reconciliation
When new artifacts are produced, verify they align with prior artifacts (requirements ↔ design ↔ code ↔ tests). This can be done with deterministic validators plus a verifier model pass. The goal is to catch contradictions early.

Decision reconciliation
Decisions should be stable. If new evidence contradicts an earlier decision, the system should not silently replace it. It should create a decision change record: what changed, why, and what impact it has downstream.

These loops are what make multi-day workflows behave like a disciplined delivery function rather than a chat session.


Multi-agent memory: the minimum viable protocol

If multiple agents collaborate, you must define a shared memory protocol. Otherwise each agent maintains its own private interpretation and reconciliation becomes impossible.

A minimal protocol:

  • Shared state JSON is the single source of truth

  • Agents can only update state via structured patches

  • Every patch has: agent_id, reason, scope, and expected_impacts

  • A coordinator reconciles patches and resolves conflicts

  • Artifact references are stable (hash + version), never pasted wholesale

This prevents the most common multi-agent failure: two agents “agree” in chat but produce incompatible outputs because they were operating on different implicit assumptions.


Practical guidance: what to implement first

If you are building this in a real system, implement memory in this order:

  • State JSON schema + validation

  • Artifact store with versioning and hashes

  • Decision log (append-only) with approvals

  • Retrieval index scoped by tenant/project

  • Summary-with-anchors mechanism

  • Conflict resolution for multi-agent patches

Once these are in place, adding more agents becomes safe. Without them, more agents typically means more drift and higher cost.


Closing perspective

“Memory” is not a feature. It is architecture. If you want AI agents that run long workflows reliably, you need explicit memory classes, state as data, gated writes, verifiable summarization, and reconciliation loops. Those mechanisms turn an LLM from a conversational interface into a durable operator that can execute over time without losing the plot.

If you want the next article to go deeper, choose one direction and I will produce it in a similarly technical format:

  • Agent memory schemas and patch protocols

  • Retrieval strategies for code + docs (hybrid BM25 + vectors)

  • Multi-agent conflict resolution and arbitration patterns

  • Long-running workflow recovery and resume-after-crash design