Event Replay System (Reprocess Events Without Affecting Live Users)

Rajesh Gami
16h
146
0
0

Article

Introduction

An Event Replay System lets teams reprocess historical events (for bug fixes, model retraining, backfills, or new consumers) while keeping the production system and live users unaffected. Replaying events is a powerful tool — but it can be dangerous if done naively: you can duplicate side-effects, corrupt read models, or trigger external integrations twice.

This article provides a practical, senior-developer guide to design and implement a safe event replay system that supports:

Isolated replay environments (non-invasive)
Idempotent reprocessing and deduplication
Selective replay (time-range, aggregate, tenant, stream)
Versioned event handling and transformation
Controlled side-effect suppression or simulation
Metrics, audit, and cancelable runs
.NET worker and Angular control panel examples

The architecture assumes an event store (Kafka, EventStoreDB, Azure Event Hubs, or a durable SQL-backed log). Examples will use Kafka-style terminology but patterns apply to other stores.

Goals and Non-Goals

Goals

Reprocess events safely without impacting live consumers.
Provide repeatable, auditable replay jobs.
Allow filtering and transformation during replay.
Support dry-run, partial replay and replay with side-effect mitigation.
Provide UI controls for engineers (start, pause, stop, inspect).

Non-Goals

This is not a transactional backfill across multiple heterogeneous systems (that requires careful coordination).
We do not try to merge side-effects applied already into a single reconciliation engine — instead we provide tooling patterns to avoid double-effects.

High-Level Architecture

┌──────────────┐        ┌───────────────┐        ┌──────────────┐
│ Event Store  │ ─────> │ Replay Worker │ ─────> │ Replay Sink  │
│ (Kafka/ESDB) │        │ (.NET Service)│        │ (Read Model) │
└──────────────┘        └──────┬────────┘        └──────┬───────┘
                                 │                       │
                                 ▼                       ▼
                            ┌───────────┐           ┌────────────┐
                            │Control API│ <-------> │Angular UI  │
                            └───────────┘           └────────────┘

Event Store: authoritative append-only log of domain events.
Replay Worker: reads events from store (earliest → latest or time window), transforms, and emits to replay sink instead of production sinks.
Replay Sink: isolated projection store, separate queues for test consumers, or a staging topic.
Control API and UI: start/stop/pause/replay runs, show metrics, configure transforms and filters.

Key Concepts

Isolated Replay

Events for replay should not be published to production sinks. Replay sinks include:

Separate projection tables (ReplayReadModel)
Separate Kafka topic namespaces (e.g., replay.*)
Temporary message queues or simulation endpoints
Mocked external service adapters

Idempotency

Replay must be idempotent. Techniques:

Use idempotency keys derived from event metadata (eventId + replayRunId)
Persist processed event ids in a replay-specific dedupe store (fast key-value DB)
Make event handlers idempotent (upserts, delta-apply with checksums)

Side-Effect Management

Side-effects (emails, payments, third-party API calls) are managed by:

Suppression: disable actual side-effects during replay and log intended actions.
Simulation: route side-effects to a sandbox or mock service.
Replay Mode Awareness: handlers get a context flag IsReplay and either avoid side-effects or call test endpoints.

Transformation & Versioning

Events may need transformation:

Versioned event schemas handled via event adapters or translators
Apply migration pipeline: V1 → V2 transforms during replay if consumer expects new shape

Controlled Replay

Allow operators to:

Specify filter: stream, aggregateId, tenantId, time window, or offset range
Run a dry-run to see read-model impacts (without persisting)
Limit throughput (events/sec) to avoid resource exhaustion
Pause/Resume/Cancel runs

Audit & Safety

Every replay run must be auditable:

ReplayRun metadata: start time, operator, filters, transform script, status (Running, Paused, Cancelled, Completed), processed count
Error logs and failure handling per-event
Checkpoints for resume

Event Replay Flowchart

Start
  |
  v
Create ReplayRun (filters, mode, operator)
  |
  v
Validate Filters & Acquire Isolation Token
  |
  v
Open Dedicated Replay Consumer (read-only)
  |
  v
For each Event in stream (matching filter):
  |
  +--> If Event already processed (dedupe) -> skip
  |
  +--> Map/Transform Event (if required)
  |
  +--> Execute Handler in Replay Mode
         - If side-effects allowed? route to sandbox
         - Else suppress and log
  |
  +--> Write to Replay Sink (or Dry Run result)
  |
  +--> Commit checkpoint
  |
  v
End Loop
  |
  v
Mark ReplayRun Completed
  |
  v
End

Data Models

ReplayRun Table (SQL)

CREATE TABLE ReplayRun (
  ReplayRunId UNIQUEIDENTIFIER PRIMARY KEY,
  CreatedBy NVARCHAR(200),
  CreatedAt DATETIME2,
  StreamName NVARCHAR(200) NULL,
  FromOffset BIGINT NULL,
  ToOffset BIGINT NULL,
  FromTime DATETIME2 NULL,
  ToTime DATETIME2 NULL,
  Mode NVARCHAR(20) NOT NULL, -- DryRun | Replay | ReplayWithSideEffects
  Status NVARCHAR(20) NOT NULL, -- Created | Running | Paused | Cancelled | Completed | Failed
  ProcessedCount BIGINT DEFAULT 0,
  ErrorCount BIGINT DEFAULT 0,
  LastCheckpoint BIGINT NULL,
  ConfigJson NVARCHAR(MAX) NULL
);

ReplayEventLog (for dedupe & audit)

CREATE TABLE ReplayEventLog (
  Id BIGINT IDENTITY PRIMARY KEY,
  ReplayRunId UNIQUEIDENTIFIER,
  EventId UNIQUEIDENTIFIER,
  StreamOffset BIGINT,
  EventType NVARCHAR(200),
  ProcessedAt DATETIME2,
  Outcome NVARCHAR(20), -- Success | Skipped | Error
  Details NVARCHAR(MAX)
);
CREATE UNIQUE INDEX UX_ReplayEventLog_Run_Event ON ReplayEventLog(ReplayRunId, EventId);

Replay Worker Design (.NET)

Responsibilities

Create and manage partitions of work (ranges or time windows)
Respect concurrency and ordering where necessary
Use checkpointing per partition for resume
Support backoff and failure retry policies
Emit metrics and progress to Control API

Worker Components

ReplayCoordinator — orchestrates runs, splits workload to PartitionWorkers
PartitionWorker — reads events from event store for assigned offsets/time-range
EventTransformer — applies schema migrations or mapping rules
ReplayHandler — executes business logic in replay context (calls projections)
SideEffectAdapter — routes side-effect calls to sandbox or suppression logger
CheckpointStore — persist per-partition checkpoints

Partitioned Reading Example (Kafka)

Use consumer group with unique replay consumer group id (e.g., replay-{ReplayRunId}) so replay does not disrupt main consumer groups.
Seek to FromOffset and read until ToOffset or ToTime.

.NET Pseudocode: PartitionWorker

public async Task RunAsync(PartitionRange range, CancellationToken ct) {
    var consumer = _kafka.CreateConsumer($"replay-{replayRunId}-{range.Partition}");
    consumer.Assign(new TopicPartitionOffset(topic, range.Partition, range.FromOffset));
    while (!ct.IsCancellationRequested) {
        var msg = consumer.Consume(_pollTimeout);
        if (msg == null) break;
        if (msg.Offset > range.ToOffset) break;

        // Dedupe: insert into ReplayEventLog unique; if fails => skip
        if (!await _replayEventLog.TryMarkProcessing(replayRunId, msg.EventId, msg.Offset)) {
           continue;
        }

        var evt = _transformer.Transform(msg.Value);
        var outcome = await _replayHandler.HandleAsync(evt, replayContext);
        await _replayEventLog.MarkProcessed(replayRunId, msg.EventId, outcome);
        await _checkpointStore.SaveAsync(replayRunId, range.Partition, msg.Offset);
    }
}

Ensuring Idempotency

Idempotent Write Patterns

Use UPSERT for projections (merge by key) instead of INSERT.
Persist event processing marker with unique constraint (ReplayRunId + EventId) so duplicate events are skipped.
Use event-sourced upserts where the latest processed event version is compared.

Example Upsert (SQL Server)

MERGE INTO ReplayProjection AS target
USING (VALUES(@AggregateId, @Value, @EventVersion)) AS src(AggregateId, Value, EventVersion)
ON target.AggregateId = src.AggregateId
WHEN MATCHED AND target.EventVersion < src.EventVersion THEN
  UPDATE SET Value = src.Value, EventVersion = src.EventVersion
WHEN NOT MATCHED THEN
  INSERT (AggregateId, Value, EventVersion) VALUES (src.AggregateId, src.Value, src.EventVersion);

This ensures safe replays even when events are processed multiple times.

Side-Effect Handling Patterns

1. Suppress Mode (Default Safe)

Handlers detect ReplayContext and do not call external services. They log intended side-effects to ReplayEffectLog for review.

2. Sandbox Mode

Route external calls to a sandbox/test endpoint that simulates responses and records calls. This is useful for integration testing.

3. Replay-With-SideEffects (Controlled)

Allow re-emission to production sinks but with safety controls:
- Tag emitted events with replayRunId and isReplay=true so downstream services can ignore if necessary.
- Rate-limit and manual confirmation before re-enabling.

Transformation and Versioning

Maintain event adapters for each version pair (V1→V2 or Vn→Latest).
Keep migration scripts idempotent.

During replay, attach ReplayMetadata to the event passed to handlers:

{ "replayRunId":"...", "originalOffset":12345, "originalStream":"orders" }

Optionally record the transformed event for audit.

Control API and Angular UI

Control API (Endpoints)

POST /api/replay → create new ReplayRun (body: stream, fromOffset, toOffset, mode, filters, dryRun boolean)
GET /api/replay/{id} → status and stats
POST /api/replay/{id}/pause → pause
POST /api/replay/{id}/resume → resume
POST /api/replay/{id}/cancel → cancel
GET /api/replay/{id}/events → paged list of processed events and outcomes

Angular UI Components

ReplayCreateForm — configure filters, mode, sampling rate
ReplayList — show runs with status, processed count, error count, operator
ReplayDetail — live console, progress gauges, list of errors with stack traces
EventInspector — view original event payload, transformed payload, and projected state
EffectReview — show suppressed side-effects and allow manual approval to re-run specific calls (dangerous; restricted to admins)

Angular Sample: Start Replay

startReplay() {
  const payload = {
    streamName: this.form.stream,
    fromOffset: this.form.fromOffset,
    toOffset: this.form.toOffset,
    mode: this.form.mode,
    dryRun: this.form.dryRun
  };
  this.http.post('/api/replay', payload).subscribe(res => {
     this.router.navigate(['/replay', res.id]);
  });
}

UI must clearly signal that replay can be destructive if not run in safe mode.

Monitoring, Metrics, and Observability

Track:

Events read /sec, processed /sec
Replay run duration and ETA
Error rates and error types per event
Checkpoint lag per partition
Side-effects suppressed count and sandbox call counts

Expose Prometheus metrics and logs. Correlate logs using ReplayRunId and EventId.

Security And Governance

Only allow authorized operators to create runs (RBAC).
Require multi-person approval for ReplayWithSideEffects mode.
Log every command with operator identity and reason.
Use an allowlist of streams and tenants eligible for replay.
Validate transforms against safe schema before applying.

Testing Strategy

Unit Tests: transformer logic, dedupe marker insert, idempotent handlers.
Integration Tests: small replay runs on staging event store and isolated projection database.
Chaos Tests: simulate worker crash mid-run; verify resume from checkpoint produces expected projection.
Dry-Run Tests: ensure dry-run results match full-run results (except persistence).
Security Tests: ensure side-effects cannot be triggered unless explicitly allowed.

Operational Playbook

Pre-Run: Create a replay in DryRun mode; review EffectReview items.
Approval: If safe, move to Replay mode or Replay-With-SideEffects with approval.
Run: Start run in low-traffic window if necessary; throttle throughput.
Monitor: Watch errors and processing rate. Pause or cancel on anomalies.
Post-Run: Compare projection counts, compute diffs, run reconciliation scripts.
Rollback Plan: For destructive projection updates, have a restore procedure (point-in-time restore of projection or snapshot before run).

Example: End-to-End Replay Scenario

Problem: A bug in a projection allowed incorrect tax calculation for orders between Jan 1–10. Fix is applied in projection code. Now replay events to correct projection without re-sending emails or payments.

Steps

Fix handler code and deploy to a staging worker image tagged replay-v2.
Create a ReplayRun with:
- stream: orders
- fromTime: 2025-01-01
- toTime: 2025-01-10
- mode: Replay (no side-effects)
- operator: rajesh.gami
Start replay in DryRun first. Inspect EffectReview (should show email sends suppressed).
Review dry-run diffs for projection; confirm corrections expected.
Run actual Replay. Projections are updated via upserts.
Verify projection data and run reconciliation queries.
Close run and archive logs.

No emails/payments were re-triggered because side-effects were suppressed.

Common Pitfalls And How To Avoid Them

Accidentally publishing to production topics: Always route replay outputs to replay.* or isolated DBs. Use strict permissions to prevent accidental writes.
Non-idempotent handlers: Convert handlers to upsert patterns before enabling replay.
Large replay windows without throttling: Throttle and batch; monitor memory and DB locks.
Hidden side-effects: Audit all handlers for hidden effects (metrics, analytics writes).
Schema drift: Keep event adapters and transformation tests in CI.

Summary

An Event Replay System is a strategic capability for mature systems — enabling recovery, data correction, and re-computation without risking live users. The key pillars are:

Isolated replay sinks and routing
Strong idempotency and deduplication
Side-effect suppression or sandboxing
Versioned transforms and migration adapters
Robust control plane: create/pause/resume/cancel with audit
Observability, tests, and safety guardrails