Introduction
An Event Replay System lets teams reprocess historical events (for bug fixes, model retraining, backfills, or new consumers) while keeping the production system and live users unaffected. Replaying events is a powerful tool — but it can be dangerous if done naively: you can duplicate side-effects, corrupt read models, or trigger external integrations twice.
This article provides a practical, senior-developer guide to design and implement a safe event replay system that supports:
Isolated replay environments (non-invasive)
Idempotent reprocessing and deduplication
Selective replay (time-range, aggregate, tenant, stream)
Versioned event handling and transformation
Controlled side-effect suppression or simulation
Metrics, audit, and cancelable runs
.NET worker and Angular control panel examples
The architecture assumes an event store (Kafka, EventStoreDB, Azure Event Hubs, or a durable SQL-backed log). Examples will use Kafka-style terminology but patterns apply to other stores.
Goals and Non-Goals
Goals
Reprocess events safely without impacting live consumers.
Provide repeatable, auditable replay jobs.
Allow filtering and transformation during replay.
Support dry-run, partial replay and replay with side-effect mitigation.
Provide UI controls for engineers (start, pause, stop, inspect).
Non-Goals
This is not a transactional backfill across multiple heterogeneous systems (that requires careful coordination).
We do not try to merge side-effects applied already into a single reconciliation engine — instead we provide tooling patterns to avoid double-effects.
High-Level Architecture
┌──────────────┐ ┌───────────────┐ ┌──────────────┐
│ Event Store │ ─────> │ Replay Worker │ ─────> │ Replay Sink │
│ (Kafka/ESDB) │ │ (.NET Service)│ │ (Read Model) │
└──────────────┘ └──────┬────────┘ └──────┬───────┘
│ │
▼ ▼
┌───────────┐ ┌────────────┐
│Control API│ <-------> │Angular UI │
└───────────┘ └────────────┘
Event Store: authoritative append-only log of domain events.
Replay Worker: reads events from store (earliest → latest or time window), transforms, and emits to replay sink instead of production sinks.
Replay Sink: isolated projection store, separate queues for test consumers, or a staging topic.
Control API and UI: start/stop/pause/replay runs, show metrics, configure transforms and filters.
Key Concepts
Isolated Replay
Events for replay should not be published to production sinks. Replay sinks include:
Separate projection tables (ReplayReadModel)
Separate Kafka topic namespaces (e.g., replay.*)
Temporary message queues or simulation endpoints
Mocked external service adapters
Idempotency
Replay must be idempotent. Techniques:
Use idempotency keys derived from event metadata (eventId + replayRunId)
Persist processed event ids in a replay-specific dedupe store (fast key-value DB)
Make event handlers idempotent (upserts, delta-apply with checksums)
Side-Effect Management
Side-effects (emails, payments, third-party API calls) are managed by:
Suppression: disable actual side-effects during replay and log intended actions.
Simulation: route side-effects to a sandbox or mock service.
Replay Mode Awareness: handlers get a context flag IsReplay and either avoid side-effects or call test endpoints.
Transformation & Versioning
Events may need transformation:
Controlled Replay
Allow operators to:
Specify filter: stream, aggregateId, tenantId, time window, or offset range
Run a dry-run to see read-model impacts (without persisting)
Limit throughput (events/sec) to avoid resource exhaustion
Pause/Resume/Cancel runs
Audit & Safety
Every replay run must be auditable:
ReplayRun metadata: start time, operator, filters, transform script, status (Running, Paused, Cancelled, Completed), processed count
Error logs and failure handling per-event
Checkpoints for resume
Event Replay Flowchart
Start
|
v
Create ReplayRun (filters, mode, operator)
|
v
Validate Filters & Acquire Isolation Token
|
v
Open Dedicated Replay Consumer (read-only)
|
v
For each Event in stream (matching filter):
|
+--> If Event already processed (dedupe) -> skip
|
+--> Map/Transform Event (if required)
|
+--> Execute Handler in Replay Mode
- If side-effects allowed? route to sandbox
- Else suppress and log
|
+--> Write to Replay Sink (or Dry Run result)
|
+--> Commit checkpoint
|
v
End Loop
|
v
Mark ReplayRun Completed
|
v
End
Data Models
ReplayRun Table (SQL)
CREATE TABLE ReplayRun (
ReplayRunId UNIQUEIDENTIFIER PRIMARY KEY,
CreatedBy NVARCHAR(200),
CreatedAt DATETIME2,
StreamName NVARCHAR(200) NULL,
FromOffset BIGINT NULL,
ToOffset BIGINT NULL,
FromTime DATETIME2 NULL,
ToTime DATETIME2 NULL,
Mode NVARCHAR(20) NOT NULL, -- DryRun | Replay | ReplayWithSideEffects
Status NVARCHAR(20) NOT NULL, -- Created | Running | Paused | Cancelled | Completed | Failed
ProcessedCount BIGINT DEFAULT 0,
ErrorCount BIGINT DEFAULT 0,
LastCheckpoint BIGINT NULL,
ConfigJson NVARCHAR(MAX) NULL
);
ReplayEventLog (for dedupe & audit)
CREATE TABLE ReplayEventLog (
Id BIGINT IDENTITY PRIMARY KEY,
ReplayRunId UNIQUEIDENTIFIER,
EventId UNIQUEIDENTIFIER,
StreamOffset BIGINT,
EventType NVARCHAR(200),
ProcessedAt DATETIME2,
Outcome NVARCHAR(20), -- Success | Skipped | Error
Details NVARCHAR(MAX)
);
CREATE UNIQUE INDEX UX_ReplayEventLog_Run_Event ON ReplayEventLog(ReplayRunId, EventId);
Replay Worker Design (.NET)
Responsibilities
Create and manage partitions of work (ranges or time windows)
Respect concurrency and ordering where necessary
Use checkpointing per partition for resume
Support backoff and failure retry policies
Emit metrics and progress to Control API
Worker Components
ReplayCoordinator — orchestrates runs, splits workload to PartitionWorkers
PartitionWorker — reads events from event store for assigned offsets/time-range
EventTransformer — applies schema migrations or mapping rules
ReplayHandler — executes business logic in replay context (calls projections)
SideEffectAdapter — routes side-effect calls to sandbox or suppression logger
CheckpointStore — persist per-partition checkpoints
Partitioned Reading Example (Kafka)
Use consumer group with unique replay consumer group id (e.g., replay-{ReplayRunId}) so replay does not disrupt main consumer groups.
Seek to FromOffset and read until ToOffset or ToTime.
.NET Pseudocode: PartitionWorker
public async Task RunAsync(PartitionRange range, CancellationToken ct) {
var consumer = _kafka.CreateConsumer($"replay-{replayRunId}-{range.Partition}");
consumer.Assign(new TopicPartitionOffset(topic, range.Partition, range.FromOffset));
while (!ct.IsCancellationRequested) {
var msg = consumer.Consume(_pollTimeout);
if (msg == null) break;
if (msg.Offset > range.ToOffset) break;
// Dedupe: insert into ReplayEventLog unique; if fails => skip
if (!await _replayEventLog.TryMarkProcessing(replayRunId, msg.EventId, msg.Offset)) {
continue;
}
var evt = _transformer.Transform(msg.Value);
var outcome = await _replayHandler.HandleAsync(evt, replayContext);
await _replayEventLog.MarkProcessed(replayRunId, msg.EventId, outcome);
await _checkpointStore.SaveAsync(replayRunId, range.Partition, msg.Offset);
}
}
Ensuring Idempotency
Idempotent Write Patterns
Use UPSERT for projections (merge by key) instead of INSERT.
Persist event processing marker with unique constraint (ReplayRunId + EventId) so duplicate events are skipped.
Use event-sourced upserts where the latest processed event version is compared.
Example Upsert (SQL Server)
MERGE INTO ReplayProjection AS target
USING (VALUES(@AggregateId, @Value, @EventVersion)) AS src(AggregateId, Value, EventVersion)
ON target.AggregateId = src.AggregateId
WHEN MATCHED AND target.EventVersion < src.EventVersion THEN
UPDATE SET Value = src.Value, EventVersion = src.EventVersion
WHEN NOT MATCHED THEN
INSERT (AggregateId, Value, EventVersion) VALUES (src.AggregateId, src.Value, src.EventVersion);
This ensures safe replays even when events are processed multiple times.
Side-Effect Handling Patterns
1. Suppress Mode (Default Safe)
2. Sandbox Mode
3. Replay-With-SideEffects (Controlled)
Transformation and Versioning
Maintain event adapters for each version pair (V1→V2 or Vn→Latest).
Keep migration scripts idempotent.
During replay, attach ReplayMetadata to the event passed to handlers:
{ "replayRunId":"...", "originalOffset":12345, "originalStream":"orders" }
Optionally record the transformed event for audit.
Control API and Angular UI
Control API (Endpoints)
POST /api/replay → create new ReplayRun (body: stream, fromOffset, toOffset, mode, filters, dryRun boolean)
GET /api/replay/{id} → status and stats
POST /api/replay/{id}/pause → pause
POST /api/replay/{id}/resume → resume
POST /api/replay/{id}/cancel → cancel
GET /api/replay/{id}/events → paged list of processed events and outcomes
Angular UI Components
ReplayCreateForm — configure filters, mode, sampling rate
ReplayList — show runs with status, processed count, error count, operator
ReplayDetail — live console, progress gauges, list of errors with stack traces
EventInspector — view original event payload, transformed payload, and projected state
EffectReview — show suppressed side-effects and allow manual approval to re-run specific calls (dangerous; restricted to admins)
Angular Sample: Start Replay
startReplay() {
const payload = {
streamName: this.form.stream,
fromOffset: this.form.fromOffset,
toOffset: this.form.toOffset,
mode: this.form.mode,
dryRun: this.form.dryRun
};
this.http.post('/api/replay', payload).subscribe(res => {
this.router.navigate(['/replay', res.id]);
});
}
UI must clearly signal that replay can be destructive if not run in safe mode.
Monitoring, Metrics, and Observability
Track:
Events read /sec, processed /sec
Replay run duration and ETA
Error rates and error types per event
Checkpoint lag per partition
Side-effects suppressed count and sandbox call counts
Expose Prometheus metrics and logs. Correlate logs using ReplayRunId and EventId.
Security And Governance
Only allow authorized operators to create runs (RBAC).
Require multi-person approval for ReplayWithSideEffects mode.
Log every command with operator identity and reason.
Use an allowlist of streams and tenants eligible for replay.
Validate transforms against safe schema before applying.
Testing Strategy
Unit Tests: transformer logic, dedupe marker insert, idempotent handlers.
Integration Tests: small replay runs on staging event store and isolated projection database.
Chaos Tests: simulate worker crash mid-run; verify resume from checkpoint produces expected projection.
Dry-Run Tests: ensure dry-run results match full-run results (except persistence).
Security Tests: ensure side-effects cannot be triggered unless explicitly allowed.
Operational Playbook
Pre-Run: Create a replay in DryRun mode; review EffectReview items.
Approval: If safe, move to Replay mode or Replay-With-SideEffects with approval.
Run: Start run in low-traffic window if necessary; throttle throughput.
Monitor: Watch errors and processing rate. Pause or cancel on anomalies.
Post-Run: Compare projection counts, compute diffs, run reconciliation scripts.
Rollback Plan: For destructive projection updates, have a restore procedure (point-in-time restore of projection or snapshot before run).
Example: End-to-End Replay Scenario
Problem: A bug in a projection allowed incorrect tax calculation for orders between Jan 1–10. Fix is applied in projection code. Now replay events to correct projection without re-sending emails or payments.
Steps
Fix handler code and deploy to a staging worker image tagged replay-v2.
Create a ReplayRun with:
Start replay in DryRun first. Inspect EffectReview (should show email sends suppressed).
Review dry-run diffs for projection; confirm corrections expected.
Run actual Replay. Projections are updated via upserts.
Verify projection data and run reconciliation queries.
Close run and archive logs.
No emails/payments were re-triggered because side-effects were suppressed.
Common Pitfalls And How To Avoid Them
Accidentally publishing to production topics: Always route replay outputs to replay.* or isolated DBs. Use strict permissions to prevent accidental writes.
Non-idempotent handlers: Convert handlers to upsert patterns before enabling replay.
Large replay windows without throttling: Throttle and batch; monitor memory and DB locks.
Hidden side-effects: Audit all handlers for hidden effects (metrics, analytics writes).
Schema drift: Keep event adapters and transformation tests in CI.
Summary
An Event Replay System is a strategic capability for mature systems — enabling recovery, data correction, and re-computation without risking live users. The key pillars are:
Isolated replay sinks and routing
Strong idempotency and deduplication
Side-effect suppression or sandboxing
Versioned transforms and migration adapters
Robust control plane: create/pause/resume/cancel with audit
Observability, tests, and safety guardrails