Introduction
PostgreSQL replication lag is a common concern for teams running production databases. Replication lag means the standby or replica database is behind the primary database and has not yet applied all recent changes. When this lag suddenly spikes in production, it can impact read replicas, analytics queries, reporting systems, and even failover reliability.
In simple terms, replication lag occurs when the primary database produces changes faster than the replica can receive and apply them. In production systems with real traffic, even small changes in workload, configuration, or infrastructure can cause sudden lag spikes. This article explains the most common reasons behind these spikes, using simple language and real-world examples.
How PostgreSQL Replication Works
PostgreSQL usually uses streaming replication in production setups. The primary database writes changes to WAL (Write-Ahead Log) files. These WAL records are sent to replica servers, which then replay them to stay in sync.
Replication lag increases when:
WAL is generated very fast on the primary
WAL is slow to transfer over the network
WAL is slow to apply to the replica
Understanding where the slowdown happens helps identify the root cause.
Sudden Increase in Write Traffic
One of the most common reasons for a replication lag spike is a sudden increase in write operations on the primary database.
Examples include:
Example scenario:
An e-commerce platform runs a nightly job to update product prices for thousands of rows. This generates a large amount of WAL data in a short time, overwhelming the replica.
You can observe WAL generation using:
SELECT pg_current_wal_lsn();
If WAL is generated faster than the replica can replay it, lag increases quickly.
Long-Running Transactions on the Primary
Long-running transactions can silently cause replication lag issues.
When a transaction stays open for a long time:
WAL cannot be fully recycled
Replicas may wait to apply changes
Vacuum processes are delayed
Example:
A reporting query runs for hours without committing. During this time, the replica keeps receiving WAL but struggles to apply changes efficiently.
To find long-running transactions:
SELECT pid, now() - xact_start AS transaction_duration, query
FROM pg_stat_activity
WHERE state = 'active' AND xact_start IS NOT NULL;
Slow Disk I/O on the Replica
Even if the primary is fast, replication lag can spike if the replica has slow disk performance.
Common causes include:
Example:
After scaling up traffic, the replica disk reaches high I/O utilization. WAL replay slows down, causing lag even though network transfer is healthy.
Disk bottlenecks are one of the most overlooked causes of replication lag.
Network Latency or Packet Loss
Replication relies on continuous network streaming. Any network issue between the primary and replica can immediately increase lag.
Possible network problems:
Example:
A replica hosted in a different availability zone experiences intermittent network drops, causing WAL streaming delays.
Replication slots will continue to retain WAL on the primary, increasing disk usage and lag.
Heavy Read Queries on the Replica
Read replicas are often used for analytics and reporting. Heavy queries can consume CPU and I/O resources, slowing WAL replay.
Example:
A dashboard runs complex joins and aggregations on the replica during peak hours. At the same time, the replica struggles to apply WAL data.
This causes lag even though the primary is functioning normally.
Monitoring replay activity helps detect this:
SELECT now() - pg_last_xact_replay_timestamp() AS replication_delay;
Vacuum and Autovacuum Pressure
Vacuum processes help maintain database health, but they also generate WAL traffic.
Lag spikes can happen when:
Autovacuum runs aggressively
Large tables are vacuumed
Vacuum and user queries compete for I/O
Example:
After a massive delete operation, autovacuum kicks in and generates heavy WAL traffic, pushing replicas behind.
Checkpoint Spikes
Checkpoints force PostgreSQL to flush dirty pages to disk. Poorly tuned checkpoint settings can cause I/O spikes.
When checkpoints are too frequent:
This can cause short but noticeable replication lag spikes during peak write activity.
Insufficient Replica Resources
Replication lag often appears after infrastructure changes.
Common issues include:
Example:
A new replica is added with lower specifications than the primary. Under load, it cannot keep up with WAL replay.
Production replicas should be sized close to the primary, especially for write-heavy systems.
Replication Slot Mismanagement
Replication slots prevent WAL cleanup until replicas consume the data.
Problems occur when:
You can inspect replication slots using:
SELECT slot_name, active, restart_lsn FROM pg_replication_slots;
Inactive or unhealthy slots can indirectly worsen lag and storage usage.
Failovers and Restarts
Planned or unplanned restarts can also cause lag spikes.
Examples include:
Primary restarts
Replica restarts
Failover events
After restarting, replicas must catch up on accumulated WAL, which can take time depending on system load.
Why Replication Lag Spikes Appear Sudden
Replication lag often builds up silently and becomes visible only after crossing a threshold.
Reasons include:
Monitoring checks triggering alerts late
WAL backlog growing unnoticed
Combined effects of multiple small issues
What looks like a sudden spike is often the result of gradual pressure.
How Teams Typically Investigate Lag Spikes
Production teams usually follow these steps:
Check write activity on the primary
Inspect replica CPU, disk, and memory
Review long-running queries
Analyze network metrics
Review WAL generation and replay rate
This systematic approach helps isolate the real bottleneck.
Summary
PostgreSQL replication lag spikes suddenly in production due to increased write traffic, long-running transactions, slow replica disks, network issues, heavy read queries, aggressive vacuuming, checkpoint pressure, or under-provisioned replicas. In most cases, lag is not caused by a single issue but by a combination of workload and resource constraints. Understanding how WAL is generated, transferred, and replayed helps teams quickly diagnose and reduce replication lag, ensuring stable and reliable PostgreSQL production systems.