Why PostgreSQL Replication Lag Spikes Suddenly in Production

Ananya Desai
1w
2.6k
0
0

Article

Introduction

PostgreSQL replication lag is one of those problems that looks scary but is often misunderstood. In many production systems, everything runs smoothly for days or weeks, and then suddenly the replica starts falling behind. Read queries slow down, dashboards show red alerts, and teams worry about data consistency or failover safety.

In simple words, replication lag means the primary database is producing changes faster than the replica can process them. What confuses most teams is that the lag appears suddenly, even though the real cause has usually been building quietly in the background.

Think of PostgreSQL replication like a courier system. The primary database is the warehouse that sends packages, and the replica is the delivery center that receives and unpacks them. If packages start arriving faster than they can be unpacked, a backlog forms. Nothing is broken, but delays become visible.

This article explains why PostgreSQL replication lag spikes in production, using real-life examples, detailed human-style explanations, and clear advantages and disadvantages so readers understand what happens if these issues are ignored.

What Developers Usually See in Production

Teams commonly report situations like these:

Replication lag jumps from a few seconds to several minutes
Primary database looks healthy
Replica CPU or disk suddenly spikes
Read traffic on replicas slows down
Lag disappears on its own later

This creates the impression that replication lag is random or unpredictable.

Wrong Assumption vs Reality

Wrong assumption: Replication lag means the replica database is broken or unreliable.

Reality: In most cases, replication lag means the replica is temporarily overloaded.

Real-life example:
Imagine a restaurant kitchen during peak dinner hours. Orders keep coming in, but the number of cooks stays the same. Food starts taking longer to reach tables. The kitchen is not broken—it is just overloaded for that period.

The same thing happens with PostgreSQL replicas. They are still working correctly, but they cannot keep up with the workload at that moment.

How PostgreSQL Replication Works (Simple View)

PostgreSQL streaming replication works by sending WAL (Write-Ahead Log) records from the primary to replicas.

In short:

Primary writes changes to WAL
WAL is streamed to replicas
Replicas replay WAL to stay in sync

Replication lag increases when any step in this pipeline slows down.

Cause 1: Sudden Spike in Write Traffic

This is the most common and most misunderstood cause of replication lag.

What actually happens

When the primary database suddenly receives a lot of write operations—such as inserts, updates, or deletes—it generates a large amount of WAL data in a short time. The replica must receive and replay all of this WAL before it can catch up.

Real-life example

A company runs a nightly batch job to update prices for millions of products. The job finishes in 15 minutes on the primary database, but the replica takes 45 minutes to replay all the changes. During this window, replication lag keeps increasing.

Why it looks sudden

Lag does not jump instantly. It grows quietly during the write burst and becomes visible only after crossing alert thresholds.

Advantages of handling this properly

Replicas stay close to real-time
Read queries remain fast
Failover remains safe

Disadvantages if ignored

Replicas fall far behind
Reporting queries become outdated
Failover may cause data loss or inconsistencies

Cause 2: Long-Running Transactions on the Primary

Long-running transactions are silent troublemakers.

What actually happens

When a transaction stays open for a long time, PostgreSQL must keep old row versions and WAL information around. This increases the amount of work replicas need to do.

Real-life example

A data analyst runs a report that takes two hours and forgets to close it. During that time, regular application writes continue. The replica has to process much more history than usual, which slows down replay.

Advantages of managing long transactions

Faster replication replay
Healthier vacuum operations
More predictable performance

Disadvantages if ignored

Replication lag keeps growing
Disk usage increases
Autovacuum becomes less effective

Cause 3: Slow Disk I/O on the Replica

Replication is disk-heavy on replicas.

Common scenarios

Replica disk is slower than primary
Disk is shared with other workloads
Storage hits IOPS limits in the cloud

What developers see

“Network looks fine, but lag keeps growing.”

The real bottleneck is WAL replay writing to disk.

Cause 4: Heavy Read Queries on the Replica

Read replicas are often abused.

Typical pattern

Analytics queries run on replicas
Large joins and aggregations consume CPU and I/O
WAL replay competes for the same resources

Real-world analogy

“Customers browsing block the staff from restocking shelves.”

Replication slows down even though writes are healthy.

Cause 5: Autovacuum and Vacuum Pressure

Vacuum processes generate WAL too.

When this hurts

Large deletes or updates happen
Autovacuum runs aggressively
WAL generation increases further

Lag spikes often appear after large cleanup operations, not during them.

Cause 6: Network Latency or Packet Loss

Replication depends on steady network streaming.

Real-world example

“Replica is in another availability zone or region.”

Small network hiccups cause WAL delays that accumulate into noticeable lag.

Cause 7: Checkpoint Spikes

Poor checkpoint tuning can cause I/O bursts.

What happens

Checkpoint forces dirty pages to disk
Disk I/O spikes
WAL replay slows down

These spikes are short but can push replicas behind quickly.

Why Replication Lag Feels Random

Replication lag often feels sudden because:

Monitoring checks lag behind reality
Multiple small issues stack together
Alerts trigger only after thresholds

Real-world explanation

“Water rises slowly, but flooding feels instant.”

How Teams Should Think About Replication Lag

Instead of asking “Why is lag high?”, ask:

Is WAL being generated faster than usual?
Is the replica CPU or disk saturated?
Are long transactions running?
Are replicas doing heavy reads?

This mindset leads to faster diagnosis.

Simple Mental Checklist

When replication lag spikes in production, pause and ask these questions calmly:

Did write traffic increase suddenly?
Are batch jobs or migrations running?
Are there long-running transactions?
Is the replica CPU or disk fully utilized?
Are analytics queries hitting the replica?

In most real-world incidents, at least one of these answers explains the lag.

Summary

PostgreSQL replication lag spikes in production not because replication is unreliable, but because replicas temporarily cannot keep up with workload pressure. Sudden write bursts, long-running transactions, slow disks, heavy read queries, vacuum activity, and checkpoint spikes all contribute to lag in very normal, predictable ways.

The key is understanding that replication lag is a capacity and workload problem, not a mystery or random failure. By recognizing real-life patterns, understanding the advantages of proactive management, and knowing the disadvantages of ignoring lag signals, teams can respond calmly and keep PostgreSQL replication stable, reliable, and production-ready.