Why PostgreSQL Failovers Are Risky When Replication Lag Exists

Ananya Desai
Jan 23
3.1k
0
0

Article

Introduction

Failover sounds comforting. A primary database fails, a replica takes over, and the application keeps running. On paper, this looks clean and safe.

In real PostgreSQL production systems, failovers are often where small problems turn into major incidents. Everything works fine until a failover happens. Then data looks missing, applications behave strangely, or systems refuse to start correctly.

This article explains why PostgreSQL failovers are risky when replication lag exists, what teams usually see in production, and why these failures feel sudden and deeply confusing.

What Replication Lag Actually Means

Replication lag means the replica is behind the primary. Writes have happened on the primary, but the replica has not replayed them yet.

A real-world analogy: imagine copying files from one laptop to another over a slow network. If the first laptop crashes before the copy finishes, the second laptop simply does not have the latest files.

PostgreSQL replication works the same way. The replica can only serve what it has already received and replayed.

Why Replication Lag Is Normal

Some amount of replication lag is expected.

Lag can occur due to:

High write volume
Slow disk I/O on replicas
Long-running queries blocking replay
Network latency
Maintenance operations

The problem is not lag itself. The problem is assuming lag does not matter during failover.

What Happens During a Failover

During failover, the replica is promoted to primary. From that moment on, applications write to it.

If the replica was behind:

Recent transactions are missing
Sequence values may jump backward
Application state becomes inconsistent
Users see partial or stale data

PostgreSQL does exactly what it is told. The surprise comes from what it cannot recover.

What Developers Usually See in Production

After a lagged failover, teams often observe:

“Missing” rows that existed minutes ago
Duplicate IDs or unexpected conflicts
Applications failing integrity checks
Background jobs behaving incorrectly
Data inconsistencies that persist

These issues are terrifying because they look like corruption, even when the database itself is healthy.

Why the Failure Feels Sudden and Severe

Replication lag is usually invisible during normal operation. Reads still work. Writes still succeed.

Failover turns a hidden delay into permanent data loss.

Because the system worked fine seconds before the failover, teams feel blindsided. The damage happens instantly, but the cause existed quietly beforehand.

Synchronous vs Asynchronous Replication

Asynchronous replication prioritizes performance. Writes return quickly, but replicas can lag.

Synchronous replication prioritizes safety. Writes wait for replicas, reducing lag but increasing latency.

Many teams unknowingly accept asynchronous replication without fully understanding the failover risk.

Real-World Example

A production system runs asynchronous replication to keep latency low. Replication lag occasionally reaches a few seconds during peak traffic, but no alerts fire.

One day, the primary crashes. Automatic failover promotes a replica. Users immediately report missing recent transactions. Support tickets flood in.

The system is “up,” but trust is broken.

Advantages and Disadvantages of Replication and Failover

Advantages (When Properly Designed)

When replication lag is understood and managed:

Failovers are predictable
Data loss scenarios are known
Recovery procedures are clear
Business impact is controlled
Trust in the system increases

Failover becomes a planned operation, not a gamble.

Disadvantages (When Lag Is Ignored)

When teams ignore replication lag:

Failovers cause data loss
Incidents escalate rapidly
Manual fixes become dangerous
Confidence in automation drops
Teams hesitate to trigger failovers

At that point, failover becomes something teams fear.

How Teams Should Think About This

Failover is not just a technical switch. It is a business decision about acceptable data loss.

Teams should ask:

How much data loss is acceptable?
How visible is replication lag?
Do applications tolerate missing writes?
Is synchronous replication required for critical data?

Failover strategy must match business expectations, not just infrastructure design.

Simple Mental Checklist

Before trusting failover, ask:

How far behind can replicas get?
What happens to in-flight transactions?
Are lag alerts tied to failover automation?
Do teams understand data loss scenarios?
Has failover been tested under real load?

These questions prevent painful surprises.

Summary

PostgreSQL failovers are risky when replication lag exists because lag turns into permanent data loss the moment a replica is promoted. The danger feels sudden because lag is invisible until it matters most. Teams that understand and plan for replication lag treat failover as a controlled risk, not a blind safety net.