Prevent Cache Stampede and Thundering Herd Problems in Redis

Baibhav Kumar
Dec 29
4.1k
0
0

Article

Cache Stampede and the Thundering Herd Problem

If you have ever seen a system that runs smoothly for days and then suddenly degrades for no apparent reason, there is a strong chance you have encountered the cache stampede problem.

Everything appears healthy. Redis is available. The database is responsive. CPU usage looks normal. Then, latency spikes, databases are overwhelmed, timeouts cascade, and teams scramble to identify what changed.

In many cases, nothing changed at all. A cache key simply expired.

This phenomenon is commonly referred to as cache stampede or thundering herd. It is one of the most frequent failure modes in caching-heavy systems and one of the most misunderstood.

What a Cache Stampede Really Is

A cache stampede occurs when multiple requests simultaneously attempt to rebuild the duplicate cache entry. Instead of a single request missing the cache and then repopulating it, hundreds or thousands of requests miss the cache simultaneously and fall back to the database or an external dependency.

Redis behaves correctly in this situation. It returns a cache miss. The real problem lies in how the application responds to that miss.

In development or low-traffic environments, this behavior rarely appears. Requests are distributed, and cache expirations occur quietly. Under real production traffic, especially for popular keys, the situation changes dramatically.

A Realistic Production Scenario

Consider a key that represents a popular homepage API response with a TTL of 10 minutes. When the key expires, the subsequent request misses and rebuilds the cache.

Under light traffic, this works as expected. Under heavy traffic, such as 10,000 requests per second, key expiration causes thousands of concurrent cache misses. Each request attempts to rebuild the data, overwhelming the database.

Even after the cache is repopulated, the backlog of queued requests and timeouts can leave a lasting impact on the system.

Why Cache Stampede Happens

Cache stampede typically appears in systems that combine three characteristics:

High traffic volume
Uniform TTL values
Cache-aside logic without coordination

Uniform TTLs create synchronized expiration. When many keys expire at the same time, the system effectively schedules load spikes.

TTL Jitter as the First Line of Defense

The simplest and most effective defense against cache stampede is to avoid synchronized expiration.

Adding a small amount of randomness to TTL values prevents large numbers of keys from expiring simultaneously. Instead of expiring at exactly ten minutes, keys may expire between nine and eleven minutes.

This approach, known as TTL jitter, spreads cache rebuilds over time and significantly reduces load spikes. Adding as little as thirty seconds or one minute of randomness is often sufficient.

TTL jitter alone prevents many incidents, but hot keys that are expensive to rebuild may still require additional protection.

Coordinating Cache Rebuilds With Locks

For expensive cache rebuilds, request-level coordination is often necessary.

A common approach is to allow only one request to rebuild a missing cache entry. When a cache miss occurs, the first request acquires a short-lived lock in Redis and performs the rebuild. Other requests that miss detect the lock and either wait briefly, retry, or serve slightly stale data.

Locks must always be short-lived. If a rebuild process crashes or times out, the lock should expire quickly to avoid creating a new failure condition.

This pattern is particularly useful for heavy database queries or slow external API calls. It is usually unnecessary for inexpensive lookups.

Stale-While-Revalidate Strategy

Another effective technique is stale-while-revalidate. Instead of allowing a key to disappear completely when it expires, the system serves slightly stale data while a background refresh occurs.

From a user perspective, slightly stale data is often preferable to slow or failed responses. From a system perspective, this smooths load and avoids spikes.

Redis does not support stale-while-revalidate natively, but it can be implemented at the application level by storing metadata alongside cached values. Logical expiration can trigger background refresh while the physical key remains available.

This approach works well for dashboards, content pages, and aggregated data where minor staleness is acceptable.

Preventing Stampede on Missing Data

Another common cause of stampede is the absence of negative caching. When data does not exist, systems may repeatedly query the database for the same missing record.

If a popular identifier does not exist, every request misses and hits the database again, creating a stampede on non-existent data.

Caching negative results for a short TTL prevents repeated database hits. If an entity is not found, caching that fact briefly reduces unnecessary load while allowing new data to appear quickly if created.

Application Restarts and Cold Caches

Application restarts are another overlooked contributor to cache stampede. When instances restart, in-process caches are empty. If Redis is also cold or partially populated, many services may attempt to rebuild the same keys simultaneously.

Warm-up strategies help mitigate this risk. Preloading critical keys or gradually ramping traffic after restarts can prevent sudden spikes. This is especially important in containerized environments where restarts are frequent.

Choosing Where to Apply Protection

Not every cache miss requires stampede protection. Overengineering every cache path adds unnecessary complexity.

Protection should focus on keys that are both frequently accessed and expensive to rebuild. Rarely used or inexpensive lookups do not need locking or coordination.

Monitoring is essential here. Spikes in database queries that align with cache expiration times or sawtooth latency patterns are strong indicators of stampede issues.

Cache Stampede Is Not a Redis Problem

Cache stampede is not caused by Redis misbehavior. Redis is functioning correctly. The issue lies in how applications handle cache misses under load.

A resilient caching strategy assumes that keys will expire at inconvenient times, requests will pile up, and systems will restart. The goal is to ensure graceful degradation rather than catastrophic failure.

When cache stampede is handled well, it becomes a non-event. A few requests take slightly longer, the cache refreshes, and traffic continues normally.

When it is handled poorly, it turns into a full production incident that appears random until the root cause is understood.

Cache stampede is not a sign that caching is broken. It is a sign that caching is working at scale and needs guardrails.

Final Thoughts

Designing for cache stampede early ensures Redis remains stable and predictable. Ignoring it allows small expiration events to expose the weakest parts of the system at the worst possible time.

A robust caching design plans for misses, rebuilds, and failure scenarios from the beginning. When this is done correctly, Redis rewards systems with consistent performance and reliability.