Redis TTL and Expiration Strategy for Production Systems

Baibhav Kumar
Dec 29
3.2k
0
1

Article

Introduction

Teams that have operated Redis in production for years tend to agree on one lesson: Redis itself is rarely the source of serious problems. The real issues usually come from data that should have expired but did not.

TTL often appears trivial during early development. Teams focus on speed, reducing database calls, and improving cache hit ratios. As a result, TTL is treated as an afterthought. Over time, this leads to stale data, unpredictable bugs, and emergency Redis flushes.

TTL is not merely a configuration option. It defines the lifecycle of cached data.

At its core, TTL answers a fundamental question: how long is this data safe to use before it becomes a liability? When teams start thinking about TTL in these terms, their Redis design decisions change significantly.

Every Cache Key Must Have a TTL

The most important rule in Redis caching is also the most frequently ignored: every cache key must have a TTL, without exception.

If a key does not expire, it is no longer a cache. It becomes memory that the system hopes will never become incorrect. In real-world systems, data changes continuously. Business rules evolve, bugs surface, and deployments introduce new assumptions. Without TTL, outdated data can persist far longer than expected.

Infinite TTL often feels convenient. It inflates cache hit ratios and reduces database traffic. However, it also removes any natural recovery mechanism. When something goes wrong, the only options are flushing Redis entirely or manually hunting for keys, neither of which scales in production environments.

TTL Depends on the Nature of the Data

A reliable TTL strategy starts with understanding the characteristics of the data being cached. Treating all cached data the same is a common mistake.

Some data changes frequently, such as user sessions, permissions, rate limits, and feature flags. For these, the cost of being wrong is high, so TTLs should be short.

Other data changes occasionally, including user profiles, preferences, and aggregated API responses. These can tolerate longer TTLs but still require periodic refresh.

Some data rarely changes, such as reference data, lookup tables, or configuration values. Even this data should expire eventually, as assumptions and dependencies evolve over time.

Practical TTL Ranges in Production

In practice, many production systems converge on TTL ranges similar to the following:

User profile data: 5 to 15 minutes
Permissions and authorization data: 1 to 5 minutes
API response caching: 30 seconds to a few minutes
Configuration data: 1 hour or more
Reference or lookup data: 12 to 24 hours

These values are not primarily about performance. They are about safety. Shorter TTLs reduce the impact when something goes wrong. Redis is fast enough that rebuilding cache entries is usually far cheaper than serving stale data to large numbers of users.

Cache Hit Ratio Is Not the Goal

A common mistake is optimizing aggressively for cache hit ratio. High hit rates look impressive in dashboards, but they are not the true objective.

Correctness under change is the real goal. A cache miss costs milliseconds. Serving incorrect data damages user trust and system reliability.

Combining TTL With Explicit Invalidation

TTL works best when combined with explicit cache invalidation.

Explicit invalidation handles changes you are aware of, such as updates to user data. TTL covers the cases you forget, unexpected failures, and edge conditions.

Relying solely on TTL or solely on manual invalidation eventually leads to failure. Used together, they provide a resilient and self-healing caching strategy.

Avoiding Cache Stampedes With TTL Jitter

Synchronized expiration is a common production issue. When many keys share the same TTL, they tend to expire simultaneously. This causes large numbers of requests to fall back to the database at once, leading to latency spikes and potential outages.

This phenomenon, known as a cache stampede, can be mitigated by adding small random offsets to TTL values. Instead of assigning identical TTLs, a base value is combined with a small random adjustment. This naturally spreads expirations over time and prevents sudden load spikes.

Sliding Versus Absolute Expiration

Sliding expiration appears attractive because it keeps frequently accessed data alive. In practice, it often leads to keys that never expire, especially in distributed systems.

Redis uses absolute expiration, which is generally preferable. Absolute TTL is predictable and ensures data is refreshed periodically. Sliding expiration, if needed, should be applied selectively rather than as a default approach.

TTL and Redis Memory Management

TTL plays a critical role when Redis operates under memory pressure. Keys with TTL are easier for Redis to evict safely. Non-expiring keys tend to distort eviction behavior and remain in memory longer than intended.

Mixing expiring and non-expiring keys makes eviction patterns difficult to reason about, further reinforcing why TTL should be applied consistently.

Monitoring TTL Behavior in Production

TTL behavior should be monitored just like latency and error rates.

If the total key count grows steadily while expired keys per second remain low, TTLs may not be applied correctly. Sudden eviction spikes can indicate TTLs that are too long or memory limits that are too restrictive. These warning signs often appear in metrics well before users notice problems.

Value Size and TTL Considerations

The size of cached values matters. Large objects should generally have shorter TTLs. They consume more memory, increase serialization costs, and put additional pressure on Redis.

If data is both large and long-lived, it is worth questioning whether Redis is the appropriate storage layer.

Safety TTLs for Special Purpose Keys

Even special-purpose keys such as distributed locks or coordination flags should have safety TTLs. Processes can crash and networks can partition. A lock without TTL can become permanent, creating difficult-to-diagnose system failures.

A Practical TTL Decision Framework

Before caching any data, it is useful to ask four questions:

How often can this data change?
What happens if the data is wrong?
How expensive is it to rebuild?
Who is affected when it becomes stale?

The shortest acceptable answer across these dimensions usually leads to the correct TTL.

Final Thoughts

When TTL is designed deliberately, Redis behaves predictably. Cache misses are expected, refreshes are routine, and systems recover naturally from unexpected states.

When TTL is treated casually, Redis becomes difficult to understand. Teams resort to frequent flushes, and confidence in the system erodes.

TTL is not a tuning parameter to adjust later. It is a core part of system design from the first day.

When TTL is done correctly, Redis fades into the background. That is where reliable infrastructure belongs.