Operating Distributed Systems in Production

Asif Eqbal
May 21
6.1k
0
12

Article

Metrics, Alerts, and Reality

Designing distributed systems is intellectually satisfying. You draw boundaries, define contracts, and reason about behavior in controlled abstractions. Operating those systems in production is something else entirely. It is less about elegance and more about survival. Once real users arrive, theory gives way to pressure, and success depends as much on human understanding as on system design.

This is where metrics, alerts, and dashboards stop being accessories and become core infrastructure. Observability is not a layer you add after the fact, it is the interface between human decision making and machine behavior. When it fails, the system may still be running, but the people responsible for it are effectively blind.

The difficult truth is that most production incidents are not caused by unknown unknowns. They are caused by signals that existed but were either ignored, misinterpreted, or buried under noise. Understanding why this happens requires accepting that observability is harder than architecture.

Why observability is harder than architecture

Architecture benefits from intentionality. You choose components, define responsibilities, and constrain interactions. Observability, by contrast, must deal with everything that emerges once those components start interacting at scale.

Moreover, architecture is usually evaluated in calm conditions. Observability is tested during failure, when cognitive load is high and time is limited. A design that looks reasonable on paper can become unusable under stress.

One of the most consistent lessons from operating large scale analytics systems, including Meta live event analytics, is that volume changes everything. When traffic spikes suddenly, engineers do not want options, they want clarity. They want to know what is broken, how badly, and whether it is getting worse.

This means, observability cannot be neutral, it must reflect priorities. Metrics are opinions encoded as numbers. Choosing what to measure is already an act of interpretation. Choosing how to present it determines whether that interpretation survives contact with reality.

Another challenge is that observability must serve multiple audiences. Developers want deep diagnostics. Operators want fast answers, while leadership wants impact summaries. Trying to satisfy all of them with the same views usually satisfies none of them well. Consequently, the hardest part of observability is not instrumentation. It is deciding which questions matter first when everything feels urgent.

This becomes especially visible when systems try to answer the simplest possible question: is it alive?

Heartbeat systems and liveness detection

Liveness sounds binary. A system is either up or down. In production, that distinction collapses almost immediately.

A service may respond to health checks while silently failing to process work. A data pipeline may be running while falling hours behind. A dashboard may load instantly while showing stale or misleading information. Therefore, naive health checks create false confidence.

Heartbeat systems must represent progress, not mere responsiveness. They must answer whether the system is doing useful work within acceptable bounds. That requires defining what progress actually means for each component.

In large data processing environments, such as Microsoft PowerBI pipelines, liveness is inseparable from freshness. A report that updates late may be technically available but operationally useless. Users make decisions based on outdated information without realizing it.

Moreover, heartbeat systems must be resilient to shared failure modes. If liveness checks depend on the same databases, networks, or schedulers as the workloads they observe, they will fail together. During incidents, those checks tend to lie optimistically.

Therefore, effective liveness detection requires isolation and redundancy. It also requires humility. No heartbeat is perfect. The goal is not certainty, but early warning. Once liveness is visible, teams naturally attempt to automate responses. This is where alerting enters the picture, often with unintended consequences.

Signal vs noise in alerting

Alerting systems are easy to build and difficult to live with.

At first, teams add alerts defensively. A metric looks important, so they add a threshold. Another metric fluctuates unexpectedly, so they add another alert. Over time, the alert surface grows faster than understanding.

Consequently, engineers receive notifications constantly. Many of them do not require action and some of them contradict each other. Eventually, the human response adapts. Alerts are muted, notifications are skimmed, and the system loses credibility.

This phenomenon is not theoretical. It appears consistently in real world alert catalogs, including those structured around Google monitoring practices. The most effective alerting systems are intentionally sparse. Each alert corresponds to a decision that someone must make immediately.

Alerts should reflect direct user impact or imminent risk, not internal metrics in isolation.
High CPU usage is rarely actionable on its own. Latency and error rates are.
Queue growth only matters when it threatens deadlines or user experience.
Every alert must encode context, including ownership, recent changes, and expected response.

Another overlooked aspect is alert decay. Systems evolve, but alerts often do not. Thresholds that made sense six months ago quietly become irrelevant. Without regular pruning, alerting systems drift toward irrelevance.

Once alerting noise is reduced, teams rediscover the value of dashboards, but only if those dashboards are designed for decisions rather than curiosity.

Real time dashboards for decision making

Dashboards fail when they try to show everything.

In production, especially during incidents, operators do not need comprehensive visibility. They need orientation. They need to understand direction and magnitude, not raw detail.

Therefore, effective real time dashboards emphasize trends, ratios, and deltas. They show whether conditions are stabilizing or deteriorating. They surface saturation points and failure boundaries.

Experience with Meta live event analytics highlights this clearly. During large live events, absolute numbers matter less than trajectories. Is join latency increasing or flattening? Is buffering recovering or worsening? Is error rate responding to mitigation?

Moreover, dashboards must reflect how humans reason under stress. Visual hierarchy matters, labels must be unambiguous, and metrics must align with mental models. If an operator has to translate what a chart means before acting, the dashboard has failed its primary purpose.

Another common failure mode is dashboard fragmentation. Different teams build different views with inconsistent definitions. During incidents, this leads to debates about whose numbers are correct rather than what actions to take. Therefore, dashboards should encode shared truth. Definitions must be consistent. Ownership must be clear. Otherwise, visibility becomes a source of friction rather than alignment.

Still, even the best dashboards only describe the present. The future of the system is shaped elsewhere, often during the aftermath of failure.

Incident driven system evolution

Every major incident is a design review conducted by reality.

What distinguishes resilient organizations from fragile ones is not how often incidents occur, but how deeply they are integrated into system evolution. Incidents reveal where assumptions break down. They expose blind spots in observability and highlight friction in human workflows.

Therefore, incident analysis should focus on systems, not individuals. The goal is not to assign fault, but to identify mismatches between expectation and behavior. Questions that matter include which signals were missing, which alerts fired too late or too often, and which dashboards confused rather than clarified. These insights should translate into concrete changes.

Moreover, incident driven improvements must be prioritized. Not every lesson deserves immediate action, but repeated failure modes demand structural response. New metrics are added. Alerts are simplified. Recovery procedures are automated.

Over time, observability systems mature through this feedback loop. They become less noisy and more intentional. They reflect real failure patterns rather than imagined ones. Importantly, this evolution never finishes. Systems change, usage patterns shift, and new dependencies are introduced. Observability must evolve alongside them or it slowly decays into irrelevance.

Observability as part of the system

One of the most damaging misconceptions is treating observability as external to the system. In reality, it is a component with its own failure modes, dependencies, and operational costs. Metrics pipelines can lag, alerting services can fail, and dashboards can display misleading data. For that reason, observability infrastructure must itself be observable.

This recursive requirement feels excessive until the first time operators cannot trust their tools during an incident. At that point, uncertainty multiplies rapidly. Therefore, production ready systems invest in observability reliability. They monitor metric ingestion latency, they test alert delivery paths, and validate dashboard freshness. Trust is earned continuously, not assumed.

Conclusion

To sum up,perating distributed systems is not about eliminating uncertainty.It is about reducing uncertainty fast enough for humans to act decisively.

Architecture defines what the system can do. Observability defines whether people can intervene when it cannot. Metrics, alerts, and dashboards are not secondary concerns, but they are the interface between intention and reality.

Finally, production does not reward completeness or elegance, it rewards clarity under pressure. The systems that endure are not the ones with the most data, but the ones that make the right data unavoidable when it matters most.