Kubernetes  

How to Monitor Kubernetes Clusters Using Prometheus and Grafana?

Introduction

Monitoring Kubernetes clusters is not optional in modern DevOps environments—it is a critical requirement for ensuring application reliability, performance, and scalability.

In real-world production systems, simply installing monitoring tools is not enough. You must verify metric collection, visualize meaningful data, configure intelligent alerts, and handle scaling challenges such as storage and metric explosion.

Prometheus and Grafana together form a powerful, industry-standard monitoring stack:

  • Prometheus collects and stores metrics

  • Grafana visualizes those metrics through dashboards

This guide goes beyond basics and shows how to set up, verify, troubleshoot, and optimize Kubernetes monitoring in a production-ready way.

What is Kubernetes Monitoring?

Kubernetes monitoring involves tracking cluster health, performance, and resource usage in real time.

This includes:

  • CPU and memory usage

  • Pod health and restart counts

  • Node performance

  • Network traffic and latency

Without proper monitoring, identifying issues like memory leaks, node failures, or slow APIs becomes extremely difficult.

Installing Prometheus and Grafana Using Helm

Step 1: Add Helm Repository

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Step 2: Install kube-prometheus-stack

helm install monitoring prometheus-community/kube-prometheus-stack -n monitoring --create-namespace

Expected Output (Example)

NAME: monitoring
LAST DEPLOYED: <timestamp>
NAMESPACE: monitoring
STATUS: deployed
NOTES:
Visit Grafana at http://localhost:3000

Verifying Installation (Critical Step)

Check Pods

kubectl get pods -n monitoring

Expected:

monitoring-kube-prometheus-prometheus-0     2/2 Running
monitoring-grafana-xxxxx                    3/3 Running
monitoring-alertmanager-0                   2/2 Running

Check Services

kubectl get svc -n monitoring

Port Forward Grafana

kubectl port-forward svc/monitoring-grafana -n monitoring 3000:80

Access: http://localhost:3000

Default credentials:

  • Username: admin

  • Password: prom-operator

Verify Prometheus is Scraping Metrics

Port-forward Prometheus:

kubectl port-forward svc/monitoring-kube-prometheus-prometheus -n monitoring 9090

Open: http://localhost:9090/targets

You should see targets with status:

  • UP → working correctly

  • DOWN → needs troubleshooting

Real PromQL Examples (Must-Know)

CPU Usage Per Node

sum(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance)

Pod Restart Count

increase(kube_pod_container_status_restarts_total[10m])

Memory Usage

container_memory_usage_bytes

These queries can be used directly in Grafana dashboards.

Grafana Dashboards

Once logged into Grafana:

  • Navigate to Dashboards

  • Use pre-built dashboards

  • Import community dashboards if needed

Common dashboards include:

  • Cluster overview

  • Node metrics

  • Pod performance

Alertmanager and Alerting (Production Approach)

Prometheus uses Alertmanager to handle alerts.

Example Alert Rule

groups:
- name: kubernetes-alerts
  rules:
  - alert: HighCPUUsage
    expr: sum(rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      description: High CPU usage detected

Avoiding Alert Fatigue

Do NOT alert on every small event.

Bad practice:

  • Alert on single pod restart

Good practice:

expr: increase(kube_pod_container_status_restarts_total[10m]) > 3
for: 5m

This ensures alerts trigger only for real issues, not normal deployments.

Troubleshooting Common Issues

Issue 1: Grafana Not Accessible

Error:

connection refused

Fix:

kubectl get pods -n monitoring
kubectl describe pod <grafana-pod>

Issue 2: Targets Showing DOWN

Check:

kubectl get pods -n monitoring
kubectl logs <prometheus-pod>

Common reasons:

  • Network issues

  • Incorrect service discovery

Issue 3: Port-Forward Fails

Error:

unable to forward port

Fix:

  • Ensure pod is running

  • Restart port-forward

Storage and Retention Strategy (Production Insight)

Default Prometheus retention is limited (around 15 days).

Option 1: Increase Retention

prometheus:
  prometheusSpec:
    retention: 30d

Pros:

  • Simple

Cons:

  • High disk usage

Option 2: Use Thanos or Cortex

Pros:

  • Scalable

  • Long-term storage

Cons:

  • More complex setup

Handling High Cardinality (Advanced Topic)

High cardinality occurs when too many unique labels create excessive time series.

Bad example:

http_requests_total{request_id="abc123"}

This creates massive memory usage.

Solution: Drop Unnecessary Labels

relabel_configs:
- source_labels: [request_id]
  action: drop

This reduces metric explosion and improves performance.

Best Practices for Production Monitoring

  • Monitor both infrastructure and application metrics

  • Use meaningful alert thresholds

  • Avoid alert noise

  • Regularly review dashboards

  • Optimize Prometheus resource usage

Real-World Scenario

A production team noticed frequent CPU spikes at night.

Using Grafana dashboards, they identified a scheduled batch job causing high load. By optimizing the job, they reduced CPU usage and improved system stability.

Summary

Monitoring Kubernetes using Prometheus and Grafana requires more than installation—it requires validation, alert tuning, troubleshooting, and scaling strategies. By implementing proper verification steps, using real PromQL queries, configuring smart alerts, and handling storage and cardinality challenges, teams can build a reliable and production-ready monitoring system.