Introduction
Monitoring Kubernetes clusters is not optional in modern DevOps environments—it is a critical requirement for ensuring application reliability, performance, and scalability.
In real-world production systems, simply installing monitoring tools is not enough. You must verify metric collection, visualize meaningful data, configure intelligent alerts, and handle scaling challenges such as storage and metric explosion.
Prometheus and Grafana together form a powerful, industry-standard monitoring stack:
This guide goes beyond basics and shows how to set up, verify, troubleshoot, and optimize Kubernetes monitoring in a production-ready way.
What is Kubernetes Monitoring?
Kubernetes monitoring involves tracking cluster health, performance, and resource usage in real time.
This includes:
Without proper monitoring, identifying issues like memory leaks, node failures, or slow APIs becomes extremely difficult.
Installing Prometheus and Grafana Using Helm
Step 1: Add Helm Repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
Step 2: Install kube-prometheus-stack
helm install monitoring prometheus-community/kube-prometheus-stack -n monitoring --create-namespace
Expected Output (Example)
NAME: monitoring
LAST DEPLOYED: <timestamp>
NAMESPACE: monitoring
STATUS: deployed
NOTES:
Visit Grafana at http://localhost:3000
Verifying Installation (Critical Step)
Check Pods
kubectl get pods -n monitoring
Expected:
monitoring-kube-prometheus-prometheus-0 2/2 Running
monitoring-grafana-xxxxx 3/3 Running
monitoring-alertmanager-0 2/2 Running
Check Services
kubectl get svc -n monitoring
Port Forward Grafana
kubectl port-forward svc/monitoring-grafana -n monitoring 3000:80
Access: http://localhost:3000
Default credentials:
Username: admin
Password: prom-operator
Verify Prometheus is Scraping Metrics
Port-forward Prometheus:
kubectl port-forward svc/monitoring-kube-prometheus-prometheus -n monitoring 9090
Open: http://localhost:9090/targets
You should see targets with status:
Real PromQL Examples (Must-Know)
CPU Usage Per Node
sum(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance)
Pod Restart Count
increase(kube_pod_container_status_restarts_total[10m])
Memory Usage
container_memory_usage_bytes
These queries can be used directly in Grafana dashboards.
Grafana Dashboards
Once logged into Grafana:
Common dashboards include:
Cluster overview
Node metrics
Pod performance
Alertmanager and Alerting (Production Approach)
Prometheus uses Alertmanager to handle alerts.
Example Alert Rule
groups:
- name: kubernetes-alerts
rules:
- alert: HighCPUUsage
expr: sum(rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.8
for: 5m
labels:
severity: warning
annotations:
description: High CPU usage detected
Avoiding Alert Fatigue
Do NOT alert on every small event.
Bad practice:
Good practice:
expr: increase(kube_pod_container_status_restarts_total[10m]) > 3
for: 5m
This ensures alerts trigger only for real issues, not normal deployments.
Troubleshooting Common Issues
Issue 1: Grafana Not Accessible
Error:
connection refused
Fix:
kubectl get pods -n monitoring
kubectl describe pod <grafana-pod>
Issue 2: Targets Showing DOWN
Check:
kubectl get pods -n monitoring
kubectl logs <prometheus-pod>
Common reasons:
Issue 3: Port-Forward Fails
Error:
unable to forward port
Fix:
Ensure pod is running
Restart port-forward
Storage and Retention Strategy (Production Insight)
Default Prometheus retention is limited (around 15 days).
Option 1: Increase Retention
prometheus:
prometheusSpec:
retention: 30d
Pros:
Cons:
Option 2: Use Thanos or Cortex
Pros:
Scalable
Long-term storage
Cons:
Handling High Cardinality (Advanced Topic)
High cardinality occurs when too many unique labels create excessive time series.
Bad example:
http_requests_total{request_id="abc123"}
This creates massive memory usage.
Solution: Drop Unnecessary Labels
relabel_configs:
- source_labels: [request_id]
action: drop
This reduces metric explosion and improves performance.
Best Practices for Production Monitoring
Monitor both infrastructure and application metrics
Use meaningful alert thresholds
Avoid alert noise
Regularly review dashboards
Optimize Prometheus resource usage
Real-World Scenario
A production team noticed frequent CPU spikes at night.
Using Grafana dashboards, they identified a scheduled batch job causing high load. By optimizing the job, they reduced CPU usage and improved system stability.
Summary
Monitoring Kubernetes using Prometheus and Grafana requires more than installation—it requires validation, alert tuning, troubleshooting, and scaling strategies. By implementing proper verification steps, using real PromQL queries, configuring smart alerts, and handling storage and cardinality challenges, teams can build a reliable and production-ready monitoring system.