Debugging a failing Kubernetes pod in production requires a structured, low-risk, and observability-driven approach. In production clusters, downtime directly impacts users, so engineers must diagnose issues without introducing further instability. Kubernetes provides built-in tooling for inspecting pod health, container logs, events, networking, configuration, and resource usage.
This step-by-step guide explains how to systematically debug a failing pod in a production-grade Kubernetes environment.
Step 1: Identify the Pod Status
Start by checking the pod state:
kubectl get pods -n <namespace>
Common failing states include:
CrashLoopBackOff
ImagePullBackOff
ErrImagePull
Pending
OOMKilled
Completed unexpectedly
Understanding the state narrows the root cause category.
Step 2: Describe the Pod for Detailed Events
Use the describe command to inspect events and conditions:
kubectl describe pod <pod-name> -n <namespace>
Key sections to analyze:
Event messages often reveal immediate causes such as insufficient memory or failed image pulls.
Step 3: Check Container Logs
Inspect logs from the failing container:
kubectl logs <pod-name> -n <namespace>
If the container restarts frequently:
kubectl logs <pod-name> --previous -n <namespace>
Common log-based issues:
Application startup failure
Database connection errors
Missing environment variables
Dependency timeouts
Port binding conflicts
Logs are the most direct source of application-level failure information.
Step 4: Investigate CrashLoopBackOff
CrashLoopBackOff indicates repeated container crashes.
Check:
Review liveness and readiness probes in the Deployment manifest.
Example probe configuration:
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 10
periodSeconds: 15
Improper probe timing can cause premature restarts.
Step 5: Debug ImagePullBackOff or ErrImagePull
Common causes:
Verify image name and tag:
docker pull <image-name>
Check image pull secrets:
kubectl get secrets -n <namespace>
Ensure the service account references the correct secret.
Step 6: Investigate Pending Pods
If the pod remains in Pending state, inspect scheduling issues:
kubectl describe pod <pod-name>
Common causes:
Check node resources:
kubectl top nodes
Step 7: Check Resource Limits and OOMKilled
If the pod is OOMKilled, memory limits are exceeded.
Check resource configuration in Deployment:
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
Increase limits cautiously and monitor memory usage:
kubectl top pod <pod-name>
Step 8: Exec into the Container for Live Debugging
If the container is running but misbehaving:
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh
Inside the container, verify:
Example network test:
curl http://dependent-service:8080
Step 9: Check ConfigMaps and Secrets
Configuration errors often cause startup failures.
Verify mounted configurations:
kubectl get configmap <config-name> -n <namespace> -o yaml
kubectl get secret <secret-name> -n <namespace> -o yaml
Ensure base64 values decode correctly and match expected credentials.
Step 10: Inspect Node-Level Issues
Sometimes the issue is node-related rather than application-specific.
Check node status:
kubectl get nodes
Inspect node events:
kubectl describe node <node-name>
Node pressure conditions such as DiskPressure or MemoryPressure can affect pods.
Step 11: Use Ephemeral Debug Containers
Modern Kubernetes supports ephemeral containers for debugging:
kubectl debug -it <pod-name> --image=busybox --target=<container-name>
This allows debugging without modifying the original container image.
Step 12: Analyze Deployment and ReplicaSet
Sometimes configuration drift causes repeated failures.
Check rollout status:
kubectl rollout status deployment/<deployment-name>
Review revision history:
kubectl rollout history deployment/<deployment-name>
Rollback if necessary:
kubectl rollout undo deployment/<deployment-name>
Step 13: Observability and Monitoring Tools
Production clusters should integrate:
Centralized logging (ELK stack)
Metrics (Prometheus and Grafana)
Distributed tracing (OpenTelemetry)
Cloud-native monitoring dashboards
Correlating logs, metrics, and traces accelerates root cause analysis.
Difference Between Pod Failure Types
| Failure Type | Root Cause Category | Typical Fix |
|---|
| CrashLoopBackOff | Application crash | Fix app error or probe config |
| ImagePullBackOff | Registry issue | Correct image or credentials |
| Pending | Scheduling issue | Increase resources or fix node constraints |
| OOMKilled | Memory limit exceeded | Adjust memory limits |
| Completed | Short-lived process | Use proper restart policy |
Classifying failure type helps reduce mean time to recovery.
Production Debugging Best Practices
Never debug directly on production nodes without audit logging
Use read-only commands first
Avoid modifying live containers manually
Roll back safely instead of hot-fixing inside pods
Maintain staging parity for reproduction
A systematic approach prevents cascading failures.
Summary
Debugging a failing Kubernetes pod in production requires identifying the pod state, analyzing events and logs, verifying resource limits, inspecting configuration and secrets, checking scheduling constraints, and leveraging observability tools for root cause analysis. By following a structured diagnostic workflow and using Kubernetes commands such as describe, logs, exec, rollout, and debug, engineers can minimize downtime, prevent configuration drift, and restore application stability efficiently in enterprise-grade container orchestration environments.