Kubernetes  

How to Debug a Failing Kubernetes Pod in Production?

Debugging a failing Kubernetes pod in production requires a structured, low-risk, and observability-driven approach. In production clusters, downtime directly impacts users, so engineers must diagnose issues without introducing further instability. Kubernetes provides built-in tooling for inspecting pod health, container logs, events, networking, configuration, and resource usage.

This step-by-step guide explains how to systematically debug a failing pod in a production-grade Kubernetes environment.

Step 1: Identify the Pod Status

Start by checking the pod state:

kubectl get pods -n <namespace>

Common failing states include:

  • CrashLoopBackOff

  • ImagePullBackOff

  • ErrImagePull

  • Pending

  • OOMKilled

  • Completed unexpectedly

Understanding the state narrows the root cause category.

Step 2: Describe the Pod for Detailed Events

Use the describe command to inspect events and conditions:

kubectl describe pod <pod-name> -n <namespace>

Key sections to analyze:

  • Events (image pull errors, scheduling failures)

  • Container state

  • Restart count

  • Resource limits

  • Node assignment

Event messages often reveal immediate causes such as insufficient memory or failed image pulls.

Step 3: Check Container Logs

Inspect logs from the failing container:

kubectl logs <pod-name> -n <namespace>

If the container restarts frequently:

kubectl logs <pod-name> --previous -n <namespace>

Common log-based issues:

  • Application startup failure

  • Database connection errors

  • Missing environment variables

  • Dependency timeouts

  • Port binding conflicts

Logs are the most direct source of application-level failure information.

Step 4: Investigate CrashLoopBackOff

CrashLoopBackOff indicates repeated container crashes.

Check:

  • Incorrect startup command

  • Missing configuration

  • Secrets not mounted properly

  • Health check failures

Review liveness and readiness probes in the Deployment manifest.

Example probe configuration:

livenessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 10
  periodSeconds: 15

Improper probe timing can cause premature restarts.

Step 5: Debug ImagePullBackOff or ErrImagePull

Common causes:

  • Incorrect image tag

  • Private registry authentication failure

  • Registry downtime

Verify image name and tag:

docker pull <image-name>

Check image pull secrets:

kubectl get secrets -n <namespace>

Ensure the service account references the correct secret.

Step 6: Investigate Pending Pods

If the pod remains in Pending state, inspect scheduling issues:

kubectl describe pod <pod-name>

Common causes:

  • Insufficient CPU or memory

  • Node selector mismatch

  • Taints and tolerations

  • Persistent volume binding failure

Check node resources:

kubectl top nodes

Step 7: Check Resource Limits and OOMKilled

If the pod is OOMKilled, memory limits are exceeded.

Check resource configuration in Deployment:

resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

Increase limits cautiously and monitor memory usage:

kubectl top pod <pod-name>

Step 8: Exec into the Container for Live Debugging

If the container is running but misbehaving:

kubectl exec -it <pod-name> -n <namespace> -- /bin/sh

Inside the container, verify:

  • Environment variables

  • Mounted volumes

  • Application configuration files

  • Network connectivity

Example network test:

curl http://dependent-service:8080

Step 9: Check ConfigMaps and Secrets

Configuration errors often cause startup failures.

Verify mounted configurations:

kubectl get configmap <config-name> -n <namespace> -o yaml
kubectl get secret <secret-name> -n <namespace> -o yaml

Ensure base64 values decode correctly and match expected credentials.

Step 10: Inspect Node-Level Issues

Sometimes the issue is node-related rather than application-specific.

Check node status:

kubectl get nodes

Inspect node events:

kubectl describe node <node-name>

Node pressure conditions such as DiskPressure or MemoryPressure can affect pods.

Step 11: Use Ephemeral Debug Containers

Modern Kubernetes supports ephemeral containers for debugging:

kubectl debug -it <pod-name> --image=busybox --target=<container-name>

This allows debugging without modifying the original container image.

Step 12: Analyze Deployment and ReplicaSet

Sometimes configuration drift causes repeated failures.

Check rollout status:

kubectl rollout status deployment/<deployment-name>

Review revision history:

kubectl rollout history deployment/<deployment-name>

Rollback if necessary:

kubectl rollout undo deployment/<deployment-name>

Step 13: Observability and Monitoring Tools

Production clusters should integrate:

  • Centralized logging (ELK stack)

  • Metrics (Prometheus and Grafana)

  • Distributed tracing (OpenTelemetry)

  • Cloud-native monitoring dashboards

Correlating logs, metrics, and traces accelerates root cause analysis.

Difference Between Pod Failure Types

Failure TypeRoot Cause CategoryTypical Fix
CrashLoopBackOffApplication crashFix app error or probe config
ImagePullBackOffRegistry issueCorrect image or credentials
PendingScheduling issueIncrease resources or fix node constraints
OOMKilledMemory limit exceededAdjust memory limits
CompletedShort-lived processUse proper restart policy

Classifying failure type helps reduce mean time to recovery.

Production Debugging Best Practices

  • Never debug directly on production nodes without audit logging

  • Use read-only commands first

  • Avoid modifying live containers manually

  • Roll back safely instead of hot-fixing inside pods

  • Maintain staging parity for reproduction

A systematic approach prevents cascading failures.

Summary

Debugging a failing Kubernetes pod in production requires identifying the pod state, analyzing events and logs, verifying resource limits, inspecting configuration and secrets, checking scheduling constraints, and leveraging observability tools for root cause analysis. By following a structured diagnostic workflow and using Kubernetes commands such as describe, logs, exec, rollout, and debug, engineers can minimize downtime, prevent configuration drift, and restore application stability efficiently in enterprise-grade container orchestration environments.