Introduction
In real-world cloud applications, user traffic is never constant. Sometimes your application gets very high traffic (like during sales or peak hours), and sometimes it is almost idle. If you keep a fixed number of pods in Kubernetes, you either face performance issues during high traffic or waste resources during low traffic.
This is where Kubernetes autoscaling comes into play. Using Horizontal Pod Autoscaler (HPA), you can automatically increase or decrease the number of pods based on CPU usage, memory usage, or other metrics.
In this step-by-step Kubernetes autoscaling tutorial, you will learn how to configure HPA in simple words with practical examples.
What is Horizontal Pod Autoscaler in Kubernetes?
Horizontal Pod Autoscaler (HPA) is a Kubernetes feature that automatically scales the number of pods in your application based on resource usage like CPU or memory.
Simple explanation:
Real-life example:
Think of a food delivery app like during dinner time. When many users place orders, more delivery agents (pods) are needed. When orders reduce, fewer agents are required. HPA works exactly like this.
Prerequisites for Kubernetes Autoscaling
Before you configure autoscaling in Kubernetes, make sure the following things are ready:
A working Kubernetes cluster (local or cloud like AWS, Azure, GCP)
kubectl installed and configured
Metrics Server installed (very important for HPA)
A running deployment to scale
Without Metrics Server, Kubernetes cannot track CPU or memory usage, so autoscaling will not work.
Step 1: Install Metrics Server in Kubernetes
Metrics Server is required for Kubernetes HPA because it collects CPU and memory usage data.
Run this command:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
After installing, verify:
kubectl get deployment metrics-server -n kube-system
Simple understanding:
Metrics Server acts like a monitoring tool that tells Kubernetes how much CPU your pods are using.
Step 2: Create or Verify Deployment
Before applying autoscaling, you must have a deployment running.
Example Kubernetes deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 2
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: my-app-container
image: nginx
resources:
requests:
cpu: "100m"
limits:
cpu: "500m"
Important concept:
Why this matters:
HPA uses CPU request as a reference to calculate scaling. If you don’t define it, autoscaling will not work correctly.
Real-world mistake:
Many beginners forget to define CPU requests, and then HPA does not scale properly.
Step 3: Create HPA Using kubectl Command
You can quickly configure autoscaling using a simple command:
kubectl autoscale deployment my-app --cpu-percent=50 --min=2 --max=10
Explanation in simple words:
CPU 50% → target average CPU usage
min=2 → minimum pods always running
max=10 → maximum pods allowed
Real-world example:
If your application CPU goes above 50%, Kubernetes will automatically increase pods to handle load.
Step 4: Configure HPA Using YAML (Recommended for Production)
For real-world projects, always use YAML configuration.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
Apply the configuration:
kubectl apply -f hpa.yaml
Why YAML is better:
Step 5: Verify Kubernetes Autoscaling
To check if HPA is working:
kubectl get hpa
For detailed information:
kubectl describe hpa my-app-hpa
This will show:
Current CPU usage
Number of replicas
Scaling events
Step 6: Test Autoscaling in Kubernetes
To see autoscaling in action, generate load:
kubectl run -i --tty load-generator --image=busybox /bin/sh
Inside container:
while true; do wget -q -O- http://my-app; done
Now monitor scaling:
kubectl get hpa -w
You will notice pods increasing when CPU usage increases.
Simple understanding:
More traffic → more pods
Less traffic → fewer pods
Advanced Kubernetes Autoscaling Configuration
You can improve autoscaling using advanced options.
Memory-based autoscaling:
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 60
Multiple metrics:
You can use both CPU and memory together for better accuracy.
Custom metrics:
Using tools like Prometheus, you can scale based on:
Requests per second
Queue length
Business metrics
When to Use Kubernetes HPA
Use Kubernetes Horizontal Pod Autoscaler when:
Avoid HPA when:
Advantages of Kubernetes Autoscaling
Automatically handles traffic spikes
Improves application performance
Reduces infrastructure cost
No manual scaling required
Disadvantages and Common Mistakes
Metrics Server must be installed
Scaling is not instant (takes time)
Wrong CPU requests lead to wrong scaling
Real-world mistake:
If CPU request is too low, Kubernetes thinks load is high and keeps scaling unnecessarily.
Best Practices for Kubernetes Autoscaling
Always define CPU and memory requests properly
Start with small scaling limits and adjust
Monitor performance regularly
Use multiple metrics when possible
Summary
Configuring autoscaling in Kubernetes using Horizontal Pod Autoscaler helps your application automatically adjust to changing traffic conditions, improving performance and reducing costs. By properly setting CPU and memory requests, installing Metrics Server, and defining clear scaling policies using YAML or kubectl, you can build a reliable and scalable system. Understanding how HPA works and avoiding common mistakes ensures that your Kubernetes autoscaling setup remains efficient, stable, and production-ready.