Kubernetes autoscaling isn’t just about adding more pods; it’s fundamentally a continuous, reactive process of resource allocation driven by observed demand.

Let’s see it in action. Imagine a simple Nginx deployment.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:latest
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            cpu: "200m"
            memory: "256Mi"

Now, let’s attach a Horizontal Pod Autoscaler (HPA) to it. This HPA will watch the average CPU utilization across all pods in the nginx-deployment.

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: nginx-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nginx-deployment
  minReplicas: 1
  maxReplicas: 10
  targetCPUUtilizationPercentage: 50

If the average CPU utilization of nginx-deployment pods goes above 50% (meaning the requested 100m CPU is being used at more than 50% across the pods), the HPA will start scaling up the deployment by creating more pods, up to the maxReplicas of 10. Conversely, if utilization drops significantly below 50%, it will scale down, but never below minReplicas of 1.

This solves the problem of your application experiencing performance degradation under high load or wasting resources during low traffic. The HPA automatically adjusts the number of application instances (pods) based on observed metrics like CPU or memory utilization.

Internally, the Kubernetes control plane has a component called the kube-controller-manager, which runs the horizontal-pod-autoscaler controller. This controller periodically (by default, every 15 seconds) queries the metrics server for the requested metrics (e.g., CPU utilization) for the pods managed by the HPA. It then calculates the current utilization and compares it to the targetCPUUtilizationPercentage. Based on this comparison, it updates the replicas field in the target Deployment, which in turn triggers a ReplicaSet update, leading to pods being created or deleted.

The exact levers you control are minReplicas, maxReplicas, and the targetCPUUtilizationPercentage (or targetMemoryUtilizationPercentage if you configure memory). You can also use custom metrics, like requests per second or queue length, by integrating with external metrics providers via the custom.metrics.k8s.io API.

Vertical Pod Autoscaler (VPA) takes a different approach. Instead of changing the number of pods, it changes the resource requests and limits of the containers within those pods. VPA is particularly useful for applications with unpredictable resource needs or when you’re unsure about setting appropriate resource requests initially. It can analyze historical resource usage and recommend or automatically apply adjustments to container requests and limits.

KEDA (Kubernetes Event-Driven Autoscaling) is an extension that allows you to scale based on external event sources. Think message queues (Kafka, RabbitMQ), cloud provider queues (SQS, Azure Service Bus), or even custom metrics from Prometheus. KEDA can scale deployments down to zero replicas when there are no events, and scale them back up when events start arriving. This is ideal for event-driven architectures where you want to avoid paying for idle resources.

The most surprising thing about HPA is that it doesn’t actually collect the metrics itself. It relies on a separate component, the Kubernetes Metrics Server, which aggregates resource usage data from the kubelets on each node. If the Metrics Server isn’t running or healthy, your HPAs will stop functioning, often showing <unknown> for target utilization.

When you configure an HPA with targetCPUUtilizationPercentage: 50, it means that if the average CPU utilization across all running pods for that deployment exceeds 50% of their requested CPU, the HPA will scale up. For example, if you have two pods, each requesting 100m CPU, and the total actual CPU usage across both pods is 150m, the average utilization is (150m / (100m + 100m)) * 100% = 75%. Since 75% is greater than 50%, the HPA will trigger a scale-up.

The next concept to wrap your head around is how to handle applications that don’t have easily measurable CPU or memory usage patterns, leading you to custom metrics and KEDA.

Want structured learning?

Take the full Containers & Kubernetes course →