The Horizontal Pod Autoscaler (HPA) in GKE isn’t just about scaling pods up and down; it’s fundamentally about managing resource contention before it impacts your users, even when the load is wildly unpredictable.

Let’s see it in action. Imagine a simple Nginx deployment serving static content. We’ll set up an HPA to react to CPU utilization.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:latest
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: "100m" # Request 0.1 CPU core
          limits:
            cpu: "200m" # Limit to 0.2 CPU cores
---
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: nginx-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nginx-deployment
  minReplicas: 1
  maxReplicas: 10
  targetCPUUtilizationPercentage: 50 # Target 50% of the *requested* CPU

Apply this with kubectl apply -f nginx-hpa.yaml. Now, if we send a lot of traffic to nginx-deployment, the HPA will watch the CPU usage of the pods. When the average CPU utilization across all nginx-deployment pods exceeds 50% of their requested CPU (which is 100m in this case), the HPA will start creating new pods, up to maxReplicas. If the CPU usage drops below 50%, it will scale down.

The core problem the HPA solves is the manual, reactive, and often too-slow process of scaling. You could manually increase replicas in your Deployment, but that requires you to know there’s a problem and react. HPA automates this, turning resource metrics into scaling actions. It works by periodically querying the metrics server (which collects resource usage data from Kubelet on each node) for the scaleTargetRef (our nginx-deployment). It then calculates the current average utilization of the target resource (CPU in this case) across all pods managed by the deployment. If this average exceeds targetCPUUtilizationPercentage of the requested CPU, it adjusts the replicas count of the scaleTargetRef to bring the average utilization back down to the target.

The "requested" CPU is crucial here. If your pods request 100m of CPU and you set targetCPUUtilizationPercentage to 50, the HPA will try to keep the average CPU usage at 50m per pod. If a pod’s CPU usage goes up to 150m, that’s 150% utilization relative to its request, triggering scaling. If you had set a limit of 200m but only requested 50m, the HPA would still see 150% utilization at 75m and scale aggressively. This is why setting realistic requests and limits is paramount for effective autoscaling.

Beyond CPU, the HPA can also scale based on memory utilization or custom metrics (like requests per second from Prometheus). For memory, you’d use targetMemoryUtilizationPercentage. For custom metrics, you’d need a metrics adapter configured to expose those metrics to the Kubernetes metrics API, and then define them in the metrics field of the HPA spec (which is a separate autoscaling/v2beta2 or autoscaling/v2 API version).

The most surprising thing about HPA’s CPU scaling is how it interacts with pod limits. If a pod hits its CPU limit (e.g., 200m in our example), it will be throttled by the Linux kernel’s CFS scheduler. The HPA, however, bases its scaling decisions on the requested CPU. So, a pod might be actively throttled and performing poorly, but if its requested CPU is still low, the HPA might not scale up quickly enough because the average utilization relative to the request isn’t high yet. This is why setting requests and limits that are close to each other, and representative of typical peak load, is a common best practice for CPU-based HPA. You want the request to be the baseline for scaling, and the limit to be a hard ceiling to prevent runaway resource consumption or noisy neighbor issues.

The next step is exploring scaling based on custom metrics, which allows for much more granular control when CPU or memory aren’t the best indicators of load.

Want structured learning?

Take the full Gke course →