The GKE Cluster Autoscaler’s most surprising feature is that it can prevent you from hitting your performance targets, even if it’s scaling up.

Let’s see it in action. Imagine a deployment with 5 replicas, each requesting 1 vCPU and 2 GiB of memory.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 5
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: my-app-container
        image: nginx:latest
        resources:
          requests:
            cpu: "1"
            memory: "2Gi"

And here’s our Cluster Autoscaler configuration:

apiVersion: autoscaling.gke.io/v1
kind: AutoscalingPolicy
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  limits:
    cpu: 100
    memory: 500
  resourceLimits:
    cpu: 200
    memory: 1000
  autoscalingProfile: "optimize-utilization"

The autoscaler, with optimize-utilization, will try to pack pods as tightly as possible onto existing nodes. It won’t scale up immediately if there’s still a bit of headroom on a node, even if that headroom is fragmented and unusable for new pods. This is its default behavior to save costs.

Here’s the mental model: The Cluster Autoscaler’s primary job is to ensure your workloads have the resources they need by adding or removing nodes. It does this by looking at pending pods (pods that can’t be scheduled due to resource constraints) and evaluating if adding a new node would allow them to be scheduled. It also looks at underutilized nodes and scales them down if their pods can be consolidated onto other nodes.

The key levers you control are:

  • minNodes and maxNodes per Node Pool: These define the absolute boundaries for scaling. minNodes ensures you always have a baseline capacity, while maxNodes prevents runaway costs.
  • autoscalingProfile: This is crucial.
    • optimize-utilization (default): Prioritizes cost savings by packing pods tightly. This can lead to less efficient resource usage if pods have diverse resource requirements.
    • optimize-availability: Prioritizes performance and availability by scaling up sooner, even if it means slightly lower node utilization. It’s more aggressive in adding nodes to avoid scheduling issues.
  • Pod Resource Requests: The autoscaler only knows about what pods request. If requests are too low, the autoscaler might think there’s capacity when there isn’t, or it might overcommit nodes. If requests are too high, it might scale up nodes unnecessarily.
  • Node Pool Configuration: The CPU and memory limits you set for your node pools (resourceLimits.cpu, resourceLimits.memory) are hard caps. The autoscaler will never scale the node pool beyond these limits.
  • Pod Disruption Budgets (PDBs): These can indirectly affect autoscaling. If a PDB prevents a node from being drained during a scale-down event, the autoscaler might hesitate to remove that node, impacting cost optimization.

The autoscaler uses a "predicate" and "priority" system. For every pending pod, it checks which node groups (node pools) can accommodate it (predicates). Then, it ranks these potential node groups based on a priority score. The default priority favors node groups that are cheaper or have fewer nodes already. When optimize-availability is set, the priority shifts towards scaling up nodes faster.

What most people miss is that the autoscaler doesn’t actually schedule pods; it merely signals to Kubernetes whether a new node could allow a pod to be scheduled. The actual scheduling is done by the Kubernetes scheduler itself. The autoscaler’s "decision" is more of a suggestion based on its capacity calculations. This means that even if the autoscaler decides to add a node, the pod might still not schedule if other factors (like taints, tolerations, or node affinity rules) prevent it.

After configuring the autoscaler for cost and performance, your next consideration will be managing node pool sizes and types based on workload characteristics.

Want structured learning?

Take the full Gke course →