The Kubernetes Cluster Autoscaler failed to provision new nodes because the cloud provider’s API reported insufficient resources for the requested instance type.

This usually means one of a few things is preventing the autoscaler from doing its job. It’s not just that the cluster needs more nodes; it’s that the underlying infrastructure is saying "no" to the specific kind of node the autoscaler is trying to add.

Here are the most common culprits:

  1. Insufficient Instance Quotas: Your cloud account has a hard limit on the number of certain instance types or total vCPUs you can run. The autoscaler tries to launch a new node, but the cloud provider says, "Sorry, you’ve hit your quota for m5.xlarge instances."

    • Diagnosis: Check your cloud provider’s console for "quotas," "limits," or "service quotas." Look for metrics related to compute instances (e.g., Running vCPUs, EC2 Instance Limit - m5.xlarge).
    • Fix: Request a quota increase for the specific instance types the autoscaler is trying to provision. For AWS, this is usually done via a support ticket. For GCP, it’s often a self-service request in the console. Example: Requesting an increase from 10 to 50 m5.xlarge instances.
    • Why it works: By increasing your allowed capacity, you remove the artificial ceiling that was preventing new nodes from being launched.
  2. Subnet IP Address Exhaustion: The autoscaler is trying to launch a new node into a specific subnet, but that subnet has run out of available IP addresses. Kubernetes nodes need an IP address to join the cluster.

    • Diagnosis: In your cloud provider’s console, check the available IP addresses within the subnets your cluster’s node groups are configured to use. For AWS VPC, this is under "Subnets." For GCP, it’s under "VPC network" -> "Subnets."
    • Fix: Either add more IP addresses to the existing subnet by increasing its CIDR block (if possible and planned) or, more commonly, create a new subnet with a larger CIDR block and add it to your node group configuration. Example: Expanding a subnet’s CIDR from 10.0.1.0/24 to 10.0.1.0/23 or adding a new subnet with CIDR 10.0.2.0/24.
    • Why it works: Providing new IP address space allows the cloud provider to assign an IP to the new instance, enabling it to join the network and the cluster.
  3. Security Group/Network ACL Restrictions: A firewall rule (Security Group in AWS, Firewall Rules in GCP) is blocking the necessary ports for the new node to communicate with the Kubernetes control plane or other cluster components.

    • Diagnosis: Review the security groups or network ACLs associated with your EC2 instances or GCE instances. Ensure that outbound traffic from the new nodes to the control plane’s API server endpoint is allowed, and that inbound traffic on necessary ports (like SSH, kubelet ports) from the control plane is also permitted.
    • Fix: Add or modify rules in your security group or network ACLs. For example, in AWS, ensure your security group allows outbound traffic on port 443 to your VPC CIDR and inbound traffic on port 22 from your control plane’s security group.
    • Why it works: Correctly configured network security allows the new node to establish the vital communication channels required to become a functional part of the cluster.
  4. Incorrect Instance Type or Availability Zone: The autoscaler is configured to use an instance type that is unavailable in the selected availability zones, or the availability zone itself is experiencing issues.

    • Diagnosis: Check the cloud provider’s status page for any reported issues in the specific availability zones your cluster is configured to use. Also, verify that the instanceType defined in your autoscaler configuration (or node group) is actually available in those zones. Sometimes, specific instance types are only offered in certain zones.
    • Fix: Update your node group configuration to use an instance type that is available in your chosen availability zones, or select different availability zones for your node group. Example: Changing instanceType: m5.large to instanceType: c5.large if m5.large is unavailable in us-east-1a but c5.large is.
    • Why it works: By selecting a valid and available instance type in a healthy availability zone, you ensure the cloud provider can physically provision the requested hardware.
  5. Node Group Max Size Reached (and autoscaler misconfiguration): The autoscaler is trying to add nodes to a specific node group, but that node group has already reached its maxSize limit, and there are no other node groups configured to scale into.

    • Diagnosis: Examine the configuration of your node groups. For AWS EKS, this is in the Node Group settings (MaxSize). For GKE, it’s in the Node Pool settings (max_nodes). Check if the maxSize for all relevant node groups has been hit.
    • Fix: Increase the maxSize parameter for the relevant node group(s) in your cluster configuration. Example: Changing maxSize: 10 to maxSize: 20 for a specific node group.
    • Why it works: This allows the autoscaler to continue adding nodes beyond the previously defined limit.
  6. Cloud Provider API Throttling or Errors: The autoscaler is making too many requests to the cloud provider’s API, and the API is throttling or returning errors. This is less common but can happen during rapid scaling events.

    • Diagnosis: Check the autoscaler’s logs for repeated "throttled" messages or specific API error codes from the cloud provider. Also, check the cloud provider’s activity logs for API call rate limits being hit.
    • Fix: Implement a backoff strategy for the autoscaler (often built-in, but can be tuned) or, if possible, adjust your cloud provider’s API rate limits. Sometimes, simply waiting for a few minutes allows the throttling to reset.
    • Why it works: By reducing the rate of API calls or increasing the allowed rate, you allow the autoscaler to successfully communicate with the cloud provider to provision resources.

After fixing these, you’ll likely encounter a PodUnschedulable event if your maxSize is still too low for the pending pods.

Want structured learning?

Take the full Kubernetes course →