Helm releases are stuck in pending-upgrade because the Kubernetes API server is refusing to acknowledge the changes Helm is trying to apply.

Common Causes and Fixes

1. Admission Controller Denials

  • Diagnosis: Check kubectl get events -n <namespace> for events with Type: Warning and Reason: FailedCreate or FailedUpdate related to your release’s pods, deployments, or other resources. Look for messages indicating an admission controller rejected the request.
  • Cause: A validating or mutating admission webhook (e.g., OPA Gatekeeper, Kyverno, or a custom controller) is blocking the changes. This often happens if the webhook’s logic has changed or if there’s a misconfiguration in the webhook itself or the policies it enforces.
  • Fix:
    • Temporarily disable the problematic admission controller. For OPA Gatekeeper, this might involve scaling down the gatekeeper-controller-manager deployment or removing its webhook configurations. For Kyverno, scale down the kyverno deployment.
    • Why it works: Admission controllers intercept API requests. Disabling them allows the Helm changes to reach the API server without being blocked, letting the upgrade proceed.
  • Diagnosis: Examine the logs of the admission controller pod itself (e.g., kubectl logs -n gatekeeper gatekeeper-controller-manager-xxxx -c manager). Look for specific error messages when the Helm release’s resources are being created or updated.
  • Fix: Correct the policy or configuration within the admission controller that is causing the rejection. For example, if a policy requires a specific label that Helm isn’t adding, update the policy to be more permissive or adjust your Helm chart to include the label.
    • Why it works: By fixing the underlying policy violation, the admission controller will no longer have a reason to reject the API requests from Helm.

2. Resource Quota Exceeded

  • Diagnosis: Run kubectl describe quota -n <namespace> and kubectl describe resourcequota -n <namespace> to see if any quotas (CPU, memory, pods, persistent volume claims, etc.) are at or near their limit.
  • Cause: The namespace has resource quotas defined, and the upgrade would exceed one or more of these limits. For example, if you’re trying to add more pods than the pods quota allows, the API server will reject the creation of new pods.
  • Fix:
    • Increase the existing resource quota or add a new one. For example, kubectl edit resourcequota <quota-name> -n <namespace> and increase the spec.hard.pods value.
    • Why it works: This allows the namespace to accommodate the additional resources required by the upgraded release.
  • Diagnosis: Check the usage reported by kubectl describe resourcequota <quota-name> -n <namespace>. If the Used count is close to Hard limit for a resource like pods, cpu, or memory, this is likely the issue.
  • Fix: If increasing quotas isn’t an option, identify and delete unused resources in the namespace that are consuming quota.
    • Why it works: Freeing up existing resource allocations makes room for the new resources required by the upgrade.

3. Pod Disruption Budgets (PDBs)

  • Diagnosis: Check kubectl get pdb -n <namespace> for any PDBs that might be too restrictive. Look at the STATUS column for minAvailable or maxUnavailable and compare it to the number of pods managed by the PDB.
  • Cause: A Pod Disruption Budget is in place, and the upgrade process (which involves terminating old pods and creating new ones) would violate the PDB’s availability guarantees. For example, if a PDB requires at least 2 pods to be available for a deployment, and the upgrade process brings the count down to 1, the eviction will be blocked.
  • Fix:
    • Temporarily adjust the PDB to allow for the planned disruption. For instance, kubectl edit pdb <pdb-name> -n <namespace> and lower spec.minAvailable or increase spec.maxUnavailable.
    • Why it works: This permits the eviction of existing pods to make way for new ones, allowing the upgrade to proceed.
  • Diagnosis: Use kubectl describe pdb <pdb-name> -n <namespace> to see the Disruptions Allowed count. If this count is zero or negative, the PDB is preventing disruptions.
  • Fix: If you cannot alter the PDB, you may need to manually coordinate the upgrade to ensure the PDB’s conditions are met during the transition. This often involves rolling updates with careful pod management.
    • Why it works: By manually managing the lifecycle of pods during the upgrade, you ensure that the PDB’s availability requirements are never breached.

4. Persistent Volume Claim (PVC) Issues

  • Diagnosis: Check kubectl get pvc -n <namespace> for any PVCs in a Pending state or with errors.
  • Cause: The underlying storage provisioner is failing to create a Persistent Volume for a PVC required by the upgraded release, or a PVC is stuck in a state that prevents its attachment to a new pod. This could be due to storage class configuration errors, storage backend issues, or insufficient capacity.
  • Fix:
    • Examine the events for the PVC: kubectl describe pvc <pvc-name> -n <namespace>. Look for messages from the storage provisioner.
    • Why it works: Understanding the specific error from the provisioner is the first step to diagnosing and fixing the storage issue.
  • Fix: Ensure the StorageClass referenced by the PVC exists and is correctly configured. If using dynamic provisioning, verify the provisioner (e.g., CSI driver) is healthy and has access to the storage backend. If it’s a static provisioning issue, ensure the PV is available and has matching access modes and capacity.
    • Why it works: A correctly configured and available StorageClass or PV is essential for the PVC to bind and be used by the application.

5. Node Taints and Tolerations

  • Diagnosis: Check kubectl get nodes for nodes with taints. Then, check the pod spec of the pods that are stuck in pending-upgrade (e.g., kubectl get pods -n <namespace> -o yaml) to see if they have the necessary tolerations.
  • Cause: The nodes where the new pods are scheduled have taints, but the pods themselves do not have corresponding tolerations. This prevents the scheduler from placing pods on those tainted nodes.
  • Fix: Add the appropriate tolerations to the pod’s spec in your Helm chart or directly to the resource definition. For example, add:
    tolerations:
    - key: "node-role.kubernetes.io/master"
      operator: "Exists"
      effect: "NoSchedule"
    
    • Why it works: Tolerations allow pods to be scheduled onto nodes that have matching taints, overriding the default scheduling behavior.

6. Network Policy Restrictions

  • Diagnosis: Check kubectl get networkpolicy -n <namespace>. If network policies are in place, examine them to see if they might be blocking essential communication required for pod startup or health checks.
  • Cause: Network policies are preventing the newly created pods from communicating with essential services (like the Kubernetes API server, DNS, or other necessary pods) during their startup phase, or preventing the control plane from reaching them for readiness/liveness probes.
  • Fix: Temporarily disable or modify the network policy that is causing the blockage. For example, kubectl delete networkpolicy <policy-name> -n <namespace>.
    • Why it works: Removing the restrictive network policy allows the necessary communication channels to open, enabling pod startup and health checks.

After resolving these issues, the next error you might encounter is ErrImagePull or ImagePullBackOff if the image specified in the chart is unavailable or misspelled.

Want structured learning?

Take the full Helm course →