The most surprising thing about GKE node pool upgrades is that they don’t have to be disruptive at all, even for massive clusters.

Let’s see what a blue-green upgrade looks like in practice. Imagine we have a node pool named default-pool with 10 nodes, running version 1.27.8-gke.1037. We want to upgrade it to 1.28.2-gke.1202.

First, we create a new node pool, the "green" pool, with the target version and the same configuration as the existing "blue" pool, but with a temporary, larger max-surge to speed up creation.

gcloud container node-pools create green-pool \
  --cluster=my-cluster \
  --node-locations=us-central1-a \
  --machine-type=e2-medium \
  --num-nodes=0 \
  --enable-autoscaling \
  --min-nodes=0 \
  --max-nodes=10 \
  --max-surge-upgrade=50% \
  --node-labels=upgrade-stage=green \
  --node-version=1.28.2-gke.1202 \
  --enable-autorepair \
  --enable-automaintenance \
  --project=my-gcp-project

Notice num-nodes=0 initially. We’ll let the autoscaler handle bringing it up to the desired capacity after we start migrating workloads.

Now, the magic happens. We tell GKE to drain the existing "blue" nodes and cordon them, meaning they’ll stop accepting new pods.

gcloud container clusters upgrade my-cluster \
  --node-pool=default-pool \
  --master \
  --cluster-version=1.28.2-gke.1202 \
  --timeout=30m \
  --async

This command initiates the in-place upgrade of the existing node pool. GKE will start replacing nodes one by one, but it respects the max-surge and max-unavailable settings. For a blue-green strategy, we configure the new pool’s autoscaler to bring it up to capacity while the old pool is being drained.

When the gcloud container clusters upgrade command completes (or is initiated with --async), GKE begins the process of upgrading the default-pool. It will create new nodes with the target version and delete old nodes once their pods have been successfully migrated. Crucially, it manages the transition to ensure minimal disruption.

The key is how GKE handles pod disruption. By default, GKE’s upgrade process respects PodDisruptionBudgets (PDBs). If you have a PDB requiring at least 80% of your application pods to be available, GKE will not evict more pods than allowed by that budget during the upgrade. This prevents your application from going offline.

Let’s refine the blue-green approach. Instead of an in-place upgrade, we explicitly create the new pool first, then migrate workloads.

  1. Create the new "green" node pool:

    gcloud container node-pools create green-pool \
      --cluster=my-cluster \
      --node-locations=us-central1-a \
      --machine-type=e2-medium \
      --num-nodes=5 \
      --enable-autoscaling \
      --min-nodes=5 \
      --max-nodes=10 \
      --node-labels=upgrade-stage=green \
      --node-version=1.28.2-gke.1202 \
      --enable-autorepair \
      --enable-automaintenance \
      --project=my-gcp-project
    

    Here, we start with num-nodes=5 to get some capacity online.

  2. Gradually shift workloads: You’d use nodeSelector or nodeAffinity in your Kubernetes Deployments or StatefulSets to direct new pods to the green-pool.

    Example nodeAffinity in a Deployment spec:

    spec:
      template:
        spec:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: upgrade-stage
                    operator: In
                    values:
                    - green
    

    You’d apply this to your Deployments, and Kubernetes would start scheduling new pods onto the green-pool.

  3. Drain the old "blue" pool: Once you’re confident that your workloads are running on the new pool, you can scale down the old default-pool and then delete it.

    # Scale down the old pool
    gcloud container node-pools update default-pool \
      --cluster=my-cluster \
      --num-nodes=0 \
      --project=my-gcp-project
    
    # Delete the old pool (after ensuring all workloads are migrated and it's empty)
    gcloud container node-pools delete default-pool \
      --cluster=my-cluster \
      --project=my-gcp-project
    

The "surge" aspect comes into play if you were doing an in-place upgrade using gcloud container clusters upgrade --node-pool=.... In that scenario, GKE would create max-surge additional nodes temporarily before deleting the old ones, ensuring that the total node count doesn’t drop below your desired capacity during the upgrade. For a manual blue-green, you control the capacity by managing the num-nodes on both pools.

The actual mechanism GKE uses for in-place upgrades is to create new nodes with the target version, cordon and drain the old nodes, and then delete them. It ensures that the number of nodes available for scheduling never drops below total_nodes - max_unavailable. max-surge dictates how many extra nodes GKE can create beyond the desired capacity to facilitate the transition.

The most counterintuitive part of GKE node pool upgrades is how seamlessly it can be managed by leveraging Kubernetes primitives like PodDisruptionBudgets and node selectors, even for very large, production-critical workloads. The system doesn’t just replace nodes; it orchestrates a controlled migration of workloads.

Once you’ve successfully migrated to your new node pool and deleted the old one, the next logical step is to consider upgrading your cluster’s control plane version to match, or to explore workload-specific configurations like taints and tolerations for more granular pod scheduling.

Want structured learning?

Take the full Gke course →