Upgrading GKE clusters to new Kubernetes versions is less about compatibility risk and more about managing the blast radius of your deployment.

Let’s see what happens when we upgrade a cluster.

Imagine a simple GKE cluster with a few deployments running. The control plane (managed by Google) gets upgraded first. This is relatively low-risk because GKE handles it and ensures minimal disruption. The real action is when your worker nodes get upgraded. GKE uses a rolling update strategy by default. It drains one node (evicts pods gracefully), upgrades it, and then rejoins it to the cluster. Then it moves to the next.

Here’s a peek at a typical node pool upgrade in gcloud:

gcloud container node-pools upgrade <NODE_POOL_NAME> \
    --cluster=<CLUSTER_NAME> \
    --version=<TARGET_K8S_VERSION> \
    --node-locations=<ZONE_OR_REGION> \
    --async

This command initiates the rolling upgrade. You can monitor its progress with gcloud container operations list.

The core problem GKE solves with its upgrade process is maintaining application availability during the upgrade. It achieves this by:

  • Graceful Node Draining: Before a node is upgraded, GKE evicts the pods running on it. This means pods are given a chance to shut down cleanly, finish in-flight requests, and be rescheduled onto other healthy nodes.
  • Rolling Updates: Nodes are upgraded one by one. This ensures that a significant portion of your cluster remains available throughout the process.
  • PodDisruptionBudgets (PDBs): If you’ve configured PDBs, GKE respects them. A PDB specifies the minimum number of pods for a deployment that must be available at any given time. GKE won’t drain a node if doing so would violate a PDB.

Consider this Deployment manifest with a PDB:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: nginx
        image: nginx:latest
        ports:
        - containerPort: 80

---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 2 # At least 2 pods must be available
  selector:
    matchLabels:
      app: my-app

With this PDB, GKE will only drain a node if at least two my-app pods will remain running across the cluster. If only two pods are running and one is on the node to be drained, the drain will be blocked until a pod is rescheduled elsewhere or the PDB is relaxed.

The mental model for safe upgrades hinges on understanding the control plane vs. worker node lifecycle and how GKE orchestrates graceful disruption. The control plane upgrade is atomic and managed by Google. Worker node upgrades are iterative and managed by you via GKE’s rolling update mechanism, with PDBs acting as guardrails.

The real "gotcha" in GKE upgrades, beyond basic PDB configuration, is understanding how volume attachment/detachment interacts with node draining. When a node is drained, its attached persistent volumes (like those from PersistentVolumeClaims) are detached. If your application relies on a specific volume being attached to a specific node for performance or stateful reasons (though this is generally an anti-pattern in Kubernetes), or if the volume detach/attach process itself is slow, you might see longer application downtime than expected. GKE’s default node drain timeout is 5 minutes. If a volume detach takes longer, your pods might be forcefully terminated before they can save state or complete operations.

Always ensure your applications can handle pod restarts gracefully and that your PDBs are configured correctly to reflect your actual availability requirements.

Want structured learning?

Take the full Gke course →