Spot VMs are often perceived as simply "cheap, unreliable VMs," but the real story is that they’re a fully integrated, resilient part of Google Cloud’s infrastructure, designed to absorb excess capacity without sacrificing availability for your critical workloads.

Let’s see how this works. Imagine you have a batch processing job that needs to run for a few hours, but it’s completely fault-tolerant. You can spin up a fleet of Spot VMs for this.

gcloud compute instances create batch-processor-1 \
  --zone=us-central1-a \
  --machine-type=n2-standard-4 \
  --image-family=ubuntu-2004-lts \
  --image-project=ubuntu-os-cloud \
  --preemptible \
  --boot-disk-size=100GB \
  --tags=batch-processing

This command creates a standard n2-standard-4 VM, but with the --preemptible flag. This tells Google Cloud that this VM can be reclaimed by the system if that capacity is needed elsewhere. You’ll get a notice 30 seconds before it’s terminated, giving your application time to save its state.

The core problem Spot VMs solve is the inherent tension between cost optimization and resource utilization in cloud computing. Traditional VMs are provisioned with guaranteed availability, which means you pay for that guarantee even if the underlying hardware is underutilized. This leads to significant wasted spend. Spot VMs, by leveraging this underutilized capacity, offer dramatically lower prices (up to 91% off) without a fundamental compromise on the quality of the compute. It’s about when you get the compute, not if you get it.

The mental model to build is one of "eventual compute" rather than "guaranteed compute." Your application needs to be designed to handle interruptions gracefully. This means:

  • Idempotency: Operations should be safe to re-run multiple times without unintended side effects.
  • Checkpointing/State Saving: Regularly save the progress of long-running tasks so they can resume from the last saved point.
  • Fault Tolerance: Design your system so that the failure of one or more nodes doesn’t cascade into a total outage.

For GKE, this translates to using node pools configured with Spot VMs. When you create a node pool:

apiVersion: container.googleapis.com/v1
kind: Cluster
metadata:
  name: my-gke-cluster
spec:
  # ... other cluster config ...
  nodePools:
  - name: spot-pool
    initialNodeCount: 3
    autoscaling:
      minNodeCount: 1
      maxNodeCount: 10
    nodeConfig:
      machineType: n2-standard-4
      preemptible: true # This is the key for Spot VMs
      diskSizeGb: 100
      oauthScopes:
        - https://www.googleapis.com/auth/cloud-platform

When you deploy your workloads, you can use node selectors or taints/tolerations to ensure that your fault-tolerant applications land on these cost-effective Spot nodes. For example, to target pods to the spot-pool:

apiVersion: v1
kind: Pod
metadata:
  name: batch-worker
spec:
  containers:
  - name: worker
    image: your-batch-image
  nodeSelector:
    cloud.google.com/gke-nodepool: spot-pool # Matches the node pool name

Or using taints and tolerations if you’ve tainted the Spot nodes:

# On your Spot Node Pool definition in GKE:
# ...
#   nodeConfig:
#     machineType: n2-standard-4
#     preemptible: true
#     taints:
#       - key: "preemptible"
#         value: "true"
#         effect: "NoSchedule"

# On your Pod definition:
apiVersion: v1
kind: Pod
metadata:
  name: batch-worker
spec:
  containers:
  - name: worker
    image: your-batch-image
  tolerations:
  - key: "preemptible"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

The most surprising aspect for many is how little actual downtime the "preemptible" nature introduces for well-architected applications. Spot VM interruptions are typically triggered by system-wide capacity needs, not by individual VM health. This means a Spot VM is just as stable as a regular VM until that capacity is requested. The 30-second warning is a grace period to clean up, not a signal of imminent failure. Many jobs can complete their critical last few seconds of work or save their state within this window.

The next step is understanding how to manage the lifecycle of these Spot VMs, especially in the context of autoscaling and ensuring your critical workloads are not unduly affected by the inherent variability.

Want structured learning?

Take the full Gke course →