Live Migration is how Google Cloud Platform (GCP) keeps your virtual machines (VMs) running when the underlying physical hardware needs maintenance, without you noticing a blip.
Let’s see it in action. Imagine you have a n1-standard-1 VM named my-live-migration-test running in GCP. You’ve set up a simple web server on it, maybe Apache, serving a page that shows the current time.
# On your VM:
sudo apt update && sudo apt install -y apache2
echo "<h1>Hello from $(hostname)! Current time: $(date)</h1>" | sudo tee /var/www/html/index.html
Now, you want to simulate a host maintenance event. You can’t force GCP to do this on demand for a specific VM, but you can observe the process when GCP schedules it. When GCP needs to perform maintenance on a host server that your VM is running on, it will initiate Live Migration.
Here’s what happens under the hood:
The core problem Live Migration solves is avoiding VM downtime during hardware maintenance. Instead of shutting down the VM, moving it to another host, and then starting it back up (which would cause a noticeable interruption), Live Migration moves the entire running state of the VM – memory contents, CPU state, network connections – from one physical server to another, with minimal packet loss and no perceived interruption to the user or application.
When GCP schedules host maintenance, it identifies VMs running on that host that are eligible for Live Migration. It then selects a suitable destination host, often within the same rack or cluster to minimize network latency. The process begins by establishing a connection between the source and destination hosts.
The source host starts copying the VM’s memory pages to the destination host. This is a staged process. Initially, it copies all memory pages. As the VM continues to run on the source host, some memory pages will change. These "dirty" pages are then copied over in subsequent rounds. The goal is to minimize the amount of memory that needs to be transferred during the final switchover.
The critical phase is the "pre-copy" and "truncation" phase. GCP employs sophisticated algorithms to predict when the remaining dirty pages will be small enough to transfer within the acceptable downtime window. When this threshold is met, the VM is briefly paused on the source host. All remaining dirty memory pages are copied, the CPU state is transferred, and the VM is resumed on the destination host. Network traffic is then rapidly switched over to the new host.
The entire process is designed to be extremely fast, typically completing in under a second. For most applications, especially those with reasonable network connection timeouts (e.g., 5-10 seconds), this brief pause is imperceptible. Network connections might experience a single packet loss, which TCP handles gracefully.
The exact levers you control are primarily through your VM’s configuration and its placement. You can’t directly trigger Live Migration, but you can ensure your VMs are configured to benefit from it. This includes using machine types that support Live Migration (most standard and high-memory types do) and understanding that GCP will automatically apply this for scheduled maintenance events. You can also configure your VMs to be "migratable" versus "non-migratable." By default, most VMs are migratable. If you have specific requirements, you can opt out of Live Migration, but this means your VM will be stopped and restarted during host maintenance.
One thing most people don’t know is how GCP manages network state during the transition. When the VM resumes on the new host, its IP address and MAC address remain the same. The network fabric, which is software-defined, is updated extremely rapidly to direct incoming traffic to the new physical location of the VM. This is crucial because it means established TCP connections, even though they might experience a single dropped packet, can often be resumed without application-level failure because the IP and MAC addresses haven’t changed from the network’s perspective.
The next concept you’ll likely encounter is understanding how to monitor and manage your applications’ resilience to brief network interruptions, even with Live Migration in place.