Blue-green deployments on Fly.io don’t actually involve two separate identical environments; instead, they leverage a single, dynamically re-routed production environment.
Let’s see it in action. Imagine we have a simple web app deployed to Fly.io.
Here’s our fly.toml:
app = "my-blue-green-app"
primary_region = "ord"
[experimental]
auto_rollback_on_error = true
And a basic app.py for a Flask app:
from flask import Flask
import os
app = Flask(__name__)
@app.route('/')
def hello():
version = os.environ.get("APP_VERSION", "unknown")
return f"Hello from version {version}!\n"
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)
We deploy the initial version:
fly deploy --image mydockerhub/my-app:v1 --yes
Now, my-blue-green-app.fly.dev (or your custom domain) points to this version.
To perform a blue-green deployment, we simply deploy a new version of our application. Fly.io doesn’t create a whole new set of VMs; it updates the existing ones.
fly deploy --image mydockerhub/my-app:v2 --yes
Fly.io’s routing layer is the "green" part. When you fly deploy, the system stages the new image (v2) on your existing VMs. Once the new version starts responding correctly to health checks, Fly.io seamlessly shifts traffic from the old version (v1) to the new one (v2) at the edge. There’s no manual DNS change or load balancer configuration needed. The "blue" environment is effectively the previous version of the app still running on the same infrastructure, ready to be reverted to if the new version has issues.
The core mechanism is Fly.io’s global Anycast network and its internal routing. When you deploy, the new version is pulled onto the existing machines. Fly’s control plane then updates the internal routing rules for your app’s domain. Instead of pointing to the processes running v1, it starts pointing to the processes running v2. This update is propagated extremely quickly across their network. The health checks are crucial here: Fly.io won’t switch traffic until the new version is deemed healthy. If the new version fails its health checks or if auto_rollback_on_error is enabled and errors spike, Fly.io automatically rolls back the traffic to the previous stable version.
The "blue" environment is actually the previous version running on the same fleet of machines. If v2 has problems, Fly.io can instantly switch traffic back to v1 by re-applying the old routing rules. This is what makes it a zero-downtime release. You aren’t managing two separate sets of infrastructure; you’re managing the lifecycle of your application code on Fly’s shared, dynamic infrastructure.
The key levers you control are your application’s health checks and the deployment process itself. A robust HEALTHCHECK in your fly.toml is paramount:
[services.concurrency]
hard_limit = 25
soft_limit = 20
type = "connections"
[[services.ports]]
handlers = ["http"]
port = 8080
[[services.tcp_checks]]
grace_period = "1s"
interval = "10s"
method = "GET"
path = "/health" # Make sure your app has a /health endpoint
port = 8080
And ensuring your application exposes a /health endpoint that returns a 200 OK when it’s ready to serve traffic. This endpoint should check critical dependencies like database connections.
One common point of confusion is thinking you need separate fly.toml files or distinct app names for blue and green. You don’t. The "blue" state is simply the previously deployed version that remains running on the same VMs until the new "green" version is fully validated and traffic is shifted. If issues arise, Fly’s routing can revert to the "blue" version extremely rapidly.
The next challenge you’ll face is managing database schema migrations alongside these zero-downtime deployments.