Fly.io’s PostgreSQL service can be deceptively simple to set up, but its multi-region failover isn’t a magic bullet; it’s a carefully orchestrated ballet of distributed consensus and network latency.

Let’s see it in action. Imagine you have two PostgreSQL clusters, one in ord (Chicago) and another in lhr (London).

# Create the primary cluster in ord
fly postgres create --name my-pg-primary --region ord --vm-size shared-cpu-1x --volume-size 10GB

# Get the primary cluster's internal IP
PRIMARY_IP=$(fly postgres show --name my-pg-primary --json | jq -r '.services[0].addresses.ipv4[0]')

# Create the replica cluster in lhr, pointing to the primary
fly postgres create --name my-pg-replica --region lhr --vm-size shared-cpu-1x --volume-size 10GB --primary-region ord --initial-cluster-size 3 --attach-volume my-pg-primary:data

Now, if my-pg-primary in ord goes down, my-pg-replica in lhr will eventually take over. This isn’t instantaneous. The magic happens via Distributed Transaction Log Replication and Consensus.

Here’s the mental model:

  1. Primary/Replica Architecture: You have a designated "primary" region that handles all writes. Other regions host "replicas" that continuously stream changes from the primary. Fly.io manages this streaming via PostgreSQL’s built-in logical replication.

  2. Consensus for Failover: When the primary region becomes unreachable, Fly.io’s control plane initiates a failover. This isn’t just about promoting a replica; it’s about ensuring data consistency. The system uses a distributed consensus protocol (like Raft or Paxos, underpinning etcd which Fly.io uses for its control plane) to agree on which replica should become the new primary. This prevents split-brain scenarios.

  3. Replication Lag is Key: The time it takes for a replica to catch up to the primary is crucial. If a failover occurs and a replica hasn’t yet received all the transactions from the primary, data can be lost. Fly.io tries to minimize this lag through efficient network routing and replication mechanisms.

  4. Automatic Promotion: Once consensus is reached, the chosen replica is promoted to become the new primary. This involves reconfiguring its PostgreSQL instance to accept writes and updating DNS records so your application can connect to the new primary.

  5. Application Reconnection: Your application needs to be configured to connect to the Fly.io PostgreSQL service endpoint (e.g., my-pg-primary.internal). Fly.io automatically updates the DNS for this endpoint to point to the new primary after a failover. However, your application’s database client library might hold onto old DNS records or TCP connections, so a graceful retry or reconnect mechanism is essential.

The distributed transaction log replication is the engine, but the consensus protocol is the steering wheel that ensures a safe and consistent failover. Without consensus, you could end up with two nodes accepting writes simultaneously, leading to data corruption.

What most people don’t realize is that the initial-cluster-size parameter on a replica cluster doesn’t just mean more read replicas; it means more potential voters in the consensus process for failover. A larger initial-cluster-size can improve the resilience of the failover decision itself, but it also increases the overhead for write acknowledgments and replication.

The next concept you’ll grapple with is managing application state during a failover.

Want structured learning?

Take the full Fly-io course →