Fly.io health checks are often misunderstood as purely an uptime indicator, but their real power lies in their ability to prevent bad deployments from ever reaching production traffic.
Let’s see them in action. Imagine a simple Go web server that listens on port 8080.
package main
import (
"fmt"
"net/http"
"os"
)
func handler(w http.ResponseWriter, r *http.Request) {
fmt.Fprintf(w, "Hello from Fly.io!")
}
func main() {
http.HandleFunc("/", handler)
port := os.Getenv("PORT")
if port == "" {
port = "8080" // Default port if not set
}
fmt.Printf("Server listening on port %s\n", port)
err := http.ListenAndServe(":"+port, nil)
if err != nil {
panic(err)
}
}
In your fly.toml, you’d configure a health check like this:
app = "my-go-app"
primary_region = "lhr"
[http_service]
internal_port = 8080
force_https = true
auto_stop_machines = true
auto_deploy = true
[[services]]
protocol = "tcp"
port = 80
# This is the health check configuration
[services.concurrency]
type = "fixed"
hard_limit = 50
soft_limit = 40
[services.http_options]
external_url = "https://my-go-app.fly.dev" # This is optional, but good practice
# The path Fly.io will hit to check if your app is healthy
health_check_path = "/health"
# How long Fly.io waits for a response before considering it a failure
response_timeout = 5000 # milliseconds
# How often Fly.io probes your app
interval = 10000 # milliseconds
# How many consecutive failures before marking the instance unhealthy
failures_to_fail = 2
Now, let’s add a simple /health endpoint to our Go app. This endpoint should respond with 200 OK if the application is ready to serve traffic, and something else (or nothing) if it’s not.
package main
import (
"fmt"
"net/http"
"os"
"time"
)
func handler(w http.ResponseWriter, r *http.Request) {
fmt.Fprintf(w, "Hello from Fly.io!")
}
func healthHandler(w http.ResponseWriter, r *http.Request) {
// Simulate a condition where the app might not be fully ready
// For example, if a critical background process hasn't started yet
// For this example, we'll just return OK immediately.
// In a real app, you'd check database connections, cache readiness, etc.
w.WriteHeader(http.StatusOK)
fmt.Fprintf(w, "OK")
}
func main() {
http.HandleFunc("/", handler)
http.HandleFunc("/health", healthHandler) // Add the health handler
port := os.Getenv("PORT")
if port == "" {
port = "8080"
}
fmt.Printf("Server listening on port %s\n", port)
err := http.ListenAndServe(":"+port, nil)
if err != nil {
panic(err)
}
}
When you deploy this, Fly.io will start machines and, before sending any production traffic to them, will ping /health on port 80. If the response isn’t 200 OK within 5 seconds, or if the connection fails, that machine won’t receive user requests. This is crucial: Fly.io doesn’t just check if a machine is running; it checks if it’s ready.
The health_check_path tells Fly.io which URL to probe. The interval determines how frequently it checks, and failures_to_fail is the grace period – how many times it can fail consecutively before Fly.io gives up on that machine for serving traffic. response_timeout is the maximum time Fly.io will wait for a response.
The most surprising thing about Fly.io health checks is that they are the primary mechanism by which Fly.io determines if a newly launched machine is ready to receive production traffic. It’s not just about detecting failures; it’s about preventing bad deployments from ever impacting users. If your health_check_path always returns a non-200 status, or times out, Fly.io will never route traffic to that machine, effectively quarantining it.
This system allows for zero-downtime deployments. When you deploy a new version, Fly.io spins up new machines. It waits for them to pass their health checks before gradually shifting traffic away from the old machines and onto the new ones. If a new machine fails its health check, it’s never used, and the deployment can be rolled back automatically if all new machines fail.
The internal_port in http_service is the port your application listens on inside the Fly.io machine. The port in [[services]] is the external port that Fly.io exposes to the internet. The health check is configured within the [[services]] block, meaning it’s checked against the external port, but Fly.io internally routes it to your application’s internal_port.
You can also configure TCP health checks if your application doesn’t serve HTTP. In this case, you’d omit http_options and Fly.io would simply try to establish a TCP connection to the specified port. This is less sophisticated as it only checks if the port is open and listening, not if the application logic is healthy.
A common pitfall is setting health_check_path to a resource that takes a long time to become available. For instance, if your application needs to connect to a database and that connection can be slow on startup, your health check might fail even if the application will eventually work. In such cases, you’d either optimize your application’s startup path, or ensure your /health endpoint specifically checks for the critical readiness signals, not every single dependency.
The failures_to_fail setting is important for resilience. If you set it too low (e.g., 1), transient network blips could cause machines to be marked unhealthy prematurely. Setting it too high might delay the detection of a genuinely unhealthy machine.
The next concept you’ll encounter is how to leverage these health checks for graceful shutdown, ensuring that a machine still processing requests doesn’t immediately terminate when Fly.io signals it to stop.