Fly.io’s soft and hard concurrency limits on machines are a surprisingly effective way to prevent cascading failures and manage resource contention without resorting to manual scaling.

Let’s see this in action with a simple Go web server. Imagine this main.go:

package main

import (
	"fmt"
	"log"
	"net/http"
	"runtime"
	"sync"
	"time"
)

var (
	activeRequests int
	mu             sync.Mutex
)

func handler(w http.ResponseWriter, r *http.Request) {
	mu.Lock()
	activeRequests++
	currentRequests := activeRequests
	mu.Unlock()

	log.Printf("Request received. Active requests: %d", currentRequests)

	// Simulate some work
	time.Sleep(5 * time.Second)

	mu.Lock()
	activeRequests--
	mu.Unlock()

	fmt.Fprintf(w, "Hello from Fly! Processed request. Active: %d\n", currentRequests)
}

func main() {
	http.HandleFunc("/", handler)
	log.Println("Starting server on :8080")
	log.Printf("GOMAXPROCS: %d", runtime.GOMAXPROCS(0))
	log.Fatal(http.ListenAndServe(":8080", nil))
}

We’ll deploy this to Fly.io. First, create a fly.toml:

app = "my-concurrency-test"
primary_region = "ord"

[experimental]
    auto_rollback_commits = 1

[[services]]
    internal_port = 8080
    protocol = "tcp"

    [[services.ports]]
        handlers = ["http"]
        port = 80

    [[services.concurrency]]
        type = "http"
        hard_limit = 10
        soft_limit = 8

The hard_limit (10) is the absolute maximum number of concurrent HTTP requests that Fly.io will attempt to route to this specific machine. Once this limit is reached, Fly.io will start returning 503 Service Temporarily Unavailable errors for new incoming requests, even if the machine technically has capacity. The soft_limit (8) is a gentler signal. When this limit is reached, Fly.io will prefer to route new requests to other available machines in the fleet. If no other machines are available or healthy, it may still send requests to this machine, but it’s a strong hint to scale up or distribute load.

Deploy this with fly deploy.

Now, let’s hit it. If you have hey installed, you can run:

hey -z 30s -c 20 http://my-concurrency-test.fly.dev

This will send 20 concurrent requests to your app for 30 seconds.

Observe the logs on Fly.io: fly logs -a my-concurrency-test. You’ll see logs like:

2023-10-27T10:00:00.123Z ERK log: Request received. Active requests: 1
2023-10-27T10:00:00.124ZERK log: Request received. Active requests: 2
...
2023-10-27T10:00:00.150ZERK log: Request received. Active requests: 8
2023-10-27T10:00:00.151ZERK log: Request received. Active requests: 9
2023-10-27T10:00:00.152ZERK log: Request received. Active requests: 10
2023-10-27T10:00:00.153ZERK INFO  my-concurrency-test: Starting server on :8080
2023-10-27T10:00:00.154ZERK INFO  my-concurrency-test: GOMAXPROCS: 8
2023-10-27T10:00:00.155ZERK INFO  my-concurrency-test: Starting server on :8080
2023-10-27T10:00:00.156ZERK INFO  my-concurrency-test: GOMAXPROCS: 8
2023-10-27T10:00:00.157ZERK INFO  my-concurrency-test: Starting server on :8080
2023-10-27T10:00:00.158ZERK INFO  my-concurrency-test: GOMAXPROCS: 8
...
2023-10-27T10:00:00.180ZERK log: Request received. Active requests: 10 // This is the last one that gets through to the handler

And if you check the hey output, you’ll see a significant number of 503 Service Temporarily Unavailable errors once the hard_limit is hit. The requests that do get through will eventually complete after their 5-second sleep.

The fundamental problem concurrency limits solve is preventing a single machine from being overwhelmed to the point of failure. When a machine hits its hard_limit, Fly.io’s edge routers will immediately start returning 503 errors for new requests destined for that machine. This isn’t a graceful shutdown; it’s a hard stop for incoming traffic to that specific instance, protecting its resources (CPU, memory, network sockets) from being exhausted by too many in-flight requests. The soft_limit acts as a preemptive signal, encouraging load balancing away from a machine that’s starting to get busy before it reaches its breaking point. This mechanism allows the fleet to absorb traffic spikes by distributing them across multiple machines, and it gives you a clear, actionable signal (the 503 errors) to scale up your fleet if the overall load consistently exceeds the soft_limit across all machines.

The type = "http" is crucial here; it tells Fly.io to monitor the number of active HTTP connections. Other types, like tcp, would monitor raw TCP connections, which is less common for typical web services but useful for other protocols.

If you were to increase hard_limit to 20 in fly.toml and redeploy, you’d see far fewer 503 errors from hey, but your application logs might start showing increased latency or even application-level timeouts if the Go runtime itself becomes saturated.

The most surprising thing about these limits is how they interact with automatic scaling. If your app consistently hits its soft_limit across multiple machines, Fly.io’s autoscaler (if configured) will trigger, creating new machines. These new machines will then accept traffic, effectively distributing the load and lowering the concurrency on existing machines, preventing them from hitting their hard_limit.

What most people don’t realize is that the soft_limit and hard_limit are applied per machine, not to the entire application fleet. This per-machine granularity is what allows for fine-grained control and prevents a single overloaded instance from impacting the health of the entire service.

The next logical step after fine-tuning these concurrency limits is to explore how they interact with application-level health checks, which can provide an even more nuanced view of machine health to the Fly.io router.

Want structured learning?

Take the full Fly-io course →