Set Up Fly.io Autoscaling to Match Live Traffic (2026)

Autoscaling on Fly.io isn’t about predicting the future, it’s about reacting to the present by treating your application’s resource consumption as a direct proxy for traffic.

Imagine you have a web app running on Fly.io, and you’ve configured autoscaling. When a user hits your site, their request doesn’t just magically get served. It hits a load balancer, which forwards it to one of your running instances (called "VMs" or "proxies" on Fly). If that VM is busy, it’s going to take longer to respond. Fly.io’s autoscaling watches how much work each VM is doing, and if they’re all swamped, it spins up new ones.

Let’s say you’re running a simple Go web server, and you’ve deployed it with a fly.toml that looks like this:

app = "my-go-app"
primary_region = "ord"

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = true
  auto_destroy_machines = false

[checks]
  [checks.my-app-check]
    type = "http"
    port = 8080
    path = "/health"
    interval = 10000 # 10 seconds
    timeout = 2000 # 2 seconds

[metrics]
  # This section is key for autoscaling
  port = 8080
  path = "/metrics"

And your Go app exposes Prometheus-compatible metrics on /metrics that include http_requests_total and go_goroutines.

When you run fly deploy, Fly.io provisions machines based on your fly.toml. If you haven’t explicitly set max_machines or min_machines, Fly.io defaults to a flexible range. The autoscaler, however, needs something concrete to latch onto. It watches the metrics exposed at the path defined in your [metrics] section. By default, Fly.io’s autoscaler monitors CPU utilization per VM. If you want to scale based on something else, you need to configure it.

Here’s how you’d tell Fly.io to scale based on the number of Goroutines your Go app is running, using the metrics block in your fly.toml:

app = "my-go-app"
primary_region = "ord"

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = true
  auto_destroy_machines = false

[checks]
  [checks.my-app-check]
    type = "http"
    port = 8080
    path = "/health"
    interval = 10000 # 10 seconds
    timeout = 2000 # 2 seconds

[metrics]
  port = 8080
  path = "/metrics"

# --- Autoscaling configuration ---
[services.concurrency]
  type = "go_routines"
  hard_limit = 1000 # Maximum goroutines per machine
  soft_limit = 500 # Target goroutines per machine for scaling

[autoscaling]
  min_machines = 2
  max_machines = 10
  # The target value is derived from the soft_limit above.
  # If soft_limit is 500 goroutines, and you have 2 machines,
  # the target is 1000 goroutines total across the fleet.
  # When the total goroutines exceed this target, Fly will scale up.
  # When it drops below, it will scale down, respecting min_machines.

In this fly.toml, the [services.concurrency] block tells Fly.io what to measure and how to interpret it for scaling. type = "go_routines" specifies that we’re interested in the go_goroutines metric. hard_limit = 1000 means that a single machine will not be allowed to run more than 1000 goroutines; if it hits this, it’s considered overloaded and requests might be rejected or delayed. soft_limit = 500 is the target number of goroutines per machine.

The [autoscaling] block sets the boundaries: min_machines = 2 ensures you always have at least two instances running, providing a baseline of availability. max_machines = 10 prevents runaway costs by capping the number of machines. Fly.io’s autoscaler will try to maintain an average number of goroutines per machine close to your soft_limit. If the total number of goroutines across all machines starts to exceed the target (which is soft_limit * current_machines), it will add more machines. If it drops significantly below, it will remove machines, down to min_machines.

The most surprising true thing about Fly.io autoscaling is that it doesn’t directly scale based on requests per second or latency. Instead, it uses resource utilization metrics (like CPU, memory, or custom metrics like goroutines) as a proxy for load. This means you need to ensure your chosen metric accurately reflects how "busy" your application is.

Consider this: you deploy your app with the above fly.toml. A sudden surge of traffic hits, and your Go app starts spawning Goroutines rapidly to handle each incoming request. Fly.io’s autoscaler, watching the /metrics endpoint, sees the go_goroutines metric climb on each existing machine. As the average number of goroutines per machine surpasses the soft_limit (500 in our example), Fly.io provisions new machines. These new machines start serving traffic, and the load (and thus goroutines) on the older machines decreases. If traffic subsides, goroutines drop, and Fly.io scales down, eventually reaching your min_machines count.

The key to effective autoscaling here is having your application expose metrics that truly correlate with the work it’s doing. If your application has a bottleneck that isn’t reflected in the metrics you’re exposing, autoscaling won’t help. For instance, if your app is CPU-bound but you’re scaling on memory usage, you’ll get incorrect scaling behavior.

The one thing most people don’t realize is that the soft_limit in the [services.concurrency] block isn’t an absolute target for individual machines. It’s used by the autoscaler to calculate a fleet-wide target. If you have min_machines = 2 and soft_limit = 500 goroutines, the initial target is 2 * 500 = 1000 goroutines. As the fleet grows, the target scales proportionally. When the total number of goroutines across all machines exceeds this dynamic fleet target, Fly.io will add a machine. Conversely, if the total falls below the target, it will remove a machine. This fleet-wide target approach is more robust than trying to enforce a strict per-machine limit, as it accounts for the distributed nature of your application.

Once your autoscaling is correctly configured and your app is exposing the right metrics, the next hurdle is often understanding how your chosen metric interacts with your application’s performance under different load patterns.