Fly.io’s built-in metrics offer a surprisingly deep dive into your app’s performance without needing to install anything extra.
Let’s see what that looks like. Imagine you have a simple Go app deployed to Fly.io.
package main
import (
"fmt"
"log"
"net/http"
"time"
)
func main() {
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
fmt.Fprintf(w, "Hello from Fly!")
})
http.HandleFunc("/slow", func(w http.ResponseWriter, r *http.Request) {
time.Sleep(5 * time.Second) // Simulate a slow request
fmt.Fprintf(w, "This was slow!")
})
log.Fatal(http.ListenAndServe(":8080", nil))
}
After deploying this with fly deploy, Fly.io automatically starts collecting metrics. You can access these metrics through the Fly.io dashboard or, more powerfully, by connecting Grafana.
The core of Fly.io’s monitoring lies in its Prometheus-compatible metrics endpoint, typically exposed at http://<your-app-ip>:8080/metrics or a similar internal address. Fly.io VMs expose a set of standard metrics, including:
fly_cpu_percent: The percentage of CPU utilization for the VM.fly_memory_percent: The percentage of RAM utilization for the VM.fly_request_count: The number of HTTP requests processed by your app.fly_request_duration_seconds: The duration of HTTP requests.fly_response_size_bytes: The size of HTTP responses.
This allows you to build dashboards that visualize not just resource usage but also application-level performance indicators like request latency and throughput.
To get this into Grafana, you’ll need to:
-
Add Fly.io as a Prometheus data source in Grafana:
- Go to
Configuration->Data sources->Add data source. - Select
Prometheus. - For the
URL, you’ll typically use the internal IP of one of your Fly.io VMs, often accessible viafly ips listand then selecting an IP, or if you’re running Grafana within the same Fly.io network, you might use a service discovery mechanism. A common approach for external Grafana is to use a reverse proxy or a dedicated agent that scrapes these metrics. For simplicity here, let’s assume you can reach a VM’s metrics endpoint directly. If your app listens on8080, the metrics endpoint might behttp://<your-vm-ip>:8080/metrics. - Set
Scrape Intervalto15s(or match Fly’s default scraping interval if known).
- Go to
-
Create Dashboards in Grafana:
- Once the data source is configured, you can start querying these metrics.
For example, to visualize CPU usage per VM:
avg by (instance) (rate(fly_cpu_percent{job="your-app-name"}[5m]))To visualize request latency (95th percentile):
histogram_quantile(0.95, sum by (le, instance) (rate(fly_request_duration_seconds_bucket{job="your-app-name"}[5m])))And total request count:
sum(rate(fly_request_count{job="your-app-name"}[5m]))
The mental model here is that Fly.io acts as a managed Prometheus server for your application’s VMs. It exposes these metrics endpoints, and you configure Grafana (or another Prometheus-compatible tool) to scrape them. You’re essentially bringing your own visualization layer to Fly’s automatically exposed telemetry. The job label typically corresponds to your Fly.io app name.
The key to unlocking advanced monitoring is understanding that fly_request_duration_seconds is a histogram. This means it doesn’t just give you average durations; it provides buckets of observed durations. Using histogram_quantile allows you to calculate percentiles like the 95th or 99th percentile, giving you a much more accurate picture of user experience than a simple average, which can be skewed by outliers.
When you’re troubleshooting performance issues, looking at the fly_cpu_percent and fly_memory_percent alongside fly_request_duration_seconds immediately tells you if your application is being bottlenecked by resources or if there’s an inherent inefficiency in your code. You can correlate spikes in resource usage with increased latency or error rates.
A common pitfall is forgetting that these metrics are per-VM. If you have auto-scaling enabled, your dashboard needs to aggregate metrics across all instances of your app to show the overall health, using sum(rate(...)) or avg(rate(...)) as appropriate. The instance label in Prometheus is usually the VM’s internal IP or a unique identifier Fly assigns.
The next step in advanced monitoring involves integrating application-specific metrics, often using libraries like OpenTelemetry or Prometheus client libraries within your application code itself.