Grafana + Prometheus: From Data Chaos to Insight

Grafana dashboards can display Prometheus metrics, but the real magic is in how they transform raw time-series data into actionable insights.

Here’s a Prometheus metric for HTTP request duration:

http_request_duration_seconds_bucket{handler="/api/v1/users",instance="localhost:8080",job="my-app",le="0.1"} 15
http_request_duration_seconds_bucket{handler="/api/v1/users",instance="localhost:8080",job="my-app",le="0.5"} 42
http_request_duration_seconds_bucket{handler="/api/v1/users",instance="localhost:8080",job="my-app",le="1.0"} 55
http_request_duration_seconds_bucket{handler="/api/v1/users",instance="localhost:8080",job="my-app",le="+Inf"} 60

This isn’t a single value; it’s a histogram. The le (less than or equal to) label tells you how many requests finished within that duration. So, 15 requests took less than 0.1 seconds, 42 took less than 0.5 seconds, and 55 took less than 1.0 second. To get the count of requests within a specific bucket (e.g., between 0.1 and 0.5 seconds), you’d subtract the counts: 42 - 15 = 27.

Let’s see how this looks in Grafana. Imagine you’re creating a new panel in a Grafana dashboard.

First, select your Prometheus data source.

Then, in the query editor, you’d write a PromQL query to visualize this. For instance, to show the average request duration for the /api/v1/users handler, you might use:

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{handler="/api/v1/users"}[5m])) by (le, job, instance))

This query calculates the 95th percentile of request durations over the last 5 minutes, aggregated across all instances and jobs.

Here’s a breakdown of what’s happening:

rate(http_request_duration_seconds_bucket{handler="/api/v1/users"}[5m]): This calculates the per-second rate of increase for each histogram bucket over the last 5 minutes. It essentially tells you how many requests are entering each bucket per second.
sum(...) by (le, job, instance): This aggregates the rates across all time, summing up the counts for each le bucket, and keeping job and instance as labels. This is crucial because histogram_quantile needs a single, aggregated histogram.
histogram_quantile(0.95, ...): This function takes the aggregated histogram data and calculates the value at the 95th percentile. This means 95% of requests finished at or below this duration.

In Grafana, you’d choose a visualization type, like a "Graph" or "Stat" panel. For a Graph panel, you’d see a line representing the 95th percentile latency over time. For a Stat panel, you might display the current 95th percentile latency.

You can add more queries to the same panel. For example, to show the total request rate:

sum(rate(http_requests_total{job="my-app"}[5m]))

This query sums the per-second rate of all HTTP requests (assuming you have a http_requests_total counter metric) over the last 5 minutes.

The power of Grafana with Prometheus lies in its ability to combine these metrics. You could overlay the 95th percentile latency with the request rate to see if increased traffic correlates with higher latency. You can also use Grafana’s transformation features. For instance, you could add a "Reduce" transformation to calculate the average of a metric or a "Filter data by values" transformation to only show data above a certain threshold.

A common mistake is trying to visualize the raw _bucket metrics directly. Grafana’s PromQL engine expects aggregated or derived metrics, not the raw histogram buckets themselves. Functions like rate(), sum(), avg(), and histogram_quantile() are your best friends for turning buckets into meaningful percentiles and rates.

The _count and _sum metrics that Prometheus automatically generates from histograms are also incredibly useful. http_request_duration_seconds_count gives you the total number of requests, and http_request_duration_seconds_sum gives you the total duration of all requests. You can use these to calculate the average duration: sum(rate(http_request_duration_seconds_sum[5m])) / sum(rate(http_request_duration_seconds_count[5m])).

When you’re building dashboards, remember that Prometheus stores data as time series. Each data point has a timestamp, a value, and a set of labels. Grafana excels at querying and visualizing these series, allowing you to slice and dice your data by any label. This means you can easily filter by job, instance, handler, or any other label you’ve instrumented.

The next step after visualizing latency and throughput is often correlating these with system resource usage, like CPU or memory.