Grafana dashboards can feel sluggish because the underlying Prometheus queries are taking too long to execute.

Let’s see this in action. Imagine a dashboard showing average CPU usage across your fleet. The query might look like this:

avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

This query, on its own, isn’t inherently slow. The problem arises when this query is executed across thousands of instances, or when the [5m] range is expanded to [1h] or [24h] to see historical trends. Prometheus has to scan a massive amount of time-series data to compute the average idle time over that period for each instance.

The core issue is that Prometheus stores data in blocks of time, and when you query a range, it has to decompress and process data from potentially many blocks. For long-range queries or queries across a vast number of series, this can become a significant bottleneck.

The Fixes, From Most To Least Impactful:

  1. Reduce the Time Range: This is the most obvious, but often overlooked, fix. If your dashboard is showing Last 24 hours by default, and you only need to see the last hour for day-to-day operations, change the dashboard’s default time picker.

    • Diagnosis: Check the Grafana dashboard’s time range selector.
    • Fix: In the dashboard settings (gear icon), under "General" or "Time options," set the "Now delay" and "Default time" to more reasonable values, e.g., 0s for Now delay and now-1h for the default time.
    • Why it works: You’re asking Prometheus to scan a vastly smaller dataset. Less data scanned means less I/O, less decompression, and faster query execution.
  2. Filter Series Early with Labels: Add more specific labels to your queries to reduce the number of series Prometheus needs to evaluate. Instead of node_cpu_seconds_total, if you only care about application servers, use node_cpu_seconds_total{env="production", role="app"}.

    • Diagnosis: Examine your Grafana panels and their PromQL queries. Look for broad selectors like {}, *, or generic labels.
    • Fix: Modify queries to include specific labels. For example, change sum(rate(node_cpu_seconds_total{mode="idle"}[5m])) to sum(rate(node_cpu_seconds_total{mode="idle", job="my-app-server"}[5m])).
    • Why it works: Prometheus’s index allows it to quickly locate series matching specific label sets, avoiding a full scan of all series.
  3. Use Recording Rules for Expensive Queries: If you have dashboard panels that consistently run complex or long-range queries, pre-compute their results.

    • Diagnosis: Identify queries in Grafana that take more than a few seconds to load and are used frequently.
    • Fix: Create a Prometheus recording rule. For example, to pre-compute average CPU usage per instance over a 5-minute window:
      groups:
      - name: host_metrics
        rules:
        - record: instance:node_cpu_seconds_total:avg_idle_5m
          expr: avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
      
      Then, in Grafana, query instance:node_cpu_seconds_total:avg_idle_5m instead.
    • Why it works: Prometheus calculates this metric periodically (e.g., every minute) and stores the result as a new time series. Your Grafana dashboard then queries this pre-computed, much simpler series, which is orders of magnitude faster.
  4. Optimize rate() and increase() Aggregations: When using rate() or increase(), especially over long intervals, Prometheus has to look at the raw counter data points within that interval. If you’re aggregating before rate(), you’re asking Prometheus to operate on a larger set of raw points.

    • Diagnosis: Look for queries like sum(rate(node_network_receive_bytes_total[5m])) by (instance).
    • Fix: If you can, apply aggregation after the rate. For example, if you want the total network traffic across all instances, query sum(rate(node_network_receive_bytes_total[5m])) (aggregating the rates) instead of sum by (instance) (rate(node_network_receive_bytes_total[5m])) and then summing those results in Grafana.
    • Why it works: Aggregating the rates means Prometheus calculates the rate for each series and then sums those rates. Aggregating raw counters first and then calculating the rate would involve processing more raw data points for each series before summing.
  5. Leverage unless and or Sparingly: While powerful, unless and or clauses can sometimes force Prometheus to evaluate larger sets of series than necessary, especially if the left-hand side of the unless or or is very broad.

    • Diagnosis: Review queries that use unless or or and have a broad selector on one side.
    • Fix: Try to make both sides of the operator as specific as possible using labels. If node_cpu_seconds_total{mode="idle"} unless node_cpu_seconds_total{mode="user"} is slow, and you only care about idle CPU on app servers, try node_cpu_seconds_total{mode="idle", job="my-app-server"} unless on(instance) node_cpu_seconds_total{mode="user", job="my-app-server"}.
    • Why it works: By constraining both operands with specific labels, Prometheus can more efficiently prune the series it needs to consider for the set operation.
  6. Check Prometheus Server Resources: A struggling Prometheus server itself will make all queries slow.

    • Diagnosis: Monitor your Prometheus server’s CPU, memory, and disk I/O. Look for high load averages, swapping, or disk saturation. Check Prometheus’s own /status page for ingestion rates, head chunk size, and memory usage.
    • Fix:
      • CPU/Memory: Increase resources for the Prometheus process or node.
      • Disk I/O: Move Prometheus’s data directory to faster storage (e.g., SSDs).
      • Ingestion Rate: If Prometheus is being overwhelmed by too many metrics, consider adjusting scrape intervals, reducing the number of metrics scraped per target, or scaling out Prometheus (e.g., using Thanos or Cortex).
    • Why it works: A healthy Prometheus server can efficiently read from its TSDB blocks and perform query computations. Resource starvation directly cripples its ability to do so.

After applying these, you might encounter issues with max_over_time or quantile_over_time on very large datasets, which often require similar strategies of data reduction or pre-aggregation.

Want structured learning?

Take the full Grafana course →