The Grafana state history query failed because the datasource backend couldn’t materialize the requested time series data within the configured timeout.

This usually means one of a few things is happening:

  • Datasource Timeout Too Low: The most frequent culprit. Grafana’s default timeout for queries is 30 seconds, but complex state history queries, especially over long time ranges or with many series, can easily exceed this. The datasource backend receives the query, starts processing, but Grafana gives up waiting before it can send back results.

    • Diagnosis: Check the Grafana server logs (/var/log/grafana/grafana.log or similar). You’ll see entries like t=... lvl=eror msg="Query failed" error="context deadline exceeded" or error="context canceled".
    • Fix: Increase the datasource timeout. In Grafana’s UI, go to Configuration -> Data Sources, select your datasource, and find the "Timeout" setting. Increase it from 30s to 1m (1 minute) or 2m.
    • Why it works: This gives the datasource backend more time to compute the state history and return the data before Grafana considers the query a failure.
  • Inefficient Query Logic: The query itself is asking for too much data or is structured in a way that’s computationally expensive for the underlying database. This is common with last() or count_over_time() on very high-cardinality series or over vast time ranges.

    • Diagnosis: Examine the exact query being run in the Grafana alert rule. Look at the time range and the number of series it’s trying to process. Try running the same query directly against the datasource’s native query interface (if available) with a smaller time range to see if it’s slow.
    • Fix: Optimize the query. For example, instead of count_over_time(metric[5m]) > 0, consider using changes(metric[5m]) > 0 if you only care if the state changed, which is often less computationally intensive. If possible, add label_replace or group by clauses to reduce the number of unique series Grafana needs to process. Limit the time range if the alert doesn’t require historical data from months ago.
    • Why it works: A more efficient query reduces the processing load on the datasource, allowing it to complete faster and return results within the timeout.
  • High Cardinality / Too Many Series: The query is trying to fetch data for an extremely large number of distinct series. Even if the query logic is simple, the sheer volume of data points and series metadata can overwhelm the datasource and the network connection.

    • Diagnosis: In Grafana, when editing the alert rule, look at the "Query" tab. If you see hundreds or thousands of series listed as results (even if filtered), this is likely the issue.
    • Fix: Implement cardinality filtering at the datasource level. Use Prometheus’s label_filter or similar mechanisms in other datasources to reduce the number of series returned. For example, instead of my_metric, query my_metric{job="my_app", env="production"}. Use sum by (label1, label2) (my_metric) if you only need aggregated data.
    • Why it works: By reducing the number of series the datasource has to scan and aggregate, the query execution time is drastically reduced.
  • Datasource Backend Resource Constraints: The server hosting your datasource (e.g., Prometheus, InfluxDB) is under-resourced. It might be running out of CPU, RAM, or I/O capacity, preventing it from executing queries quickly.

    • Diagnosis: Monitor the resource utilization of your datasource server(s). Look for high CPU load, low available memory, high disk I/O wait times, or network saturation during the times the alerts are failing.
    • Fix: Scale up your datasource infrastructure. This could mean adding more CPU cores, increasing RAM, upgrading to faster storage, or distributing the load across multiple instances.
    • Why it works: Adequate resources allow the datasource to process queries efficiently and respond within Grafana’s (and your increased) timeout.
  • Network Latency or Bandwidth Issues: While less common for the query execution itself to fail due to network issues (it’s usually a connection error), high latency or low bandwidth between Grafana and the datasource can contribute to timeouts if data transfer is slow.

    • Diagnosis: Use ping and traceroute from the Grafana server to the datasource server. Check network monitoring tools for packet loss or congestion.
    • Fix: Improve network connectivity. This might involve optimizing network routes, increasing bandwidth, or ensuring the Grafana and datasource servers are geographically closer or on the same high-speed network segment.
    • Why it works: Faster data transfer between Grafana and the datasource means more data can be sent back within the timeout period.
  • Datasource Instance Unhealthy/Restarting: The datasource instance that Grafana is trying to query might be in a bad state, restarting, or undergoing maintenance, causing it to be unresponsive.

    • Diagnosis: Check the status of your datasource service (e.g., systemctl status prometheus, influxd status). Look for recent restarts or error messages in the datasource’s own logs.
    • Fix: Ensure your datasource instances are healthy and stable. Address any underlying issues causing restarts or unresponsiveness. If you have multiple instances, ensure Grafana is configured to query healthy ones.
    • Why it works: A healthy, responsive datasource can execute and return query results reliably.
  • Grafana Server Load: In rare cases, the Grafana server itself might be overloaded, struggling to process incoming requests or manage its own internal operations, contributing to timeouts.

    • Diagnosis: Monitor the Grafana server’s CPU, memory, and network usage. Check Grafana’s own request latency metrics.
    • Fix: Scale up the Grafana server resources or optimize its configuration. Consider adjusting settings related to concurrent requests or panel rendering if those are identified as bottlenecks.
    • Why it works: A Grafana server with sufficient capacity can efficiently manage its connections and process query results from datasources.

After fixing these, you’ll likely encounter the next common alerting issue: "Alert is flapping between Pending and Firing."

Want structured learning?

Take the full Grafana course →