InfluxDB’s internal metrics are actually a sophisticated, built-in monitoring system that Prometheus can scrape, not just a passive data source.
Let’s look at a running InfluxDB instance. Imagine we’re tracking user login attempts and request latency. Here’s a snippet of what InfluxDB might expose via its /metrics endpoint, which Prometheus is configured to pull:
# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{code="200",method="GET",path="/query",handler="query"} 12345
http_requests_total{code="404",method="GET",path="/query",handler="query"} 5
# HELP http_request_duration_seconds Duration of HTTP requests
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{handler="query",le="0.05"} 10000
http_request_duration_seconds_bucket{handler="query",le="0.1"} 11000
http_request_duration_seconds_bucket{handler="query",le="0.25"} 12000
http_request_duration_seconds_bucket{handler="query",le="+Inf"} 12345
http_request_duration_seconds_sum{handler="query"} 1234.56
http_request_duration_seconds_count{handler="query"} 12345
# HELP go_goroutines Number of goroutines that currently exist
# TYPE go_goroutines gauge
go_goroutines 50
# HELP influxdb_build_info Build information about InfluxDB
# TYPE influxdb_build_info gauge
influxdb_build_info{version="1.8.10"} 1
This output, when scraped by Prometheus, allows us to build dashboards and alerts on InfluxDB’s own performance. The http_requests_total metric tells us how many requests InfluxDB is serving, broken down by HTTP status code, method, and the handler that processed them. http_request_duration_seconds provides latency insights, crucial for understanding user experience. go_goroutines is a direct indicator of the internal concurrency and resource utilization within the InfluxDB process. influxdb_build_info is a simple gauge that confirms the version being scraped.
The problem InfluxDB’s internal metrics solve is the "black box" issue. Without them, you’d only see if InfluxDB was responding at the network level or if your application could write to it. You wouldn’t know why it might be slow or failing internally. These metrics expose the internal workings—request handling, goroutine counts, memory usage (though not directly in the snippet above, other metrics exist for that)—allowing for proactive identification of bottlenecks.
To get this working, you need to:
-
Enable the HTTP API and Metrics Endpoint: In your
influxdb.conf(or equivalent configuration), ensure thehttpsection is configured. For InfluxDB v1.x, it’s often enabled by default if you have anhttpsection. InfluxDB v2.x has it enabled by default on port 8086. -
Configure Prometheus to Scrape: In your
prometheus.ymlfile, add a job to scrape the InfluxDB metrics endpoint. For a typical InfluxDB v1.x setup onlocalhost:8086, this would look like:scrape_configs: - job_name: 'influxdb' static_configs: - targets: ['localhost:8086'] metric_path: '/metrics' # This is the default, but good to be explicitFor InfluxDB v2.x, the metrics are often available at the same
/metricsendpoint on the primary HTTP port (usually 8086). -
Utilize the Metrics: Once Prometheus is scraping, you can query these metrics. For example, to see the rate of 5xx errors from InfluxDB:
rate(http_requests_total{code=~"5..", job="influxdb"}[5m])To calculate the 95th percentile of query request durations:
histogram_quantile(0.95, sum by (le, handler) (rate(http_request_duration_seconds_bucket{handler="query", job="influxdb"}[5m])))
The most surprising thing is how many different types of internal metrics InfluxDB exposes, from garbage collection pauses and memory allocation patterns to cache hit rates and query execution details. Many users only scrape the most obvious ones like HTTP request counts and durations, missing the deeper insights into how the database’s Go runtime and internal components are behaving. For instance, metrics like runtime_alloc_bytes and runtime_gc_pause_seconds can reveal memory pressure or inefficient garbage collection cycles that directly impact query performance, long before user-facing errors appear.
The next concept you’ll likely explore is using these metrics to build sophisticated alerting rules, such as detecting a sustained increase in goroutines or a degradation in query latency percentiles.