Kafka lag exporter is actually a powerful tool for visualizing Kafka consumer lag in Grafana, but its real magic lies in its ability to expose internal Kafka state to Prometheus, a system designed for metrics, not message queues.

Here’s a Kafka lag exporter setup in action, directly exporting data for Prometheus to scrape:

# prometheus.yml
scrape_configs:
  - job_name: 'kafka_lag_exporter'
    static_configs:
      - targets: ['kafka-lag-exporter:9308'] # The address where kafka-lag-exporter is listening
// Example output from kafka-lag-exporter (before Prometheus scrapes it)
{
  "metric": "kafka_consumergroup_lag",
  "labels": {
    "consumergroup": "my-app-consumer",
    "topic": "user-events",
    "partition": "0"
  },
  "value": 150.5
}
{
  "metric": "kafka_consumergroup_offset",
  "labels": {
    "consumergroup": "my-app-consumer",
    "topic": "user-events",
    "partition": "0"
  },
  "value": 12345.0
}
{
  "metric": "kafka_topic_partition_current_offset",
  "labels": {
    "topic": "user-events",
    "partition": "0"
  },
  "value": 12500.0
}

This data, once scraped by Prometheus, allows you to build dashboards in Grafana that show not just the lag (the difference between the latest offset and the consumer’s offset), but also the underlying offsets themselves. This is crucial because lag is a derived metric. Understanding the absolute offsets on both the topic partition and the consumer group tells you why the lag is what it is. If the topic partition offset is skyrocketing, you have a producer problem. If the consumer group offset is stagnant, you have a consumer processing issue.

The core problem Kafka lag exporter solves is making Kafka’s internal offset management visible to external monitoring systems. Kafka itself tracks these offsets, but they aren’t easily queryable in a way that Prometheus can understand without an intermediary. The exporter acts as that intermediary, translating Kafka’s broker-level offset information into Prometheus-compatible metrics. You configure it with your Kafka broker addresses, and it continuously queries for consumer group offsets and topic partition offsets.

The key levers you control are:

  • --kafka.bootstrap-servers: The comma-separated list of your Kafka broker addresses. This is how the exporter connects to Kafka. For example: kafka-broker-1:9092,kafka-broker-2:9092.
  • --kafka.version: The Kafka protocol version to use. If you’re unsure, setting it to 2.0.0 or higher is usually safe for modern Kafka clusters.
  • --group.ids: A comma-separated list of specific consumer group IDs you want to monitor. If omitted, it attempts to discover all consumer groups, which can be resource-intensive on large clusters.
  • --topics: Similar to --group.ids, this allows you to filter for specific topics.

When you set up a Grafana dashboard, you’ll typically query Prometheus using PromQL. A common query to visualize consumer lag for a specific consumer group and topic might look like this:

sum by (consumergroup, topic) (
  kafka_consumergroup_lag{consumergroup="my-app-consumer", topic="user-events"}
)

This aggregates the lag across all partitions for that consumer group and topic. You can also visualize the raw offsets:

sum by (consumergroup, topic) (
  kafka_consumergroup_offset{consumergroup="my-app-consumer", topic="user-events"}
)

And compare it against the latest offsets:

sum by (topic) (
  kafka_topic_partition_current_offset{topic="user-events"}
)

By plotting these together, you get a clear picture of whether your consumers are keeping up with the producers.

The one thing most people don’t realize is that kafka-lag-exporter can also report metrics about the size of a consumer group (how many members are in it) and the number of partitions for a given topic. These metrics, like kafka_consumergroup_members and kafka_topic_partitions, are incredibly useful for understanding the topology of your Kafka setup and can help diagnose issues related to consumer rebalances or unexpected topic growth.

Once you have lag visualization, the next logical step is to set up alerting based on that lag exceeding certain thresholds.

Want structured learning?

Take the full Kafka course →