Kafka Streams applications can fall behind their input topics, a state known as "lag," and this lag can grow silently until it causes downstream issues.

Imagine a Kafka Streams app processing an order stream. Each new order hits Kafka, and your app consumes it, updates a database, and perhaps publishes a "shipped" event. If your app slows down—maybe due to a database bottleneck or a bug—orders will start piling up in the input topic. Kafka Streams keeps track of how far behind it is for each partition using offsets. Monitoring this offset lag is crucial.

Here’s a simplified Kafka Streams topology:

StreamsBuilder builder = new StreamsBuilder();
KStream<String, Order> orders = builder.stream("orders");

KTable<String, CustomerOrderSummary> customerOrderSummaries = orders
    .groupByKey()
    .aggregate(
        () -> new CustomerOrderSummary(0, 0.0), // Initializer
        (key, order, summary) -> summary.addOrder(order), // Aggregator
        Materialized.as("customer-order-summary-store")
    );

customerOrderSummaries
    .toStream()
    .to("customer-order-summaries", Produced.with(Serdes.String(), orderSummarySerde));

In this example, orders is the input topic, and customer-order-summaries is the output topic. The customer-order-summary-store is a local state store that Kafka Streams manages.

The core idea behind lag monitoring is comparing the "current" offset of a topic partition (the end of the log) with the "processed" offset by your Kafka Streams application for that partition. If current_offset - processed_offset > 0, you have lag.

The primary tool for observing this is Kafka’s own command-line utilities, specifically kafka-consumer-groups.sh. You’ll use it to inspect the consumer group your Kafka Streams application belongs to.

First, identify your Kafka Streams application’s consumer group ID. This is usually configured in your StreamsConfig with StreamsConfig.APPLICATION_ID_CONFIG. Let’s say it’s my-order-processor.

Now, run the following command to see the lag for all topics consumed by this group:

kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group my-order-processor --describe

The output will look something like this:

GROUP           TOPIC           PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-HOST CONSUMER-CLIENT-ID
my-order-processor orders          0         1500           1650           150   localhost/127.0.0.1:xxxx my-order-processor-xxxx
my-order-processor orders          1         1200           1200           0     localhost/127.0.0.1:xxxx my-order-processor-xxxx
my-order-processor orders          2         2100           2300           200   localhost/127.0.0.1:xxxx my-order-processor-xxxx

Here, CURRENT-OFFSET is the offset your Streams app has processed up to for that partition. LOG-END-OFFSET is the latest offset available in the topic partition. LAG is the difference.

A LAG of 0 means your application is caught up on that partition. A LAG greater than 0 indicates a processing delay. The higher the LAG, the further behind your application is.

To make lag monitoring actionable, you need to set up alerts. Most monitoring systems (Prometheus, Datadog, etc.) can scrape metrics. Kafka exposes consumer group information, including lag, which these systems can ingest. You’d typically set a threshold (e.g., if LAG > 1000 for more than 5 minutes) to trigger an alert.

When you see significant lag, the first step is to understand why. Is it a general processing slowdown, or is it specific to certain partitions? If it’s partition-specific, examine the data in those partitions. Are there unusually large messages, complex operations, or malformed data causing your processing logic to take longer?

For a Kafka Streams application, a common cause of lag is resource contention: insufficient CPU, memory, or network bandwidth for the processing threads. Another frequent culprit is a slow downstream system your Streams app writes to or queries. If your aggregate or process methods are doing blocking I/O (e.g., slow database calls without proper asynchronous handling), lag will inevitably build up.

Sometimes, the issue is within Kafka Streams itself. For instance, if you’re using KTable.toStream() and then re-partitioning with groupByKey() on a very high-cardinality key, the repartitioning topic can become a bottleneck. The default num.stream.threads might also be too low for your workload, meaning not enough parallel processing threads are available.

The most direct way to increase processing capacity is to increase the number of Kafka Streams threads. This is controlled by StreamsConfig.NUM_STREAM_THREADS_CONFIG. For example, to use 4 threads:

Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "my-order-processor");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, "4"); // Increase threads
// ... other configurations

This tells Kafka Streams to create 4 instances of your topology, each processing a subset of the partitions independently.

If your lag is caused by a slow database, consider optimizing your queries, adding indexes, or using a connection pool. For Kafka Streams applications, it’s also common to implement a "dead letter queue" (DLQ) pattern. If a message consistently fails processing after several retries, send it to a separate DLQ topic for later inspection, rather than letting it block the main processing pipeline.

You can configure the number of Kafka Streams threads to be equal to the number of partitions in your input topics. This ensures that each partition can be processed by a dedicated thread if available, maximizing parallelism. However, if you have significantly more partitions than available CPU cores, this might not be beneficial and could even lead to excessive context switching.

The next hurdle you’ll likely encounter is managing state store performance when dealing with high throughput and large amounts of data.

Want structured learning?

Take the full Kafka-streams course →