Kafka consumers get stuck, processing messages slower than producers send them, causing a backlog.

Common Causes and Fixes for Kafka Consumer Lag

1. Insufficient Consumer Resources:

  • Diagnosis: Check consumer CPU and memory usage. If consistently high, the consumer instance is struggling. Use docker stats <consumer_container_id> or htop on the host.
  • Fix: Increase the resources allocated to your consumer instances. For Docker, this might be docker run --cpus=2 --memory=4g .... For Kubernetes, adjust resource requests/limits in the deployment YAML. This gives the consumer more processing power and memory to handle the message throughput.
  • Why it works: More CPU cycles and RAM allow the consumer application to deserialize, process, and commit offsets more quickly.

2. Inefficient Consumer Processing Logic:

  • Diagnosis: Profile your consumer application to identify bottlenecks. Is there a slow database query, an external API call with high latency, or complex data transformations? Use application performance monitoring (APM) tools or built-in profiling libraries.
  • Fix: Optimize the slow parts of your processing logic. Batch database inserts instead of single inserts. Implement caching for frequently accessed external data. Parallelize independent processing tasks within a single consumer instance using thread pools.
  • Why it works: Reducing the time spent on each message directly increases the rate at which messages can be processed.

3. Incorrect fetch.min.bytes and fetch.max.wait.ms Configuration:

  • Diagnosis: Examine your consumer’s fetch.min.bytes and fetch.max.wait.ms settings. If fetch.min.bytes is too high, the consumer waits for a large batch, potentially delaying processing. If fetch.max.wait.ms is too short, it might fetch small batches frequently, leading to increased network overhead.
  • Fix: Tune these parameters. A common starting point is fetch.min.bytes=1 (to get messages as soon as they are available) and fetch.max.wait.ms=500 (to allow some batching without excessive delay). Adjust based on your message size and desired latency.
  • Why it works: These settings control how aggressively the consumer requests data from Kafka. Lowering fetch.min.bytes ensures quicker retrieval, while fetch.max.wait.ms balances latency with fetch efficiency.

4. Network Latency Between Consumer and Broker:

  • Diagnosis: Measure the network latency from your consumer instances to your Kafka brokers. Use ping or traceroute from the consumer environment to the broker IPs/hostnames. High latency (e.g., > 50ms) can significantly slow down fetch requests.
  • Fix: Co-locate consumers and brokers in the same network region or availability zone if possible. Optimize network routing. Ensure sufficient network bandwidth between the consumer and broker clusters.
  • Why it works: Reduced latency means faster data retrieval from Kafka, allowing the consumer to receive messages more quickly.

5. Kafka Broker Under-provisioning:

  • Diagnosis: Monitor your Kafka brokers for high CPU, I/O wait, or network saturation. If brokers are struggling to serve fetch requests, consumers will naturally lag. Use broker-side metrics (e.g., kafka.network.RequestMetrics.TotalTimeMs, kafka.server.BrokerTopicMetrics.BytesInPerSec).
  • Fix: Scale up your Kafka broker cluster by adding more brokers or increasing the resources (CPU, RAM, disk I/O) of existing brokers.
  • Why it works: Healthy brokers can respond to consumer fetch requests quickly and efficiently, preventing a bottleneck at the source.

6. Inefficient Offset Committing:

  • Diagnosis: Observe the frequency and success rate of offset commits. If commits are failing or happening too infrequently, the consumer might be re-processing messages after a restart, or the commit process itself is a bottleneck. Check consumer logs for commit errors or long commit durations.
  • Fix: Ensure enable.auto.commit is set to false and implement manual commits after successful processing. Commit offsets periodically (e.g., every 1000 messages or every 30 seconds) using consumer.commitSync() or consumer.commitAsync() to balance throughput and fault tolerance.
  • Why it works: Manual commits give you control over when offsets are committed, ensuring they are only committed for fully processed messages and avoiding redundant work.

7. Message Duplication and Re-processing:

  • Diagnosis: If your processing logic is not idempotent (meaning processing a message multiple times has the same effect as processing it once), and you have consumer restarts or Kafka rebalances, you might be reprocessing messages. This artificially inflates the perceived lag.
  • Fix: Make your message processing logic idempotent. This can involve checking if a record has already been processed (e.g., by checking a database) before applying changes, or using unique transaction IDs.
  • Why it works: Idempotency ensures that even if a message is re-processed due to a rebalance or failure, it doesn’t lead to incorrect state, effectively reducing the "work" the consumer needs to do.

After fixing these issues, you might encounter java.net.SocketTimeoutException: Read timed out if your consumer is still too slow to acknowledge messages within the broker’s configured replica.lag.time.max.ms.

Want structured learning?

Take the full Kafka course →