The Kafka broker kafka-01.example.com:9092 is actively rejecting connections from your consumer my-consumer-group because the consumer is failing its heartbeat checks, indicating it’s either too slow to process messages or has crashed.

This usually means your consumer is struggling to keep up with the message ingestion rate. The broker, trying to maintain cluster stability, assumes a stalled consumer is a dead one and prunes it from the consumer group.

Here are the most common reasons and how to fix them:

1. Consumer Processing Lag

The consumer is taking longer than session.timeout.ms to process a batch of messages.

  • Diagnosis: Check consumer lag using kafka-tools.sh (or a similar tool like kafka-lag-checker.sh). Look for consistently high lag values for your consumer group.

    kafka-tools.sh --broker-list kafka-01.example.com:9092 --group-id my-consumer-group --topic my-topic --describe
    

    Examine the LAG column. If it’s consistently in the thousands or tens of thousands and not decreasing, you have a processing problem. Also, check your consumer logs for messages indicating long processing times or OutOfMemoryError.

  • Fix:

    • Increase session.timeout.ms and heartbeat.interval.ms: This gives the consumer more time to process. In your consumer configuration:
      session.timeout.ms=300000 # 5 minutes, default is 10s
      heartbeat.interval.ms=30000 # 30 seconds, default is 3s
      
      Why it works: session.timeout.ms is the maximum time a broker will wait without a heartbeat before considering a consumer dead. heartbeat.interval.ms is how often the consumer sends heartbeats. Increasing these values makes the broker more tolerant of slower processing, provided the consumer eventually catches up. Make sure session.timeout.ms is at least 3x heartbeat.interval.ms.
    • Scale up consumer instances: If a single consumer instance can’t handle the load, add more instances to your consumer group. Kafka will automatically rebalance partitions among the new instances. How to do it: Simply start more identical consumer applications with the same group.id. Why it works: More consumers mean partitions are distributed across more processing units, reducing the load on each individual consumer.
    • Optimize message processing logic: Profile your consumer application to identify bottlenecks. Are you performing expensive I/O operations, complex computations, or blocking calls within your message processing loop? How to do it: Use profiling tools (e.g., jvisualvm, async-profiler for Java) to pinpoint slow code sections. Refactor to use asynchronous operations, batching, or more efficient algorithms. Why it works: Faster processing means each message is handled within the session.timeout.ms window, allowing the consumer to send heartbeats regularly.
    • Increase fetch.max.wait.ms and fetch.min.bytes: While not directly related to processing, these affect how quickly data is fetched. In your consumer configuration:
      fetch.max.wait.ms=500 # default is 500ms
      fetch.min.bytes=102400 # 100KB, default is 1 byte
      
      Why it works: fetch.max.wait.ms is the maximum time a broker will wait for fetch.min.bytes to be met before returning a partial batch. Increasing fetch.min.bytes can lead to larger, more efficient fetches, reducing the number of network round trips. However, if processing is the bottleneck, this can exacerbate lag if batches become too large to process in time. Use with caution and monitor lag.

2. Network Issues or Latency

The consumer is sending heartbeats, but they are not reaching the broker in time due to network problems.

  • Diagnosis:

    • Use ping and traceroute from the consumer instance to the Kafka broker. Look for high latency or packet loss.
    • Check consumer logs for messages like "Connection timed out" or "SocketException."
    • Monitor network traffic on the consumer and broker nodes.
  • Fix:

    • Improve network connectivity: Address any underlying network infrastructure issues between your consumers and Kafka brokers. This might involve working with your network team to resolve routing problems, upgrade bandwidth, or reduce congestion. Why it works: Reliable, low-latency network connections ensure heartbeats are sent and received promptly, preventing the broker from timing out the consumer.
    • Adjust session.timeout.ms and heartbeat.interval.ms: As a temporary workaround or if minor network latency is unavoidable, you can increase these timeouts. In your consumer configuration:
      session.timeout.ms=300000 # 5 minutes
      heartbeat.interval.ms=120000 # 2 minutes
      
      Why it works: This provides a larger buffer for network delays, allowing heartbeats to arrive within the acceptable window even with elevated latency.

3. Consumer Application Crashes or Unresponsibly Slow Threads

The consumer application might be crashing, or specific threads within it are becoming unresponsive (e.g., due to deadlocks, infinite loops, or excessive garbage collection pauses).

  • Diagnosis:

    • Check consumer application logs for exceptions, stack traces, or repeated restarts.
    • Use monitoring tools to observe CPU, memory, and thread activity on the consumer instances. Look for sustained high CPU, excessive garbage collection cycles, or a large number of blocked threads.
    • If using Java, check JVM heap dumps or GC logs for signs of memory leaks or long GC pauses.
  • Fix:

    • Fix application bugs: Debug and resolve any exceptions, deadlocks, infinite loops, or memory leaks identified in your consumer application. Why it works: A stable, responsive application will consistently process messages and send heartbeats.
    • Tune JVM garbage collection: If GC pauses are too long, they can exceed the heartbeat interval. Tune your JVM GC settings (e.g., switch to G1GC, adjust heap size, use GC logging) to minimize pause times. How to do it: Add JVM options like -XX:+UseG1GC -XX:MaxGCPauseMillis=200 to your consumer’s startup script. Why it works: Shorter GC pauses ensure the consumer’s main thread remains responsive and can send heartbeats.
    • Increase max.poll.records: If you’re processing too many records in a single poll() call, it might be taking too long. In your consumer configuration:
      max.poll.records=100 # default is 500
      
      Why it works: Reducing the number of records fetched per poll call limits the amount of work done in a single iteration, making it easier to stay within the processing time limits.

4. Broker-Side Configuration Issues (Less Common)

While less common, broker-side configurations can indirectly impact consumer heartbeats.

  • Diagnosis: Check broker logs for any errors related to consumer groups or connection management.
  • Fix:
    • Ensure zookeeper.connect is reachable and healthy: Kafka brokers rely on ZooKeeper for consumer group coordination. If ZooKeeper is down or unreachable, brokers cannot properly manage consumer group state, leading to issues. How to do it: Verify ZooKeeper connectivity from the broker. Check ZooKeeper logs. Why it works: A healthy ZooKeeper ensemble is crucial for Kafka’s internal coordination mechanisms, including consumer group management.
    • Check controlled.shutdown.enable: While unlikely to cause session timeouts directly, an improperly configured shutdown can lead to unexpected consumer group rebalances. Why it works: Ensures graceful shutdown and rebalancing, minimizing disruption.

5. Kafka Version Incompatibilities or Bugs

Rarely, specific Kafka versions might have bugs related to session management or heartbeat handling.

  • Diagnosis: Consult Kafka release notes and JIRA issues for known bugs related to consumer session timeouts in your specific Kafka version.
  • Fix: Upgrade to a stable, patched version of Kafka. Why it works: Resolves known bugs that might be causing erratic behavior.

After applying these fixes, your next potential issue might be UnknownTopicOrPartitionException if your auto.create.topics.enable is false and a new topic is referenced.

Want structured learning?

Take the full Kafka course →