The Kafka broker kafka-01.example.com:9092 is actively rejecting connections from your consumer my-consumer-group because the consumer is failing its heartbeat checks, indicating it’s either too slow to process messages or has crashed.
This usually means your consumer is struggling to keep up with the message ingestion rate. The broker, trying to maintain cluster stability, assumes a stalled consumer is a dead one and prunes it from the consumer group.
Here are the most common reasons and how to fix them:
1. Consumer Processing Lag
The consumer is taking longer than session.timeout.ms to process a batch of messages.
-
Diagnosis: Check consumer lag using
kafka-tools.sh(or a similar tool likekafka-lag-checker.sh). Look for consistently high lag values for your consumer group.kafka-tools.sh --broker-list kafka-01.example.com:9092 --group-id my-consumer-group --topic my-topic --describeExamine the
LAGcolumn. If it’s consistently in the thousands or tens of thousands and not decreasing, you have a processing problem. Also, check your consumer logs for messages indicating long processing times orOutOfMemoryError. -
Fix:
- Increase
session.timeout.msandheartbeat.interval.ms: This gives the consumer more time to process. In your consumer configuration:
Why it works:session.timeout.ms=300000 # 5 minutes, default is 10s heartbeat.interval.ms=30000 # 30 seconds, default is 3ssession.timeout.msis the maximum time a broker will wait without a heartbeat before considering a consumer dead.heartbeat.interval.msis how often the consumer sends heartbeats. Increasing these values makes the broker more tolerant of slower processing, provided the consumer eventually catches up. Make suresession.timeout.msis at least 3xheartbeat.interval.ms. - Scale up consumer instances: If a single consumer instance can’t handle the load, add more instances to your consumer group. Kafka will automatically rebalance partitions among the new instances.
How to do it: Simply start more identical consumer applications with the same
group.id. Why it works: More consumers mean partitions are distributed across more processing units, reducing the load on each individual consumer. - Optimize message processing logic: Profile your consumer application to identify bottlenecks. Are you performing expensive I/O operations, complex computations, or blocking calls within your message processing loop?
How to do it: Use profiling tools (e.g.,
jvisualvm,async-profilerfor Java) to pinpoint slow code sections. Refactor to use asynchronous operations, batching, or more efficient algorithms. Why it works: Faster processing means each message is handled within thesession.timeout.mswindow, allowing the consumer to send heartbeats regularly. - Increase
fetch.max.wait.msandfetch.min.bytes: While not directly related to processing, these affect how quickly data is fetched. In your consumer configuration:
Why it works:fetch.max.wait.ms=500 # default is 500ms fetch.min.bytes=102400 # 100KB, default is 1 bytefetch.max.wait.msis the maximum time a broker will wait forfetch.min.bytesto be met before returning a partial batch. Increasingfetch.min.bytescan lead to larger, more efficient fetches, reducing the number of network round trips. However, if processing is the bottleneck, this can exacerbate lag if batches become too large to process in time. Use with caution and monitor lag.
- Increase
2. Network Issues or Latency
The consumer is sending heartbeats, but they are not reaching the broker in time due to network problems.
-
Diagnosis:
- Use
pingandtraceroutefrom the consumer instance to the Kafka broker. Look for high latency or packet loss. - Check consumer logs for messages like "Connection timed out" or "SocketException."
- Monitor network traffic on the consumer and broker nodes.
- Use
-
Fix:
- Improve network connectivity: Address any underlying network infrastructure issues between your consumers and Kafka brokers. This might involve working with your network team to resolve routing problems, upgrade bandwidth, or reduce congestion. Why it works: Reliable, low-latency network connections ensure heartbeats are sent and received promptly, preventing the broker from timing out the consumer.
- Adjust
session.timeout.msandheartbeat.interval.ms: As a temporary workaround or if minor network latency is unavoidable, you can increase these timeouts. In your consumer configuration:
Why it works: This provides a larger buffer for network delays, allowing heartbeats to arrive within the acceptable window even with elevated latency.session.timeout.ms=300000 # 5 minutes heartbeat.interval.ms=120000 # 2 minutes
3. Consumer Application Crashes or Unresponsibly Slow Threads
The consumer application might be crashing, or specific threads within it are becoming unresponsive (e.g., due to deadlocks, infinite loops, or excessive garbage collection pauses).
-
Diagnosis:
- Check consumer application logs for exceptions, stack traces, or repeated restarts.
- Use monitoring tools to observe CPU, memory, and thread activity on the consumer instances. Look for sustained high CPU, excessive garbage collection cycles, or a large number of blocked threads.
- If using Java, check JVM heap dumps or GC logs for signs of memory leaks or long GC pauses.
-
Fix:
- Fix application bugs: Debug and resolve any exceptions, deadlocks, infinite loops, or memory leaks identified in your consumer application. Why it works: A stable, responsive application will consistently process messages and send heartbeats.
- Tune JVM garbage collection: If GC pauses are too long, they can exceed the heartbeat interval. Tune your JVM GC settings (e.g., switch to G1GC, adjust heap size, use GC logging) to minimize pause times.
How to do it: Add JVM options like
-XX:+UseG1GC -XX:MaxGCPauseMillis=200to your consumer’s startup script. Why it works: Shorter GC pauses ensure the consumer’s main thread remains responsive and can send heartbeats. - Increase
max.poll.records: If you’re processing too many records in a singlepoll()call, it might be taking too long. In your consumer configuration:
Why it works: Reducing the number of records fetched per poll call limits the amount of work done in a single iteration, making it easier to stay within the processing time limits.max.poll.records=100 # default is 500
4. Broker-Side Configuration Issues (Less Common)
While less common, broker-side configurations can indirectly impact consumer heartbeats.
- Diagnosis: Check broker logs for any errors related to consumer groups or connection management.
- Fix:
- Ensure
zookeeper.connectis reachable and healthy: Kafka brokers rely on ZooKeeper for consumer group coordination. If ZooKeeper is down or unreachable, brokers cannot properly manage consumer group state, leading to issues. How to do it: Verify ZooKeeper connectivity from the broker. Check ZooKeeper logs. Why it works: A healthy ZooKeeper ensemble is crucial for Kafka’s internal coordination mechanisms, including consumer group management. - Check
controlled.shutdown.enable: While unlikely to cause session timeouts directly, an improperly configured shutdown can lead to unexpected consumer group rebalances. Why it works: Ensures graceful shutdown and rebalancing, minimizing disruption.
- Ensure
5. Kafka Version Incompatibilities or Bugs
Rarely, specific Kafka versions might have bugs related to session management or heartbeat handling.
- Diagnosis: Consult Kafka release notes and JIRA issues for known bugs related to consumer session timeouts in your specific Kafka version.
- Fix: Upgrade to a stable, patched version of Kafka. Why it works: Resolves known bugs that might be causing erratic behavior.
After applying these fixes, your next potential issue might be UnknownTopicOrPartitionException if your auto.create.topics.enable is false and a new topic is referenced.