The Kafka broker, broker-0.example.com:9092, stopped responding to heartbeats from consumer group my-consumer-group, causing consumers to be deemed dead and triggering a rebalance.
Common Causes and Fixes
1. Network Latency/Packet Loss Between Consumer and Broker:
- Diagnosis: Use
pingandmtrfrom a consumer instance to the broker. Look for high, fluctuating RTT (Round Trip Time) or packet loss.ping broker-0.example.com mtr broker-0.example.com - Fix: Address underlying network issues. This might involve optimizing routing, upgrading network hardware, or consulting your network administrator. For a temporary workaround, increase
session.timeout.msto a value safely above your observed maximum latency (e.g.,session.timeout.ms=60000for 60 seconds if latency is around 30-40 seconds). - Why it works: A longer session timeout gives the consumer more grace period to send heartbeats if the network is temporarily degraded, preventing premature expiration.
2. Consumer Processing is Too Slow:
- Diagnosis: Monitor consumer lag using
kafka-consumer-groups.sh --describe. High lag and consistently slow commit times indicate the consumer cannot keep up with the data rate. Also, check consumer application logs for processing bottlenecks.kafka-consumer-groups.sh --bootstrap-server broker-0.example.com:9092 --describe --group my-consumer-group - Fix: Optimize the consumer application’s processing logic. This could involve parallelizing processing, improving database queries, or increasing the number of consumer instances in the group (if the processing is indeed CPU-bound and scalable). For a quick fix, increase
max.poll.interval.msto allow more time betweenpoll()calls (e.g.,max.poll.interval.ms=300000for 5 minutes if processing a large batch takes up to 4 minutes). - Why it works:
max.poll.interval.msdefines the maximum time between twopoll()calls. If your processing takes longer than this, the consumer won’t callpoll()in time to send a heartbeat, and the broker will assume it’s dead. Extending this interval gives the consumer more time to finish its batch processing before needing to callpoll().
3. Insufficient Broker Resources (CPU/Memory):
- Diagnosis: Monitor broker CPU and memory utilization. High CPU or sustained high memory usage on the broker can delay its response to consumer heartbeats. Use tools like
top,htop, or Prometheus/Grafana. - Fix: Scale up broker resources (more CPU/RAM) or scale out by adding more brokers to the cluster. If the issue is specific to a topic, consider partitioning it further if it’s a hot topic.
- Why it works: When a broker is overloaded, its ability to process network requests, including heartbeat checks, is degraded, leading to timeouts.
4. Insufficient Consumer Resources (CPU/Memory):
- Diagnosis: Monitor consumer CPU and memory utilization. If consumers are maxing out their CPU or running out of memory, their processing will slow down, and they may not be able to send heartbeats reliably.
- Fix: Increase the resources allocated to consumer instances (e.g., more CPU/RAM in your Kubernetes pod, EC2 instance, etc.). Alternatively, add more consumer instances to the group to distribute the load.
- Why it works: Similar to broker resource issues, if a consumer is struggling for resources, its application threads will become unresponsive, preventing timely heartbeat submissions.
5. heartbeat.interval.ms Mismatch or Too High:
- Diagnosis: Check
heartbeat.interval.msin your consumer configuration. The broker’sgroup.session.timeout.msmust be significantly larger thanheartbeat.interval.ms. A common rule of thumb issession.timeout.msshould be at least 3 timesheartbeat.interval.ms. Ifheartbeat.interval.msis set too high (e.g.,heartbeat.interval.ms=30000), andsession.timeout.msis also high but not sufficiently larger, it can lead to issues. - Fix: Ensure
heartbeat.interval.msis set to a reasonable value, typically1000(1 second) or3000(3 seconds). Crucially, ensuresession.timeout.msis set appropriately higher, e.g.,session.timeout.ms=10000(10 seconds) ifheartbeat.interval.ms=3000. Do not setheartbeat.interval.mstoo high; it’s meant to be frequent.# consumer.properties session.timeout.ms=10000 heartbeat.interval.ms=3000 - Why it works: The broker expects heartbeats at
heartbeat.interval.ms. If it doesn’t receive one withinsession.timeout.ms, it declares the consumer dead. Aheartbeat.interval.msthat is too close tosession.timeout.msleaves no room for network jitter or slight processing delays.
6. Broker Network Interface Issues or Firewall Blocking:
- Diagnosis: Check broker logs for any network-related errors or dropped connections. Use
tcpdumpon the broker to see if heartbeats from the consumer’s IP are arriving. Verify firewall rules on the broker’s host and any intermediary network devices are not blocking the Kafka port (9092) or specific consumer IPs. - Fix: Resolve network interface problems, update firewall rules to explicitly allow traffic from consumer IPs to the broker on port 9092.
- Why it works: If the broker can’t receive the heartbeat packets due to network misconfiguration or hardware issues, it will naturally assume the consumer is gone.
7. Kafka Broker Configuration (group.min.session.timeout.ms, group.max.session.timeout.ms):
- Diagnosis: Examine
server.propertieson the Kafka brokers forgroup.min.session.timeout.msandgroup.max.session.timeout.ms. If the consumer’ssession.timeout.msfalls outside this range, the broker will reject it or force it into the allowed range, potentially causing unexpected behavior. - Fix: Adjust
group.min.session.timeout.msandgroup.max.session.timeout.mson the brokers to encompass thesession.timeout.msconfigured on your consumers. For example, if consumers usesession.timeout.ms=60000, ensure the broker’sgroup.max.session.timeout.msis at least60000. Restart brokers for changes to take effect. - Why it works: These broker-side configurations enforce a minimum and maximum session timeout for consumer groups, acting as a safety net. If your consumer’s timeout is outside these bounds, the broker’s behavior becomes unpredictable, potentially leading to premature session expiration.
The next error you’ll likely encounter is a LeaderNotAvailable error when trying to produce or consume from a partition whose leader is currently unavailable due to a rebalance or broker failure.