The Kafka broker, broker-0.example.com:9092, stopped responding to heartbeats from consumer group my-consumer-group, causing consumers to be deemed dead and triggering a rebalance.

Common Causes and Fixes

1. Network Latency/Packet Loss Between Consumer and Broker:

  • Diagnosis: Use ping and mtr from a consumer instance to the broker. Look for high, fluctuating RTT (Round Trip Time) or packet loss.
    ping broker-0.example.com
    mtr broker-0.example.com
    
  • Fix: Address underlying network issues. This might involve optimizing routing, upgrading network hardware, or consulting your network administrator. For a temporary workaround, increase session.timeout.ms to a value safely above your observed maximum latency (e.g., session.timeout.ms=60000 for 60 seconds if latency is around 30-40 seconds).
  • Why it works: A longer session timeout gives the consumer more grace period to send heartbeats if the network is temporarily degraded, preventing premature expiration.

2. Consumer Processing is Too Slow:

  • Diagnosis: Monitor consumer lag using kafka-consumer-groups.sh --describe. High lag and consistently slow commit times indicate the consumer cannot keep up with the data rate. Also, check consumer application logs for processing bottlenecks.
    kafka-consumer-groups.sh --bootstrap-server broker-0.example.com:9092 --describe --group my-consumer-group
    
  • Fix: Optimize the consumer application’s processing logic. This could involve parallelizing processing, improving database queries, or increasing the number of consumer instances in the group (if the processing is indeed CPU-bound and scalable). For a quick fix, increase max.poll.interval.ms to allow more time between poll() calls (e.g., max.poll.interval.ms=300000 for 5 minutes if processing a large batch takes up to 4 minutes).
  • Why it works: max.poll.interval.ms defines the maximum time between two poll() calls. If your processing takes longer than this, the consumer won’t call poll() in time to send a heartbeat, and the broker will assume it’s dead. Extending this interval gives the consumer more time to finish its batch processing before needing to call poll().

3. Insufficient Broker Resources (CPU/Memory):

  • Diagnosis: Monitor broker CPU and memory utilization. High CPU or sustained high memory usage on the broker can delay its response to consumer heartbeats. Use tools like top, htop, or Prometheus/Grafana.
  • Fix: Scale up broker resources (more CPU/RAM) or scale out by adding more brokers to the cluster. If the issue is specific to a topic, consider partitioning it further if it’s a hot topic.
  • Why it works: When a broker is overloaded, its ability to process network requests, including heartbeat checks, is degraded, leading to timeouts.

4. Insufficient Consumer Resources (CPU/Memory):

  • Diagnosis: Monitor consumer CPU and memory utilization. If consumers are maxing out their CPU or running out of memory, their processing will slow down, and they may not be able to send heartbeats reliably.
  • Fix: Increase the resources allocated to consumer instances (e.g., more CPU/RAM in your Kubernetes pod, EC2 instance, etc.). Alternatively, add more consumer instances to the group to distribute the load.
  • Why it works: Similar to broker resource issues, if a consumer is struggling for resources, its application threads will become unresponsive, preventing timely heartbeat submissions.

5. heartbeat.interval.ms Mismatch or Too High:

  • Diagnosis: Check heartbeat.interval.ms in your consumer configuration. The broker’s group.session.timeout.ms must be significantly larger than heartbeat.interval.ms. A common rule of thumb is session.timeout.ms should be at least 3 times heartbeat.interval.ms. If heartbeat.interval.ms is set too high (e.g., heartbeat.interval.ms=30000), and session.timeout.ms is also high but not sufficiently larger, it can lead to issues.
  • Fix: Ensure heartbeat.interval.ms is set to a reasonable value, typically 1000 (1 second) or 3000 (3 seconds). Crucially, ensure session.timeout.ms is set appropriately higher, e.g., session.timeout.ms=10000 (10 seconds) if heartbeat.interval.ms=3000. Do not set heartbeat.interval.ms too high; it’s meant to be frequent.
    # consumer.properties
    session.timeout.ms=10000
    heartbeat.interval.ms=3000
    
  • Why it works: The broker expects heartbeats at heartbeat.interval.ms. If it doesn’t receive one within session.timeout.ms, it declares the consumer dead. A heartbeat.interval.ms that is too close to session.timeout.ms leaves no room for network jitter or slight processing delays.

6. Broker Network Interface Issues or Firewall Blocking:

  • Diagnosis: Check broker logs for any network-related errors or dropped connections. Use tcpdump on the broker to see if heartbeats from the consumer’s IP are arriving. Verify firewall rules on the broker’s host and any intermediary network devices are not blocking the Kafka port (9092) or specific consumer IPs.
  • Fix: Resolve network interface problems, update firewall rules to explicitly allow traffic from consumer IPs to the broker on port 9092.
  • Why it works: If the broker can’t receive the heartbeat packets due to network misconfiguration or hardware issues, it will naturally assume the consumer is gone.

7. Kafka Broker Configuration (group.min.session.timeout.ms, group.max.session.timeout.ms):

  • Diagnosis: Examine server.properties on the Kafka brokers for group.min.session.timeout.ms and group.max.session.timeout.ms. If the consumer’s session.timeout.ms falls outside this range, the broker will reject it or force it into the allowed range, potentially causing unexpected behavior.
  • Fix: Adjust group.min.session.timeout.ms and group.max.session.timeout.ms on the brokers to encompass the session.timeout.ms configured on your consumers. For example, if consumers use session.timeout.ms=60000, ensure the broker’s group.max.session.timeout.ms is at least 60000. Restart brokers for changes to take effect.
  • Why it works: These broker-side configurations enforce a minimum and maximum session timeout for consumer groups, acting as a safety net. If your consumer’s timeout is outside these bounds, the broker’s behavior becomes unpredictable, potentially leading to premature session expiration.

The next error you’ll likely encounter is a LeaderNotAvailable error when trying to produce or consume from a partition whose leader is currently unavailable due to a rebalance or broker failure.

Want structured learning?

Take the full Kafka course →