Kafka partition rebalancing is failing and causing downtime because the brokers are unable to agree on a new partition leader, leading to a deadlock where no partitions are available.
Common Causes and Fixes
-
Insufficient
zookeeper.session.timeout.ms:- Diagnosis: Look for
ConnectionLossExceptionorBroker may be downmessages in broker logs, especially during rebalancing. Also, checkzookeeper.connectinserver.propertiesto confirm your Zookeeper ensemble. - Fix: Increase
zookeeper.session.timeout.msinserver.propertieson all brokers. A common fix is to set it to60000(60 seconds). - Why it works: This timeout dictates how long a Kafka broker can be disconnected from Zookeeper before Zookeeper considers it dead. During heavy rebalancing, network blips or slow broker restarts can exceed the default (often 6 seconds). A longer timeout prevents Zookeeper from prematurely marking a momentarily unresponsive broker as dead, which would then trigger a cascading failure during leadership election.
- Diagnosis: Look for
-
Insufficient
zookeeper.connection.timeout.ms:- Diagnosis: Similar to session timeout, but look for initial connection errors or
ZooKeeper client obtained a null connectionmessages when brokers start up or try to re-establish connections. - Fix: Increase
zookeeper.connection.timeout.msinserver.propertieson all brokers. Set it to15000(15 seconds). - Why it works: This is the timeout for establishing an initial connection to Zookeeper. If network latency is high or Zookeeper is slow to respond to new connections, the default (often 5 seconds) might be too short, preventing brokers from joining the Zookeeper ensemble reliably.
- Diagnosis: Similar to session timeout, but look for initial connection errors or
-
Under-provisioned Zookeeper Ensemble:
- Diagnosis: Observe Zookeeper server logs for
zkServer.util.ThreadCleanuperrors,OutOfMemoryError, or high CPU/disk I/O on Zookeeper nodes. Useecho stat | nc <zookeeper_host> 2181andecho mntr | nc <zookeeper_host> 2181to check Zookeeper’s health and request counts. If request counts are very high or latency is increasing, Zookeeper is likely a bottleneck. - Fix: Scale up your Zookeeper ensemble. This typically means adding more Zookeeper nodes (aim for an odd number, like 3 or 5) or providing more resources (CPU, RAM, faster disks) to existing nodes. Ensure Zookeeper is on dedicated hardware, not shared with Kafka brokers.
- Why it works: Zookeeper is the metadata store for Kafka. During rebalancing, it’s under heavy load as brokers register, unregister, and update partition leadership information. An overloaded Zookeeper cannot keep up with these requests, leading to timeouts and client disconnections, which in turn breaks the rebalancing process.
- Diagnosis: Observe Zookeeper server logs for
-
Insufficient Broker Network Bandwidth or Latency:
- Diagnosis: Monitor network traffic on Kafka brokers during rebalancing. Look for high network I/O, dropped packets, or high latency between brokers and Zookeeper, and between brokers themselves. Tools like
iftop,nload, andpingare useful. - Fix: Increase network bandwidth for Kafka brokers. If possible, ensure Kafka brokers and Zookeeper nodes are in the same network availability zone/rack to minimize latency. Consider using Kafka’s
replica.fetch.max.bytesandmessage.max.bytesto control the size of data transferred during replication, which can impact rebalancing time. - Why it works: Rebalancing involves significant data transfer (log segments) between brokers as partitions are reassigned. If the network is saturated or latency is too high, these transfers will time out, and Zookeeper will detect a broker as unresponsive, aborting the rebalance.
- Diagnosis: Monitor network traffic on Kafka brokers during rebalancing. Look for high network I/O, dropped packets, or high latency between brokers and Zookeeper, and between brokers themselves. Tools like
-
Incorrect
controlled.shutdown.enableandcontrolled.shutdown.timeout.ms:- Diagnosis: If brokers are restarted abruptly (not via controlled shutdown), Zookeeper might not have time to update partition leadership correctly, leading to stale information and failed rebalances upon restart. Check broker logs for messages indicating partitions are unavailable or leaders are missing after a restart.
- Fix: Ensure
controlled.shutdown.enable=trueinserver.propertieson all brokers. When shutting down a broker, usekafka-topics.sh --bootstrap-server <broker_list> --alter --topic <topic_name> --partitions <num_partitions> --delete-existing-partitions(if repartitioning) or simply usesystemctl stop kafkaorkafka-server-stop.sh. For controlled shutdown specifically, usekafka-server-stop.shwhich signals brokers to transfer leadership gracefully. Setcontrolled.shutdown.timeout.msto a sufficiently large value, e.g.,300000(5 minutes), to allow ample time for leadership transfer. - Why it works: Controlled shutdown allows a broker to gracefully hand off its partition leadership to another replica before it shuts down. This ensures that Zookeeper is updated with the new leader information immediately, preventing partitions from becoming unavailable during the subsequent startup of other brokers or the restarted broker.
-
Under-provisioned Broker Resources (CPU/RAM/Disk I/O):
- Diagnosis: Monitor CPU, RAM, and disk I/O on Kafka brokers. High CPU utilization, excessive swapping (low RAM), or disk I/O wait times above 80-90% during rebalancing are strong indicators.
- Fix: Scale up broker resources. This might involve adding more CPU cores, increasing RAM, or migrating to faster storage (SSDs). Ensure
num.io.threadsandnum.network.threadsinserver.propertiesare adequately tuned for your hardware (e.g.,16and32respectively, but this is highly workload-dependent). - Why it works: Rebalancing is resource-intensive. Brokers need CPU to process Zookeeper requests, RAM to buffer data and manage connections, and fast disk I/O to write replicated data and serve fetch requests. If any of these are bottlenecks, brokers will be slow to respond, leading to timeouts and rebalancing failures.
The next error you’ll likely hit after fixing these issues is a No leader for partition error for a specific topic, indicating that even with rebalancing, a stable leader for that partition could not be established.