Kafka partition rebalancing is failing and causing downtime because the brokers are unable to agree on a new partition leader, leading to a deadlock where no partitions are available.

Common Causes and Fixes

  1. Insufficient zookeeper.session.timeout.ms:

    • Diagnosis: Look for ConnectionLossException or Broker may be down messages in broker logs, especially during rebalancing. Also, check zookeeper.connect in server.properties to confirm your Zookeeper ensemble.
    • Fix: Increase zookeeper.session.timeout.ms in server.properties on all brokers. A common fix is to set it to 60000 (60 seconds).
    • Why it works: This timeout dictates how long a Kafka broker can be disconnected from Zookeeper before Zookeeper considers it dead. During heavy rebalancing, network blips or slow broker restarts can exceed the default (often 6 seconds). A longer timeout prevents Zookeeper from prematurely marking a momentarily unresponsive broker as dead, which would then trigger a cascading failure during leadership election.
  2. Insufficient zookeeper.connection.timeout.ms:

    • Diagnosis: Similar to session timeout, but look for initial connection errors or ZooKeeper client obtained a null connection messages when brokers start up or try to re-establish connections.
    • Fix: Increase zookeeper.connection.timeout.ms in server.properties on all brokers. Set it to 15000 (15 seconds).
    • Why it works: This is the timeout for establishing an initial connection to Zookeeper. If network latency is high or Zookeeper is slow to respond to new connections, the default (often 5 seconds) might be too short, preventing brokers from joining the Zookeeper ensemble reliably.
  3. Under-provisioned Zookeeper Ensemble:

    • Diagnosis: Observe Zookeeper server logs for zkServer.util.ThreadCleanup errors, OutOfMemoryError, or high CPU/disk I/O on Zookeeper nodes. Use echo stat | nc <zookeeper_host> 2181 and echo mntr | nc <zookeeper_host> 2181 to check Zookeeper’s health and request counts. If request counts are very high or latency is increasing, Zookeeper is likely a bottleneck.
    • Fix: Scale up your Zookeeper ensemble. This typically means adding more Zookeeper nodes (aim for an odd number, like 3 or 5) or providing more resources (CPU, RAM, faster disks) to existing nodes. Ensure Zookeeper is on dedicated hardware, not shared with Kafka brokers.
    • Why it works: Zookeeper is the metadata store for Kafka. During rebalancing, it’s under heavy load as brokers register, unregister, and update partition leadership information. An overloaded Zookeeper cannot keep up with these requests, leading to timeouts and client disconnections, which in turn breaks the rebalancing process.
  4. Insufficient Broker Network Bandwidth or Latency:

    • Diagnosis: Monitor network traffic on Kafka brokers during rebalancing. Look for high network I/O, dropped packets, or high latency between brokers and Zookeeper, and between brokers themselves. Tools like iftop, nload, and ping are useful.
    • Fix: Increase network bandwidth for Kafka brokers. If possible, ensure Kafka brokers and Zookeeper nodes are in the same network availability zone/rack to minimize latency. Consider using Kafka’s replica.fetch.max.bytes and message.max.bytes to control the size of data transferred during replication, which can impact rebalancing time.
    • Why it works: Rebalancing involves significant data transfer (log segments) between brokers as partitions are reassigned. If the network is saturated or latency is too high, these transfers will time out, and Zookeeper will detect a broker as unresponsive, aborting the rebalance.
  5. Incorrect controlled.shutdown.enable and controlled.shutdown.timeout.ms:

    • Diagnosis: If brokers are restarted abruptly (not via controlled shutdown), Zookeeper might not have time to update partition leadership correctly, leading to stale information and failed rebalances upon restart. Check broker logs for messages indicating partitions are unavailable or leaders are missing after a restart.
    • Fix: Ensure controlled.shutdown.enable=true in server.properties on all brokers. When shutting down a broker, use kafka-topics.sh --bootstrap-server <broker_list> --alter --topic <topic_name> --partitions <num_partitions> --delete-existing-partitions (if repartitioning) or simply use systemctl stop kafka or kafka-server-stop.sh. For controlled shutdown specifically, use kafka-server-stop.sh which signals brokers to transfer leadership gracefully. Set controlled.shutdown.timeout.ms to a sufficiently large value, e.g., 300000 (5 minutes), to allow ample time for leadership transfer.
    • Why it works: Controlled shutdown allows a broker to gracefully hand off its partition leadership to another replica before it shuts down. This ensures that Zookeeper is updated with the new leader information immediately, preventing partitions from becoming unavailable during the subsequent startup of other brokers or the restarted broker.
  6. Under-provisioned Broker Resources (CPU/RAM/Disk I/O):

    • Diagnosis: Monitor CPU, RAM, and disk I/O on Kafka brokers. High CPU utilization, excessive swapping (low RAM), or disk I/O wait times above 80-90% during rebalancing are strong indicators.
    • Fix: Scale up broker resources. This might involve adding more CPU cores, increasing RAM, or migrating to faster storage (SSDs). Ensure num.io.threads and num.network.threads in server.properties are adequately tuned for your hardware (e.g., 16 and 32 respectively, but this is highly workload-dependent).
    • Why it works: Rebalancing is resource-intensive. Brokers need CPU to process Zookeeper requests, RAM to buffer data and manage connections, and fast disk I/O to write replicated data and serve fetch requests. If any of these are bottlenecks, brokers will be slow to respond, leading to timeouts and rebalancing failures.

The next error you’ll likely hit after fixing these issues is a No leader for partition error for a specific topic, indicating that even with rebalancing, a stable leader for that partition could not be established.

Want structured learning?

Take the full Kafka course →