The Kafka controller is refusing to assign partitions to brokers because it believes some brokers are still in a transitional state, preventing it from finalizing leadership elections.

1. Broker Configuration Mismatch:

  • Diagnosis: Check broker.id in server.properties on all brokers. Ensure they are unique and don’t overlap. Also, verify listeners and advertised.listeners are correctly set and resolvable.
  • Cause: If broker.id is duplicated, Kafka can’t distinguish between brokers. Incorrect listeners or advertised.listeners prevent the controller from connecting to brokers.
  • Fix: Edit server.properties on the offending broker(s) to assign a unique broker.id (e.g., broker.id=3). For listener issues, ensure advertised.listeners=PLAINTEXT://your-broker-hostname:9092 points to a resolvable hostname and port.
  • Why it works: Unique IDs ensure each broker is a distinct entity. Correct listeners allow the controller to establish communication channels for metadata exchange and partition assignments.

2. ZooKeeper Session Expiration/Loss:

  • Diagnosis: Check ZooKeeper logs for "session expired" or "connection lost" messages related to Kafka brokers. You can also check the broker logs for corresponding errors like "ZooKeeper session expired."
  • Cause: If a broker loses its ZooKeeper session (due to network issues, GC pauses, or ZooKeeper instability), it’s considered "dead" by the controller. The controller will wait for it to re-register, causing CoordinatorLoadInProgress.
  • Fix: Increase zookeeper.session.timeout.ms in server.properties on brokers (e.g., zookeeper.session.timeout.ms=60000) and tickTime in zoo.cfg on ZooKeeper servers (e.g., tickTime=2000). Also, ensure network connectivity between brokers and ZooKeeper is stable.
  • Why it works: A longer session timeout gives brokers more time to recover from transient network glitches or GC pauses before ZooKeeper considers their sessions expired. Stable network ensures consistent communication.

3. Insufficient Controller Quorum:

  • Diagnosis: Check ZooKeeper status (echo stat | nc your-zk-host 2181 on each ZK node) and look for the number of connected clients. Verify that the number of active Kafka controllers in ZK (/brokers/controller node) matches zookeeper.connect in server.properties.
  • Cause: Kafka requires a majority of its configured controllers (which are typically brokers designated by controller.listener.names) to be available to make progress. If fewer than a quorum are active, partition assignments halt.
  • Fix: Ensure the number of brokers listed in zookeeper.connect is odd and sufficient for a quorum. If a controller broker is down, bring it back online. If a broker is permanently lost, remove its ZK ephemeral nodes manually (with extreme caution).
  • Why it works: A quorum ensures that decisions are made by a majority, preventing split-brain scenarios and ensuring consistency across the cluster.

4. Large Number of Partitions/Topics Being Created/Deleted:

  • Diagnosis: Monitor Kafka controller logs for messages related to partition leadership elections and metadata updates. If you see a very high rate of create_topics, delete_topics, or partition reassignment operations, this could be the cause.
  • Cause: The controller has to manage metadata and state for every partition. A massive influx of changes overwhelms its ability to process and assign partitions, leading to the CoordinatorLoadInProgress state as it tries to catch up.
  • Fix: Temporarily pause or slow down topic creation/deletion operations. If this is a persistent issue, consider increasing controller.learner.session.timeout.ms and controller.quorum.election.timeout.ms in server.properties to give the controller more time.
  • Why it works: Giving the controller more time and reducing the load allows it to process the pending metadata changes and eventually stabilize the partition assignments.

5. Broker Network Connectivity Issues:

  • Diagnosis: Use ping and traceroute from the controller broker to other brokers, and vice-versa. Check firewall rules between brokers and ZooKeeper.
  • Cause: If the controller cannot reach certain brokers to send them partition assignment updates or receive acknowledgments, it will consider them unavailable and stall.
  • Fix: Resolve any network routing problems, open necessary ports in firewalls (e.g., 9092 for PLAINTEXT, 9093 for SSL/SASL), and ensure DNS resolution is working correctly.
  • Why it works: Reliable network communication is fundamental for Kafka’s distributed coordination. Ensuring brokers can talk to each other and the controller is paramount.

6. Under-provisioned Controller Resources (CPU/Memory):

  • Diagnosis: Monitor CPU and memory usage on the brokers acting as controllers. Look for sustained high utilization.
  • Cause: The controller process is responsible for a lot of coordination. If the broker hosting the controller is starved for CPU or memory, it can’t process requests efficiently, leading to delays in partition assignment.
  • Fix: Increase the allocated CPU cores or RAM for the controller broker. Alternatively, move the controller role to a more powerful broker.
  • Why it works: Providing adequate resources ensures the controller process can execute its tasks promptly and efficiently, unblocking partition assignment.

The next error you’ll likely see if you fix this is KAFKA: NoBrokersAvailable if you’re using a client library that attempts to connect before the cluster is fully ready, or LEADER_NOT_AVAILABLE for specific topics if partitions can’t elect a leader.

Want structured learning?

Take the full Kafka course →