The Kafka controller is refusing to assign partitions to brokers because it believes some brokers are still in a transitional state, preventing it from finalizing leadership elections.
1. Broker Configuration Mismatch:
- Diagnosis: Check
broker.idinserver.propertieson all brokers. Ensure they are unique and don’t overlap. Also, verifylistenersandadvertised.listenersare correctly set and resolvable. - Cause: If
broker.idis duplicated, Kafka can’t distinguish between brokers. Incorrectlistenersoradvertised.listenersprevent the controller from connecting to brokers. - Fix: Edit
server.propertieson the offending broker(s) to assign a uniquebroker.id(e.g.,broker.id=3). For listener issues, ensureadvertised.listeners=PLAINTEXT://your-broker-hostname:9092points to a resolvable hostname and port. - Why it works: Unique IDs ensure each broker is a distinct entity. Correct listeners allow the controller to establish communication channels for metadata exchange and partition assignments.
2. ZooKeeper Session Expiration/Loss:
- Diagnosis: Check ZooKeeper logs for "session expired" or "connection lost" messages related to Kafka brokers. You can also check the broker logs for corresponding errors like "ZooKeeper session expired."
- Cause: If a broker loses its ZooKeeper session (due to network issues, GC pauses, or ZooKeeper instability), it’s considered "dead" by the controller. The controller will wait for it to re-register, causing
CoordinatorLoadInProgress. - Fix: Increase
zookeeper.session.timeout.msinserver.propertieson brokers (e.g.,zookeeper.session.timeout.ms=60000) andtickTimeinzoo.cfgon ZooKeeper servers (e.g.,tickTime=2000). Also, ensure network connectivity between brokers and ZooKeeper is stable. - Why it works: A longer session timeout gives brokers more time to recover from transient network glitches or GC pauses before ZooKeeper considers their sessions expired. Stable network ensures consistent communication.
3. Insufficient Controller Quorum:
- Diagnosis: Check ZooKeeper status (
echo stat | nc your-zk-host 2181on each ZK node) and look for the number of connected clients. Verify that the number of active Kafka controllers in ZK (/brokers/controllernode) matcheszookeeper.connectinserver.properties. - Cause: Kafka requires a majority of its configured controllers (which are typically brokers designated by
controller.listener.names) to be available to make progress. If fewer than a quorum are active, partition assignments halt. - Fix: Ensure the number of brokers listed in
zookeeper.connectis odd and sufficient for a quorum. If a controller broker is down, bring it back online. If a broker is permanently lost, remove its ZK ephemeral nodes manually (with extreme caution). - Why it works: A quorum ensures that decisions are made by a majority, preventing split-brain scenarios and ensuring consistency across the cluster.
4. Large Number of Partitions/Topics Being Created/Deleted:
- Diagnosis: Monitor Kafka controller logs for messages related to partition leadership elections and metadata updates. If you see a very high rate of
create_topics,delete_topics, or partition reassignment operations, this could be the cause. - Cause: The controller has to manage metadata and state for every partition. A massive influx of changes overwhelms its ability to process and assign partitions, leading to the
CoordinatorLoadInProgressstate as it tries to catch up. - Fix: Temporarily pause or slow down topic creation/deletion operations. If this is a persistent issue, consider increasing
controller.learner.session.timeout.msandcontroller.quorum.election.timeout.msinserver.propertiesto give the controller more time. - Why it works: Giving the controller more time and reducing the load allows it to process the pending metadata changes and eventually stabilize the partition assignments.
5. Broker Network Connectivity Issues:
- Diagnosis: Use
pingandtraceroutefrom the controller broker to other brokers, and vice-versa. Check firewall rules between brokers and ZooKeeper. - Cause: If the controller cannot reach certain brokers to send them partition assignment updates or receive acknowledgments, it will consider them unavailable and stall.
- Fix: Resolve any network routing problems, open necessary ports in firewalls (e.g., 9092 for PLAINTEXT, 9093 for SSL/SASL), and ensure DNS resolution is working correctly.
- Why it works: Reliable network communication is fundamental for Kafka’s distributed coordination. Ensuring brokers can talk to each other and the controller is paramount.
6. Under-provisioned Controller Resources (CPU/Memory):
- Diagnosis: Monitor CPU and memory usage on the brokers acting as controllers. Look for sustained high utilization.
- Cause: The controller process is responsible for a lot of coordination. If the broker hosting the controller is starved for CPU or memory, it can’t process requests efficiently, leading to delays in partition assignment.
- Fix: Increase the allocated CPU cores or RAM for the controller broker. Alternatively, move the controller role to a more powerful broker.
- Why it works: Providing adequate resources ensures the controller process can execute its tasks promptly and efficiently, unblocking partition assignment.
The next error you’ll likely see if you fix this is KAFKA: NoBrokersAvailable if you’re using a client library that attempts to connect before the cluster is fully ready, or LEADER_NOT_AVAILABLE for specific topics if partitions can’t elect a leader.