The Kafka broker failed to elect a leader for one or more of its partitions, meaning requests to those partitions are unanswerable.

Broker Configuration Mismatch

  • Diagnosis: Check server.properties on all brokers. Look for discrepancies in zookeeper.connect, advertised.listeners, and listeners. A common issue is zookeeper.connect pointing to an inaccessible Zookeeper ensemble, or advertised.listeners not matching what clients and other brokers expect to connect to.
  • Fix: Ensure zookeeper.connect is identical across all brokers and points to a healthy Zookeeper cluster. Verify advertised.listeners uses the correct network interface and port that all other brokers and clients can reach. For example, if your Zookeeper is at zk1.example.com:2181,zk2.example.com:2181 and your brokers listen on port 9092 internally but are advertised as broker1.example.com:9092, all these values must be consistent.
  • Why it works: Kafka relies on Zookeeper for leader election and metadata management. If brokers can’t reach Zookeeper, or if they advertise themselves with incorrect addresses, they can’t participate in the cluster or elect leaders.

Zookeeper Unavailability or Instability

  • Diagnosis: Check Zookeeper logs for errors like "connection refused," "session expired," or "leader election failed." Use echo stat | nc <zookeeper_host> 2181 to check Zookeeper’s status; it should return a stat output with a Mode: field indicating leader or follower.
  • Fix: Ensure your Zookeeper ensemble is running, healthy, and accessible from all Kafka brokers. This might involve restarting Zookeeper nodes, increasing Zookeeper’s JVM heap size (e.g., export JVMFLAGS="-Xmx1g -Xms1g" in zookeeper-server-start.sh), or resolving network connectivity issues between Kafka and Zookeeper.
  • Why it works: Zookeeper is the central coordination service for Kafka. If it’s down or unstable, Kafka brokers cannot perform critical operations like leader election, partition assignment, and metadata updates, leading to LeaderNotAvailableException.

Under-Replicated Partitions due to Broker Failure/Restart

  • Diagnosis: Check Kafka broker logs for messages indicating a broker has stopped or restarted unexpectedly. Use kafka-topics.sh --describe --topic <topic_name> --bootstrap-server <broker_list> to see the ISR (In-Sync Replicas) for each partition. If a partition’s ISR is smaller than its replication factor, and the leader is among the missing replicas, LeaderNotAvailableException will occur.
  • Fix: Identify the failed broker and restart it. Once the broker rejoins the cluster and syncs its data, Kafka’s controller will re-evaluate partition leadership and ISRs, eventually electing a leader. If a broker is permanently removed, you’ll need to reassign partitions using kafka-reassign-partitions.sh to meet the desired replication factor.
  • Why it works: When a broker holding a partition’s leader or a significant portion of its replicas goes down, Kafka temporarily loses the ability to serve requests for that partition until a new leader can be elected from the remaining in-sync replicas.

Insufficient Controller Quorum

  • Diagnosis: Kafka brokers log messages when attempting to elect a controller, often indicating failure to acquire a lock or a lack of consensus. The controller is responsible for leader election. If fewer than min.insync.replicas brokers (for controller election, typically (N/2) + 1 for an ensemble of N brokers) are available, a controller cannot be elected.
  • Fix: Ensure at least a quorum of Kafka brokers are running and can communicate with each other and Zookeeper. This might involve restarting downed brokers or troubleshooting network issues preventing brokers from forming a quorum. The controller.quorum.join.timeout.ms and controller.quorum.election.timeout.ms in server.properties (defaulting to 10 seconds and 20 seconds respectively) can sometimes be tuned, but it’s better to fix the underlying availability issue.
  • Why it works: A controller broker is elected to manage partition leadership and replica synchronization. If a quorum of brokers cannot be established, no controller can be elected, and thus no leader election for partitions can occur.

Network Partitioning Within the Kafka Cluster

  • Diagnosis: Observe broker logs for repeated connection errors between brokers, especially during periods of high network load or known network instability. Use tools like ping and traceroute between brokers to identify packet loss or high latency. Check firewall rules and network security groups.
  • Fix: Resolve the underlying network issues. This could involve fixing faulty network hardware, reconfiguring firewalls to allow traffic on Kafka’s ports (e.g., 9092, 2888, 3888 for Zookeeper), or improving network stability. Ensure advertised.listeners and listeners are configured to use stable, routable IP addresses or hostnames.
  • Why it works: If brokers cannot communicate with each other due to network partitions, they cannot coordinate leader elections or maintain in-sync replica sets, leading to partitions becoming unavailable.

Topic Configuration: min.insync.replicas Too High

  • Diagnosis: Examine the topic configuration using kafka-configs.sh --describe --topic <topic_name> --bootstrap-server <broker_list>. If min.insync.replicas is set to a value higher than the number of available in-sync replicas for a partition, producers configured with acks=all will fail to write, and consumers might see partitions as unavailable.
  • Fix: Adjust min.insync.replicas for the affected topic to a value that can be met by the currently available replicas. For example, if a topic has a replication factor of 3 but only 2 brokers are healthy, min.insync.replicas should be set to 2 or less. This can be done via kafka-configs.sh --alter --topic <topic_name> --bootstrap-server <broker_list> --add-config min.insync.replicas=2.
  • Why it works: min.insync.replicas ensures that a minimum number of replicas must acknowledge a write before it’s considered successful. If this threshold cannot be met due to broker unavailability, Kafka may prevent writes to prevent data loss, effectively making the partition unavailable for producers.

The next error you’ll likely encounter is a TimeoutException from your producers or consumers, as they wait indefinitely for a response from the unavailable partition leader.

Want structured learning?

Take the full Kafka course →