The KafkaException: org.apache.kafka.common.errors.BrokerNotAvailableException means a Kafka client (producer or consumer) tried to talk to a broker that it believes is active, but the broker didn’t respond. This usually points to a network issue or a broker that’s actually dead but not properly registered as such.
Here are the most common reasons and how to fix them:
-
ZooKeeper Issues: Kafka relies on ZooKeeper for cluster coordination. If ZooKeeper is unhealthy, brokers might not be able to register themselves properly, or clients might get stale information about broker availability.
- Diagnosis: Check the ZooKeeper logs for errors. On the ZooKeeper server, run
echo stat | nc localhost 2181(or your ZooKeeper port) and look forMode: leaderorMode: followerand check theoutstandingandpacketscounts. If they are high or growing, ZooKeeper is overloaded. - Fix: Restart ZooKeeper ensemble members one by one, ensuring each one is fully back up before restarting the next. If ZooKeeper is consistently overloaded, you might need to increase its heap size (
JVMFLAGS="-Xmx1g -Xms1g"inzookeeper.conf) or add more ZooKeeper nodes. - Why it works: A healthy ZooKeeper ensures brokers can register their presence and clients can discover active brokers.
- Diagnosis: Check the ZooKeeper logs for errors. On the ZooKeeper server, run
-
Broker Network Connectivity: Brokers need to be able to reach each other, and clients need to reach brokers. Firewalls, incorrect
advertised.listenersorlistenersconfigurations, or general network partitions are common culprits.- Diagnosis: From a client machine, try to
telnet <broker_host> <broker_port>. From a broker machine, try totelnet <other_broker_host> <other_broker_port>. Also, check/etc/hostson all Kafka nodes and clients for correct IP-to-hostname mappings. - Fix: Ensure
listenersinserver.propertiesis set to an IP address or hostname that is resolvable and reachable by other brokers and clients (e.g.,listeners=PLAINTEXT://0.0.0.0:9092). Crucially, setadvertised.listenersto the IP address or hostname that clients should use to connect (e.g.,advertised.listeners=PLAINTEXT://your_broker_public_ip:9092). Open firewall ports between brokers and between clients and brokers. - Why it works: Correctly configured listeners and network access ensure that brokers and clients can establish TCP connections to each other.
- Diagnosis: From a client machine, try to
-
Broker Crashing/Restarting: A broker might be crashing due to memory issues, disk full errors, or unrecoverable errors, and then restarting. While it’s down, it’s unavailable. If it restarts and immediately fails again, clients will see it as intermittently unavailable.
- Diagnosis: Check the Kafka broker logs (
server.log) forOutOfMemoryError, disk space warnings (No space left on device), or any other exceptions indicating a crash. Also, check system logs (/var/log/messagesorjournalctl) for OOM killer events. - Fix: For OOM errors, increase the broker’s JVM heap size (
KAFKA_HEAP_OPTS="-Xmx4G -Xms4G"in the Kafka startup script or environment file). For disk space, free up disk space or add more storage. If the crash is due to a specific error, address that underlying issue (e.g., fix corrupted log segments, increase file descriptor limits). - Why it works: A stable broker process that is not crashing is essential for continuous availability.
- Diagnosis: Check the Kafka broker logs (
-
Incorrect
broker.id: Each broker in a Kafka cluster must have a uniquebroker.id. If two brokers are configured with the same ID, one will likely fail to register or be removed by ZooKeeper, causing availability issues.- Diagnosis: Check the
broker.idsetting in theserver.propertiesfile on each broker. Also, examine ZooKeeper data for broker registrations (/brokers/idsin ZooKeeper CLI). - Fix: Ensure every broker has a distinct
broker.id(e.g.,broker.id=0,broker.id=1, etc.). If a duplicate ID is found, stop the affected broker, correct itsbroker.idinserver.properties, and restart it. - Why it works: ZooKeeper uses
broker.idas the primary identifier for brokers; duplicate IDs confuse the cluster state.
- Diagnosis: Check the
-
ZooKeeper Session Expiration / Network Glitches: If the network between a broker and ZooKeeper is flaky, the broker’s ZooKeeper session can expire. ZooKeeper then removes the broker from its list of active brokers, even if the broker process is still running. Clients will stop seeing it.
- Diagnosis: Look for messages like
Expired sessionorConnection lossin the broker’sserver.logrelated to ZooKeeper. On the ZooKeeper server, you might see messages about session expirations. - Fix: Improve network stability between brokers and ZooKeeper. You can also tune ZooKeeper session timeouts. In
zookeeper.conf, adjusttickTime(e.g., to2000) andsyncLimit(e.g., to10) and in Kafka’sserver.properties, adjustzookeeper.session.timeout.ms(e.g., to60000) andzookeeper.connection.timeout.ms(e.g., to60000). - Why it works: Longer timeouts give brokers more resilience against transient network hiccups, preventing premature deregistration.
- Diagnosis: Look for messages like
-
Large Message Sizes / Producer/Consumer Overload: While less direct, if producers are sending extremely large messages or if the cluster is under heavy load, it can exacerbate underlying network or broker issues. A broker might become unresponsive under strain.
- Diagnosis: Monitor broker CPU, memory, and network I/O. Check
NetworkProcessorAvgIdleMsandRequestHandlerAvgIdleMsin broker metrics; values consistently below 0 indicate the broker is overloaded. Check producer/consumer logs for timeouts or slow throughput. - Fix: Increase
message.max.bytesandreplica.fetch.max.byteson brokers and producers/consumers if large messages are the cause. Scale up broker resources (CPU, RAM, network bandwidth) or add more brokers to the cluster. Optimize producer/consumer configurations (e.g., batching, compression). - Why it works: Ensuring brokers have sufficient resources and are not overwhelmed by traffic prevents them from becoming unresponsive.
- Diagnosis: Monitor broker CPU, memory, and network I/O. Check
If you fix all these, the next error you’ll likely see is NotControllerException if you try to perform an operation that requires the cluster controller and it’s temporarily unavailable during a re-election.