The KafkaException: org.apache.kafka.common.errors.BrokerNotAvailableException means a Kafka client (producer or consumer) tried to talk to a broker that it believes is active, but the broker didn’t respond. This usually points to a network issue or a broker that’s actually dead but not properly registered as such.

Here are the most common reasons and how to fix them:

  1. ZooKeeper Issues: Kafka relies on ZooKeeper for cluster coordination. If ZooKeeper is unhealthy, brokers might not be able to register themselves properly, or clients might get stale information about broker availability.

    • Diagnosis: Check the ZooKeeper logs for errors. On the ZooKeeper server, run echo stat | nc localhost 2181 (or your ZooKeeper port) and look for Mode: leader or Mode: follower and check the outstanding and packets counts. If they are high or growing, ZooKeeper is overloaded.
    • Fix: Restart ZooKeeper ensemble members one by one, ensuring each one is fully back up before restarting the next. If ZooKeeper is consistently overloaded, you might need to increase its heap size (JVMFLAGS="-Xmx1g -Xms1g" in zookeeper.conf) or add more ZooKeeper nodes.
    • Why it works: A healthy ZooKeeper ensures brokers can register their presence and clients can discover active brokers.
  2. Broker Network Connectivity: Brokers need to be able to reach each other, and clients need to reach brokers. Firewalls, incorrect advertised.listeners or listeners configurations, or general network partitions are common culprits.

    • Diagnosis: From a client machine, try to telnet <broker_host> <broker_port>. From a broker machine, try to telnet <other_broker_host> <other_broker_port>. Also, check /etc/hosts on all Kafka nodes and clients for correct IP-to-hostname mappings.
    • Fix: Ensure listeners in server.properties is set to an IP address or hostname that is resolvable and reachable by other brokers and clients (e.g., listeners=PLAINTEXT://0.0.0.0:9092). Crucially, set advertised.listeners to the IP address or hostname that clients should use to connect (e.g., advertised.listeners=PLAINTEXT://your_broker_public_ip:9092). Open firewall ports between brokers and between clients and brokers.
    • Why it works: Correctly configured listeners and network access ensure that brokers and clients can establish TCP connections to each other.
  3. Broker Crashing/Restarting: A broker might be crashing due to memory issues, disk full errors, or unrecoverable errors, and then restarting. While it’s down, it’s unavailable. If it restarts and immediately fails again, clients will see it as intermittently unavailable.

    • Diagnosis: Check the Kafka broker logs (server.log) for OutOfMemoryError, disk space warnings (No space left on device), or any other exceptions indicating a crash. Also, check system logs (/var/log/messages or journalctl) for OOM killer events.
    • Fix: For OOM errors, increase the broker’s JVM heap size (KAFKA_HEAP_OPTS="-Xmx4G -Xms4G" in the Kafka startup script or environment file). For disk space, free up disk space or add more storage. If the crash is due to a specific error, address that underlying issue (e.g., fix corrupted log segments, increase file descriptor limits).
    • Why it works: A stable broker process that is not crashing is essential for continuous availability.
  4. Incorrect broker.id: Each broker in a Kafka cluster must have a unique broker.id. If two brokers are configured with the same ID, one will likely fail to register or be removed by ZooKeeper, causing availability issues.

    • Diagnosis: Check the broker.id setting in the server.properties file on each broker. Also, examine ZooKeeper data for broker registrations (/brokers/ids in ZooKeeper CLI).
    • Fix: Ensure every broker has a distinct broker.id (e.g., broker.id=0, broker.id=1, etc.). If a duplicate ID is found, stop the affected broker, correct its broker.id in server.properties, and restart it.
    • Why it works: ZooKeeper uses broker.id as the primary identifier for brokers; duplicate IDs confuse the cluster state.
  5. ZooKeeper Session Expiration / Network Glitches: If the network between a broker and ZooKeeper is flaky, the broker’s ZooKeeper session can expire. ZooKeeper then removes the broker from its list of active brokers, even if the broker process is still running. Clients will stop seeing it.

    • Diagnosis: Look for messages like Expired session or Connection loss in the broker’s server.log related to ZooKeeper. On the ZooKeeper server, you might see messages about session expirations.
    • Fix: Improve network stability between brokers and ZooKeeper. You can also tune ZooKeeper session timeouts. In zookeeper.conf, adjust tickTime (e.g., to 2000) and syncLimit (e.g., to 10) and in Kafka’s server.properties, adjust zookeeper.session.timeout.ms (e.g., to 60000) and zookeeper.connection.timeout.ms (e.g., to 60000).
    • Why it works: Longer timeouts give brokers more resilience against transient network hiccups, preventing premature deregistration.
  6. Large Message Sizes / Producer/Consumer Overload: While less direct, if producers are sending extremely large messages or if the cluster is under heavy load, it can exacerbate underlying network or broker issues. A broker might become unresponsive under strain.

    • Diagnosis: Monitor broker CPU, memory, and network I/O. Check NetworkProcessorAvgIdleMs and RequestHandlerAvgIdleMs in broker metrics; values consistently below 0 indicate the broker is overloaded. Check producer/consumer logs for timeouts or slow throughput.
    • Fix: Increase message.max.bytes and replica.fetch.max.bytes on brokers and producers/consumers if large messages are the cause. Scale up broker resources (CPU, RAM, network bandwidth) or add more brokers to the cluster. Optimize producer/consumer configurations (e.g., batching, compression).
    • Why it works: Ensuring brokers have sufficient resources and are not overwhelmed by traffic prevents them from becoming unresponsive.

If you fix all these, the next error you’ll likely see is NotControllerException if you try to perform an operation that requires the cluster controller and it’s temporarily unavailable during a re-election.

Want structured learning?

Take the full Kafka course →