Kafka brokers are reporting being disconnected from other nodes in the cluster. This usually means a broker has stopped communicating with its controller or other brokers, leading to unavailability of partitions and a degraded cluster state.

Common Causes and Fixes

1. Network Partitioning Due to Firewall Rules

  • Diagnosis: Check firewall logs on the affected broker and any network devices in between. Look for DENY or DROP messages for traffic on Kafka’s internal ports (usually 9092 for clients, 2888/3888 for ZooKeeper if not embedded, and broker-to-broker ports which are the same as client ports by default).
    # On the broker, check its own firewall rules
    sudo iptables -L -n | grep 'dport 9092'
    sudo ufw status verbose | grep 9092
    
    # If using a cloud provider, check security group rules
    # Example for AWS EC2:
    aws ec2 describe-security-groups --group-ids sg-xxxxxxxxxxxxxxxxx --query 'SecurityGroups[*].IpPermissions'
    
  • Fix: Ensure that all Kafka brokers can communicate with each other on Kafka’s port (default 9092) and that the controller broker can communicate with all other brokers. If using ZooKeeper, ensure brokers can communicate with ZooKeeper on its port (default 2181).
    # Example: Add a rule to allow traffic from other brokers on port 9092
    sudo iptables -I INPUT -p tcp --dport 9092 -j ACCEPT
    sudo ufw allow 9092/tcp
    
    # Example: Update AWS Security Group to allow all brokers within the group to talk to each other on 9092
    # (This requires modifying the SG attached to your Kafka instances)
    
  • Why it works: Kafka relies on continuous network connectivity between brokers and between brokers and the controller for cluster coordination and partition leadership. Firewalls blocking this traffic directly cause disconnects.

2. ZooKeeper Connectivity Issues

  • Diagnosis: If your Kafka brokers are not embedded with ZooKeeper, they rely on an external ZooKeeper ensemble for coordination. Check Kafka broker logs for messages like ZooKeeper session expired or Connection refused to ZooKeeper.
    # On the Kafka broker, check Kafka logs (e.g., /var/log/kafka/server.log)
    grep -E "ZooKeeper|connection refused" /var/log/kafka/server.log
    
    # On the ZooKeeper server, check its logs for connections from Kafka brokers
    # (e.g., /var/log/zookeeper/zookeeper.log)
    grep "accepted connection" /var/log/zookeeper/zookeeper.log
    
  • Fix: Ensure ZooKeeper is running, accessible on its port (default 2181), and that Kafka brokers are configured with the correct ZooKeeper connection string in server.properties (zookeeper.connect=zk1:2181,zk2:2181,zk3:2181). Restart Kafka brokers if ZooKeeper was down or unreachable.
    # In server.properties
    zookeeper.connect=kafka-zk-1.example.com:2181,kafka-zk-2.example.com:2181,kafka-zk-3.example.com:2181
    
  • Why it works: ZooKeeper is the source of truth for Kafka cluster metadata. If brokers cannot maintain a session with ZooKeeper, they cannot participate in the cluster, leading to disconnects.

3. Insufficient File Descriptors (ulimit)

  • Diagnosis: Kafka brokers, especially under heavy load, can open a large number of file descriptors for network connections and log files. If the system limit is too low, the broker will fail to open new connections. Check Kafka logs for Too many open files errors.
    # Check the current open file descriptor limit for the Kafka process
    # Find the PID of the Kafka broker
    ps aux | grep kafka
    # Then check limits for that PID
    cat /proc/<PID>/limits | grep "Max open files"
    
    # Or, check the system-wide limit
    ulimit -n
    
  • Fix: Increase the nofile limit for the Kafka user. This is typically done in /etc/security/limits.conf or a file in /etc/security/limits.d/.
    # In /etc/security/limits.conf or a custom limits file
    kafka - nofile 65536
    root - nofile 65536
    
    After changing, you’ll need to restart the Kafka service for the new limits to take effect.
  • Why it works: Each network connection and open file requires a file descriptor. Exceeding the system’s limit prevents the OS from allocating new ones, causing connection failures.

4. High CPU or Memory Usage on Brokers

  • Diagnosis: When a broker is overloaded, its network threads and I/O operations can become starved, leading to timeouts and disconnects. Monitor CPU and memory usage on the affected brokers.
    # Use top or htop to monitor CPU and memory
    top -p $(pgrep -f kafka)
    htop -p $(pgrep -f kafka)
    
    # Check Kafka's JVM heap usage
    # You might need to use jstat or jcmd if you don't have JMX configured
    jstat -gcutil <PID>
    
  • Fix:
    • Increase JVM Heap Size: If the broker is constantly garbage collecting or running out of heap, increase the KAFKA_HEAP_OPTS environment variable (e.g., -Xmx8g -Xms8g).
    • Scale Horizontally: Add more brokers to the cluster to distribute the load.
    • Tune Kafka Settings: Adjust settings like num.io.threads and num.network.threads in server.properties (though be cautious, as incorrect tuning can worsen performance).
    • Optimize Topics/Producers/Consumers: Reduce the rate of messages, optimize message size, or scale consumers.
  • Why it works: A responsive broker is crucial for cluster health. Overload leads to unresponsiveness, which the controller and other brokers interpret as a disconnect.

5. Incorrect Broker Configuration (broker.id)

  • Diagnosis: Each broker must have a unique broker.id in its server.properties file. If two brokers have the same ID, they will conflict. Check Kafka logs for errors indicating a duplicate broker.id or inability to register with ZooKeeper.
    # Check server.properties on each broker
    grep "broker.id" /etc/kafka/server.properties
    
    # Check Kafka logs for ID conflicts
    grep "broker id" /var/log/kafka/server.log
    
  • Fix: Ensure each server.properties file has a unique, positive integer for broker.id. Restart any brokers that had duplicate IDs after correcting them.
    # Example: Unique IDs for three brokers
    broker.id=0
    broker.id=1
    broker.id=2
    
  • Why it works: The broker.id is how Kafka uniquely identifies each instance in the cluster. A conflict means the cluster cannot correctly manage its metadata.

6. Network Interface or IP Address Changes

  • Diagnosis: If a broker’s network interface or IP address changes unexpectedly (e.g., due to DHCP or misconfiguration), it can lose its connection to the cluster and its advertised address might become invalid. Check server.properties for advertised.listeners and compare it with the actual IP address of the broker.
    # Check Kafka's advertised listener in server.properties
    grep "advertised.listeners" /etc/kafka/server.properties
    
    # Check the actual IP address of the broker
    ip a
    
  • Fix: Ensure advertised.listeners is correctly configured to reflect the IP address and port that other brokers and clients should use to connect to this broker. If using dynamic IPs, consider using hostnames that resolve correctly or static IPs. Restart the Kafka broker after correcting the configuration.
    # Example: If the broker's IP is 192.168.1.100
    advertised.listeners=PLAINTEXT://192.168.1.100:9092
    
  • Why it works: All communication within the Kafka cluster is based on IP addresses and ports. If the advertised address is wrong or unreachable, other nodes cannot establish or maintain connections.

The next error you’ll likely encounter if these issues are resolved is related to consumer group rebalances, as partitions will need to re-elect leaders and consumers will need to re-subscribe.

Want structured learning?

Take the full Kafka course →