Kafka brokers are reporting being disconnected from other nodes in the cluster. This usually means a broker has stopped communicating with its controller or other brokers, leading to unavailability of partitions and a degraded cluster state.
Common Causes and Fixes
1. Network Partitioning Due to Firewall Rules
- Diagnosis: Check firewall logs on the affected broker and any network devices in between. Look for
DENYorDROPmessages for traffic on Kafka’s internal ports (usually 9092 for clients, 2888/3888 for ZooKeeper if not embedded, and broker-to-broker ports which are the same as client ports by default).# On the broker, check its own firewall rules sudo iptables -L -n | grep 'dport 9092' sudo ufw status verbose | grep 9092 # If using a cloud provider, check security group rules # Example for AWS EC2: aws ec2 describe-security-groups --group-ids sg-xxxxxxxxxxxxxxxxx --query 'SecurityGroups[*].IpPermissions' - Fix: Ensure that all Kafka brokers can communicate with each other on Kafka’s port (default 9092) and that the controller broker can communicate with all other brokers. If using ZooKeeper, ensure brokers can communicate with ZooKeeper on its port (default 2181).
# Example: Add a rule to allow traffic from other brokers on port 9092 sudo iptables -I INPUT -p tcp --dport 9092 -j ACCEPT sudo ufw allow 9092/tcp # Example: Update AWS Security Group to allow all brokers within the group to talk to each other on 9092 # (This requires modifying the SG attached to your Kafka instances) - Why it works: Kafka relies on continuous network connectivity between brokers and between brokers and the controller for cluster coordination and partition leadership. Firewalls blocking this traffic directly cause disconnects.
2. ZooKeeper Connectivity Issues
- Diagnosis: If your Kafka brokers are not embedded with ZooKeeper, they rely on an external ZooKeeper ensemble for coordination. Check Kafka broker logs for messages like
ZooKeeper session expiredorConnection refusedto ZooKeeper.# On the Kafka broker, check Kafka logs (e.g., /var/log/kafka/server.log) grep -E "ZooKeeper|connection refused" /var/log/kafka/server.log # On the ZooKeeper server, check its logs for connections from Kafka brokers # (e.g., /var/log/zookeeper/zookeeper.log) grep "accepted connection" /var/log/zookeeper/zookeeper.log - Fix: Ensure ZooKeeper is running, accessible on its port (default 2181), and that Kafka brokers are configured with the correct ZooKeeper connection string in
server.properties(zookeeper.connect=zk1:2181,zk2:2181,zk3:2181). Restart Kafka brokers if ZooKeeper was down or unreachable.# In server.properties zookeeper.connect=kafka-zk-1.example.com:2181,kafka-zk-2.example.com:2181,kafka-zk-3.example.com:2181 - Why it works: ZooKeeper is the source of truth for Kafka cluster metadata. If brokers cannot maintain a session with ZooKeeper, they cannot participate in the cluster, leading to disconnects.
3. Insufficient File Descriptors (ulimit)
- Diagnosis: Kafka brokers, especially under heavy load, can open a large number of file descriptors for network connections and log files. If the system limit is too low, the broker will fail to open new connections. Check Kafka logs for
Too many open fileserrors.# Check the current open file descriptor limit for the Kafka process # Find the PID of the Kafka broker ps aux | grep kafka # Then check limits for that PID cat /proc/<PID>/limits | grep "Max open files" # Or, check the system-wide limit ulimit -n - Fix: Increase the
nofilelimit for the Kafka user. This is typically done in/etc/security/limits.confor a file in/etc/security/limits.d/.
After changing, you’ll need to restart the Kafka service for the new limits to take effect.# In /etc/security/limits.conf or a custom limits file kafka - nofile 65536 root - nofile 65536 - Why it works: Each network connection and open file requires a file descriptor. Exceeding the system’s limit prevents the OS from allocating new ones, causing connection failures.
4. High CPU or Memory Usage on Brokers
- Diagnosis: When a broker is overloaded, its network threads and I/O operations can become starved, leading to timeouts and disconnects. Monitor CPU and memory usage on the affected brokers.
# Use top or htop to monitor CPU and memory top -p $(pgrep -f kafka) htop -p $(pgrep -f kafka) # Check Kafka's JVM heap usage # You might need to use jstat or jcmd if you don't have JMX configured jstat -gcutil <PID> - Fix:
- Increase JVM Heap Size: If the broker is constantly garbage collecting or running out of heap, increase the
KAFKA_HEAP_OPTSenvironment variable (e.g.,-Xmx8g -Xms8g). - Scale Horizontally: Add more brokers to the cluster to distribute the load.
- Tune Kafka Settings: Adjust settings like
num.io.threadsandnum.network.threadsinserver.properties(though be cautious, as incorrect tuning can worsen performance). - Optimize Topics/Producers/Consumers: Reduce the rate of messages, optimize message size, or scale consumers.
- Increase JVM Heap Size: If the broker is constantly garbage collecting or running out of heap, increase the
- Why it works: A responsive broker is crucial for cluster health. Overload leads to unresponsiveness, which the controller and other brokers interpret as a disconnect.
5. Incorrect Broker Configuration (broker.id)
- Diagnosis: Each broker must have a unique
broker.idin itsserver.propertiesfile. If two brokers have the same ID, they will conflict. Check Kafka logs for errors indicating a duplicatebroker.idor inability to register with ZooKeeper.# Check server.properties on each broker grep "broker.id" /etc/kafka/server.properties # Check Kafka logs for ID conflicts grep "broker id" /var/log/kafka/server.log - Fix: Ensure each
server.propertiesfile has a unique, positive integer forbroker.id. Restart any brokers that had duplicate IDs after correcting them.# Example: Unique IDs for three brokers broker.id=0 broker.id=1 broker.id=2 - Why it works: The
broker.idis how Kafka uniquely identifies each instance in the cluster. A conflict means the cluster cannot correctly manage its metadata.
6. Network Interface or IP Address Changes
- Diagnosis: If a broker’s network interface or IP address changes unexpectedly (e.g., due to DHCP or misconfiguration), it can lose its connection to the cluster and its advertised address might become invalid. Check
server.propertiesforadvertised.listenersand compare it with the actual IP address of the broker.# Check Kafka's advertised listener in server.properties grep "advertised.listeners" /etc/kafka/server.properties # Check the actual IP address of the broker ip a - Fix: Ensure
advertised.listenersis correctly configured to reflect the IP address and port that other brokers and clients should use to connect to this broker. If using dynamic IPs, consider using hostnames that resolve correctly or static IPs. Restart the Kafka broker after correcting the configuration.# Example: If the broker's IP is 192.168.1.100 advertised.listeners=PLAINTEXT://192.168.1.100:9092 - Why it works: All communication within the Kafka cluster is based on IP addresses and ports. If the advertised address is wrong or unreachable, other nodes cannot establish or maintain connections.
The next error you’ll likely encounter if these issues are resolved is related to consumer group rebalances, as partitions will need to re-elect leaders and consumers will need to re-subscribe.