The ZooKeeper client in Kafka has lost its connection to the ZooKeeper ensemble, meaning Kafka brokers can no longer coordinate or discover each other.
Here are the common causes and how to fix them:
1. Network Latency or Packet Loss: ZooKeeper relies on a stable, low-latency network connection. High latency or dropped packets can cause the client to miss heartbeats, leading to session expiry.
- Diagnosis: Use
pingandtraceroutefrom a Kafka broker to each ZooKeeper node. Look for high round-trip times (consistently over 10ms) or packet loss. Check network interface statistics for errors (ifconfig eth0orip -s link show eth0). - Fix: Address underlying network issues. This might involve optimizing routing, upgrading network hardware, or working with your network team. Ensure brokers and ZooKeeper nodes are on the same low-latency network segment.
- Why it works: Reducing latency and packet loss ensures heartbeats are delivered within ZooKeeper’s timeout window.
2. Insufficient ZooKeeper Client Threads: The Kafka broker’s ZooKeeper client uses a limited number of threads to manage its connections and requests. If these threads are overwhelmed, requests can back up, leading to missed heartbeats.
- Diagnosis: On the Kafka broker, check the thread dump for the Java process. Look for threads stuck in
java.lang.Thread.State: BLOCKEDorRUNNABLEstates related to ZooKeeper client operations or I/O. - Fix: Increase the number of ZooKeeper client threads in the Kafka broker configuration. Edit
server.propertiesand setzookeeper.connection.timeout.msto a higher value (e.g.,30000or60000) andzookeeper.session.timeout.msto a higher value (e.g.,60000or120000). You might also need to adjustzookeeper.sync.time.msif it’s set very low. Restart the Kafka broker. - Why it works: More threads can handle concurrent ZooKeeper operations and I/O, preventing a backlog that leads to missed heartbeats.
3. ZooKeeper Server Overload: If the ZooKeeper ensemble itself is struggling with high load (too many client connections, frequent writes, or long-running leader elections), it might not be able to respond to heartbeats from Kafka brokers in time.
- Diagnosis: On each ZooKeeper server, check the ZooKeeper logs for errors like "Too many connections" or "zxid out of order." Monitor ZooKeeper’s own metrics like
zk_num_alive_connections,zk_outstanding_requests, andzk_server_state. - Fix: Scale up your ZooKeeper ensemble. Add more ZooKeeper nodes to distribute the load. Ensure ZooKeeper servers have sufficient CPU, memory, and fast disk I/O. Optimize
zoo.cfgon ZooKeeper servers, particularlytickTime(e.g.,2000),initLimit(e.g.,10), andsyncLimit(e.g.,5). - Why it works: A healthy ZooKeeper ensemble can process heartbeats and client requests promptly, maintaining stable connections.
4. Firewall or Network ACL Issues: Firewalls between Kafka brokers and ZooKeeper nodes, or Access Control Lists (ACLs) on network devices, can intermittently drop or block ZooKeeper’s heartbeat traffic (typically on port 2181).
- Diagnosis: Use
tcpdumpon both the Kafka broker and a ZooKeeper node to observe traffic on port 2181. Look for SYN/ACK packets, or lack thereof, between the client and server. Check firewall logs for dropped packets originating from Kafka broker IPs to ZooKeeper IPs on port 2181. - Fix: Configure firewalls and network ACLs to explicitly allow TCP traffic on port 2181 between all Kafka brokers and all ZooKeeper nodes.
- Why it works: Ensuring uninterrupted communication on the ZooKeeper port prevents heartbeats from being blocked.
5. Time Skew Between Nodes: ZooKeeper relies on accurate time synchronization. Significant clock drift between Kafka brokers and ZooKeeper servers can cause heartbeats to be interpreted as late.
- Diagnosis: On each Kafka broker and ZooKeeper server, run
dateand compare the outputs. Usentpdate -q <ntp_server>orchronyc sourcesto check synchronization status. - Fix: Ensure all Kafka brokers and ZooKeeper servers are synchronized to a reliable NTP (Network Time Protocol) source. Configure
ntpdorchronydto maintain tight synchronization (within a few milliseconds). - Why it works: Consistent time across all nodes ensures that the perceived time for heartbeats aligns, preventing false positives for expired sessions.
6. Kafka Broker Resource Starvation: If a Kafka broker is experiencing high CPU, memory, or I/O utilization, its ZooKeeper client threads might not get enough CPU time to send heartbeats or process ZooKeeper responses in a timely manner.
- Diagnosis: Monitor CPU, memory, and I/O utilization on the Kafka broker using tools like
top,htop,vmstat, or cloud provider monitoring dashboards. Look for sustained high usage (e.g., >90% CPU, high swap usage). - Fix: Optimize Kafka broker configurations, increase hardware resources, or reduce the load on the specific broker. This could involve adjusting the number of request handlers, network threads, or disk performance tuning.
- Why it works: Providing sufficient system resources allows the ZooKeeper client threads to execute and maintain their connection.
The next error you’ll likely encounter is org.apache.kafka.common.errors.TimeoutException: Topic <topic_name> not present as brokers struggle to agree on cluster state.