The Kafka GroupCoordinatorNotAvailable error means a consumer is trying to join or rebalance a consumer group, but the broker responsible for managing that group is unreachable or not ready. This usually happens because the coordinator broker itself is down, overloaded, or being restarted, and the consumers can’t find a new one quickly enough.
Here are the common reasons and how to fix them:
1. Coordinator Broker is Down or Unresponsive
- Diagnosis: Check the status of your Kafka brokers. Look for brokers that are not registered with ZooKeeper or are not responding to requests.
# On a Kafka broker, check ZooKeeper registration /opt/kafka/bin/zookeeper-shell.sh <zookeeper_host>:2181 ls /brokers/ids # On a Kafka broker, check logs for connection errors tail -f /opt/kafka/logs/server.log | grep -i "connection refused\|failed to connect" - Fix: If a broker is down, restart it. Ensure its configuration (
server.properties) is correct, especiallyadvertised.listenersandzookeeper.connect.# Example: Restarting a Kafka broker service sudo systemctl restart kafka - Why it works: A healthy coordinator broker is essential for managing consumer groups. Restarting it brings it back into service, allowing it to resume its coordinator duties.
2. ZooKeeper Issues Affecting Broker Registration
- Diagnosis: Kafka relies on ZooKeeper to discover brokers and elect a group coordinator. If ZooKeeper is unhealthy or experiencing network issues, brokers might not register properly, or the coordinator might not be discoverable.
# Check ZooKeeper status and logs sudo systemctl status zookeeper tail -f /opt/zookeeper/logs/zookeeper.log - Fix: Ensure ZooKeeper is running, healthy, and accessible from all Kafka brokers. If ZooKeeper is the problem, restart it and check its network connectivity and disk I/O.
sudo systemctl restart zookeeper - Why it works: ZooKeeper acts as Kafka’s registry. If brokers can’t talk to ZooKeeper, they can’t register themselves, and consumers won’t know which broker is the coordinator.
3. High Broker Load or Network Saturation
- Diagnosis: A heavily loaded broker, especially the one acting as the group coordinator, might be slow to respond to consumer group join requests or rebalance requests. Monitor CPU, memory, network I/O, and disk I/O on your brokers.
Also, check Kafka broker logs for messages indicating high latency or I/O wait times.# Example using top/htop on a broker top -bn1 | grep "Cpu(s)\|Mem\|Tasks" # Check network traffic iftop -i eth0 - Fix:
- Scale your Kafka cluster: Add more brokers to distribute the load.
- Optimize Kafka configuration: Tune
num.io.threads,num.network.threads, andmessage.max.bytesbased on your workload. - Optimize consumer configuration: Reduce
max.poll.interval.msif consumers are taking too long to process messages, or increasesession.timeout.msif network latency is high.
# server.properties example tuning num.io.threads=16 num.network.threads=8 - Why it works: Reducing the load on the coordinator broker or improving network throughput allows it to process join/rebalance requests promptly, preventing timeouts.
4. Incorrect group.initial.rebalance.delay.ms Setting
- Diagnosis: This setting in
server.propertiesdetermines how long Kafka waits before initiating a group rebalance when a new consumer joins or a broker fails. If it’s set too low, it can cause frequent, disruptive rebalances. If it’s too high, it can delay consumers joining a group.# Check server.properties on your brokers cat /opt/kafka/config/server.properties | grep group.initial.rebalance.delay.ms - Fix: For initial setup or after significant cluster changes, a slightly higher value might be beneficial. For normal operation, a lower value is usually fine. The default is 3 seconds (3000 ms). Experiment with values like 5000 ms or 10000 ms if you’re seeing issues during rapid consumer scaling or broker restarts.
Remember to restart brokers after changing this setting.# Example: Increase delay to allow more consumers to join before rebalancing group.initial.rebalance.delay.ms=10000 - Why it works: A controlled rebalance delay ensures that consumers have enough time to join the group and be assigned partitions before the coordinator starts the rebalancing process, reducing the chances of
GroupCoordinatorNotAvailableerrors during the initial join phase.
5. Incorrect Broker Listener Configuration
- Diagnosis: The
listenersandadvertised.listenersinserver.propertiesmust be correctly configured for brokers to communicate with each other and for clients to reach them. If these are misconfigured, consumers might not be able to find the designated coordinator.# Check server.properties on your brokers cat /opt/kafka/config/server.properties | grep listeners cat /opt/kafka/config/server.properties | grep advertised.listeners - Fix: Ensure
listenersdefines the interface and port brokers listen on, andadvertised.listenersdefines how clients (including other brokers acting as coordinators) should connect to this broker. For example, if brokers are in different subnets or behind NAT,advertised.listenersmust point to an address that is reachable.
Restart brokers after making changes.# Example for a broker with a specific IP listeners=PLAINTEXT://0.0.0.0:9092 advertised.listeners=PLAINTEXT://192.168.1.100:9092 - Why it works: Correctly configured listeners ensure that Kafka brokers can discover each other and that clients can establish connections to the correct broker addresses, including the group coordinator.
6. Network Partition or Firewall Issues
- Diagnosis: Network connectivity problems between consumers and brokers, or between brokers themselves, can lead to the group coordinator appearing unavailable.
Check firewall rules on both consumer and broker machines.# From a consumer machine, try to connect to the broker's listener port nc -vz <broker_host> <broker_port> # From a broker machine, try to connect to ZooKeeper and other brokers nc -vz <zookeeper_host> 2181 nc -vz <other_broker_host> 9092 - Fix: Open necessary ports (e.g., 9092 for Kafka, 2181 for ZooKeeper) in firewalls. Resolve any network routing issues.
# Example: Allow traffic on port 9092 (ufw firewall) sudo ufw allow 9092/tcp - Why it works: The
GroupCoordinatorNotAvailableerror often stems from a fundamental inability to communicate. Ensuring network paths are open and stable allows consumers and brokers to connect and interact as expected.
7. Consumer Group Lag or Stuck Consumers
- Diagnosis: While less direct, a consumer group that is extremely lagged or has consumers that are stuck processing messages can indirectly contribute to rebalance issues. If consumers fail to send heartbeats to the group coordinator within the
session.timeout.ms, they are considered dead, triggering a rebalance. If this happens frequently or to many consumers, it can destabilize the group.# Check consumer group status and lag /opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server <broker_host>:9092 --describe --group <your_group_id> - Fix: Address the root cause of consumer lag:
- Increase consumer parallelism (add more consumer instances).
- Optimize consumer processing logic.
- Ensure
max.poll.recordsis not set too high. - Increase
session.timeout.msandheartbeat.interval.msif network latency is high, but be careful not to make them too large, as this delays detection of truly dead consumers.
# consumer.properties example tuning session.timeout.ms=30000 heartbeat.interval.ms=10000 max.poll.records=500 - Why it works: Healthy, responsive consumers send regular heartbeats, indicating to the coordinator that they are alive. This prevents unnecessary rebalances caused by premature consumer timeouts.
The next error you might encounter after fixing GroupCoordinatorNotAvailable is LeaderNotAvailable, which indicates a problem with partition leadership, often related to broker availability or ZooKeeper state.