The Kafka GroupCoordinatorNotAvailable error means a consumer is trying to join or rebalance a consumer group, but the broker responsible for managing that group is unreachable or not ready. This usually happens because the coordinator broker itself is down, overloaded, or being restarted, and the consumers can’t find a new one quickly enough.

Here are the common reasons and how to fix them:

1. Coordinator Broker is Down or Unresponsive

  • Diagnosis: Check the status of your Kafka brokers. Look for brokers that are not registered with ZooKeeper or are not responding to requests.
    # On a Kafka broker, check ZooKeeper registration
    /opt/kafka/bin/zookeeper-shell.sh <zookeeper_host>:2181 ls /brokers/ids
    # On a Kafka broker, check logs for connection errors
    tail -f /opt/kafka/logs/server.log | grep -i "connection refused\|failed to connect"
    
  • Fix: If a broker is down, restart it. Ensure its configuration (server.properties) is correct, especially advertised.listeners and zookeeper.connect.
    # Example: Restarting a Kafka broker service
    sudo systemctl restart kafka
    
  • Why it works: A healthy coordinator broker is essential for managing consumer groups. Restarting it brings it back into service, allowing it to resume its coordinator duties.

2. ZooKeeper Issues Affecting Broker Registration

  • Diagnosis: Kafka relies on ZooKeeper to discover brokers and elect a group coordinator. If ZooKeeper is unhealthy or experiencing network issues, brokers might not register properly, or the coordinator might not be discoverable.
    # Check ZooKeeper status and logs
    sudo systemctl status zookeeper
    tail -f /opt/zookeeper/logs/zookeeper.log
    
  • Fix: Ensure ZooKeeper is running, healthy, and accessible from all Kafka brokers. If ZooKeeper is the problem, restart it and check its network connectivity and disk I/O.
    sudo systemctl restart zookeeper
    
  • Why it works: ZooKeeper acts as Kafka’s registry. If brokers can’t talk to ZooKeeper, they can’t register themselves, and consumers won’t know which broker is the coordinator.

3. High Broker Load or Network Saturation

  • Diagnosis: A heavily loaded broker, especially the one acting as the group coordinator, might be slow to respond to consumer group join requests or rebalance requests. Monitor CPU, memory, network I/O, and disk I/O on your brokers.
    # Example using top/htop on a broker
    top -bn1 | grep "Cpu(s)\|Mem\|Tasks"
    # Check network traffic
    iftop -i eth0
    
    Also, check Kafka broker logs for messages indicating high latency or I/O wait times.
  • Fix:
    • Scale your Kafka cluster: Add more brokers to distribute the load.
    • Optimize Kafka configuration: Tune num.io.threads, num.network.threads, and message.max.bytes based on your workload.
    • Optimize consumer configuration: Reduce max.poll.interval.ms if consumers are taking too long to process messages, or increase session.timeout.ms if network latency is high.
    # server.properties example tuning
    num.io.threads=16
    num.network.threads=8
    
  • Why it works: Reducing the load on the coordinator broker or improving network throughput allows it to process join/rebalance requests promptly, preventing timeouts.

4. Incorrect group.initial.rebalance.delay.ms Setting

  • Diagnosis: This setting in server.properties determines how long Kafka waits before initiating a group rebalance when a new consumer joins or a broker fails. If it’s set too low, it can cause frequent, disruptive rebalances. If it’s too high, it can delay consumers joining a group.
    # Check server.properties on your brokers
    cat /opt/kafka/config/server.properties | grep group.initial.rebalance.delay.ms
    
  • Fix: For initial setup or after significant cluster changes, a slightly higher value might be beneficial. For normal operation, a lower value is usually fine. The default is 3 seconds (3000 ms). Experiment with values like 5000 ms or 10000 ms if you’re seeing issues during rapid consumer scaling or broker restarts.
    # Example: Increase delay to allow more consumers to join before rebalancing
    group.initial.rebalance.delay.ms=10000
    
    Remember to restart brokers after changing this setting.
  • Why it works: A controlled rebalance delay ensures that consumers have enough time to join the group and be assigned partitions before the coordinator starts the rebalancing process, reducing the chances of GroupCoordinatorNotAvailable errors during the initial join phase.

5. Incorrect Broker Listener Configuration

  • Diagnosis: The listeners and advertised.listeners in server.properties must be correctly configured for brokers to communicate with each other and for clients to reach them. If these are misconfigured, consumers might not be able to find the designated coordinator.
    # Check server.properties on your brokers
    cat /opt/kafka/config/server.properties | grep listeners
    cat /opt/kafka/config/server.properties | grep advertised.listeners
    
  • Fix: Ensure listeners defines the interface and port brokers listen on, and advertised.listeners defines how clients (including other brokers acting as coordinators) should connect to this broker. For example, if brokers are in different subnets or behind NAT, advertised.listeners must point to an address that is reachable.
    # Example for a broker with a specific IP
    listeners=PLAINTEXT://0.0.0.0:9092
    advertised.listeners=PLAINTEXT://192.168.1.100:9092
    
    Restart brokers after making changes.
  • Why it works: Correctly configured listeners ensure that Kafka brokers can discover each other and that clients can establish connections to the correct broker addresses, including the group coordinator.

6. Network Partition or Firewall Issues

  • Diagnosis: Network connectivity problems between consumers and brokers, or between brokers themselves, can lead to the group coordinator appearing unavailable.
    # From a consumer machine, try to connect to the broker's listener port
    nc -vz <broker_host> <broker_port>
    # From a broker machine, try to connect to ZooKeeper and other brokers
    nc -vz <zookeeper_host> 2181
    nc -vz <other_broker_host> 9092
    
    Check firewall rules on both consumer and broker machines.
  • Fix: Open necessary ports (e.g., 9092 for Kafka, 2181 for ZooKeeper) in firewalls. Resolve any network routing issues.
    # Example: Allow traffic on port 9092 (ufw firewall)
    sudo ufw allow 9092/tcp
    
  • Why it works: The GroupCoordinatorNotAvailable error often stems from a fundamental inability to communicate. Ensuring network paths are open and stable allows consumers and brokers to connect and interact as expected.

7. Consumer Group Lag or Stuck Consumers

  • Diagnosis: While less direct, a consumer group that is extremely lagged or has consumers that are stuck processing messages can indirectly contribute to rebalance issues. If consumers fail to send heartbeats to the group coordinator within the session.timeout.ms, they are considered dead, triggering a rebalance. If this happens frequently or to many consumers, it can destabilize the group.
    # Check consumer group status and lag
    /opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server <broker_host>:9092 --describe --group <your_group_id>
    
  • Fix: Address the root cause of consumer lag:
    • Increase consumer parallelism (add more consumer instances).
    • Optimize consumer processing logic.
    • Ensure max.poll.records is not set too high.
    • Increase session.timeout.ms and heartbeat.interval.ms if network latency is high, but be careful not to make them too large, as this delays detection of truly dead consumers.
    # consumer.properties example tuning
    session.timeout.ms=30000
    heartbeat.interval.ms=10000
    max.poll.records=500
    
  • Why it works: Healthy, responsive consumers send regular heartbeats, indicating to the coordinator that they are alive. This prevents unnecessary rebalances caused by premature consumer timeouts.

The next error you might encounter after fixing GroupCoordinatorNotAvailable is LeaderNotAvailable, which indicates a problem with partition leadership, often related to broker availability or ZooKeeper state.

Want structured learning?

Take the full Kafka course →