Kafka brokers can get stuck in a shutdown loop if the controlled.shutdown.enable configuration is not properly set, preventing them from finishing their shutdown process and leading to repeated restarts.

Common Causes and Fixes

  1. controlled.shutdown.enable is false on some brokers:

    • Diagnosis: On each broker, check $KAFKA_HOME/config/server.properties for controlled.shutdown.enable. If any broker has this set to false, it will not participate in a controlled shutdown and might get stuck.
    • Fix: Set controlled.shutdown.enable=true on all brokers that you intend to shut down gracefully.
    • Why it works: When controlled.shutdown.enable is true, a broker will attempt to transfer its partition leadership to another broker before shutting down. If this is false, it just stops, potentially leaving partitions leaderless and causing issues for clients.
  2. Zookeeper session expiration during shutdown:

    • Diagnosis: Check Kafka broker logs (e.g., $KAFKA_HOME/logs/server.log) for messages like Zookeeper session expired or Connection closed by Zookeeper. This indicates Zookeeper lost its connection to the broker or vice-versa during the shutdown attempt.
    • Fix:
      • Increase zookeeper.session.timeout.ms in server.properties to 30000 (30 seconds).
      • Ensure Zookeeper clients on the Kafka brokers are configured with a sufficiently long session.timeout.ms and connection.timeout.ms.
      • Check network connectivity between Kafka brokers and Zookeeper ensemble.
    • Why it works: A longer session timeout gives the broker more time to complete its Zookeeper registration and partition leadership transfer operations before Zookeeper considers the session dead.
  3. Insufficient controlled.shutdown.max.wait.ms:

    • Diagnosis: Observe logs for messages indicating that the broker timed out waiting for partition leadership to be transferred. The default is 5 minutes.
    • Fix: Increase controlled.shutdown.max.wait.ms in server.properties to a higher value, e.g., 600000 (10 minutes) or more, depending on cluster size and load.
    • Why it works: This setting dictates how long Kafka will wait for all partitions hosted on the shutting-down broker to have their leadership transferred to other brokers. If this time is too short, the shutdown will fail before it’s complete.
  4. Network partitions or firewall issues preventing Zookeeper communication:

    • Diagnosis: Use netcat or telnet from the Kafka broker to the Zookeeper ensemble’s ports (default 2181). For example: nc -vz <zookeeper_host> 2181. Check broker logs for repeated NoRouteToHostException or Connection refused errors related to Zookeeper.
    • Fix: Ensure proper network routes and firewall rules allow persistent, low-latency communication between all Kafka brokers and all Zookeeper ensemble members on Zookeeper’s client port.
    • Why it works: Controlled shutdown relies heavily on Zookeeper for coordinating partition leadership. If brokers cannot reliably communicate with Zookeeper, they cannot signal their intention to shut down or transfer leadership.
  5. Under-replicated partitions or unbalanced leadership:

    • Diagnosis: Use kafka-topics.sh --bootstrap-server <broker_list> --describe --under-replicated-partitions to identify any under-replicated partitions. Also, examine kafka-topics.sh --bootstrap-server <broker_list> --describe output to see if leadership is heavily concentrated on a few brokers.
    • Fix: Address any under-replicated partitions first by ensuring replicas are healthy and in sync. Then, use kafka-reassign-partitions.sh to rebalance partition leadership to distribute it more evenly across brokers.
    • Why it works: If a broker is shutting down and holds leadership for many partitions, especially if replicas are unhealthy, it becomes difficult and time-consuming to transfer leadership to available, healthy brokers. Rebalancing leadership beforehand can prevent this bottleneck.
  6. Broker is still actively serving requests when shutdown command is issued:

    • Diagnosis: Check broker logs for recent ProduceRequest or FetchRequest logs. If the broker is busy, it might not be able to immediately stop serving requests and transfer leadership.
    • Fix: Issue the shutdown command during a period of low cluster activity. If necessary, use kafka-topics.sh --bootstrap-server <broker_list> --alter --topic <topic_name> --partitions <num_partitions> --config leader.replication.throttled.replicas=<broker_id> to throttle leader election for specific topics, or globally throttle leader elections if the problem persists across all topics.
    • Why it works: Throttling leader elections ensures that the broker doesn’t spend excessive time trying to elect new leaders for partitions it hosts, allowing the shutdown process to proceed more smoothly.

After all brokers have successfully completed their controlled shutdown, you will likely encounter java.io.IOException: Broken pipe errors in client applications if they were actively communicating with brokers during the shutdown phase, as these clients will lose their established connections.

Want structured learning?

Take the full Kafka course →