RebalanceInProgressException means a Kafka broker tried to join or leave a consumer group while it was already in the middle of a rebalance, and that’s a problem because it can lead to dropped messages or inconsistent state.

It’s most often caused by a consumer restarting too quickly after the broker detects a failure. The broker initiates a rebalance to reassign partitions, but if a new instance of the consumer comes online before the old one has fully acknowledged its shutdown, the broker gets confused.

Here are the common culprits and how to fix them:

  1. Consumer max.poll.interval.ms too short: This is the most frequent offender. If your consumers take longer than this interval to process records and call poll(), Kafka assumes the consumer is dead and triggers a rebalance.

    • Diagnosis: Check your consumer logs for RebalanceInProgressException. Then, look at your consumer application’s processing time per poll() call. If it’s consistently close to or exceeding max.poll.interval.ms, that’s your problem.
    • Fix: Increase max.poll.interval.ms. For example, if your processing takes 30 seconds, set it to 60000 (60 seconds).
      # In your consumer configuration
      max.poll.interval.ms=60000
      
    • Why it works: This gives your consumers more leeway to process batches of records without being considered "dead" by the broker, preventing premature rebalances.
  2. Consumer session.timeout.ms too short: This timeout determines how long a broker will wait for a consumer to send a heartbeat before considering it dead. If session.timeout.ms is too short relative to max.poll.interval.ms and processing time, a consumer might be timed out and rebalanced even if it’s actively processing.

    • Diagnosis: Examine consumer logs for rebalance triggers. Compare session.timeout.ms with max.poll.interval.ms and your actual processing times.
    • Fix: Increase session.timeout.ms. It should generally be shorter than max.poll.interval.ms but long enough to account for network latency and processing. A common pattern is session.timeout.ms being 1/3 of max.poll.interval.ms. If max.poll.interval.ms is 60000, try session.timeout.ms=20000.
      # In your consumer configuration
      session.timeout.ms=20000
      
    • Why it works: A longer session timeout allows consumers more time to send heartbeats, reducing the likelihood of them being falsely declared dead and triggering a rebalance.
  3. Consumer heartbeat.interval.ms too long: This setting controls how often consumers send heartbeats to the broker. If it’s too long, brokers might not be aware that a consumer is still alive, leading to premature timeouts.

    • Diagnosis: Check if heartbeat.interval.ms is significantly larger than session.timeout.ms divided by 3 (a common rule of thumb).
    • Fix: Decrease heartbeat.interval.ms. It should be roughly one-third of session.timeout.ms. If session.timeout.ms=20000, set heartbeat.interval.ms=6000.
      # In your consumer configuration
      heartbeat.interval.ms=6000
      
    • Why it works: More frequent heartbeats provide the broker with a clearer, more up-to-date picture of consumer liveness, preventing false positives for consumer failures.
  4. Broker group.initial.rebalance.delay.ms too short: When a new consumer group is created or a broker restarts, Kafka waits for this duration before initiating the first rebalance. If it’s too short, consumers might not have enough time to join and fetch metadata, leading to an immediate rebalance upon their first poll().

    • Diagnosis: Observe rebalances happening very early in a consumer’s lifecycle, immediately after startup or when a new consumer group is initialized.
    • Fix: Increase group.initial.rebalance.delay.ms on the Kafka brokers. A value of 30000 (30 seconds) is often recommended.
      # In your Kafka broker configuration (server.properties)
      group.initial.rebalance.delay.ms=30000
      
    • Why it works: This delay gives all consumers a chance to join the group and be recognized by the broker before the rebalance process begins, ensuring a more stable initial assignment.
  5. Network issues or high latency: Transient network problems can cause heartbeats or poll() requests to be delayed, making consumers appear unresponsive to brokers and triggering rebalances.

    • Diagnosis: Look for patterns of RebalanceInProgressException that correlate with periods of high network latency or packet loss between your consumers and brokers.
    • Fix: Address underlying network instability. This might involve improving network infrastructure, optimizing routing, or increasing timeouts as described above to be more resilient to temporary network blips. For example, increasing request.timeout.ms on the consumer can help if brokers are slow to respond.
      # In your consumer configuration
      request.timeout.ms=30000
      
    • Why it works: By providing more time for network requests to complete, this helps prevent transient network issues from being misinterpreted as consumer failures.
  6. Consumer application logic errors: If your consumer application has bugs that cause it to block indefinitely, crash, or fail to call poll() within the configured intervals, it will inevitably lead to rebalances.

    • Diagnosis: Deep dive into your consumer application’s code. Use profiling tools to identify long-running operations or deadlocks. Check for uncaught exceptions or resource exhaustion (e.g., thread pools, memory leaks).
    • Fix: Debug and fix the application logic. Ensure poll() is called regularly, and that processing is completed within max.poll.interval.ms. Implement robust error handling and graceful shutdown mechanisms.
    • Why it works: A stable, correctly functioning consumer application will adhere to the Kafka protocol’s expectations, preventing it from being prematurely considered failed.

The next error you’ll likely see after fixing RebalanceInProgressException is CommitFailedException, especially if your consumers are configured to auto-commit offsets.

Want structured learning?

Take the full Kafka course →