Kafka Streams applications can get stuck in a "rebalancing" state for a long time after a Kafka broker restarts or a Streams instance crashes, preventing processing from resuming.

Common Causes and Fixes for Slow Rebalance Recovery

1. Large State Stores: If your Kafka Streams application uses stateful operations (like aggregations, joins, or windowed operations), the state needs to be restored from Kafka changelog topics. If these changelog topics are very large, it will naturally take a long time to download and rebuild the local state stores.

  • Diagnosis: Check the size of your changelog topics.

    kafka-topics --bootstrap-server localhost:9092 --describe --topic my-stream-app-store-changelog
    

    Look at the "Size" column. If it’s in gigabytes, this is likely your bottleneck.

  • Fix: Increase state.cleanup.delay.ms in your Streams configuration. This setting controls how long Kafka retains old changelog segments. By increasing this, you allow older, potentially larger state store segments to be retained for longer periods, giving your application more time to recover and write new segments before old ones are eligible for deletion. A common starting point is to set it to a value that allows for a full recovery window, e.g., state.cleanup.delay.ms: 86400000 (24 hours).

    streams:
      state:
        cleanup.delay.ms: 86400000
    

    This works because Kafka’s log compaction for changelog topics is based on retention policies. By extending the retention, you prevent premature deletion of the data needed for restoration.

2. Insufficient Disk I/O on Application Instances: Restoring state involves reading from disk (local state stores) and writing to disk. If the disks on your application instances are slow or saturated, this process will be significantly delayed.

  • Diagnosis: Monitor disk I/O performance on your application servers using tools like iostat or cloud provider monitoring dashboards. Look for high %util, high await, or low iops.

    iostat -xd 5
    

    Observe the await and %util columns for your state store disk.

  • Fix: Migrate your state stores to faster storage (e.g., SSDs) or provision application instances with better I/O capabilities. For example, if you’re using AWS, switch from gp2 to gp3 or io1/io2 EBS volumes for your state store directory.

    # Example for application configuration
    spring:
      cloud:
        stream:
          kafka:
            binder:
              brokers: "localhost:9092"
              configuration:
                default:
                  stateStore:
                    location: "/mnt/ssd/kafka-streams-state" # Ensure this path is on fast storage
    

    This speeds up the read/write operations necessary to load and update the local state store files.

3. Network Latency Between Application and Brokers: State restoration involves downloading data from Kafka brokers. High network latency or low bandwidth between your application instances and the Kafka cluster can become a significant bottleneck.

  • Diagnosis: Measure network latency and throughput between your application hosts and Kafka brokers using ping and iperf3.

    ping kafka-broker-1.example.com
    iperf3 -c kafka-broker-1.example.com
    
  • Fix: Co-locate your Kafka Streams applications with your Kafka brokers in the same availability zone or region. Ensure sufficient network bandwidth is provisioned. If using cloud providers, check network configurations and consider dedicated network links if necessary.

    # Example Kafka Streams configuration
    kafka.bootstrap.servers: kafka-broker-1.example.com:9092,kafka-broker-2.example.com:9092
    # Ensure these brokers are geographically close to your app instances
    

    Reducing the physical distance and increasing the pipe size for data transfer directly accelerates the download of changelog segments.

4. Inefficient State Store Backend: The default state store for Kafka Streams is RocksDB. While generally performant, misconfigurations or specific workload patterns can lead to suboptimal restoration performance.

  • Diagnosis: Examine RocksDB-specific metrics if available (often exposed via JMX) for compaction times, read/write amplification, or cache hit rates. Also, check for excessive disk space usage by RocksDB.

  • Fix: Tune RocksDB configuration embedded within Kafka Streams. For example, increasing rocksdb.blockCacheSize can help if your state stores are frequently accessed during restoration.

    # Example Kafka Streams configuration
    streams.rocksdb.config.override: "blockCacheSize: 134217728" # 128MB
    

    A larger block cache allows more frequently accessed data blocks to remain in memory, reducing the need for slow disk reads during the restore process.

5. Excessive Number of Partitions: Each partition requires its own state store and needs to be restored independently. If you have a very large number of partitions for your input topics, the overhead of managing and restoring all these state stores can become substantial, even if individual state stores are small.

  • Diagnosis: Count the number of partitions for your input topics.

    kafka-topics --bootstrap-server localhost:9092 --describe --topic my-input-topic
    

    If the number of partitions is in the thousands, this might be a contributing factor.

  • Fix: Consider repartitioning your topics to a smaller, more manageable number of partitions. This is a complex operation that often requires a separate Streams application or a custom migration process, potentially involving writing to new topics with fewer partitions.

    // Example of repartitioning within a Streams application
    KStream<Key, Value> repartitionedStream = originalStream
        .selectKey((k, v) -> new Key(v.getSomeId())) // Choose a key that distributes data
        .repartition(
            Repartitioned.with(Serdes.String(), Serdes.ByteArray())
                           .numberOfPartitions(100) // Target fewer partitions
        );
    

    Reducing the number of partitions directly reduces the number of independent state stores that need to be restored, decreasing the overall restoration time.

6. Outdated Kafka Streams Library Version: Older versions of Kafka Streams might have performance regressions or bugs related to state restoration that have since been fixed.

  • Diagnosis: Check the version of the kafka-streams library in your project’s dependency management (e.g., pom.xml or build.gradle).

  • Fix: Upgrade to the latest stable version of the Kafka Streams library.

    <!-- Example in Maven pom.xml -->
    <dependency>
        <groupId>org.apache.kafka</groupId>
        <artifactId>kafka-streams</artifactId>
        <version>3.6.1</version> <!-- Update to latest stable -->
    </dependency>
    

    Newer versions often include performance optimizations and bug fixes that can directly improve rebalance recovery times.

After addressing these, you might encounter issues with the application becoming available quickly enough, leading to downstream systems timing out waiting for it.

Want structured learning?

Take the full Kafka-streams course →