Kafka Streams choked because one partition got swamped, and the rest sat idle.

This isn’t about Kafka itself being slow; it’s about your Streams application’s internal state and processing load being unevenly distributed. The core issue is that Kafka Streams assigns tasks (processing logic for a subset of partitions) to instances. If one partition has way more data or its processing is significantly more complex, the task responsible for it becomes a bottleneck, starving other tasks and instances.

Common Causes and Fixes

  1. Uneven Key Distribution in Input Topic:

    • Diagnosis: Check the distribution of keys in your input Kafka topic. Tools like kafka-console-consumer can help, but for a true picture, you’d often need to write a small consumer application to count key occurrences or analyze topic compaction logs if applicable. A quick visual check might involve looking at the first few bytes of messages if your key is structured.
    • Fix: Re-key your data. If you’re using a map or flatMap operation, introduce a KTable.groupByKey() or KStream.groupByKey() using a different key that is evenly distributed. For example, if you’re grouping by user_id and one user has millions of events, switch to grouping by user_id % 1000 (or a similar modulo operation) to distribute the load.
    • Why it works: This forces Kafka Streams to repartition the data across different partitions based on the new, more uniform key. Each instance will then receive a more balanced share of the workload.
  2. Large State Stores:

    • Diagnosis: Monitor the size of your Kafka Streams state stores on disk. If one instance has a state store that’s orders of magnitude larger than others, it’s a strong indicator. This can be observed by checking disk usage on your Kafka Streams nodes or by using JMX metrics related to StateStoreProvider and RocksDB (if you’re using RocksDB as the backend). Look for metrics like rocksdbs-num-files-at-level or rocksdbs-total-size.
    • Fix: Implement a strategy to limit the size of state stores or clean them up. For time-based data, consider setting appropriate retention policies for your state stores (e.g., withRetentionMs() in the Materialized builder). If the large state is due to a specific set of keys, you might need to introduce a mechanism to prune or archive old data from that store, perhaps by using a separate cleanup topic and consumer.
    • Why it works: Large state stores require more disk I/O, memory, and can slow down rebalancing operations. Reducing their size or ensuring they are manageable prevents a single instance from becoming bogged down.
  3. Complex Processing Logic on Specific Keys:

    • Diagnosis: Profile your Kafka Streams application. Use tools like Java Flight Recorder (JFR) or YourKit to identify which parts of your processing topology are consuming the most CPU time. If a specific key’s processing involves many joins, aggregations, or calls to external services, it will naturally take longer.
    • Fix: Optimize the processing logic for high-volume or complex keys. This might involve caching results of external calls, pre-computing some values, or simplifying the aggregation logic. If possible, denormalize data earlier in the pipeline so that complex lookups aren’t needed within the Streams app.
    • Why it works: By reducing the computational or I/O cost for processing specific keys, you bring their processing time closer to the average, thus distributing the load more evenly.
  4. Kafka Topic Partition Count Mismatch or Too Low:

    • Diagnosis: Check the number of partitions for your input and output topics. If your input topic has only a few partitions (e.g., 2 or 4) and your Streams application has many instances, you might be hitting the limit of parallelism dictated by the topic.
    • Fix: Increase the number of partitions for your input topic. This is a one-time operation that requires careful consideration of future growth. You can do this using kafka-topics.sh --alter --topic my-topic --partitions <new_count>. Ensure your output topics also have at least as many partitions as your input topics if you’re performing joins or aggregations that span partitions.
    • Why it works: More partitions allow Kafka Streams to create more tasks, which can then be distributed across more application instances, increasing the potential for parallelism.
  5. Uneven Number of Tasks per Instance (During Rebalance):

    • Diagnosis: Observe Kafka Streams rebalancing logs. If, after a rebalance, one instance consistently gets assigned more tasks than others, it’s a sign. You can also inspect the KafkaStreams object’s internal state (though this is more advanced, often requiring JMX or debugging) to see task assignments.
    • Fix: Ensure your num.standby.tasks configuration is set appropriately. While not directly fixing hot partitions, a good standby task configuration helps with recovery and can indirectly influence rebalancing behavior. More importantly, ensure your max.poll.records is not too high, which can cause a single poll cycle to process a disproportionate amount of data for a hot partition.
    • Why it works: A balanced task distribution is crucial. While you can’t force Kafka Streams to assign tasks perfectly if the underlying data or processing is inherently skewed, optimizing configuration parameters can prevent artificial imbalances.
  6. Network or Disk I/O Bottlenecks on a Single Instance:

    • Diagnosis: Monitor system-level metrics (CPU, network, disk I/O) on your Kafka Streams instances. If one instance shows consistently high disk I/O or network saturation, especially when processing a particular partition’s data, that instance is likely the bottleneck.
    • Fix: This is often a hardware or infrastructure issue. Ensure instances have sufficient resources (faster disks, more network bandwidth). If the bottleneck is consistently due to state store operations on a specific partition, re-evaluating the key distribution (point 1) is often the root cause.
    • Why it works: The application can only process data as fast as the underlying hardware allows. Resolving system-level bottlenecks ensures the application isn’t artificially constrained.

The next error you’ll likely hit after fixing skewed processing is org.apache.kafka.streams.errors.TaskMigratedException, indicating that a task has been moved to another instance, usually during a rebalance or a failure.

Want structured learning?

Take the full Kafka-streams course →