Kafka Streams apps can survive node failures without losing state or dropping messages, but it’s not magic.

Let’s watch a simple Kafka Streams application process messages and see how it stays alive. Imagine we have a Kafka topic named clicks with incoming events like {"user_id": "alice", "timestamp": 1678886400} and another topic user_counts where we want to aggregate clicks per user.

Here’s a basic Kafka Streams topology:

StreamsBuilder builder = new StreamsBuilder();
KStream<String, Long> userCounts = builder
    .stream("clicks", Consumed.with(Serdes.String(), JsonSerde.serde()))
    .groupByKey(Grouped.with(Serdes.String(), JsonSerde.serde()))
    .count(Materialized.as("user-counts-store"));

userCounts.toStream().to("user_counts", Produced.with(Serdes.String(), Serdes.Long()));

KafkaStreams streams = new KafkaStreams(builder.build(), streamsConfig);
streams.start();

When this runs, Kafka Streams doesn’t just read and write. It creates state stores. In our example, user-counts-store is a local, on-disk (or in-memory, if configured) changelog of counts for each user_id. This state is crucial for resuming processing after a failure.

The magic of High Availability (HA) in Kafka Streams relies on two main components:

  1. Task Distribution and Rebalancing: Kafka Streams partitions your application’s processing logic into "tasks." Each task is responsible for processing a subset of the input topic partitions. When you start multiple instances of your Kafka Streams application, Kafka Streams leverages Kafka’s consumer group protocol. It automatically assigns these tasks to the available instances. If an instance goes down, Kafka detects this. The remaining instances then rebalance, taking over the tasks that were previously handled by the failed instance. This ensures all input partitions are still being processed.

  2. State Store Replication/Changelogging: For stateful operations (like our count), each task has an associated state store. Kafka Streams maintains a changelog topic for each state store. This changelog topic is a regular Kafka topic. Every change made to the local state store (e.g., incrementing a user’s count) is first written as a record to this changelog topic. When a new instance starts up or takes over a task from a failed instance, it can restore its state by reading from this changelog topic. This process is called "restoring from changelog." The new instance effectively replays the recent history of state changes to bring its local store up-to-date before it starts processing new input data. This guarantees that no state is lost and no data is reprocessed unnecessarily.

Here’s a look at the configuration that enables this HA:

# broker list
bootstrap.servers: kafka-broker-1:9092,kafka-broker-2:9092

# application ID: crucial for consumer group management and state store naming
application.id: my-click-counter-app

# where to store local state
state.dir: /path/to/kafka-streams-state

# how often to commit state changes to the changelog topic
commit.interval.ms: 60000 # 1 minute

# how long to wait for an instance to recover before marking it as dead
# this is more for Kafka broker detection of client liveness
# not directly for Streams task rebalancing, but related to overall cluster health
# session.timeout.ms: 30000
# request.timeout.ms: 30000

# number of threads per application instance
# more threads mean more parallel processing, up to the number of partitions
# for a given task type, but also more potential for contention.
# For HA, we typically run multiple instances of the app, each with >=1 thread.
# num.stream.threads: 2

The application.id is paramount. It’s used by Kafka Streams to:

  • Create the internal changelog topics for state stores.
  • Manage the consumer group for the input topics.
  • Identify which state stores belong to which application.

If you change the application.id for an existing application, Kafka Streams will treat it as a completely new application. It will create new changelog topics, ignore any existing state, and start processing from scratch.

The state.dir is where the local, durable copy of the state stores resides. When an instance restarts, it will first check this directory for existing state. If found, it will attempt to load it. If not, or if the state is deemed corrupted or incomplete, it will rely on the changelog topic.

The commit.interval.ms determines how frequently Kafka Streams commits the processed offsets and flushes the state store changes to the changelog topic. A smaller interval means less potential data loss if an instance crashes between commits, but it also means more writes to Kafka. A larger interval reduces write load but increases the window of potential data loss.

What happens when an instance dies? Let’s say instance A is running and processing tasks for partitions clicks-0 and clicks-1. Instance B is running and processing clicks-2. If instance A crashes:

  1. Kafka (as the broker for the consumer group) detects that instance A has failed to send heartbeats.
  2. Instance B (or any other running instance of my-click-counter-app) will notice that some tasks are no longer assigned.
  3. A rebalance is triggered. Kafka reassigns clicks-0 and clicks-1 to instance B (or another available instance).
  4. The instance taking over clicks-0 and clicks-1 will look for its local state for these tasks in state.dir.
  5. If the state isn’t there or is incomplete, it will read the corresponding changelog topic (e.g., my-click-counter-app-user-counts-store-changelog) from the last committed offset to restore its state.
  6. Once its state is restored, it begins processing new messages from clicks-0 and clicks-1.

The most surprising thing about Kafka Streams HA is how seamlessly it integrates with Kafka’s core partitioning and consumer group mechanisms, abstracting away much of the complexity of distributed state management. It feels like a single-threaded application until you look under the hood at the task assignments and changelog topics.

The one thing most people don’t know is that the order of operations during a restore is critical. The Streams client first restores the state store from the changelog topic. Only after the state store has been fully restored does it resume processing new input records from the current offset of the input partition. If the state store restoration fails or gets stuck, the task won’t become active, and the input partition will remain unassigned until the issue is resolved.

Once your application is running with HA, the next thing you’ll likely encounter is understanding how to scale your application by adding more instances and how Kafka Streams handles the internal repartitioning of tasks.

Want structured learning?

Take the full Kafka-streams course →