Kafka Streams uses standby replicas to dramatically speed up recovery after a task failure.
Let’s see this in action. Imagine a Kafka Streams application processing orders. It has three instances (let’s call them app-1, app-2, app-3) and a Kafka topic orders with 10 partitions. Kafka Streams assigns partitions to instances. app-1 might get partitions 0-3, app-2 gets 4-7, and app-3 gets 8-9.
Now, app-2 suddenly crashes. Normally, Kafka Streams would have to re-read all the data for partitions 4-7 from the orders topic and rebuild the internal state (like aggregations or joins) from scratch. This could take minutes or even hours for large datasets.
But with standby replicas, app-1 and app-3 have been keeping copies of the state for partitions assigned to app-2. These are the "standby" tasks. When app-2 fails, Kafka Streams can immediately assign partitions 4-7 to another instance, say app-1. app-1 doesn’t need to reprocess data; it simply "promotes" its standby replica for partitions 4-7 to become the active task, taking over where app-2 left off with minimal disruption. The processing continues almost instantly.
The core problem standby replicas solve is the state recovery time. In distributed stream processing, each task often maintains an internal state (e.g., counts, windowed aggregations, joined data). When a task fails, its state is lost. Without standbys, the system must re-read all relevant input data from Kafka to rebuild this state, which is a slow and costly operation. Standby replicas mitigate this by maintaining a mirror of the active task’s state.
Here’s how it works internally:
- Task Assignment: Kafka Streams assigns partitions to instances. For each partition assigned to an instance (making it an "active" task), Kafka Streams also assigns a "standby" task for that same partition to another instance.
- State Synchronization: The active task writes its state changes to a local, changelog-backed Kafka topic (often called a "re-partition topic" or "state store changelog"). The standby task reads these changelogs and applies the same state changes locally, thus mirroring the active task’s state.
- Failover: When the active task instance fails, Kafka Streams detects this via ZooKeeper/Kafka’s internal coordination. It then selects a new instance that has a standby replica for the failed task’s partitions.
- State Promotion: The selected instance promotes its standby replica to become the new active task. Since its state is already up-to-date from the changelogs, it can immediately start processing new input records without replaying historical data.
The exact levers you control are primarily through configuration:
streams.num.standby.replicas: This setting, often set to1or2, determines how many standby copies Kafka Streams will maintain for each task. A value of0disables standby replicas.streams.replication.factor: While not directly a Streams config, the underlying Kafka topic’s replication factor impacts the durability of state changelogs. A replication factor of3is common for production.state.dir: The directory where Kafka Streams stores its local state. Ensure this is on persistent storage.
The surprising thing most people don’t realize is that the standby replica’s state is not a direct copy of the active task’s local file system. Instead, it’s a perfect reconstruction from Kafka changelogs. This is crucial because it means the standby is not tied to the specific instance’s disk; if the standby instance also fails and is replaced, the new instance can still catch up by replaying the same changelogs, ensuring high availability even if multiple instances are lost in succession.
The next concept you’ll likely grapple with is how Kafka Streams handles rebalancing and task migration when instances are added or removed, and how that interacts with standby replica availability.