Kafka’s cooperative rebalance is a fundamental shift in how consumers adapt to group membership changes, significantly reducing the impact of rebalances on your application.
Imagine a Kafka consumer group. When a new consumer joins, or an existing one leaves (crashes, restarts, or is intentionally stopped), the group needs to rebalance its partitions. Traditionally, this meant a "stop-the-world" event: all consumers in the group would pause fetching data until the new partition assignments were finalized and distributed. This pause, even if brief, could lead to processing delays, backlogs, and potentially trigger downstream timeouts.
Cooperative rebalancing, introduced in Kafka 0.11, changes this paradigm. Instead of a global pause, consumers in a cooperative rebalance gradually give up partitions they no longer own and gradually acquire new ones. This is achieved through a more granular communication protocol between the consumer coordinator and the consumers.
Let’s see this in action with a simplified scenario.
Consider a consumer group my-group with two consumers, consumer-1 and consumer-2, each assigned one partition from a topic my-topic with two partitions (partition 0 and partition 1).
Initial State:
consumer-1ownsmy-topic-0consumer-2ownsmy-topic-1
Now, let’s say consumer-2 restarts.
Traditional (Eager) Rebalance:
consumer-2sends a leave-group request.- The consumer coordinator marks
consumer-2as dead. - The coordinator initiates a rebalance. All consumers (
consumer-1andconsumer-2, if it were still alive and participating) would enter aREBALANCINGstate. consumer-1stops fetching data.- The coordinator assigns partitions. It might assign
my-topic-0andmy-topic-1toconsumer-1. - The coordinator sends the new assignments.
consumer-1resumes fetching data for both partitions.
During steps 3-6, no data is processed by any consumer in the group.
Cooperative Rebalance:
consumer-2sends a leave-group request.- The consumer coordinator receives the request and begins the rebalance.
consumer-1receives aRevokePartitionsnotification formy-topic-0. It stops fetching frommy-topic-0and finishes processing any records it has buffered for that partition.consumer-2(if it were still attempting to join) would receive aAssignPartitionsnotification formy-topic-0.- The coordinator determines the new assignments. In this case, it might assign both
my-topic-0andmy-topic-1toconsumer-1. consumer-1receives aAssignPartitionsnotification formy-topic-1. It begins fetching frommy-topic-1if it wasn’t already. It also starts fetching frommy-topic-0once it’s ready to take it back (or if it was already assigned).
Crucially, consumer-1 can continue processing my-topic-1 (if it was already assigned) while the rebalance for my-topic-0 is occurring. The pause is localized to the partitions being revoked and assigned, not the entire consumer group.
To enable cooperative rebalancing, you need to configure your consumers. The key setting is partition.assignment.strategy.
# In your consumer configuration
group.id=my-group
bootstrap.servers=kafka-broker-1:9092,kafka-broker-2:9092
key.deserializer=org.apache.kafka.common.serialization.StringDeserializer
value.deserializer=org.apache.kafka.common.serialization.StringDeserializer
# Enable cooperative rebalancing
partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor
The CooperativeStickyAssignor is the implementation that provides this behavior. It’s designed to minimize partition movements between consumers, meaning a consumer will try to keep its existing partitions if possible during a rebalance. This "stickiness" further reduces unnecessary work.
The mental model for cooperative rebalancing involves understanding that partition ownership is now a more fluid concept. Consumers don’t just "get" partitions; they request them and relinquish them incrementally. The CooperativeStickyAssignor works by first calculating partition assignments using a sticky approach (trying to keep current assignments) and then, if that’s not feasible due to group changes, it falls back to a more traditional round-robin assignment. The "cooperative" part comes in because consumers actively participate in the rebalance by signaling their readiness to give up or accept partitions.
The CooperativeStickyAssignor has a subtle but important behavior: it favors keeping partitions with existing consumers. If a consumer is already assigned partition P, and a rebalance occurs, the assignor will try to keep P assigned to that same consumer if it remains in the group. This reduces the amount of state that needs to be reset (e.g., committed offsets, in-flight records).
A common pitfall is forgetting to update the partition.assignment.strategy. If this is not set or set to the older RangeAssignor or RoundRobinAssignor, your consumers will continue to use the eager rebalancing protocol, negating the benefits of cooperative rebalancing.
The next challenge you’ll likely encounter is managing the max.poll.records and max.poll.interval.ms settings in conjunction with cooperative rebalancing to ensure your consumers can process records fast enough to avoid triggering MaxPollIntervalExceededException.