A Kafka stretch cluster isn’t really about "stretching" Kafka; it’s about stretching ZooKeeper and accepting some Kafka-specific trade-offs to make it happen.

Let’s see what this looks like in practice. Imagine we have two regions, us-east-1 and us-west-2. We’ll deploy ZooKeeper nodes in both, and Kafka brokers in both.

# Example ZooKeeper config snippet for a stretch cluster
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/var/lib/zookeeper
clientPort=2181
# ZooKeeper ensemble configuration:
# node1.region1.example.com:2888:3888
# node2.region1.example.com:2888:3888
# node3.region2.example.com:2888:3888
# node4.region2.example.com:2888:3888
# node5.region2.example.com:2888:3888

And for Kafka:

# Example Kafka broker config snippet
broker.id=0
listeners=PLAINTEXT://:9092
advertised.listeners=PLAINTEXT://broker0.region1.example.com:9092
zookeeper.connect=zk1.region1.example.com:2181,zk2.region1.example.com:2181,zk3.region2.example.com:2181
log.dirs=/var/lib/kafka/log
default.replication.factor=3
min.insync.replicas=2

The core problem this solves is disaster recovery. If one region goes down entirely, your Kafka cluster remains available, albeit with some performance implications. It allows for active-active or active-passive setups across geographic distances.

Internally, the magic (and the headache) comes from ZooKeeper. ZooKeeper needs to maintain quorum across all its nodes, regardless of region. This means network latency between regions directly impacts ZooKeeper’s ability to elect leaders and synchronize state. Kafka producers and consumers will see higher latency when interacting with brokers in different regions due to this underlying ZooKeeper dependency.

The key levers you control are the min.insync.replicas setting in Kafka and the ZooKeeper ensemble configuration. min.insync.replicas dictates how many brokers must acknowledge a write before it’s considered successful. Setting this to 2 (as shown above) means even if one broker is down, as long as two others (potentially in different regions) have the data, the write succeeds. This is crucial for durability in a multi-region setup.

The most surprising thing is how much ZooKeeper’s latency budget dictates the feasibility of a stretch cluster. You can have a low-latency connection between Kafka brokers within a region, but if ZooKeeper nodes are far apart, the cluster’s overall responsiveness degrades because every metadata operation, every leader election, every partition rebalance is subject to the inter-region ZooKeeper latency. You’re essentially trading network latency for availability.

The next concept to explore is how to handle client configurations for multi-region Kafka, specifically how producers and consumers decide which region to connect to.

Want structured learning?

Take the full Kafka course →