Kafka is surprisingly resilient, but true zero downtime during failures isn’t automatic; it requires deliberate design choices.

Let’s look at a typical Kafka setup handling a constant stream of messages, say 10,000 messages per second, with a replication factor of 3.

{
  "topic": "user-activity",
  "partitions": 10,
  "replication_factor": 3,
  "min_insync_replicas": 2,
  "acks": "all"
}

Producers are configured to wait for acknowledgments from all in-sync replicas:

Properties props = new Properties();
props.put("bootstrap.servers", "kafka-broker-1:9092,kafka-broker-2:9092,kafka-broker-3:9092");
props.put("acks", "all"); // Crucial for durability
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

KafkaProducer<String, String> producer = new KafkaProducer<>(props);

for (int i = 0; i < 10000; i++) {
    String msg = "User event " + i;
    producer.send(new ProducerRecord<>("user-activity", Integer.toString(i), msg));
}
producer.close();

Consumers are set up in a group to ensure parallel processing and fault tolerance:

Properties consumerProps = new Properties();
consumerProps.put("bootstrap.servers", "kafka-broker-1:9092,kafka-broker-2:9092,kafka-broker-3:9092");
consumerProps.put("group.id", "user-activity-processor");
consumerProps.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
consumerProps.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
consumerProps.put("auto.offset.reset", "earliest");
consumerProps.put("enable.auto.commit", "false"); // Manual commits for control

KafkaConsumer<String, String> consumer = new KafkaConsumer<>(consumerProps);
consumer.subscribe(Arrays.asList("user-activity"));

while (true) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, String> record : records) {
        System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
        // Process the record
    }
    consumer.commitSync(); // Commit offsets after successful processing
}

This setup addresses the core problem of data loss and service interruption when brokers fail. By replicating data across multiple brokers and requiring acknowledgments from a minimum number of in-sync replicas, Kafka ensures that messages are durably stored even if a broker goes offline. The min.insync.replicas setting, when combined with acks=all, guarantees that a producer will only receive a successful acknowledgment if the message has been written to at least that many replicas. This is the bedrock of high availability.

The design patterns for zero downtime revolve around ensuring that the loss of a single component (broker, network partition, or even a consumer instance) does not lead to a service outage for producers or consumers. This is achieved through:

  1. Replication: As seen with replication_factor: 3, data is copied across multiple brokers. If one broker fails, others have a complete copy.
  2. In-Sync Replicas: min.insync.replicas: 2 and acks: all ensure that producers only consider a write successful if it’s durably stored on a quorum of brokers. This prevents producers from writing to a broker that is about to fail or is already failing and losing the data.
  3. Controller Election: Kafka has a controller broker responsible for managing partitions, leader elections, and broker registration. If the controller fails, the remaining brokers elect a new one automatically, typically within seconds. This process is designed to be fast and seamless, though it can cause a brief pause in metadata updates.
  4. Leader Election: For each partition, one replica is designated the leader. Producers and consumers communicate with the leader. If a leader broker fails, Kafka automatically promotes another in-sync replica to become the new leader. This failover is handled by the controller and is usually very quick.
  5. Consumer Rebalancing: When a consumer instance in a group fails or a new one joins, Kafka triggers a rebalance. This redistributes partitions among the remaining or new consumers. While a rebalance is in progress, consumers cannot fetch new messages, which can lead to a brief pause in processing. Minimizing rebalance times and understanding their impact is key.
  6. Idempotent Producers and Transactions: For absolute guarantee against duplicate messages during retries (especially in failure scenarios), idempotent producers (using enable.idempotence=true) ensure that a message is written exactly once, even if sent multiple times. Transactions allow for atomic writes across multiple partitions and even multiple Kafka topics, providing end-to-end durability for complex workflows.

When producers use acks=all and min.insync.replicas=2, and a broker holding the leader for a partition goes down, the controller will elect a new leader from the remaining in-sync replicas. Producers will experience a brief interruption as they discover the new leader. Consumers will experience a pause during the partition rebalance if the failed broker was hosting their assigned partition leader. The goal is that no data is lost, and service resumes automatically.

The most surprising thing about Kafka’s high availability is how much it relies on the coordination provided by ZooKeeper (or KRaft in newer versions) for leader election and controller failover. Without this external coordination, the system would struggle to agree on the state of the cluster during failures, leading to potential data inconsistencies or prolonged outages.

The next challenge is understanding how network partitions within your cluster can lead to split-brain scenarios and how to configure unclean.leader.election.enable=false to prevent data loss in such edge cases, even at the cost of availability during the partition.

Want structured learning?

Take the full Kafka course →