Fix Kafka NotEnoughReplicasAfterAppendException (2026)

The NotEnoughReplicasAfterAppendException means a Kafka broker received a write request but couldn’t guarantee its durability because not enough in-sync replicas were available to acknowledge the write. This typically happens when a topic’s min.insync.replicas setting is higher than the number of currently available and healthy replicas for that partition.

Here’s a breakdown of common causes and how to fix them:

1. Broker(s) are Down or Unhealthy: The most frequent culprit is simply that one or more brokers expected to be part of the replica set are offline or unresponsive.

Diagnosis: Check the status of your Kafka brokers. A quick way is to list the brokers in ZooKeeper:
```
echo "ls /brokers/ids" | nc <zookeeper_host> 2181
```
This will show you the IDs of all registered brokers. Then, check if these brokers are actually running and reachable. You can also use tools like kafka-topics.sh --describe to see which brokers are listed as leaders and in-sync replicas for the affected partition. If some are missing from the Isr list, they are likely the problem.
Fix: Bring the downed brokers back online. Ensure they are correctly configured (e.g., broker.id, zookeeper.connect, listeners) and can connect to ZooKeeper and other brokers. Once a broker restarts and rejoins the cluster, it will catch up on any missed data and become available for ISR.
Why it works: Kafka’s design relies on a quorum of replicas being in sync to acknowledge writes when min.insync.replicas is set. If a broker is down, it can’t participate in this quorum, potentially leading to this exception if min.insync.replicas cannot be met. Restoring the broker allows it to rejoin the ISR set.

2. Network Issues Between Brokers: Even if brokers are running, they might be unable to communicate with each other due to network partitions, firewall rules, or incorrect advertised.listeners configurations.

Diagnosis: From the affected broker, try to telnet to other brokers on their advertised listener ports (e.g., 9092).
```
telnet <other_broker_host> 9092
```
Also, examine broker logs for connection errors or timeouts when attempting to establish connections with peers. Ensure advertised.listeners in server.properties correctly reflects the network interface and port that other brokers can reach.
Fix: Correct network configurations. This might involve updating firewall rules, ensuring proper routing, or fixing the advertised.listeners property in server.properties on each broker to a resolvable and reachable address. For example, if brokers are in different subnets, advertised.listeners should point to the IP address accessible across subnets.
Why it works: Kafka brokers communicate using a request/response protocol. If network connectivity is broken, a broker cannot receive acknowledgments from other replicas, meaning it cannot confirm that the data is replicated to the required number of in-sync replicas. Fixing the network path allows these critical inter-broker communications to succeed.

3. Under-replicated Partitions Due to Topic Configuration: The topic’s replication factor might be set too high relative to the number of available brokers, or brokers have been removed without adjusting the topic’s replication factor.

Diagnosis: Describe the topic to see its replication factor and current ISRs.
```
kafka-topics.sh --bootstrap-server <broker_host>:9092 --describe --topic <your_topic_name>
```
Look at the ReplicationFactor and Isr columns. If ReplicationFactor is, say, 3, but the Isr list only contains 2 broker IDs, you have an under-replicated partition.
Fix: Either increase the number of brokers in your cluster to match or exceed the desired replication factor, or decrease the replication factor of the topic to match the number of available, healthy brokers. To decrease replication factor:
```
kafka-topics.sh --bootstrap-server <broker_host>:9092 --alter --topic <your_topic_name> --replication-factor <new_smaller_factor>
```
Caution: Decreasing replication factor is a destructive operation for some replicas. It’s generally safer to increase brokers or wait for failed brokers to recover.
Why it works: If the desired replication factor is 3, but only 2 brokers are healthy and in sync, Kafka cannot guarantee that the data is written to 3 replicas. Reducing the replication factor to 2 (if 2 brokers are healthy) allows the system to meet the new, lower durability guarantee.

4. High Load and Broker Resource Constraints: When brokers are under heavy load (high CPU, memory, or disk I/O), they may become slow to respond to replication requests or heartbeats, causing them to be temporarily removed from the ISR list.

Diagnosis: Monitor broker resource utilization. Check CPU, memory, disk I/O, and network bandwidth on the Kafka brokers. Kafka broker logs are also crucial; look for messages indicating slow disk operations, high garbage collection pauses, or network buffer issues. JMX metrics for Kafka can provide detailed insights into request latency and queue depths.
Fix: Scale up your Kafka cluster by adding more brokers, or scale up the resources of existing brokers (more CPU, faster disks, more RAM). Optimize producer configurations, such as reducing batch sizes or request timeouts if they are too aggressive for your cluster’s capacity.
Why it works: If a broker is overloaded, it might not be able to process replication requests or send heartbeats to the controller in a timely manner. This can lead to it being marked as out-of-sync or even down, reducing the effective number of replicas available and triggering the exception. Providing more resources or optimizing load alleviates these bottlenecks.

5. Incorrect min.insync.replicas Configuration: The min.insync.replicas setting for the topic or the broker might be too high for the current cluster state.

Diagnosis: Check the topic’s configuration for min.insync.replicas.
```
kafka-configs.sh --bootstrap-server <broker_host>:9092 --describe --topic <your_topic_name>
```
Also, check the broker-level default in server.properties (min.insync.replicas). If the topic has a min.insync.replicas set, it will override the broker default. Ensure this value is less than or equal to the topic’s replication factor and the number of healthy, available brokers.
Fix: Lower the min.insync.replicas value. For a topic:
```
kafka-configs.sh --bootstrap-server <broker_host>:9092 --alter --topic <your_topic_name> --add-config min.insync.replicas=<new_lower_value>
```
If you want to change the broker default, update min.insync.replicas in server.properties on all brokers and restart them.
Why it works: min.insync.replicas defines the minimum number of replicas that must acknowledge a write for it to be considered successful. If this number is set to, say, 3, but only 2 brokers are available and in sync, the condition cannot be met, and the exception is thrown. Lowering this value allows writes to succeed with fewer acknowledgments.

6. ZooKeeper Issues: While Kafka aims to be resilient, ZooKeeper is critical for cluster coordination. If ZooKeeper is unhealthy, slow, or unavailable, it can indirectly cause brokers to be marked as down or prevent leader election, leading to ISR issues.

Diagnosis: Check the health of your ZooKeeper ensemble. Look for errors in ZooKeeper logs, and check the stat command output for each ZooKeeper node:
```
echo "stat" | nc <zookeeper_host> 2181
```
Look for min_latency and max_latency values that are excessively high, or a large number of outstanding requests.
Fix: Resolve ZooKeeper performance issues or bring downed ZooKeeper nodes back online. Ensure ZooKeeper has sufficient resources and network connectivity.
Why it works: ZooKeeper is responsible for maintaining the list of active brokers, controller leader, and partition leadership. If ZooKeeper is slow or down, Kafka brokers might not be able to update their status, leading to incorrect ISR lists and potential leadership issues that manifest as replication problems.

After resolving these issues, you might encounter a LeaderNotAvailableException if the controller is still in the process of re-electing a leader for the affected partition, or if the cluster is still stabilizing.