The NotEnoughReplicasException means Kafka couldn’t find enough in-sync replicas for a partition to satisfy the producer’s acks setting.
This usually happens because a broker is down, a replica is lagging significantly, or there’s a network partition preventing brokers from communicating. When a producer sends a message with acks=all (or -1), Kafka waits for a specific number of replicas to acknowledge the write. If it can’t get enough acknowledgements within the timeout period, it throws this error.
Here are the common causes and how to fix them:
Broker is Down or Unreachable
Diagnosis: Check the status of your Kafka brokers.
/opt/kafka/bin/kafka-topics.sh --bootstrap-server <your_broker_address>:9092 --describe --topic <your_topic_name>
Look for brokers listed as LEADER or REPLICAS that are not present in the output or show connection errors. Also, check your monitoring system for broker health.
Fix: Identify the down broker and restart it. If it’s a persistent issue, investigate why it went down (e.g., disk full, OOM, network issues).
# On the machine where the broker is supposed to run
sudo systemctl start kafka
Why it works: Restarts the Kafka broker process, allowing it to rejoin the cluster and serve as a replica.
Replica Lagging Behind
Diagnosis:
Examine the LEADER and REPLICAS for the affected partition. Then, check the ISRS (In-Sync Replicas) list. If the ISRS list is shorter than the total number of REPLICAS and the LEADER is present, it indicates replicas are out of sync.
/opt/kafka/bin/kafka-topics.sh --bootstrap-server <your_broker_address>:9092 --describe --topic <your_topic_name>
Look for partitions where the ISRS count is less than the REPLICAS count. You can also check replica lag using JMX or tools like kafka-consumer-groups.sh to see how far behind consumers are.
Fix:
Allow the lagging replicas time to catch up. If a replica is consistently lagging, investigate the broker’s performance, disk I/O, network, or the broker’s Kafka configuration (replica.lag.time.max.ms). You might need to increase the replica.lag.time.max.ms setting if temporary network blips are causing replicas to fall out of sync too easily.
# In server.properties on the lagging broker
replica.lag.time.max.ms=60000 # Increased from default 10000 (10 seconds) to 60 seconds
Why it works: This setting defines how long a replica can be out of sync before it’s considered defunct by the leader. Increasing it gives replicas more grace period to catch up after temporary issues.
Network Partition
Diagnosis:
Check broker logs for messages indicating connection failures or leader election events that happen frequently. Use ping or traceroute from affected brokers to other brokers in the cluster.
ping <other_broker_ip>
traceroute <other_broker_ip>
Look for packet loss or high latency.
Fix: Resolve the underlying network issue. This could involve checking firewalls, router configurations, or physical network connectivity.
# Example: If a firewall is blocking port 9092 between brokers
sudo firewall-cmd --zone=public --add-port=9092/tcp --permanent
sudo firewall-cmd --reload
Why it works: Restores communication channels between brokers, allowing them to sync and maintain quorum.
Insufficient Number of Brokers for Topic Replication Factor
Diagnosis: Check the replication factor of the topic and the total number of active brokers in your cluster.
/opt/kafka/bin/kafka-topics.sh --bootstrap-server <your_broker_address>:9092 --list
/opt/kafka/bin/kafka-topics.sh --bootstrap-server <your_broker_address>:9092 --describe --topic <your_topic_name>
If the replication factor is 3, but you only have 2 active brokers, you’ll always get NotEnoughReplicasException for new partitions or if one broker goes down.
Fix: Increase the number of active brokers in your cluster by starting more Kafka broker instances. Alternatively, reduce the replication factor of the affected topic if having fewer replicas is acceptable for your use case.
# To reduce replication factor (use with caution, data loss is possible if not handled properly)
/opt/kafka/bin/kafka-reassign-partitions.sh --bootstrap-server <your_broker_address>:9092 --topics-to-move-json-file reassign.json --execute
Where reassign.json might look like:
{
"version": 1,
"partitions": [
{
"topic": "<your_topic_name>",
"partition": 0,
"replicas": [1, 2] // Reduced from [1, 2, 3]
}
]
}
Why it works: Ensures that the cluster has enough healthy brokers to meet the replication requirements of the topic.
Controller Leader Election Issues
Diagnosis: Check Kafka broker logs for frequent "Controller shuffle" or "leader election" messages. This can indicate instability in the controller, which manages partition leadership.
grep "leader election" /opt/kafka/logs/server.log
Look for a high frequency of these events.
Fix: Ensure your Zookeeper ensemble is healthy and accessible. Kafka controllers rely heavily on Zookeeper for coordination. Restarting the controller broker (if you can identify it) or the entire Kafka cluster might be necessary.
# Ensure Zookeeper is running and accessible
sudo systemctl status zookeeper
Why it works: A stable controller is essential for electing partition leaders and managing replica states. Zookeeper health is paramount for this.
Under-provisioned Resources on Brokers
Diagnosis: Monitor CPU, memory, disk I/O, and network usage on your Kafka brokers. High utilization can cause replicas to fall behind or become unresponsive.
# On a broker
top
iostat -xz 1
iftop
Look for sustained high CPU, low free memory, high disk wait times (%iowait), or saturated network interfaces.
Fix:
Scale up your broker hardware (more CPU, RAM, faster disks) or optimize Kafka configurations (e.g., num.io.threads, num.network.threads). If disk I/O is the bottleneck, consider using faster SSDs.
# In server.properties
num.io.threads=16 # Increased from default 8
num.network.threads=16 # Increased from default 3
Why it works: Provides sufficient resources for brokers to process incoming requests, replicate data, and maintain connections with other brokers.
After resolving these issues, you might encounter TimeoutException if your producer’s request.timeout.ms is too low for the cluster to recover and acknowledge messages.