The NATS server is stuck in a "Leader Election In Progress" state because the Raft consensus group cannot agree on a leader.

This typically happens when a majority of nodes in a Raft group are unavailable, or when network partitions prevent nodes from communicating with each other. The Raft algorithm requires a quorum (more than half) of nodes to be available and able to communicate to elect a leader and make progress. When this quorum is lost, the system enters a degraded state and will remain in "Leader Election In Progress" until a quorum can be re-established.

Here are the most common causes and how to address them:

1. Insufficient Node Availability

Diagnosis: Check the status of all nodes in the Raft group. Command: nats server -m <port> (on each server, replace <port> with the monitoring port). Look for the raft.state field. It should be leader or follower. If a node is candidate and stays that way, or if many nodes are stopped or shutting_down, this is an issue. Cause: A majority of nodes in the Raft group are not running or are not accessible. For a 3-node cluster, at least 2 must be up. For a 5-node cluster, at least 3 must be up. Fix: Start or restart the missing/crashed NATS server instances. Ensure they are configured with the correct cluster endpoints to communicate with each other. Why it works: Restoring the minimum required number of active nodes allows the Raft group to achieve a quorum, enabling a leader to be elected.

2. Network Partitions

Diagnosis: Verify network connectivity between all NATS cluster nodes. Command: On one NATS server, nc -vz <other_node_ip> <cluster_port> for all other nodes in the cluster. Replace <other_node_ip> with the IP address of another NATS server and <cluster_port> with the port specified in cluster.listen in your NATS configuration. Cause: Firewalls, routing issues, or other network problems are preventing nodes from communicating with each other on their cluster ports. Fix: Adjust firewall rules (iptables, ufw, cloud security groups) to allow traffic on the cluster.listen port between all NATS cluster members. Ensure routing is correctly configured. Why it works: Restoring network connectivity allows nodes to exchange heartbeats and Raft messages, enabling them to discover each other and participate in leader election.

3. Incorrect Cluster Configuration

Diagnosis: Review the cluster section of your nats-server.conf file on all nodes. Config Snippet:

cluster {
  listen = "0.0.0.0:6222"
  routes = [
    "nats://node1.example.com:6222",
    "nats://node2.example.com:6222",
    "nats://node3.example.com:6222"
  ]
}

Cause: The routes array is misconfigured, pointing to incorrect IPs/hostnames or ports, or missing nodes entirely. This prevents nodes from discovering and connecting to each other. Fix: Ensure the routes array on each server lists the correct cluster.listen addresses of all other servers in the cluster. Why it works: Correctly configured routes ensure that each NATS server knows how to reach its peers, facilitating the discovery and communication necessary for Raft consensus.

4. Stale Raft State

Diagnosis: If nodes have been down for an extended period and then brought back up, their Raft state might be out of sync. Command: nats server -m <port> on a node that was recently restarted. Look for raft.term and raft.index. Compare these values across nodes. If they are significantly different, it indicates a potential state divergence. Cause: A node may have data that is too old or too new compared to the majority of the cluster after a prolonged outage. Raft relies on consistent terms and indices for agreement. Fix: * Option A (Recommended for data integrity): Stop all NATS servers in the cluster. Manually delete the raft.dat file from the data directory of all servers. Restart all servers. They will form a new cluster and elect a leader from scratch. Warning: This will lose any state stored in JetStream. * Option B (If JetStream persistence is critical and you suspect a specific node is bad): Identify the node with the most up-to-date raft.term and raft.index. Stop all other nodes. Restart the "good" node first, then bring up the others. If they still don’t elect a leader, you may need to reset them as in Option A. Why it works: Deleting raft.dat forces each server to start with a clean Raft state, allowing a new, consistent consensus to be formed. Restarting the most up-to-date node first gives it a better chance to be elected leader and bring others up to speed.

5. Resource Exhaustion on Nodes

Diagnosis: Check system resources (CPU, memory, disk I/O) on the NATS servers. Command: top, htop, iostat, free -m on the server. Cause: A NATS server might be overloaded with too many connections, messages, or other operations, preventing its Raft process from running efficiently or responding to cluster communications. This can cause it to appear unavailable to other nodes. Fix: Optimize NATS configuration (e.g., connection limits, message throughput), scale up server resources (CPU, RAM), or distribute the load across more servers. Why it works: Ensuring NATS servers have sufficient resources allows the Raft protocol to execute its critical, time-sensitive operations without being starved by other system processes.

6. DNS Resolution Issues

Diagnosis: Verify that all NATS cluster nodes can resolve each other’s hostnames correctly. Command: On one NATS server, ping <other_node_hostname> and dig <other_node_hostname> for all other nodes in the cluster. Cause: If you are using hostnames in your cluster configuration and DNS is not resolving consistently or correctly across your network, nodes will not be able to establish connections. Fix: Ensure your DNS server is functioning correctly and that all NATS cluster nodes have access to it. Alternatively, use static IP addresses in the routes configuration. Why it works: Reliable DNS resolution is crucial for nodes to translate hostnames into IP addresses, which is necessary for establishing the network connections required for Raft communication.

After resolving these issues, you should see a leader election complete and the NATS servers will transition out of the "Leader Election In Progress" state. The next error you might encounter, if you were previously experiencing JetStream issues, is related to JetStream stream or consumer operations failing due to the prior leader election disruption.

Want structured learning?

Take the full Nats course →