Your NATS cluster is failing because one or more of your cluster members can’t find each other over the network.
Here’s what’s actually going on: NATS cluster members communicate via a gossip protocol. Each member maintains a list of other members it expects to see. When a member hasn’t heard from another member for a certain period (defined by server_timeout), it declares that member "not found" and removes it from its internal cluster view. This can cascade, leading to a fragmented or completely offline cluster.
Common Causes and Fixes
1. Network Unreachability (Most Common)
- Diagnosis: From one NATS server, try to
pingortelnetto the IP address and port of another NATS server that’s reporting as "not found."ping <other_server_ip> telnet <other_server_ip> 6222 # Default cluster port - Fix:
- Firewall: If
pingortelnetfails, check your firewalls (e.g.,ufw,firewalld, cloud security groups). Ensure port6222(or your configured cluster port) is open between all NATS server IPs.
Why it works: This explicitly allows TCP traffic on the NATS cluster port, enabling the gossip protocol to establish connections.# Example for ufw on Debian/Ubuntu sudo ufw allow from <other_server_ip> to any port 6222 proto tcp sudo ufw reload - Routing: Verify that the IP addresses listed in your NATS server configurations are indeed reachable from each other. This might involve checking
/etc/hostsfiles, DNS, or cloud provider network configurations. Why it works: NATS servers need to resolve and connect to the actual network interfaces of their peers. Incorrect routing prevents this. - Container/VM Networking: If running in Docker or Kubernetes, ensure the network overlay allows direct communication between pods/containers. For Docker, check
docker network inspect <network_name>. For Kubernetes, ensure your CNI (Container Network Interface) is correctly configured. Why it works: The container orchestrator’s network might be isolating NATS nodes, preventing them from seeing each other’s cluster ports.
- Firewall: If
2. Incorrect Cluster Configuration
- Diagnosis: Examine the
clustersection of yournats-serverconfiguration file (nats-server -c nats.conf). Pay close attention to thelistenandroutes(orjoin) directives.# nats.conf example cluster { listen: "0.0.0.0:6222" routes = [ "nats://node1.example.com:6222", "nats://node2.example.com:6222", "nats://node3.example.com:6222" ] # or for dynamic joining: # join_as: "node4.example.com:6222" # connect_timeout: 1s # server_timeout: 5s } - Fix:
listenAddress: Ensure thelistenaddress in the cluster configuration is an IP address or interface that other servers can reach. Using0.0.0.0is common, but if that interface isn’t accessible externally, it will fail. If a server has multiple IPs, ensure it’s listening on the one advertised to the cluster. Why it works: Thelistenaddress is what a NATS server advertises as its cluster endpoint. If it’s not reachable by others, they can’t connect.routes(Static) orjoin(Dynamic):- Static
routes: Each server’srouteslist must contain the cluster listen addresses of all other expected cluster members. If a server is missing from another server’srouteslist, it will never attempt to connect. Why it works: Therouteslist is the explicit instruction set for which peers to connect to. - Dynamic
join: If usingjoinorjoin_aswith a list of known peers, ensure the listed addresses are correct and resolvable/reachable. If a server cannot reach any of the join addresses, it won’t be able to discover the cluster. Why it works: Thejoinmechanism relies on discovering at least one existing cluster member to bootstrap its connection.
- Static
- Conflicting Configurations: Ensure
clusterconfigurations are consistent across all servers. Mismatched ports or incorrect peer addresses are fatal. Why it works: The gossip protocol assumes a shared understanding of the cluster topology.
3. High Server Load / Resource Exhaustion
- Diagnosis: Check the CPU, memory, and network I/O of your NATS servers. Use tools like
top,htop,vmstat,iostat, or cloud provider monitoring. Look for sustained high CPU usage (>90%) or memory pressure. - Fix:
- Increase Resources: Provide more CPU, RAM, or network bandwidth to the NATS server instances. Why it works: NATS needs sufficient resources to process incoming connections, handle client traffic, and perform its gossip protocol updates. Starvation causes it to drop connections or fail to respond.
- Optimize Client Traffic: If high load is due to client activity, consider scaling your NATS cluster horizontally (adding more servers) or optimizing your publisher/subscriber logic to reduce message volume or processing time. Why it works: Reducing the overall load on individual NATS servers gives them the headroom to maintain cluster peer connections.
- Adjust
server_timeout: In extreme cases, you might temporarily increaseserver_timeout(e.g., from5sto10s) in theclusterconfiguration.
Why it works: This gives servers more leeway before declaring a peer "not found," which can help in highly latent or congested networks, but it masks underlying problems.cluster { server_timeout: 10s }
4. DNS Resolution Issues
- Diagnosis: If your
routesorjoinconfiguration uses hostnames (e.g.,node1.example.com), try resolving those hostnames from the other NATS servers.
Also, checkdig node1.example.com nslookup node1.example.com/etc/resolv.confon the NATS servers. - Fix:
- Correct DNS Records: Ensure DNS records for your NATS server hostnames point to the correct, reachable IP addresses. Why it works: NATS servers rely on DNS to find the IP addresses of their peers. Incorrect DNS means they’re trying to connect to the wrong places.
- DNS Server Availability: Verify that the DNS servers listed in
/etc/resolv.confare themselves reachable and responsive. Why it works: If the NATS server can’t ask DNS where its peers are, it can’t connect. - Use IP Addresses: As a workaround or for simpler setups, use direct IP addresses in your
routesorjoinconfiguration instead of hostnames. Why it works: Bypasses DNS entirely, ensuring NATS uses the IP directly.
5. Incorrect Cluster Port Configuration
- Diagnosis: Verify the
listenport specified in theclustersection of the NATS server configuration matches the port expected by other servers. The default is6222. - Fix: Ensure the
cluster.listenport is identical across all servers intended to be in the same cluster, and that this port is used in theroutesorjoinconfiguration of other servers.
Why it works: NATS servers must agree on which port to use for inter-server communication. Mismatched ports mean they’re trying to connect to different services.# nats.conf on all servers cluster { listen: "0.0.0.0:6222" # Or a specific IP # ... routes/join ... }
6. NTP / Time Synchronization Issues
- Diagnosis: Check the system time on all your NATS servers. Use
dateon each server. - Fix: Ensure all servers in the cluster are synchronized to the same Network Time Protocol (NTP) source.
Why it works: While NATS itself doesn’t strictly require perfect time sync for basic operation, significant time skew can sometimes interfere with TLS handshakes (if used) or other network-level operations that indirectly impact peer discovery and keep-alives.# Example command to check NTP status on systemd systems timedatectl status
Once these are resolved, the next error you’ll likely encounter is a Slow Consumer if your message processing can’t keep up.