Your NATS cluster is failing because one or more of your cluster members can’t find each other over the network.

Here’s what’s actually going on: NATS cluster members communicate via a gossip protocol. Each member maintains a list of other members it expects to see. When a member hasn’t heard from another member for a certain period (defined by server_timeout), it declares that member "not found" and removes it from its internal cluster view. This can cascade, leading to a fragmented or completely offline cluster.

Common Causes and Fixes

1. Network Unreachability (Most Common)

  • Diagnosis: From one NATS server, try to ping or telnet to the IP address and port of another NATS server that’s reporting as "not found."
    ping <other_server_ip>
    telnet <other_server_ip> 6222 # Default cluster port
    
  • Fix:
    • Firewall: If ping or telnet fails, check your firewalls (e.g., ufw, firewalld, cloud security groups). Ensure port 6222 (or your configured cluster port) is open between all NATS server IPs.
      # Example for ufw on Debian/Ubuntu
      sudo ufw allow from <other_server_ip> to any port 6222 proto tcp
      sudo ufw reload
      
      Why it works: This explicitly allows TCP traffic on the NATS cluster port, enabling the gossip protocol to establish connections.
    • Routing: Verify that the IP addresses listed in your NATS server configurations are indeed reachable from each other. This might involve checking /etc/hosts files, DNS, or cloud provider network configurations. Why it works: NATS servers need to resolve and connect to the actual network interfaces of their peers. Incorrect routing prevents this.
    • Container/VM Networking: If running in Docker or Kubernetes, ensure the network overlay allows direct communication between pods/containers. For Docker, check docker network inspect <network_name>. For Kubernetes, ensure your CNI (Container Network Interface) is correctly configured. Why it works: The container orchestrator’s network might be isolating NATS nodes, preventing them from seeing each other’s cluster ports.

2. Incorrect Cluster Configuration

  • Diagnosis: Examine the cluster section of your nats-server configuration file (nats-server -c nats.conf). Pay close attention to the listen and routes (or join) directives.
    # nats.conf example
    cluster {
      listen: "0.0.0.0:6222"
      routes = [
        "nats://node1.example.com:6222",
        "nats://node2.example.com:6222",
        "nats://node3.example.com:6222"
      ]
      # or for dynamic joining:
      # join_as: "node4.example.com:6222"
      # connect_timeout: 1s
      # server_timeout: 5s
    }
    
  • Fix:
    • listen Address: Ensure the listen address in the cluster configuration is an IP address or interface that other servers can reach. Using 0.0.0.0 is common, but if that interface isn’t accessible externally, it will fail. If a server has multiple IPs, ensure it’s listening on the one advertised to the cluster. Why it works: The listen address is what a NATS server advertises as its cluster endpoint. If it’s not reachable by others, they can’t connect.
    • routes (Static) or join (Dynamic):
      • Static routes: Each server’s routes list must contain the cluster listen addresses of all other expected cluster members. If a server is missing from another server’s routes list, it will never attempt to connect. Why it works: The routes list is the explicit instruction set for which peers to connect to.
      • Dynamic join: If using join or join_as with a list of known peers, ensure the listed addresses are correct and resolvable/reachable. If a server cannot reach any of the join addresses, it won’t be able to discover the cluster. Why it works: The join mechanism relies on discovering at least one existing cluster member to bootstrap its connection.
    • Conflicting Configurations: Ensure cluster configurations are consistent across all servers. Mismatched ports or incorrect peer addresses are fatal. Why it works: The gossip protocol assumes a shared understanding of the cluster topology.

3. High Server Load / Resource Exhaustion

  • Diagnosis: Check the CPU, memory, and network I/O of your NATS servers. Use tools like top, htop, vmstat, iostat, or cloud provider monitoring. Look for sustained high CPU usage (>90%) or memory pressure.
  • Fix:
    • Increase Resources: Provide more CPU, RAM, or network bandwidth to the NATS server instances. Why it works: NATS needs sufficient resources to process incoming connections, handle client traffic, and perform its gossip protocol updates. Starvation causes it to drop connections or fail to respond.
    • Optimize Client Traffic: If high load is due to client activity, consider scaling your NATS cluster horizontally (adding more servers) or optimizing your publisher/subscriber logic to reduce message volume or processing time. Why it works: Reducing the overall load on individual NATS servers gives them the headroom to maintain cluster peer connections.
    • Adjust server_timeout: In extreme cases, you might temporarily increase server_timeout (e.g., from 5s to 10s) in the cluster configuration.
      cluster {
        server_timeout: 10s
      }
      
      Why it works: This gives servers more leeway before declaring a peer "not found," which can help in highly latent or congested networks, but it masks underlying problems.

4. DNS Resolution Issues

  • Diagnosis: If your routes or join configuration uses hostnames (e.g., node1.example.com), try resolving those hostnames from the other NATS servers.
    dig node1.example.com
    nslookup node1.example.com
    
    Also, check /etc/resolv.conf on the NATS servers.
  • Fix:
    • Correct DNS Records: Ensure DNS records for your NATS server hostnames point to the correct, reachable IP addresses. Why it works: NATS servers rely on DNS to find the IP addresses of their peers. Incorrect DNS means they’re trying to connect to the wrong places.
    • DNS Server Availability: Verify that the DNS servers listed in /etc/resolv.conf are themselves reachable and responsive. Why it works: If the NATS server can’t ask DNS where its peers are, it can’t connect.
    • Use IP Addresses: As a workaround or for simpler setups, use direct IP addresses in your routes or join configuration instead of hostnames. Why it works: Bypasses DNS entirely, ensuring NATS uses the IP directly.

5. Incorrect Cluster Port Configuration

  • Diagnosis: Verify the listen port specified in the cluster section of the NATS server configuration matches the port expected by other servers. The default is 6222.
  • Fix: Ensure the cluster.listen port is identical across all servers intended to be in the same cluster, and that this port is used in the routes or join configuration of other servers.
    # nats.conf on all servers
    cluster {
      listen: "0.0.0.0:6222" # Or a specific IP
      # ... routes/join ...
    }
    
    Why it works: NATS servers must agree on which port to use for inter-server communication. Mismatched ports mean they’re trying to connect to different services.

6. NTP / Time Synchronization Issues

  • Diagnosis: Check the system time on all your NATS servers. Use date on each server.
  • Fix: Ensure all servers in the cluster are synchronized to the same Network Time Protocol (NTP) source.
    # Example command to check NTP status on systemd systems
    timedatectl status
    
    Why it works: While NATS itself doesn’t strictly require perfect time sync for basic operation, significant time skew can sometimes interfere with TLS handshakes (if used) or other network-level operations that indirectly impact peer discovery and keep-alives.

Once these are resolved, the next error you’ll likely encounter is a Slow Consumer if your message processing can’t keep up.

Want structured learning?

Take the full Nats course →