NATS has stopped accepting connections from your clients because the server believes the client is still connected, but the client has actually lost its connection to the server.

Common Causes and Fixes:

  1. Network Interruption (Firewall/Load Balancer State)

    • Diagnosis: Check firewall or load balancer logs for connection resets or idle timeouts. On the NATS server, monitor connection counts and look for sudden drops that don’t correspond to client-initiated disconnects.
      nats-server --config /path/to/nats.conf # Monitor server logs for connection events
      
    • Fix: Configure your network infrastructure (firewalls, load balancers, NAT gateways) to send keep-alive packets more frequently or to have longer idle timeouts. For example, a common firewall setting might be to drop idle TCP connections after 300 seconds (5 minutes). You’ll want to increase this.
    • Why it works: NATS relies on TCP keep-alives to detect dead connections. If a network device aggressively purges idle TCP states, NATS can’t tell the client is gone until it tries to send something and gets no response. Increasing the idle timeout on network devices prevents them from prematurely closing the underlying TCP connection that NATS is using.
  2. Client-Side Network Unreliability

    • Diagnosis: On the client machine, check for network interface errors, packet loss, or high latency. Tools like ping or mtr can help identify network issues between the client and the NATS server.
      ping nats.example.com
      mtr nats.example.com
      
    • Fix: Implement robust retry logic within your NATS client application. Most NATS client libraries have options for reconnect intervals and maximum reconnect attempts. For example, in Go’s NATS client:
      nc, err := nats.Connect("nats://nats.example.com:4222",
          nats.ReconnectWait(10*time.Second),
          nats.MaxReconnects(10))
      
    • Why it works: Even if the network is intermittently flaky, the client library can attempt to re-establish a connection after a brief outage, allowing it to rejoin the NATS cluster.
  3. NATS Server Resource Exhaustion

    • Diagnosis: Monitor CPU, memory, and file descriptor usage on the NATS server. If the server is overloaded, it might be slow to process network events, including heartbeats, leading to stale connections.
      top
      htop
      sudo lsof -p $(pgrep nats-server) | wc -l # Check file descriptor count
      
    • Fix: Increase the resources allocated to the NATS server (CPU, RAM). Also, ensure the operating system’s file descriptor limit is high enough for the expected number of concurrent connections. For example, on Linux, you might increase the limit in /etc/security/limits.conf:
      * soft nofile 65536
      * hard nofile 65536
      
      And then restart the NATS server.
    • Why it works: A NATS server struggling with resources may not be able to keep up with the rate of TCP ACKs and heartbeats, causing it to incorrectly assume a connection is dead or slow to detect a client’s disappearance. Sufficient resources ensure the server can process network I/O promptly.
  4. NATS Server Configuration (Max Payload/Connection Timeout)

    • Diagnosis: Review nats.conf for settings that might be too restrictive. Specifically, ping_interval and ping_timeout on the server side, or max_payload if clients are sending very large messages that could be timing out during transit or processing.
    • Fix: Increase ping_interval and ping_timeout on the server. For example, if your network latency is high, you might set:
      # nats.conf
      ping_interval: 120 # seconds
      ping_timeout: 60   # seconds
      
      If max_payload is too small for your use case, increase it:
      # nats.conf
      max_payload: 10485760 # 10MB
      
    • Why it works: The ping_interval is how often the server sends a ping to the client. ping_timeout is how long it waits for a pong before considering the connection stale. If these are too aggressive for the network conditions, valid connections can be dropped. max_payload affects how much data can be sent in a single publish; exceeding it can cause client errors or disconnects.
  5. Client Application Logic Errors (No Heartbeat)

    • Diagnosis: If a client application hangs or becomes unresponsive without explicitly closing its NATS connection, the server will eventually time out the connection. Check client application logs for unhandled exceptions or deadlocks.
    • Fix: Ensure that your client application’s event loop or main goroutine (or equivalent in other languages) is not blocked indefinitely. If the client is performing long-running synchronous operations, consider offloading them to separate goroutines or threads.
    • Why it works: The NATS client library typically manages sending PONG responses to server PINGs automatically. If the client application’s main thread is blocked, it can’t process incoming network messages (like PINGs) or send outgoing responses (like PONGs), leading the server to believe the connection is dead.
  6. NATS Server Bug or Unexpected Behavior

    • Diagnosis: Check the NATS server release notes and issue tracker for known bugs related to connection handling or network I/O, especially if the problem started after an upgrade.
    • Fix: If a specific bug is identified, upgrade to a newer, stable NATS server version that addresses the issue.
    • Why it works: Sometimes, subtle bugs in the server’s network stack or connection management can lead to premature disconnects or stale connection states. Upgrading to a patched version resolves the underlying defect.

The next error you’ll likely encounter after fixing stale connections is related to message ordering or delivery guarantees if your application was sensitive to the dropped connections and didn’t handle re-establishment gracefully.

Want structured learning?

Take the full Nats course →