NATS stops accepting new connections when you tell it to shut down, but it doesn’t automatically finish processing existing requests.

Here’s how to make NATS shut down cleanly, ensuring all in-flight messages are delivered.

The core issue is that a standard nats-server shutdown signal (SIGTERM or SIGINT) stops the server from listening for new TCP connections and processing new incoming requests. However, any clients that are already connected and have active subscriptions or pending publishes will have their work interrupted. If those messages are critical, they might be lost or require manual retry logic on the client side.

The solution is to use the nats-server’s built-in "drain" functionality. This tells the server to stop accepting new work but to continue processing existing work until all queues are empty or a timeout is reached.

Common Causes & Fixes

  1. Server receives SIGTERM without draining:

    • Diagnosis: Observe nats-server logs. You’ll see messages like [INF] Server shutting down... followed by [INF] Server shutdown complete. There will be no mention of draining or pending messages being processed.
    • Fix: Send a SIGUSR1 signal before sending SIGTERM.
      # Find the process ID (PID) of your nats-server
      PID=$(pgrep nats-server)
      
      # Send the drain signal
      kill -USR1 $PID
      
      # Wait a few seconds for draining to complete (adjust as needed)
      sleep 5
      
      # Now send the termination signal
      kill -TERM $PID
      
    • Why it works: The SIGUSR1 signal tells the nats-server to enter its drain mode. In this mode, it stops accepting new publishes and subscriptions but continues to process messages for existing subscriptions until their queues are empty or a configured timeout is met. Only after draining is complete does it respond to a subsequent SIGTERM by shutting down.
  2. Insufficient drain timeout:

    • Diagnosis: You send SIGUSR1 and then SIGTERM, but some messages are still lost. In the logs, you might see [INF] Server shutting down... shortly after the drain signal, indicating it didn’t wait long enough.
    • Fix: Configure a longer drain timeout in nats-server’s configuration file or via command-line flags. The default drain timeout is 30 seconds.
      • Config File (nats-server.conf):
        {
          "server_name": "my-nats-server",
          "port": 4222,
          "drain_timeout": 60000  // 60 seconds in milliseconds
        }
        
      • Command Line:
        nats-server --drain-timeout 60000
        
      • Send the signal with a longer wait:
        PID=$(pgrep nats-server)
        kill -USR1 $PID
        sleep 60 # Wait for the configured drain timeout
        kill -TERM $PID
        
    • Why it works: The drain_timeout (specified in milliseconds) sets the maximum time the server will wait to drain all pending messages from queues before initiating the final shutdown. Increasing this value gives clients more time to process their messages.
  3. Client doesn’t acknowledge messages promptly:

    • Diagnosis: Even with draining enabled and a sufficient timeout, messages are lost. This often happens with services that process messages asynchronously and don’t explicitly acknowledge them back to NATS.
    • Fix: Ensure your NATS clients are configured to acknowledge messages. For JetStream, this means using AckExplicit() or AckSync() on the message object. For core NATS, if you’re using queue groups, the server implicitly tracks message delivery. However, if clients are crashing before processing, a more robust solution is needed.
      • Go Client Example (JetStream):
        msg, err := js.QueueSubscribe("my-topic", "my-qgroup", func(msg *nats.Msg) {
            // Process message here
            if err := msg.Ack(); err != nil {
                // Handle ack error
            }
        }, nats.AckWait(10*time.Second)) // Optional: client-side ack wait
        
      • Client Application Logic: Ensure your application code doesn’t exit or restart until all messages it’s responsible for have been acknowledged.
    • Why it works: For JetStream, explicit acknowledgments tell the server that a message has been successfully processed. Until an ACK is received, the server will not consider the message delivered and will redeliver it if the client disconnects or the AckWait expires. This ensures message persistence and delivery guarantees.
  4. No active subscribers for a queue group:

    • Diagnosis: You see messages being published, but when the server drains, the logs indicate no subscribers are present to receive them, and the messages are effectively lost. This can happen if a subscriber crashes before the drain signal is sent.
    • Fix: Ensure that for critical queues, you have a mechanism to keep subscribers alive or that messages are being sent to JetStream streams with appropriate retention and replication policies. If using core NATS queue groups, ensure at least one subscriber is always active or that you’re using JetStream’s durability.
      • JetStream Stream Configuration:
        {
          "name": "my-stream",
          "subjects": ["my-topic"],
          "storage": "file", // or "memory"
          "retention": "limits", // or "interest" or "workqueue"
          "max_msgs": 10000,
          "replicas": 1 // or more for HA
        }
        
    • Why it works: JetStream streams provide message durability. Even if no subscribers are currently connected, messages published to a JetStream stream are persisted. When subscribers reconnect, they can resume processing from where they left off. The drain process ensures that any messages in flight to currently connected subscribers are processed before shutdown.
  5. Network partitions or client disconnects during drain:

    • Diagnosis: The server attempts to drain, but clients disconnect unexpectedly, leading to incomplete drains and potential message loss.
    • Fix: Implement robust client-side reconnection logic and graceful shutdown procedures within your client applications. Ensure clients have a mechanism to re-establish connections and resume processing after a server restart. For critical applications, consider using JetStream’s AckWait and MaxDeliver settings to control message redelivery behavior.
    • Why it works: By having clients actively manage their connections and acknowledge messages, they can signal their status to the server and ensure that messages are only considered processed when they are truly handled, even across transient network issues.
  6. Server is overwhelmed and cannot process drain requests:

    • Diagnosis: The server is experiencing high CPU or memory usage, preventing it from responding to the SIGUSR1 signal or processing messages efficiently during the drain period.
    • Fix: Scale up your NATS server resources (CPU, RAM) or optimize your message processing on the client side to reduce the load. Monitor server metrics like CPU, memory, and network I/O.
    • Why it works: A healthy, responsive server is essential for any operation, including graceful shutdown. Ensuring adequate resources prevents the server from becoming a bottleneck during critical shutdown procedures.

After successfully draining and stopping the server, the next error you might encounter is related to clients attempting to reconnect to a server that is temporarily unavailable or has changed its address/port if it’s part of a cluster.

Want structured learning?

Take the full Nats course →