The NATS JetStream server is dropping messages for a slow consumer because the consumer’s acknowledgement (ACK) timeout is being exceeded. This means the server is giving up on waiting for the consumer to process and acknowledge messages, and is reclaiming them to potentially deliver to other consumers.

Here’s a breakdown of why this happens and how to fix it, from most to least common causes:

1. Consumer ACK Policy is Too Short for Processing Time

  • Diagnosis: Check the consumer configuration for its AckPolicy and MaxAckPending. If AckPolicy is Explicit or All, and MaxAckPending is low, a slow consumer will quickly hit the limit of pending unacknowledged messages. The server’s default ACK timeout is 30 seconds, but your consumer’s MaxAckPending could be the bottleneck if it’s lower and messages are arriving faster than they can be ACKed.
    • Command: nats context describe <your_context_name> (to find the server/account)
    • Command: nats stream view <stream_name> --consumer <consumer_name> (to get consumer details, including ACK policy and MaxAckPending)
  • Fix: Increase MaxAckPending. This tells the server how many messages it can have "in flight" to this consumer without an ACK before considering them potentially lost or needing redelivery.
    • Command: nats consumer update <stream_name> <consumer_name> --max-ack-pending 1000 (adjust 1000 to a value that accommodates your typical processing burst)
  • Why it works: A higher MaxAckPending gives the consumer more buffer. The server won’t consider messages unacknowledged until this limit is reached or the overall ACK timeout expires. This buys your slow consumer more time before the server starts to panic.

2. Insufficient Server Resources (CPU/Memory/Network)

  • Diagnosis: Monitor your NATS server’s resource utilization. High CPU, low memory, or network saturation on the server hosting the JetStream stream can delay message processing and ACKs.
    • Command: top or htop on the NATS server.
    • Command: iftop or nload for network traffic.
    • Check NATS server logs for any resource-related warnings or errors.
  • Fix: Scale up your NATS server resources (CPU, RAM) or optimize its network configuration. If running in a containerized environment, ensure resource limits are not too restrictive.
    • Action: Increase CPU/RAM allocated to the NATS server process or node.
    • Action: Investigate and resolve network bottlenecks between NATS clients and the server.
  • Why it works: A starved server cannot efficiently process incoming messages, manage consumer state, or respond to ACKs in a timely manner. Providing adequate resources ensures the NATS server can keep up with its duties.

3. Slow Consumer Application Logic

  • Diagnosis: The consumer application itself is taking too long to process messages and send ACKs. This is the most direct cause of a "slow consumer."
    • Check your consumer application’s logs for processing times, error rates, and any internal delays.
    • Use application performance monitoring (APM) tools if available.
  • Fix: Optimize the consumer application’s code. Identify bottlenecks in message processing, external dependencies (databases, APIs), or inefficient data handling.
    • Action: Profile your consumer application to find slow functions.
    • Action: Implement batching for ACKs if your application processes multiple messages before acknowledging.
  • Why it works: The faster your consumer can process messages and send ACKs, the less likely it is to exceed any timeouts or MaxAckPending limits.

4. Network Latency Between Consumer and Server

  • Diagnosis: High network latency or packet loss between your consumer application and the NATS server can delay the ACK packets from reaching the server. Even if the consumer processes the message quickly, the server might not receive the ACK in time.
    • Command: ping <nats_server_ip> from the consumer’s host.
    • Command: traceroute <nats_server_ip> to identify network hops with high latency.
  • Fix: Improve network connectivity. This might involve moving the consumer closer to the NATS server, optimizing network routes, or addressing underlying network infrastructure issues.
    • Action: Deploy consumers in the same network region or availability zone as the NATS server.
    • Action: Investigate and resolve any network device issues (routers, firewalls) causing delays.
  • Why it works: Reduced latency means ACKs arrive at the server faster, allowing the server to mark messages as processed before any timeouts trigger.

5. Inefficient Message Handling in Consumer (e.g., Blocking Operations)

  • Diagnosis: The consumer application might be performing blocking I/O operations (like synchronous database calls or external API requests) that prevent it from processing the next message or sending an ACK promptly.
    • Review consumer code for synchronous network calls, disk I/O, or long-running computations that are not handled asynchronously.
  • Fix: Refactor the consumer application to use asynchronous operations. Use non-blocking I/O, worker threads, or message queues for external service calls.
    • Action: Replace synchronous database drivers with asynchronous ones.
    • Action: Offload long-running tasks to background workers.
  • Why it works: Asynchronous processing allows the consumer to handle multiple messages concurrently or to acknowledge messages while other operations are in progress, preventing a single slow operation from blocking the entire message pipeline.

6. Server ACK Timeout is Too Aggressive

  • Diagnosis: While less common, the server’s general ACK timeout for JetStream (defaulting to 30 seconds) might be too short for your overall system, even with reasonable MaxAckPending.
    • This is usually identified after addressing the above points, if messages are still being dropped.
  • Fix: Increase the server’s ACK timeout. This is a server-level configuration.
    • Edit your NATS server configuration file (nats-server.conf) and add/modify:
      jetstream {
        ack_timeout: 60s
      }
      
    • Restart the NATS server.
  • Why it works: A longer global ACK timeout gives all consumers, especially those facing transient issues, more grace period before the server assumes messages are lost. Use this with caution as it can mask underlying problems.

The next error you’ll likely encounter if you fix this problem is related to message ordering or duplicate processing if your consumer’s idempotency guarantees are not robust.

Want structured learning?

Take the full Nats course →