NATS flow control is designed to prevent a fast producer from overwhelming a slow consumer, but the default settings can sometimes lead to a different problem: the slow consumer itself stalls, becoming unresponsive to the producer.

Here’s how that looks in practice:

Let’s say you have a service publishing messages to a NATS subject at a rate of 10,000 messages per second. A consumer subscribed to this subject can only process 1,000 messages per second. NATS Flow Control kicks in to signal the producer to slow down, but if the consumer’s processing rate drops even further due to internal application logic or external dependencies, the flow control mechanism can get into a state where it’s constantly signaling back and forth, effectively freezing the connection.

Common Causes and Fixes

  1. Insufficient Consumer Credit: The consumer doesn’t have enough "credit" to signal its readiness to receive messages. When credit runs out, the consumer stops acknowledging messages, and the producer, noticing this lack of acknowledgment, will eventually pause or disconnect.

    • Diagnosis: Check the NATS server logs for messages indicating flow control exceeded or consumer credit exhausted. On the client side, you might see connection errors or extended periods of no message delivery.
    • Fix: Increase the max_outstanding_bytes or max_outstanding_messages for the consumer. For example, in the NATS JetStream API, you might set this to max_outstanding_messages=10000 when creating or updating a consumer.
    • Why it works: This gives the consumer a larger buffer of messages it can receive before it needs to signal back to the server. It smooths out temporary dips in processing speed without immediately triggering flow control.
  2. Aggressive Flow Control Limits: The max_outstanding_bytes or max_outstanding_messages are set too low, meaning flow control signals are sent too frequently, creating a tight feedback loop that can be easily disrupted.

    • Diagnosis: Monitor the NATS server metrics for flow control messages. High rates of flow_control.granted and flow_control.pending can indicate this.
    • Fix: Increase max_outstanding_messages to a value that allows for bursts of producer activity without immediately halting delivery. A common starting point is to set it to a multiple of your expected peak processing rate, e.g., max_outstanding_messages=5000 if your consumer can handle 2000 messages/sec but experiences brief spikes.
    • Why it works: A larger buffer allows the consumer to absorb more messages during brief processing slowdowns, preventing the producer from pausing unnecessarily and thus avoiding the tight oscillation that can lead to stalls.
  3. Network Latency: High latency between the producer and consumer, or between the consumer and the NATS server, can cause acknowledgments to be delayed. This delay can be misinterpreted by the flow control mechanism as the consumer being too slow, leading to premature pausing.

    • Diagnosis: Use ping and traceroute to check latency and packet loss to the NATS server from the consumer’s host. Monitor NATS server metrics for slow_consumers.
    • Fix: Improve network infrastructure, co-locate services, or increase the max_outstanding_messages to buffer against acknowledgment delays. For instance, increasing max_outstanding_messages to 20000 can help if network latency adds 100ms to acknowledgments and your consumer processes 1000 messages/sec (which would normally require a credit for 100 messages).
    • Why it works: By providing a larger buffer, the consumer can continue to receive messages even if acknowledgments are delayed due to network issues, preventing the flow control from prematurely throttling the producer.
  4. Consumer Application Bottlenecks: The consumer application itself is experiencing internal delays (e.g., slow database queries, blocking I/O, garbage collection pauses) that prevent it from acknowledging messages promptly. Flow control is working as intended but highlighting an application-level problem.

    • Diagnosis: Profile the consumer application. Look for long-running operations, high CPU usage, or memory issues. Check application logs for any errors or warnings related to processing or external dependencies.
    • Fix: Optimize the consumer application’s code, improve database performance, use asynchronous I/O, or scale up the consumer instances. For example, if a database write is taking 50ms per message and you need to process 1000 messages/sec, this is the bottleneck. Fixing the database query to take 5ms would resolve it.
    • Why it works: Addressing the root cause of the consumer’s slowness allows it to acknowledge messages faster, which in turn signals to the NATS server that it can grant more credit and resume normal flow.
  5. NATS Server Resource Constraints: The NATS server itself is under heavy load (CPU, memory, network I/O), which can delay its ability to process flow control requests and grant credit efficiently.

    • Diagnosis: Monitor NATS server resource utilization. Look for high CPU, memory, or network saturation on the server hosts. Check NATS server logs for any indications of internal slowness or dropped packets.
    • Fix: Scale up the NATS server cluster (add more nodes), increase resources on existing nodes, or optimize NATS server configuration. Ensure the server is running on adequate hardware.
    • Why it works: A healthy, unburdened NATS server can respond to flow control signals and client acknowledgments much faster, maintaining a stable flow of messages.
  6. Improper Flow Control Configuration on Server: While less common, incorrect global flow control settings on the NATS server could theoretically impact behavior, though most flow control is client-side configured. However, server-side max_payload or other network-related configurations might indirectly affect message handling.

    • Diagnosis: Review NATS server configuration files for any unusual settings related to message buffering or network limits.
    • Fix: Ensure NATS server configuration is standard and aligned with best practices for your deployment size. For example, ensure max_payload is sufficiently large for your message sizes.
    • Why it works: Standard configurations ensure the NATS server can efficiently manage message queues and flow control mechanisms without introducing artificial bottlenecks.

Once these issues are addressed, you might encounter a new, more fundamental problem: consumer message processing logic errors that lead to duplicate processing or missed messages.

Want structured learning?

Take the full Nats course →