Fluentd’s buffer flush mechanism is the primary reason you might lose data when shutting down your service.

This isn’t about Fluentd crashing and losing data; it’s about Fluentd gracefully shutting down but still losing data because the buffers weren’t emptied. When Fluentd receives events, it doesn’t send them immediately. Instead, it writes them to a buffer. This buffer can be in memory or on disk. When the buffer is full, or after a certain time interval, Fluentd flushes the buffer, sending its contents to the configured output destination (like Elasticsearch, S3, etc.).

During a normal shutdown, Fluentd tries to flush its buffers. However, if the shutdown signal arrives before the flush is complete, or if the output destination is slow to acknowledge receipt, events remaining in the buffer can be lost. This is especially problematic for disk buffers, where the data is written locally but not yet sent downstream.

Here’s how to ensure your Fluentd buffers are drained before shutdown:

1. The Graceful Shutdown Signal and shutdown_timeout

Fluentd responds to signals like SIGTERM (sent by systemd or docker stop) by initiating a graceful shutdown. This involves trying to flush all pending buffers before exiting. The shutdown_timeout parameter in fluentd.conf controls how long Fluentd waits for this flush to complete.

  • Diagnosis: Check your Fluentd logs for messages indicating a shutdown is in progress. Look for lines like:

    2023-10-27 10:30:00 +0000 [info]: received SIGTERM
    2023-10-27 10:30:00 +0000 [info]: shutting down fluentd
    

    If you see these messages followed by fluentd exiting without further buffer flush messages, your shutdown_timeout might be too short, or the flush is genuinely taking too long.

  • Fix: Increase the shutdown_timeout in your fluentd.conf. A common default is 5 seconds. For busy systems or slow outputs, you might need much longer.

    <system>
      shutdown_timeout 30s
    </system>
    

    This gives Fluentd up to 30 seconds to flush its buffers after receiving a shutdown signal. If your output plugin is very slow, you might need to set this even higher, but be mindful of how long your service orchestration expects shutdown to take.

  • Why it works: This directly tells Fluentd to spend more time trying to send data before giving up. It’s the first line of defense, ensuring the attempt to flush is given sufficient time.

2. Buffer Plugin Configuration: flush_interval and retry_max_times

The behavior of the buffer itself is crucial. Different buffer plugins (like memory, file, buffer_chunk_limit) have specific parameters that influence when and how data is flushed.

  • Diagnosis: Examine your <buffer> configurations within your <match> directives.

    <match *.**>
      @type forward
      <buffer tag>
        @type file
        path /var/log/td-agent/buffer/my_app
        flush_interval 5s
        retry_max_times 10
      </buffer>
      # ... other settings
    </match>
    

    If flush_interval is too long, data might sit in the buffer for extended periods. If retry_max_times is too low, failed flushes are abandoned too quickly.

  • Fix (for file or memory buffers):

    • Reduce flush_interval: Set it to a shorter duration to encourage more frequent flushing.
      <buffer tag>
        @type file
        path /var/log/td-agent/buffer/my_app
        flush_interval 1s # Reduced from 5s
        retry_max_times 10
      </buffer>
      
      This makes Fluentd attempt to send data more often, reducing the amount of data that can be in flight or pending at any given shutdown moment.
    • Increase retry_max_times: This is more relevant for transient network issues with the output destination.
      <buffer tag>
        @type file
        path /var/log/td-agent/buffer/my_app
        flush_interval 5s
        retry_max_times 30 # Increased from 10
      </buffer>
      
      This allows Fluentd to retry sending data multiple times if the output destination is temporarily unavailable, increasing the chance that data eventually gets through even if there are minor hiccups.
  • Why it works: flush_interval directly controls how often Fluentd initiates a flush. A shorter interval means less data accumulates. retry_max_times improves the robustness of individual flush attempts, ensuring that temporary network blips don’t lead to permanent data loss.

3. Output Plugin num_threads and queue_limit_length

Many output plugins (like http, kafka, elasticsearch) use internal queues and threads to manage sending data. The configuration of these can impact how quickly data is acknowledged and how much is buffered within the output plugin itself.

  • Diagnosis: Consult the documentation for your specific output plugin. For example, the http output might have num_threads and queue_limit_length.

    <match my_es>
      @type elasticsearch
      host elasticsearch.example.com
      port 9200
      logstash_format true
      logstash_prefix my-app
      include_tag_key true
      tag_key @log_name
      flush_interval 5s # This is Fluentd's buffer flush, not the output's internal queue
      # Output plugin specific settings below:
      num_threads 4
      queue_limit_length 8
    </match>
    

    If num_threads is too low, the output can’t keep up with Fluentd’s flushes. If queue_limit_length is too small, the output plugin’s internal buffer might fill up and block Fluentd.

  • Fix:

    • Increase num_threads: Give the output plugin more capacity to send data concurrently.
      <match my_es>
        @type elasticsearch
        # ... other settings
        num_threads 8 # Increased from 4
        queue_limit_length 8
      </match>
      
      This allows the output plugin to process multiple outgoing requests in parallel, speeding up the overall delivery of buffered events.
    • Increase queue_limit_length: Allow the output plugin to buffer more data internally before it starts blocking Fluentd.
      <match my_es>
        @type elasticsearch
        # ... other settings
        num_threads 4
        queue_limit_length 16 # Increased from 8
      </match>
      
      This provides a larger safety net for the output plugin, allowing it to handle temporary bursts of data or slower downstream processing without immediately signaling back pressure to Fluentd.
  • Why it works: These settings tune the output plugin’s ability to consume data from Fluentd’s buffer. By increasing its processing power and internal buffering capacity, you reduce the likelihood that the output plugin becomes a bottleneck during a shutdown flush.

4. Buffer Plugin chunk_limit_size and chunk_limit_num

The size of individual buffer chunks can significantly affect flush performance. If chunks are too large, a single flush operation might take a long time to complete, increasing the chance of being interrupted by a shutdown signal.

  • Diagnosis: Again, look at your <buffer> configuration.

    <buffer tag>
      @type file
      path /var/log/td-agent/buffer/my_app
      chunk_limit_size 10m # 10 MB
      chunk_limit_num 1000 # 1000 events
      flush_interval 5s
    </buffer>
    

    If your events are small, chunk_limit_num might be more relevant. If events are large, chunk_limit_size is key.

  • Fix: Reduce the size of individual chunks.

    <buffer tag>
      @type file
      path /var/log/td-agent/buffer/my_app
      chunk_limit_size 1m # Reduced from 10MB
      chunk_limit_num 500 # Reduced from 1000
      flush_interval 5s
    </buffer>
    

    By making chunks smaller, each individual flush operation completes faster. This means Fluentd can process more flushes within the shutdown_timeout window, reducing the amount of data at risk.

  • Why it works: Smaller chunks lead to quicker flush operations. A quicker flush means less chance of a shutdown signal interrupting the process, and more flushes can be completed within the allotted timeout.

5. Using SIGQUIT for a More Controlled Shutdown (Advanced)

While SIGTERM is the standard graceful shutdown, SIGQUIT can sometimes be used to initiate a dump of the current state, which can be useful for debugging or ensuring a more complete flush. However, this is less about preventing data loss and more about understanding what’s in the buffer.

  • Diagnosis: This is more about observation. If you suspect data is being lost even with shutdown_timeout and buffer tuning, you might want to see exactly what Fluentd is trying to flush.

  • Fix: Send SIGQUIT to the Fluentd process.

    # Find your Fluentd PID
    pgrep fluentd
    # Send the signal
    kill -s QUIT <fluentd_pid>
    

    This will cause Fluentd to dump its internal state and buffer information to its log file. It doesn’t necessarily guarantee a flush before exit, but it provides a snapshot. For true shutdown, SIGTERM is still the primary signal.

  • Why it works: SIGQUIT causes Fluentd to perform a state dump, which can include information about buffered events. This is more of a diagnostic tool than a direct fix for data loss during shutdown, but understanding the state can inform other tuning parameters.

6. Ensure Output Destination is Ready

This is less about Fluentd’s configuration and more about your overall system. If your output destination (e.g., Elasticsearch cluster, S3 bucket) is overloaded or unavailable, Fluentd will retry, but eventually, the buffer might fill up or the shutdown_timeout will expire.

  • Diagnosis: Monitor your output destination. Are there errors in its logs? Is it reporting high load or slow response times?

  • Fix: Ensure your output destination is healthy and can keep up with the ingestion rate. This might involve scaling up your Elasticsearch cluster, increasing S3 write capacity, or optimizing your database writes.

  • Why it works: Fluentd’s buffering is a mechanism to handle temporary discrepancies between ingestion and delivery rates. If the destination is permanently or for a long duration unavailable, even the best Fluentd configuration will eventually fail to deliver data.

The most common next error you’ll encounter after fixing buffer flush issues is related to resource exhaustion if your output destination can’t keep up with the now-guaranteed delivery of all data, or configuration errors in a new plugin you’re introducing.

Want structured learning?

Take the full Fluentd course →