Fluentd’s buffer flush mechanism is the primary reason you might lose data when shutting down your service.
This isn’t about Fluentd crashing and losing data; it’s about Fluentd gracefully shutting down but still losing data because the buffers weren’t emptied. When Fluentd receives events, it doesn’t send them immediately. Instead, it writes them to a buffer. This buffer can be in memory or on disk. When the buffer is full, or after a certain time interval, Fluentd flushes the buffer, sending its contents to the configured output destination (like Elasticsearch, S3, etc.).
During a normal shutdown, Fluentd tries to flush its buffers. However, if the shutdown signal arrives before the flush is complete, or if the output destination is slow to acknowledge receipt, events remaining in the buffer can be lost. This is especially problematic for disk buffers, where the data is written locally but not yet sent downstream.
Here’s how to ensure your Fluentd buffers are drained before shutdown:
1. The Graceful Shutdown Signal and shutdown_timeout
Fluentd responds to signals like SIGTERM (sent by systemd or docker stop) by initiating a graceful shutdown. This involves trying to flush all pending buffers before exiting. The shutdown_timeout parameter in fluentd.conf controls how long Fluentd waits for this flush to complete.
-
Diagnosis: Check your Fluentd logs for messages indicating a shutdown is in progress. Look for lines like:
2023-10-27 10:30:00 +0000 [info]: received SIGTERM 2023-10-27 10:30:00 +0000 [info]: shutting down fluentdIf you see these messages followed by
fluentdexiting without further buffer flush messages, yourshutdown_timeoutmight be too short, or the flush is genuinely taking too long. -
Fix: Increase the
shutdown_timeoutin yourfluentd.conf. A common default is 5 seconds. For busy systems or slow outputs, you might need much longer.<system> shutdown_timeout 30s </system>This gives Fluentd up to 30 seconds to flush its buffers after receiving a shutdown signal. If your output plugin is very slow, you might need to set this even higher, but be mindful of how long your service orchestration expects shutdown to take.
-
Why it works: This directly tells Fluentd to spend more time trying to send data before giving up. It’s the first line of defense, ensuring the attempt to flush is given sufficient time.
2. Buffer Plugin Configuration: flush_interval and retry_max_times
The behavior of the buffer itself is crucial. Different buffer plugins (like memory, file, buffer_chunk_limit) have specific parameters that influence when and how data is flushed.
-
Diagnosis: Examine your
<buffer>configurations within your<match>directives.<match *.**> @type forward <buffer tag> @type file path /var/log/td-agent/buffer/my_app flush_interval 5s retry_max_times 10 </buffer> # ... other settings </match>If
flush_intervalis too long, data might sit in the buffer for extended periods. Ifretry_max_timesis too low, failed flushes are abandoned too quickly. -
Fix (for
fileormemorybuffers):- Reduce
flush_interval: Set it to a shorter duration to encourage more frequent flushing.
This makes Fluentd attempt to send data more often, reducing the amount of data that can be in flight or pending at any given shutdown moment.<buffer tag> @type file path /var/log/td-agent/buffer/my_app flush_interval 1s # Reduced from 5s retry_max_times 10 </buffer> - Increase
retry_max_times: This is more relevant for transient network issues with the output destination.
This allows Fluentd to retry sending data multiple times if the output destination is temporarily unavailable, increasing the chance that data eventually gets through even if there are minor hiccups.<buffer tag> @type file path /var/log/td-agent/buffer/my_app flush_interval 5s retry_max_times 30 # Increased from 10 </buffer>
- Reduce
-
Why it works:
flush_intervaldirectly controls how often Fluentd initiates a flush. A shorter interval means less data accumulates.retry_max_timesimproves the robustness of individual flush attempts, ensuring that temporary network blips don’t lead to permanent data loss.
3. Output Plugin num_threads and queue_limit_length
Many output plugins (like http, kafka, elasticsearch) use internal queues and threads to manage sending data. The configuration of these can impact how quickly data is acknowledged and how much is buffered within the output plugin itself.
-
Diagnosis: Consult the documentation for your specific output plugin. For example, the
httpoutput might havenum_threadsandqueue_limit_length.<match my_es> @type elasticsearch host elasticsearch.example.com port 9200 logstash_format true logstash_prefix my-app include_tag_key true tag_key @log_name flush_interval 5s # This is Fluentd's buffer flush, not the output's internal queue # Output plugin specific settings below: num_threads 4 queue_limit_length 8 </match>If
num_threadsis too low, the output can’t keep up with Fluentd’s flushes. Ifqueue_limit_lengthis too small, the output plugin’s internal buffer might fill up and block Fluentd. -
Fix:
- Increase
num_threads: Give the output plugin more capacity to send data concurrently.
This allows the output plugin to process multiple outgoing requests in parallel, speeding up the overall delivery of buffered events.<match my_es> @type elasticsearch # ... other settings num_threads 8 # Increased from 4 queue_limit_length 8 </match> - Increase
queue_limit_length: Allow the output plugin to buffer more data internally before it starts blocking Fluentd.
This provides a larger safety net for the output plugin, allowing it to handle temporary bursts of data or slower downstream processing without immediately signaling back pressure to Fluentd.<match my_es> @type elasticsearch # ... other settings num_threads 4 queue_limit_length 16 # Increased from 8 </match>
- Increase
-
Why it works: These settings tune the output plugin’s ability to consume data from Fluentd’s buffer. By increasing its processing power and internal buffering capacity, you reduce the likelihood that the output plugin becomes a bottleneck during a shutdown flush.
4. Buffer Plugin chunk_limit_size and chunk_limit_num
The size of individual buffer chunks can significantly affect flush performance. If chunks are too large, a single flush operation might take a long time to complete, increasing the chance of being interrupted by a shutdown signal.
-
Diagnosis: Again, look at your
<buffer>configuration.<buffer tag> @type file path /var/log/td-agent/buffer/my_app chunk_limit_size 10m # 10 MB chunk_limit_num 1000 # 1000 events flush_interval 5s </buffer>If your events are small,
chunk_limit_nummight be more relevant. If events are large,chunk_limit_sizeis key. -
Fix: Reduce the size of individual chunks.
<buffer tag> @type file path /var/log/td-agent/buffer/my_app chunk_limit_size 1m # Reduced from 10MB chunk_limit_num 500 # Reduced from 1000 flush_interval 5s </buffer>By making chunks smaller, each individual flush operation completes faster. This means Fluentd can process more flushes within the
shutdown_timeoutwindow, reducing the amount of data at risk. -
Why it works: Smaller chunks lead to quicker flush operations. A quicker flush means less chance of a shutdown signal interrupting the process, and more flushes can be completed within the allotted timeout.
5. Using SIGQUIT for a More Controlled Shutdown (Advanced)
While SIGTERM is the standard graceful shutdown, SIGQUIT can sometimes be used to initiate a dump of the current state, which can be useful for debugging or ensuring a more complete flush. However, this is less about preventing data loss and more about understanding what’s in the buffer.
-
Diagnosis: This is more about observation. If you suspect data is being lost even with
shutdown_timeoutand buffer tuning, you might want to see exactly what Fluentd is trying to flush. -
Fix: Send
SIGQUITto the Fluentd process.# Find your Fluentd PID pgrep fluentd # Send the signal kill -s QUIT <fluentd_pid>This will cause Fluentd to dump its internal state and buffer information to its log file. It doesn’t necessarily guarantee a flush before exit, but it provides a snapshot. For true shutdown,
SIGTERMis still the primary signal. -
Why it works:
SIGQUITcauses Fluentd to perform a state dump, which can include information about buffered events. This is more of a diagnostic tool than a direct fix for data loss during shutdown, but understanding the state can inform other tuning parameters.
6. Ensure Output Destination is Ready
This is less about Fluentd’s configuration and more about your overall system. If your output destination (e.g., Elasticsearch cluster, S3 bucket) is overloaded or unavailable, Fluentd will retry, but eventually, the buffer might fill up or the shutdown_timeout will expire.
-
Diagnosis: Monitor your output destination. Are there errors in its logs? Is it reporting high load or slow response times?
-
Fix: Ensure your output destination is healthy and can keep up with the ingestion rate. This might involve scaling up your Elasticsearch cluster, increasing S3 write capacity, or optimizing your database writes.
-
Why it works: Fluentd’s buffering is a mechanism to handle temporary discrepancies between ingestion and delivery rates. If the destination is permanently or for a long duration unavailable, even the best Fluentd configuration will eventually fail to deliver data.
The most common next error you’ll encounter after fixing buffer flush issues is related to resource exhaustion if your output destination can’t keep up with the now-guaranteed delivery of all data, or configuration errors in a new plugin you’re introducing.