Fluentd’s buffer and retry mechanisms are the unsung heroes of reliable log aggregation, preventing data loss even when downstream systems hiccup.
Here’s Fluentd’s buffer in action, processing a stream of logs and periodically flushing them to a destination. Imagine we’re collecting Nginx access logs and sending them to Elasticsearch.
<source>
@type tail
path /var/log/nginx/access.log
pos /var/log/td-agent/nginx-access.log.pos
tag nginx.access
<parse>
@type nginx
</parse>
</source>
<match nginx.access>
@type elasticsearch
host localhost
port 9200
logstash_format true
logstash_prefix nginx-access
include_tag_key true
tag_key @log_name
flush_interval 5s
<buffer>
@type file
path /var/log/td-agent/buffer/nginx
flush_mode interval
retry_max_times 10
retry_wait 1s
chunk_limit_size 2m
chunk_limit_records 1000
</buffer>
</match>
The core problem Fluentd solves is guaranteeing delivery of log data in the face of network instability or destination service outages. It does this by decoupling log ingestion from log forwarding. When the elasticsearch output plugin can’t connect to Elasticsearch, Fluentd doesn’t drop the logs; it buffers them.
The <buffer> section in the configuration is where this magic happens.
@type file: This specifies that the buffer will be stored on disk. Other options includememory(faster but data is lost on restart) or custom types. Usingfileis the most common for durability.path /var/log/td-agent/buffer/nginx: This is the directory where Fluentd will write its buffer files (chunks). Ensure this directory exists and is writable by thetd-agentuser.flush_mode interval: This dictates when Fluentd attempts to send buffered data.intervalmeans it tries everyflush_intervalseconds (defined in the<match>block, here 5s). Other modes includelazy(flushes only when a chunk is full or a certain time has passed since the last flush) andimmediate(attempts to flush each record as it arrives, less common for high-throughput scenarios).retry_max_times 10: If a flush operation fails, Fluentd will retry sending that chunk up to 10 times.retry_wait 1s: After a failed flush, Fluentd will wait 1 second before retrying. This prevents overwhelming a struggling destination.chunk_limit_size 2m: Each buffer chunk written to disk will not exceed 2 megabytes. Once a chunk reaches this size, Fluentd will start a new one.chunk_limit_records 1000: Alternatively, a chunk can be considered "full" if it contains 1000 records, regardless of its total size. Fluentd will flush when eitherchunk_limit_sizeorchunk_limit_recordsis met, or whenflush_intervalelapses.
When Fluentd tries to flush a chunk and the destination (Elasticsearch, in this case) is unavailable, the chunk is not deleted. Instead, it’s marked for retry, and Fluentd waits for retry_wait seconds before attempting to send it again, up to retry_max_times. If all retries fail, the chunk is eventually dropped (or sent to a dead-letter queue if configured), but only after exhausting all retry attempts.
The interplay between flush_interval, chunk_limit_size, and chunk_limit_records is crucial for tuning. A short flush_interval with small chunk sizes means more frequent, smaller flushes, leading to lower latency but potentially higher overhead on the destination. Larger chunks and longer intervals reduce overhead but increase latency and the amount of data buffered during an outage.
One common pitfall is not setting retry_wait and retry_max_times appropriately. If retry_wait is too short (e.g., 0s) and retry_max_times is high, Fluentd can thrash a failing service, making the problem worse. A common starting point is retry_wait 5s and retry_max_times 5.
The next challenge you’ll likely face is handling situations where the destination is permanently unavailable or data corruption occurs, leading to persistent flush failures.