Fluent Bit’s internal storage mechanism can silently drop incoming logs if not monitored, leading to data loss that’s hard to trace back.
Let’s see Fluent Bit in action, specifically how it handles buffered data and the metrics that tell us its health.
Imagine a scenario where your application is spewing logs at a high rate, but your downstream destination (like Elasticsearch or S3) is slow to ingest them. Fluent Bit, by default, buffers these logs in memory or on disk to smooth out the flow. This buffering is essential, but it has limits. If the buffer fills up faster than Fluent Bit can drain it, bad things happen.
Here’s a simplified Fluent Bit configuration showcasing buffering:
[SERVICE]
Flush 5
Daemon On
Log_Level info
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 127.0.0.1
HTTP_Port 2020
[INPUT]
Name tail
Path /var/log/app.log
Tag app.logs
Mem_Buf_Limit 10MB # In-memory buffer limit
[OUTPUT]
Name es
Match app.logs
Host localhost
Port 9200
Logstash_Format On
Retry_Limit False # Keep retrying indefinitely
Buffer_Queue_Size 10000 # Max queue size for output buffer
Buffer_Max_Bytes 1000000 # Max bytes for output buffer
Buffer_Chunk_Size 100000 # Chunk size for output buffer
In this setup, Mem_Buf_Limit defines how much data Fluent Bit will hold in RAM for the tail input before it starts writing to disk (if disk buffering is enabled, which it is by default if storage.path is set in [SERVICE]). The output section has parameters like Buffer_Queue_Size, Buffer_Max_Bytes, and Buffer_Chunk_Size that govern how Fluent Bit stages data before sending it to Elasticsearch.
The problem arises when the rate of logs entering Fluent Bit exceeds the rate at which it can process and forward them. Fluent Bit has a few mechanisms to prevent data loss, but they rely on you observing its internal state.
The key to preventing data loss isn’t just setting large buffer sizes; it’s actively monitoring Fluent Bit’s metrics. Fluent Bit exposes a rich set of metrics via its HTTP server (enabled by HTTP_Server On and HTTP_Port 2020 in the [SERVICE] section). These metrics provide a real-time view into its internal workings.
The most critical metric to watch is fluentbit_input_bytes_total and its counterpart fluentbit_output_bytes_total. However, these just show the total bytes processed. What you really need to look at are the buffer-related metrics.
Specifically, you want to monitor:
fluentbit_mem_buffer_usage_bytes_total: This tells you how much of the in-memory buffer is currently in use. If this approachesMem_Buf_Limit(10MB in our example), you’re getting close to a critical state for that input.fluentbit_disk_buffer_usage_bytes_total: This indicates the usage of the on-disk buffer. If this starts to climb significantly, it means the in-memory buffer is full and Fluent Bit is spilling to disk. A persistently growing disk buffer suggests a bottleneck downstream.fluentbit_output_queue_size: This metric, directly tied toBuffer_Queue_Sizein the output configuration, shows how many records are waiting in the output queue. If this number is consistently high or growing, your output destination is struggling.
To retrieve these metrics, you can use curl:
curl http://127.0.0.1:2020/api/v1/metrics/prometheus
This will output Prometheus-formatted metrics. You’d typically scrape this endpoint with Prometheus itself and set up alerts.
For example, an alert for a full in-memory buffer might look like this (in Prometheus Alertmanager rules):
groups:
- name: fluentbit_alerts
rules:
- alert: FluentBitInputBufferFull
expr: |
fluentbit_mem_buffer_usage_bytes_total > (0.9 * 10 * 1024 * 1024) # 90% of 10MB
for: 5m
labels:
severity: warning
annotations:
summary: "Fluent Bit input buffer nearing capacity"
description: "The in-memory buffer for input {{ $labels.input }} is at {{ printf \"%.0f\" (fluentbit_mem_buffer_usage_bytes_total / (10 * 1024 * 1024) * 100) }}% capacity. This may lead to data loss if not addressed."
If you see the fluentbit_mem_buffer_usage_bytes_total metric consistently high, it means your input rate is too high for the current Mem_Buf_Limit, or your output is too slow. The first step is usually to increase Mem_Buf_Limit in the [INPUT] section, perhaps to 50MB. This gives Fluent Bit more breathing room in memory.
If the fluentbit_disk_buffer_usage_bytes_total metric starts to climb, it means Fluent Bit is writing to disk. While disk buffering is a safety net, a constantly filling disk buffer is a strong indicator that the output destination is the bottleneck. In this case, you’d focus on tuning the output parameters like Buffer_Queue_Size, Buffer_Max_Bytes, and Buffer_Chunk_Size in the [OUTPUT] section, or, more critically, addressing the performance of the downstream system. Increasing Buffer_Queue_Size to 20000 or Buffer_Max_Bytes to 5000000 can help absorb temporary spikes, but it’s not a long-term solution if the output is fundamentally slow.
The most surprising thing about Fluent Bit’s buffering is how its "storage" mechanism, which sounds like a passive holding area, is actually an active participant in flow control. When the storage is full, Fluent Bit doesn’t just stop accepting logs; it starts dropping them if certain configurations aren’t met (e.g., if Retry_Limit is set to a number instead of False or 0, it will eventually give up on sending logs and they will be discarded). This dynamic behavior means you need to treat buffer metrics not just as indicators of load, but as direct signals of potential data loss.
If you’ve tuned your buffers and the metrics look good, but you’re still seeing data loss, the next thing to investigate is the storage.sync option in the [SERVICE] section.