Fluent Bit’s internal storage mechanism can silently drop incoming logs if not monitored, leading to data loss that’s hard to trace back.

Let’s see Fluent Bit in action, specifically how it handles buffered data and the metrics that tell us its health.

Imagine a scenario where your application is spewing logs at a high rate, but your downstream destination (like Elasticsearch or S3) is slow to ingest them. Fluent Bit, by default, buffers these logs in memory or on disk to smooth out the flow. This buffering is essential, but it has limits. If the buffer fills up faster than Fluent Bit can drain it, bad things happen.

Here’s a simplified Fluent Bit configuration showcasing buffering:

[SERVICE]
    Flush        5
    Daemon       On
    Log_Level    info
    Parsers_File parsers.conf
    HTTP_Server  On
    HTTP_Listen  127.0.0.1
    HTTP_Port    2020

[INPUT]
    Name         tail
    Path         /var/log/app.log
    Tag          app.logs
    Mem_Buf_Limit 10MB # In-memory buffer limit

[OUTPUT]
    Name         es
    Match        app.logs
    Host         localhost
    Port         9200
    Logstash_Format On
    Retry_Limit  False # Keep retrying indefinitely
    Buffer_Queue_Size 10000 # Max queue size for output buffer
    Buffer_Max_Bytes 1000000 # Max bytes for output buffer
    Buffer_Chunk_Size 100000 # Chunk size for output buffer

In this setup, Mem_Buf_Limit defines how much data Fluent Bit will hold in RAM for the tail input before it starts writing to disk (if disk buffering is enabled, which it is by default if storage.path is set in [SERVICE]). The output section has parameters like Buffer_Queue_Size, Buffer_Max_Bytes, and Buffer_Chunk_Size that govern how Fluent Bit stages data before sending it to Elasticsearch.

The problem arises when the rate of logs entering Fluent Bit exceeds the rate at which it can process and forward them. Fluent Bit has a few mechanisms to prevent data loss, but they rely on you observing its internal state.

The key to preventing data loss isn’t just setting large buffer sizes; it’s actively monitoring Fluent Bit’s metrics. Fluent Bit exposes a rich set of metrics via its HTTP server (enabled by HTTP_Server On and HTTP_Port 2020 in the [SERVICE] section). These metrics provide a real-time view into its internal workings.

The most critical metric to watch is fluentbit_input_bytes_total and its counterpart fluentbit_output_bytes_total. However, these just show the total bytes processed. What you really need to look at are the buffer-related metrics.

Specifically, you want to monitor:

  • fluentbit_mem_buffer_usage_bytes_total: This tells you how much of the in-memory buffer is currently in use. If this approaches Mem_Buf_Limit (10MB in our example), you’re getting close to a critical state for that input.
  • fluentbit_disk_buffer_usage_bytes_total: This indicates the usage of the on-disk buffer. If this starts to climb significantly, it means the in-memory buffer is full and Fluent Bit is spilling to disk. A persistently growing disk buffer suggests a bottleneck downstream.
  • fluentbit_output_queue_size: This metric, directly tied to Buffer_Queue_Size in the output configuration, shows how many records are waiting in the output queue. If this number is consistently high or growing, your output destination is struggling.

To retrieve these metrics, you can use curl:

curl http://127.0.0.1:2020/api/v1/metrics/prometheus

This will output Prometheus-formatted metrics. You’d typically scrape this endpoint with Prometheus itself and set up alerts.

For example, an alert for a full in-memory buffer might look like this (in Prometheus Alertmanager rules):

groups:
- name: fluentbit_alerts
  rules:
  - alert: FluentBitInputBufferFull
    expr: |
      fluentbit_mem_buffer_usage_bytes_total > (0.9 * 10 * 1024 * 1024) # 90% of 10MB
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Fluent Bit input buffer nearing capacity"

      description: "The in-memory buffer for input {{ $labels.input }} is at {{ printf \"%.0f\" (fluentbit_mem_buffer_usage_bytes_total / (10 * 1024 * 1024) * 100) }}% capacity. This may lead to data loss if not addressed."

If you see the fluentbit_mem_buffer_usage_bytes_total metric consistently high, it means your input rate is too high for the current Mem_Buf_Limit, or your output is too slow. The first step is usually to increase Mem_Buf_Limit in the [INPUT] section, perhaps to 50MB. This gives Fluent Bit more breathing room in memory.

If the fluentbit_disk_buffer_usage_bytes_total metric starts to climb, it means Fluent Bit is writing to disk. While disk buffering is a safety net, a constantly filling disk buffer is a strong indicator that the output destination is the bottleneck. In this case, you’d focus on tuning the output parameters like Buffer_Queue_Size, Buffer_Max_Bytes, and Buffer_Chunk_Size in the [OUTPUT] section, or, more critically, addressing the performance of the downstream system. Increasing Buffer_Queue_Size to 20000 or Buffer_Max_Bytes to 5000000 can help absorb temporary spikes, but it’s not a long-term solution if the output is fundamentally slow.

The most surprising thing about Fluent Bit’s buffering is how its "storage" mechanism, which sounds like a passive holding area, is actually an active participant in flow control. When the storage is full, Fluent Bit doesn’t just stop accepting logs; it starts dropping them if certain configurations aren’t met (e.g., if Retry_Limit is set to a number instead of False or 0, it will eventually give up on sending logs and they will be discarded). This dynamic behavior means you need to treat buffer metrics not just as indicators of load, but as direct signals of potential data loss.

If you’ve tuned your buffers and the metrics look good, but you’re still seeing data loss, the next thing to investigate is the storage.sync option in the [SERVICE] section.

Want structured learning?

Take the full Fluentbit course →