Fluent Bit’s default configuration can lead to surprisingly high CPU usage, especially under heavy load, because it processes incoming log streams sequentially.
Let’s see Fluent Bit in action, processing logs and sending them to a destination. Imagine you have a web server generating logs.
192.168.1.10 - - [10/Oct/2023:10:00:01 +0000] "GET /index.html HTTP/1.1" 200 1234 "-" "Mozilla/5.0"
192.168.1.11 - - [10/Oct/2023:10:00:02 +0000] "GET /about.html HTTP/1.1" 200 567 "-" "Chrome/90.0.4430.93"
192.168.1.10 - - [10/Oct/2023:10:00:03 +0000] "POST /submit HTTP/1.1" 201 50 "-" "curl/7.68.0"
If Fluent Bit is configured to tail these logs and send them to Elasticsearch, it reads each line, parses it, potentially enriches it, and then writes it to the output buffer. Without proper tuning, this entire pipeline for every single log line happens on a single thread, becoming a bottleneck.
The core problem Fluent Bit solves is efficiently collecting, transforming, and routing large volumes of machine-generated data. It acts as a lightweight, high-performance log forwarder. Internally, it uses a plugin architecture. Inputs (like tail or docker) collect data, filters (like grep or lua) modify it, and outputs (like elasticsearch or stdout) send it to destinations. Each plugin can be thought of as a stage in a processing pipeline.
The primary lever you control for performance is the [SERVICE] section of your fluent-bit.conf. This is where you configure the core behavior of the Fluent Bit process itself.
Here’s a typical fluent-bit.conf snippet:
[SERVICE]
Flush 5
Daemon On
Log_Level info
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 127.0.0.1
HTTP_Port 2020
To reduce CPU usage, we’ll focus on two key areas: threading and buffer tuning.
Threading for Parallel Processing
By default, Fluent Bit uses a single thread for processing. When you have many log sources or complex processing pipelines, this becomes a bottleneck. Enabling multiple worker threads allows Fluent Bit to process different input streams or stages concurrently.
Diagnosis:
Monitor your Fluent Bit process’s CPU usage. If it’s consistently high and you see a single fluent-bit process consuming a large percentage of CPU, threading is likely a good area to explore.
Configuration:
Add the Parsers_N and Correlator_N directives to your [SERVICE] section. Parsers_N controls the number of threads for parsing, and Correlator_N for event correlation (though often less critical for basic log forwarding).
Example:
[SERVICE]
# ... other configurations
Parsers_N 4 # Use 4 threads for parsing
# Correlator_N 2 # Optional: use 2 threads for correlation
Why it works: This directive tells Fluent Bit to create a pool of worker threads dedicated to parsing incoming log data. Instead of one thread handling all parsing tasks sequentially, multiple threads can work in parallel, significantly increasing throughput and reducing the CPU load on any single thread. For I/O-bound tasks, you might see less dramatic CPU reduction but better overall responsiveness.
Fix:
Modify your fluent-bit.conf to include Parsers_N. A good starting point is Parsers_N 2 or Parsers_N 4, depending on your CPU cores. Monitor CPU usage after applying the change.
Buffer Tuning for Efficient I/O
Fluent Bit uses internal buffers to manage data flow between input, filter, and output plugins. When these buffers are too small or not flushed frequently enough, Fluent Bit might spend more CPU cycles managing buffer overflow or waiting for data to be sent. Conversely, overly large buffers can increase memory usage and latency.
Diagnosis: If you see Fluent Bit’s CPU usage spike during periods of high log volume, or if you observe output plugins reporting slow writes or buffer full errors, buffer tuning is relevant. Check Fluent Bit’s logs for messages related to buffer pressure.
Configuration:
Tune the Flush interval in the [SERVICE] section and, more granularly, the Buffer_Chunk_Size and Buffer_Max_Num for specific input or output plugins.
Example (Global):
[SERVICE]
# ... other configurations
Flush 10 # Flush data every 10 seconds (default is 5)
Example (Per-Plugin - e.g., for http output):
[OUTPUT]
Name http
Match *
Host your-http-endpoint
Port 8080
URI /logs
Buffer_Chunk_Size 100K # Increase chunk size to 100KB
Buffer_Max_Num 60 # Allow up to 60 chunks in the buffer
Why it works:
The Flush interval dictates how often Fluent Bit attempts to send buffered data to the output. Increasing this value (e.g., from 5 to 10 seconds) reduces the frequency of I/O operations, which can be CPU-intensive. Buffer_Chunk_Size determines the size of individual data blocks within the buffer. Larger chunks can lead to more efficient I/O writes, especially for network protocols that benefit from larger data transfers. Buffer_Max_Num sets the maximum number of these chunks allowed in the buffer. Increasing this allows Fluent Bit to absorb bursts of data without immediately dropping logs or experiencing high CPU from constant buffer management.
Fix:
Start by adjusting Flush in the [SERVICE] section. If that’s not enough, experiment with Buffer_Chunk_Size and Buffer_Max_Num on your critical output plugins. For example, try Buffer_Chunk_Size 1M and Buffer_Max_Num 100 for high-volume outputs, but monitor memory usage.
The Hidden Cost of Dynamic Parsers
While dynamic parsers (like those configured via Parsers_File and loaded at runtime) are incredibly flexible, each time Fluent Bit needs to select and apply a parser, it incurs a small overhead. If you have a vast number of parser rules, or if your log format changes frequently and triggers parser re-evaluation, this can contribute to background CPU churn that’s hard to pinpoint.
If you’ve tuned threading and buffers and still see unexpected CPU usage, consider if your parsing strategy is overly dynamic for your workload. For maximum efficiency, statically defining commonly used parsers directly within the input plugin configuration, rather than relying solely on dynamic loading, can sometimes yield marginal but measurable CPU improvements by reducing lookup and re-evaluation overhead.
The next error you’ll likely encounter after optimizing CPU usage is related to memory consumption if buffers are set too aggressively, or network congestion if your output destination can’t keep up with the increased throughput.