Fluentd’s buffer flush is failing to keep up, causing data to back up and potentially get lost.

Common Causes and Fixes

1. Under-provisioned Buffer Size

  • Diagnosis: Check the buffer size configured in your Fluentd output plugin. If it’s too small, it will fill up quickly and trigger frequent, small flushes that can overwhelm the downstream system or the network. Look for logs indicating "buffer flushed" occurring very rapidly, or "buffer is full" warnings.
    • Command: grep 'buffer_size' /etc/fluentd/fluentd.conf (or your specific config file)
  • Fix: Increase the buffer_size in your output plugin configuration. A common starting point for high-volume systems might be 128m or 256m.
    • Example Config Snippet:
      <match your.tag>
        @type rewrite_tag_filter
        # ... other configs
        <buffer tag_key,time>
          @type file
          path /var/log/fluentd/buffer/your_tag # Ensure this path exists and is writable
          flush_interval 10s
          chunk_limit_size 256m # Increased buffer size
          retry_max_times 5
          retry_wait 1s
        </buffer>
        @type forward
        # ... your upstream host and port
      </match>
      
  • Why it works: A larger buffer allows Fluentd to accumulate more data before initiating a flush. This leads to fewer, larger flushes, which are generally more efficient for downstream systems and reduce the overhead of frequent connection establishments and teardowns.

2. Slow Downstream System/Network Latency

  • Diagnosis: The most direct indicator is high latency or error rates reported by the downstream system receiving data from Fluentd. If you have a metric for how long it takes for data to appear in your Elasticsearch, Splunk, or other destination, and that metric is increasing, this is likely the cause. Fluentd’s forward plugin, for instance, will show connection refused or timeout errors if the destination is not responding.
    • Command: Monitor your destination system’s ingestion rate and latency. For Elasticsearch, check _cat/indices for age and _nodes/stats/indices/indexing for throughput. For network issues, use ping and traceroute from the Fluentd host to the destination.
  • Fix: Optimize the downstream system for faster ingestion (e.g., scale up Elasticsearch nodes, optimize indexing templates, increase network bandwidth). If using the forward plugin, ensure linger_timeout is set appropriately (e.g., 5s) to avoid holding connections open unnecessarily if the downstream is slow to respond.
    • Example Config Snippet (for forward plugin):
      <match your.tag>
        @type forward
        <server>
          host your_destination_host
          port 24224
        </server>
        linger_timeout 5s # Adjust if downstream is consistently slow
        # ... other configs
      </match>
      
  • Why it works: By addressing the bottleneck in the destination or network, you enable Fluentd to send data faster, allowing its buffers to drain more effectively.

3. Excessive Chunk Creation/Small Chunk Sizes

  • Diagnosis: Even with a large buffer_size, if chunk_limit_size is set too low, Fluentd will create many small chunks. This increases the overhead of serializing, writing, and sending these chunks, even if the total data volume isn’t massive. Look for logs indicating a very high number of buffer flushes per second.
    • Command: Inspect your chunk_limit_size in the output plugin’s buffer configuration.
      • Command: grep 'chunk_limit_size' /etc/fluentd/fluentd.conf
  • Fix: Increase chunk_limit_size. A common value for high-throughput scenarios is 64m or 128m. Ensure chunk_limit_size is less than or equal to buffer_size.
    • Example Config Snippet:
      <buffer tag_key,time>
        @type file
        path /var/log/fluentd/buffer/your_tag
        flush_interval 10s
        chunk_limit_size 64m # Increased chunk size
        buffer_size 128m # Ensure buffer_size is >= chunk_limit_size
        # ... other configs
      </buffer>
      
  • Why it works: Larger chunks reduce the number of I/O operations and network requests required to send data, making the overall flush process more efficient.

4. Inefficient Output Plugin or Downstream Serialization

  • Diagnosis: Some output plugins, or the serialization format they use, can be CPU-intensive on the Fluentd side. If your Fluentd instance is consistently hitting 100% CPU, especially during buffer flushes, this is a strong indicator. Check the CPU usage of the fluentd process.
    • Command: top -p $(pgrep fluentd) or htop
  • Fix: If using JSON, consider msgpack which is more compact and often faster to serialize/deserialize. If the output plugin itself is known to be inefficient, explore alternatives or look for performance-tuned versions. For example, the fluent-plugin-elasticsearch has various performance options.
    • Example Config Snippet (using msgpack for forward plugin):
      <match your.tag>
        @type forward
        send_chunk true # Ensure chunks are sent
        <buffer tag_key,time>
          @type file
          # ... buffer configs
        </buffer>
        <transport tcp>
          protocol msgpack # Use msgpack for efficient serialization
        </transport>
        # ... server configs
      </match>
      
  • Why it works: More efficient serialization formats reduce the CPU load on the Fluentd worker threads responsible for preparing data for transmission, freeing up resources for flushing.

5. Insufficient Fluentd Worker Threads

  • Diagnosis: Fluentd uses worker threads to process incoming events and flush buffers. If you have a high volume of data or many output destinations, the default number of workers might be insufficient. Monitor Fluentd’s internal metrics if available, or observe if the fluentd process is CPU-bound across multiple cores.
    • Command: Check fluentd --version to see if it’s a recent version. Modern Fluentd versions (1.14+) have improved concurrency. If using older versions, this is more likely.
  • Fix: Increase the number of worker threads. This is typically done via the fluentd.conf or by setting environment variables. For example, FLUENTD_WORKER_PROCS environment variable.
    • Example Environment Variable:
      export FLUENTD_WORKER_PROCS=4 # Or higher, depending on your CPU cores
      fluentd -c /etc/fluentd/fluentd.conf
      
  • Why it works: More worker threads allow Fluentd to concurrently handle more incoming data and manage multiple buffer flushes to different destinations, reducing the chance of any single buffer becoming starved.

6. Network Congestion or MTU Mismatches

  • Diagnosis: If data is being sent over a network, especially between different subnets or over WAN links, network congestion or incorrect Maximum Transmission Unit (MTU) settings can drastically slow down transfers. You might see TCP retransmissions or packet loss.
    • Command: Use iperf3 between the Fluentd host and the destination to test raw network throughput. Check netstat -s for TCP retransmissions. On Linux, ip addr show on relevant interfaces can show MTU.
  • Fix: Address network congestion (e.g., QoS, increased bandwidth). Ensure MTU settings are consistent across the path from Fluentd to its destination. A common MTU is 1500. If using VPNs or tunnels, MTU might need to be lower (e.g., 1400 or 1350).
    • Example Command to Set MTU (Linux):
      sudo ip link set eth0 mtu 1400 # Replace eth0 with your interface
      
  • Why it works: Correct MTU sizes prevent IP fragmentation, which is inefficient and can be dropped by intermediate network devices. Consistent and high network throughput ensures data packets reach their destination without significant delays.

The next error you’ll likely encounter after fixing buffer flush performance is related to the specific output plugin’s rate limiting or connection errors if the downstream system is still unable to keep up, even with faster flushes.

Want structured learning?

Take the full Fluentd course →