Fluentd is choking on itself, specifically its CPU usage is spiking, because the core processing threads are getting starved. This isn’t just a performance hiccup; it’s a sign that the ingestion pipeline is either overloaded or misconfigured, preventing events from being processed efficiently.

Common Causes and Fixes for CPU Spikes in Fluentd

  1. Overwhelmed Input Plugin: The plugin receiving data is being fed more data than Fluentd can process downstream.

    • Diagnosis: Monitor input plugin metrics. Look for a consistently high number of events being received per second that outpaces the output_records or output_bytes metrics from the output plugins.
      fluent-cat --dry-run '{"message": "test"}' | fluentd -c fluent.conf --dry-run
      # Observe the output rate from input vs. output.
      # Or, use Prometheus exporter and check input_records_total vs. output_records_total.
      
    • Fix:
      • Rate Limiting: If the source can be configured to send data slower, do that.
      • Buffering: Implement or increase buffer sizes in the input plugin or at the source to smooth out bursts.
      • Scaling: If the source cannot be slowed, you’ll need to scale Fluentd horizontally (more instances) or vertically (more CPU/RAM).
      • Example Fix (input plugin configuration): In your fluent.conf, if using in_tail, consider adding read_from_head true if you are restarting Fluentd and don’t want to re-process old logs immediately, or adjust poll_interval if applicable. If using in_http, ensure the upstream service is not overwhelming it.
    • Why it works: This reduces the rate of incoming events to a manageable level for the rest of the Fluentd pipeline, preventing backpressure and CPU starvation.
  2. Inefficient Filter Plugins: Complex or poorly written filter plugins are consuming excessive CPU cycles per event.

    • Diagnosis: Use Fluentd’s built-in profiling or the Prometheus exporter to identify which filter plugins are taking the longest to execute.
      # Using fluentd's profiling (requires enabling):
      # Add 'fluentd -p /path/to/plugins' and check /proc/<pid>/fd/
      # Or, if using Prometheus exporter:
      # Monitor `fluentd_output_plugin_process_time_seconds_bucket` and `fluentd_filter_plugin_process_time_seconds_bucket`
      
    • Fix:
      • Simplify Logic: Refactor complex regex, JavaScript, or Ruby code within filters.
      • Remove Unnecessary Filters: Eliminate filters that are no longer required.
      • Optimize Regex: Ensure regular expressions are as efficient as possible (e.g., avoid excessive backtracking).
      • Example Fix (filter configuration): If a filter uses a complex regex, try to break it down or use a simpler pattern. For instance, instead of .*(complex_pattern).*, use specific anchors if possible. If using filter_record_transformer with many renew_record operations, ensure you’re not duplicating work.
    • Why it works: By reducing the computational cost per event, each CPU core can process more events, alleviating the overall load.
  3. Output Plugin Bottleneck: An output plugin is slow to write data, causing events to queue up in memory buffers and eventually leading to high CPU as Fluentd tries to manage these buffers and retry operations.

    • Diagnosis: Monitor output plugin metrics. Look for high output_queue_length or output_worker_lost errors. Check the network latency and throughput to the output destination.
      # Using Prometheus exporter:
      # Check `fluentd_output_worker_queue_length` and `fluentd_output_worker_lost_total`.
      # Also monitor network I/O and latency to the target.
      
    • Fix:
      • Increase Buffer Size: Larger buffers can absorb temporary output slowness.
      • Tune Output Plugin Parameters: Many output plugins have options like buffer_chunk_limit, buffer_queue_limit, flush_interval, and retry_max_times. Increase these cautiously.
      • Parallelism: If the output plugin supports it, increase the number of worker threads (e.g., num_threads in out_elasticsearch).
      • Destination Performance: Ensure the destination (e.g., Elasticsearch, S3) is healthy and can accept data at the required rate.
      • Example Fix (buffer configuration): For out_file, increase buffer_chunk_limit 256m and buffer_queue_limit 100. For out_elasticsearch, ensure flush_interval 5s is not too aggressive for the cluster.
    • Why it works: This gives the output plugin more room to accumulate data and handle temporary network congestion or destination unresponsiveness without blocking the entire Fluentd process.
  4. Excessive Tagging/Routing: A complex routing configuration (<match>) with many overlapping or deeply nested tags forces Fluentd to evaluate numerous matching rules for every incoming event.

    • Diagnosis: Review your fluent.conf for a very large number of <match> blocks or deeply nested tag structures (e.g., a.b.c.d.e.f).
    • Fix:
      • Simplify Tagging: Use broader tags and more targeted filtering within a match block.
      • Consolidate Matches: Combine multiple match blocks that point to the same output if possible.
      • Order Matters: Place more specific match rules before more general ones to reduce unnecessary evaluation.
      • Example Fix (match configuration): Instead of:
        <match a.b.c> ... </match>
        <match a.b.c.d> ... </match>
        <match a.b.c.d.e> ... </match>
        
        Use:
        <match a.b.c>
          # General handling for a.b.c and its children
          ...
          <match a.b.c.d> # Specific handling for a.b.c.d
            ...
          </match>
          <match a.b.c.d.e> # Specific handling for a.b.c.d.e
            ...
          </match>
        </match>
        
    • Why it works: Reduces the number of comparisons Fluentd’s core needs to perform for each event, speeding up the routing phase.
  5. High Event Volume with Insufficient Resources: The overall volume of logs is simply too high for the allocated CPU and memory resources on the Fluentd instance.

    • Diagnosis: Check system-level CPU and memory utilization (top, htop, free -m). If Fluentd is consistently using 80-100% CPU and significant RAM, this is the likely culprit.
    • Fix:
      • Vertical Scaling: Increase CPU cores and RAM allocated to the Fluentd host or container.
      • Horizontal Scaling: Run multiple Fluentd instances behind a load balancer (e.g., HAProxy, AWS ELB) and configure them to listen on different ports or use different tags to distribute the load.
      • Example Fix (system config): If running in Kubernetes, increase the resources.requests.cpu and resources.limits.cpu for the Fluentd pod. If on a VM, provision a larger instance type.
    • Why it works: Provides the necessary computational power and memory to process the volume of data without becoming a bottleneck.
  6. Garbage Collection Pauses (Ruby VM): Fluentd is written in Ruby, and aggressive object creation/destruction can lead to significant garbage collection (GC) pauses, which manifest as CPU spikes.

    • Diagnosis: Observe CPU usage patterns. If spikes are sudden, sharp, and short-lived, followed by a brief period of lower usage, GC is a strong candidate. You can also enable Ruby’s GC profiling.
      # To enable GC profiling (can be verbose):
      # Add RUBYOPT="-r 'benchmark/ips' -r 'gc_profiler'" to your environment variables before starting Fluentd.
      # Or use a dedicated Ruby profiler.
      
    • Fix:
      • Reduce Object Creation: Optimize filters and plugins to create fewer temporary objects.
      • Tune Ruby GC: While more advanced, you can experiment with RUBY_GC_HEAP_GROWTH_MAX_SLOTS and RUBY_GC_HEAP_INIT_SLOTS environment variables, but this is risky without deep understanding.
      • Upgrade Fluentd/Ruby: Newer versions of Ruby and Fluentd may have GC optimizations.
      • Example Fix (environment variable): export RUBY_GC_HEAP_GROWTH_MAX_SLOTS=150000 (default is 100000). This allows the heap to grow larger before triggering a GC cycle, potentially reducing frequency but increasing pause duration. Use with caution.
    • Why it works: By either reducing the frequency or duration of GC pauses, the Ruby VM spends more time executing Fluentd’s core logic rather than managing memory.

After fixing these, the next error you’ll likely encounter is BufferChunkOverflowError if your output is still not keeping up, or perhaps a SocketTimeoutError if network connectivity to your output destination is unstable.

Want structured learning?

Take the full Fluentd course →