Fluent Bit pods are crashing due to resource starvation, specifically when the Kubernetes scheduler evicts them for exceeding their CPU or memory limits.

Here’s how to diagnose and fix this:

Common Causes and Fixes

  1. Under-allocated CPU Limits:

    • Diagnosis: Check Fluent Bit pod resource usage in Kubernetes.
      kubectl top pod -n <namespace> <fluent-bit-pod-name> --containers
      
      Look for consistently high CPU usage, approaching or exceeding the current limit. Also, check Fluent Bit logs for OOMKilled or CPU throttling messages.
    • Fix: Increase the resources.limits.cpu in your Fluent Bit DaemonSet YAML. For example, change 100m to 200m.
      resources:
        limits:
          cpu: "200m" # Increased from 100m
          memory: "256Mi"
      
    • Why it works: Kubernetes uses CPU limits to prevent a single pod from consuming excessive CPU. Increasing the limit allows Fluent Bit to burst its CPU usage when processing logs, preventing evictions.
  2. Under-allocated Memory Limits:

    • Diagnosis: Similar to CPU, check memory usage with kubectl top pod and Fluent Bit logs for OOMKilled messages.
      kubectl top pod -n <namespace> <fluent-bit-pod-name> --containers
      
      Look for memory usage consistently high and close to the limit.
    • Fix: Increase the resources.limits.memory in your Fluent Bit DaemonSet YAML. For example, change 256Mi to 512Mi.
      resources:
        limits:
          cpu: "200m"
          memory: "512Mi" # Increased from 256Mi
      
    • Why it works: Memory limits prevent pods from consuming unbounded amounts of RAM. Increasing the limit provides Fluent Bit with more memory to buffer log data and internal structures, preventing out-of-memory errors and subsequent pod restarts.
  3. Insufficient CPU Requests:

    • Diagnosis: While limits define the maximum, requests define the guaranteed amount of CPU. If requests are too low, Fluent Bit might be scheduled on nodes with less available CPU, leading to throttling even if limits are generous. Check kubectl describe pod <fluent-bit-pod-name> -n <namespace> for Requests and Limits.
    • Fix: Increase resources.requests.cpu to match or be slightly less than your intended limits.cpu. For instance, if your limit is 200m, set the request to 150m.
      resources:
        requests:
          cpu: "150m" # Increased from 100m
          memory: "256Mi"
        limits:
          cpu: "200m"
          memory: "512Mi"
      
    • Why it works: Kubernetes uses CPU requests to schedule pods. A higher request ensures Fluent Bit is placed on nodes with sufficient guaranteed CPU, reducing the likelihood of throttling and eviction under normal load.
  4. Insufficient Memory Requests:

    • Diagnosis: Similar to CPU requests, low memory requests can lead to Fluent Bit being scheduled on nodes with less available memory, increasing the chance of eviction when memory pressure occurs. Check kubectl describe pod <fluent-bit-pod-name> -n <namespace>.
    • Fix: Increase resources.requests.memory to match or be slightly less than your intended limits.memory. For example, if your limit is 512Mi, set the request to 256Mi.
      resources:
        requests:
          cpu: "150m"
          memory: "256Mi" # Increased from 128Mi
        limits:
          cpu: "200m"
          memory: "512Mi"
      
    • Why it works: Memory requests are used by the scheduler to ensure a node has enough available memory. By increasing the request, you tell Kubernetes to only schedule Fluent Bit on nodes with enough "room" for its baseline memory needs, preventing it from being the first candidate for eviction when memory becomes scarce.
  5. High Log Volume/Processing Load:

    • Diagnosis: If resource usage is consistently high even after increasing limits, the Fluent Bit configuration itself might be inefficient or processing an overwhelming volume of logs. Examine Fluent Bit’s internal metrics (if exposed) or its output plugin performance. Check kubectl logs <fluent-bit-pod-name> -n <namespace>. Look for messages indicating slow output, large buffer sizes, or high parsing/filtering load.
    • Fix: Optimize Fluent Bit’s configuration. This could involve:
      • Reducing the number of input plugins or their polling intervals.
      • Simplifying or disabling complex filters.
      • Increasing buffer sizes (e.g., Mem_Buf_Limit in the [INPUT] section) to handle bursts, but be mindful this increases memory usage and might require higher memory limits.
      • Ensuring output plugins are configured for efficient batching and retries.
      • Consider increasing Buffer_Queue_Max and Buffer_Chunk_Max in the [ENGINE] section.
      [ENGINE]
          Mem_Buf_Limit   100MB # Increased from 20MB
          Buffer_Queue_Max  5000 # Increased from 1000
          Buffer_Chunk_Max  1000 # Increased from 100
      
    • Why it works: By adjusting how Fluent Bit buffers and processes logs, you can smooth out processing spikes and reduce the overall CPU and memory footprint per log record, making it more resilient to high traffic and less likely to hit resource limits.
  6. Incorrectly Configured Output Plugins:

    • Diagnosis: A misconfigured output plugin (e.g., incorrect endpoint, authentication issues, or slow downstream service) can cause Fluent Bit to buffer excessively or retry continuously, leading to high resource consumption. Check Fluent Bit logs for errors related to output plugins, such as connection refused, authentication failures, or timeouts.
    • Fix: Review and correct the configuration for your output plugins (e.g., [OUTPUT] sections in fluent-bit.conf). Ensure endpoints are correct, credentials are valid, and the downstream service is healthy and responsive. For Elasticsearch or similar, check HTTP_User_Agent and HTTP_Bulk_Size.
      [OUTPUT]
          Name        es
          Match       *
          Host        elasticsearch.example.com
          Port        9200
          Logstash_Format On
          Logstash_Prefix fluentbit
          Replace_Dots    On
          Retry_Limit     False # Set to False to retry indefinitely until successful
          HTTP_User_Agent fluent-bit/1.8.0
          HTTP_Bulk_Size  5000 # Increased for larger batches
      
    • Why it works: Properly configured output plugins ensure logs are sent efficiently to their destination. Resolving connectivity or authentication issues, and tuning batch sizes, prevents Fluent Bit from getting stuck in retry loops or building up massive internal buffers, thereby reducing resource strain.

After applying these changes, the next error you’ll encounter is likely related to the storage of your logs if the downstream service becomes the bottleneck.

Want structured learning?

Take the full Fluentbit course →