Fluent Bit pods are crashing due to resource starvation, specifically when the Kubernetes scheduler evicts them for exceeding their CPU or memory limits.
Here’s how to diagnose and fix this:
Common Causes and Fixes
-
Under-allocated CPU Limits:
- Diagnosis: Check Fluent Bit pod resource usage in Kubernetes.
Look for consistently high CPU usage, approaching or exceeding the current limit. Also, check Fluent Bit logs forkubectl top pod -n <namespace> <fluent-bit-pod-name> --containersOOMKilledor CPU throttling messages. - Fix: Increase the
resources.limits.cpuin your Fluent Bit DaemonSet YAML. For example, change100mto200m.resources: limits: cpu: "200m" # Increased from 100m memory: "256Mi" - Why it works: Kubernetes uses CPU limits to prevent a single pod from consuming excessive CPU. Increasing the limit allows Fluent Bit to burst its CPU usage when processing logs, preventing evictions.
- Diagnosis: Check Fluent Bit pod resource usage in Kubernetes.
-
Under-allocated Memory Limits:
- Diagnosis: Similar to CPU, check memory usage with
kubectl top podand Fluent Bit logs forOOMKilledmessages.
Look for memory usage consistently high and close to the limit.kubectl top pod -n <namespace> <fluent-bit-pod-name> --containers - Fix: Increase the
resources.limits.memoryin your Fluent Bit DaemonSet YAML. For example, change256Mito512Mi.resources: limits: cpu: "200m" memory: "512Mi" # Increased from 256Mi - Why it works: Memory limits prevent pods from consuming unbounded amounts of RAM. Increasing the limit provides Fluent Bit with more memory to buffer log data and internal structures, preventing out-of-memory errors and subsequent pod restarts.
- Diagnosis: Similar to CPU, check memory usage with
-
Insufficient CPU Requests:
- Diagnosis: While limits define the maximum, requests define the guaranteed amount of CPU. If requests are too low, Fluent Bit might be scheduled on nodes with less available CPU, leading to throttling even if limits are generous. Check
kubectl describe pod <fluent-bit-pod-name> -n <namespace>forRequestsandLimits. - Fix: Increase
resources.requests.cputo match or be slightly less than your intendedlimits.cpu. For instance, if your limit is200m, set the request to150m.resources: requests: cpu: "150m" # Increased from 100m memory: "256Mi" limits: cpu: "200m" memory: "512Mi" - Why it works: Kubernetes uses CPU requests to schedule pods. A higher request ensures Fluent Bit is placed on nodes with sufficient guaranteed CPU, reducing the likelihood of throttling and eviction under normal load.
- Diagnosis: While limits define the maximum, requests define the guaranteed amount of CPU. If requests are too low, Fluent Bit might be scheduled on nodes with less available CPU, leading to throttling even if limits are generous. Check
-
Insufficient Memory Requests:
- Diagnosis: Similar to CPU requests, low memory requests can lead to Fluent Bit being scheduled on nodes with less available memory, increasing the chance of eviction when memory pressure occurs. Check
kubectl describe pod <fluent-bit-pod-name> -n <namespace>. - Fix: Increase
resources.requests.memoryto match or be slightly less than your intendedlimits.memory. For example, if your limit is512Mi, set the request to256Mi.resources: requests: cpu: "150m" memory: "256Mi" # Increased from 128Mi limits: cpu: "200m" memory: "512Mi" - Why it works: Memory requests are used by the scheduler to ensure a node has enough available memory. By increasing the request, you tell Kubernetes to only schedule Fluent Bit on nodes with enough "room" for its baseline memory needs, preventing it from being the first candidate for eviction when memory becomes scarce.
- Diagnosis: Similar to CPU requests, low memory requests can lead to Fluent Bit being scheduled on nodes with less available memory, increasing the chance of eviction when memory pressure occurs. Check
-
High Log Volume/Processing Load:
- Diagnosis: If resource usage is consistently high even after increasing limits, the Fluent Bit configuration itself might be inefficient or processing an overwhelming volume of logs. Examine Fluent Bit’s internal metrics (if exposed) or its output plugin performance. Check
kubectl logs <fluent-bit-pod-name> -n <namespace>. Look for messages indicating slow output, large buffer sizes, or high parsing/filtering load. - Fix: Optimize Fluent Bit’s configuration. This could involve:
- Reducing the number of input plugins or their polling intervals.
- Simplifying or disabling complex filters.
- Increasing buffer sizes (e.g.,
Mem_Buf_Limitin the[INPUT]section) to handle bursts, but be mindful this increases memory usage and might require higher memory limits. - Ensuring output plugins are configured for efficient batching and retries.
- Consider increasing
Buffer_Queue_MaxandBuffer_Chunk_Maxin the[ENGINE]section.
[ENGINE] Mem_Buf_Limit 100MB # Increased from 20MB Buffer_Queue_Max 5000 # Increased from 1000 Buffer_Chunk_Max 1000 # Increased from 100 - Why it works: By adjusting how Fluent Bit buffers and processes logs, you can smooth out processing spikes and reduce the overall CPU and memory footprint per log record, making it more resilient to high traffic and less likely to hit resource limits.
- Diagnosis: If resource usage is consistently high even after increasing limits, the Fluent Bit configuration itself might be inefficient or processing an overwhelming volume of logs. Examine Fluent Bit’s internal metrics (if exposed) or its output plugin performance. Check
-
Incorrectly Configured Output Plugins:
- Diagnosis: A misconfigured output plugin (e.g., incorrect endpoint, authentication issues, or slow downstream service) can cause Fluent Bit to buffer excessively or retry continuously, leading to high resource consumption. Check Fluent Bit logs for errors related to output plugins, such as connection refused, authentication failures, or timeouts.
- Fix: Review and correct the configuration for your output plugins (e.g.,
[OUTPUT]sections influent-bit.conf). Ensure endpoints are correct, credentials are valid, and the downstream service is healthy and responsive. For Elasticsearch or similar, checkHTTP_User_AgentandHTTP_Bulk_Size.[OUTPUT] Name es Match * Host elasticsearch.example.com Port 9200 Logstash_Format On Logstash_Prefix fluentbit Replace_Dots On Retry_Limit False # Set to False to retry indefinitely until successful HTTP_User_Agent fluent-bit/1.8.0 HTTP_Bulk_Size 5000 # Increased for larger batches - Why it works: Properly configured output plugins ensure logs are sent efficiently to their destination. Resolving connectivity or authentication issues, and tuning batch sizes, prevents Fluent Bit from getting stuck in retry loops or building up massive internal buffers, thereby reducing resource strain.
After applying these changes, the next error you’ll encounter is likely related to the storage of your logs if the downstream service becomes the bottleneck.