Your pods are getting kicked off your Kubernetes nodes because the nodes are running out of memory. This happens when the kernel’s Out-Of-Memory (OOM) killer steps in and terminates processes to free up memory, and in Kubernetes, it usually targets your pods.
Common Causes and Fixes
-
Under-provisioned Nodes: The most frequent culprit is simply not having enough RAM on your worker nodes to handle the workload.
- Diagnosis: Check your node utilization.
Look for nodes consistently showing high memory usage (e.g., 90%+) acrosskubectl top nodeskubectl top pods --all-namespacesoutput for pods on that node. - Fix: Increase the memory available to your nodes. This can mean:
- Adding more RAM to existing nodes (if hardware allows).
- Replacing nodes with larger instances.
- Adding more nodes to the cluster to distribute the load.
- Why it works: More physical or virtual memory on the node means the OOM killer has more breathing room before it needs to act.
- Diagnosis: Check your node utilization.
-
Memory Leaks in Pods: A single pod might be consuming more and more memory over time without releasing it, eventually exhausting node resources.
- Diagnosis:
- Identify pods with consistently increasing memory usage over extended periods.
kubectl top pods --all-namespaces --sort-by=memory- Use tools like
kubectl exec <pod-name> -- pmap -x <pid>(where<pid>is the process ID inside the pod) or application-specific profiling tools to find the leak.
- Fix: Address the memory leak in the application code. This is highly application-specific but generally involves fixing incorrect memory allocation/deallocation patterns or resource handling.
- Why it works: Fixing the leak prevents the pod from continuously gobbling up memory, thus reducing the overall memory pressure on the node.
- Diagnosis:
-
Insufficient Pod Memory Limits/Requests: Pods are allowed to consume more memory than they are assigned via
limits.memory, leading to unexpected exhaustion.- Diagnosis: Compare pod
requests.memoryandlimits.memoryagainst actual usage fromkubectl top pods. Iflimits.memoryis set too high or not at all, a pod can consume all available node memory.
Look atkubectl get pod <pod-name> -o yamlspec.containers[].resources.requests.memoryandspec.containers[].resources.limits.memory. - Fix: Set appropriate
limits.memoryfor your pods. A good starting point is to setlimits.memoryequal torequests.memoryif your application is stable, or slightly higher (e.g., 1.5x-2x) if you know it has spiky usage.resources: requests: memory: "256Mi" limits: memory: "512Mi" - Why it works: Kubernetes uses
limits.memoryto reserve resources and to evict pods when they exceed their allocated limit, preventing them from starving other pods or the node itself.
- Diagnosis: Compare pod
-
Kubelet Memory Usage: The Kubelet itself, responsible for managing pods on a node, can consume significant memory, especially on nodes with many pods or custom configurations.
- Diagnosis: Check Kubelet’s memory usage directly on the node (e.g., via
toporhtopon the node). - Fix:
- Ensure Kubelet is running with sufficient resources. If Kubelet is configured to run as a static pod, its resources might need adjustment.
- Consider reducing the
--max-podssetting for Kubelet if you have extremely high pod density on a node, or if your node is memory-constrained. - Ensure your node OS is not running out of memory before Kubernetes even gets a chance to manage it.
- Why it works: By ensuring the Kubelet has enough memory, you prevent it from being a victim of the OOM killer, which could lead to node instability and pod evictions.
- Diagnosis: Check Kubelet’s memory usage directly on the node (e.g., via
-
System Daemons and Other Processes: Non-Kubernetes processes on the node (like Docker/containerd, systemd, monitoring agents, or even other applications not managed by Kubernetes) can consume memory.
- Diagnosis: SSH into the affected node and use
top,htop, orfree -hto identify which processes are consuming the most memory. - Fix:
- Optimize or reduce the resource footprint of these non-Kubernetes processes.
- If possible, run these agents as Kubernetes pods with resource limits, so their usage is accounted for and managed by the scheduler.
- Ensure your node image is lean and only includes necessary system packages.
- Why it works: Reducing the memory footprint of non-Kubernetes processes frees up more memory for Kubernetes to manage pods effectively.
- Diagnosis: SSH into the affected node and use
-
Eviction Thresholds Too High/Low: Kubernetes has eviction thresholds that, when crossed, trigger pod evictions before the OOM killer is invoked. If these are misconfigured, they can lead to premature evictions.
- Diagnosis: Check the Kubelet configuration on the node. Look for
--eviction-hardand--eviction-softparameters.
Common settings are# On the node, check Kubelet config file, e.g., /var/lib/kubelet/config.yaml # Or check systemd unit file for Kubelet flagsmemory.available<100Miormemory.available<10%. - Fix: Adjust the eviction thresholds. If you’re seeing evictions due to low memory, you might need to lower the threshold (e.g.,
memory.available<500Mi) to give the node more headroom before evicting. Alternatively, if your nodes are consistently hitting these thresholds, it indicates an underlying resource shortage as described in point 1.# Example in Kubelet config.yaml evictionHard: memory.available: "500Mi" evictionSoft: memory.available: "1Gi" evictionSoftGracePeriod: memory.available: "5m" - Why it works: By adjusting the thresholds, you change the point at which Kubernetes proactively evicts pods to reclaim memory, potentially preventing the more drastic OOM killer action.
- Diagnosis: Check the Kubelet configuration on the node. Look for
After fixing these, you’ll likely encounter a CrashLoopBackOff error on pods that were restarted due to the OOM killer, as they might still have issues that prevent them from starting up cleanly.