The Linkerd proxy’s resource requests and limits are surprisingly complex, and often, the default settings are a poor fit for real-world traffic patterns, leading to either wasted resources or performance degradation.
Imagine a single pod running Linkerd. Inside that pod, you have your application container and the linkerd-proxy container. The linkerd-proxy is the busy little bee intercepting all inbound and outbound traffic for your application. It’s doing a lot: TLS termination, routing, metrics collection, retries, circuit breaking. All this work requires CPU and memory.
Here’s a simplified view of a Linkerd-enabled pod’s resource definition in Kubernetes:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
template:
spec:
containers:
- name: my-app
image: my-app:latest
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "200m"
memory: "256Mi"
- name: linkerd-proxy
image: public.ecr.aws/linkerd/proxy:2.14.1
resources:
requests:
cpu: "100m"
memory: "50Mi"
limits:
cpu: "200m"
memory: "100Mi"
The problem arises because the default resource requests and limits for the linkerd-proxy are often too low, especially for applications with moderate to high traffic or complex request patterns. This leads to the Kubernetes scheduler potentially placing pods on nodes that don’t have enough resources, or the proxy itself being starved of CPU and memory when it needs it most.
Common Causes and Fixes
-
Under-provisioned CPU Requests for the Proxy:
- Diagnosis: Monitor the
linkerd-proxycontainer’s CPU utilization viakubectl top pod <pod-name> -c linkerd-proxy. Look for consistently high CPU usage, especially during traffic spikes. You might also see increased latency in your application’s response times, which can be a symptom of CPU throttling. Checkkubectl describe pod <pod-name>forOOMKilledevents orreason: OOMKilledin thelinkerd-proxycontainer’s status. - Fix: Increase the
requests.cpufor thelinkerd-proxycontainer. For a moderately busy service, start with200m.resources: requests: cpu: "200m" # Increased from 100m memory: "50Mi" limits: cpu: "400m" memory: "100Mi" - Why it works: A higher CPU request ensures the Kubernetes scheduler prioritizes this pod for CPU resources on the node. It signals to the scheduler that this container needs at least this much CPU to operate correctly, preventing it from being scheduled on an already saturated node.
- Diagnosis: Monitor the
-
Under-provisioned CPU Limits for the Proxy:
- Diagnosis: Similar to CPU requests, monitor CPU usage. If you see CPU usage hitting the limit (e.g., the
limits.cpuis200mandkubectl top podshows sustained usage near200m), the proxy will be throttled. This leads to increased latency and potentially dropped requests. Checkkubectl describe pod <pod-name>forreason: ContainerCannotRunorreason: OOMKilled. - Fix: Increase the
limits.cpufor thelinkerd-proxycontainer. For a moderately busy service,400mis a common starting point.resources: requests: cpu: "200m" memory: "50Mi" limits: cpu: "400m" # Increased from 200m memory: "100Mi" - Why it works: The CPU limit prevents the
linkerd-proxyfrom consuming more than a specified amount of CPU. Increasing this limit allows the proxy to burst and handle transient traffic spikes without being throttled, ensuring better performance.
- Diagnosis: Similar to CPU requests, monitor CPU usage. If you see CPU usage hitting the limit (e.g., the
-
Under-provisioned Memory Requests for the Proxy:
- Diagnosis: Monitor the
linkerd-proxycontainer’s memory usage withkubectl top pod <pod-name> -c linkerd-proxy. If memory usage is consistently high and close to thelimits.memory, the pod is at risk of being evicted or the proxy might become unstable. Checkkubectl describe pod <pod-name>forreason: OOMKilled. - Fix: Increase the
requests.memoryfor thelinkerd-proxycontainer. For a moderately busy service,100Mior150Miis often appropriate.resources: requests: cpu: "200m" memory: "100Mi" # Increased from 50Mi limits: cpu: "400m" memory: "200Mi" - Why it works: A higher memory request ensures the Kubernetes scheduler reserves enough memory for the proxy, preventing it from being placed on a node that might run out of memory. This reduces the chance of the pod being killed by the Kubelet when the node is under memory pressure.
- Diagnosis: Monitor the
-
Under-provisioned Memory Limits for the Proxy:
- Diagnosis: Observe the
linkerd-proxymemory usage. If it hits the memory limit, the container will be terminated by the Kubelet with anOOMKillederror. This is a hard stop. - Fix: Increase the
limits.memoryfor thelinkerd-proxycontainer. A common starting point for moderate traffic is200Mi.resources: requests: cpu: "200m" memory: "100Mi" limits: cpu: "400m" memory: "200Mi" # Increased from 100Mi - Why it works: The memory limit defines the maximum amount of memory the
linkerd-proxycan consume. Increasing this limit provides headroom for the proxy to handle its internal data structures, connection pooling, and other memory-intensive operations without being killed.
- Diagnosis: Observe the
-
Application Container Resource Starvation:
- Diagnosis: Sometimes, the problem isn’t the proxy itself, but the application container is starved. If your application container has low
requests.cpuandrequests.memory, it might not get enough resources, causing it to slow down or fail. The proxy then sees this slowness and might appear to be the bottleneck. Checkkubectl top pod <pod-name>for the application container’s CPU/memory usage. - Fix: Increase the
requests.cpuandrequests.memoryfor your application container. The exact values depend heavily on your application.spec: containers: - name: my-app image: my-app:latest resources: requests: cpu: "500m" # Increased memory: "512Mi" # Increased limits: cpu: "1" memory: "1Gi" - Why it works: This ensures your application gets the resources it needs to run efficiently. A healthy application means the proxy has less work to do in terms of retries and error handling, indirectly improving overall perceived performance.
- Diagnosis: Sometimes, the problem isn’t the proxy itself, but the application container is starved. If your application container has low
-
Linkerd Control Plane Resource Issues:
- Diagnosis: If all your Linkerd-enabled pods are showing resource pressure or latency, the issue might be with the Linkerd control plane itself (controller, web, etc.). Check the logs and resource usage of the
linkerd-controller,linkerd-web, andlinkerd-identitypods in thelinkerdnamespace. - Fix: Scale up the Linkerd control plane pods or increase their resource requests/limits. This is typically done by modifying the Linkerd installation configuration or the Helm values if installed via Helm.
Or, if using# Example: Scaling up the controller replica count kubectl scale deployment linkerd-controller -n linkerd --replicas=3linkerd installwith--set:linkerd install --set controller.replicas=3 | kubectl apply -f - - Why it works: The control plane is responsible for distributing routing information and managing the overall Linkerd mesh. If it’s overloaded, it can’t effectively serve the data plane proxies, leading to issues across the mesh.
- Diagnosis: If all your Linkerd-enabled pods are showing resource pressure or latency, the issue might be with the Linkerd control plane itself (controller, web, etc.). Check the logs and resource usage of the
After adjusting these resources, the next error you might encounter is related to the Linkerd policy controller if you haven’t configured it, or potentially certificate rotation issues if your linkerd-identity service is also under-resourced.