Kubernetes cost optimization isn’t about squeezing more performance out of your existing nodes; it’s about understanding that your workloads are telling you they need less.
Let’s watch a deployment in action and see how it behaves. Imagine we have a simple web application deployed with a Deployment object.
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-web-app
spec:
replicas: 3
selector:
matchLabels:
app: my-web-app
template:
metadata:
labels:
app: my-web-app
spec:
containers:
- name: web
image: nginx:latest
ports:
- containerPort: 80
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "200m"
memory: "256Mi"
When this Deployment is applied to a Kubernetes cluster, the kube-controller-manager ensures that three Pods matching the template are running. Each Pod will request 100 millicores (m) of CPU and 128 Mebibytes (Mi) of memory, and will be limited to 200m CPU and 256Mi memory. These requests are crucial. They tell the Kubernetes scheduler how much resource to reserve for each Pod on a node, and they are what your cloud provider bills you for as part of the node’s cost. The limits are a ceiling to prevent runaway Pods from starving others.
The problem arises when these requests are set too high. If your nginx:latest container, in reality, only ever uses 20m CPU and 64Mi memory, you’re paying for resources that are consistently idle. This is the core of "right-sizing."
The most effective way to rightsize is to observe your applications’ actual resource utilization over time. Tools like the Kubernetes Metrics Server, Prometheus, and specialized cost optimization platforms can provide this data.
Here’s how you’d typically check current usage for your my-web-app Pods:
First, ensure metrics-server is installed in your cluster. You can check with kubectl get pods -n kube-system | grep metrics-server. If it’s not there, you’ll need to install it.
Then, use kubectl top pods to see real-time usage:
kubectl top pods -l app=my-web-app
This might output something like:
NAME CPU(cores) MEMORY(bytes)
my-web-app-7d5b9d6f7f-abcde 15m 70Mi
my-web-app-7d5b9d6f7f-fghij 18m 75Mi
my-web-app-7d5b9d6f7f-klmno 12m 68Mi
Compare this to your spec.resources.requests (100m CPU, 128Mi memory). You’re requesting significantly more than you’re using.
The fix is to update your Deployment YAML to reflect the actual usage. You’d lower the requests and, ideally, also adjust the limits to be a reasonable buffer above the observed peak usage, not an arbitrary high number.
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-web-app
spec:
replicas: 3
selector:
matchLabels:
app: my-web-app
template:
metadata:
labels:
app: my-web-app
spec:
containers:
- name: web
image: nginx:latest
ports:
- containerPort: 80
resources:
requests:
cpu: "50m" # Reduced from 100m
memory: "100Mi" # Reduced from 128Mi
limits:
cpu: "100m" # Reduced from 200m, reasonable buffer
memory: "150Mi" # Reduced from 256Mi, reasonable buffer
Applying this change (kubectl apply -f your-deployment.yaml) will cause Kubernetes to reschedule Pods if necessary, or at least update their resource definitions. The scheduler will now reserve less CPU and memory for these Pods. If your cluster’s nodes are over-provisioned, this can lead to fewer nodes being required, directly reducing your cloud infrastructure bill. The scheduler’s job is to pack Pods onto nodes efficiently, and accurate requests are its primary input for doing so.
It’s also important to consider the type of resources. CPU is often burstable, especially on shared-core instances. Memory, however, is a hard limit; if a Pod exceeds its memory request and limit, it will be OOMKilled (Out Of Memory killed). Therefore, memory requests are often more critical to get right for stability.
Beyond individual Pods, consider the node pool sizes. If your cluster consistently has nodes with low CPU or memory utilization across all Pods, you can scale down the number of nodes in that node pool. This is often an automated process managed by the Cluster Autoscaler, which reacts to pending Pods (when there aren’t enough resources) and idle nodes. However, the autoscaler’s efficiency is directly tied to the accuracy of your Pod resource requests. If Pods request 10 CPU cores but only use 1, the autoscaler might spin up more nodes than are actually needed to accommodate those "large" requests.
The concept of "requests" versus "limits" is a fundamental aspect of Kubernetes resource management. Many teams set limits equal to requests, effectively disabling the scheduler’s ability to pack Pods efficiently and leading to over-provisioning. Others set limits extremely high, which can lead to noisy neighbor problems where one Pod consumes excessive resources and impacts others on the same node. The sweet spot is usually a request that matches observed average usage, and a limit that provides a comfortable buffer for spikes, perhaps 1.5x to 2x the request, depending on the application’s sensitivity to latency and its historical behavior.
The next challenge you’ll face is identifying and optimizing stateful workloads, which have different resource management considerations than stateless applications.