The most surprising thing about right-sizing GKE pod resources is that your application is probably asking for way more CPU and memory than it actually needs, and the system is happily giving it to you.

Let’s see what that looks like. Imagine you have a simple web server running in a GKE pod. You’ve declared its resource requests like this:

resources:
  requests:
    cpu: "1000m"  # 1 full CPU core
    memory: "2Gi" # 2 Gigabytes of RAM
  limits:
    cpu: "2000m"  # 2 full CPU cores
    memory: "4Gi" # 4 Gigabytes of RAM

This tells Kubernetes, "This pod needs at least 1 CPU and 2Gi of RAM to start, and it promises not to use more than 2 CPUs and 4Gi of RAM." Kubernetes uses these requests to schedule your pod onto nodes. If a node doesn’t have enough free CPU or memory to satisfy the requests, your pod won’t be scheduled. The limits are for enforcement; if a pod exceeds its limits, Kubernetes might throttle its CPU or, more drastically, terminate its memory-bound pods.

Now, what if your web server, in reality, only ever uses about 200m of CPU and 500Mi of memory, even under load? You’re over-requesting by nearly 5x for CPU and 4x for memory. This means you’re wasting resources. Other pods that actually need those resources are being pushed to other nodes, or you’re forced to provision larger, more expensive nodes than you need.

This is where the Vertical Pod Autoscaler (VPA) comes in. VPA observes your pod’s actual resource usage over time and then recommends or automatically updates the requests and limits in your pod’s deployment. It’s like having a tireless intern constantly monitoring your application’s resource consumption and adjusting the thermostats.

Here’s a typical VPA configuration:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-webserver-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-webserver
  updatePolicy:
    updateMode: "Auto" # or "Recreate" or "Off"
  resourcePolicy:
    containerPolicies:
    - containerName: "*" # Apply to all containers in the pod
      minAllowed:
        cpu: "100m"
        memory: "128Mi"
      maxAllowed:
        cpu: "2" # 2 CPUs
        memory: "4Gi"

With updateMode: "Auto", VPA will not only recommend new values but will also proactively update the pod’s template in the Deployment. When the Deployment is updated, Kubernetes will roll out new pods with the adjusted resource requests. This can cause pods to be recreated, which is why you might choose Recreate if you want more explicit control, or Off if you only want recommendations. The resourcePolicy lets you set boundaries, ensuring VPA doesn’t shrink resources too aggressively or expand them beyond what’s feasible for your nodes.

Let’s look at the output of VPA. After VPA has been running for a while and observed your my-webserver deployment, you can inspect its status:

kubectl describe vpa my-webserver-vpa

You’ll see output like this:

...
    Pod Recommendations:
      Container Name:  my-app-container
      Lower Bound:
        Cpu:  150m
        Memory: 350Mi
      Recommended:
        Cpu:  220m
        Memory: 550Mi
      Target:
        Cpu:  220m
        Memory: 550Mi
      Uncapped Target:
        Cpu:  220m
        Memory: 550Mi
...

Here, VPA is recommending 220m of CPU and 550Mi of memory. If your updateMode was Auto, VPA would have already updated the Deployment’s pod template to request these new values. Kubernetes would then terminate existing pods and start new ones with the right-sized requests.

The magic happens because VPA calculates these recommendations based on historical usage, including spikes. It doesn’t just take an average; it looks at percentiles (often the 95th percentile) to ensure it accommodates typical peak loads without over-provisioning for rare, extreme outliers. This means your pods get enough resources to perform well under normal, busy conditions, but not so much that they waste node capacity.

The most common way VPA is misunderstood is that it always reduces requests. VPA can also increase requests if your application is being throttled or is experiencing out-of-memory errors due to insufficient requests. If VPA sees your pods are consistently hitting their CPU limits and getting throttled, or if they’re being OOMKilled (Out Of Memory), it will recommend higher requests and limits. This is crucial for performance and stability; VPA helps you avoid performance degradation caused by resource starvation.

After you’ve successfully right-sized your pods with VPA, the next logical step is to consider horizontal scaling, which is where the Horizontal Pod Autoscaler (HPA) comes in, allowing you to scale the number of pods based on metrics like CPU or memory utilization.

Want structured learning?

Take the full Gke course →