Beyond Uptime: Real Health Checks That Matter

Kubernetes health checks are surprisingly about disagreement between the kubelet and your application, not just whether your app is "alive."

Let’s watch a pod go through its paces.

apiVersion: v1
kind: Pod
metadata:
  name: health-demo
spec:
  containers:
  - name: app
    image: nginx:latest
    ports:
    - containerPort: 80
    livenessProbe:
      httpGet:
        path: /
        port: 80
      initialDelaySeconds: 5
      periodSeconds: 5
    readinessProbe:
      httpGet:
        path: /
        port: 80
      initialDelaySeconds: 10
      periodSeconds: 5
    startupProbe:
      httpGet:
        path: /
        port: 80
      failureThreshold: 30
      periodSeconds: 10

Here, the nginx image starts up fast. The startupProbe checks if it’s responding on / on port 80 every 10 seconds, up to 30 times (5 minutes total). If it fails 30 times, the pod is marked as failed during startup. Once the startupProbe succeeds, Kubernetes moves on to the livenessProbe and readinessProbe. The livenessProbe checks every 5 seconds after an initial 5-second delay. If it fails five times in a row (5 seconds * 5 failures = 25 seconds), Kubernetes restarts the container. The readinessProbe also checks every 5 seconds after an initial 10-second delay. If it fails five times, the pod is marked as NotReady, meaning it won’t receive new traffic from Services.

The magic is in how these probes tell the kubelet, the agent on each node, about your application’s state. The kubelet is the one making the decisions: restart the container (liveness), stop sending traffic (readiness), or give up entirely (startup). It’s not about your app saying "I’m good"; it’s about the kubelet asking and your app answering in a way the kubelet understands.

The initialDelaySeconds is crucial. It gives your application time to start up without triggering probes prematurely. For complex applications that take a while to initialize, this can prevent false positives. periodSeconds dictates how often the probe runs, and timeoutSeconds (not shown but defaults to 1) sets the maximum time the probe has to complete. successThreshold (defaults to 1) means one successful probe is enough to mark the probe as successful. failureThreshold (defaults to 3) means it takes that many consecutive failures to mark the probe as failed.

Many people think readiness probes are just a less aggressive liveness probe. They’re fundamentally different. A liveness probe failure means the container is unhealthy and needs to be restarted. A readiness probe failure means the container is temporarily unable to serve traffic, but it’s not necessarily broken. This distinction is key for graceful rolling updates and managing traffic during deployments. If a new pod isn’t ready, the Service won’t send traffic to it. If an old pod becomes unready, the Service stops sending it traffic before it’s terminated, preventing dropped requests.

A common pitfall is using the same probe for both liveness and readiness. While it might seem simpler, it misses the nuance. Imagine an application that’s running but has a temporary database connection issue. A liveness probe would restart the container unnecessarily, potentially leading to a restart loop. A readiness probe, however, would simply mark the pod as NotReady, preventing new traffic while the application attempts to recover. The startupProbe is the newest addition, designed specifically for applications with long startup times. It allows you to have a longer initial grace period for startup without making your livenessProbe excessively long, which would delay detecting actual runtime failures.

The most surprising thing about Kubernetes health checks is that they operate on a "guilty until proven innocent" basis for failures. The kubelet assumes your application is healthy until a probe fails. When a probe fails, it triggers a state change. For livenessProbe, this means a container restart. For readinessProbe, it means the pod is removed from Service endpoints. The startupProbe has a similar effect to livenessProbe but only during the initial startup phase, allowing for a more forgiving initial period.

If your probes are failing, the next thing you’ll likely see is CrashLoopBackOff for liveness failures or pods with <none> in the READY column for readiness failures.