The most surprising thing about microservice health checks is that they often don’t tell you if your service is actually healthy, only if it’s able to be healthy.
Let’s see this in action. Imagine a simple web service that needs to connect to a database.
apiVersion: v1
kind: Pod
metadata:
name: my-web-app
spec:
containers:
- name: web-app-container
image: my-web-app:latest
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
In this Kubernetes Pod definition, we have two probes: livenessProbe and readinessProbe. The livenessProbe is like asking, "Is this process still alive and running?" If it fails, Kubernetes will restart the container. The readinessProbe asks, "Is this service ready to accept traffic?" If it fails, Kubernetes will remove the Pod’s IP address from the Service endpoints.
The difference between /health/live and /health/ready is crucial. A service might be lively (the process is running) but not ready (it can’t connect to its database, for instance).
The /health/live endpoint might just check if the web server process is responding. A simple curl http://localhost:8080/health/live should return 200 OK.
The /health/ready endpoint, however, should perform a more thorough check. For our web app that needs a database, it would attempt to connect to the database and maybe even run a simple SELECT 1 query. If this check passes, it returns 200 OK; otherwise, it returns a non-2xx status code.
This separation allows Kubernetes to manage the lifecycle of your microservice effectively. If your /health/live endpoint starts failing, Kubernetes restarts the pod, assuming the application itself is stuck. If your /health/ready endpoint fails, but /health/live continues to pass, Kubernetes knows the application is running but has an external dependency issue (like a database being down) and will stop sending traffic to it until it recovers. This prevents users from hitting an unhealthy instance.
The problem most people miss is that a probe can pass even if the service is experiencing significant degradation. For example, a readinessProbe that only checks if the web server is up and running, but doesn’t actually test the critical database connection, will report the service as "ready" even if it can’t fulfill its core function. The probe needs to reflect the actual operational state of the service’s primary responsibilities.
The next concept you’ll run into is how to implement these probes effectively for more complex dependencies and asynchronous operations.