Load balancers often use passive health checks to detect unhealthy backend servers by observing traffic patterns, rather than actively sending probes.
Let’s see this in action. Imagine a load balancer distributing traffic to three backend web servers: 192.168.1.101, 192.168.1.102, and 192.168.1.103.
{
"load_balancer": {
"name": "my-web-lb",
"port": 80,
"backends": [
{"address": "192.168.1.101", "port": 8080},
{"address": "192.168.1.102", "port": 8080},
{"address": "192.168.1.103", "port": 8080}
],
"health_check": {
"type": "passive",
"timeout_threshold": 5,
"error_rate_threshold": 0.8,
"recovery_threshold": 3
}
}
}
When clients connect to my-web-lb:80, the load balancer forwards requests to one of the backend servers. If 192.168.1.102 starts failing to respond within the load balancer’s configured timeout (say, 5 seconds), the load balancer doesn’t immediately mark it as down. Instead, it notes this failure. If a significant percentage of requests (e.g., 80% or more) directed to 192.168.1.102 result in timeouts or specific error codes within a rolling window, the load balancer will then deem 192.168.1.102 unhealthy and stop sending new traffic to it. It will continue to monitor 192.168.1.102 passively; if it starts responding successfully again for a set number of consecutive requests (e.g., 3), it will be marked as healthy and traffic will resume.
Passive health checks are crucial because they reflect the actual experience of users hitting the backend servers. Active health checks, like sending a GET /health request, can sometimes pass even when the server is struggling with real user traffic due to load, specific application errors, or database contention that a simple probe might miss. By observing connection timeouts, TCP resets, or specific HTTP error codes (like 5xx errors) that the load balancer receives back from the backend, it gains a more realistic view of backend health. This prevents the load balancer from sending traffic to a server that appears to be up via an active check but is actually failing for real requests. The configuration parameters, timeout_threshold and error_rate_threshold, define the sensitivity of the detection. If a backend server consistently fails to respond within timeout_threshold milliseconds for more than error_rate_threshold of the recent traffic, it’s flagged. recovery_threshold dictates how many consecutive successful responses are needed for a server to be considered healthy again.
The most surprising aspect of passive health checks is their reliance on observing failures to determine health, rather than explicitly verifying it. This means a server can be subtly unhealthy for a period, impacting user experience, before the load balancer even registers an issue, but it also means the load balancer is more likely to react to genuine, user-impacting problems.
The next challenge is understanding how to configure different types of passive health checks beyond simple timeouts, such as those based on HTTP status codes.