The most surprising thing about load balancer health checks is that they’re often too fast, leading to unnecessary traffic shifts and cascading failures.
Let’s see this in action. Imagine a simple web service behind an AWS Application Load Balancer (ALB). We have two EC2 instances, i-0123456789abcdef0 and i-0fedcba9876543210, both running a web server on port 80. Our ALB listener is configured to forward requests to a target group tg-webservers.
Here’s a snapshot of our ALB’s target group configuration (simplified):
{
"TargetGroupArn": "arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/tg-webservers/9012345678901234",
"TargetType": "instance",
"Protocol": "HTTP",
"Port": 80,
"VpcId": "vpc-abcdef0123456789",
"HealthCheckProtocol": "HTTP",
"HealthCheckPort": "traffic-port",
"HealthCheckPath": "/health",
"HealthCheckIntervalSeconds": 5,
"HealthCheckTimeoutSeconds": 2,
"HealthyThresholdCount": 2,
"UnhealthyThresholdCount": 2,
"Matcher": {
"HttpCode": "200"
},
"TargetGroupHealth": {
"Targets": [
{
"Id": "i-0123456789abcdef0",
"Port": 80,
"State": "healthy",
"Reason": "Elb.Registered"
},
{
"Id": "i-0fedcba9876543210",
"Port": 80,
"State": "healthy",
"Reason": "Elb.Registered"
}
]
}
}
The ALB will periodically send an HTTP GET request to /health on each instance. If it receives a 200 OK response within 2 seconds, and this happens twice in a row (HealthyThresholdCount: 2), the target is considered healthy. If it receives anything else (a timeout, a different status code, a connection refused), and this happens twice in a row (UnhealthyThresholdCount: 2), the target is marked unhealthy and removed from the load balancer’s rotation.
The problem arises when a temporary glitch occurs on one instance. Let’s say i-0123456789abcdef0 experiences a brief spike in CPU, causing its web server to respond to a health check with a 503 Service Unavailable error.
- Health Check 1 (at T=0s): Instance responds with
503. - Health Check 2 (at T=5s): Instance responds with
503.
At T=10s, after the second consecutive failure, the ALB marks i-0123456789abcdef0 as unhealthy. If you have a low HealthCheckIntervalSeconds (like 5s) and a low UnhealthyThresholdCount (like 2), this means an instance can be taken out of rotation in as little as 10 seconds.
Now, consider the opposite scenario. What if the instance is actually fine, but the network between the ALB and the instance has a transient blip? Or maybe the web server is just slow to start up after a deployment or a restart, and it occasionally misses the health check window. If the health check is too aggressive, it can prematurely declare a healthy instance unhealthy.
This can lead to a cascade:
- Instance becomes unhealthy: The ALB stops sending traffic to
i-0123456789abcdef0. - Increased load on remaining instances: All traffic now goes to
i-0fedcba9876543210. - Remaining instances struggle: If
i-0fedcba9876543210was already near capacity, the sudden influx of traffic might cause it to start failing health checks. - All instances become unhealthy: If both instances fail, the load balancer has nowhere to send traffic, resulting in a complete outage.
The key is to tune these parameters to match the expected resilience of your application. A microservice that can spin up in 500ms and is designed for rapid scaling might tolerate very aggressive health checks. A monolithic application that takes 30 seconds to initialize and has fewer instances might need more lenient checks.
Here’s the mental model: the health check isn’t just a "is it alive?" ping; it’s a statement of confidence from the load balancer about the target’s ability to serve traffic reliably. A short interval and low threshold means the LB has very low confidence and will err on the side of caution. A longer interval and higher threshold means the LB has higher confidence and will tolerate more temporary hiccups.
The HealthCheckTimeoutSeconds is critical. If your application can respond, but it’s often slow, a timeout of 2 seconds might be too short. If a typical response time for /health is 1.5 seconds, and you have network latency of 0.5 seconds, you’re already at 2 seconds. A slight delay could push it over. You might want to set HealthCheckTimeoutSeconds to at least twice your typical application response time for the health check endpoint.
For a more resilient service, consider these adjustments:
- Increase
HealthCheckIntervalSeconds: Instead of 5 seconds, try 10 or 15 seconds. This reduces the frequency of checks, giving the instance more breathing room. - Increase
HealthCheckTimeoutSeconds: If your application sometimes takes a bit longer to respond, increase this from 2 seconds to 4 or 5 seconds. - Increase
HealthyThresholdCount: If you want to be very sure an instance is healthy before sending traffic, set this to 3 or 4. - Increase
UnhealthyThresholdCount: This is often the most impactful. Instead of 2 failures, require 3 or 4 consecutive failures before marking an instance unhealthy. For example, settingUnhealthyThresholdCountto 3 with an interval of 10s means an instance must fail for 30 seconds before being removed.
Let’s say we adjust our target group config to:
HealthCheckIntervalSeconds: 10HealthCheckTimeoutSeconds: 4UnhealthyThresholdCount: 3
Now, a single transient 503 from i-0123456789abcdef0 will be ignored. The ALB will check again in 10 seconds. If it’s still failing, it will check again in another 10 seconds. Only after three consecutive failures (30 seconds total) will the instance be marked unhealthy. This provides a much larger buffer against temporary network issues or minor application hiccups.
The next thing you’ll likely encounter is optimizing the content of your health check endpoint itself.