Load balancer health checks are a surprisingly passive-aggressive system, constantly policing your backend services without ever directly telling them they’re doing a bad job.

Let’s watch a request flow through an AWS Elastic Load Balancer (ELB) with health checks enabled. Imagine we have a simple web application running on EC2 instances behind an Application Load Balancer (ALB).

Client -> DNS Resolution -> ALB DNS -> ALB Frontend -> Backend Instance 1 (Healthy) -> Response
Client -> DNS Resolution -> ALB DNS -> ALB Frontend -> Backend Instance 2 (Unhealthy) -> ALB Returns Error

In this scenario, the client’s request hits the ALB. The ALB, before forwarding the request, pings each registered backend instance using a pre-configured health check. If an instance fails the health check, the ALB simply stops sending traffic to it, effectively isolating it from the user base. The client requesting a page from the unhealthy instance will receive an error directly from the ALB, not from the instance itself. This is crucial: the ALB is acting as a gatekeeper, ensuring only responsive, healthy instances serve traffic.

The core problem health checks solve is availability. Without them, a load balancer would happily send traffic to an instance that’s crashed, overloaded, or otherwise incapable of responding. Users would experience intermittent or complete failures, and there’d be no automated way for the system to recover. Health checks provide the mechanism for the load balancer to detect these failures and automatically reroute traffic to healthy instances, maintaining service continuity.

Internally, the health check is a lightweight, periodic probe initiated by the load balancer itself to each registered target. This probe can take various forms: an HTTP request, a TCP connection attempt, or even a gRPC call. The target is considered healthy if it responds within a defined timeout period and, for HTTP/gRPC checks, returns a success status code (typically 2xx or 3xx). The load balancer tracks the health status of each target and maintains a "healthy host count." Traffic is only distributed to instances where this count is greater than zero.

The primary levers you control are:

  • Protocol: The type of probe (HTTP, HTTPS, TCP, TLS, H2, H2C, GRPC).
  • Port: The port the load balancer should connect to on the target (e.g., 80 for HTTP, 443 for HTTPS).
  • Path (for HTTP/HTTPS/GRPC): The specific URL path the load balancer should request (e.g., /health, /status). This is vital for application-level checks.
  • Interval: How often the health check is performed (e.g., 30 seconds).
  • Timeout: How long the load balancer waits for a response before considering the target unhealthy (e.g., 5 seconds).
  • Healthy Threshold: The number of consecutive successful health checks required for a target to be marked healthy (e.g., 2).
  • Unhealthy Threshold: The number of consecutive failed health checks required for a target to be marked unhealthy (e.g., 3).
  • Status Codes (for HTTP/HTTPS/GRPC): The HTTP status codes that indicate success (e.g., 200, 301, 302).

The health check endpoint itself is a critical piece of the puzzle. A common mistake is to configure the health check to hit the root path (/) of your application. While this might work for a simple static site, it’s often too broad for dynamic applications. If your application can serve the root page but is failing on specific API endpoints or business logic, a / health check will incorrectly report the instance as healthy. A dedicated health check endpoint (e.g., /api/v1/healthz) that performs essential internal checks (database connectivity, critical service availability) is far more robust.

Consider an HTTP health check on an ALB configured like this:

  • Protocol: HTTP
  • Port: 80
  • Path: /healthz
  • Interval: 30 seconds
  • Timeout: 5 seconds
  • Healthy Threshold: 2
  • Unhealthy Threshold: 2
  • Success Codes: 200

If an instance is running but its /healthz endpoint returns a 500 Internal Server Error, the ALB will mark it unhealthy after two consecutive failures within 10 seconds (2 failures * 5-second timeout). If the application then recovers and /healthz returns 200 OK, the ALB will mark it healthy after two consecutive successes within 10 seconds.

When you configure your health check path to be / on a complex application that requires a specific Host header to respond correctly, the load balancer might fail the health check even if the application is fundamentally working. This is because the default health check often doesn’t send a Host header, or it sends the wrong one. You need to explicitly configure the Host header within the health check parameters if your application requires it. For example, in AWS, you’d add a Host header of my.example.com to the health check configuration for your target group.

The next logical step after mastering health checks is understanding how they interact with deregistration delays and connection draining.

Want structured learning?

Take the full Load-balancing course →