Load balancers are supposed to prevent downtime, but they often become the very source of it. The most common failure mode isn’t a complete outage, but rather a subtle imbalance where a few backend servers get hammered while others sit idle, leading to cascading failures and eventual unavailability for some users, even if the load balancer itself is technically "up."
Let’s look at the common ways this happens and how to fix them.
1. Sticky Sessions Without Health Checks
You’ve configured your load balancer to send a user’s requests to the same backend server for their entire session. This is great for stateful applications. However, if that server suddenly becomes unhealthy (e.g., a process crashes, a disk fills up), the load balancer keeps sending new requests to it because it thinks the session is still active. Users hitting that server get stuck with errors.
Diagnosis: Check your load balancer’s session stickiness configuration and compare it with its health check settings. Look for a mismatch where stickiness is enabled but health checks are either disabled or not aggressive enough. On the backend server, look for a high error rate or unresponsiveness that isn’t being detected by the load balancer.
Fix:
- AWS ELB/ALB: Ensure
Stickiness.Enabledis set totrueandStickiness.DurationSecondsis configured appropriately. Crucially, ensureHealthCheck.EnabledistrueandHealthCheck.IntervalSecondsis set to a low value (e.g.,10-30seconds), with a lowHealthCheck.UnhealthyThresholdCount(e.g.,2) and a shortHealthCheck.TimeoutSeconds(e.g.,5). - Nginx (as LB): In your
upstreamblock, usesticky cookie srv_id expires=1h httponlyorsticky ip_hash. Then, ensure yourlocationblock hasproxy_pass http://your_upstream;and that yourserverblock hashealth_checkdirectives defined for each server in the upstream, or useresolverwith DNS for dynamic updates. If usingip_hash, ensure your health check is robust. - HAProxy: Configure
balance roundrobin(or your preferred algorithm) and then addcookie SERVERID insert indirect nocacheto yourbackendsection. Ensure yourserverlines have health checks defined:server server1 192.168.1.10:80 check port 80 inter 10s fall 3 rise 2.
Why it works: By enabling robust health checks with low thresholds, the load balancer will quickly mark a failing server as unhealthy and stop sending new sticky sessions to it, while existing sessions might still complete or fail gracefully. Once the health check fails consistently, the load balancer removes it from the pool.
2. Overly Aggressive Health Checks
This is the flip side. Your health checks are too sensitive. A momentary blip in a backend server’s response time, a brief network hiccup, or a single slow request causes the load balancer to mark the server as unhealthy. It then removes the server from the pool, reducing capacity. If this happens repeatedly, you can starve your application of resources, leading to slow responses and eventual timeouts for all users.
Diagnosis: Examine load balancer logs for frequent "target deregistration" or "server down" events. Check the health check status for your backend servers – are they fluctuating between healthy and unhealthy? Look at application logs on the backend servers for transient errors or high latency during these periods.
Fix:
- AWS ELB/ALB: Increase
HealthCheck.UnhealthyThresholdCount(e.g., to3or5) andHealthCheck.IntervalSeconds(e.g., to30or60seconds). Consider increasingHealthCheck.TimeoutSecondsif your application legitimately takes longer to respond sometimes (e.g., to10seconds). - Nginx (as LB): In your
upstreamblock, adjustmax_failsandfail_timeout. For example:server backend1.example.com:8080 max_fails=3 fail_timeout=30s;. - HAProxy: Increase the
fallvalue andintervalue in yourserverdirective. For instance:server server1 192.168.1.10:80 check port 80 inter 30s fall 5 rise 2.
Why it works: By increasing the number of consecutive failures required to mark a server unhealthy and extending the interval between checks, you allow for minor, transient issues to resolve themselves without prematurely removing a server from service.
3. No Health Checks at All
This is surprisingly common on smaller setups or when a load balancer is initially configured. The load balancer just blindly sends traffic to all registered backend servers, regardless of their actual health. If one server goes down, it just sits there, receiving traffic and returning errors to users. This leads to a degraded experience for a subset of users and wasted resources.
Diagnosis: Check your load balancer configuration. If there are no explicit health check settings defined for the backend pool, this is your problem.
Fix:
- AWS ELB/ALB: Define a
HealthCheckconfiguration. For an HTTP/HTTPS load balancer, this might be:HealthCheck.Enabled: true,HealthCheck.Path: /health,HealthCheck.Port: traffic-port,HealthCheck.Protocol: HTTP,HealthCheck.IntervalSeconds: 30,HealthCheck.TimeoutSeconds: 5,HealthCheck.HealthyThresholdCount: 3,HealthCheck.UnhealthyThresholdCount: 2. Ensure your backend applications expose a/healthendpoint that returns a 200 OK when healthy. - Nginx (as LB): Add a
health_checkdirective to yourupstreamblock, or use a module likenginx-upsstreams-module. A simple check might involveproxy_method GET; proxy_set_header Host $host; proxy_pass http://your_upstream/health;. - HAProxy: Add
checkto yourserverlines:server web1 192.168.1.10:80 check port 80.
Why it works: Health checks allow the load balancer to dynamically remove unhealthy backend servers from its rotation, ensuring that traffic is only sent to servers that can actually respond successfully.
4. Incorrect Load Balancing Algorithm
You’re using roundrobin for a stateful application, or least_conn for stateless APIs where connection count isn’t a good indicator of load. This can lead to uneven distribution. roundrobin might send a flood of requests for a long-running process to the same server if it happens to be next in line, while least_conn might overload a server with many short, quick connections over one with fewer but very long-lived ones.
Diagnosis: Understand your application’s traffic patterns. Is it stateful? Are requests long or short-lived? Are some requests computationally intensive? Review your load balancer’s algorithm setting.
Fix:
- Stateful Applications: Use
sticky cookie(Nginx, HAProxy) orsticky sessions(AWS ALB) to ensure a user’s requests go to the same backend server. - Stateless, CPU-bound applications:
roundrobinis often fine, but if you see uneven CPU usage, considerleast_time(HAProxy) orleast_request(Nginxupstream) which balances based on response time. - Stateless, I/O-bound applications (e.g., many short DB queries):
least_conn(AWS ALB, HAProxy) is usually a good choice, distributing connections to the server with the fewest active connections. - AWS ELB/ALB: Choose
Application Load Balancerand selectRound RobinorLeast outstanding requestsforTarget groupattributes. For specific stickiness, configureLoad balancing.AlgorithmtoLeast outstanding requestsandStickiness.Enabledtotrue. - Nginx: Use
upstream your_upstream { ip_hash; }for sticky sessions, orupstream your_upstream { least_conn; }to balance based on active connections. - HAProxy: Use
balance leastconnorbalance roundrobin. For sticky sessions, usecookie SERVERID insert.
Why it works: The right algorithm ensures that traffic is distributed based on the actual needs and characteristics of your backend services, preventing one server from being disproportionately burdened.
5. Over-reliance on DNS Round Robin
Using DNS A records with multiple IP addresses for your backend servers and expecting DNS round-robin to balance load. DNS is often heavily cached by clients and intermediate resolvers. This means that many clients will consistently get the same IP address for a long time, leading to massive imbalance. It also means that when a server goes down, clients will continue hitting its IP until their DNS cache expires.
Diagnosis: Use dig or nslookup multiple times from different locations. Observe if you consistently get the same IP address. Check DNS TTL (Time To Live) values – if they are high (e.g., hours), this is a problem.
Fix: Do not use DNS round-robin for load balancing. Instead, use a dedicated load balancer service (like AWS ELB, Google Cloud Load Balancing, Azure Load Balancer, or self-hosted Nginx/HAProxy). Configure your application’s public-facing DNS record to point to the IP address of your load balancer.
Why it works: A dedicated load balancer sits between your clients and your backend servers, actively managing traffic distribution and health checks in real-time, unlike the slow and unreliable nature of DNS caching.
6. Load Balancer Resource Exhaustion
The load balancer itself becomes the bottleneck. This can happen if the load balancer instance is undersized for the traffic volume, or if its configuration (e.g., connection limits, CPU/memory limits) is too restrictive. This manifests as slow responses, connection resets, or outright failures originating from the load balancer.
Diagnosis: Monitor the load balancer’s own performance metrics: CPU utilization, memory usage, network throughput, active connections, and request latency. Check load balancer logs for errors related to resource limits being hit.
Fix:
- AWS ELB/ALB: Scale up the instance size or number of instances for your load balancer. For Application Load Balancers, this is often automatic, but you can influence capacity via
Load balancing.Capacity.MinCapacityUnitsandLoad balancing.Capacity.MaxCapacityUnits. Ensure your health check settings aren’t causing excessive load on the LB itself (e.g., checking too frequently). - Nginx/HAProxy: Increase worker processes (
worker_processes), worker connections (worker_connectionsin Nginx), or tune HAProxy’smaxconnsettings. Ensure the server running Nginx/HAProxy has sufficient CPU and RAM.
Why it works: By providing the load balancer with adequate resources or configuring it to handle more concurrent traffic, you ensure it can effectively distribute requests without becoming a bottleneck itself.
The next thing you’ll likely encounter is 502 Bad Gateway errors originating from the load balancer itself, indicating it couldn’t reach any healthy backend targets.