Load balancers are supposed to prevent downtime, but they often become the very source of it. The most common failure mode isn’t a complete outage, but rather a subtle imbalance where a few backend servers get hammered while others sit idle, leading to cascading failures and eventual unavailability for some users, even if the load balancer itself is technically "up."

Let’s look at the common ways this happens and how to fix them.

1. Sticky Sessions Without Health Checks

You’ve configured your load balancer to send a user’s requests to the same backend server for their entire session. This is great for stateful applications. However, if that server suddenly becomes unhealthy (e.g., a process crashes, a disk fills up), the load balancer keeps sending new requests to it because it thinks the session is still active. Users hitting that server get stuck with errors.

Diagnosis: Check your load balancer’s session stickiness configuration and compare it with its health check settings. Look for a mismatch where stickiness is enabled but health checks are either disabled or not aggressive enough. On the backend server, look for a high error rate or unresponsiveness that isn’t being detected by the load balancer.

Fix:

  • AWS ELB/ALB: Ensure Stickiness.Enabled is set to true and Stickiness.DurationSeconds is configured appropriately. Crucially, ensure HealthCheck.Enabled is true and HealthCheck.IntervalSeconds is set to a low value (e.g., 10-30 seconds), with a low HealthCheck.UnhealthyThresholdCount (e.g., 2) and a short HealthCheck.TimeoutSeconds (e.g., 5).
  • Nginx (as LB): In your upstream block, use sticky cookie srv_id expires=1h httponly or sticky ip_hash. Then, ensure your location block has proxy_pass http://your_upstream; and that your server block has health_check directives defined for each server in the upstream, or use resolver with DNS for dynamic updates. If using ip_hash, ensure your health check is robust.
  • HAProxy: Configure balance roundrobin (or your preferred algorithm) and then add cookie SERVERID insert indirect nocache to your backend section. Ensure your server lines have health checks defined: server server1 192.168.1.10:80 check port 80 inter 10s fall 3 rise 2.

Why it works: By enabling robust health checks with low thresholds, the load balancer will quickly mark a failing server as unhealthy and stop sending new sticky sessions to it, while existing sessions might still complete or fail gracefully. Once the health check fails consistently, the load balancer removes it from the pool.

2. Overly Aggressive Health Checks

This is the flip side. Your health checks are too sensitive. A momentary blip in a backend server’s response time, a brief network hiccup, or a single slow request causes the load balancer to mark the server as unhealthy. It then removes the server from the pool, reducing capacity. If this happens repeatedly, you can starve your application of resources, leading to slow responses and eventual timeouts for all users.

Diagnosis: Examine load balancer logs for frequent "target deregistration" or "server down" events. Check the health check status for your backend servers – are they fluctuating between healthy and unhealthy? Look at application logs on the backend servers for transient errors or high latency during these periods.

Fix:

  • AWS ELB/ALB: Increase HealthCheck.UnhealthyThresholdCount (e.g., to 3 or 5) and HealthCheck.IntervalSeconds (e.g., to 30 or 60 seconds). Consider increasing HealthCheck.TimeoutSeconds if your application legitimately takes longer to respond sometimes (e.g., to 10 seconds).
  • Nginx (as LB): In your upstream block, adjust max_fails and fail_timeout. For example: server backend1.example.com:8080 max_fails=3 fail_timeout=30s;.
  • HAProxy: Increase the fall value and inter value in your server directive. For instance: server server1 192.168.1.10:80 check port 80 inter 30s fall 5 rise 2.

Why it works: By increasing the number of consecutive failures required to mark a server unhealthy and extending the interval between checks, you allow for minor, transient issues to resolve themselves without prematurely removing a server from service.

3. No Health Checks at All

This is surprisingly common on smaller setups or when a load balancer is initially configured. The load balancer just blindly sends traffic to all registered backend servers, regardless of their actual health. If one server goes down, it just sits there, receiving traffic and returning errors to users. This leads to a degraded experience for a subset of users and wasted resources.

Diagnosis: Check your load balancer configuration. If there are no explicit health check settings defined for the backend pool, this is your problem.

Fix:

  • AWS ELB/ALB: Define a HealthCheck configuration. For an HTTP/HTTPS load balancer, this might be: HealthCheck.Enabled: true, HealthCheck.Path: /health, HealthCheck.Port: traffic-port, HealthCheck.Protocol: HTTP, HealthCheck.IntervalSeconds: 30, HealthCheck.TimeoutSeconds: 5, HealthCheck.HealthyThresholdCount: 3, HealthCheck.UnhealthyThresholdCount: 2. Ensure your backend applications expose a /health endpoint that returns a 200 OK when healthy.
  • Nginx (as LB): Add a health_check directive to your upstream block, or use a module like nginx-upsstreams-module. A simple check might involve proxy_method GET; proxy_set_header Host $host; proxy_pass http://your_upstream/health;.
  • HAProxy: Add check to your server lines: server web1 192.168.1.10:80 check port 80.

Why it works: Health checks allow the load balancer to dynamically remove unhealthy backend servers from its rotation, ensuring that traffic is only sent to servers that can actually respond successfully.

4. Incorrect Load Balancing Algorithm

You’re using roundrobin for a stateful application, or least_conn for stateless APIs where connection count isn’t a good indicator of load. This can lead to uneven distribution. roundrobin might send a flood of requests for a long-running process to the same server if it happens to be next in line, while least_conn might overload a server with many short, quick connections over one with fewer but very long-lived ones.

Diagnosis: Understand your application’s traffic patterns. Is it stateful? Are requests long or short-lived? Are some requests computationally intensive? Review your load balancer’s algorithm setting.

Fix:

  • Stateful Applications: Use sticky cookie (Nginx, HAProxy) or sticky sessions (AWS ALB) to ensure a user’s requests go to the same backend server.
  • Stateless, CPU-bound applications: roundrobin is often fine, but if you see uneven CPU usage, consider least_time (HAProxy) or least_request (Nginx upstream) which balances based on response time.
  • Stateless, I/O-bound applications (e.g., many short DB queries): least_conn (AWS ALB, HAProxy) is usually a good choice, distributing connections to the server with the fewest active connections.
  • AWS ELB/ALB: Choose Application Load Balancer and select Round Robin or Least outstanding requests for Target group attributes. For specific stickiness, configure Load balancing.Algorithm to Least outstanding requests and Stickiness.Enabled to true.
  • Nginx: Use upstream your_upstream { ip_hash; } for sticky sessions, or upstream your_upstream { least_conn; } to balance based on active connections.
  • HAProxy: Use balance leastconn or balance roundrobin. For sticky sessions, use cookie SERVERID insert.

Why it works: The right algorithm ensures that traffic is distributed based on the actual needs and characteristics of your backend services, preventing one server from being disproportionately burdened.

5. Over-reliance on DNS Round Robin

Using DNS A records with multiple IP addresses for your backend servers and expecting DNS round-robin to balance load. DNS is often heavily cached by clients and intermediate resolvers. This means that many clients will consistently get the same IP address for a long time, leading to massive imbalance. It also means that when a server goes down, clients will continue hitting its IP until their DNS cache expires.

Diagnosis: Use dig or nslookup multiple times from different locations. Observe if you consistently get the same IP address. Check DNS TTL (Time To Live) values – if they are high (e.g., hours), this is a problem.

Fix: Do not use DNS round-robin for load balancing. Instead, use a dedicated load balancer service (like AWS ELB, Google Cloud Load Balancing, Azure Load Balancer, or self-hosted Nginx/HAProxy). Configure your application’s public-facing DNS record to point to the IP address of your load balancer.

Why it works: A dedicated load balancer sits between your clients and your backend servers, actively managing traffic distribution and health checks in real-time, unlike the slow and unreliable nature of DNS caching.

6. Load Balancer Resource Exhaustion

The load balancer itself becomes the bottleneck. This can happen if the load balancer instance is undersized for the traffic volume, or if its configuration (e.g., connection limits, CPU/memory limits) is too restrictive. This manifests as slow responses, connection resets, or outright failures originating from the load balancer.

Diagnosis: Monitor the load balancer’s own performance metrics: CPU utilization, memory usage, network throughput, active connections, and request latency. Check load balancer logs for errors related to resource limits being hit.

Fix:

  • AWS ELB/ALB: Scale up the instance size or number of instances for your load balancer. For Application Load Balancers, this is often automatic, but you can influence capacity via Load balancing.Capacity.MinCapacityUnits and Load balancing.Capacity.MaxCapacityUnits. Ensure your health check settings aren’t causing excessive load on the LB itself (e.g., checking too frequently).
  • Nginx/HAProxy: Increase worker processes (worker_processes), worker connections (worker_connections in Nginx), or tune HAProxy’s maxconn settings. Ensure the server running Nginx/HAProxy has sufficient CPU and RAM.

Why it works: By providing the load balancer with adequate resources or configuring it to handle more concurrent traffic, you ensure it can effectively distribute requests without becoming a bottleneck itself.

The next thing you’ll likely encounter is 502 Bad Gateway errors originating from the load balancer itself, indicating it couldn’t reach any healthy backend targets.

Want structured learning?

Take the full Load-balancing course →