Locust’s failure rate and latency metrics are not just numbers; they’re a direct readout of how your application is performing under stress, and often, the failure rate is a more sensitive indicator of problems than raw latency.
Let’s see what this looks like in the wild. Imagine we’re running a simple load test against a basic web service.
from locust import HttpUser, task, between
class WebsiteUser(HttpUser):
wait_time = between(1, 5)
host = "http://localhost:8080"
@task
def index(self):
self.client.get("/")
@task
def about(self):
self.client.get("/about")
When we run this with Locust, we get a web UI. On the "Statistics" tab, we’ll see lines for each endpoint (/ and /about in this case), and columns for "Requests", "Failures", "Med. (ms)", "95%", etc.
The "Requests" column shows the total number of times a request was made to that endpoint. "Failures" is the count of requests that did not complete successfully. "Med. (ms)" is the median response time, meaning 50% of requests were faster than this. "95%" is the 95th percentile, meaning 95% of requests were faster than this value.
The core problem Locust helps you diagnose is understanding why requests are failing and how slow they are becoming under load. It’s not enough to see a high number of requests; you need to correlate that with failures and latency to pinpoint bottlenecks.
Interpreting Failure Rates
A failure rate isn’t just a percentage; it’s a symptom of a system under duress. If your failure rate for a specific endpoint starts climbing above 0.5% to 1%, it’s a strong signal that something is breaking. This could be due to several reasons:
-
Application Errors (5xx Status Codes): The most common cause. Your application is explicitly returning an error.
- Diagnosis: Check your application logs for tracebacks or error messages corresponding to the failing requests. On the Locust UI, look at the "Failures" table. It will list exceptions (e.g.,
ValueError,KeyError) or specific HTTP status codes (e.g., 500 Internal Server Error). - Fix: Debug your application code. For example, if you see
KeyError: 'user_id', it means your code is trying to accessdata['user_id']but that key doesn’t exist. You’d add a check:user_id = data.get('user_id')and handle theNonecase, or ensure the key is always present. - Why it works: Eliminates the bug in your application logic that leads to the error.
- Diagnosis: Check your application logs for tracebacks or error messages corresponding to the failing requests. On the Locust UI, look at the "Failures" table. It will list exceptions (e.g.,
-
Connection Timeouts (Locust Exception
ReadTimeout): The client (Locust) waited for a response from the server but didn’t receive one within its configured timeout.- Diagnosis: In Locust’s "Failures" tab, you’ll see
ReadTimeoutor similar network-related exceptions. This often coincides with high latency. - Fix: Increase the server’s timeout configurations (e.g., in your web server like Nginx or your application framework’s settings). For example, in Nginx, you might increase
proxy_read_timeoutfrom60sto120s. You might also need to increase Locust’s client timeout:self.client.timeout = 120.0in yourHttpUserclass. - Why it works: Gives the server more time to process slow requests, preventing the client from giving up prematurely.
- Diagnosis: In Locust’s "Failures" tab, you’ll see
-
Connection Refused/Reset (
ConnectionRefusedError,ConnectionResetError): The server actively rejected the connection or closed it unexpectedly.- Diagnosis: These errors appear directly in the Locust failures. This usually means the server process crashed, was overloaded and couldn’t accept new connections, or a firewall intervened.
- Fix: Ensure your application server process is running and healthy. Check resource utilization (CPU, memory) on the server. If it’s a scaling issue, increase the number of application server instances or their resources. For connection refused, check if the port is open and the service is listening.
- Why it works: Addresses the underlying issue of the server being unavailable or unable to handle the load.
-
DNS Resolution Failures (
NewConnectionErrorwith "[Errno -2] Name or service not known"): Locust can’t resolve the hostname of your service.- Diagnosis: Locust exceptions like
NewConnectionErroroften indicate DNS problems. - Fix: Verify your
/etc/resolv.confor network settings for correct DNS servers. Ensure the hostname you’re testing against is actually resolvable from where Locust is running. - Why it works: Ensures Locust can establish a network connection to the target service.
- Diagnosis: Locust exceptions like
-
Rate Limiting (429 Status Codes): Your API is intentionally throttling requests.
- Diagnosis: Locust will report 429 "Too Many Requests" status codes under "Failures."
- Fix: Adjust your load test to respect the API’s rate limits. This might involve reducing the user count, increasing
wait_timein your Locustfile, or implementing retry logic with backoff in your Locustfile if the API supports it. - Why it works: Aligns your test with the service’s designed capacity.
-
Resource Exhaustion on Server (e.g., File Descriptors, Memory Leaks): The server runs out of system resources.
- Diagnosis: This can manifest as
ConnectionRefusedError(if the OS won’t allow new connections) or application-level errors due to out-of-memory conditions. Monitor server-side metrics (likeulimit -nfor file descriptors, RAM usage). - Fix: Increase system limits (e.g.,
ulimit -n 65536) or fix memory leaks in your application. - Why it works: Provides the necessary resources for the application to operate correctly.
- Diagnosis: This can manifest as
Interpreting Latency
While failures are critical, high latency, even without failures, indicates a performance degradation. A user experiencing slow load times is almost as bad as a failed request.
- Median vs. 95th Percentile: The median (50th percentile) tells you what half your users experience. The 95th percentile shows the experience for your slower users. A large gap between median and 95th percentile (e.g., median 50ms, 95th percentile 500ms) indicates significant variability and potential for user frustration.
- Bottlenecks: High latency often points to specific parts of your system that are struggling. This could be database queries, external API calls, inefficient algorithms, or contention for shared resources.
- Diagnosis: Correlate high latency on specific Locust endpoints with performance metrics from your application (database query times, external service response times). Tools like APM (Application Performance Monitoring) are invaluable here.
- Fix: Optimize the identified bottleneck. This might mean adding database indexes, caching frequently accessed data, optimizing SQL queries, or refactoring slow code paths.
- Why it works: Directly addresses the slow operation that is delaying the response.
When you fix a failure, the next thing you’ll likely encounter is that the latency for previously failing requests will now be measurable, potentially revealing a new performance bottleneck.