Locust’s percentile statistics are actually a historical artifact, and most users are better off using a different, more robust metric entirely.
Let’s see Locust in action. Imagine you’re running a simple load test against a hypothetical API endpoint /login.
from locust import HttpUser, task, between
class WebsiteUser(HttpUser):
wait_time = between(1, 5)
host = "http://localhost:8080"
@task
def login(self):
self.client.get("/login")
After running this for a bit, you’d see a report like this in Locust’s web UI:
User statistics (10000 requests)
Name # requests | Min | Avg | Max | Median | p95 | p99 | வது
-----------------------------------------------------------------------------------------------------------------------------------------------------
GET /login 10000 | 10 ms | 150 ms | 2000 ms | 120 ms | 250 ms | 500 ms | 10000 ms
The p95 and p99 columns show the response time below which 95% and 99% of requests, respectively, fell. So, for /login, 95% of requests finished in 250ms or less, and 99% finished in 500ms or less. This is useful for understanding tail latency – the performance experienced by a small, but significant, fraction of your users.
The problem Locust solves with these percentiles is identifying and quantifying outliers. In load testing, we often care about the "average" performance, but a few very slow requests can severely degrade the user experience for those unfortunate users. Percentiles give us a way to measure this.
Here’s how Locust calculates these percentiles. It collects all response times for a given endpoint, sorts them, and then picks the value at the 95% or 99% mark. For example, if you had 100 requests, the p95 would be the 95th slowest response time.
However, this approach has a fundamental flaw: it requires storing every single response time. As your load test scales, especially with high request volumes or long-running tests, this memory usage can become prohibitive. A test running for an hour with 10,000 requests per second could easily try to store billions of individual response times, leading to OutOfMemory errors or extreme performance degradation of the Locust master process itself. This makes it impossible to accurately calculate percentiles for long or high-throughput tests.
The true power of percentiles in performance testing lies not in their exact calculation from all data points, but in their approximation using algorithms like T-Digest or HDR Histogram. These algorithms can maintain an accurate estimate of the distribution with a fixed, small amount of memory, regardless of the number of requests. This allows you to get reliable p95 and p99 values even for massive load tests. While Locust’s built-in percentile calculation is intuitive, its memory footprint makes it unsuitable for real-world, large-scale scenarios. For production-grade load testing, you’ll want to integrate or use tools that employ these memory-efficient percentile estimation algorithms.
The next critical concept to grasp is how to interpret these percentiles in the context of your service-level objectives (SLOs).