Gatling’s "failures" metric is actually the most important one to watch during a load test, not its "throughput."
Here’s how to rip through Gatling results and find what’s actually slowing your system down.
Imagine you’ve just run a Gatling load test and you’re staring at the HTML report. You’ve got your usual suspects: requests per second, response times, error rates. But where’s the bottleneck? It’s not always obvious, and Gatling’s strength is in showing you the system’s behavior, not just your code’s.
Let’s say you’re seeing a dip in throughput and an increase in response times as your load ramps up. You drill into the "Requests" section. You’re looking for any request that deviates significantly from the others, especially those that are not the most frequent.
Scenario 'MyScenario':
- Request 'POST /api/users'
- Total: 10000
- OK: 9950 (99.5%)
- KO: 50 (0.5%)
- Response Time (max): 2500 ms
- Response Time (std dev): 800 ms
- Response Time (median): 1200 ms
- Response Time (75th percentile): 1800 ms
- Response Time (95th percentile): 2400 ms
- Response Time (99th percentile): 2500 ms
The first thing to check is your KO (failed) requests. What’s the error? If you see a lot of 503 Service Unavailable or 500 Internal Server Error for a specific endpoint, that’s your first major clue. It means the server is actively rejecting or failing your requests.
Cause 1: Resource Exhaustion on the Server
This is the classic. Your application server (or database, or any downstream service) is running out of CPU, memory, or file descriptors. Gatling is hammering it, and it can’t keep up.
- Diagnosis: On your application server, run
toporhtopto check CPU and memory usage. For file descriptors, uselsof -p <pid> | wc -land compare it to/proc/sys/fs/file-max. - Fix:
- CPU/Memory: Increase instance size, optimize application code, or scale out horizontally. For example, if you’re on AWS EC2, upgrade from
t3.mediumtot3.xlarge. - File Descriptors: Increase the
ulimit -nfor the user running your application. A common fix is to add* soft nofile 65536and* hard nofile 65536to/etc/security/limits.confand restart the application.
- CPU/Memory: Increase instance size, optimize application code, or scale out horizontally. For example, if you’re on AWS EC2, upgrade from
- Why it works: More CPU/memory gives the application more processing power. Increasing file descriptors allows the OS to handle more concurrent network connections, which your application needs to serve requests.
Cause 2: Database Connection Pool Exhaustion
Your application is trying to get database connections, but the pool is empty because all connections are in use or being held too long. This often manifests as slow requests or timeouts for endpoints that hit the database.
- Diagnosis: Check your database connection pool metrics (available in your application framework or APM tool). Look for
Active Connections,Idle Connections, andConnection Wait Time. IfActive Connectionsis consistently at or near your pool size, andConnection Wait Timeis high, this is your problem. - Fix: Increase the maximum size of your database connection pool. For example, if your HikariCP pool is set to
maximumPoolSize=20, increase it tomaximumPoolSize=50. Restart your application. - Why it works: A larger pool allows more concurrent requests to acquire database connections, preventing them from getting stuck waiting.
Cause 3: Slow Downstream Services
Your application relies on another service (an API, a microservice, a third-party integration) that is itself overloaded or experiencing issues. Gatling might show high response times for your endpoint, but the root cause is external.
- Diagnosis: Use an Application Performance Monitoring (APM) tool (like Datadog, New Relic, Dynatrace) to trace requests across services. Look for the "time spent" in external calls. If a significant portion of your endpoint’s latency is attributed to a call to
http://other-service.example.com/api/data, that’s your culprit. - Fix: Scale up the downstream service or optimize its performance. If it’s a third-party service, implement aggressive caching or circuit breakers in your application. For example, add a cache with a TTL of 60 seconds for responses from
http://other-service.example.com/api/data. - Why it works: By reducing the time your application spends waiting for slow external services, you improve its overall response time and capacity. Caching avoids repeated calls to the slow service altogether.
Cause 4: Inefficient Database Queries
A poorly optimized SQL query can consume massive amounts of CPU and I/O on the database server, slowing down all requests that use it. This often appears as high response times for specific endpoints in Gatling.
- Diagnosis: Enable slow query logging on your database. Analyze the logs for queries taking longer than a few hundred milliseconds. Use
EXPLAIN(orEXPLAIN ANALYZE) on those queries to understand their execution plan. - Fix: Add appropriate indexes to your database tables based on the
EXPLAINoutput. For example, if a query onuserstable is slow and filters onemail, add an index:CREATE INDEX idx_users_email ON users (email);. - Why it works: Indexes allow the database to find rows much faster without scanning entire tables, drastically reducing query execution time.
Cause 5: Network Latency or Bandwidth Limitations
While less common for internal bottlenecks, if your Gatling injector machines are far from your application servers, or if there’s network congestion between them, it can add significant latency.
- Diagnosis: Use
pingandtraceroutefrom the Gatling injector machines to your application servers. Monitor network interface statistics (ifconfigorip addr) on both sides for errors, dropped packets, or high utilization. - Fix: Deploy Gatling injectors in the same network/region as your application. Increase network bandwidth if utilization is consistently high.
- Why it works: Reducing physical distance and ensuring sufficient network capacity minimizes the time packets take to travel, directly impacting overall request latency.
Cause 6: Application Threading Issues (Deadlocks, Contention)
Your application’s threads might be getting stuck waiting for each other (deadlock) or spending too much time acquiring locks (contention), preventing requests from being processed.
- Diagnosis: Use thread dumps. On Linux, you can often get these using
jstack <pid>for Java applications. Analyze the dumps for threads stuck inWAITINGorBLOCKEDstates, and look for lock acquisition patterns. APM tools can also help visualize thread contention. - Fix: Refactor code to reduce lock scope, use non-blocking I/O where possible, or adjust thread pool sizes. For example, if a specific
synchronizedblock is causing contention, consider usingjava.util.concurrent.locks.ReentrantLockwith a fairness policy or breaking down the synchronized operation. - Why it works: Minimizing the time threads spend waiting for locks or resolving deadlocks allows them to process more requests concurrently.
Once you’ve addressed these, keep an eye on your Gatling reports. The next thing you’ll likely see is a plateau in throughput and consistently low percentiles, indicating you’re hitting the theoretical maximum capacity of your system for that specific workload.