Gatling’s "failures" metric is actually the most important one to watch during a load test, not its "throughput."

Here’s how to rip through Gatling results and find what’s actually slowing your system down.

Imagine you’ve just run a Gatling load test and you’re staring at the HTML report. You’ve got your usual suspects: requests per second, response times, error rates. But where’s the bottleneck? It’s not always obvious, and Gatling’s strength is in showing you the system’s behavior, not just your code’s.

Let’s say you’re seeing a dip in throughput and an increase in response times as your load ramps up. You drill into the "Requests" section. You’re looking for any request that deviates significantly from the others, especially those that are not the most frequent.

Scenario 'MyScenario':
  - Request 'POST /api/users'
    - Total: 10000
    - OK: 9950 (99.5%)
    - KO: 50 (0.5%)
    - Response Time (max): 2500 ms
    - Response Time (std dev): 800 ms
    - Response Time (median): 1200 ms
    - Response Time (75th percentile): 1800 ms
    - Response Time (95th percentile): 2400 ms
    - Response Time (99th percentile): 2500 ms

The first thing to check is your KO (failed) requests. What’s the error? If you see a lot of 503 Service Unavailable or 500 Internal Server Error for a specific endpoint, that’s your first major clue. It means the server is actively rejecting or failing your requests.

Cause 1: Resource Exhaustion on the Server

This is the classic. Your application server (or database, or any downstream service) is running out of CPU, memory, or file descriptors. Gatling is hammering it, and it can’t keep up.

  • Diagnosis: On your application server, run top or htop to check CPU and memory usage. For file descriptors, use lsof -p <pid> | wc -l and compare it to /proc/sys/fs/file-max.
  • Fix:
    • CPU/Memory: Increase instance size, optimize application code, or scale out horizontally. For example, if you’re on AWS EC2, upgrade from t3.medium to t3.xlarge.
    • File Descriptors: Increase the ulimit -n for the user running your application. A common fix is to add * soft nofile 65536 and * hard nofile 65536 to /etc/security/limits.conf and restart the application.
  • Why it works: More CPU/memory gives the application more processing power. Increasing file descriptors allows the OS to handle more concurrent network connections, which your application needs to serve requests.

Cause 2: Database Connection Pool Exhaustion

Your application is trying to get database connections, but the pool is empty because all connections are in use or being held too long. This often manifests as slow requests or timeouts for endpoints that hit the database.

  • Diagnosis: Check your database connection pool metrics (available in your application framework or APM tool). Look for Active Connections, Idle Connections, and Connection Wait Time. If Active Connections is consistently at or near your pool size, and Connection Wait Time is high, this is your problem.
  • Fix: Increase the maximum size of your database connection pool. For example, if your HikariCP pool is set to maximumPoolSize=20, increase it to maximumPoolSize=50. Restart your application.
  • Why it works: A larger pool allows more concurrent requests to acquire database connections, preventing them from getting stuck waiting.

Cause 3: Slow Downstream Services

Your application relies on another service (an API, a microservice, a third-party integration) that is itself overloaded or experiencing issues. Gatling might show high response times for your endpoint, but the root cause is external.

  • Diagnosis: Use an Application Performance Monitoring (APM) tool (like Datadog, New Relic, Dynatrace) to trace requests across services. Look for the "time spent" in external calls. If a significant portion of your endpoint’s latency is attributed to a call to http://other-service.example.com/api/data, that’s your culprit.
  • Fix: Scale up the downstream service or optimize its performance. If it’s a third-party service, implement aggressive caching or circuit breakers in your application. For example, add a cache with a TTL of 60 seconds for responses from http://other-service.example.com/api/data.
  • Why it works: By reducing the time your application spends waiting for slow external services, you improve its overall response time and capacity. Caching avoids repeated calls to the slow service altogether.

Cause 4: Inefficient Database Queries

A poorly optimized SQL query can consume massive amounts of CPU and I/O on the database server, slowing down all requests that use it. This often appears as high response times for specific endpoints in Gatling.

  • Diagnosis: Enable slow query logging on your database. Analyze the logs for queries taking longer than a few hundred milliseconds. Use EXPLAIN (or EXPLAIN ANALYZE) on those queries to understand their execution plan.
  • Fix: Add appropriate indexes to your database tables based on the EXPLAIN output. For example, if a query on users table is slow and filters on email, add an index: CREATE INDEX idx_users_email ON users (email);.
  • Why it works: Indexes allow the database to find rows much faster without scanning entire tables, drastically reducing query execution time.

Cause 5: Network Latency or Bandwidth Limitations

While less common for internal bottlenecks, if your Gatling injector machines are far from your application servers, or if there’s network congestion between them, it can add significant latency.

  • Diagnosis: Use ping and traceroute from the Gatling injector machines to your application servers. Monitor network interface statistics (ifconfig or ip addr) on both sides for errors, dropped packets, or high utilization.
  • Fix: Deploy Gatling injectors in the same network/region as your application. Increase network bandwidth if utilization is consistently high.
  • Why it works: Reducing physical distance and ensuring sufficient network capacity minimizes the time packets take to travel, directly impacting overall request latency.

Cause 6: Application Threading Issues (Deadlocks, Contention)

Your application’s threads might be getting stuck waiting for each other (deadlock) or spending too much time acquiring locks (contention), preventing requests from being processed.

  • Diagnosis: Use thread dumps. On Linux, you can often get these using jstack <pid> for Java applications. Analyze the dumps for threads stuck in WAITING or BLOCKED states, and look for lock acquisition patterns. APM tools can also help visualize thread contention.
  • Fix: Refactor code to reduce lock scope, use non-blocking I/O where possible, or adjust thread pool sizes. For example, if a specific synchronized block is causing contention, consider using java.util.concurrent.locks.ReentrantLock with a fairness policy or breaking down the synchronized operation.
  • Why it works: Minimizing the time threads spend waiting for locks or resolving deadlocks allows them to process more requests concurrently.

Once you’ve addressed these, keep an eye on your Gatling reports. The next thing you’ll likely see is a plateau in throughput and consistently low percentiles, indicating you’re hitting the theoretical maximum capacity of your system for that specific workload.

Want structured learning?

Take the full Gatling course →