k6 Baseline Tests: Measure Normal System Performance (2026)

A k6 baseline test isn’t just about seeing what your system can do, it’s about understanding what it normally does under light, non-disruptive load.

Here’s k6 measuring a simple API endpoint, simulating 10 users hitting it every second for 30 seconds:

// baseline.js
import http from 'k6/http';
import { sleep } from 'k6';

export const options = {
  vus: 10, // Virtual Users (concurrent users)
  duration: '30s', // Test duration
  thresholds: {
    http_req_failed: 'rate<0.01', // 1% error rate allowed
    http_req_duration: 'p(95)<500', // 95% of requests must complete within 500ms
  },
};

export default function () {
  http.get('http://your-api.example.com/status');
  sleep(1); // Wait 1 second before the next iteration
}

To run this, you’d execute: k6 run baseline.js

This script defines vus (virtual users) and duration. The thresholds section is crucial: it defines acceptable performance limits. If more than 1% of requests fail, or if 95% of requests take longer than 500ms, the test will fail. The http.get simulates a user request, and sleep(1) ensures each virtual user doesn’t hammer the server; they wait one second between requests, mimicking typical user behavior.

The core problem baseline tests solve is the "it works on my machine" syndrome, or more broadly, the lack of a clear, objective definition of "normal." Without a baseline, you don’t know if a sudden performance degradation is a catastrophic failure or just a slightly busier-than-usual Tuesday. Baseline tests establish that objective reality. They are your system’s "resting heart rate."

Internally, k6 orchestrates these virtual users. Each VU runs the default function independently and concurrently. k6 collects metrics like request duration, error rates, and data transfer size for every single request made by all VUs. It then aggregates these into percentiles, averages, and rates, giving you a statistical overview of performance. The thresholds are evaluated against these aggregated metrics at the end of the test run.

The sleep() duration is a critical lever. If you set sleep(0), you’re simulating a load test, not a baseline. For baseline, you want to mimic realistic, non-aggregated user activity. If your API typically sees one request per user every 5 seconds on average, your sleep() should reflect that, perhaps sleep(5). If you have multiple distinct user actions within a single user session, your script would chain http calls and sleeps.

A common mistake is to make the vus count too high, turning a baseline test into a load test. For example, if your system’s average load is 100 concurrent users, and you run a baseline test with vus: 500, you’re not measuring normal; you’re measuring how your system copes with overload. The goal is to represent the typical concurrent user count, not the peak or maximum capacity.

Another subtle point is the distribution of sleep. If all 10 VUs in the example above sleep(1) at the exact same time, they will all hit the server simultaneously every second. In reality, user actions are more staggered. To simulate this more realistically, you might use sleep(Math.random() * 5 + 1); to introduce variability in the wait times, making the load pattern less predictable and more representative of real-world, asynchronous user behavior.

The next concept you’ll likely explore is how to use these baseline metrics to inform more aggressive load testing scenarios, or how to integrate them into your CI/CD pipeline to catch regressions automatically.