Chaos engineering in k6 is about deliberately breaking things to see if your system can handle it, especially when it’s already under stress.
Here’s a k6 test script simulating user login traffic and injecting network latency:
import http from 'k6/http';
import { sleep } from 'k6';
import { randomInt } from 'https://jslib.k6.io/k6-utils/1.2.0/index.js';
export const options = {
stages: [
{ duration: '1m', target: 100 }, // Ramp up to 100 users over 1 minute
{ duration: '2m', target: 100 }, // Stay at 100 users for 2 minutes
{ duration: '1m', target: 0 }, // Ramp down to 0 users over 1 minute
],
thresholds: {
http_req_failed: ['rate<0.01'], // http errors should be less than 1%
http_req_duration: ['p(95)<500'], // 95% of requests should be below 500ms
},
};
export default function () {
const username = `user_${__VU}_${__ITER}`;
const password = 'password123';
const res = http.post('http://your-app.com/login', {
username: username,
password: password,
});
// Simulate network latency by sleeping.
// In a real chaos experiment, this would be injected by a tool like Toxiproxy.
const latency = randomInt(50, 500); // Simulate 50ms to 500ms latency
sleep(latency / 1000); // k6 sleep is in seconds
// Add checks for response status
if (res.status !== 200) {
console.log(`Login failed: Status ${res.status}, Body: ${res.body}`);
}
sleep(1); // Simulate think time between requests
}
This script simulates 100 virtual users logging in, with each request experiencing between 50ms and 500ms of added latency. The thresholds ensure that failures stay below 1% and 95% of requests complete within 500ms.
The core problem k6 chaos engineering solves is understanding how your system behaves when things go wrong, not just when everything is perfect. It’s about proactively identifying weaknesses before they impact real users. You’re not just testing performance; you’re testing resilience.
Internally, k6 runs these scripts by launching multiple virtual users (__VU) concurrently. Each virtual user executes the default function. The options object controls the load profile (how many users, for how long) and defines success criteria (thresholds).
Here’s how you’d typically set up a chaos experiment:
- Define your hypothesis: "Our login service will remain available and responsive even if 20% of requests to the authentication microservice experience 1-second network latency."
- Instrument your system: Ensure you have monitoring in place (Prometheus, Grafana, APM tools) to observe system health, error rates, and latency.
- Choose your chaos tool: For network issues, tools like Toxiproxy, Chaos Mesh, or LitmusChaos are common. For resource exhaustion, you might use tools that manipulate CPU/memory.
- Write your k6 script: As shown above, simulate the "normal" user traffic that your system expects.
- Orchestrate the chaos:
- Start your k6 test.
- During the test, use your chaos tool to inject the failure (e.g., "add 1000ms latency to all requests going to
auth.your-app.com"). - Observe your monitoring dashboards and k6 output.
- Did the system stay up? Did k6 report failures above the threshold? Did response times degrade gracefully or crash?
- Analyze and remediate: If the system failed, investigate why. If k6 reported errors, understand the root cause. Fix the issue, then repeat the experiment.
The power of this approach is seeing the cascading effects of a failure. A single slow database query, if unhandled, can lead to connection pool exhaustion, which then causes application servers to fail requests, leading to high error rates in k6, and ultimately, a complete outage. k6 helps you quantify this degradation.
A crucial aspect often missed is correlating the k6 metrics with the actual injected chaos. When running k6, you’ll see http_req_duration increase. This is the sum of your application’s processing time plus the injected latency. To truly understand your application’s performance under duress, you need to subtract the known injected chaos duration from the total observed duration. This requires careful logging or using k6 features that expose injected latency separately if your chaos tool supports it.
The next step after mastering basic chaos injection is to explore more complex scenarios like injecting random errors, simulating disk I/O delays, or even terminating critical pods unexpectedly.