The JVM is actually a lot faster than you think, and most performance regressions aren’t about raw CPU speed.

Let’s look at a typical CI pipeline that builds and tests a Java application. Imagine this scenario:

# Checkout code
git checkout main
git pull origin main

# Build the application
./mvnw clean package

# Run tests
./mvnw test

This looks simple, but what if the package or test phases suddenly take 5 minutes longer than they used to? Where did that time go?

The most common culprit isn’t a new, slow algorithm in your code. It’s usually a subtle shift in how the JVM manages its memory and executes code, leading to increased garbage collection pauses or inefficient JIT compilation.

Common Causes of JVM Performance Regressions in CI

  1. Increased Heap Usage Leading to More Frequent/Longer GC Pauses:

    • Diagnosis: Monitor heap usage during the CI run. Use jstat -gcutil <pid> <interval> (if you can attach a jstat to the running JVM, e.g., by setting JAVA_TOOL_OPTIONS="-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:gc.log" in your build environment) or analyze GC logs. Look for consistently higher Old Gen usage (O) and increased full GC counts.
    • Fix: Identify the code that’s allocating more objects. Often, this is due to larger data structures, increased logging, or inefficient caching. A common fix is to tune heap size (-Xmx) or optimize object allocation patterns. For example, if you see a sudden spike in object creation, you might increase -Xmx from 4g to 6g if your CI runner has sufficient memory.
    • Why it works: A larger heap means the JVM can hold more objects before needing to run a full garbage collection, reducing the frequency of these potentially long pauses. Optimizing allocation patterns reduces the rate at which garbage is generated, achieving a similar effect.
  2. JIT Compiler Deoptimization/Recompilation:

    • Diagnosis: The JVM’s Just-In-Time (JIT) compiler optimizes frequently executed code. If the assumptions it made about code behavior change (e.g., due to new code paths being hit more often, or changes in data types), it may deoptimize and recompile. This can be observed by increased CPU usage during the test phase and can sometimes be hinted at in GC logs or by using JVM flight recorder (jcmd <pid> JFR.start name=jit-recompilation).
    • Fix: This is harder to directly "fix" without code changes. However, you can influence JIT behavior. For instance, ensuring consistent execution profiles across runs can help. Sometimes, increasing the JIT compiler threads (-XX:CICompilerCount) can help it keep up, but this is often a symptom, not the root cause.
    • Why it works: By providing more compiler threads, the JIT can catch up on recompilation tasks faster, reducing the time spent in interpreted mode or in the process of deoptimization.
  3. Increased Thread Contention:

    • Diagnosis: If your tests or build process use multiple threads, increased contention for shared resources (locks, data structures) can cause threads to block, slowing down execution. Monitor thread states using jstack <pid> or by analyzing thread dumps. Look for threads stuck in BLOCKED or WAITING states for extended periods.
    • Fix: Refactor code to reduce shared mutable state, use more granular locks, or employ concurrent data structures (like ConcurrentHashMap instead of synchronized HashMap). For example, replacing a synchronized ArrayList with CopyOnWriteArrayList for read-heavy scenarios can drastically reduce contention.
    • Why it works: Reducing contention means threads spend less time waiting for each other and more time doing actual work.
  4. Classloading Overhead:

    • Diagnosis: If new dependencies are added, or if the build process involves dynamically loading classes, this can add overhead. While less common for a stable CI, a sudden increase in classloading time might point to new libraries being introduced or complex plugin architectures. You can use JVM Flight Recorder (JFR) to profile class loading.
    • Fix: Ensure only necessary dependencies are loaded. For build-time issues, review your dependency management. For runtime issues in tests, ensure static initializers are efficient.
    • Why it works: Faster classloading means the JVM can get to executing your code sooner.
  5. Garbage Collector Choice or Tuning:

    • Diagnosis: The default garbage collector might not be optimal for your CI environment’s characteristics (e.g., short-lived or long-lived objects, pause time tolerance). A change in JVM version can sometimes default to a different GC. Check your JVM version and GC logs for frequent Stop-The-World pauses.
    • Fix: Explicitly set a GC. For CI environments where throughput is key and short pauses are acceptable, ParallelGC (-XX:+UseParallelGC) might be faster overall than G1GC (-XX:+UseG1GC). If pause times are critical, G1GC or ZGC/Shenandoah (if available and compatible with your JVM version) are better, but might have higher CPU overhead. For example, switching from the default G1GC to ParallelGC might look like adding -XX:+UseParallelGC to your JVM options.
    • Why it works: Different GCs have different trade-offs. ParallelGC is optimized for throughput (getting work done faster) by using multiple threads for collection but can result in longer pauses. G1GC aims for predictable pause times. Choosing the right one for your workload can improve overall speed.
  6. I/O Bound Operations in Tests:

    • Diagnosis: Some tests might perform disk I/O (e.g., writing temporary files, reading configuration). If these operations become slower (e.g., due to disk contention on the CI runner, or larger datasets being written), they can significantly slow down the test phase. Monitor disk I/O metrics on your CI agent.
    • Fix: Optimize I/O operations. Use in-memory solutions where possible (e.g., H2 in-memory database for tests), ensure temporary files are cleaned up promptly, or investigate faster storage for your CI agents.
    • Why it works: Reducing the time spent waiting for disk operations directly speeds up the tests that rely on them.

The next error you’ll likely hit after fixing performance regressions is related to test timeouts, as the system will now correctly identify tests that are genuinely too slow to be useful, rather than just being bogged down by JVM overhead.

Want structured learning?

Take the full Jvm course →