The JVM’s Garbage Collector (GC) is aggressively collecting, causing long pauses that directly spike your application’s latency.
Here’s what’s likely happening and how to fix it:
1. Heap Size Too Small (Most Common)
- Diagnosis: Monitor your heap usage over time. If it’s consistently near 100% before a GC cycle, it’s too small. Use
jstat -gc <pid> <interval>and look at the% Old Genusage. - Fix: Increase the maximum heap size (
-Xmx) and initial heap size (-Xms). For example, if you have 16GB RAM and your app needs about 8GB, set-Xms8g -Xmx8g. - Why it works: A larger heap gives the GC more breathing room, meaning it doesn’t have to run as frequently. Less frequent GCs mean fewer pauses.
2. Excessive Object Allocation
- Diagnosis: High allocation rates lead to frequent GCs. Use a profiler like YourKit, JProfiler, or even
jmap -histo:live <pid>to identify which objects are being created most frequently. - Fix: Optimize your code to reduce object creation. This might involve reusing objects (e.g., using
StringBuilderinstead of string concatenation in loops), using primitive types where possible, or employing object pooling. For instance, replaceString str = "a" + "b" + "c";inside a loop withStringBuilder sb = new StringBuilder(); sb.append("a").append("b").append("c");. - Why it works: Fewer objects created means less work for the GC, leading to less frequent and shorter GC cycles.
3. Inefficient GC Algorithm Chosen
- Diagnosis: The default GC might not be suitable for your workload. Observe GC pause times. If they are consistently high, especially with high throughput requirements, the current GC is likely not optimal. Use
jstat -gcutil <pid> <interval>to see GC event details. - Fix: Experiment with different GCs. For low-pause applications, G1 GC (
-XX:+UseG1GC) is often a good choice. For very high throughput, Parallel GC (-XX:+UseParallelGC) can be faster but with longer pauses. For extremely low latency needs, ZGC (-XX:+UseZGC) or Shenandoah GC (-XX:+UseShenandoahGC) might be options, though they have their own trade-offs. Start with G1:-XX:+UseG1GC -XX:MaxGCPauseMillis=200. - Why it works: Different GCs have different strategies for reclaiming memory. G1 aims for predictable pause times, while Parallel GC prioritizes throughput. Choosing the right one matches your application’s needs.
4. Too Many Short-Lived Objects in Old Generation
- Diagnosis: This is a symptom of "promotion failure." Objects that should have been garbage collected in the Young Generation are surviving and being promoted to the Old Generation, forcing Full GCs. Look for increased Old Gen usage followed by Full GC events in
jstat -gc <pid> <interval>. - Fix: Tune the Young Generation size. If objects are living longer than expected but still short-lived, increasing the Young Gen size (
-Xmnor let the JVM calculate it with-XX:NewRatio=N) can help. For example,-XX:NewRatio=3means the Old Gen is 3 times larger than the Young Gen. - Why it works: A larger Young Generation allows more objects to be collected there before they are promoted, reducing the load on the Old Generation and the need for Full GCs.
5. Excessive Finalizers
- Diagnosis: Objects with
finalize()methods are more expensive to collect. They require an extra GC cycle to clear the finalizer queue. Use a profiler to detect objects with finalizers. - Fix: Avoid using
finalize(). Instead, usetry-with-resourcesforAutoCloseableobjects or explicitclose()methods. If you must clean up resources, usejava.lang.ref.Cleanerwhich is more efficient. - Why it works: Finalizers add an extra step to the GC process. Eliminating them allows objects to be reclaimed more directly.
6. Tuning GC Threads
- Diagnosis: The number of GC threads might be insufficient or excessive for your CPU cores. Look at the GC logs for "GC thread(s) exhausted" or unusually long GC times that don’t correlate with heap usage.
- Fix: Adjust the number of GC threads. For G1 GC, you can use
-XX:ParallelGCThreads=<N>and-XX:ConcGCThreads=<M>. Often, the default settings are reasonable, but in CPU-bound scenarios, tweaking might help. For example,-XX:ParallelGCThreads=8 -XX:ConcGCThreads=4if you have 16 cores. - Why it works: Properly allocating GC threads to available CPU resources ensures that GC operations complete efficiently without starving the application threads or wasting CPU cycles.
7. PermGen/Metaspace Leaks (Less common for GC pauses but can cause OutOfMemoryError leading to restarts)
- Diagnosis: While not directly causing GC pauses, a leak in the PermGen (Java 7 and below) or Metaspace (Java 8+) can lead to
OutOfMemoryError: PermGen spaceorOutOfMemoryError: Metaspaceerrors, forcing application restarts and perceived downtime. Monitor Metaspace usage withjstat -gc <pid> <interval>(look atNGCMNandNGCMXfor Metaspace). - Fix: For Metaspace, increase
-XX:MaxMetaspaceSize=<size>. For example,-XX:MaxMetaspaceSize=512m. The root cause is usually dynamic class loading and unloading issues, which might require application code changes. - Why it works: This prevents the JVM from running out of space for class metadata, which would otherwise cause an OutOfMemoryError and application termination.
After fixing these, you’ll likely encounter java.lang.OutOfMemoryError: Direct buffer memory if your application uses a lot of off-heap memory.