The surprising truth about JVM heap dumps is that they’re not just a snapshot of memory; they’re a detailed forensic record that can pinpoint the exact objects responsible for your application’s memory bloat, often revealing subtle programming errors you didn’t know existed.
Let’s look at a real-world scenario. Imagine a web application that’s been running for days, and its memory usage is steadily climbing, eventually leading to OutOfMemoryError. We’ve captured a heap dump using jmap while the application is experiencing high memory usage:
jmap -dump:format=b,file=heapdump.hprof <pid>
This command creates a heapdump.hprof file containing the entire Java heap. Now, we load this file into Eclipse Memory Analyzer Tool (MAT). MAT presents a wealth of information, but we’re specifically looking for leaks.
The first thing MAT typically shows is the "Overview" page. From there, you’ll want to navigate to the "Leak Suspects" report. This report automatically analyzes the heap dump and identifies potential memory leak candidates. It’s not magic; it’s a sophisticated algorithm that looks for objects that are retained in memory longer than they should be, often by holding onto references to other objects.
The Leak Suspects report will point to a dominator tree. A dominator tree shows which objects are "dominating" others, meaning that if the dominator is garbage collected, all the objects it dominates will also be garbage collected. Large dominators, especially those that seem to grow over time, are prime suspects for leaks.
Let’s say the Leak Suspects report highlights a HashMap as a major leak suspect. This is common. Applications often use collections like HashMap to store data, and if these collections are not properly cleared or managed, they can accumulate objects indefinitely.
To investigate this HashMap, you’d right-click on the suspect in the Leak Suspects report and select "Path to GC Roots" (excluding weak/soft references). This is the core of heap dump analysis. It shows you the chain of references that prevent an object (or a collection of objects) from being garbage collected. You’ll see the HashMap itself, and then a chain of objects leading back to a GC root. GC roots are objects that are always considered reachable, such as active threads, static variables, or JNI references. If your suspect object is reachable from a static variable or an active thread that never terminates, it’s a strong indicator of a leak.
Consider this common leak pattern: a ThreadLocal variable that holds a large object, and the thread itself is never terminated or the ThreadLocal is not explicitly remove()d. The ThreadLocal map holds a reference to the object, and the thread being a GC root keeps the ThreadLocal map alive.
Another frequent culprit is a cache that isn’t configured with an eviction policy. You might have a ConcurrentHashMap acting as a cache, and as more items are added without any mechanism to remove old ones, it grows unbounded.
When examining the "Path to GC Roots," pay close attention to the type of references. Strong references will prevent garbage collection. Soft references might be cleared under memory pressure, and weak references are cleared more readily. If your leak suspect is held by a strong reference from an unexpected source, that’s your leak.
A typical fix for a HashMap leak might involve identifying where the HashMap is being populated and ensuring that either:
- Entries are explicitly removed when no longer needed.
- The
HashMapis cleared periodically. - If it’s a cache, a proper eviction strategy (like LRU or time-based expiry) is implemented.
For example, if you find a HashMap holding user sessions and it’s being populated in a request handler, you’d look for the logic that removes the session when the user logs out or the session times out. If that logic is missing or flawed, you’d add it.
If the leak is due to a ThreadLocal, the fix is usually to ensure threadLocalVariable.remove() is called in a finally block after the thread finishes its work, or if the thread is pooled and reused, to ensure the variable is cleared before the thread is returned to the pool.
Sometimes, the leak isn’t in your application code directly, but in a third-party library. MAT’s dominator tree can help here too. If a large portion of memory is dominated by objects from a specific library, and you can’t find a logical reason for them to be held, it might be a library bug or an incorrect usage pattern of that library.
The key is to follow the chain of references. If you see an object that shouldn’t be there, ask yourself: "What is holding onto it?" and then use MAT to find that "what."
Once you’ve fixed the identified leak and redeployed, the next common problem you’ll encounter is a StackOverflowError if you’ve inadvertently introduced infinite recursion while trying to fix the original issue.