JVM String Deduplication, a feature introduced in Java 8u20, can slash your heap usage by 10-30% by identifying and merging identical String objects.
Let’s see it in action. Imagine you have a bunch of String objects that are all the same:
public class StringDemo {
public static void main(String[] args) {
List<String> strings = new ArrayList<>();
for (int i = 0; i < 100000; i++) {
strings.add(new String("commonString"));
}
// Now, let's simulate some memory pressure
try {
Thread.sleep(60000); // Keep the JVM alive for inspection
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
}
Without deduplication, each new String("commonString") creates a distinct object in the heap, even though they hold the same character data. This is a massive waste.
Here’s how String Deduplication tackles this:
When the Garbage Collector (GC) runs, specifically during the full GC phase (like G1GC’s concurrent marking and cleanup cycles, or older GCs’ full collections), it scans the heap. If it encounters multiple String objects that represent the same sequence of characters (i.e., string1.equals(string2) is true and string1.hashCode() == string2.hashCode()), it can merge them.
The magic happens because String objects internally store their character data in a char[] array. Before Java 6u21, each String object had its own char[]. From Java 6u21 onwards, String.intern() was introduced, which interned strings into a special "String pool." If a string was already in the pool, new String("...") would return a reference to the pooled string. However, new String("...") outside of intern() still created new char[] arrays.
String Deduplication, enabled by the -XX:+UseStringDeduplication JVM flag, works differently. It doesn’t rely on String.intern(). Instead, it identifies identical String objects in the regular heap and points them to a single shared char[]. This is done by the GC. When the GC finds duplicate string data, it keeps one char[] and makes all other duplicate String objects point to that single char[]. Crucially, the String objects themselves remain distinct instances (unless intern() was used), but their underlying character data is shared.
To see this in action, you need to enable the flag:
java -XX:+UseStringDeduplication -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -jar StringDemo.jar
Observe the GC logs. You’ll see messages indicating string deduplication activity. You can also use tools like VisualVM or JProfiler. Take a heap dump before and after enabling -XX:+UseStringDeduplication. You’ll notice a significant reduction in the memory occupied by char[] arrays and String objects that contain identical character sequences. The total heap size will decrease.
The primary benefit is reduced memory footprint, leading to fewer GC cycles (especially full GCs), lower pause times, and potentially higher throughput. This is particularly effective in applications that create many short-lived or long-lived strings with identical content, such as web servers processing similar requests, data parsing applications, or applications dealing with large datasets of repeated strings.
One crucial aspect is that String Deduplication is triggered only by full GCs. This means it’s most effective when your heap is under pressure and a full GC is imminent or occurring. If your application rarely triggers full GCs, you might not see the full benefit. However, the goal of enabling it is precisely to avoid those costly full GCs by reducing the amount of live data that needs to be scanned and copied.
The mechanism is straightforward: the GC identifies String objects. If a String object’s char[] (or byte[] for CompactStrings) is identical to another String object’s backing array, and that backing array is eligible for sharing (i.e., not part of a String that has been intern()-ed), the GC can remap the duplicate String objects to point to the single shared backing array. The original backing arrays become garbage and are reclaimed.
This feature is most impactful when you have a large number of identical strings that are not being intern()-ed. If your application heavily relies on String.intern(), the benefits of UseStringDeduplication might be less pronounced, as intern() already provides a form of string sharing. However, intern() has its own overhead (a synchronized operation and a fixed pool), whereas UseStringDeduplication operates more dynamically during GC.
The next step in optimizing string handling is understanding how CompactStrings further reduces memory usage by using byte[] instead of char[] for Latin-1 encoded strings.