Java code can often be written to be more cache-friendly, and it’s not about obscure compiler flags or processor-specific intrinsics.
Let’s watch a tiny Java program interact with the CPU cache. Imagine we have two arrays, a and b, both filled with integers. We want to sum their elements into a third array, c.
public class CacheDemo {
public static void main(String[] args) {
int size = 1024 * 1024; // 1 million integers
int[] a = new int[size];
int[] b = new int[size];
int[] c = new int[size];
// Initialize arrays (this part is not cache-sensitive for the demo)
for (int i = 0; i < size; i++) {
a[i] = i;
b[i] = i * 2;
}
// Warm-up the JVM
long startTime = System.nanoTime();
for (int i = 0; i < size; i++) {
c[i] = a[i] + b[i];
}
long endTime = System.nanoTime();
System.out.println("Simple sum duration: " + (endTime - startTime) + " ns");
// Now, let's try a different operation that can be less cache-friendly
startTime = System.nanoTime();
for (int i = 0; i < size; i++) {
c[i] = a[i] + b[size - 1 - i]; // Accessing b in reverse
}
endTime = System.nanoTime();
System.out.println("Reverse sum duration: " + (endTime - startTime) + " ns");
}
}
When you run this, you’ll likely see the "Simple sum duration" is significantly faster than the "Reverse sum duration." Why? Because the CPU’s cache is designed for locality of reference. When the CPU fetches data from main memory, it brings it into its L1, L2, or L3 cache, which is much faster. If subsequent operations need that same data, they can grab it from the cache instead of going back to slow main memory.
In the "Simple sum," both a[i] and b[i] are accessed sequentially. As the CPU reads a[0], it also fetches a[1], a[2], etc., into the cache. Similarly for b[i]. The data is laid out contiguously in memory, and the CPU’s prefetchers can often predict the next accesses, filling the cache with the data needed before the code actually asks for it.
However, in the "Reverse sum," we access a[i] sequentially but b[size - 1 - i] in reverse. This means for i=0, we access b[size-1]. For i=1, we access b[size-2]. The memory accesses for b are jumping all over the place. The CPU’s prefetchers struggle to predict these non-sequential accesses. Each time b[size - 1 - i] is needed, it’s likely not in the cache, forcing a slow trip to main memory. This leads to cache misses, where the CPU has to wait for data to be loaded.
The core problem the CPU cache tries to solve is the speed gap between the CPU and main memory. Modern CPUs are orders of magnitude faster than RAM. Without caches, CPUs would spend most of their time waiting for data. Caches exploit two principles:
- Temporal Locality: If a piece of data is accessed, it’s likely to be accessed again soon.
- Spatial Locality: If a piece of data is accessed, data located near it in memory is also likely to be accessed soon.
Our "Simple sum" benefits from both. Sequential array access is the epitome of spatial locality. The "Reverse sum" breaks spatial locality for array b.
To write cache-friendly Java code, you want to maximize data parallelism and access patterns that exhibit locality. This often means processing data in contiguous blocks. For example, instead of processing one element from a, then one from b, then one from c, you might process a "chunk" of a, then a chunk of b, and so on, or process a chunk of a and b to produce a chunk of c.
Consider a common pattern: iterating over a list of objects. If you have List<MyObject> objects, and in your loop you access obj.getFieldA() and obj.getFieldB(), and MyObject has many fields, the object might be laid out in memory such that FieldA and FieldB are not close together. If you have a hot loop that only needs FieldA and FieldB, it might be more cache-friendly to use a structure where these fields are closer, perhaps by using separate arrays for FieldA and FieldB data, indexed by the object’s original position. This is the essence of Array of Structs (AoS) vs. Struct of Arrays (SoA). The CacheDemo above is implicitly using SoA (separate arrays a, b, c).
The most surprising thing about CPU caches is how much the physical layout of your data in memory, not just its logical structure, impacts performance. Even with a garbage collector, the JVM tries to keep objects allocated contiguously for a while, but the underlying memory management can scatter them. When you deal with primitive arrays, you have direct control over this contiguous allocation.
When you’ve optimized array access for locality, the next performance bottleneck you’ll likely encounter is the overhead of method calls within tight loops, which can hinder inlining and introduce stack frame management costs.