GPU Memory Layers: Beyond Caches

A GPU’s memory hierarchy is designed to feed its thousands of cores as fast as possible, and the biggest surprise is how much of that speed comes from predictable latency rather than raw bandwidth.

Let’s see this in action. Imagine a simple CUDA kernel that sums an array:

__global__ void sum_array(float* data, float* result, int n) {
    __shared__ float sdata[256]; // Shared memory for this block
    int tid = threadIdx.x;
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    // Load data into shared memory
    if (i < n) {
        sdata[tid] = data[i];
    } else {
        sdata[tid] = 0; // Pad with zeros
    }
    __syncthreads(); // Ensure all threads in block have loaded

    // Perform reduction in shared memory
    for (int s = 256 / 2; s > 0; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads(); // Synchronize after each reduction step
    }

    // Write final sum for this block to global memory
    if (tid == 0) {
        result[blockIdx.x] = sdata[0];
    }
}

Here, data is in HBM (High Bandwidth Memory), sdata is in shared memory, and the tid variable is held in a register. The kernel orchestrates a multi-stage reduction. Threads first load a chunk of data into sdata. __syncthreads() is crucial – it ensures all threads in a block have finished loading before the reduction begins. Then, in stages, threads combine elements within sdata. Finally, the result for each block is written back to result in HBM.

The fundamental problem the GPU memory hierarchy solves is the memory wall: the ever-widening gap between processor speed and memory access speed. GPUs, with their massive parallelism, are particularly susceptible. If thousands of threads are all waiting for data from slow global memory (HBM), performance grinds to a halt. The hierarchy is a multi-tiered strategy to keep the compute units fed.

At the very top are registers. These are private to each individual thread, incredibly fast (single-cycle latency), and have immense bandwidth per thread. A core can have hundreds or even thousands of registers. Variables like tid and loop counters (i, s) typically live here. The compiler tries to keep frequently used variables in registers. If a thread needs more registers than available, it "spills" to local memory (which is just uncached global memory), causing a massive performance hit.

Next is shared memory. It’s a user-managed cache, local to a Streaming Multiprocessor (SM). A block of threads can access shared memory. It’s much faster than global memory (tens of cycles latency) and has high bandwidth for the threads within a block. In our example, sdata resides here. Developers explicitly manage shared memory, loading data into it and synchronizing access with __syncthreads(). Shared memory is organized into banks, and accessing contiguous locations within a bank can lead to bank conflicts, slowing down access. However, staggered accesses across banks can achieve very high bandwidth.

Then come L1 and L2 caches. These are hardware-managed caches. L1 is typically per SM, and L2 is shared across all SMs on the GPU. They act as transparent caches for global memory (HBM). L1 is faster than L2, but L2 has a larger capacity. The goal is to hold frequently accessed global memory data closer to the SMs. Unlike shared memory, you don’t explicitly control what goes into L1/L2; the hardware decides based on access patterns.

Finally, at the bottom is HBM (High Bandwidth Memory). This is the main, large capacity memory. It has enormous bandwidth but relatively high latency (hundreds of cycles). All the data loaded into registers, shared memory, and caches ultimately originates from or is written back to HBM. The "high bandwidth" comes from the sheer number of memory channels and very wide buses.

The key to performance is minimizing accesses to HBM. By keeping working data in registers and shared memory, and leveraging L1/L2 caches, threads can execute without long stalls waiting for HBM. The __syncthreads() calls in our example are critical for coordinating these loads and computations, ensuring that when a thread reads from sdata, the data has already been written by another thread in the same block.

The single most surprising thing about this hierarchy is how much latency is hidden through concurrency. When one warp (a group of 32 threads) stalls waiting for memory, the SM immediately switches to executing another warp that is ready. This rapid context switching effectively hides the latency of memory operations, making the "slow" HBM appear much faster than it would for a single thread. The memory hierarchy provides the data, but the SM’s ability to juggle thousands of threads is what makes the latency disappear.

The next hurdle is understanding how warp scheduling and occupancy impact performance within this memory hierarchy.