A GPU kernel is typically memory bound when the majority of its execution time is spent waiting for data to be fetched from or written to memory, rather than actively performing computations.

Let’s see this in action. Imagine a simple matrix multiplication kernel.

__global__ void matrixMultiply(float* C, const float* A, const float* B, int N) {
    int row = blockIdx.row * blockDim.row + threadIdx.row;
    int col = blockIdx.col * blockDim.col + threadIdx.col;

    if (row < N && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < N; ++k) {
            sum += A[row * N + k] * B[k * N + col];
        }
        C[row * N + col] = sum;
    }
}

In this kernel, each thread computes one element of the output matrix C. The core operation is the dot product of a row from A and a column from B. If the GPU is constantly fetching A and B elements and writing C elements, it’s memory bound. If it’s spending most of its time in the for loop performing multiplications and additions, it’s compute bound.

The problem this solves is optimizing GPU performance. Without understanding whether your kernel is bottlenecked by memory or computation, you might try to optimize the wrong thing. For example, if your kernel is memory bound, adding more complex mathematical operations won’t help; you need to improve data access patterns. Conversely, if it’s compute bound, optimizing memory access might yield diminishing returns if the arithmetic intensity is already high.

Internally, GPUs have sophisticated memory hierarchies (global memory, shared memory, L1/L2 caches) and many processing cores. A memory-bound kernel means these cores are often idle, waiting for data. A compute-bound kernel means the cores are saturated with calculations. The key is to keep those cores fed with data and instructions.

To diagnose this, we use profiling tools. NVIDIA’s Nsight Compute is the go-to for this. The primary metric we look at is Arithmetic Intensity (AI). AI is defined as the number of floating-point operations (FLOPs) per byte of data loaded from global memory.

  • High AI (e.g., > 20-30 for many workloads, but context-dependent): Suggests the kernel is compute-bound. The kernel performs many calculations for each byte of data it touches.
  • Low AI (e.g., < 10): Suggests the kernel is memory-bound. The kernel performs fewer calculations per byte of data, meaning data movement is a significant portion of the execution time.

Here’s how you’d gather this using Nsight Compute:

  1. Install Nsight Compute: Download and install from the NVIDIA developer website.
  2. Run the profiler: From your terminal, execute your CUDA application with nsys profile or nvprof (older, but still functional) and then analyze the output with ncu (Nsight Compute CLI). A common command structure looks like this:
    nvprof --analysis-metrics --log-file report.nvprof ./your_cuda_application
    
    Or with Nsight Systems (which can launch Nsight Compute analysis):
    nsys profile --stats=true -o report.nsys ./your_cuda_application
    nsys stats report.nsys # then analyze the generated report.html or use ncu
    
  3. Analyze the report: Open the generated report file (e.g., report.nvprof or the HTML generated by nsys stats) with Nsight Compute. Look for the "Roofline Analysis" or "Metrics" section.

Common Diagnosis Steps and Fixes:

  • Check Arithmetic Intensity (AI):

    • Diagnosis: In Nsight Compute, find the Avg Arithmetic Intensity metric for your kernel. If it’s low (e.g., 3-5 for a simple matrix multiply that could theoretically achieve much higher), it’s likely memory bound.
    • Fix: Increase the amount of computation per memory access.
      • Tiling/Blocking: Rewrite your kernel to use smaller blocks (tiles) of data that fit into faster on-chip memory (shared memory or L1/L2 caches). This reuses data multiple times before it needs to be fetched from global memory again. For matrix multiply, this involves loading tiles of A and B into shared memory, performing the multiply-accumulate on those tiles, and then writing back the result.
      • Example Tiling Logic (Conceptual):
        // Load tileA and tileB into shared memory
        // ...
        __syncthreads(); // Ensure all threads have data
        // Perform computation using shared memory tiles
        for (int k = 0; k < TILE_DIM; ++k) {
            sum += tileA[tid + k * TILE_DIM] * tileB[k * TILE_DIM + tid];
        }
        __syncthreads(); // Ensure computation is done before next tile load
        
      • Why it works: Shared memory has much higher bandwidth and lower latency than global memory. By using it, you reduce the number of global memory fetches per computation.
  • Examine Global Memory Throughput:

    • Diagnosis: Look at metrics like DRAM Throughput or Global Memory Bandwidth. If your kernel is utilizing a low percentage of the theoretical maximum bandwidth (e.g., < 50%), and AI is low, it’s memory bound. Also, check Global Load/Store Efficiency or L1 Cache Hit Rate. Low hit rates indicate frequent misses and thus a reliance on slower global memory.
    • Fix: Improve memory access patterns to achieve coalescing.
      • Coalesced Access: Ensure that threads within a warp (32 threads) access contiguous memory locations. For example, if thread 0 accesses data[0], thread 1 accesses data[1], and so on, up to thread 31 accessing data[31].
      • Example Coalesced Access: In the matrix multiply, if N is a multiple of the warp size (32), and threads in a warp are responsible for consecutive columns (col), then accesses to B[k * N + col] will be coalesced.
      • Why it works: The GPU’s memory controller can fetch data for an entire warp in a single transaction if accesses are aligned and contiguous, significantly increasing effective bandwidth.
  • Analyze Instruction Mix and Occupancy:

    • Diagnosis: Check SM Occupancy (how many warps are active on a Streaming Multiprocessor) and Issue Slot Utilization. Low occupancy might indicate that threads are blocked waiting for memory, not that there aren’t enough resources. Low issue slot utilization can mean the SM is stalled.
    • Fix: Increase occupancy by reducing register usage or shared memory usage per thread.
      • Reduce Register Pressure: If a kernel uses too many registers per thread, it limits the number of warps that can be resident on an SM, leading to lower occupancy. This can happen with complex intermediate calculations.
      • Reduce Shared Memory Usage: Similarly, excessive shared memory allocation per thread block can limit occupancy.
      • Why it works: Higher occupancy means more warps are available to execute when others are stalled on memory operations, masking latency.
  • Check L1/L2 Cache Hit Rates:

    • Diagnosis: Look at L1 Cache Hit Rate and L2 Cache Hit Rate. Low hit rates (e.g., < 70%) for L1, or consistently low L2 utilization when global memory bandwidth is not saturated, suggest that data isn’t being effectively reused or the access patterns aren’t cache-friendly.
    • Fix: Implement data reuse patterns and ensure data locality.
      • Data Reuse: Similar to tiling for shared memory, structure your computation so that data loaded into L1/L2 caches is used multiple times before being evicted. This is often achieved through algorithmic changes or by processing data in a way that promotes temporal locality.
      • Why it works: Caches store frequently accessed data closer to the SMs. High hit rates mean computation is happening on cached data, avoiding slow global memory access.
  • Consider Texture/Surface Memory:

    • Diagnosis: If you have irregular access patterns or need hardware-accelerated interpolation, and your kernel is memory-bound, the standard global memory might not be optimal.
    • Fix: Use texture or surface memory for specific access patterns.
      • Texture Memory: Offers hardware-accelerated caching and filtering/interpolation capabilities, often beneficial for spatially localized accesses (like accessing neighboring pixels in an image).
      • Why it works: The specialized hardware can provide benefits (like interpolation and potentially better caching for specific access patterns) that are not available with regular global memory loads.
  • Increase Parallelism (if possible):

    • Diagnosis: If the kernel is compute-bound but not fully utilizing all SMs, or if you have many independent kernels that could run concurrently.
    • Fix: Increase the grid dimensions (number of blocks) or block dimensions (number of threads per block) to better saturate the GPU’s compute resources.
      • Why it works: More threads and blocks mean more work to keep the SMs busy, even if some threads are waiting for memory. This is more of a compute-bound strategy but can help mask memory latency if not critically memory bound.

The next error you’ll encounter after fixing your memory-bound kernel is likely related to compute utilization or register spilling, indicating you’ve successfully masked memory latency and are now hitting the limits of the GPU’s computational power or register file.

Want structured learning?

Take the full Gpu course →