To get the most out of your GPU, you want to keep its many execution units as busy as possible, and that means maximizing something called "warp occupancy."

Let’s see this in action. Imagine a simple CUDA kernel that adds two arrays.

__global__ void vectorAdd(float* A, float* B, float* C, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        C[i] = A[i] + B[i];
    }
}

When this kernel runs, the GPU breaks the work into blocks, and each block is further broken down into warps of 32 threads. These warps are the fundamental unit of execution scheduling. If you have a lot of warps ready to run, the GPU can switch between them when one warp has to wait (e.g., for memory access), ensuring that the execution units are always fed with work. This is warp occupancy.

The core problem warp occupancy solves is hiding latency. GPUs are incredibly fast at computation, but memory operations are orders of magnitude slower. If a warp requests data from global memory and has to wait 100s of clock cycles for it to arrive, that’s 32 threads doing nothing. If, during that wait, the GPU can switch to another ready warp and execute its instructions, those 32 threads are still being productive. Maximizing occupancy means having enough concurrently active warps such that when one is stalled, there are others ready to take its place on the Streaming Multiprocessors (SMs).

To understand how this works internally, you need to know about the SM. An SM is the core processing unit on a GPU. It contains CUDA Cores (for floating-point arithmetic), Tensor Cores (for matrix operations), Load/Store Units (for memory access), Special Function Units, and importantly, register files and shared memory. Each SM can hold a certain number of warps in flight. The number of warps an SM can execute concurrently is limited by the available resources:

  • Register File: Each thread in a warp uses registers to store its local variables. The total number of registers available per SM is finite. If a kernel requires too many registers per thread, it will limit the number of warps that can be active simultaneously because the total register usage across all active warps cannot exceed the SM’s capacity.
  • Shared Memory: This is a user-managed, on-chip memory that is much faster than global memory. If your kernel uses shared memory, the amount allocated per thread block (and thus, indirectly, per warp) will consume a portion of the SM’s shared memory capacity. This can also limit the number of active warps.
  • Maximum Warps per SM: There’s a hard limit on how many warps an SM can schedule at any given time, regardless of other resources. This is a hardware design parameter.

The "occupancy" is essentially the ratio of currently active warps to the maximum possible warps that could be active on an SM. A fully occupied SM has all its schedulable warps running. The goal is to get as close to this maximum as possible.

You can check your kernel’s occupancy using nvprof or nsight compute. For example, to profile a kernel named myKernel in an executable my_app:

nvprof --analysis-metrics ./my_app

Look for the "SM Occupancy" metric. It’s often expressed as a percentage. A common target is 80% or higher, but the ideal value depends on the specific kernel and hardware.

To control occupancy, you primarily adjust three things:

  1. Threads per Block: Increasing threads per block (up to a point) often leads to more warps per block. Since each block has at least one warp, more threads per block can mean more warps that can be scheduled on an SM if the SM has enough resources. For example, if your kernel uses 256 threads per block and the SM can schedule 64 warps, that’s 64 warps/SM. If you increase to 512 threads per block, you might still only be limited by the SM’s scheduling capacity of 64 warps, but you’ve potentially used the SM’s resources more efficiently if other constraints are met.
  2. Registers per Thread: This is often the most direct lever. If nvprof shows low occupancy due to register constraints, you need to reduce the number of registers your kernel uses. This usually involves rewriting code to use fewer local variables, reusing registers, or using techniques like loop unrolling judiciously. For example, if a kernel is using 120 registers per thread and the SM can only support 32 warps with that many registers, but could support 64 warps if each thread used only 60 registers, reducing register usage will increase occupancy.
  3. Shared Memory per Block: Similar to registers, if your kernel uses a lot of shared memory, it can limit the number of blocks (and thus warps) that can reside on an SM simultaneously. Reducing shared memory usage or optimizing its access patterns can free up this resource.

Let’s say nvprof reports that your kernel is register-limited and has an occupancy of 30%. The report might also tell you the maximum number of warps per SM is 64 and your current kernel is only managing to run 19 warps concurrently. It might also show that each thread is using 100 registers, and the SM has a total of 65536 registers, which, when divided by 100 registers/thread, means only 655 threads can be active if registers were the only limit. But because you have 32 threads per warp, that’s only 655 / 32 = ~20 warps. If you can refactor your kernel to use only 50 registers per thread, then the SM could theoretically support 65536 / 50 = ~1310 threads, or 1310 / 32 = ~41 warps. If the SM’s actual warp limit is 64, you’ve now increased your potential occupancy significantly.

The next logical step after maximizing warp occupancy for a compute-bound kernel is to ensure that the warps that are running are getting their data to and from global memory as quickly as possible.

Want structured learning?

Take the full Gpu course →