CUDA Compute: Parallelism Demystified

You can actually run CUDA code without a dedicated NVIDIA GPU by using a software emulator called CUDA-on-ARM or by leveraging cloud providers that offer GPU instances.

Here’s a look at how CUDA works under the hood, focusing on the fundamental building blocks: kernels, threads, and memory.

Imagine you have a massive dataset and you want to perform the same operation on every single element. Doing this on a CPU, with its few powerful cores, would be painfully slow. This is where CUDA shines. It lets you harness the power of thousands of simpler cores on a GPU to perform these repetitive tasks in parallel.

The core of any CUDA program is the kernel. This is a C/C++ function that you mark with __global__ to indicate it will be executed on the GPU. When you launch a kernel, you’re not just running a function; you’re launching it many times in parallel, with each execution happening on a different thread.

Let’s see this in action. Suppose we want to add two vectors, a and b, and store the result in c.

__global__ void vectorAdd(float* a, float* b, float* c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

int main() {
    // ... (CPU-side setup: allocate host memory, initialize data)

    float *d_a, *d_b, *d_c; // Pointers to device memory
    int size = n * sizeof(float);

    // Allocate memory on the GPU
    cudaMalloc(&d_a, size);
    cudaMalloc(&d_b, size);
    cudaMalloc(&d_c, size);

    // Copy data from CPU to GPU
    cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);

    // Define grid and block dimensions
    int blockSize = 256;
    int numBlocks = (n + blockSize - 1) / blockSize;

    // Launch the kernel
    vectorAdd<<<numBlocks, blockSize>>>(d_a, d_b, d_c, n);

    // Copy result from GPU to CPU
    cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);

    // ... (CPU-side cleanup: free device memory, process results)

    return 0;
}

In this example, vectorAdd<<<numBlocks, blockSize>>> is the kernel launch. numBlocks determines how many blocks of threads will be created, and blockSize determines how many threads are in each block.

Each thread within a block gets a unique identifier. threadIdx.x is the index of the thread within its block, and blockIdx.x is the index of the block within the grid of blocks. By combining these, int i = blockIdx.x * blockDim.x + threadIdx.x; calculates a unique global index i for each thread. This i is then used to access the correct elements in the input and output arrays (a[i], b[i], c[i]). The if (i < n) check is crucial to prevent threads from accessing memory out of bounds if the total number of threads launched exceeds the size of the data.

The memory hierarchy is critical for performance. When data is on the GPU, it resides in global device memory. This is the largest and slowest memory available to the GPU. Before you can operate on data with a kernel, you must explicitly copy it from host (CPU) memory to device memory using cudaMemcpy. Similarly, results must be copied back.

Within the kernel, threads within the same block can cooperate using shared memory. Shared memory is much faster than global memory but is limited in size and is local to a block. You declare it using __shared__.

__global__ void sharedMemoryExample(float* data, int n) {
    __shared__ float sdata[256]; // Shared memory for a block

    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

    // Load data from global memory to shared memory
    if (i < n) {
        sdata[tid] = data[i];
    }
    __syncthreads(); // Synchronize threads within the block

    // Perform computation using shared memory
    if (i < n) {
        // Example: Sum reduction within the block
        float val = sdata[tid];
        // ... (reduction logic using val and sdata)
        // data[i] = ... (write result back to global memory)
    }
}

The __syncthreads() call is essential. It ensures that all threads in a block reach that point before any thread proceeds. This is vital for operations like reduction or stencil computations where threads depend on data loaded by other threads in the same block.

Beyond global and shared memory, there are constant memory and texture memory, each with specific caching mechanisms and access patterns that can boost performance for certain workloads. Local memory is essentially a per-thread cache of global memory, often used when compiler optimizations can’t keep data in registers. Registers are the fastest memory, private to each thread, but are extremely limited in number.

When you launch a kernel, the GPU groups threads into thread blocks. These blocks are then executed by streaming multiprocessors (SMs) on the GPU. An SM can execute multiple thread blocks concurrently. The GPU schedules threads within a block in groups of 32 called warps. All threads in a warp execute the same instruction at the same time. If threads in a warp take different paths due to conditional branches (if statements), this leads to warp divergence, where the SM has to execute both paths serially for that warp, significantly reducing efficiency.

A key concept often overlooked is how the GPU handles memory requests. When multiple threads in a warp request data from global memory, the GPU tries to combine these requests into a single, larger transaction to maximize bandwidth. This is called coalesced memory access. If threads access memory non-contiguously, the memory request is broken into multiple smaller transactions, leading to much lower effective bandwidth. For example, if threadIdx.x accesses data[threadIdx.x * stride], where stride is not a power of 2, you’re likely to get poor coalescing.

The next hurdle you’ll likely face is understanding occupancy, which is the ratio of active warps to the maximum number of warps an SM can handle, and how to tune grid and block dimensions to maximize it without running into resource limitations like shared memory or register usage.