CUDA streams are the key to overlapping computation and data transfer, allowing your GPU to work on multiple tasks simultaneously instead of waiting for one to finish before starting another.

Imagine you have a GPU. It’s a powerful number cruncher, but it can only do one thing at a time. Normally, if you want to do some computation, you first have to copy data from your CPU’s memory (host) to the GPU’s memory (device), then tell the GPU to compute on that data, and finally copy the results back to the host. This sequence looks like this:

Host -> Device (Data Transfer)
Device (Computation)
Device -> Host (Data Transfer)

This is sequential. The GPU is idle during the data transfers. CUDA streams let us break this dependency. A stream is essentially a sequence of operations that execute in order on the GPU. Crucially, operations on different streams can execute concurrently, and importantly, operations on one stream can overlap with operations on another.

Let’s see this in action. Suppose we have two independent sets of data, dataA and dataB, and we want to perform two independent computations, computeA and computeB, on them.

Here’s a simplified, non-streamed version (conceptually):

// 1. Copy dataA to device
cudaMemcpy(deviceA, hostA, sizeA, cudaMemcpyHostToDevice);

// 2. Compute on dataA
computeA<<<gridA, blockA>>>(deviceA);
cudaDeviceSynchronize(); // Wait for computeA to finish

// 3. Copy resultsA back to host
cudaMemcpy(hostA_results, deviceA_results, sizeA, cudaMemcpyDeviceToHost);

// 4. Copy dataB to device
cudaMemcpy(deviceB, hostB, sizeB, cudaMemcpyHostToDevice);

// 5. Compute on dataB
computeB<<<gridB, blockB>>>(deviceB);
cudaDeviceSynchronize(); // Wait for computeB to finish

// 6. Copy resultsB back to host
cudaMemcpy(hostB_results, deviceB_results, sizeB, cudaMemcpyDeviceToHost);

Notice the cudaDeviceSynchronize() calls. These force the CPU to wait until the GPU is completely done with all its work up to that point. This is inefficient.

Now, let’s use streams to overlap. We’ll create two streams: streamA and streamB.

// Create streams
cudaStream_t streamA, streamB;
cudaStreamCreate(&streamA);
cudaStreamCreate(&streamB);

// --- Operations for Data A ---
// 1. Copy dataA to device on streamA
cudaMemcpyAsync(deviceA, hostA, sizeA, cudaMemcpyHostToDevice, streamA);

// 2. Launch computeA on streamA
computeA<<<gridA, blockA, 0, streamA>>>(deviceA);

// --- Operations for Data B ---
// 3. Copy dataB to device on streamB
cudaMemcpyAsync(deviceB, hostB, sizeB, cudaMemcpyHostToDevice, streamB);

// 4. Launch computeB on streamB
computeB<<<gridB, blockB, 0, streamB>>>(deviceB);

// --- Asynchronous result retrieval ---
// 5. Copy resultsA back to host on streamA
cudaMemcpyAsync(hostA_results, deviceA_results, sizeA, cudaMemcpyDeviceToHost, streamA);

// 6. Copy resultsB back to host on streamB
cudaMemcpyAsync(hostB_results, deviceB_results, sizeB, cudaMemcpyDeviceToHost, streamB);

// Wait for all operations on both streams to complete
cudaStreamSynchronize(streamA);
cudaStreamSynchronize(streamB);

// Destroy streams
cudaStreamDestroy(streamA);
cudaStreamDestroy(streamB);

In this streamed version, cudaMemcpyAsync and kernel launches are enqueued onto their respective streams. The GPU can start computeA on streamA as soon as dataA is transferred, while dataB is being transferred on streamB. Even better, the CPU doesn’t block on cudaMemcpyAsync. It just enqueues the operation and returns immediately. The GPU can start computeB as soon as dataB arrives, potentially while computeA is still running. The final cudaMemcpyAsync calls to retrieve results can also be issued asynchronously. The cudaStreamSynchronize calls at the end are the only points where we must wait for everything to finish.

The problem this solves is maximizing GPU utilization. By keeping the GPU busy with computation while data is moving, and by moving data for the next task while the current task is computing, we drastically reduce idle time. The "computation" here can be a kernel launch, a cudaMemcpy, or even a cudaEvent recording. The "data transfer" is typically cudaMemcpyAsync.

Internally, the CUDA driver and runtime manage these streams. When you enqueue an operation on a stream, it’s added to that stream’s queue. The GPU hardware pulls operations from these queues. Operations within a single stream must execute in the order they were enqueued. However, the hardware can interleave operations from different streams. For example, if streamA has memcpy then kernelA, and streamB has memcpy then kernelB, the GPU might start kernelA after streamA’s memcpy is done, and simultaneously start streamB’s memcpy. If streamB’s memcpy finishes before streamA’s memcpy, kernelB could start before kernelA even if kernelA was enqueued first overall. The dependency is only within a stream.

The levers you control are:

  1. Stream Creation/Destruction: cudaStreamCreate(), cudaStreamDestroy(). You decide how many streams you need. More streams can mean more overlap, but also more overhead.
  2. Asynchronous Operations: Using cudaMemcpyAsync() and kernel launches with a stream argument (e.g., kernel<<<..., stream>>>()). This is the core of non-blocking behavior.
  3. Synchronization: cudaStreamSynchronize(), cudaDeviceSynchronize(), and cudaEventRecord()/cudaEventSynchronize(). You use these to enforce dependencies or wait for completion. cudaStreamSynchronize(stream) waits only for that specific stream. cudaDeviceSynchronize() waits for all previously issued operations on the device, regardless of stream.

A common pitfall is forgetting to use cudaMemcpyAsync when launching operations on a stream. If you use cudaMemcpy (synchronous) on a stream, it will block the CPU until the copy is complete, defeating the purpose of overlapping. Similarly, if you launch a kernel without specifying a stream (i.e., using the default stream, which is implicitly NULL or 0), that kernel will execute sequentially with other operations in the default stream, and cudaMemcpyAsync on a different stream might not overlap with it as effectively as you’d expect.

The real magic happens when you have multiple independent data processing pipelines. For instance, one stream could handle data loading and preprocessing, another could handle the main computation, and a third could handle post-processing and writing results. This allows the GPU to be continuously fed with work.

When you use cudaMemcpyAsync to copy data from host to device on streamA, and then launch a kernel on streamA, the kernel will only start after the cudaMemcpyAsync has completed on that stream. However, if you then launch a kernel on streamB, that kernel can start executing as soon as its dependencies are met, potentially while the cudaMemcpyAsync for streamA is still in progress. The crucial insight is that the GPU can execute operations from different streams concurrently.

If you have multiple independent kernels that operate on different data, you can assign each kernel to its own stream. The data transfers for these kernels can also be assigned to their respective streams. The GPU will then attempt to execute these concurrently, leading to significant speedups. The amount of overlap you achieve is directly tied to how well you can decouple your workload into independent tasks and manage their execution across multiple streams.

The next step is often understanding how to manage dependencies between streams using CUDA events, which allow you to signal completion of an operation on one stream and have another stream wait for that signal.

Want structured learning?

Take the full Gpu course →