Accelerate NumPy Workloads on GPU with CuPy (2026)

CuPy lets you run NumPy code on NVIDIA GPUs, often making it dramatically faster.

Here’s a taste of it in action. Let’s say we have a massive array and we want to perform an element-wise operation.

import numpy as np
import cupy as cp
import time

# Create a large NumPy array
size = 10000
numpy_array = np.random.rand(size, size)

# Create a CuPy array on the GPU
cupy_array = cp.random.rand(size, size)

# --- NumPy operation ---
start_time = time.time()
result_numpy = numpy_array * 2 + 1
np.mean(result_numpy) # Force computation
end_time = time.time()
print(f"NumPy execution time: {end_time - start_time:.4f} seconds")

# --- CuPy operation ---
start_time = time.time()
result_cupy = cupy_array * 2 + 1
cp.mean(result_cupy) # Force computation
cp.cuda.Stream.null.synchronize() # Wait for GPU to finish
end_time = time.time()
print(f"CuPy execution time: {end_time - start_time:.4f} seconds")

When you run this, you’ll see that CuPy’s execution time is significantly lower, especially for large arrays. This isn’t magic; it’s about parallel processing. NumPy, by default, runs on your CPU, which has a few powerful cores. CuPy leverages the thousands of smaller cores on your NVIDIA GPU to perform the same calculations simultaneously across many data points.

CuPy is designed to be a drop-in replacement for NumPy. The API is nearly identical. If your code uses NumPy for array manipulation and mathematical operations, you can often switch to CuPy by simply changing your import statement from import numpy as np to import cupy as cp. The same functions, cp.sin(), cp.dot(), cp.mean(), cp.sum(), etc., behave just like their NumPy counterparts, but execute on the GPU.

Internally, CuPy translates your NumPy-like calls into CUDA kernels. These are small programs that run directly on the GPU. For simple operations like element-wise addition or multiplication, CuPy can often use pre-compiled, highly optimized kernels. For more complex operations, it might generate custom kernels on the fly or use libraries like cuBLAS (for linear algebra) and cuFFT (for Fast Fourier Transforms) that are specifically engineered for GPU performance.

The primary levers you control with CuPy are the data itself and the operations you perform. By moving your data from CPU memory (NumPy arrays) to GPU memory (CuPy arrays) using functions like cp.asarray() or by creating arrays directly on the GPU with cp.array(), you enable GPU acceleration. Then, any subsequent operations on these CuPy arrays will be offloaded to the GPU. You can also move data back to the CPU using cp.asnumpy() when needed.

The real power comes when you can keep your data on the GPU for a sequence of operations. If you perform a calculation on the GPU, then immediately use the result for another GPU calculation without transferring it back to the CPU, you avoid the significant overhead of PCIe bus transfers. This is why code that involves many steps of array manipulation, like in deep learning training or scientific simulations, sees the most dramatic speedups with CuPy.

When you’re working with CuPy, remember that the GPU is a separate processing unit with its own memory. This means you can’t directly use NumPy functions on CuPy arrays, and vice-versa. You must explicitly convert data between the two using cp.asarray() (CPU to GPU) and cp.asnumpy() (GPU to CPU). Each transfer has a cost, so minimizing these transfers by keeping data on the GPU for as long as possible is key to maximizing performance.

The next hurdle you’ll likely encounter is managing memory on the GPU.