Write CUDA Kernels in Python with Numba and CuPy (2026)

Numba lets you write CUDA kernels in Python, but it’s not just a JIT compiler; it’s a bridge that allows Python code to directly control the GPU’s execution flow.

Let’s see it in action. Imagine we want to perform a simple element-wise addition on two large arrays on the GPU.

import numba
import numpy as np

@numba.cuda.jit
def add_arrays_cuda(x, y, out):
    idx = numba.cuda.grid(1)
    if idx < out.size:
        out[idx] = x[idx] + y[idx]

# Create sample data on the host
host_x = np.arange(1000000, dtype=np.float32)
host_y = np.arange(1000000, dtype=np.float32) * 2

# Allocate memory on the device and copy data
device_x = numba.cuda.to_device(host_x)
device_y = numba.cuda.to_device(host_y)
device_out = numba.cuda.device_array_like(host_x)

# Configure the kernel launch
threads_per_block = 128
blocks_per_grid = (host_x.size + (threads_per_block - 1)) // threads_per_block

# Launch the kernel
add_arrays_cuda[blocks_per_grid, threads_per_block](device_x, device_y, device_out)

# Copy the result back to the host
host_out = device_out.copy_to_host()

print(host_out[:10])
# Expected output: [  0.   3.   6.   9.  12.  15.  18.  21.  24.  27.]

This code defines a CUDA kernel add_arrays_cuda using Numba’s @numba.cuda.jit decorator. When this function is called with the [blocks_per_grid, threads_per_block] syntax, Numba compiles it into PTX (Parallel Thread Execution) code, which the NVIDIA driver then loads and executes on the GPU. The numba.cuda.grid(1) function is crucial here; it assigns a unique global index to each thread, allowing it to access its corresponding element in the arrays. The if idx < out.size: check is a standard practice to prevent out-of-bounds memory access when the total number of threads launched might exceed the array size.

The problem Numba and CuPy solve is the significant overhead and complexity of traditional CUDA programming. Writing kernels in C++ requires managing memory explicitly, dealing with compiler toolchains, and often involves complex data transfer patterns. Python, with Numba and CuPy, abstracts away much of this. Numba compiles Python directly to GPU-executable code, and CuPy provides a NumPy-like interface for GPU arrays, making it seamless to move data and operations between the CPU and GPU. The goal is to achieve near C-level performance for highly parallelizable tasks without leaving the Python ecosystem.

Internally, Numba analyzes the Python bytecode of your decorated function. It identifies operations that can be executed in parallel on the GPU and translates them into efficient LLVM IR. This IR is then further optimized and compiled into PTX. For memory management, numba.cuda.to_device and device_array_like handle the allocation and transfer of data between the host (CPU) and device (GPU) memory. The blocks_per_grid and threads_per_block parameters determine how the work is divided among the GPU’s streaming multiprocessors. Each block of threads executes the same kernel code, but numba.cuda.grid(1) ensures each thread works on a different piece of data.

A subtle but powerful aspect of Numba’s CUDA support is its ability to handle dynamic parallelism and complex control flow within kernels, which is often challenging in standard CUDA C++. You can call other Numba-compiled CUDA kernels from within a kernel, and Numba manages the nested launch and execution. This allows for more intricate algorithms that might involve recursive patterns or adaptive computation on the GPU, blurring the lines between high-level Python logic and low-level GPU control.

The next step is often to explore more advanced kernel configurations, such as using shared memory for inter-thread communication within a block to reduce global memory accesses.