Write Custom GPU Kernels in Python with OpenAI Triton (2026)

Triton lets you write custom GPU kernels in Python that are as fast as hand-tuned CUDA, but with a fraction of the effort.

Let’s see Triton in action. Imagine you need to perform a simple element-wise addition of two large tensors on the GPU. A naive approach might involve copying data back and forth from CPU to GPU, or using a generic PyTorch operation that might not be perfectly optimized for your specific hardware or use case. Triton allows you to define this operation directly, giving you fine-grained control.

import torch
import triton
import triton.language as tl

@triton.jit
def add_kernel(
    x_ptr, y_ptr, output_ptr,
    n_elements,
    stride_x, stride_y, stride_output,
    BLOCK_SIZE: tl.constexpr,
):
    # Calculate the program ID for this thread
    pid = tl.program_id(axis=0)

    # Calculate the offset for this thread
    num_elements_per_block = BLOCK_SIZE
    offsets = pid * num_elements_per_block + tl.arange(0, num_elements_per_block)

    # Calculate the indices into the input and output tensors
    idx = offsets + tl.load(stride_x) * offsets # This is a simplified example, actual indexing can be more complex
    x_offsets = idx
    y_offsets = idx # For element-wise add, indices are the same
    output_offsets = idx

    # Load data from global memory
    x = tl.load(x_ptr + x_offsets)
    y = tl.load(y_ptr + y_offsets)

    # Perform the addition
    output = x + y

    # Store the result back to global memory
    tl.store(output_ptr + output_offsets, output)

def add_kernel_wrapper(x, y):
    assert x.is_cuda and y.is_cuda
    assert x.shape == y.shape

    output = torch.empty_like(x)
    n_elements = x.numel()
    stride_x = x.stride()
    stride_y = y.stride()
    stride_output = output.stride()

    # Determine the block size. A common starting point is 1024.
    BLOCK_SIZE = 1024

    # Launch the kernel
    grid = (triton.cdiv(n_elements, BLOCK_SIZE),)
    add_kernel[grid](
        x, y, output,
        n_elements,
        stride_x[0], stride_y[0], stride_output[0], # Assuming contiguous tensors for simplicity
        BLOCK_SIZE=BLOCK_SIZE
    )
    return output

# Example usage:
size = 1024 * 1024 * 10 # 10 million elements
x = torch.randn(size, device='cuda')
y = torch.randn(size, device='cuda')

# Triton version
triton_output = add_kernel_wrapper(x, y)

# PyTorch version for comparison
pytorch_output = x + y

# Verify
assert torch.allclose(triton_output, pytorch_output)
print("Triton and PyTorch outputs match!")

This code defines a add_kernel that runs on the GPU. Each "program" (a group of threads) processes a BLOCK_SIZE number of elements. It calculates its unique ID (pid), determines the memory offsets for the elements it will handle, loads data from the pointers (x_ptr, y_ptr), performs the addition, and stores the result back to output_ptr. The add_kernel_wrapper function sets up the necessary arguments and launches the kernel with an appropriate grid size.

The problem Triton solves is the impedance mismatch between high-level deep learning frameworks and the low-level, performance-critical nature of GPU computation. Frameworks like PyTorch and TensorFlow provide abstractions that make it easy to build and train models, but when you need to squeeze out every bit of performance for a specific operation (like a custom attention mechanism, a novel activation function, or a specialized data preprocessing step), their generic operations might not be optimal. Writing raw CUDA C++ is powerful but incredibly complex, time-consuming, and prone to subtle bugs. Triton offers a middle ground: a Python-based domain-specific language (DSL) that compiles down to highly efficient GPU code, often comparable to what an expert would write in CUDA.

Internally, Triton’s compiler takes your Python-like kernel code and transforms it into PTX (Parallel Thread Execution) assembly, which the GPU can execute. It performs numerous optimizations automatically, such as:

Tiling: Breaking down the computation into smaller blocks that fit into faster on-chip memory (shared memory or registers).
Vectorization: Grouping operations on multiple data elements together to utilize the GPU’s wide SIMD (Single Instruction, Multiple Data) units.
Shared Memory Management: Explicitly managing data movement between global GPU memory and faster shared memory to reduce latency.
Thread Synchronization: Ensuring that threads within a block coordinate correctly when accessing shared data.
Memory Coalescing: Arranging memory accesses so that threads in a warp access contiguous memory locations, maximizing bandwidth.

The exact levers you control in Triton are primarily through how you structure your kernel, the BLOCK_SIZE (which affects occupancy and shared memory usage), and how you manage data loading and storing. You define the computation, and Triton’s compiler handles the intricate mapping to GPU hardware.

A common misconception is that you need to manually manage shared memory for performance. While you can explicitly load data into shared memory in Triton for complex access patterns, for many simple operations like element-wise addition or reductions, Triton’s compiler can automatically infer and utilize shared memory effectively without explicit user intervention. This is part of its magic – it analyzes your kernel’s memory access patterns and optimizes them behind the scenes, often achieving significant speedups by keeping frequently used data close to the processing cores.

The next concept you’ll likely encounter is handling more complex data layouts and reductions, which introduce challenges in thread synchronization and data aggregation.