Speed Up Distributed Training with NCCL AllReduce and P2P Transfers (2026)

NCCL is the secret sauce that makes multi-GPU training actually fast, and it achieves this by ditching the naive "send everything to one GPU, then broadcast" approach for something much more clever.

Here’s what it looks like under the hood when you’re training a model across multiple GPUs, say four of them, and you need to average their gradients. Instead of one GPU collecting all gradients and then sending them back out, NCCL orchestrates a ring-based communication. GPU 0 sends its gradient to GPU 1, GPU 1 receives from 0 and sends its own gradient to GPU 2, GPU 2 receives from 1 and sends to GPU 3, and GPU 3 receives from 2 and sends its own gradient to GPU 0. Simultaneously, each GPU is also receiving from its "upstream" neighbor in the ring. This way, every GPU gets a chunk of the gradient from every other GPU, and after one full pass, everyone has the complete averaged gradient.

This is the core idea behind AllReduce, one of NCCL’s most powerful operations. It’s not just about moving data; it’s about transforming it in transit. AllReduce takes an operation (like sum, average, max, min) and applies it to tensors distributed across multiple devices, making the result available on all devices.

Let’s see it in action. Imagine you have two GPUs, gpu:0 and gpu:1, and you want to sum two tensors, tensor_a on gpu:0 and tensor_b on gpu:1. The result, tensor_sum, should be on both.

import torch
import torch.distributed as dist

# Initialize distributed environment (this would typically be done once)
# For demonstration, we'll simulate a simple setup
if not dist.is_initialized():
    dist.init_process_group(backend="nccl", rank=0, world_size=2) # Assume rank 0 and 1

# Create tensors on different GPUs
tensor_a = torch.ones(2, 2).cuda(0)
tensor_b = torch.ones(2, 2).cuda(1)

# Perform AllReduce (sum)
# We need a buffer to store the result, which will be the sum of tensor_a and tensor_b
tensor_sum = torch.zeros(2, 2).cuda(0) # Result will be on GPU 0

# In a real scenario, you'd have a tensor on each GPU that you want to reduce
# For this example, we'll simulate by sending tensor_b to tensor_a's GPU and then reducing
# A more direct NCCL AllReduce would look like this on each GPU:

# On GPU 0:
# input_tensor = torch.randn(2, 2).cuda(0)
# output_tensor = torch.zeros_like(input_tensor).cuda(0)
# dist.all_reduce(input_tensor, op=dist.ReduceOp.SUM, async_op=False) # input_tensor becomes the reduced tensor

# On GPU 1:
# input_tensor = torch.randn(2, 2).cuda(1)
# output_tensor = torch.zeros_like(input_tensor).cuda(1)
# dist.all_reduce(input_tensor, op=dist.ReduceOp.SUM, async_op=False) # input_tensor becomes the reduced tensor

# Let's simulate the outcome for clarity:
# If tensor_a was on GPU 0 and tensor_b was on GPU 1, and we wanted the sum on both:
# After dist.all_reduce(tensor_a, op=dist.ReduceOp.SUM) on GPU 0
# and dist.all_reduce(tensor_b, op=dist.ReduceOp.SUM) on GPU 1,
# tensor_a on GPU 0 would become 2*tensor_a (if it was the only thing on GPU 0)
# tensor_b on GPU 1 would become 2*tensor_b (if it was the only thing on GPU 1)

# The actual operation involves a tensor on each GPU participating.
# Let's consider a scenario where we have identical tensors on both GPUs and we want to sum them.
tensor_gpu0 = torch.ones(2, 2).cuda(0)
tensor_gpu1 = torch.ones(2, 2).cuda(1)

# Perform AllReduce on GPU 0
dist.all_reduce(tensor_gpu0, op=dist.ReduceOp.SUM, async_op=False)
# After this, tensor_gpu0 on GPU 0 will contain the sum of its original value and the value from GPU 1.
# Because it's an AllReduce, the same operation happens on GPU 1 concurrently.

# To verify, let's retrieve the tensor from GPU 1 (assuming it also performed all_reduce)
# In a real distributed run, you'd check tensor_gpu1 on GPU 1.
# For this simulation, we'll just show what tensor_gpu0 on GPU 0 would become.
# If tensor_gpu0 started as 1s and tensor_gpu1 started as 1s, after AllReduce SUM:
# tensor_gpu0 on GPU 0 would be 2s.
# tensor_gpu1 on GPU 1 would also be 2s.
print("Tensor on GPU 0 after AllReduce SUM:", tensor_gpu0)
# Expected output:
# Tensor on GPU 0 after AllReduce SUM:
# tensor([[2., 2.],
#         [2., 2.]], device='cuda:0')

The key here is that NCCL uses optimized communication patterns. For AllReduce, it’s typically a ring-based algorithm as described earlier, or a tree-based algorithm for larger numbers of GPUs. The point is that data doesn’t need to travel to a central point and back. Every GPU participates in sending and receiving, making it highly efficient.

Beyond AllReduce, NCCL also excels at Point-to-Point (P2P) transfers. While AllReduce is for collective operations where all participants need the final result, P2P is for direct communication between two specific GPUs. This is useful for more complex communication patterns or when you don’t need the result on all GPUs.

Consider a scenario where GPU 0 needs to send a tensor to GPU 2, and GPU 1 needs to send a tensor to GPU 3. This is a P2P operation.

import torch
import torch.distributed as dist

# Initialize distributed environment (simulated)
if not dist.is_initialized():
    dist.init_process_group(backend="nccl", rank=0, world_size=4) # Assume ranks 0, 1, 2, 3

# Create tensors on different GPUs
tensor_gpu0 = torch.arange(4).float().cuda(0)
tensor_gpu1 = torch.arange(4, 8).float().cuda(1)

# We want to send tensor_gpu0 to GPU 2 and tensor_gpu1 to GPU 3.
# This requires creating receive buffers on the destination GPUs.

# On GPU 2 (receiving from GPU 0):
recv_tensor_gpu2 = torch.zeros(4).float().cuda(2)
# We'd initiate a receive operation. The send would be initiated on GPU 0.

# On GPU 3 (receiving from GPU 1):
recv_tensor_gpu3 = torch.zeros(4).float().cuda(3)
# Similar receive initiation on GPU 3.

# To demonstrate, let's simulate the send/recv pair.
# In a real distributed run, these calls would be on the respective ranks.

# On Rank 0 (GPU 0) - Send:
send_op_0_to_2 = dist.isend(tensor_gpu0, dst=2)

# On Rank 2 (GPU 2) - Receive:
# We need to know the size and dtype of the tensor being sent.
recv_tensor_gpu2 = torch.zeros(tensor_gpu0.shape, dtype=tensor_gpu0.dtype, device='cuda:2')
recv_op_2_from_0 = dist.irecv(recv_tensor_gpu2, src=0)

# On Rank 1 (GPU 1) - Send:
send_op_1_to_3 = dist.isend(tensor_gpu1, dst=3)

# On Rank 3 (GPU 3) - Receive:
recv_tensor_gpu3 = torch.zeros(tensor_gpu1.shape, dtype=tensor_gpu1.dtype, device='cuda:3')
recv_op_3_from_1 = dist.irecv(recv_tensor_gpu3, src=1)

# Wait for operations to complete
send_op_0_to_2.wait()
recv_op_2_from_0.wait()
send_op_1_to_3.wait()
recv_op_3_from_1.wait()

print("Tensor received on GPU 2 from GPU 0:", recv_tensor_gpu2)
print("Tensor received on GPU 3 from GPU 1:", recv_tensor_gpu3)
# Expected output:
# Tensor received on GPU 2 from GPU 0: tensor([0., 1., 2., 3.], device='cuda:2')
# Tensor received on GPU 3 from GPU 1: tensor([4., 5., 6., 7.], device='cuda:3')

NCCL P2P is crucial for asynchronous operations and custom communication patterns that AllReduce doesn’t cover. It allows for fine-grained control over data movement between specific GPU pairs.

The performance gains from NCCL stem from its deep understanding of the underlying hardware. It uses specialized GPU direct communication paths (like NVLink) and multi-core CPU offload for coordination, minimizing CPU bottlenecks and maximizing GPU utilization. It carefully manages buffer allocation and kernel launches to ensure that communication and computation overlap as much as possible.

One often-overlooked aspect of NCCL performance is its ability to handle multiple outstanding requests. When you issue multiple isend or irecv operations, or when AllReduce is broken down into smaller chunks for pipelining, NCCL can schedule these operations concurrently. This means that while one GPU is busy with a send operation, another can be performing a receive or initiating a different send, keeping the interconnect saturated and hiding latency. The async_op=True (or isend/irecv) is your gateway to this concurrency.

The next hurdle you’ll face is diagnosing performance regressions when your NCCL operations aren’t as fast as expected, often pointing to network topology or incorrect backend initialization.