NVLink is faster than PCIe for GPU-to-GPU communication, but PCIe is more versatile for connecting to the CPU and other peripherals.
Let’s see how this plays out in a real training scenario. Imagine we’re training a large language model like GPT-3. This model has billions of parameters, far too many to fit on a single GPU. We need to distribute the model across multiple GPUs, and these GPUs need to constantly exchange gradients and activations during training.
Here’s a simplified view of the data flow during a backward pass:
- Forward Pass: Data flows through the network on each GPU. Intermediate activations are computed.
- Gradient Calculation: Gradients are computed on each GPU.
- Gradient Synchronization: This is where the interconnect shines. Gradients calculated on one GPU need to be communicated to other GPUs that hold parts of the model. If the model is split across GPUs, the gradients for parameters on one GPU might depend on activations from another. This involves massive data transfers.
- Parameter Update: Gradients are aggregated, and parameters are updated.
In a multi-GPU setup, the GPUs are typically connected in one of two ways:
- PCIe: Each GPU has a direct connection to the CPU via PCIe lanes. While GPUs can communicate directly through the CPU, this is inefficient for frequent, high-bandwidth GPU-to-GPU traffic. Think of it like sending messages between two offices in a building by going through the main reception desk each time.
- NVLink: This is a direct, high-speed interconnect specifically designed for GPU-to-GPU communication. It bypasses the CPU, allowing GPUs to talk to each other much faster. This is like having a direct phone line or a dedicated hallway between those two offices.
Let’s look at some numbers. A typical PCIe 4.0 x16 slot offers around 32 GB/s of bidirectional bandwidth. A PCIe 5.0 x16 slot doubles that to 64 GB/s. NVLink, on the other hand, can offer significantly more. For example, NVIDIA’s NVLink 3 on the A100 provides 600 GB/s of bidirectional bandwidth per GPU (across 12 links), and NVLink 4 on the H100 offers 900 GB/s.
The difference in performance is stark when training very large models that require extensive inter-GPU communication. For models that fit within a single GPU’s memory, the choice of interconnect is less critical for training performance itself, as most communication is CPU-GPU. But as models scale, the bottleneck rapidly shifts to GPU-GPU.
Consider a scenario where you’re training a model with model parallelism, where different layers of the model reside on different GPUs. During the forward and backward passes, activations and gradients must be exchanged between these GPUs.
- With PCIe: If GPU A needs to send a large tensor (say, 100GB) to GPU B, it might involve sending it to the CPU’s RAM, then back to GPU B. This is slow. Even direct GPU-to-GPU communication over PCIe involves overhead and is limited by the PCIe bus speed.
- With NVLink: GPU A can send that same 100GB tensor directly to GPU B at speeds approaching its NVLink bandwidth, potentially completing the transfer in a fraction of the time.
Here’s a snippet of what NVLink topology might look like in nvidia-smi topo -m:
GPU0 GPU1 GPU2 GPU3 mlx5_0 mlx5_1
GPU0 X NV1 NV1 NV1 PIX PIX
GPU1 NV1 X NV1 NV1 PIX PIX
GPU2 NV1 NV1 X NV1 PIX PIX
GPU3 NV1 NV1 NV1 X PIX PIX
mlx5_0 PIX PIX PIX PIX X PHB
mlx5_1 PIX PIX PIX PIX PHB X
In this output:
Xindicates self-connection.NV1(or similar, likeNV2,NV4,NVLink) signifies a direct NVLink connection between GPUs. The number indicates the number of NVLink lanes.PIXindicates a connection via PCIe.PHBindicates a connection via PCIe to a Host Bridge.
This clearly shows that GPUs 0-3 are directly connected via NVLink, enabling high-speed communication. They are also connected to the host (CPU) via PCIe.
When do you absolutely need NVLink?
- Large Model Parallelism: When your model is too big for one GPU and you split it across multiple GPUs, requiring frequent, large data exchanges.
- Data Parallelism with Large Batches: Even with data parallelism, if your batch size is very large, the gradient synchronization step can become a bottleneck if not accelerated by NVLink.
- Multi-GPU Inference: For real-time inference of massive models, low-latency inter-GPU communication is key.
When is PCIe sufficient?
- Small to Medium Models: Models that fit on a single GPU or can be easily distributed across a few GPUs where inter-GPU communication is not the primary bottleneck.
- CPU-Bound Workloads: If your training is dominated by CPU preprocessing or data loading, the GPU interconnect speed is less relevant.
- Cost-Sensitive Deployments: PCIe-based motherboards and CPUs are generally less expensive than those designed for extensive NVLink configurations.
The most surprising thing about NVLink is how it fundamentally changes the architecture of distributed deep learning. It’s not just about faster pipes; it enables a more tightly coupled, almost monolithic view of multiple GPUs, blurring the lines between individual devices and a unified computing resource for certain tasks. You can often treat a node with NVLinked GPUs as a single, larger GPU for memory and bandwidth considerations, which is impossible with pure PCIe.
One critical aspect often overlooked is the topology of NVLink connections. Not all NVLink configurations are equal. A fully connected ring (like in an 8-GPU HGX board) offers consistent bandwidth between any two GPUs. However, in some server configurations, GPUs might be connected in a way that requires traversing intermediate GPUs to reach another, which can degrade performance. Always check your server’s NVLink topology.
The next hurdle you’ll face after optimizing for NVLink is understanding how to effectively utilize that bandwidth with optimized communication libraries like NCCL (NVIDIA Collective Communications Library) and how to profile your application to identify which communication patterns are actually benefiting from NVLink.