NVLink is an NVIDIA technology that provides high-bandwidth, low-latency connections between GPUs, while NVSwitch is a switch that allows multiple NVLink-connected GPUs to communicate with each other at full bandwidth.

Let’s see this in action. Imagine you’re training a massive language model that doesn’t fit on a single GPU. You’ve got four A100 GPUs, each with 40GB of HBM2 memory, and you want to use all 160GB for your model.

# Example command to launch a distributed training job
# This assumes PyTorch with DistributedDataParallel
torchrun --nproc_per_node=4 --master_addr="localhost" --master_port=29500 your_training_script.py

In this scenario, the GPUs need to exchange gradients and model parameters constantly. If they were connected via PCIe, this communication would be a bottleneck. NVLink, however, offers up to 600 GB/s bidirectional bandwidth per GPU, dramatically speeding up inter-GPU communication. NVSwitch takes this further. Instead of just pairwise GPU connections, a NVSwitch allows all 4 GPUs to talk to each other simultaneously at full NVLink speed, creating a fully connected fabric. This means GPU 0 can talk to GPU 1, GPU 2, and GPU 3 all at the same time, without any contention.

The problem this solves is the memory and compute limitations of single GPUs for increasingly large AI models. Training models with billions or trillions of parameters requires distributing the model across multiple GPUs. Without high-speed interconnects, the time spent moving data between GPUs can dwarf the actual computation time, making training infeasible. NVLink provides direct, high-bandwidth links, and NVSwitch extends this to a multi-GPU system, effectively creating a "super GPU" where all connected processors can communicate as if they were directly attached to each other.

Internally, NVLink is a serial, electrical interconnect that uses differential signaling. Each NVLink connection consists of multiple lanes, and NVIDIA GPUs often have multiple NVLink connections (e.g., 12 lanes for A100). NVSwitch is a dedicated silicon chip designed to aggregate these NVLink connections. A NVSwitch can connect multiple GPUs, and multiple NVSwitches can be interconnected to scale to even larger clusters. For instance, an NVIDIA DGX A100 system uses one NVSwitch to connect 8 A100 GPUs, providing full bisection bandwidth between all GPUs.

The exact levers you control are primarily in your distributed training framework (like PyTorch’s DistributedDataParallel or TensorFlow’s MirroredStrategy) and how you configure your hardware. For example, when using torchrun, the --nproc_per_node argument specifies how many GPUs on a single machine will participate. The underlying system automatically leverages NVLink and NVSwitch if they are present and properly configured. If you are using multiple machines, you would typically configure your cluster manager (like Slurm or Kubernetes) to assign ranks to processes on different nodes, and the communication then happens over network interconnects (like InfiniBand) between nodes, but NVLink/NVSwitch is still critical for communication within a node.

A common misconception is that NVLink and NVSwitch are simply faster versions of PCIe. While they both serve to connect components, NVLink and NVSwitch are designed from the ground up for GPU-to-GPU communication, prioritizing extremely high bandwidth and very low latency. PCIe is a general-purpose bus, and its latency characteristics and bandwidth allocation are not optimized for the massive, synchronous data transfers required in large-scale distributed deep learning. This architectural difference means that for workloads that are communication-bound, the performance uplift from NVLink/NVSwitch over PCIe can be orders of magnitude, not just incremental.

The next challenge you’ll encounter is optimizing data loading and preprocessing to keep these high-bandwidth GPUs fed with data.

Want structured learning?

Take the full Gpu course →