NVIDIA Tensor Cores aren’t just about doing more math; they fundamentally change how floating-point numbers are represented during computation to make things faster.
Let’s see this in action. Imagine training a deep neural network. Typically, weights and activations are stored as 32-bit floating-point numbers (FP32). This gives a wide dynamic range and good precision, but it’s computationally expensive.
import torch
# FP32 tensors
fp32_weight = torch.randn(1024, 1024, device='cuda')
fp32_activation = torch.randn(1024, 1024, device='cuda')
# FP16 tensors
fp16_weight = fp32_weight.half()
fp16_activation = fp32_activation.half()
# FP32 matrix multiplication
fp32_output = torch.matmul(fp32_weight, fp32_activation)
# FP16 matrix multiplication (leveraging Tensor Cores if available)
fp16_output = torch.matmul(fp16_weight, fp16_activation)
# Mixed Precision: Accumulate in FP32, store in FP16
# This is what happens under the hood with libraries like AMP
mixed_precision_output = torch.matmul(fp16_weight.float(), fp16_activation.float()).half()
The core idea behind Tensor Cores and mixed precision is to use lower-precision formats, primarily 16-bit floating-point (FP16), for the bulk of the computations. Tensor Cores are specialized hardware units on NVIDIA GPUs that can perform matrix multiply-accumulate (MMA) operations extremely quickly on FP16 inputs, often accumulating the results in FP32 to maintain accuracy.
When you use mixed precision, you’re strategically choosing which operations are safe to perform in FP16 and which require FP32. The dominant operations in deep learning training, like matrix multiplications and convolutions, benefit the most. Weights and activations are typically cast to FP16 before being fed into the Tensor Core. The multiplication happens in FP16, and the accumulation often happens in FP32. The final result might be cast back to FP16 or FP32 depending on the subsequent operation.
This technique solves several problems. First, FP16 numbers take up half the memory of FP32. This means you can fit larger models or larger batch sizes into GPU memory, which directly translates to faster training throughput. Second, FP16 operations are much faster. Tensor Cores can achieve significantly higher FLOPS (floating-point operations per second) for FP16 matrix multiplications compared to FP32. This reduces the time spent on the most computationally intensive parts of the training process.
The challenge is that FP16 has a smaller dynamic range and less precision than FP32. This can lead to issues like vanishing or exploding gradients, or loss of precision in weight updates. Mixed precision training addresses this by keeping certain critical values in FP32. For example, master weights might be stored in FP32, and gradients are often accumulated in FP32 before being used to update the FP32 master weights. Loss scaling is another crucial technique: the loss is multiplied by a large factor before backpropagation to keep intermediate gradients from underflowing to zero in FP16. This scaled loss is then used for gradient calculation, and after gradients are computed, they are unscaled before being applied to the weights.
The torch.cuda.amp.autocast() context manager and torch.cuda.amp.GradScaler in PyTorch, or similar functionalities in TensorFlow, automate this process. They intelligently decide which operations should run in FP16 and manage the loss scaling, making mixed precision training much easier to implement.
The real magic happens when you combine FP16 operations with FP32 accumulation. The Tensor Core can perform an FP16 x FP16 multiplication and then add that result to an FP32 accumulator in a single, highly optimized step. This "mixed" nature of the operation—FP16 inputs, FP32 accumulation—is what delivers both speed and accuracy. Without the FP32 accumulator, the precision loss from repeated FP16 additions would quickly become a problem.
When you enable mixed precision, the GPU hardware will automatically utilize Tensor Cores for supported operations if they are available and the data types are compatible. This means that the torch.matmul(fp16_weight, fp16_activation) call, when executed on a compatible GPU with Tensor Cores enabled, will automatically leverage this specialized hardware for a significant speedup. The underlying CUDA kernels are designed to detect and use these cores.
The next frontier you’ll likely explore is optimizing gradient accumulation strategies and exploring even lower precision formats like BF16.