MLPerf benchmarks tell a story about GPU performance, but not the one most people think. It’s not about which GPU is "fastest" in an absolute sense, but rather which GPU is best suited for the specific type of ML workload you’re running, because the benchmarks are designed to stress different aspects of GPU architecture.

Let’s look at the MLPerf Training benchmark. Imagine you’re training a massive image recognition model like ResNet-50. The benchmark reports results in "time to train" (how long it takes to reach a certain accuracy) and "training throughput" (how many samples per second it can process).

Here’s a simplified view of what’s happening under the hood for a ResNet-50 training run on an NVIDIA A100:

{
  "benchmark": "resnet-50",
  "scenario": "training",
  "hardware": "NVIDIA A100",
  "results": {
    "time_to_train": {
      "value": 18.5,
      "unit": "minutes"
    },
    "training_throughput": {
      "value": 2150,
      "unit": "samples/sec"
    },
    "power_consumption": {
      "value": 400,
      "unit": "Watts"
    }
  },
  "details": {
    "gpu_utilization": "98%",
    "memory_bandwidth_usage": "75%",
    "compute_utilization": "95%",
    "interconnect_bandwidth": "100 GB/s"
  }
}

This output is a snapshot. The real value is in comparing this to, say, an NVIDIA V100 or an AMD Instinct MI250X. You’d see differences in time_to_train and training_throughput that are directly related to the underlying hardware architecture. The A100’s high compute_utilization and interconnect_bandwidth (especially its NVLink speed) are critical for large models that involve massive parallel matrix multiplications and frequent data movement between GPUs.

Now, consider the MLPerf Inference benchmark, specifically for BERT. This tests a different muscle.

{
  "benchmark": "bert",
  "scenario": "inference",
  "hardware": "NVIDIA A100",
  "results": {
    "latency": {
      "value": 15.2,
      "unit": "ms",
      "percentile": 99
    },
    "throughput": {
      "value": 1800,
      "unit": "queries/sec"
    },
    "power_consumption": {
      "value": 350,
      "unit": "Watts"
    }
  },
  "details": {
    "gpu_utilization": "85%",
    "memory_bandwidth_usage": "50%",
    "compute_utilization": "70%",
    "cache_hit_rate": "88%"
  }
}

Here, latency (especially the 99th percentile) becomes crucial. BERT inference involves a lot of sequential operations and attention mechanisms that are sensitive to memory access patterns and cache performance. You might see a GPU that’s excellent at raw FP16 throughput for training struggle with the low-latency requirements of BERT inference if its cache hierarchy isn’t optimized for transformer models. The A100’s large L2 cache and high memory bandwidth help, but the cache_hit_rate becomes a more revealing metric here than raw compute FLOPS.

The key takeaway is that MLPerf isn’t a single leaderboard. It’s a suite of benchmarks, each designed to mimic a specific class of ML problem. When you look at the results, ask yourself:

  1. What kind of models am I running? Are they large, dense, compute-bound training tasks (like ResNet, GPT-3)? Or are they smaller, more latency-sensitive inference tasks (like BERT, object detection on edge devices)?
  2. What are the critical metrics for my use case? Is it raw throughput for training speed, or is it low, consistent latency for real-time inference?
  3. How does the GPU’s architecture map to these metrics? Look at the details section of the benchmark results. High memory bandwidth is king for large models; efficient cache utilization and specialized tensor cores might be more important for transformer inference.

The benchmarks also reveal how well a GPU scales with multiple devices. For training, the interconnect_bandwidth between GPUs (e.g., NVLink for NVIDIA) is paramount. A GPU with slightly lower per-chip compute might outperform a theoretically "faster" one if its interconnect allows for much more efficient data sharing in a multi-GPU setup. MLPerf often reports results for different numbers of GPUs (e.g., 8-GPU systems), which is where you see the impact of these high-speed interconnects.

Most people don’t realize that the MLPerf benchmarks are run in specific "scenarios" (like "training" or "inference") and often with different "optimization levels" (like "submission" vs. "resilience"). The "submission" scenario typically allows for more aggressive optimizations, including mixed-precision training, fused operations, and specific kernel tuning, which can significantly boost performance but might not reflect a "stock" setup. Understanding which scenario and optimization level a result corresponds to is vital for a fair comparison.

Choosing the right GPU involves dissecting these benchmark results, not just looking at the headline numbers. Map the benchmark’s demands to your own workload’s demands, and scrutinize the architectural features that enable the reported performance.

The next step after choosing a GPU based on MLPerf is understanding how to optimize your specific ML framework (like PyTorch or TensorFlow) to fully leverage that GPU’s capabilities.

Want structured learning?

Take the full Gpu course →