Blackwell ups the ante on training performance

In MLPerf Training 4.1 industry benchmarks, the Nvidia Blackwell platform has delivered impressive results on workloads across all tests – and up to 2,2x more performance per GPU on LLM benchmarks, including Llama 2 70B fine-tuning and GPT-3 175B pretraining.

In addition, Nvidia’s submissions on the Nvidia Hopper platform continued to hold at-scale records on all benchmarks, including a submission with 11,616 Hopper GPUs on the GPT-3 175B benchmark.

The first Blackwell training submission to the MLCommons Consortium – which creates standardised, unbiased and rigorously peer-reviewed testing for industry participants – highlights how the architecture is advancing generative AI training performance.

The architecture includes new kernels that make more efficient use of Tensor Cores. Kernels are optimised, purpose-built math operations like matrix-multiplies that are at the heart of many deep learning algorithms.

Blackwell’s higher per-GPU compute throughput and significantly larger and faster high-bandwidth memory allows it to run the GPT-3 175B benchmark on fewer GPUs while achieving excellent per-GPU performance.

Taking advantage of larger, higher-bandwidth HBM3e memory, just 64 Blackwell GPUs were able to run in the GPT-3 LLM benchmark without compromising per-GPU performance. The same benchmark run using Hopper needed 256 GPUs.

The Blackwell training results follow an earlier submission to MLPerf Inference 4.1, where Blackwell delivered up to 4x more LLM inference performance versus the Hopper generation. Taking advantage of the Blackwell architecture’s FP4 precision, along with the NVIDIA QUASAR Quantization System, the submission revealed powerful performance while meeting the benchmark’s accuracy requirements.

Nvidia platforms undergo continuous software development, racking up performance and feature improvements in training and inference for a wide variety of frameworks, models and applications.

In this round of MLPerf training submissions, Hopper delivered a 1,3x improvement on GPT-3 175B per-GPU training performance since the introduction of the benchmark.

Nvidia also submitted large-scale results on the GPT-3 175B benchmark using 11,616 Hopper GPUs connected with Nvidia NVLink and NVSwitch high-bandwidth GPU-to-GPU communication and Nvidia Nvidia-2 InfiniBand networking.

Nvidia Hopper GPUs have more than tripled scale and performance on the GPT-3 175B benchmark since last year. In addition, on the Llama 2 70B LoRA fine-tuning benchmark, NVIDIA increased performance by 26% using the same number of Hopper GPUs, reflecting continued software enhancements.