In the latest MLPerf benchmarks, Nvidia TensorRT-LLM boosted the performance of Nvidia Hopper architecture GPUs on the GPT-J LLM nearly 3x over their results just six months ago.

Nvidia TensorRT-LLM is software that speeds and simplifies the complex job of inference on large language models (LLMs).

TensorRT-LLM running on Nvidia H200 Tensor Core GPUs delivered the fastest performance running inference in MLPerf’s biggest test of generative AI to date.

The new benchmark uses the largest version of Llama 2, a state-of-the-art large language model packing 70-billion parameters. The model is more than 10x larger than the GPT-J LLM first used in the September benchmarks.

The memory-enhanced H200 GPUs, in their MLPerf debut, used TensorRT-LLM to produce up to 31 000 tokens/second, a record on MLPerf’s Llama 2 benchmark.

The H200 GPU results include up to 14% gains from a custom thermal solution. It’s one example of innovations beyond standard air cooling that systems builders are applying to their Nvidia MGX designs to take the performance of Hopper GPUs to new heights.

Nvidia is sampling H200 GPUs to customers today and shipping in the second quarter. They’ll be available soon from nearly 20 leading system builders and cloud service providers.

On a per-accelerator basis, Hopper GPUs swept every test of AI inference in the latest round of the MLPerf industry benchmarks.
In addition, Nvidia Jetson Orin remains at the forefront in MLPerf’s edge category. In the last two inference rounds, Orin ran the most diverse set of models in the category, including GPT-J and Stable Diffusion XL.

The MLPerf benchmarks cover today’s most popular AI workloads and scenarios, including generative AI, recommendation systems, natural language processing, speech and computer vision.