As enterprises race to adopt generative AI and bring new services to market, the demands on data centre infrastructure have never been greater – training large language models (LLMs) is one challenge, but delivering LLM-powered realtime services is another.
In the latest round of MLPerf industry benchmarks, Inference v4.1 Nvidia platforms delivered leading performance across all data centre tests. The first-ever submission of the upcoming Nvidia Blackwell platform revealed up to 4x more performance than the Nvidia H100 Tensor Core GPU on MLPerf’s biggest LLM workload – Llama 2 70B – thanks to its use of a second-generation Transormer Engine and FP4 Tensor Cores.
The Nvidia H200 Tensor Core GPU delivered outstanding results on every benchmark in the data centre category – including the latest addition to the benchmark, the Mixtral 8x7B mixture of experts (MoE) LLM, which features a total of 46,7-billion parameters, with 12,9-billion parameters active per token.
MoE models have gained popularity as a way to bring more versatility to LLM deployments as they’re capable of answering a wide variety of questions and performing more diverse tasks in a single deployment.
They’re also more efficient since they only activate a few experts per inference – meaning they deliver results much faster than dense models of a similar size.
The continued growth of LLMs is driving the need for more compute to process inference requests. To meet realtime latency requirements for serving today’s LLMs, and to do so for as many users as possible, multi-GPU compute is a must.
Nvidia NVLink and NVSwitch provide high-bandwidth communication between GPUs based on the Nvidia Hopper architecture and provide significant benefits for realtime, cost-effective large model inference. The Blackwell platform will further extend NVLink Switch’s capabilities with larger NVLink domains with 72 GPUs.
In addition to the Nvidia submissions, 10 of its partners – ASUSTek, Cisco, Dell Technologies, Fujitsu, Giga Computing, Hewlett Packard Enterprise (HPE), Juniper Networks, Lenovo, Quanta Cloud Technology, and Supermicro – all made solid MLPerf Inference submissions, underscoring the wide availability of Nvidia platforms.
Nvidia platforms undergo continuous software development, racking up performance and feature improvements on a monthly basis.
In the latest inference round, Nvidia offerings including the Nvidia Hopper architecture, Nvidia Jetson platform, and Nvidia Triton Inference Server saw leaps and bounds in performance gains. The Nvidia H200 GPU delivered up to 27% more generative AI inference performance over the previous round, underscoring the added value customers get over time from their investment in the platform.
Triton Inference Server is a fully featured open-source inference server that helps organisations consolidate framework-specific inference servers into a single, unified platform. This helps lower the total cost of ownership of serving AI models in production and cuts model deployment times from months to minutes.
In this round of MLPerf, Triton Inference Server delivered near-equal performance to Nviida’s bare-metal submissions, showing that organisations no longer have to choose between using a feature-rich production-grade AI inference server and achieving peak throughput performance.
Deployed at the edge, generative AI models can transform sensor data such as images and videos into realtime actionable insights with strong contextual awareness. The Jetson platform for edge AI and robotics is uniquely capable of running any kind of model locally – including LLMs, vision transformers, and Stable Diffusion.
In this round of MLPerf benchmarks, Jetson AGX Orin system-on-modules achieved more than a 6.2x throughput improvement and 2.4x latency improvement over the previous round on the GPT-J LLM workload. Rather than developing for a specific use case, developers can now use this general-purpose 6-billion-parameter model to seamlessly interface with human language – transforming generative AI at the edge.