Intel Gaudi AI accelerator drives doubled performance on GPT-3

MLCommons has published results of the industry standard MLPerf training v3.1 benchmark for training AI models, with Intel submitting results for Intel Gaudi2 accelerators and 4th Gen Intel Xeon Scalable processors with Intel Advanced Matrix Extensions (Intel AMX).

Intel Gaudi2 demonstrated a significant 2x performance leap, with the implementation of the FP8 data type on the v3.1 training GPT-3 benchmark. The benchmark submissions reinforced Intel’s commitment to bring AI everywhere with competitive AI solutions.

“We continue to innovate with our AI portfolio and raise the bar with our MLPerf performance results in consecutive MLCommons AI benchmarks,” says Sandra Rivera, Intel executive vice-president and GM of the data centre and AI group.

“Intel Gaudi and 4th Gen Xeon processors deliver a significant price-performance benefit for customers and are ready to deploy today. Our breadth of AI hardware and software configuration offers customers comprehensive solutions and choice tailored for their AI workloads.”

The newest MLCommons MLPerf results build on Intel’s strong AI performance over previous MLPerf training results from June. The Intel Xeon processor remains the only CPU reporting MLPerf results, and Intel Gaudi2 is one of only three accelerator solutions upon which results are based, only two of which are commercially available.

Intel Gaudi2 and 4th Gen Xeon processors demonstrate compelling AI training performance in a variety of hardware configurations to address the increasingly broad array of customer AI compute requirements.

Gaudi2 continues to be a viable alternative to Nvidia’s H100 for AI compute needs, delivering significant price-performance. MLPerf results for Gaudi2 displayed the AI accelerator’s increasing training performance:

* Gaudi2 demonstrated a 2x performance leap with the implementation of the FP8 data type on the v3.1 training GPT-3 benchmark, reducing time-to-train by more than half compared to the June MLPerf benchmark, completing the training in 153.58 minutes on 384 Intel Gaudi2 accelerators. The Gaudi2 accelerator supports FP8 in both E5M2 and E4M3 formats, with the option of delayed scaling when necessary.

* Intel Gaudi2 demonstrated training on the Stable Diffusion multi-modal model with 64 accelerators in 20.2 minutes, using BF16.

* While FP8 was used only in GPT-3 in this MLPerf training submission and GPT-J in the previous inference submission, Intel is expanding FP8 support in Gaudi2 Software and tools to additional models for both training and inference.

* On eight Intel Gaudi2 accelerators, benchmark results were 13.27 and 15.92 minutes for BERT and ResNet-50, respectively, using BF16.