Aurora supercomputer completion marks exascale milestone

The Aurora supercomputer at Argonne National Laboratory is now fully equipped with all 10 624 compute blades, boasting 63 744 Intel Data Centre GPU Max Series and 21 248 Intel Xeon CPU Max Series processors.

“Aurora is the first deployment of Intel’s Max Series GPU, the biggest Xeon Max CPU-based system, and the largest GPU cluster in the world,” says Jeff McVeigh, Intel corporate vice-president and GM of the Super Compute Group. “We’re proud to be part of this historic system and excited for the groundbreaking AI, science and engineering Aurora will enable.”

A collaboration of Intel, Hewlett Packard Enterprise (HPE) and the Department of Energy (DOE), the Aurora supercomputer is designed to unlock the potential of the three pillars of high performance computing (HPC): simulations, data analytics and artificial intelligence (AI) on an extremely large scale.

The system incorporates more than 1 024 storage nodes (using DAOS, Intel’s distributed asynchronous object storage), providing 220PB of capacity at 31Tbps of total bandwidth, and leverages the HPE Slingshot high-performance fabric.

Later this year, Aurora is expected to be the world’s first supercomputer to achieve a theoretical peak performance of more than 2 exaflops (an exaflop is 1018 or a billion-billion operations per second) when it enters the TOP500 list.

Aurora will harness the full power of the Intel Max Series GPU and CPU product family.

Designed to meet the demands of dynamic and emerging HPC and AI workloads, early results with the Max Series GPUs demonstrate leading performance on real-world science and engineering workloads, showcasing up to 2-times the performance of AMD MI250X GPUs on OpenMC, and near linear scaling up to hundreds of nodes.

The Intel Xeon Max Series CPU drives a 40% performance advantage over the competition in many real-world HPC workloads, such as earth systems modeling, energy and manufacturing.

From tackling climate change to finding cures for deadly diseases, researchers face monumental challenges that demand advanced computing technologies at scale. Aurora is poised to address the needs of the HPC and AI communities, providing the necessary tools to push the boundaries of scientific exploration.

“While we work toward acceptance testing, we’re going to be using Aurora to train some large-scale open source generative AI models for science,” says Rick Stevens, Argonne National Laboratory associate laboratory director.

“Aurora, with over 60 000 Intel Max GPUs, a very fast I/O system, and an all-solid-state mass storage system, is the perfect environment to train these models.”

How it works:

At the heart of this state-of-the-art system are Aurora’s sleek rectangular blades, housing processors, memory, networking and cooling technologies. Each blade consists of two Intel Xeon Max Series CPUs and six Intel Max Series GPUs.

The Xeon Max Series product family is already demonstrating great early performance on Sunspot, the test bed and development system with the same architecture as Aurora.

Developers are utilising oneAPI and AI tools to accelerate HPC and AI workloads and enhance code portability across multiple architectures.

The installation of these blades has been a delicate operation, with each 70-pound blade requiring specialized machinery to be vertically integrated into Aurora’s refrigerator-sized racks.

The system’s 166 racks accommodate 64 blades each and span eight rows, occupying a space equivalent to two professional basketball courts in the Argonne Leadership Computing Facility (ALCF) data centre.

Researchers from the ALCF’s Aurora Early Science Program (ESP) and DOE’s Exascale Computing Project will migrate their work from the Sunspot test bed to the fully installed Aurora. This transition will allow them to scale their applications on the full system.

Early users will stress test the supercomputer and identify potential bugs that need to be resolved before deployment. This includes efforts to develop generative AI models for science, recently announced at the ISC’23 conference.

Featured picture: Credit Argonne National Laboratory