Introduction

This paper presents our preliminary work of benchmarking and comparing performance on different processors with a special focus on the Sapphire Rapids’ (SPR) [1] and their HBM feature, which promises a performance boost for memory bandwidth-limited applications. An early study using the SPR architecture reported more than 8.5x faster runtimes for multi-physics codes relative to Intel’s Broadwell architecture when utilizing high bandwidth memory [2]. The paper investigated the runtime of two different hydrodynamics applications on Intel Broadwell, as well as on SPR with and without HBM. Another study examined bandwidth limitations in the SPR processor and concluded that the lack of sufficient concurrency in the cores of the processor affects bandwidth and explains why the peak is never achieved when using HBM [3]. Wang et al. [4] explored the effects of HBM on several benchmarks and applications, finding that many scientific applications benefit from it. [5] analyzed the performance of AMD Genoa and Intel Sapphire Rapids CPUs and compared them to older CPU models. The authors used the HPL, HPCC, and NAS parallel benchmarks, as well as LAMMPS, GROMACS, and NWChem for the comparison. The paper concludes that the Intel Sapphire Rapids as well as the AMD Genoa CPUs provide a significant performance boost of 20% to 50% compared to older AMD and Intel CPUs. [6] investigated the SPEChpc 2021 benchmark suite, in MPI-only mode, on Intel Ice Lake and Sapphire Rapids and analyzed the performance in terms of runtime and power/energy.

In this work, we extend the previous studies by studying a different set of benchmarks and applications and comparing the performance on a diverse set of architectures, i.e. Fujitsu A64FX, Intel Skylake, and AMD Milan. The latest AMD model is not included in the study due to the unavailability for experimentation or analysis within the research environment. All benchmarks and applications were compiled with full optimizations for each architecture. Selected instances were chosen to juxtapose Intel Sapphire Rapids with NVIDIA’s Grace-Grace and Grace-Hopper superchips [7].

We also share the observed performance gains from using the new AMX instruction set extension to the AVX 512 ISA. SPR is the first chip to implement the tiling instructions. These tiling instructions are run on an accelerator, a tile matrix multiply unit (TMUL). This TMUL unit operates on data stored in separate 2-D registers representing a tile. [8] shows improved inference performance of BERT, a Deep Learning model, by applying quantization and operator fusion on top of Intel’s AMX features. Their results are comparable to NVIDIA’s T4 GPU for smaller batch sizes. In our case study, we report the performance of multiple input problems for the convolution operator and see how different types of memory and hardware support for input data types affect these results.

The results presented here are a first step and will be extended in the future. This paper is laid out in the following manner. We discuss the benchmarking protocol, micro-benchmark applications, and three science applications in Section “Materials and Methods”. The next section discusses the observed performance for the aforementioned benchmarks and compares it to other systems. We also include an investigation on how changing various BIOS settings affects the performance and energy consumption of a selected application. This analysis was initially necessary to choose settings that kept the power consumption within data center limits while not decreasing the performance of the nodes. Finally, we discuss our key observations.

Materials and Methods

Benchmarking protocol: The systems evaluated in this benchmark analysis are listed in Table 1. The used compilers, MPIs, as well as processor-specific flags can be found in the appendix. Typically, each application or micro-benchmark was executed five to ten times on each system, and mean values were reported. HBM on the SPR nodes was configured in Flat Mode with Sub-NUMA Clustering 4 (SNC4). In this configuration, memory is spread across 16 NUMA regions, with regions 0–7 containing cores, and all (0–15) with both DDR5 and HBM. To preferentially utilize HBM over DDR5 memory where possible, the tested applications and micro-benchmarks were run with numactl –preferred-many=8–15, unless specified otherwise. This setup was shown in our tests to produce the same performance as binding applications to specific numa nodes while being less cumbersome to execute. Hereafter, SPR-HBM denotes runs on the SPR processor wherein HBM was preferentially utilized, while SPR-DDR signifies runs with default memory configurations. Energy consumption was measured using each system’s ipmitool [9].

The analyzed metrics are:

Runtime: The runtime, or time to solution, is often the primary concern of users as it is the limiter for discovery or publication deadlines. It can be a good assessor of how scalable the combination of software and hardware is.

Energy consumption: This is primarily most important to system managers who seek to minimize overall energy consumption. But increasingly this information is passed to users to inform climate-conscious actions.

Efficiency: This metric is mainly interesting to system managers, who are seeking good overall system throughput and cost-efficient utilization.

Peak power consumption: The metric alone does not deliver a lot of useful information, but for a given performance/efficiency/or other target it can inform system design and is a key constraint in peak performance.

For selected benchmarks, results on Graviton3 CPUs and NVIDIA A100 GPUs are included, as those results are available from a previous study, [10].

Table 1 Studied systems

Benchmarks

Several smaller benchmarks are investigated to test multiple attributes of the systems. They are listed and described below:

DAXPY and simple memory copy were used to analyze the bandwidth memory on SPR with and without HBM.

STREAM - We compile the standard source code for all processors except for A64FX, where a tuned version of STREAM is publicly available. The array size is chosen such that at least half of the available node memory is in use. The benchmark is run on all architectures at full subscription.

The HPCC (HPC Challenge) benchmark [11] combines multiple benchmarks. Here we are reporting on three of them: High Performance LINPACK (HPL), Matrix-Matrix multiplication and Fast Fourier Transform (FFT). LINPACK solves a linear system of equations using all cores in parallel. The performance is measured in Giga Floating Point Operations Per Second (GFLOPS) and corresponds to the performance of the application on all allocated compute resources. We also report GFLOPS/Core, which are the total GFLOPS divided by the number of cores.

The HPCG (The High-Performance Conjugate Gradients) benchmark is an alternative to the HPL benchmark (used in HPCC) and utilizes methods and patterns commonly used in many PDE solvers [12]. Unlike HPCC, HPCG does not rely on external libraries but requires vendors to optimize their own version of HPCG. Thus, for x86 machines, we used the Intel version of HPCG, and for the A64FX the version from Cray.

oneAPI Deep Neural Network Library or oneDNN, a part of oneAPI, is an open-source library providing optimized deep learning primitives for CPUs and GPUs. Many deep learning frameworks like PyTorch, and TensorFlow use oneDNN as a backend for accelerated computing on CPUs. oneDNN has the capability to detect the underlying instruction set architecture and uses Just-In-time (JIT) code generation to deploy optimized kernels at runtime. In this study, we explore the performance of new features included in the SPR processors, namely Advanced Matrix Extensions, and compare them to the best dispatch in Skylake-X and A64FX processors for the same problem, inputs, and run configuration. To prevent any unfair disadvantages in our study, we do not include the AMD Milan processor since it does not have special 512-bit vector processing elements.

Applications

Scientific applications are investigated to allow for a better understanding if, and how, the SPR with and without HBM can benefit real-life applications.

GROMACS is a software package for the simulation of biomolecular systems like proteins, membranes, DNA, and RNA [13]. It calculates how atoms move over time under a classical physics approximation by solving Ordinary Differential Equations based on Newton’s second law. Three systems, consisting of 82K, 200K, and 1.2M atoms, were used as benchmark [14].

OpenFOAM is a library and a collection of applications for the numerical solution of Partial Differential Equations [15]. The test case is a calculation of incompressible airflow around a motorcycle and is based on a test included in the OpenFOAM suite (incompressible\simpleFoam\ motorBike). We have increased the initial grid in each direction 2, 4 and 6 times to increase resolution and problem size. The grid is further refined around the obstacle, and the Navier–Stokes equations are solved on an unstructured grid. The resulting mesh consists of 2 M, 11 M, and 35 M cells. Total wall time was used as a performance metric (i.e., smaller = better).

ROMS, Regional Ocean Modeling System [16, 17], is an ocean model widely used in the scientific community. It is a free-surface, terrain-following, primitive equations ocean model. The test case we use for this work simulates the flow around the west Antarctic Peninsula [18]. ROMS was picked as a test application because it is an important local application with performance characteristics representative of other structured grid codes.

HPCC, HPGC, GROMACS, and OpenFOAM were executed on all cores of a single node (96 cores) using MPI for parallel execution.

The input parameters and automation procedures (batch job submission and monitoring, output parsing) for HPCC, HPCG, GROMACS, and OpenFOAM were adopted from the XDMoD application kernel module [10, 19].

Results

Here we focus on the runtime, memory bandwidth, as well as on energy consumption.

Benchmarks

Memory bandwidth. STREAM TRIAD demonstrates the sustained HBM bandwidth of SPR being 1,360.5 GB/s, which is 3.5 times higher than DDR bandwidth on the same node (Table 2, STREAM Column). This is consistent with the results obtained by Wang et al. [4]. We also studied the bandwidth versus the number of threads and their distribution using the memory copy benchmark (Fig. 1). Impressively, increasing the number of threads up to full subscription on a node improves the memory bandwidth utilization almost linearly with HBM, while in DDR mode, the bandwidth saturates quickly. The cumulative single-node memory copy bandwidth on SPR was more than 4.2x higher for SPR-HBM than for SPR-DDR (Fig. 1). Similarly, using the spread thread distribution configuration consistently outperformed close thread placement except at the highest number of threads tested.

Table 2 Benchmarks: STREAM, HPCC(DGEMM, Linpack and FFT)
Fig. 1
figure 1

Memory copy routines showing memory bandwidth patterns

The HPCC benchmark utilizes BLAS and FFT libraries, often used in scientific applications. Matrix-matrix multiplication (DGEMM) is one of the few practical calculations capable of approaching theoretical FLOPS, in part due to memory-efficient algorithms, which significantly reduce the memory bandwidth requirements. As can be seen from Table 2 (matrix multiplication column), SPR demonstrates the highest per core and per node performance with HBM bringing an additional 13% improvement. In the LINPACK test, the memory bandwidth requirements are higher and HBM systems (A64FX and SPR-HBM) show better performance (Table 2, LINPACK column). SPR-DDR has a similar per-core performance to the Intel Skylake-X CPU, and HBM brings an additional 29% improvement making SPR-HBM the fastest per core and per node system. FFT exhibits even higher memory bandwidth requirements than LINPACK and DGEMM and systems with faster memory show better performance (Table 2, FFT column). SPR shows the fastest per core and per node results with HBM responsible for an 11% improvement.

The HPCG benchmark reflects the performance of many PDE solvers. SPR-DDR exhibits similar performance to previous generations. Turning on HBM brings an impressive 2.4x increase in performance (Table 3).

Table 3 Performance in HPCG Benchmark

oneDNN is publicly available via GitHub, and we chose the latest release v3.2 for this work. To benchmark, we compare the performance of the convolution driver with benchdnn, a benchmarking harness provided by oneDNN. We selected 10 sample inputs as mentioned in Table 4 for the convolution driver. The naming convention follows a descriptor-size combination. Here mb stands for mini batch size; ic and ih stand for input channel and height respectively; oc and oh stand for output channel and height respectively; kh, sh and ph stand for kernel, stride, and padding heights respectively. For example, mb256ic64ih56oc256oh56kh1ph0n will mean that we have a mini batch size of 256, 64 input channels, input height 56, output channel 256, output height 56, kernel height 1, and padding height 0. Next, we selected 3 input configurations to test, i.e., the input data type to the operator. We test with signed and unsigned 8-bit integers (important for inference), 32-bit floating point inputs, and lastly BrainFloat-16 (BF16) data types (both important for training as well as inference). The low precision data types are chosen since SPR has native support for them and they can leverage the TMUL accelerator via AMX instructions.

Table 4 benchDNN: input problems
Table 5 benchDNN: best CPU dispatch available

For the SPR chips, we utilize numactl to map memory regions explicitly to those nodes that have HBM and DDR corresponding to the CPUs used for running the benchmark. This helps us understand how HBM plays a role in the observed performance. Each associated data type has a CPU dispatch control (except for A64FX), which can control the ISA to be used. Here we select the default, which is the best available for the input configurations. This is described in Table 5.

We see the performance difference between the convolution operator with the same inputs on all 3 architectures that have a 512-bit vector length implementation. In Fig. 2, while running the samples in forward mode with 32-bit floating point data types typically used during training, we see that SPR-HBM is up to 1.7x and 3.5x faster than Skylake-X and A64FX. Improvements with respect to Skylake-X for this problem show the general benefits of running on newer Intel hardware as the same dispatch (AVX512_CORE) for jit’d kernels is used in both cases.

Fig. 2
figure 2

FWD_B mode (forward mode with bias) used in training with 32-bit floating point input and output

We also bring special focus to forward mode runs with the BrainFloat-16 data type commonly used on GPUs. Figure 3 shows the performance gains from using BF16 data type compared with other architectures. A64FX is omitted because the JIT kernels do not have a BF16 implementation. Comparing to results obtained by SPR-HBM in fp32 mode, we see speedup ranging from 1.79x - 9.2x. This result is crucial as bf16 data type can be used for mix-precision training available in Deep Learning frameworks and can accelerate workloads on CPU architectures with native support. Skylake-X suffers from massive performance degradation even compared to fp32 mode because of the lack of intrinsic support for BF16 instructions. Additional details can be found in the Appendix.

Fig. 3
figure 3

FWD_B mode (forward mode with bias) with 16-bit BrainFloat data type for input and output

Lastly, we show the performance of the same inputs in forward mode for inference with 8-bit integer data types in Fig. 4. We see speedups of up to 16.1x with the SPR processor over Skylake-X and up to 17.5x speedup over A64FX when using SPR-HBM. There are also general benefits of using HBM vs DDR on the SPR processor with speedups of up to 2x. However, there are some cases across all figures where there was no speedup and insignificant loss in performance.

Fig. 4
figure 4

FWD_I mode (forward pass in inference) with 8-bit integer data type input and output

Applications

All our benchmarks show significant improvement over the previous CPU generation, with HBM bringing a significant performance boost for some applications. We therefore wanted to determine if improvements in core libraries and system capabilities will translate to real-world applications.

GROMACS and other Molecular Dynamics applications are responsible for a significant portion of HPC system utilization. SPR performs the fastest among pure CPU systems (Table 6). However, the use of HBM does not have a significant effect on the performance across the tested problem sizes. For small MEM systems (82k atoms), the SPR CPU is 2.5 times faster than the Milan CPU, but for larger systems (RIB, 2 M atoms, and PEP 12 M atoms), SPR is only 31–37% faster. We speculate that due to the smaller size of the MEM system, it fits mostly within CPU

caches, allowing more efficient utilization of AVX-512 instructions. This results in a single SPR node performance for MEM being only 15% slower than the NVIDIA A100 GPU performance. For larger problems, SPR is 40–50% slower than A100 (for a more extensive comparison to modern GPUs, see [7]). Another notable point is that the old Skylake CPU still has strong per-core performance, and multi-node execution can alleviate lower core per-node count.

Table 6 Performance for GROMACS. GPU performance is given for comparison, see [10] for more details. The GPU systems had four and two GPUs but only one was used. The power measurements also include idle GPUs

OpenFOAM tests demonstrate an increased performance of SPR for larger problems compared to previous Intel and AMD CPUs (Table 7). For the smallest problem, which has 2 M cells, SPR performs similarly to Milan, with little effect from HBM. For the larger problems (11 M and 35 M cells), SPR is 25% faster than Milan. An additional 21–25% improvement can be achieved by using HBM. Interestingly, HBM has no significant effect on the meshing step. Most of the performance improvement comes from the solving part. This is most likely due to intensive memory allocation/deallocation during the meshing process where DDR and HBM perform similarly.

Table 7 Performance in OpenFOAM application

ROMS - The benchmark results in terms of runtime, scaling, and energy consumption are shown in Fig. 5. On SPR, HBM is nearly 2 times faster than using DDR (Fig. 5a). The performance of Skylake and SPR-DDR is very similar. A64FX shows poor performance when using low core counts, but outpaces the other CPUs when increasing the number of used cores. On the A64FX nodes, the code scales well, whereas this is not the case for other processors. Especially on SPR, both HBM and DDR, and Milan, the scaling is poor. The scaling was evaluated as the ratio of runtime on a single node versus on multiple nodes (Fig. 5b). The energy consumption is depicted in Fig. 5c. SPR, having a peak power consumption of more than 1000W, exhibits a substantial energy consumption, especially for higher core counts. A64FX, with a peak power consumption of around 120W, on the other hand, is very energy-efficient. The energy efficiency of the A64FX has also been indicated by the leadership of the Fugaku supercomputer in the Green500 benchmark in 2019. Summarized, ROMS showcases both positive and negative aspects: the favorable performance gain from HBM and, on the other hand, the hard fact that poor multi-node scaling inevitably leads to poor energy efficiency.

Fig. 5
figure 5

Results for running ROMS on different CPUs

Investigating the Effect of BIOS Changes

This section focuses on the effect of changes in the system’s BIOS on performance and energy consumption. We investigated two workload profiles: the default HPC profile and our custom profiles (Table 8). The default HPC profile prioritizes the performance and disables the CPU’s C-states. The latter disables the ability of the CPU to go into a deeper idle state. To figure out the energy savings and effects on performance, we tested our custom profile setting where we use balanced performance and allow up to C6 idle state. This allows higher energy savings when the node is unused. Our measurements show that the custom profile reduces the idle power consumption from \(\sim \) 450 Watt to \(\sim \) 310 Watts. The BIOS settings were tested on ROMS, GROMACS, and OpenFOAM.

Table 8 Differences in the HPC and the custom profile

Four nodes were configured with each workload profile, and the ROMS application was run 5 times for each configuration on 1, 2, and 4 nodes. The results of runtime and energy consumption are shown in Fig. 6.

Interestingly the runtime as well as the energy consumption show a slight decrease under the custom profile, indicating that those settings have a measurable impact. The peak power consumption decreased from 1150 to 1034 W when using the custom BIOS profile.

GROMACS and OpenFOAM show a performance degradation of 3% and 6%, respectively, when being run with the Custom BIOS. Still, for underutilized machines, the energy savings might outweigh performance degradation. Future work will investigate this in more detail.

Fig. 6
figure 6

Results of ROMS with different BIOS configurations. Left: Runtime in minutes vs number of cores, Right: energy consumption per simulation vs number of cores

Discussion and Conclusion

In this work, we evaluated the new Intel SPR CPU with optional HBM memory and AMX instructions. HBM holds the promise of enhancing the performance of memory-bound applications. Our results with multiple benchmarks demonstrate promising performance improvements with the new SPR (DDR mode). This is evident in the performance improvement for ROMS, GROMACS, and OpenFOAM.

HBM brings a further and often substantial improvement in the benchmarks (2.4 times in HPCG) and real applications (almost doubling for ROMS and 21–25% for OpenFOAM). However, some applications like GROMACS do not benefit from the HBM featured in this processor.

The AMX extension to the Intel AVX512 ISA shows significant speedups in our tests, positioning the SPR CPU as a potential option for mix-precision deep-learning training and inference workloads with appropriate data types.

In centers with diverse workloads, HBM-enabled SPR nodes can offer significant performance enhancements for specific applications such as ROMS. Adjusting the BIOS profile can assist in maintaining low power consumption without compromising performance. Future work will involve further BIOS investigations with additional use cases. Furthermore, we plan to extend our analysis of SPR to encompass additional scientific applications and larger, computationally more demanding test cases.