1 Introduction

High-performance supercomputers play a vital supporting role in national security, economy, and social development (Matsuoka et al. 2023; Lu 2019; Asch et al. 2018; Su and Naffziger 2023). Therefore, accelerating the development and promoting the application of supercomputers bear great significance in promoting scientific research, advancing technological innovation, and achieving high-quality economic and social development. In the field of computing, floating-point operations per second or FLOPS is commonly used to measure the performance of a supercomputer.

Since the International Supercomputing Conference (ISC) in May 2022 (Top500 2022), Frontier—a supercomputer developed by Oak Ridge National Laboratory (ORNL) in the United States—has continuously ranked first on the Top500 list for four consecutive years. It is worth emphasizing that the Frontier supercomputer remains the sole exascale supercomputer among the currently established supercomputers. In the Linpack benchmarks, the supercomputer achieved a stable operating speed of up to 1.194 EFLOPS (1018 operations/s) and a theoretical speed of up to 1.68 EFLOPS (Top500 2023). The Frontier supercomputer implements exascale computing using only 22.7 MW of power, with an energy efficiency of 52.59 GFLOPS (109 operations/s) per watt. Besides, the Frontier supercomputer is also the world's first supercomputer with a single-node performance exceeding 100 TFLOPS (1012 operations/s).

Currently, compute nodes in supercomputers around the world can be classified into two categories based on their structure: (1) homogeneous nodes are generally composed of a single processor or accelerated processor. Although the computing power of a single node is limited, the entire system can contain 10,000 or even more than 100,000 nodes, such as Japan's Fugaku (Shimizu 2020; Sato et al. 2022, 2020) and China’s Sunway TaihuLight (Fu et al. 2016; Gao et al. 2021), of which Fugaku has over 100,000 nodes. (2) Heterogeneous nodes are typically composed of several processors and co-processors/accelerators, which can be interconnected through PCIe or cache coherence interface within the nodes. The computing power of single-node is immensely enhanced by accelerators with several teraflops throughput, hence the total number of nodes in the entire system usually does not exceed 10,000 nodes.

In order to achieve this goal, building 10-exascale to zettascale (1021 operations/s) supercomputers in the future meets many challenges, including power consumption control, interconnection among processors or accelerators, and software system design. It is difficult to enhance the computing power of the system by increasing the number of nodes simply. Taking the Frontier supercomputer as an example, it is estimated that the power consumption of a 10-exascale supercomputer would exceed 200 MW, making such an implementation cost prohibitive. Therefore, improving the performance and energy efficiency of a single node is a more feasible approach. Furthermore, with the convergence of high performance computing (HPC) and artificial intelligence (AI), supercomputers are required to provide greater computing power and place higher demands on energy efficiency (Huerta et al. 2020; Moreno-Álvarez et al. 2022; Ward et al. 2019; Blaiszik et al. 2019). To address these challenges, we analyze the architecture of compute node and the developing trends of critical technology in major supercomputers. We also explore the technical challenges of compute nodes for building 10-exascale and even zettascale supercomputers in the future, as well as ways to achieve compute nodes of 1000 TFLOPS.

2 Current status of compute nodes in supercomputers

The Fig. 1 illustrates the representative supercomputers that have been released from 2012 onwards. Notably, among these systems, Aurora, El Capitan, and Eos are the announced exascale supercomputers. This study primarily focuses on analyzing the pivotal technologies incorporated within Summit, Frontier, Aurora, El Capitan, and Eos supercomputers.

Fig. 1
figure 1

Representative supercomputers released from 2012 to present

Remarkably, the Summit supercomputer undoubtedly represents a significant milestone in the field of exascale computing. This is attributed to its utilization of the V100 chip, which marks the first integration of Nvidia's tensor core technology. Consequently, this chip provides unparalleled support for both HPC and AI applications. Owing to its significance, we have conducted a comprehensive and in-depth analysis of the Summit supercomputer.

2.1 Summit supercomputer

Summit, launched by the ORNL in June 2018, achieved the top position on the Top500 list at that time, representing a significant milestone in the exascale computing path. By utilizing NVLink consistency bus instead of PCIe bus for communication between CPU and GPU within nodes, Summit improved communication performance and scalability. It was ranked seventh in the latest Top500 (Table 1). The entire system consists of 4608 compute nodes with a Linpack benchmark program of 148.6 PFLOPS (1015 operations/s), a peak performance of 200.79 PFLOPS, and a power consumption of 10.1 MW (Hines 2018 and Preface 2020).

Table 1 The Top10 of global supercomputer (Top500 2023)

From the perspective of hardware architecture, the compute nodes in Summit adopt a heterogeneous approach. Fig. 2 show that each compute node consists of two 22-core Power9 processors and six Nvidia Tesla V100 accelerators. In V100, Nvidia introduced the tensor core for the first time, enabling significant acceleration of HPC and AI application loads. The traditional node architecture of supercomputing, where the connection between the CPU and GPU relies on the PCIe bus, limits the bandwidth. In Summit, Power9 and V100 are connected through a high-speed, high-bandwidth mesh interconnect NVLink. However, scaling thousands of nodes into a high-performance cluster requires a state-of-the-art network. The complete tree network of Summit employs a non-blocking, fat tree network utilizing Mellanox EDR Infiniband. This offers high-performance node-to-node communication and filesystem access, achieving an energy efficiency ratio of 14.7 GFLOPS/W (Hines 2018; Kahle et al. 2019).

Fig. 2
figure 2

Conceptional block diagram of Summit supercomputer architecture (Hines 2018; Kahle et al. 2019; Gonzalez et al. 2018; IBM POWER9 NPU team 2018)

Within each compute node of Summit, the Power9 CPUs are connected using NVLink 2.0 and are equipped with three groups of six NVLink channels, with two channels per group. These two channels enable bidirectional bandwidth of 100 GB/s between the CPU and GPU. The GV100 GPU, similar to the CPU, features six NVLink 2.0 channels that are divided into three groups. One of these channels is connected to the CPU, while the remaining two are connected to different GPUs. These channels also provide 100 GB/s of bandwidth. For CPU-to-CPU communication, IBM's X-Bus is utilized. This 4-byte, 16 giga transmission per second link offers 64 GB/s of bidirectional bandwidth, effectively meeting the communication requirements between two processors (Gonzalez et al. 2018; IBM POWER9 NPU team 2018).

2.2 Frontier supercomputer

Frontier, developed at ORNL, was the world's first exascale supercomputer with single-node computing power over 100 TFLOPS. When using the HPL-MxP mixed precision programs (32-bits, 24-bits, and 16-bits) suitable for machine learning applications, Frontier delivered 9.95 EFLOPS throughout (HPL-MxP 2023). Furthermore, Frontier was also the world's second energy-efficient supercomputer on the Green500 list in June 2022, with energy efficiency of 52.23 GFLOPS/W.

Frontier consists of 74 Hewlett Packard Enterprise (HPE) Cray EX supercomputer cabinets with a total of 9408 nodes. Each node show in Fig. 3a has a single AMD 3rd Generation EPYC 64C 2 GHz CPU with four AMD MI250X GPUs, for a total of 9472 CPUs and 37,888 GPUs in the entire system (AMD 2021; Frontier 2024; Rajaraman 2023). The CPU and GPU are connected through AMD's proprietary Infinity Fabric interconnect, which enhances them coherence and provides a unified view of shared memory. Additionally, the Infinity Fabric offers high-bandwidth and low-latency interconnect channels, effectively optimizing data transfer. This optimization significantly enhances overall performance by minimizing latency and enhancing data transfer speed between the CPU and GPU. Furthermore, the Infinity Fabric technology enables the CPU and GPU to directly access shared memory without requiring data replication or transfer. This eliminates the need for additional data transmission, leading to improved efficiency.

Fig. 3
figure 3

The composition of compute node architecture in supercomputers a Frontier (Rajaraman 2023) and b Aurora (Gomes et al. 2022) supercomputer

Two compute nodes are integrated onto a printed circuit board, referred to as a blade, which can be installed in a rack. These blades are water-cooled, and with each rack accommodating 64 blades. The blade servers are interconnected through HPE Slingshot interconnects, each equipped with a custom-designed 64-port switch that provides a network bandwidth of 12.8 terabits per second (TB/s). The Slingshot network interconnect system maximizes throughput and minimizes latency. It has adaptive routing, congestion control, and excellent quality of service (QoS). A "Dragonfly" topology interconnects groups of blades, ensuring no more than three hops between any two nodes in Frontier.

2.3 Aurora supercomputer

Aurora, a supercomputer co-developed by Intel, HPE, and the U.S. Department of Energy (DOE), has been installed at Argonne National Laboratory (ANL). Despite being in its partial installation phase, Aurora ranks second on the latest Top500 list (Table 1). It is anticipated that upon full installation, Aurora will achieve a peak performance exceeding 2 EFLOPS, surpassing Frontier.

Based on recently published data, Aurora comprises a total of 10,624 compute nodes. As show in Fig. 3b, each compute node is equipped with two Intel Xeon Max "Sapphire Rapids" CPUs featuring High Bandwidth Memory (HBM) and six Intel Date Center GPU Max "Ponte Vecchio" GPUs. Additionally, it incorporates all the necessary memory, networking, and cooling technology required for high-performance applications. The CPU-to-GPU connections are established via PCIe, while the GPU-to-GPU connections utilize Xe Link. In general, Aurora is equipped with 63,744 GPUs and 21,248 CPUs. It also boasts over 1024 Distributed Asynchronous Object Store (DAOS) nodes, totaling a capacity of 230 petabytes (PB). Furthermore, its maximum bandwidth can reach up to 31 TB/s. Aurora uses HPE's Slingshot fabric interconnect designed specifically for supercomputers, with 64–200 Gbps ports per switch and a Dragonfly topology with adaptive routing (Aurora 2023; Gomes et al. 2022).

2.4 El Capitan supercomputer

El Capitan is planned to be built by the U.S. DOE, Lawrence Livermore National Laboratory (LLNL) and HPE in partnership with AMD (Smith 2020; Morgan 2022). This system aims to deliver up to 2 EFLOPS of double-precision computing power within a power envelope of 40 MW. It was installed at LLNL in May 2023 and will primarily be utilized by the National Nuclear Security Administration (NNSA) for nuclear weapons modeling, replacing actual weapon testing. Additionally, it will also be used secondarily as a research system in other fields, particularly those where machine learning can be applied. Overall, El Capitan will become DOE's third exascale supercomputer, following ORNL's "Frontier" and ANL's "Aurora" system.

Similar to Frontier, El Capitan will primarily adopt AMD's hardware architecture. However, it differs in that it will be the first supercomputer based on AMD's server-grade APUs. Each compute node features two sets of resources, consisting of a Zen4 based Genoa EPYC processor and four Instinct MI300A accelerators, interconnected via AMD's Infinity Fabric technology. The AMD Instinct MI300A APU is a multi-chip/multi-IP accelerator that adopts 3D packaging to integrate the CPU, GPU, cache, and HBM memory. Not only does it boast next-generation CDNA3 GPU cores, but it also incorporates the next-generation Zen4 CPU cores. Furthermore, the El Capitan supercomputer will be designed by HPE with Slingshot-11 interconnect that connects each HPE Cray XE rack together.

2.5 Eos supercomputer

At the GPU Technology Conference (GTC) in March 2022, Nvidia officially unveiled the H100 Tensor Core GPU, utilizing the Hopper architecture. Subsequently, they launched a range of products, including machine learning workstations and supercomputers. The DGX H100 is a giant GPU composed of eight H100 Tensor Core GPUs and four NVSwitches. The DGX H100 offers up to 32 PFLOPS of FP8 AI computing power and 0.5 PFLOPS of FP64, along with 640 GB of HBM3 memory. The new NVLink Switch System and Quantum-2 InfiniBand network connectivity allows the integration of 32 DGX H100 into Pod nodes to form a DGX SuperPod (Nvidia 2023a, b, c, d, e). Thus, a single H100 SuperPod has 256 H100 GPUs, 20 TB of HBM3 memory, and up to 1 EFLOPS of FP8 AI computing power (Fig. 4).

Fig. 4
figure 4

Conceptual block diagram of architecture of Nvidia DGX SuperPod (Nvidia 2023a, b, c, d, e)

Building upon the DGX SuperPod, Nvidia will create a supercomputer named EoS, which comprises 18 DGX SuperPods. This configuration amounts to a total of 576 DGX H100 systems (4608 H100 GPUs) and 360 NVLink switches. By traditional supercomputing standards, Eos will offer 275 PFLOPS of FP64 computing power. When it comes to AI computing, Eos will deliver 18.4 EFLOPS of FP8 AI computing power, or 9 EFLOPS of FP16. As a result, Eos is anticipated to become the world's fastest AI supercomputer.

3 Key technological trends for compute nodes

In the field of supercomputing, performance improvements primarily depend on enhancing single-node performance and increasing the number of compute nodes. Nevertheless, if the performance of individual nodes is inadequate, simply increasing the number of nodes to achieve improved performance is unfeasible.

It bears mentioning that, in supercomputer, a compute node typically consists of processor or accelerator. Taking Frontier as an example, each node comprises a single AMD EPYC Trento processor and four AMD Radeon Instinct MI250X accelerators. The processor, accelerator, and their interconnection technology are crucial factors in determining the computing power of the individual node. Consequently, we will summarize the key technologies involved in the compute nodes of major supercomputers and analyze the development trends of technologies at the 100 TFLOPS or of 1000 TFLOPS.

3.1 Innovative processor design using high-performance core microarchitectures

In supercomputers, processors typically employ advancements in high-performance core microarchitecture to achieve sustained improvements in both memory access and single-core performance. The addition of a matrix computation acceleration engine provides significant acceleration for tensor processing at the heart of deep learning algorithms.

Sapphire Rapids employed in Aurora is Intel's fourth-generation Xeon Scalable processor, and it is also Intel's first Xeon processor designed based on the chiplet technology (Biswas 2021; Nassif et al. 2022). Fabricated on the Intel 7 process, Sapphire Rapids incorporates Intel's new Performance Core (P-core) microarchitecture known as Golden Cove. Designed to enhance speed and overcome limitations in low-latency and single-threaded application performance, the P-Core features a wider, deeper, and more intelligent architecture. The Sapphire Rapids core follows a tiled and modular System-on-Chip (SoC) architecture, leveraging Intel's EMIB packaging technology. This design maintains the advantages of a monolithic CPU interface while offering significant scalability. It provides a unified memory access architecture where each thread has full access to all resources across all tiles, including cache, memory, and input/output. Consequently, the SoC exhibits consistent low-latency and high cross-section bandwidth across the entire architecture. Furthermore, Sapphire Rapids incorporates Intel's Accelerator Interface Architecture instruction set to facilitate efficient scheduling, synchronization, and signaling between accelerators and devices. Additionally, it features Intel's Advanced Matrix Extensions (AMX), a new acceleration engine dedicated to providing significant acceleration for tensor processing at the core of deep learning algorithms. Sapphire Rapids can perform 2000 INT8 operations and 1000 BFP16 operations per cycle, enabling a significant increase in computing power. It also incorporates a Data Stream Accelerator (DSA), designed to offload the most common data movement tasks and reduce costs in data center scale deployments. From the third-generation Ice Lake Xeons to the fourth-generation Sapphire Rapids, there is both continuation and updates. The continuation refers to the 2D Mesh architecture and the updates refer to chiplet technology, but the updates go beyond that.

AMD has been pursuing architectural innovations to achieve significant performance leaps. Table 2 shows that from Zen to Zen4, significant improvements in Instruction Per Cycle (IPC) performance have been achieved. The first processor based on the Zen (Singh et al. 2017, 2018) architecture was the Ryzen 1000 series desktop processor, which adopted a 14 nm FinFET process. Its Zen core featured a complex unit comprising a shared 8 MB L3 cache and four cores. Zen2 (Singh et al. 2020; Suggs et al. 2019, 2020) is similar to the previous generation in terms of its core design, but has a significant 15% increase in IPC. Zen2, fabricated with TSMC's 7 nm FinFET process, introduced an advanced branch prediction method in its frontend, nearly doubling branch target capacity. In contrast, Zen3 (Burd et al. 2022; Evers et al. 2021) marked a substantial microarchitecture overhaul within AMD's Zen line of processors. Despite using the same 7 nm technology as Zen2, Zen3 achieved a remarkable 19% IPC gain. There are several significant microarchitectural improvements in the design of Zen3. Firstly, AMD doubled the L1 Branch-Target-Buffer (BTB) from 512 entries to 1024 entries, which enhances the bandwidth and accuracy of branch predictors and allows for faster access to target addresses. Additionally, Zen3 enhances the performance of inference workloads by increasing the number of pipelines, thereby improving throughput. Secondly, Zen3 introduces two-bit width scheduling and dispatching in floating-point, as well as faster FMAC cycles and two INT8 IMAC pipelines. Furthermore, Zen3 extends AVX2 instructions to a maximum of 256 bits for acceleration, encryption, and decryption algorithms. Finally, Zen3 features AMD 3D V-Cache technology for even greater improvements in single-core performance, especially for the most computationally intensive workloads and engineering designs. The newly unveiled Zen4 (Munger et al. 2023) represents AMD's cutting-edge x86-64 core, crafted using a 5 nm FinFET process. It boasts a 13% IPC improvement compared to its predecessor. Zen4 showcases major redesigns across the chip's front-end, execution engine, load/store hierarchy, and a doubled L2 cache. Notably, Zen4 can clock up to 5.7 GHz, delivering a substantial single-threaded performance boost of up to 29%. AMD's fourth-generation EPYC Genoa processor leverages this innovative Zen4 architecture.

Table 2 AMD's Zen to Zen4 architectural differences (Singh et al. 2017; Munger et al. 2023)

Through the analysis of Intel's fourth-generation Xeon scalable Sapphire Rapids processor and AMD's fourth-generation EPYC Genoa processor, the main microarchitecture innovations are as follows:

  • Firstly, we observed that current high-performance general-purpose processors have adopted the multi-core approach to enhance performance. Intel's first-generation Xeon processors released in 2017 supported 28 cores, and the fourth-generation Xeon processors support up to 60 cores. Furthermore, Intel plans to introduce the Xeon Sierra Forest processor with 144 cores. AMD's EPYC Genoa supports 96 cores, and its next-generation Bergamo processor will support 128 cores. Ampere, AWS and China's FeiTeng have all launched high-performance general-purpose processors with more than 64 cores.

  • Secondly, the size of caches at all levels in general-purpose processors is growing, essentially using private L1 and L2, as well as shared L3 caches. L3 cache is the largest and slowest cache level, and currently most high-performance processors have L3 cache above 64 MB. For example, Intel's Sapphire Rapids processor has a 105 MB L3 cache, while AMD's EPYC Genoa processor has up to 1152 MB L3 cache. AMD's extra-large cache is achieved with a 3D-stacked L3 cache called 3D V-Cache.

  • Finally, in addition to the steady development of traditional high-speed interfaces, new interfaces are beginning to emerge. Current mainstream processors have gradually started supporting HBM, DDR5, and PCIe5.0, demonstrating technological iteration and updates. Compared to existing AMD EPYC "Milan" and Intel Xeon Scalable "Ice Lake" processors, the Sapphire Rapids CPU, with the enhancement of HBM2E packaging technology, has achieved a performance improvement of approximately 2.8 times. This significant achievement has injected new vitality into the field of high-performance computing. It is worth noting that DDR6 has entered the early stages of development. Meanwhile, the PCIe6.0 specification was officially released in 2022, doubling the bandwidth compared to PCIe5.0 and providing a broader space for data transmission. In addition, new high-speed interfaces such as CXL, CCIX, and NVLink-C2C are continuously emerging, which will further enrich and expand the application scenarios of high-speed interface technology.

3.2 Designing accelerator using domain specific architecture

Accelerators in supercomputers typically adopt the design philosophy of Domain Specific Architecture (DSA). This approach incorporates a substantial quantity of vector and tensor cores, tailored for optimizing HPC and AI applications. Such optimization is attained through seamless integration of software and hardware. Prominent manufacturers, namely Nvidia, AMD, and Intel, integrate DSA principles into their accelerators, thereby enhancing performance across a range of application domains.

HPC and AI can assist numerous professionals in their respective fields, including researchers working in laboratories, engineers solving complex technical problems, and financial analysts using mathematical algorithms to make market predictions. Nvidia has adopted a DSA for HPC and AI applications that combines these two fields. The architecture is continually iterated and up-dated, introducing new data precision types, and replacing high-precision units with low-precision ones. In 2012, Nvidia introduced the Kepler architecture, which proposed GPU Direct technology. This approach enables direct data exchange with other local or remote GPUs bypass CPU/System Memory. The 2016 Pascal architecture, DP Unit supporting double-precision computing was added for deep learning (Nvidia 2017). Pascal employs NVLink technology for point-to-point communication within multiple GPUs in a single machine, with bandwidths up to 160 GB/s. The 2017 Volta architecture, which is entirely focused on deep learning. Tensor core (Raihan et al. 2019) module was introduced for performing fused multiply-add operations. Tensor core module is an integral component in the Volta architecture and is optimized for training deep neural networks. Specifically, it specializes in calculating matrix-matrix multiplications with reduced precision through a technique called mixed-precision training, which results in improved performance without sacrificing accuracy. In 2018, Nvidia released the Turing architecture based on the Volta architecture. This architecture features upgraded tensor core that support enhanced computing power for INT8, INT4, and Binary (INT1), with each level of performance doubling. In addition, it is equipped with the ray tracing (RT) Core, which is capable of rendering light and sound at high speed (Nvidia 2018). The Ampere architecture released in 2020, has been upgraded again for tensor core, adding support for both TF32 and BF16 data formats, and also for sparse matrix computation (Nvidia 2020). The Hopper architecture, released in 2022, triples the FLOPS for TF32, FP64, FP16, and INT8 precision. Advancing the tensor core technology through Transformer Engine aims to accelerate the training of AI models (Nvidia 2023a, b, c, d, e). Among them, the Nvidia H100 Tensor Core GPU (Choquette 2023) is a new-generation accelerated computing platform launched in 2023 based on Hopper architecture. Leveraging the advanced TSMC 4N process technology, this GPU packs 80 billion transistors and introducing various new system architecture features. A notable highlight is the incorporation of the fourth-generation NVLink interconnect technology, enabling efficient scaling across multiple GPU-accelerated nodes. Consequently, the H100 offers significant performance gains, ranging from 2 × to 3 ×, in both HPC and AI applications compared to its predecessors.

AMD currently has two architectures. AMD RDNA, which is optimized for gaming to achieve the highest frame rates (AMD 2019). AMD CDNA, which is optimized for computing to push the limits of flops per second (AMD 2020). The AMD Radeon Instinct MI250X accelerators used in the Frontier system are based on the CDNA2 architecture. The architecture is specifically optimized for accelerator computing, with the aim of enhancing performance and reduce costs by eliminating or reducing elements such as display and pixel rendering engines, as well as ray tracing. Instead, logical components that contribute to data center performance, such as tensor processing units, are added to achieve this goal. In the past, GPUs have been optimized for single-precision floating-point operations, while double-precision operations are slower, ranging from one-half to one-sixteenth of the speed. To improve performance for scientific computing applications, the AMD CDNA2 vector pipeline has been adjusted so that operations on wider double-precision data can be performed at the same speed as single-precision, with 64 Fused Multiply-Add (FMA) operations per clock cycle. With this improvement, the vector pipeline can also perform operations on packed single-precision values, doubling the throughput to 128 single-precision FMA operations per cycle.

The Intel Iris Xe GPU family also includes a range of microarchitectures, from integrated/low power (Xe-LP) to enthusiast/high performance gaming (Xe-HPG), data center/AI (Xe-HP) and HPC (Xe-HPC). Notably, the Ponte Vecchio accelerator (Jiang 2022), developed for the Aurora system, is rooted in the Xe HPC architecture. This architecture incorporates the Xe-core, which houses eight vector and eight matrix engines, alongside a substantial 512KB L1 cache/Shared Local Memory (SLM). Each vector engine within the Xe-core boasts a 512-bit width, supporting 16 FP32 Single Instruction Multiple Data (SIMD) operations with fused FMAs. Collectively, with eight vector engines, Xe-core provides 512 FP16, 256 FP32, and 256 FP64 operations/cycle. The width of each matrix engine is 4096 bits. With eight matrix engines, Xe-core provides 8192 INT8 and 4096 FP16/BF16 operations/cycle. Furthermore, the Xe-core exhibits a remarkable memory system bandwidth of 1024 bytes per cycle for both load and store operations, facilitating efficient data transfer within the GPU architecture. The Xe HPC architecture supports multi-stack design, with the Xe HPC 2-stack Ponte Vecchio GPU consisting of two stacks. Each stack has a large-scale second-level cache, four HBM2E memory controllers, one media engine, eight Xe-Link high-speed coherent structures, copy engine, and PCIe controller. Leveraging EMIB packaging and inter-stack interconnect channels, it ensures good memory consistency between stacks.

Powerful hardware requires powerful software support. In 2006, Nvidia introduced CUDA, a general-purpose parallel computing architecture (Nvidia 2009). Before the advent of CUDA, if one wanted to harness the computing power of the GPU, they would have to write a large amount of low-level language code or rely on graphics APIs. After more than a decade of development, the entire HPC and AI ecosystem has now become deeply intertwined with CUDA. From another perspective, the entire industry currently relies excessively on CUDA. AMD uses ROCm, an open software ecosystem for heterogeneous computing. ROCm is an open-source computing platform for GPUs that enables developers to customize and tailor their GPU software to their specific needs. The platform is primarily Open-Source Software (OSS), which provides developers with a collaborative community of other developers who can help solve problems in an agile, flexible, rapid, and secure manner. Intel has introduced the application programming interface, oneAPI, for XPU hybrid computing. With oneAPI, a single and complete solution can be provided across CPUs, GPUs, FPGAs, and computing units and accelerators in other architecture. It can simplify cross-architecture program development, significantly lowering the threshold for heterogeneous computing. This allows developers to freely choose the best hardware for specific solutions without being restricted by the economic and technical burdens associated with dedicated program development models.

3.3 Memory consistent interconnection within the node

Currently, most compute nodes in supercomputers are heterogeneous architecture, consisting of processors and accelerators. The computing power mainly relies on the accelerators, while the processors within a node are connected using coherent interconnect technologies. Previously, heterogeneous systems such as Tianhe-2, Titan, etc., used PCIe interfaces to achieve high-speed data transmission between processors and accelerators, as well as among different accelerators. However, the PCIe interfaces has problems such as limited bandwidth, non-uniform memory access, and inefficient program execution. To address these issues, major manufacturers have successively introduced memory consistent interconnection technologies such as AMD's Infinity Fabric (Singh et al. 2017; Pires 2021), Nvidia's NVLink (Nvidia 2023a, b, c, d, e; Li et al. 2020; Ishii and Wells 2022), and Intel's Xe Link (Blythe 2021). The connection between processors and accelerators was converted from PCIe interface to coherent interconnect interface, which improve network bandwidth between them, making programming easier and improving the execution efficiency of application programs.

The processor and accelerators in the Frontier system utilize AMD's Infinity Architecture 3.0 interconnect technology, achieving high-bandwidth interconnect between one AMD EPYC Trento processor and four AMD Radeon Instinct MI250X accelerators within a single node. The Infinity Architecture roadmap is shown in Fig. 5a. The first-generation Infinity Fabric is an interconnect technology used by AMD on the first generation EPYC processors. It enables fast connectivity between different chiplets within processors, as well as between sockets in a multi-socket server (Singh et al. 2017). The second-generation Infinity Architecture can connect two CPUs and four GPUs to form a ring, but the connection between the CPU and GPU is still based on PCIe. The third-generation Infinity Architecture allows for a coherent communication channel between not only a dual-socket EPYC CPU system. Besides increasing the maximum number of simultaneously connected GPUs from 4 to 8 (6 links per GPU), it also supports CPU to GPU connections without the need for PCIe based. The speed at which graphics cards talk to one another has also been greatly improved. The Infinity Architecture 3.0 provides each Infinity Fabric link with a bandwidth of 100 GB/s, which is sufficient to meet the demands of the entire system. It can accommodate up to two EPYC CPUs and eight GPU accelerators, making it an ideal solution for HPC applications (Pires 2021).

Fig. 5
figure 5

Roadmap of interconnection technology for AMD (Singh et al. 2017; Pires 2021), Nvidia (Li et al. 2020; Ishii and Wells 2022), and Intel (Blythe 2021)

Both the Summit and Eos supercomputers employ Nvidia's NVLink and NVSwitch technologies. NVLink is a direct GPU-to-GPU interconnect that scales multiple GPU input/output within servers. NVSwitch connects multiple NVLinks to provide all-to-all GPU communication at NVLink's full speed within a single node and between nodes. With the debut of the Nvidia P100 GPU, NVLink technology has evolved in parallel with the Nvidia GPU architecture, with each new architecture accompanied by a new generation of NVLink (see Fig. 5b). The V100 GPU (Choquette and Gandhi 2020) used in the Summit system provides a total network bandwidth of 300 GB/s across 6 links per GPU. The newly launched H100 GPU, on the other hand, provides 18 NVLink links totaling 900 GB/s of network bandwidth, which is more than seven times the bandwidth of PCIe Gen5. Nvidia's large AI model, using the fourth generation NVLink and external NVLink Switch, which extends NVLink into interconnect network between servers. The system supports up to 256 clusters of H100, providing 9 times higher bandwidth than the previous generation using Nvidia's HDR Quantum InfiniBand network. This results in smoother data processing pathways (Li et al. 2020; Ishii and Wells 2022).

The upcoming Aurora system utilizes Intel's Xe Link interconnect technology, which provides a high-speed and coherent unified structure between multiple GPU configurations. Figure 5c shows that Intel's Xe Link can be used in 2-way, 4-way, 6-way, and 8-way topologies, with each GPU directly connected to other GPUs (Blythe 2021). Putting them together can result in a large amount of computation.

After conducting an in-depth analysis, we have observed that major manufacturers are uniformly committed to achieving high-bandwidth, low-latency communication technologies. The connection between CPU and GPU has also evolved from traditional PCIe interfaces towards coherent interconnect technologies, exemplified by Nvidia's NVLink Chip-to-Chip (C2C) and AMD's third-generation Infinity Architecture. In addition, Nvidia has a dedicated NVSwitch chip for scalability, which AMD and Intel do not have. For example, Nvidia's GH200 superchip seamlessly integrates Grace CPU with Hopper GPU through NVLink-C2C technology, forming a unified system that achieves an impressive bidirectional bandwidth of up to 900GB/s between chips (Nvidia 2024). The DGX GH200 supercomputer, announced in 2023, which can integrate 256 GH200 superchips and up to 144 TB of shared memory into a single unit, offering an impressive AI computing power of up to 1 EFLOPS. Similarly, Intel's Xe Link interconnection technology also demonstrates robust performance. It supports up to 8 GPUs for consistent interconnection and can enable interconnection among 8 clusters, providing up to an 8x increase in computing power. AMD, on the other hand, utilizes its third-generation Infinity Architecture interconnection technology, which similarly supports consistent interconnection among up to 8 GPUs. Meanwhile, with the recent launch of its Instinct MI300 accelerator, AMD is set to upgrade to Infinity Architecture 4.0. Reportedly, the fourth-generation Infinity Architecture will support CXL 2.0 memory pooling and system-level cache coherency. This signifies the potential for devices like Samsung's 512GB CXL memory extender to be supported by AMD processors, a development that could have significant implications for data centers (Alcorn 2022). In summary, the continuous innovation and advancements in interconnection technologies by major manufacturers are driving the development of HPC and opening up new possibilities for future data center architectures.

3.4 Advanced integration based on chiplet technology

With the advent of the post-Moore's Law era, it faces bottlenecks that relying solely on process improvement to enhance performance of chip. Therefore, adopting architecture the chiplet design has become a common choice, with both AMD and Intel using chiplet technology. Through die-to-die high-speed interconnect and advanced packaging, homogeneous or heterogeneous integration can be achieved, resulting in higher performance.

In the field of HPC and AI, AMD has adopted the CDNA architecture. The continuous advancement of chiplet and packaging technologies has progressed from early 2.5D packaging, to subsequent "2.5D and Multi-Chip Module (MCM)" packaging, and now to the latest 3.5D packaging. AMD's technology has undergone continuous iteration and enhancement. In 2020, AMD debuted its CDNA architecture (AMD 2020), which was specifically tailored for data centers. This architecture marked a distinct departure from its preceding RDNA gaming architecture. The first product based on the CDNA architecture was the MI100 accelerator, which featured 32GB of HBM2 memory. By integrating the GPU chiplet with the HBM2 chiplet within the same package, the MI100 achieved highly efficient memory access. Subsequently, in 2021, AMD further introduced the MI250 and MI250X accelerators, both based on the CDNA2 architecture (AMD 2021). These products leveraged the advanced 6 nm FinFET technology and incorporated the 2.5 D Elevated Fanout Bridge (EFB) interconnection technique. Notably, they marked the first implementation of MCM packaging in the accelerator domain, integrating two cores within a single package. This innovation significantly enhanced both computing performance and energy efficiency. In 2023, AMD continued to drive technological innovation with the introduction of its Instinct MI300 series products (AMD 2023), specifically the Instinct MI300A and the Instinct MI300X. These products incorporate the latest CDNA3 and Zen4 architectures, based on a chiplet design. As illustrated in Fig. 6a, this newly designed architecture integrates eight HBM3 memory stacks and up to 256MB of AMD Infinity Cache. By introducing 3D hybrid bonding and a 2.5D silicon interposer, a novel "3.5D packaging" technology has been achieved. Specifically, the MI300A APU incorporates three of the compute dies are “Zen 4” × 86 CPUs, six 3rd Gen AMD CDNA architecture GPU accelerator compute dies (XCD), and four I/O dies (IOD), providing 128GB of HBM memory. On the other hand, the MI300X Accelerator integrates eight XCDs and four IO dies, offering 192GB of memory. With a transistor count of 153 billion, it represents the largest chip ever manufactured by AMD.

Fig. 6
figure 6

Architecture diagram of a AMD MI300 series (AMD 2023) and b Intel Ponte Vecchio (Gomes et al. 2022)

The Ponte Vecchio accelerator for Aurora is Intel's first exascale accelerator. As shown in Fig. 6b, it is a heterogeneous petaop 3D processor consisting of 47 functional tiles across five nodes. These tiles are connected via embedded multi-die interconnect bridges (EMIB) (Mahajan et al. 2016) and Foveros (Ingerly et al. 2019), implemented as a single unit, and achieved the efficient performance of a scalable exascale supercomputer (Jiang 2022). The Ponte Vecchio contains over 100 billion transistors, consisting of 16 TSMC N5 compute tiles (Xe-HPC architecture) and 8 Intel-7 memory tiles optimized for random access bandwidth optimized SRAM tiles (RAMBO) 3D stacked on two Intel-7 Foveros substrate dies. Eight HBM2E memory tiles and two TSMC N7 SerDes connection tiles are connected to the substrate chip via 11 compact EMIB. Furthermore, the SerDes connectivity offers a high-speed, coherent, and unified fabric for scalability in connecting Ponte Vecchio SoCs. Each tile includes an 8-port switch that supports up to 8-way fully connected configuration and 90G SerDes links. The SerDes tiles support load/store, bulk data transfers, and synchronization semantics that are essential for scalability in HPC and AI applications (Gomes et al. 2022).

3.5 Next generation accelerator with heterogeneous integrated architecture

Heterogeneous computing has emerged as a prominent paradigm in HPC, leveraging the unique strengths of different processing elements to enhance overall system performance. In this context, the integration of CPUs and GPUs into a unified architecture represents a significant advancement.

Historically, GPUs have excelled in parallel processing tasks, owing to their ability to execute a large number of instructions simultaneously. However, their thread-group Single Instruction Multiple Threads (SIMT) execution model imposes specific demands on memory access patterns, favoring regular and dense computing tasks. Conversely, irregular tasks may not fully utilize the hardware capabilities of GPUs, resulting in suboptimal performance. To address these challenges, major design manufacturers have been exploring the integration of CPU and GPU dies into a single chip. This approach aims to combine the benefits of both processing elements, enabling more efficient memory access and higher overall performance.

A notable example of this trend is AMD's Instinct MI300A, unveiled at CES 2023 (AMD 2023). This next-generation accelerator chip represents AMD's first foray into the data center/HPC-class APU market. The MI300A integrates 24 Zen4 CPU cores with the latest CDNA3 GPU architecture, leveraging a 3D packaging technology to combine multiple dies onto a single chip. Furthermore, it incorporates 128GB of HBM3, providing a unified memory pool for both the CPU and GPU.

The unified architecture of the MI300A brings several advantages. Firstly, it eliminates the need for redundant memory copies, as processors can directly access and modify data without copying it to their own dedicated memory pool. This significantly improves performance by reducing memory latency and increasing bandwidth utilization. Secondly, the unified memory pool allows CPUs and GPUs to share the same memory resources, eliminating the need for a separate pool of memory chips. This not only simplifies system design but also reduces power consumption and overall cost.

Similarly, Nvidia's GH200 superchip, which integrates an Arm-based Nvidia Grace CPU with the Nvidia H100 Tensor Core GPU. The superchip utilizes Nvidia's NVLink-C2C interconnect technology to provide a unified CPU and GPU memory model. This approach eliminates the need for traditional CPU-to-GPU PCIe connections, reducing latency and increasing bandwidth between the processing elements. The Grace Hopper superchip also adopts an integrated architecture of LPDDR5X and HBM3 heterogeneous memory fusion, providing a high-bandwidth and low-latency memory solution for demanding HPC applications (Nvidia 2024).

The development trajectory of AMD's server GPUs bears striking similarities to that of both Intel and Nvidia, a trend worth emphasizing. These three companies are committed to the integration of "CPU and GPU" products. It is believed that in the near future, we can expect new breakthroughs combining heterogeneous fusion processors with machine learning technologies.

Certainly, it is worth highlighting that the Sunway SW26010 processor, employed in China's Sunway TaihuLight supercomputer, exemplifies a distinctive heterogeneous integrated architecture (Fu et al. 2016; Gao et al. 2021). Unlike the "CPU and accelerator" method, the SW multi-core processor integrates multiple types of cores heterogeneously onto a single chip. Within this meticulously designed heterogeneous architecture, a limited number of highly capable management processing elements (MPEs) play a pivotal role. They are tasked with detecting instruction-level parallelism and managing the chip. Meanwhile, a substantial number of computing processing elements (CPEs) concentrate on handling thread-level parallelism. This division of labor significantly enhances the chip performance. The heterogeneous nature of this multi-core processor is what sets it apart. It seamlessly blends the flexibility of a general-purpose CPU with the high performance of a specialized accelerator. This unique combination effectively increases computational density and provides robust support for complex scientific computations and data processing tasks. Notably, the adoption of a unified instruction set in this processor brings about several advantages. It not only streamlines the software development process but also enhances compatibility with existing software environments. This, in turn, offers convenience to both researchers and developers.

4 Key technical challenges and suggestions for compute nodes

With the introduction of exascale supercomputers, 10-exascale and even zettascale supercomputers have become the future targets to be tackled by supercomputing developers. In the post-Moore era, building future supercomputers of 10-exascale and above faces many challenges, such as energy efficiency, reliability, programming efficiency, scalability, and other issues (Lu 2019; Su and Naffziger 2023; Liao et al. 2018). Among them, reliability, programming efficiency, and scalability are intricately linked with the system size; accordingly, these aspects will not be examined in-depth in this article. Instead, our focus will be on three primary areas: enhancing computing energy efficiency, devising memory optimization techniques, and exploring advanced packaging technologies and chiplet interconnects related to compute nodes. Furthermore, plausible solutions and recommendations will be provided.

4.1 Technology for enhancing computing energy efficiency

A zettaflop supercomputer would have the computing power of 1000 EFLOPS. Currently, the world's fastest supercomputer, Frontier, has a peak of slightly over 1 EFLOPS, but its power consumption is about 20 MW. If today's supercomputing technology were used to assemble a zettascale computer, it would consume around 20 GW of energy, equivalent to that produced by 20 nuclear power plants. Moreover, while the performance of supercomputers has doubled every 1.2 years since 1995, the energy efficiency of computing has only doubled every 2–2.5 years. At the current rate of technological progress in performance and energy efficiency, it would take about 500 MW to build a zettascale supercomputer. It is equivalent to the output of half a nuclear power plant, which remains an elusive scenario. From the perspectives of both technology and sustainability, the gradually flattening curve of computing energy efficiency has become the biggest challenge that we must address.

In order to achieve comparable or surpassing progress in the next decade compared to the past decade, it is crucial to prioritize energy efficiency in the execution of physical design, architectural innovation, and algorithmic enhancement (Su and Naffziger 2023).

  • Firstly, the improvement of silicon technology will always be a key component in enhancing compute efficiency (Liu 2021). This is because the energy per operation of the logic gates continues to increase and provide critical efficiency gains. However, after the 7 nm node, the speed of density improvement slows down, and the energy efficiency flattens at the 5 nm node. In addition, the increasing process complexity of advanced nodes has led to a significant increase on the cost per mm2 with each successive process generation. Simultaneously, people are exploring alternatives to silicon.

  • Secondly, In-/Near-Memory Computing techniques based on new packaging technologies can improve the energy efficiency of data moving between the processor and memory. The energy consumption for data transmission between the processor and memory is significantly influenced by the memory type and its connection to the processor. DDR5 is the mainstream memory interface for modern servers. In recent years, HBM has become the preferred memory for HPC components such as GPU accelerators. HBM leverages advanced packaging to provide much higher bandwidth, while exhibiting a remarkable energy efficiency that outperforms DDR5 by a factor of 3–4 per bit. Higher bandwidth and lower energy per bit can be achieved if 3D stacking techniques are used directly on the compute die, such as AMD's 3D V-Cache technology. It could realize an order of magnitude reduction in data movement energy relative to HBM by extending this technology to DRAM and potential alternative DRAM density memory technologies. Furthermore, another approach to addressing memory efficiency challenges is to embed dedicated computing data paths directly into the memory itself, known as Processing-in-Memory (PIM) (Gokhale et al. 1995; Kim et al. 2022; Jang et al. 2023). PIM has the potential to reduce data movement energy consumption by up to 85%, making it a promising technology for optimizing future systems.

  • Thirdly, DSA accelerators and hardware-software co-design are extremely effective in improving compute efficiency (Wolf 1994; Moreno and Wen 2021). General-purpose computing has been the backbone of compute infrastructure for decades, but now accelerators have gained increasing prominence in this field. For instance, the AMD Instinct MI250 accelerator is the primary compute engine in the leading exascale Frontier supercomputer. The accelerator utilizes highly parallel and efficient GPU architecture, as well as dedicated matrix multiplication primitives, to perform GEMM (General Matrix Multiplication) operations used in Linpack and many HPC algorithms more efficiently. Compared to traditional general floating-point operations, this reduces energy consumption per 64b floating-point operation by approximately half, allowing software algorithms to fully advantage these primitives to provide higher compute efficiency. In addition, the customized features of custom computing break the abstraction layers and interaction interfaces between existing software and hardware. To fully exploit the computing power of custom architectures, hardware-software co-design optimization has become an inevitable choice for future HPC systems, with enormous potential. Therefore, it is necessary to change the traditional software, hardware, chip design independent of each other's development ideas, the introduction of integration of software-hardware-chip systems.

  • Finally, we need to actively apply AI in all aspects of our designs and systems. Recently, through AI for Science, some expensive and inefficient physics-based code has been replaced with AI-based agent models. Practical "HPC and AI" applications combine AI-trained approximations of physical computing with pure physics codes in the final stages to obtain accurate results. AI greatly improves the efficiency of scientific computing.

4.2 Technology for devising memory optimization

With the convergence of HPC and AI, the demand for computing power is constantly increasing due to the rapid growth of data scale and the expanding application requirements. In particular, computing power such as deep learning-based AI is growing much faster than memory. According to OpenAI in 2018, the computing power utilized for AI training tasks would double approximately every 3.43 months since 2012. In recent years, with the rise of large-scale AI models starting from tens of billion-scale parameters, the demand for computing power in AI applications has taken a leap to new heights. Meanwhile, while processor performance has improved by approximately 55% each year over the past 20 years, memory performance has only improved by approximately 10% each year. Therefore, memory performance significantly lags behind computing performance. The development of future 10-exascale and above supercomputers faces the challenge of the "memory wall" problem, which is a mismatch between the current computing power and memory capacity. Therefore, the design of memory of high-performance nodes needs to address two challenges: one is the balanced design problem of computing and memory, the other is the design problem of memory performance optimization, and the third is the design problem of memory capacity expansion.

4.2.1 Balancing design between computing and memory

Chip hardware development primarily aims to enhance peak computing performance; however, this optimization often necessitates the simplification or elimination of other components, such as the memory hierarchy. To enhance execution efficiency, modern CPUs integrate a multilevel cache architecture. Therefore, when dealing with memory bandwidth-limited problems, such as large-scale recommendation systems, the performance of CPUs is significantly better than that of GPUs. However, compared to AI chips like GPUs and TPUs, the computing power of current CPUs is weaker by an order of magnitude. One reason for this is that many AI chips are often designed to maximize computing power by removing certain components to increase the number of computing units. In addition, intra- and inter-chip communication has become a bottleneck for many applications. Finding a balance between hardware bandwidth and computing power is a highly challenging fundamental problem. By optimizing the design architecture, a better balance of "computing power/bandwidth" can be sought. It is important to perform a full performance evaluation for different configurations of compute-memory architectures and select the appropriate chip architecture.

4.2.2 Optimization design of memory access performance

The objective of memory optimization is to improve memory access bandwidth and reduce latency. Various approaches can be employed to achieve this objective. Furthermore, there are several methods to enhance memory performance, beyond implementing advanced memory technologies such as HBM. These methods include: (1) implementing a hierarchical cache structure to efficiently reuse data, thus minimizing the need for data movement. (2) decreasing access latency by employing techniques like vectorization and merged access when retrieving data. (3) Exploiting parallel processor technologies, such as multithreading, to achieve the overlap of memory and computing, effectively reducing memory waits due to idle computing time.

In addition to current memory optimization techniques, the industry has actively engaged in researching Computing-in-Memory (CIM) technology (Kang et al. 2019; Moreau et al. 2018). CIM refers to embedding computing power in memory and directly utilizing memory for data processing or computing. By merging memory and computing in the same area of a chip, CIM eliminates the bottleneck of the von Neumann computing architecture. This technology is particularly effective when handling large-scale parallel and large data volume applications. Similar to the integration of memory and computing in the human brain, CIM architecture combines the data memory unit and the computing unit, reducing data movement and enhancing computing parallelism and energy efficiency.

To promote the development of the CIM architecture, a crucial strategy for the near future is to reduce the distance between memory and computing units by implementing advancements in chip design, integration, and packaging. This approach, known as Computing-Near-Memory, aims to improve bandwidth and minimize data movement overhead, thereby alleviating bottlenecks caused by data transfer. In the mid-term plan, memory and computing units will be combined through architectural innovations to realize Computing-in-Memory, which has become the primary development focus for various manufacturers. The computing operations are executed by autonomous computing units situated within the memory chip area. Memory and computing can be either analog or digital, and are generally employed for algorithmic calculations within fixed scenarios. Looking towards the future, device-level innovations aim to achieve Logic-in-Memory, wherein devices can function as both memory units and computing units. By adding computing logic to internal memory, data computing can be performed directly in the internal memory, truly achieving "Computing-in-Memory".

4.2.3 Design of memory capacity expansion

With the development of HPC and AI convergence, supercomputers need to provide strong computing power support for AI applications such as large model.

The complexity of AI mock-ups has witnessed an exponential surge in the count of parameters, whilst the individual size of model weights has escalated to the scale of terabytes (TB) (Fedus et al. 2022). As a result, the performance of large-model AI applications is limited by the ability to store and retrieve training and inference data from memory, and the memory capacity, in addition to the memory bandwidth, is an important factor affecting the performance of the application. Coherent interconnect technology facilitates the extension of memory capacity within a node, thereby catering to the high-capacity memory demands of AI training, such as large models. For example, the latest Nvidia's Grace Hopper 200 superchip is based on NVLink memory-coherent interconnect technology (Li et al. 2020, Ishii and Wells 2022). It can realize the interconnection between 256 chips in a single node, providing 144 TB of memory capacity. It has a good acceleration effect on applications with limited memory capacity. Hierarchical memory capacity expansion is also possible through techniques such as CXL (CXL 2023; Gouk et al. 2023; Park et al. 2022). The "CXL Memory Expander" is similar to an external memory device, a solid state drive (SSD). The principle is actually to expand the DRAM capacity by installing it where the DRAM is inserted. That is, it does not need to change or replace the existing server structure, just change the interface. It can build flexible memory and increase the DRAM capacity of the system. This technology has greater latency than traditional DDR, which may limit its application in HPC systems. At present, CXL interface technology is still a relatively new concept and is still in the early stages of development. It is widely recognized that CXL will play an important role in the upcoming industrial wave.

4.3 Technology of Chiplet interconnects and advanced packaging

In response to escalating computing power requirements, the semiconductor industry faces significant challenges pertaining to physical constraints, cost, and energy consumption. Chiplet technology has emerged as a viable solution, and has been recognized by both academic and industrial sectors as a vital component for future high-performance chips. It can split the SoC chip into chiplet with different functions through heterogeneous integration of advanced packaging, thus improving yield and reducing cost. In contrast to traditional SoC technology, chiplet technology primarily focuses on architecture design and advanced packaging.

Specifically, the overall architecture design consists of two major components: chiplet design and chiplet interconnection. For chiplet design, it is necessary to first complete the functional die of chiplets and then establish the interconnections between individual functional dies. Currently, the mainstream architectural design scheme can be divided into two categories: (1) multiple dies are divided based on functions, where no single die contains a complete set of functions, and various types of products are achieved by combining different die packages. (2) A single die contains a more independent and complete set of functions, and the performance is linearly increased by interconnecting multiple dies. In terms of chiplet interconnections, there are two primary approaches: serial interconnections and parallel interconnections, with the latter becoming the prevailing industry trend. Currently, Intel, AMD, TSMC, Open Compute Project Foundation (OCP), and others have developed their own standards of chiplet interface. Among them, Intel's Universal Chiplet Interconnect Express (UCIe) standard of chiplet interconnect released in 2022 has gained widespread attention for being data-center oriented and compatible with PCIe and CXL protocols with good ecology. However, standards of packaging and interfaces of communication are not uniform is the main development problem of chiplet packaging and communication, each manufacturer has its own program. It is necessary for upstream and downstream industries, as well as electronic design automation (EDA) vendors, foundries, etc., to collaborate in the development of interconnect standards to promote the widespread adoption and application of this technology.

In the design process of chip miniaturization, several considerations arise, notably the limited chiplet area accompanied by a greater number of interfaces, alongside the issue of weak signal transmission quality between chiplets. Therefore, the implementation of advanced packaging technologies, characterized by high density and large bandwidth wiring, becomes of paramount importance. Advanced packaging technology is currently dominated by industry leaders such as TSMC, Samsung and Intel. It ranges from 2D MCM to 2.5D CoWoS, EMIB and 3D Hybrid Bonding, all of which are mainstream packaging technologies for chiplet. However, the challenge of packaging multiple die stacks in a limited space can lead to high temperature and heat dissipation issues. The resolution of thermal problems in chiplet packaging requires breakthroughs in multiple disciplines. One possible solution is the development of new materials that can meet integrated electrical, thermal, and force requirements. Additionally, the development of new integrated system testing tools and methods is necessary. It is important to note that software accounts for 30–40% of the cost in chip design. A key aspect of DARPA's CHIPS project is the emphasis on Electronic Design Automation (EDA) tools, which are essential for interconnection, packaging, and testing of chiplets

5 Conclusions

The Frontier supercomputer, ranking first in the Top500 list, has ushered in the exascale era of HPC, where the compute nodes' double-precision computing power exceeds 100 TFLOPS. This paper analyzes the structure of compute node and key technologies for the existing Summit and Frontier supercomputers, as well as the future planned Aurora, El Capitan, and EoS supercomputers. We have summarized and explored the development trends of compute node technologies for achieving 100 TFLOPS performance, both presently and in the future. It includes aspects such as high-performance core microarchitectures, domain specific architecture, interconnect technology, chiplets integration, and next-generation accelerator. In addition, we explore the challenges faced by future 10-exascale and even zettascale supercomputers, including increasing energy efficiency, memory optimization, chiplets integration, interconnect and packaging technologies, and provide preliminary suggestions