The history of HPC systems, spanning more than 50 years, shows the progression of recurring design concepts adapted to evolving scenarios:
-
Fabrication technology advancements and varying limitations, mostly influenced by the trend of Moore’s Law, characterized by short-medium term periods of evolution;
-
Application landscape of HPC, generally characterized by a growing range but relatively slow in its evolution.
Subsequent to HPC architecture design evolution, the programming paradigms have generally evolved to meet the requirements for approaching peak performance of the architecture. Figure 1 depicts the main drivers of HPC development in the last decades.
Here we analyse part of the HPC evolution from the above perspective, through a set of representative cases, in the view of establishing the expected way for European Exascale computing system development.
The first rise of hardware acceleration: vector computers (1974–1993)
At the end of the mainframe age, which had gone through more than two decades after World War II, many foundational ideas of modern computer science such as compilers, operating systems, floating point arithmetic, virtual memory and memory hierarchy, had already been conceived and experimented. Mainframes were general purpose systems, although the range of computer applications was limited with respect to the present idea of general purpose computer, and they had struggled for years with the limited memory capacity.
Technology and application drivers
On the technology side, bipolar transistors and emitter-coupled logic (ECL) families had been already adopted to boost the speed at the expense of power efficiency. With the advancement of memory technology and memory sub-system design, the memory address space size ceased to be the limiting factor for application development. On the application side, the presence of heavy matrix algebra processing in scientific and military applications opened the way for acceleration of arithmetic operations on vectors of real numbers.
Architecture design advancement
Instructions operating on vector operands, rather than the more conventional scalar operands, were introduced along with hardware support within the CPU for executing vector operations. Vector processors exploit data level parallelism (DLP), where a Single Instruction operates over Multiple Data streams (SIMD) (Asanovic 1998; Espasa et al. 1998). This constitutes the first representative example of hardware acceleration of computational kernels, especially in the form of dedicated and parallel (SIMD-organized) functional units and of dedicated vector register file. For this reason vector computers can be considered the first form of domain-specific machine. Vector machines appeared in the early 1970s and dominated supercomputer designs for two decades.
There are two main classes of vector processors, depending on the location of their vector operands. Vector memory–memory architectures locate all vector operands in memory, while vector register architectures provide vector instructions operating on registers, while separate vector loads and store move data between memory and the register file. Some relevant vector memory–memory machines that appeared in such a period of time are the ILLIAC IV supercomputer (Barnes et al. 1968), the TI advanced scientific computing (ASC) supercomputer (Watson 1972), and the CDC STAR 100 (Hintz and Tate 1972) and successors (Lincoln 1982). In contrast, representative vector register architectures include the Cray series (Russell 1978; Cray Research 1984). These designs exploited DLP with long vectors of thousands of bits.
Many applications can potentially benefit from vector execution for better performance, higher energy efficiency and greater resource utilization. Ultimately, the effectiveness of a vector architecture depends on its ability to vectorize large quantities of code. However, the code vectorization process incurs in several obstacles, such as horizontal operations, data structure conversion or divergence control. As a result, a significant effort was done in improving automatic vectorization of scientific codes (Callahan et al. 1988). However, autovectorizing large scientific codes requires the programmer to perform some degree of code annotation, modification or even complete rewrite.
The rise of massive homogeneous parallelism (1994–2007)
Technology and application drivers
The progresses of CMOS technology allowed the growth of general purpose single-chip microprocessor performance, pushed by increasing clock frequencies and by microarchitecture advances, targeting the high-volume personal computer (PC) market during the 80s. Microprocessor speed grew faster than memory speed, which represented the first appearance of a memory wall limiting the performance. With the appearance of on-chip caches, made possible by the increasing scale of integration of CMOS, the initial memory wall was overtaken and the performance of general purpose complex CPUs, organized in multi-processor parallel architecture, became comparable with vector computer systems. Vector computers relied on specialized hardware boards composed of multiple integrated circuits. Because of their relatively limited market the development of powerful CMOS-based single-chip vector microprocessors was not justified. In order to maintain the pace with the speed of general purpose multi-processor, vector computers remained bonded to bipolar technology that again was not in the mainstream of semiconductor industry. More recently, a further significant advance on memory technology was the advent of eDRAM, allowing on chip L3 caches for example, thus pushing back the memory wall in commodity CPU based HPC systems [eDRAM].
On the application side, the market favoured the availability of general purpose parallel architectures that were not intrinsically devoted to a special class of algorithms like vector CPUs.
Architecture design advancements
The rise of massively parallel architectures based on off-the-shelf processors soon opened the way to shared memory symmetric multiprocessor and subsequently to the advent of distributed architectures to overcome the limit of memory bandwidth with respect to the traffic generated by multiple CPUs (the new memory wall). Cluster-based architectures, with multi-processor shared-memory nodes connected by different topologies and technologies of interconnection networks have represented the dominant paradigm of HPC systems for the last 25 years.
On the software side, programming massively parallel architectures has produced a number of approaches and APIs (application programming interface) broadly divided into shared-memory paradigms and message-passing paradigms, with the support of compiler technology.
The renaissance of acceleration units (2008–2018)
Technology and application drivers
The arrival of the power wall, due to the increasing power consumption density, has definitely limited clock frequencies in favour of the increase of the number of cores integrated in the same silicon die. This phenomenon has been characterized by technology node progress through geometry scaling, accompanied by voltage scaling (a.k.a. Dennard scaling), while maintaining practically the same clock frequency and increasing performance by augmenting the number of cores on the die. Yet, due to the need for acceptable noise margins in the logic gates, Dennard scaling has demonstrated to be unfeasible, thus slowing supply voltage scaling with respect to geometry scaling. This effect has led to the impossibility of further increasing the number of active cores on the die, again due to excessive power density, a situation known as the dark silicon necessity (Esmaeilzadeh et al. 2012). Dark silicon refers to the need of maintaining part of the silicon die inactive or active at lower frequency than the CPU cores. One main way of facing this design complication is the adoption of specialised hardware acceleration to dramatically gain in power efficiency. To clarify the real foundations of the above trends, Table 1 evidences the actual gain in efficiency of hardware specialization on a simple computational kernel.
Table 1 Performance and energy efficiency of hardware acceleration on a numerical application design case On the memory side, a decisive step to push back the memory wall has been the introduction of 3D-stacked DRAM technology created for graphic applications, basically High Bandwith Memory and Hybrid Memory Cube, which permit up to 512 GB/s data rate (HBM3).
Architecture design advancements
While the trend towards increasing parallelism continues, the designers of HPC systems have rediscovered hardware acceleration with the need for power efficiency. In fact, the consequences of this technology scenario, while limiting the proliferation of cores in large Systems-on-Chip, like those employed in HPC systems, move in two directions:
-
Limiting the number of cores and the die area, leveraging acceleration of the computation by means of specialized external units hosted on the same board or even on daughterboards connected by high speed links like PCIe (Peripheral Component Interconnect Express);
-
Leveraging on-chip hardware acceleration units that allow increased throughput at relatively low clock frequency, thus with considerably higher energy efficiency.
The first direction has been followed since 2008, with the development of the first supercomputer equipped with GPU (graphic processing unit) acceleration boards allowing a tremendous increase of parallelism while relieving the CPU from the highest computational load (Fig. 2). Notably, the advent of GPUs in HPC systems has been possible thanks to the availability of off-the-shelf top-level performance silicon chips fabricated for existing high-volume markets (high-end personal computers). This trend followed the same trend with the appearance of off-the-shelf high-speed CPUs in the HPC market 15 years before.
The second direction of development of acceleration units, i.e. on-chip accelerators, while promising a new boost in performance, thanks to the less severe limitation imposed by data transfers, necessarily requires a similar support from a high-volume market in order to sustain the cost of dedicated chip design and fabrication, to avoid the decline experienced by vector architectures in the 90s. Yet, this opportunity will be provided by the emergence of new application domains, not traditionally related to HPC, which may benefit from the usage of such dedicated high-speed processing chips. Such applications are already widely documented and consist of Artificial Intelligence (Deep Learning), statistical analysis (Big Data), biology, etc.. Notably, in addition to extend the applicability of the HPC solutions to a larger market, the new applications also present particular requirements that demand attention by HPC system designers. Examples are reduced floating precision, integer processing, and bit-level processing. In this scenario, it is significant that big market drivers—like Google—are designing their own hardware solutions to solve their needs for high performance computing.
On the CPU side, the need for higher power efficiency along with the availability of high-performance and established software eco-system in ARM architectures, have opened the way to ARM-based supercomputers in the search for less power consuming CPUs. In the last decade, the Barcelona Supercomputing Center (BSC) has pioneered in the adoption of ARM-based systems in HPC. The Mont-Blanc projects (Online: http://montblanc-project.eu/), together with other European projects, allowed the development of an HPC system software stack on ARM. The first integrated ARM-based HPC prototype was deployed in 2015 with 1080 Samsung Exynos 5 Dual processors (2160 cores in total) (Rajovic et al. 2016). The clear success of these projects has influenced the international roadmaps for Exascale: the Post-K processor, designed by Fujitsu, will make use of the Arm64 ISA (Yoshida 2016), while Cray and HPE are developing supercomputers together with ARM in the US.
A special case of hardware acceleration units is represented by FPGA-enhanced systems. Some relevant examples of the adoption of this technology are already at a mature level of development. Most of such systems are using FPGA acceleration to cut down the communication latency in HPC networks—such as the case of Novo-G# architecture—leveraging the long experience of FPGA based routers. More generally, FPGA can be used at node-level processing acceleration, by reconfiguring data-paths within the FPGA connected to each CPU. This approach relies on the availability of the reconfigurable on-chip connections and on-chip memory structures within the FPGA, which in principle allows the exploitation of a higher degree of parallelism at the expense of one-order-of-magnitude decrease in clock frequency. An example of systems going in this direction is the Catapult V2 from Microsoft, which employs FPGA for local acceleration, network/storage latency acceleration, and remote computing acceleration. Yet the most interesting impact appears to be in network latency reduction. A rather exhaustive list of systems that have experimented FPGA accelerators in HPC systems and other related systems can be found in http://www.bu.edu/caadlab/HPEC16c.pdf.
On the programming paradigm side, the advent of GPU accelerators as well as on-chip accelerators and possibly FPGA accelerators, has surely complicated the scenario. In general, this trend pushes towards a more and more collaborative development between the hardware and the software designers. In this scenario, a holistic strategy matches applications requirements with the technical implementation of the final design. In the last years, we have observed a rise in the popularity of novel programming models targeting HPC systems, especially focused on managing parallelism at node level. Parallelism between nodes at large scale still relies on the standardized Message Passing Interface (MPI).
Traditional programming models and computing paradigms have been successful on achieving significant throughput from current HPC infrastructures. However, more asynchronous and flexible programming models and runtime systems are going to be used to support huge amounts of parallelism with the hardware. With that goal, several task-based programming models have been developed in the past years for multiprocessors and multicores, such as Cilk, Intel Threading Building Blocks (TBB), NVIDIA’s CUDA, and OpenCL, while task support was also introduced in OpenMP 3.0. These programming models allow the programmer to split the code into several sequential pieces, denoted tasks, by adding annotations to identify potentially parallel phases of the application.
More recently, emerging data-flow task-based programming models allow the programmer to explicitly specify data dependencies between the different tasks of the application. In such programming models, the programmer (or the compiler) identifies tasks that have the potential to run in parallel and specify the required input and output parameters. Then, the runtime (or the programmer) builds a task-dependency graph to handle data dependencies and expose the parallel workload to the underlying hardware transparently. Therefore, the application code does not contain information on how to handle the workload besides specifying data dependencies. Representative examples of such programming models are Charm++, Codelets, Habanero, OmpSs), and the task support in OpenMP 4.0.