Finally, how many efficiencies the supercomputers have?

Using extremely large number of processing elements in computing systems leads to unexpected phenomena, such as different efficiencies of the same system for different tasks, that cannot be explained in the frame of the classical computing paradigm. The introduced simple non-technical model enables to set up a frame and formalism needed to explain the unexpected experiences around supercomputing. The paper shows that the degradation of the efficiency of the parallelized sequential system is a natural consequence of the computing paradigm, rather than an engineering imperfectness. The workload is greatly responsible for wasting the energy as well as limiting the size and the type of tasks the supercomputers can run. Case studies provide insight how different contributions compete for dominating the resulting payload performance of the computing system and how enhancing the technology made the computing + communication the dominating contribution in defining the efficiency of supercomputers. The model also enables to derive predictions about the supercomputer performance limitations for the near future and provides hints for enhancing the supercomputer components. The phenomena show interesting parallels with the phenomena experienced in science more than a century ago, and through their studying, a modern science was developed.


Introduction
Given that the dynamic growth of single-processor performance has stalled about two decades ago [1] and the computing demand grows more speedily than the computing capacity [2], the only way to achieve the required high computing performance remained to parallelize the work of a vast number of sequentially working single processors.However, as was very early predicted [3], and decades later experimentally confirmed [4], the scaling of parallelized computing is not linear.Even, "there comes a point when using more processors . . .actually increases the execution time rather than reducing it" [4].Parallelized sequential processing has different rules of game [4], [5]: its performance gain ("speedup") has its inherent bounds [6].
Akin to as laws of science limit the performance of single-thread processors [7], the commonly used computing paradigm (through its technical implementation) limits the payload performance of supercomputers [5].On the one side, experts expected performance 1 to achieve the magic 1 Eflops around year 2020, Figure 1 in [8] 2 ."The performance increase of the No. 1 systems slowed down around 2013, and it was the same for the sum performance" [8], but the authors extrapolated linearly, expecting that the development continues and shall achieve "zettascale computing" (i.e., 10 4 -fold more than the present performance) in just more than a decade.On the other side, it has recently been admitted that linearity is "A trend that can't go on ad infinitum."Furthermore, that it "can be seen in our current situation where the historical ten-year cadence between the attainment of megaflops, teraflops, and petaflops has not been the case for exaflops" [9].Officially, the TOP500 [10] evaluation sounds (as of 2020) that "the top of the list remains largely unchanged " and "the full list recorded the smallest number of new entries since the project began in 1993 ".The 2021 list added: "Still waiting for Exascale".
The expectations against supercomputers are excessive.For example, the name of the company PEZY 3 witnesses that a billion times increase in payload performance is expected.It looks like that in the feasibility studies on supercomputing using parallelized sequential systems, an analysis whether building 1 There are doubts about the definition of exaFLOPS, whether it means nominal performance R P eak or payload performance R M ax ; measured by which benchmark (or how it depends on the workload the computer runs); using which operand length.Here the term is used as R HP L−64 M ax .Several other benchmarks results (not related to floating computation) have been published to produce higher figures.computers of such size is feasible (and reasonable) remained out of sight either in USA [11,12] or in EU [13] or in Japan [14] or in China [8].Even in the most prestigious journals, the "gold rush" is going on [15,12].In addition to the previously existing "two different efficiencies of supercomputers" [16] other efficiency/performance values appeared4 (of course with higher numeric figures), and we can easily derive several more efficiencies.
Although severe counter-arguments were also published, mainly based on the power consumption of single processors and large computing centers [17], the moon-shot of limitless parallelized processing performance is followed.The probable source of the idea is the "weak scaling" [18,4] 5 .However, it is based simply on a misinterpretation [19,20] of terms in Amdahl's law [21] 6 .In reality, Amdahl's Law (in its original spirit) is valid for all parallelized sequential activities, including computing-unrelated ones, and it is the governing law of distributed (including super-) computing.
Demonstrative failures of some systems (such as supercomputers Gyoukou and Aurora'18 7 , and brain simulator SpiNNaker8 ) are already known, and we expect many more to follow: such as Aurora'21 [24], the mystic China supercomputers 9 .Fugaku, although it considerably enhanced the efficacy of computing, mainly due to the clever placing and use mode of its on-chip memory, also stalled at about 40% of its planned capacity [25] and could increase its payload performance only marginally in a year.Perlmutter produced a modest 10% payload performance enhancement in its Phase 2. No sign of life was seen from US exascale supercomputer Aurora.
Similar is the case with exascale applications, such as brain simulation.Exaggerated news about simulating the brain of some animals or a large percentage of the human brain appeared.The reality is that the many-thread implementation of the brain simulator can fill a tremendous amount of memory with data of billions of artificial neurons [26], a purpose-built (mainly hardware (HW)) brain simulator can be designed to simulate one billion neurons [27].However, in practice, they both can simulate only about 80 thousand neurons [28], mainly because of "the quantal nature of the computing time" [29]."More Is Different" [30].
In June 2022, Frontier achieved the magic barrier.The major contribution was a seamless integration by sharing L 1 and L 2 data of Central Processing Unit (CPU) with Graphic Processing Unit (GPU).Although this kind of accelerated processors eliminated the performance-limiting effect of the former implementations (particularly at high number of Processing Unit (PU)s) and caused a theoretically unpredictable performance jump in supercomputers' performance, it needed no changes in their theoretical discussion.
The confusion keeps growing.The paper attempts to clarify the terms by scrutinizing the basic notions, contributions, and measurement methods.In section 2 a by intention enormously simplified non-technical model, based on the temporal behavior of a physical implementation of computing [31], is presented.The notations for Amdahl's Law, which form the basis of the present paper, are introduced in section 3. We show that the degradation of efficiency of parallelized sequential systems is a natural consequence of the computing paradigm instead of being an engineering imperfectness (in the sense that it could be fixed later).Furthermore, its consequence is that the parallelized sequential computing systems by their very nature have an upper payload performance (or more precisely, a payload performance gain) bound.Different contributions arise from the sequential portion of the task (and through this, they degrade its parallelized performance), as detailed in section 4. We validate the established model in section 5.
Given that the race to produce computing systems, having components and systems with higher performance numbers, is going on, in section 6, the expectable results of developments in the near future are predicted.The section introduces some further performance merits and, through interpreting them, concludes that increasing the size of supercomputers further and making expensive enhancements in their technologies, only increase their non-payload performance.In section 7 we discuss that under extreme conditions, technical objects of computing show up in a series of behavioral features (for more details see [5]), similar to those of natural objects.

A conceptual model of parallelized sequential operation
The performance measurements are simple time measurements 10 (although they need careful handling and proper interpretation, see good textbooks such as [32]): a standardized set of machine instructions is executed (a large number of times) and the known number of relevant operations are divided by the measurement time; for both single-processor and distributed parallelized sequential systems.In the latter case, however, the joint work must also be organized, implemented with additional machine instructions and additional execution time, forming an overhead 11 .This extra activity originates from efficiency: one of the processors orchestrates the joint operation, and the others are idly waiting.At this point, the "dark performance" appears: the processing 10 Sometimes also secondary merits, such as GFlops/Watt or GFlops/USD are also derived 11 The 'weak scaling' approximation neglects this aspect.A many billion USD mistake.units are ready to operate and consume power but do not make any payload work.As discussed in detail in [31], the "stealthy nature" of the incremental development of technology made its appearance unnoticed.However, today, the "idle time" is the primary reason that power consumption is used mainly for delaying electronic signals [33] inside our computing systems and delivering data rather than performing computations [17].
Amdahl listed [3] different reasons why losses in "computational load" can occur.Amdahl's idea enables us to put everything that cannot be parallelized, i.e., distributed between fellow processing units, into the sequential-only fraction of the task.For describing the parallelized operation of sequentially working units, the model depicted in Figure 1 was prepared (based on the temporal behavior of components, as described in [31]).The technical implementations of the different parallelization methods show up a virtually infinite variety [34], so here a (by intention) strongly simplified model is presented.However, the model is general enough to qualitatively discuss systems working in parallel.We shall neglect the various contributions as possible in the different cases.Although with some obvious limitations, one can easily convert our model to a technical (quantitative) one by interpreting its contributions in technical terms.Such technical interpretations also enable us to find out some technical limiting factors of the performance of parallelized computing.
The non-parallelizable (i.e., apparently sequential) part of tasks comprises contributions from HW, operating system (OS), software (SW) and Propagation Delay (PD), and also some access time is needed for reaching the parallelized system.This separation is rather conceptual than strict, although dedicated measurements can reveal their role, at least approximately.Some features can be implemented in either SW or HW or shared between them.Furthermore, some sequential activities may happen partly parallel with each other.The relative weights of these contributions are very different for differ-ent parallelized systems and even, within those cases, depend on many specific factors.In every single parallelization case, a careful analysis is required.SW activity represents what was assumed and discussed by Amdahl as the total sequential fraction 12 .Non-determinism of modern HW systems [35] [36] also contributes to the non-parallelizable portion of the task: the resulting execution time of parallelly working processing elements is defined by the slowest unit.
Notice that our model assumes no interaction between processes running on the parallelized system and the necessary minimum: starting and terminating otherwise independent processes, which take parameters at the beginning and return their result at the end.It can, however, be trivially extended to more general cases when processes must share some resource (such as a database, which shall provide different records for the different processes), either implicitly or explicitly.Concurrent objects have their inherent sequentiality [37].Synchronization and communication between those objects considerably increase [38] the non-parallelizable portion (i.e.contribution to (1 )) of the task.Because of this effect, in the case of many processors, we must devote special attention to their role in the efficiency of applications on parallelized systems.
The physical size of the computing system also matters.A processor connected to the first one with a cable of dozens of meters must wait for several hundred clock cycles.This waiting is only because of the finite speed of light propagation, topped by their interconnection's latency time and hoppings (not mentioning geographically distributed computer systems, such as some clouds, connected through general-purpose networks).Detailed calculations are given in [39].
After reaching a certain number of processors, there is no more increase in the payload fraction when adding more processors.The first fellow processor has already finished its task and is idly waiting, while the last one is still idly waiting for its start command.We can increase this limiting number by organizing the processors into clusters: the first computer must speak directly only to the head of the cluster.Another way is to distribute the job near to the processing units.It can happen either inside the processor [40,25], or one can let do the job by the processing units of a GPU 13 .This looping contribution is not considerable (and, in this way, not noticeable) at a low number of processing units (apart from the other contributions).Still, it can dominate at a high number of processing units.This "high number" was a few dozens at the time of writing the paper [4], today it is a few millions 14 .We consider the effect of the looping contribution as the borderline between first and second-order approximations in modeling the system's payload performance.The housekeeping keeps growing with the number of processors, and in contrast, the system's resulting performance does not increase anymore.The first-order approximation assumes the contribution of housekeeping as constant.The second-order approximation also considers that as the number of processing units grows, housekeeping grows with, gradually becomes the dominating factor of performance limitation, and decreases payload performance.
As Figure 1 shows, in the parallelized operating mode (in addition to computation, furthermore communication of data between its processing units) both software and hardware contribute to the execution time, i.e., they both must be considered in Amdahl's Law.This finding is not new, again: see [3]. Figure 1 also shows where is a place to improve computing efficiency.When combining PD properly with sequential scheduling, one can considerably reduce non-payload time during fine-tuning the system (see the cases of performance increases of Sierra and Summit, a half year after their appearance on the TOP500 list).Also, mismatching total time and extended measurement time (or not making a proper correction) may lead to completely wrong conclusions [41] as discussed in [39].3 Amdahl's Law in terms of our model Usually, Amdahl's law is expressed as where N is the number of parallelized code fragments (or PUs), α is the ratio of the parallelizable portion to the total, S is the measurable speedup.From this When calculating the speedup, one calculates hence the resulting efficiency of the system (see Figure 2) This phenomenon itself has been known for decades [4], and α is theoretically established [42].Presently, however, the theory was somewhat faded, Fig. 2 The 2-parameter efficiency surface (in function of parallelization efficiency and number of the processing elements), as concluded from Amdahl's Law (see Eq. ( 4)), in first order approximation.Some sample efficiency values for some selected supercomputers are shown, measured with benchmarks HPL and HPCG, respectively."This decay in performance is not a fault of the architecture, but is dictated by the limited parallelism" [4] mainly due to the rapid development of parallelization technology and the increase in single-processor performance.
During the past quarter of a century, the proportion of contributions changed considerably: today the number of processors is thousands-fold higher than a quarter of a century ago.The growing physical size and the higher processing speed increased the role of the propagation overhead, furthermore, the large number of processing units enormously amplified the role of the looping overhead.As a side-result of the technological development, the phenomenon on performance limitation, returned in a technically different form, at a much higher number of processors.
Through using Eq. ( 4), E = S N = R M ax R P eak can be equally good for describing the efficiency of parallelization of a setup: As we discuss below, except for an extremely high number of processors, we can safely assume that the value of α is independent from the number of processors in the system.Eq. ( 5) can be used to derive the value of α from values of parameters R M ax /R P eak and the number of cores N .According to Eq. ( 4), the payload efficiency can be described with a 2dimensional surface, as shown in Figure 2. On the surface, we displayed some measured efficiencies of the current top supercomputers to illustrate some general rules.Both the HPL 15 and the HPCG16 efficiency values are displayed.We project the measured values back to the axes to enable the reader to compare the corresponding values of their number of processors and parallelization efficiency.The HPL efficiency sharply decreases with the increasing number of cores in the system.In the case of unnaturally high (or not provided) HPCG efficiency values, the values are not displayed in the figure.The last "world champion" which benchmarked HPCG correctly was T aihulight.
We can divide the measured values into two groups.The first group comprises measurements where the same number of cores were used in both benchmarks.For visibility, only the HPL projections are displayed, and on their top, the HPCG data point.The general experience showed that the ratio of HPLto-HPCG efficiency is about 200-500 when using the same number of cores.The HPCG payload performance reached its "roofline" [43] [21] level at a lower number of cores; using all cores would decrease the system's performance because of the higher number of cores.This issue is why the members of the second group reduced their number of cores in the HPCG benchmark.
The recent trend is that only a tiny fraction (only about 10 %) of supercomputer's cores are used in the HPCG benchmarking, while all cores are used in the HPL benchmarking.Beginning with June 2021, the data "Measured cores" are not provided anymore, covering this aspect.The presumable reason is that in this way the measured HPCG/HPL efficiency ratio gets much higher, providing the illusion that the vast supercomputers became more suitable for real-life tasks.The supercomputers based on the newly developed AMD EPYC 7763 64C processors also did not provide their HPCG efficiency and performance, or they provide an incorrect number of measured cores.
There is an inflection point in the performance: "there comes a point when using more PUs . . .actually increases the execution time rather than reducing it" [4].We can observe a quick breakdown of the performance gain as theoretically predicted [21] and experimentally measured, see [44](Fig.13) or [45](Fig.7).As can be concluded from the figure, increasing the systems' nominal performance by order of magnitude, at the same time, decreases its efficiency (and so: its payload performance) by more than an order of magnitude.
The goal of the Gordon Bell Prize [46] was originally "to demonstrate a speedup of at least 200 times on a real problem", and the community noticed a decade ago that the efficacy measured with the benchmark HPL and that Brain simulation Summit F ugaku F rontier Fig. 3 The performance gain of top supercomputers, in function of their year of production.
The marks display the measured values derived using HPL and HPCG benchmarks, for the TOP3 supercomputers.The small black dots mark the performance data of supercomputers JU QU EEN and K as of 2014 June, for HPL and HPCG benchmarks, respectively.The big black dot denotes the performance value of the system used by [28].The red pentagons denote performance gain, measured using half-precision operands.The saturation effect can be observed for both HPL and HPCG benchmarks.
of the real-life applications started to differ by up to two orders of magnitude.
It would be worth recalling that the new benchmark program HPCG [47] was introduced since "HPCG is designed to exercise computational and data access patterns that more closely match a different and broad set of important applications, and to give incentive to computer system designers to invest in capabilities that will have impact on the collective performance of these applications" [48].The present design efforts target building "racing" computers, and the constructors either do not publish their HPCG efficiency or measure it using only a fraction (about 10 %) of their cores.The developers aim to produce higher figures by the HPL benchmark instead of making higher performance in the real-life-imitating HPCG benchmark.It is simply misleading to claim that building such racing supercomputers "will enable scientists to develop critically needed technologies for the country's energy, economic and national security, helping researchers address problems of national importance that were impossible to solve just five years ago" and that "Scientists and en-gineers from around the world will put these extraordinary computing speeds to work to solve some of the most challenging questions of our era" [49].
Comments such as "The HPCG performance at only 0.3% of peak performance shows the weakness of the Sunway TaihuLight architecture with slow memory and modest interconnect performance" [50] and "The HPCG performance at only 2.5% of peak performance shows the strength of the memory architecture performance" [25] show that supercomputing experts did not realize, that the efficiency is a(n at least) 2-parameter function, depending on both the number of PUs in the system and its workload; furthermore, that the workload defines the achievable payload performance.It looks like the community experienced the effect of the two-dimensional efficiency but did not want to comprehend its reason, despite the early and clear prediction: "this decay in performance is not a fault of the architecture, but is dictated by the limited parallelism" [4].In excessive systems of modern HW, it is also dictated by the laws of nature [5].Furthermore, we can perfectly describe its dependence by the correctly interpreted Amdahl's Law, rather than being "empirical efficiency".
Notice two more rooflines.During the development of the interconnection technology, between the years 2004 and 2012, different implementation ideas have been around, and they competed for years.The beginning of the second roofline, around the year 2011, coincides with the dawn of GPUs, interfering with the effect of the interconnection technology; see also section 4.3.
The top roofline dawned with the appearance of T aihulight: some assistant processors take over part of the duties of the individual cores, and in this way, one can mitigate the non-payload portion of the workload.This roofline can be slightly above the possible roofline at the price of using a slightly modified computing paradigm; using cooperating cores.
It is important to notice the two red pentagons in the figure: they represent the performance gain achieved using half-precision operand length.The performance gain is lower than the double precision equivalent, just because of the increased relative weight of housekeeping, as discussed in detail in section 4.5.
The projections of efficiency values to axes show that the top few supercomputers offer similar parallelization efficiency and core number values; both features are required to receive one of the top slots.Supercomputers T aihulight and F ugaku are exceptions on both axes.They have the highest number of cores and the best HP L parallelization efficiency.An interesting coincidence is that the processors of both supercomputers have "assistant cores" (i.e., some cores do not make payload computing; instead, they take over "housekeeping duties").This solution makes housekeeping in parallel with the payload computing (reduces the sequential portion of the task) and, in this way, decreases the internal latency of processors making payload computing and increases the system's efficiency.They both use a "light-weight operating system" (and so do F ugaku, Summit and Sierra) to reduce processor latency.This efficiency, of course, requires executing several floating instructions per clock cycle.That mode of operation gets more and more challenging for the interconnection, delivering data to and from data processing units.Notice also in their cases the role of "near" memories: as explained in [31], the data delivery time considerably increases the "idle time" of computing.This idle time is why F ugaku, with its cleverly placed L 2 cache memories, can be more effective when measured with HPL.However, this trick does not work in the case of HP CG because its "sparse" computations use those cache memories ineffectively.We expect the "true" HP CG efficiency of F ugaku to be between the corresponding values of Summit and T aihulight.
The newly (as of 2022) developed AMD EPYC 7763 64C processors did not cause a revolution.On the one side, they introduced a performance jump, quite similarly to the appearance of the NVidia accelerators some years ago.On the other side, their internal latency allowed to achieve a somewhat higher performance gain.As the data point representing F rontier shows, their performance gain is slightly higher than that of systems without acceleration and considerably higher than that of systems having accelerators with non-shared data space.However, they did not require developing new theoretical approach.
The processors of T aihulight comprise cooperating cores [40].The direct core-to core transfer uses a (slightly) different computing paradigm: the processor cores explicitly assume the presence of another core, and in this way, their effective parallelism becomes much better, see also Fig. 6.On that figure, this data and the ones using shorter operands (Summit and F ugaku) result in performance values above the limiting line.However, reducing the loop count by internal clustering (in addition to the "hidden clustering", enabled by its assistant cores) and exchanging data without using the global memory works only for the HP L case, where the contribution of SW is low.The poor value of (1 − α HP CG ef f ) is not necessarily a sign of architectural weakness [11]: T aihulight comprises about four times more cores than Summit and performs the HPCG benchmark with ten-fold more cores.Given that HPCG mimics "real-life" applications, one can conclude that for practical purposes, only systems comprising a few hundred thousand cores17 shall be built, see also section 4.4.More cores contribute only to the "dark performance".
According to Eq. ( 4), efficiency can be interpreted in terms of α and N , and the efficiency of a parallelized sequential computing system can be calculated as This simple formula explains why the payload performance is not a linear function of the nominal performance and why in the case of very good parallelization ((1 − α) 1) and low N , this nonlinearity cannot be noticed.The value of α, however, can hardly be calculated for the present complex HW/SW systems from their technical data (for a detailed discussion see [21]).We can follow two ways to estimate their value of α.One way is to calculate α for existing supercomputing systems (making "computational experiments" [20]) applying Eq. ( 5) to data in TOP500 list [5].This way provides a lower bound, already achieved, for (1 − α).Another way round is to consider contributions of different origins, see section 4, and to calculate the high limit of the value of (1 − α), that the given contributions alone do not enable to exceed (provided that that contribution is the dominant one).It gives us good confidence in the reliability of the parameters that the values derived in these two ways differ only by up to a factor of two.At the same time, this also means that technology is already very close to its theoretical limitations.
Notice that the "algorithmic effects" -such as dealing with "sparse" data structures (which affects cache behavior, which will have a growing importance in the age of "real-time everything", "big data", and "neural networks") or communication between parallelly running threads, such as returning results repeatedly to the main thread in an iteration (which greatly increases the nonparallelizable fraction in the main thread) -manifest through the HW/SW architecture, and we can hardly separate them.Also notice that there are one-time and fixed-size contributions, such as utilizing time measurement facilities or calling system services.Since α ef f is a relative merit, the absolute measurement time shall be large.When utilizing efficiency data from measurements dedicated to some other goal, we must exercise proper caution with the interpretation and accuracy of those data.
The 'right efficiency metric' [51] has always been a question (for a summary see cited references in [52]) when defining efficient supercomputing.The present discussion aims to discover the inherent limitations of parallelized sequential computing and provide numerical values.For this goal, the 'classic interpretation' [3,4,42] of performance was used, in its original spirit.We scrutinized the contributions mentioned in those papers and revised their importance under current technical conditions.

The effect of different contributions to α
Theory can display data from systems with any contributors with any parameters.From measured data, we can conclude only the sum of all contributions, although dedicated measurements can reveal the value of separated contributions experimentally.The publicly available data enable us to conclude mainly of limited validity.

Estimating different limiting factors of α
The estimations below assume that the actual contribution is the dominating one, which defines the achievable performance alone.This situation is usually not the case in practice (the convolution of different contributions is limiting), but this approach enables us to find out the limiting (1 − α) values for all contributions.
In the systems implemented in Single Processor Approach (SPA) [3] as parallelized sequential systems, the life begins and ends in one such sequential subsystem, see also Fig. 1.In large parallelized applications, running on generalpurpose supercomputers, initially and finally, only one thread exists, i.e., the minimal necessary non-parallelizable activity is to fork the other threads and join them again.
With the present technology, no such actions can be shorter than one processor clock period 18 .The theoretical absolute minimum value of the nonparallelizable portion of the task will be given as the ratio of the time of the two clock periods to the total execution time.The latter time is a free parameter in describing efficiency.That is, the value of effective parallelization α ef f depends on the total benchmarking time (and so does the achievable parallelization gain, too).
This dependence, of course, is well known for supercomputer scientists.For measuring the efficiency with better accuracy (and also for producing better α ef f values), one uses hours of execution times in practice.In the case of benchmarking supercomputer T aihulight [50], 13,298 seconds HPL benchmark runtime was used; on the 1.45 GHz processors it means 2 * 10 13 clock periods.The inherent limit of (1 − α ef f ) at such benchmarking time is 10 −13 (or, equivalently, the achievable performance gain is 10 13 ).For simplicity, in the paper, 1.00 GHz processors (i.e., 1 ns clock cycle time) will be assumed.
Supercomputers are also distributed systems.In a stadium-sized supercomputer, we can assume a distance between processors (cable length) up to about 100 m.The net signal round trip time is ca. 10 −6 seconds, or 10 3 clock periods, i.e., in the case of a finite-sized supercomputer, the performance gain cannot be above 10 10 , only because of the physical size of the supercomputer.The presently available network interfaces have 100. . .200 ns latency times and sending a message between processors takes time in the same order of magnitude, typically 500 ns.This timing also means that making better interconnection is not a bottleneck in enhancing performance.This statement is also underpinned by the discussion in section 4.3.However, sharing data instead of copying them, can result in demonstrative performance improvements, see the case of F rontier.
These predictions enable us to assume that the presently achieved value of (1 − α ef f ) could also persist for roughly a hundred times more cores.However, another major issue arises from the computing principle SPA: the first processor can address only one processor at a time.As a consequence, at least as many clock cycles are to be used for organizing the parallelized work as many addressing steps are required.This number equals the number of cores in the supercomputer, i.e., the addressing operation in supercomputers in the TOP10 positions typically needs clock cycles in the order of 5 * 10 5 . . . 10 7 ; degrading 18 This statement is valid even if some parallelly working units can execute more than one instruction in a clock period.One can take these two clock periods as an ideal (but not realistic) case.However, the actual limitation will inevitably be (much) worse than the one calculated for this idealistic case.The exact number of clock periods depends on many factors, as discussed below.
the value of (1 − α ef f ) into the range 10 −6 . . . 2 * 10 −5 .One can use two tricks to mitigate the number of the addressing steps: either cores are organized into cluster s as many supercomputer builders do, or at the other end, the processor itself can take over the responsibility of addressing its cores [40].In function of the actual construction, the reducing factor of clustering of those types can be in the range 10 1 . . . 5 * 10 3 , i.e the resulting value of (1 − α ef f ) is expected to be around 10 −7 .
As Eq. ( 6) suggests, the trivial way to increase systems' performance is to improve a single processor's performance; for example, combining the PUs with accelerators.Various implementations of the idea exist, but "What is initially perplexing is that accelerators do not dominate the architectures in the Top500, given the performance and price/performance advantages they offer as compute engines" [53].This situation seems to change: now the Frontier has taken over the leading position, and presumably, other supercomputers will be built with its technology.The primary reason for the change is that sharing data (mainly at L 1 and L 2 levels), as opposed to copying data between the separated address fields of PU and GPU, dramatically reduces the internal latency of the PUs, and in the case of "racing" supercomputers, this non-payload contribution can be decisive, see section 4.2.A critical difference between the previous and recent designs of GPUs is that "The two GK110 GPUs in the K80s were not interconnected and could not share data; the two Aldebaran GCDs are and can." [54]".The consequence of the lower latency is a higher performance gain and, due to it, a higher absolute performance.The former GPUs, because of their higher latency due to copying between address spaces, could not get into the top group of supercomputers.
Fig. 4 The time dependence of the number of floating point operations performed when running benchmark HPL on supercomputer Frontier.Reproduced from [55] [56] introduces an unexplained claim: "As configured for the Linpack run, the Frontier system had a power draw of 21.1 megawatts, but . . .when the Frontier machine is first set loose on the Linpack, it draws an extra 15 megawatts of power as it is starts the job."Together with Fig. 4 one can understand why.
At the very beginning and end of computation, the operands shall be distributed and results collected, respectively.This non-payload contribution grows with the number of cores.As discussed above, several ideas are used to reduce its time: from the very common "clustering" to using assistant cores [40].Frontier introduced a revolutionary solution for that task.As the left side of Fig. 4 illustrates, for about 3 minutes, Frontier, at least its monitored part, does not perform floating operations.After that period, all operands suddenly appear in their place, and the supercomputer can make its computations at full speed; apparently, no time is spent transferring the operands to the peer cores.This task is what Frontier uses the extra 15 MW power for; for this short period, a hidden second specialized supercomputer starts up, with the only mission to deliver the operands to their place.From the viewpoint of the coordinator core, it is read-only access that can be massively parallelized.However, it needs a terrible parallel bus capacity and electric power to drive millions of Input/Output (I/O) ports 19 .
The right side of Fig. 4 illustrates that when the benchmark computation HPL finishes, it takes about 40 minutes to collect the data from its nearly 9M cores.It enables us to estimate also that the value of T T (the time needed for a single transfer) is about 200, 000 ns [55]; a value unexpectedly high compared to Slingshot's advertised 200 Gb/sec transfer bandwidth.The issue shows a very poor bus utilization.The datasheet value does not consider the arbitration time and data transmission time.This finding is not unprecedented: [53] mentions that "IBM and Nvidia had issues with the NVLink coherent fabric between the CPUs and GPUs in Summit, and that machine did not get 200 Gb/sec InfiniBand as was hoped when it was installed."Yes, the "proxy" core and the separated CPU/GPU address space increased the node's internal latency (its non-payload contribution).
The figure also shows that the submitted computing performance is averaging the momentary values for the period shown in the figure.Without the "phantom supercomputer" (i.e., having two ramps), the performance would be about 0.8 EFlops (that is, less than the magic 1 EFlops), and with a sufficiently long measurement time, it could approach 1.3 EFlops.Notice that the "ramp" on the right side would occur several times in HPCG-type computations (and, of course, in all real-life computations) would keep the average computing performance low.To prevent this, one can reduce the number of the measured cores to, say, 10%, in which case the ramp's length is about 4 minutes, and in that way, much better efficiency and speedup can be reported.
The 'end of computation' situation means that the orchestrating core must collect the result from all peer cores.It imitates the case when in an Artificial Intelligence (AI) workload a 'neuron' must collect the results from all of its (interested) peers, and all neurons want to make that action at the same time, using the same single bus.For a AI-type application, the increased level of communication keeps PUs waiting for the single high-speed bus so that the computed results "are processed as they come in", and to reduce the apparent overload, "are dropped if the receiving process is busy over several delivery cycles" [28].Actually, using a single high-speed bus excludes the chance of achieving a real-time simulation speed.This method of implementation is why "artificial intelligence, . . .it's the most disruptive workload from an I/O pattern perspective" 20 .
One must also use an operating system for protection and convenience.If fork/join is executed by the OS as usual, because of the needed context switchings 2 * 10 4 [57,58] clock cycles are used, rather than 2 clock cycles considered in the idealistic case.The derived values are correspondingly by four orders of magnitude different, that is, the absolute limit is ≈ 5 * 10 −8 , on a zerosized supercomputer.This value is somewhat better than the limiting value derived above, but it is close to that value and surely represents a considerable contribution.This limitation is why a growing number of supercomputers uses "light-weight kernel" or runs its actual computations in kernel mode; a method of computing that we can use only with well-known benchmarks.Frontier's 'phantom supercomputer' eliminates also this OS contribution.
However, this optimistic limit assumes that the processor can access its instructions in one clock cycle.It is usually not the case, but it seems to be a good approximation.On the one side, even a cached instruction in the memory needs about five times more access time and the time required to access 'far' memory is roughly 100 times longer.Correspondingly, the optimistic achievable performance gain values shall be scaled down by a factor of 5 . . .100.A considerable part of the difference between efficiencies α HP L ef f and α HP CG ef f can be attributed to a different cache behavior because of the 'sparse' matrix operations.

The effect of workload
The overly complex Figure 5 attempts to explain the phenomenon, why and how the performance of a supercomputer configuration depends on the application it runs.The non-parallelizable fraction (denoted in the figure by α X ef f ) of the computing task comprises components X of different origins.As we already discussed and was noticed decades ago, "the inherent communication-tocomputation ratio in a parallel application is one of the important determinants of its performance on any architecture" [4], suggesting that communication can be a dominant contribution in system's performance.Figure 5.A displays a case with minimum communication, and Figure 5.B a case with moderately increased communication (corresponding to real-life supercomputer tasks).As the nominal performance increases linearly and the payload performance decreases inversely with the number of cores, at some critical value, where an inflection point occurs, the resulting payload performance starts to drop.The resulting non-parallelizable fraction sharply decreases the efficacy (or, in other words: performance gain or speedup) of the system [6,29].The effect was noticed early [4], under different technical conditions, but somewhat faded due to successes of development of parallelization technology.[4], [45](Figure 7) and [28](Figure 8).The looping contribution (thin green line) is the same as above.Consequently, the achievable payload performance is lower, and the payload performance breakdown is softer in the case of real-life tasks.
Figure 3 depicts the experimental equivalent of Figure 5.Given that no dedicated measurements exist, it is hard to compare theoretical prediction directly to measured data.However, the impressive and quick development of interconnecting technologies provides a helping hand.
In the HPL class, the communication intensity is the lowest possible one: the computing units receive their task (and parameters) at the beginning of the computation, and they return their result at the very end.The core orchestrating their work must only deal with the fellow cores in these periods, so their communication intensity is proportional to the number of cores in the system.Notice the need to queue requests at the task's beginning and end.
In the HPCG class, iteration takes place: the cores return the result of one iteration to the coordinator core, which makes sequential operations: not only receives and re-sends parameters but also needs to compute new parameters before sending them to the fellow cores.The program repeats the process several times.As a consequence, the non-parallelizable fraction of the benchmarking time grows proportionally to the number of iterations and the size of the problem.The effect of that extra communication decreases the achievable performance roofline [43]: as shown in Fig. 3, the HPCG roofline is about 200 times smaller than the HPL one, as discussed in section 3. Turning the memory into (partly) active elements, using different 'coherence' solutions such as OpenCAPI [59] or data sharing at levels L 1 and L 2 , can mitigate this effect.See also section 5.

The effect of interconnection
As discussed above, in a somewhat simplified view, we can calculate the resulting performance using the contributions to α as That is, we must handle two of the contributions with emphasis.The theory provides values for contributions of interconnection and calculation separately.Fortunately, the public database TOP500 [10] also provides data measured under conditions considerably similar to measure the 'net' interconnection contribution.Of course, the measured data reflect the sum of contributions of all components.However, as will be shown below, in the total contribution, those mentioned contributions dominate, and all but the contribution from networking are (nearly) unchanged, so the difference of measured α can be directly compared to the difference of the corresponding sum of calculated α values, although here only qualitative agreement can be expected.Both the quality of the interconnection and the nominal performance are a parametric function of their construction time.One can assume on the theory side that (in a limited period), interconnection contribution was changing in the function of nominal performance as shown in Figure 6A.The other primary contribution is assumed to be computation 21 itself.Benchmark com-putation contributions for HPL and HPCG are very different, so the sum of the respective component plus the interconnection component is also very different.Given that at the beginning of the considered period, the contribution from the HPCG calculation and that of the interconnection was in the same order of magnitude, their sum only changed marginally (see the upper diagram lines), i.e., the measured performance improved only slightly.Because the benchmark HPCG is communication-bound (and so are real-life programs), their efficiency would be an order of magnitude worse.The reason is Eq. ( 4): when supercomputers use all of their cores, their achievable performance is not higher (or maybe even lower), only their power consumption is higher (and the calculated efficiency is lower).As predicted: "scaling thus put larger machines at an inherent disadvantage" [4].The cloud-like supercomputers have a disadvantage in the HPCG competition: the Ethernet-like operation results in relatively high (1 − α) values.
The case with HPL calculation is drastically different (see the lower diagram lines).In this case, at the beginning of the considered period, the contribution from interconnection is very much larger than that from the computation.Consequently, the sum of these two contributions changes sensitively as the speed of the interconnection improves.As soon as the contribution from interconnection decreases to a value comparable with that from the computation, the decrease of the sum slows down considerably, and further improvement of interconnection causes only marginal decrease in the value of the resulting α (and so only a marginal increase in the payload performance).
The measured data enable us to draw the same conclusion, but one must consider that here multiple parameters may have changed.Their tendency, however, is surprisingly straightforward.Figure 6.B is actually a 2.5D diagram: the size of marks is proportional to the time passed since the beginning of the considered period.A decade ago, the speed of interconnection gave a significant contribution to α total .Enhancing it drastically in the past few years increased systems' efficacy.At the same time, because of the stalled single-processor performance, the other technology components changed only marginally.The computational contribution to α from benchmark HPL remained constant in the function of time, so the quick improvement of the interconnection technology resulted in a quick decrease of α total , and the relative weights of α N et and α Compute reversed.The decrease in the value of (1 − α) can be considered as the result of the decreased contribution from interconnection.
However, the total α contribution decreased considerably only until α N et reached the order of magnitude of α Compute .This match occurred in the first 4-5 years of the period shown in Figure 6B: the sloping line is due to the enhancement of interconnection.Then, the two contributors changed their role, and the constant contribution due to computation started to dominate, i.e., the total α contribution decreased only marginally.As soon as the computing contribution took over the dominating role, the value of α total did not fall anymore: all measured data remained above that value of α.Correspondingly, the payload performance improved only marginally (and due to factors other than interconnection).

B
Fig. 6 The effect of changing the dominating The left subfigure shows the theoretical estimation, the right subfigure the corresponding measured data, as derived from public database TOP500 [10] (only values for the first four supercomputers are shown).When the contribution from interconnection drops below that of computation, the value of (1 − α) (and the performance gain) get saturated.The red 'x' marks denote half-precision values.
At this point, as a consequence of that the dominating contributor changed, it was noticed that the efficacy of benchmark HPL and the efficacy of real-life applications started to differ by up to two orders of magnitude.Because of this, a new benchmark program HPCG [47] was introduced, since "HPCG is designed to exercise computational and data access patterns that more closely match a different and broad set of important applications" [48].
Since the major contributor is the computing itself, the different benchmarks contribute differently, and since that time the "supercomputers have two different efficiencies" [16].Yes, if the dominating α contribution (from the benchmark calculation) is different, then the same computer shows different efficiencies in the function of the computation it runs.Since that time, the interconnection provides less contribution than the computation of the benchmark.Due to that change, enhancing the interconnection contributes mainly to the dark performance, rather than to the payload performance.For the goals of exascale computing, the efforts (and expenses) to enhance interconnection seem to be rather pointless.For today, even in the benchmark HPL, the sequential contribution of interconnection is only a tiny fraction of the computation itself; for solving real-life problems, enhanced interconnection makes no noticeable difference.The fundamental issue is that the need for communication (including "sparse" computations) blocks computing performance.

The effect of accelerator
As suggested by Equ. ( 6), the trivial way to increase supercomputers' resulting payload performance is to increase the single-processor performance of its processors.Given that the single processor performance has reached its limitations, some accelerators are frequently used for this goal.Fig. 7 shows how using accelerators with separate address space influences ranking of supercomputers 22 .
The left subfigure shows that supercomputers' ranking does not depend on which acceleration method it uses.Essentially the same is confirmed by the right subfigure: the non-payload portion raises with the ranking position, and the slope is the same for any acceleration.As the left subfigure depicts, GPUaccelerated processors increase the payload performance of the system by a factor of 2-2.5 (the value is in good accordance with that measured in [60]) 23 .The right subfigure discovers that the non-payload to payload ratio of GPUaccelerated systems is nearly an order of magnitude worse than that of the nonaccelerated systems.That is, the resulting efficiency can be (depending on the size of the system) worse than in the case of utilizing unaccelerated processors; this can be a definite disadvantage when GPUs used in a system with a vast number of processors.See also section 5, on the one side, how introducing GPUs to the system increased their payload performance; on the other side, how changing GPU to another type, with a higher nominal performance, but larger memory to be filled with copied content, affected payload performance of the system.
The 'weak scaling' provides a quick and easy way to estimate the nominal performance of a designed system.However, we experience an empirical efficiency, see Fig. 2, when measuring actual systems' payload performance, which  8 The correlation of the specific single processor performance (top row) and nonpayload to payload efficiency (middle row: HPL, bottom row: HPCG) with the number of cores in the system.The empty figures mark non-accelerated, and the filled figures GPU accelerated processor systems.The present (as of November 2021) state (right column) versus the state three years ago (left column).The red line section displays the estimated theoretically possible best non-payload to payload ratio (see section 4.1).The dashed line serves as a guide for the eye.
refutes the validity of 'weak scaling'.With the proliferation of supercomputers using the same GPU-accelerator and based on the same processor in systems of various sizes, we can refute the fallacy that an added GPU accelerator multiplies the payload performance of the system's cores by the nominal performance of the GPU.At the same time, we can more safely estimate the final empirical limits of supercomputing (using the present computing paradigm).
Fig. 8 shows how the specific single-processor performance (the single processor performance divided by the clock frequency) and the non-payload to payload ratio change in the function of the number of system's cores.The figures also show how a better integration with the GPU components (and the intention to achieve higher performance) transformed the landscape in three years.
As the top row displays, the accelerated specific processor performance (unlike power consumption) decreases to the level of non-accelerated specific processor performance as the systems' size increases.We could guess this tendency already three years ago, but now it has become much more expressed with the appearance of masses of GPU-accelerated systems.On the one side, the enhanced acceleration technology is susceptible to the implementation method (shows a significant variance), but it can produce a performance boost factor of about 4. The dedicated measurements provided a factor 2. . . 3 only [60] but in some implementations of supercomputers, the enhancement is marginal (see the upper left subfigure of Fig. 8).On the other side, the required non-payload to payload ratio strongly decreases as the number of the cores in the system increases (also reflecting the effect of the physical size); and it is different for different workloads.
As the middle and bottom rows display, using an accelerator seems not to influence how the amplification factor depends on the number of cores: the increased latency (caused by the need to copy data between the address spaces) counterbalances the higher processor performance at those core numbers.In the case of the benchmark HPL, the dashed line (serving to guide the eye) crosses the estimated performance limit (the short red line section, see subsection 4.1) at about 10M.Its meaning is that this could be the maximum number of cores that can cooperate in the case of this workload.In the case of the benchmark HPCG, the performance roofline is about 200 times lower, and the critical number of cores has a ten times lower value.The red diamonds impressively populated the figure, but they all belong to a relatively low number (a few hundred thousand) of cores.There is no chance they can work with cores above a million.The right subfigures underpin that a high number of processors must be accompanied with good parallelization efficiency.Otherwise, the large number of cores cannot counterbalance the decreasing efficiency, see Eq. ( 6).We cannot build more extensive systems using the present paradigm and technology.
The two measured data points above the rooflines mark the supercomputers Fugaku and Taihulight.The latter uses 'cooperating processors' [40]; both are slightly deviating from the classic computing paradigm.The former can perform better because of the cleverly placed cache memories (closer to the 'in-memory computing').We can increase the nominal performance without limitation.Still, the system either will not start up at all (mainly because of the "communication collapse" [61]) or will work only at marginal computing and energetic efficiency.For the HPCG workload, because of the reasons mentioned, only reliable data are included.
Noticeably, in Fig. 2 the systems having the best efficiency values do not use accelerator: the efficiency of systems using accelerators is much lower also in the case of HPL benchmark, but it is more disadvantageous in the case of HPCG benchmark.As can be seen, one can reach the "roofline" efficiency with using only a fragment of the available cores.The two new items in Fig. 2 (as of June 2021, Selene and Jewels; based on entirely GPUs) not only show the worst efficacy for HPL benchmark, but their HPCG efficiencies push back the number of usable cores (for real-life tasks) well below the hundred thousand limit.On the one side, the achieved HPCG efficiency values show the same tendency as the HPL efficiency values: the more cores, the lower efficiency.On the other side, figures 2 and 8 show that for real-life tasks, the reasonable size of supercomputers (even in the case of benchmark HPL) is below 1M core for non-accelerated cores.At that number of cores, their efficiency is just a few percent, and increasing the number of cores decreases their efficiency (does NOT increase their payload performance).In other words, we can conclude that the accelerators enable the systems to reach a much worse non-payload to payload ratio.For an example, see Perlmutter.The slight increase in both the nominal and payload performance for benchmark HPL (the HPL roofline reached) is not accompanied by any enhancement for benchmark HPCG (the HPCG roofline already arrived at a small fragment of cores).
In the past few years, there has appeared the tendency to attempt to show up better efficiency values when excessive supercomputers attack real-life tasks; the right bottom figure shows why.The theoretically estimated amplification factor is about 200-300 times lower for the HPCG workload than that for the HPL workload.However, some HPCG data points seem to appear with a much better amplification factor.The reason is that the HPCG workload does not enable to increase the system's payload performance above its roofline when using the same number of cores that were used when benchmarking the systems with HPL workload.Because of this limitation, the "measured cores" in the HPCG benchmarking is just a fragment (about 10 %) of the total cores.The figure shows two connected data point pairs.The upper point is the published performance (calculated assuming that the total number of cores was used) the lower point is the actual performance (it was calculated using the data "measured cores").This correction puts back those performance amplification data to the correct scale.It confirms that for real-life tasks, the maximum number of processors in the system is below 1M for non-accelerated cores and below 0.1M cores for accelerated ones.Unfortunately, the item "measured cores" is not published anymore in the TOP500 lists' records.This cheating can be seen only in that the HPCG/HPL performance ratio is greatly improved; even in the case of older machines, there is a sudden jump (about ten times better value) in their performance ratio.

The effect of reduced operand length
The so-called HPL-AI benchmark 24 used Mixed Precision rather than Double Precision computations.The name suggests that AI applications may run on the supercomputer with that efficiency.However, the type of workload does not change, so that one can expect the same overall behavior for AI applications, including Artificial Neural Network (ANN)s, than for double-precision operands.For AI applications, the limitations remain the same as described above; except that when using Mixed Precision, the efficiency can be better by a factor of 2 • • • 3, strongly depending on the workload of the system; see also Fig. 10.
Unfortunately, when using half-precision, the enhancement comes from accessing fewer data in memory and using faster operations on shorter operands, instead of reducing communication intensity 25 that defines the efficiency.Similarly, exchanging data directly between processing units [40] (without using global and even local memory) also enhances α (and payload performance) [62], but it represents a (slightly) different computing paradigm.Only the two mentioned measured data fall above the limiting roofline of (1 − α) in Figure 6.
Recent supercomputers F ugaku [25] and Summit [63] provided their HPL performance for both 64-bit and 16-bit operands.Of course, their performance seems to be much better with a shorter operand length (at the same number of operations, the total measurement time is much shorter).One expected that their performance should be four times higher when using four times shorter operands.Power consumption data [63] underpin the expectations: power consumption is about four times lower for four times shorter operands.Computing performance, however, shows a slighter performance enhancement: 3.01 for Summit and 3.42 for F ugaku because of the needed housekeeping.
In the long run, a T ime X value comprises housekeeping and computation.We assume that housekeeping (indexing, branching) is the same fixed amount for different operand lengths, and the other time contribution (data delivery and bit manipulation) is proportional with the operand length.Given that according to our model, the measured payload performance directly reflects the sum of all contributions, we can assume that 24 It is a common fallacy that benchmark HPL-AI is benchmark for AI systems.It means "The High Performance LINPACK for Accelerator Introspection" (HPL-AI), and "that benchmark seeks to highlight the convergence of HPC and artificial intelligence (AI) workloads", see https://www.icl.utk.edu/hpl-ai/.It has not much to do with AI, except that it uses the operand length common in AI tasks.HPL, similarly to AI, is a workload type.However, even https://www.top500.org/lists/top500/2020/06/mismatches operand length and workload: "In single or further reduced precision, which are often used in machine learning and AI applications, Fugaku's peak performance is over 1,000 petaflops (1 exaflops)". 25On the contrary, the relative weight of communication increased in this way.where F 0 is the time contribution from housekeeping (in a long-term run, using benchmark HPL), and F 16 is the time contribution due to manipulating 16 bits.
We can use two simple models when calculating the relative values of F 0 and F 16 .Both models are speculative rather than technically established ones, but they nicely point to the vital point: the role of housekeeping.
If no part of housekeeping runs in parallel with floating computing, furthermore, computing and data transfer operations do not block each other, we can use the simple summing above; see Fig. 9.With the proper parallelization method, we can perform the non-floating housekeeping for the next operand in parallel with the floating operation of the current operand.In this way, theoretically, we can reach the possible ratio of four.Notice, however, that this model is not aware of the data transfer time.
In the case of considering the temporal behavior of the components, we can use the trigonometric sum of the non-payload and payload times, see [31].
Table 1 shows the data calculated from the published values in the TOP500 list for supercomputers F ugaku and Summit, together with their parameters calculated as described above, using the time-aware summing model.There is no significant difference between the values derived using the two models in the discussed simple workload case.The values in the figures are given in units of payload performance ratios; the values in the table are given in units of the respective α contributions.We can compare the numbers only as discussed below.The values Ef f 64 and Ef f 16 are calculated from the corresponding published R M ax /R P eak values.We calculate Amdahl's parameter using Eq. ( 5) for the two different operand lengths.As discussed, (1 − α) is the sequential portion of the total measurement time (aka non-payload to payload ratio).Assuming that the time unit is the total measurement time, the limiting time of performing a floating operation with 64-bit operands T ime 64 (on our arbitrary time scale) is directly derived from the (1 − α 64 ) value.To get a T ime 16 value on the same scale, we must correct the measured (1 − α 16 ) values for their differing measurement times (the measured performance ratios 3.42 and 3.01, respectively).
One cannot compare the absolute values of the data in the last two columns of the table directly with those of the other supercomputer (their measurement time was different).Given that the task and the computing model were the same, we can directly compare the ratios of values F 16 and F 0 .The proportions of both F 0 values and F 16 /F 0 values show that housekeeping is much better for F ugaku than for Summit.Given that their architectures are globally similar, the plausible reason for the difference in their efficacy (and performance) is that in the case of Summit, the processor core is in a role of proxy (and in this way it represents a bottleneck), while F ugaku uses "assistant cores".The housekeeping increases latency and significantly decreases the performance of the system.The plausible reason for F ugagus better F 16 values is the clever use (and positioning!see [31]) of L2-level cache memories.
In the case of Summit, we also know its HPCG efficiency.In its case the T ime HP CG 64 value is 2.08 • e − 5, i.e. several hundred times higher than T ime HP L

64
. Given that F 16 and F 64 are the same in the case of the two benchmarks, the difference is caused by F 0 .In the long run, the different workload (iteration, more intensive communication, "sparse" computation forcing different cache utilization) forces different F 0 value, and that leads to "different efficiencies" [16] of supercomputers under different workloads.Given that we used a single "snapshot" of measurement times, these values are average values taken for all floating operations instead of actual operating times.
F ugaku used only a fraction of its cores in the benchmark HPCG, so we can validate only the achieved payload performance (but not its efficiency).It is very plausible that HPCG performance reached its roofline, and -because of the higher number of cores-its real HPCG efficiency would be around that of T aihulight.Anyhow, it would not be fair either to assume that speculation or to accept their value measured at a different number of cores.There is no measured data.
These data directly underpin that technology is (almost) perfect: contribution from the benchmark calculation HPCG-FP64 is by orders of magnitude HPL and HPCG timing Summit and Fugaku implementation larger 26 than the contribution from all the rests.Recalling that the benchmark program imitates the behavior (as defined by the resulting α) of real-life programs, one can see that the contribution from other computing-related actors is about a thousand times smaller than the contribution of the computa-tion+communication.
The unique role of "mixed precision" efficiencies (the third kind of efficiencies of a supercomputer), see the red 'x' marks in Fig. 6, deserves special attention.Strictly speaking, we cannot correctly position the points in the figure; they belong to a different scale (they are measured on a different HW).On the one side, the same number of operations are performed, using the same amount of PUs.On the other side, four times fewer data are transferred and manipulated.The nominal performance is expected to be four times higher than in the case of using double-precision operands.Without correcting for the more than three times shorter measurement (see below), the efficiency mark is slightly above the corresponding value measured with double length operands (the relative weight of F 0 is higher); with correction, it is somewhat below it.This difference, however, is noticeable only with benchmark HPL; in the case of HPCG workload, computation (including operand length) has a marginal effect.In the former case, contributions of computation and communication are in the same range of magnitude, competing for dominating the system's performance.In the latter case, communication dominates performance; computing has a marginal role.This difference is the reason why no data are available for the HPCG benchmark using half-precision operands.
Fig. 10 illustrates the role of non-payload performance with respect to operand length.In the figure (for visibility), a hypothetic ratio 10 of efficiencies measured by benchmarks HP L and HP CG was assumed.The non-payload contribution blocks the operation of the floating-point unit.The dominating role of non-payload contribution also means that it is of little importance if we use double or half-precision operands in the computation.The blue vectors essentially represent the case of Summit, the red vector represents the F P 0 value of F ugaku (transformed to scaling of Summit).We attribute the difference between their HP L performances to their different F P 0 values.This difference, however, gets marginal as the workload approaches real-life conditions.

Further efficiency values
The performance corresponding to α F P 0 HP L is slightly above 1 EFlops (when making no floating operations, i.e., rather Eops).Another peak performance reported 27 when running genomics code on Summit (by using a mixture of numerical precisions and mostly non-floating point instructions) is 1.88 Eops, corresponding to α F P 0 Genom = 1 * 10 −8 .Those two values refer to a different mixture of instructions, so the agreement is more than satisfactory.
Our simple calculations result in, that in the case of the benchmark HPL, F P 0 values are in the order of F P 16 , and that the benchmark HPL is computing bound: reducing housekeeping (including communication) has some sense for "racing supercomputers", for real-life applications it has only marginal effect.On the other side, increasing the housekeeping (more communication such as in the case of benchmark HPCG or ANNs) degrades the apparent performance.At a sufficiently large amount of communication [21], the housekeeping dominates the performance, and the contribution of F P X becomes marginal.For large ANN applications, using F P 16 operands makes no real difference, their workload defines their performance (and efficiency).The "commodity supercomputers" can achieve the same payload performance, although they have a much lower number of PUs: The "racing supercomputers" either cannot use their impressively vast number of cores at all, or using all their cores does not increase their payload performance in solving real-life tasks.

The effect of clock period's length
The behavior of time in computing systems is in parallel with the quantal nature of energy, known from modern science.Time in computing systems passes in discrete steps rather than continuously.This difference is not noticeable under usual conditions: both human perception and macroscopic computing operations are a million-fold longer.Under the extreme conditions represented by many-many core systems, however, the quantal nature is the source of the inherent limitation of parallelized sequential systems.The fundamental issue is that operations must be synchronized; asynchronous operation provides performance advantages [64].
The need to synchronize operations (including those of many-many processors) using a central clock signal is especially disadvantageous when attempting to imitate the behavior of biological systems without such a central signal.
Although the intention to provide asynchronous operating mode was a major design point [27], the hidden synchronization (mainly introduced by thinking in conventional SW solutions) led to very poor efficiency [28], when the system attempted to perform its flagship goal: simulating the functionality of (a large part of) the human brain.
As we discussed in section 4.1, the performance also depends on the length of measurement time, because of fixed-time contributions.When making only 10 seconds long measurements, the smaller denominator (compared to HPL benchmarking time) may result in up to 10 3 times worse (1 − α ef f ) and performance gain values.The dominant limiting factor, however, is a different one.
In brain simulation, a 1 ms integration time (essentially, a sampling time) is commonly used [28].The biological time (when events happen) and the computing time (how much computing time is required to perform computing operations to describe the same) are not only different but also not directly related.Working with "signals from the future" must be excluded.For this goal, at the end of this period, one must communicate the calculated new values of the state variables to all (interested) fellow neurons.This action essentially introduces a "biology clock signal period" million-fold longer than the electronic clock signal period.Correspondingly, the achievable performance is desperately low: the system can simulate less than 10 5 neurons, out of the planned 10 9 [28]28 .For a detailed discussion see [29,31,21].
The researchers also investigated [28] their power consumption efficiency.We presumed that -to avoid obsolete energy consumption-they performed the measurement at the point where involving more cores increases the power consumption but does not increase the payload simulation performance.This assumption resulted in the "reasoned guess" for the efficiency of brain simulation in Figures 2 and 3. Given that using AI workload (for a discussion from this point of view see [21]) on supercomputers is of growing importance, the performance gain of a processor-based AI application can be estimated to be between those of HPCG and brain simulation, closer to that of HPCG.As discussed experimentally in [65] and theoretically in [21], in the case of neural networks (especially in the case of selecting improper layering depth) the efficiency can be much lower 29 than that estimated value.
Recall that since AI nodes usually perform simple calculations compared to the functionality of supercomputer benchmarks, their communication/calculation ratio is much higher, making the efficacy even worse.The experimental research [65] underpins our conclusions : -"strong scaling is stalling after only a few dozen nodes" -"The scalability stalls when the compute times drop below the communication times, leaving compute units idle.Hence becoming an communication bound problem."-"network layout has a large impact on the crucial communication/compute ratio: shallow networks with many neurons per layer . . .scale worse than deep networks with less neurons." The massively "bursty" nature of data (different nodes of the layer want to use the communication subsystem simultaneously) also makes the case harder.The commonly used global bus is overloaded with messages (for a detailed discussion see [31]).That overload may lead to a "communicational collapse" (demonstrated in Figure 5.(a) in [61]): at an extremely large number of cores, exceeding the critical threshold of communication intensity, leads to an unexpected and drastic change of network latency.

Accuracy and reliability of our model
As the value of parameters of our model are inferred from non-dedicated singleshot measurements, their reliability is limited.One can verify, however, how our model predicts values derived from later measurements.Supercomputers usually do not have a long lifespan and several documented stages.One of the rare exceptions is supercomputer Piz Daint, and its documented lifetime spans over six years.Different amounts of cores, without and with acceleration, using various accelerators, were used at that period.
Figure 11 depicts the performance and efficiency values published during its lifetime, together with diagram lines predicting (at the time of making the prediction) values at higher nominal performance values.The left subfigure shows how changes made in the configuration affected its efficiency (the timeline starts in the top right corner, and a line connects adjacent stages).
In the right subfigure, bubbles represent data published in adjacent editions of the TOP500 lists, the diagram lines crossing them are predictions made from that snapshot.We shall compare the predicted value to the value published in the next list.It is especially remarkable that introducing GPGPU acceleration resulted only in a slight increase (in good agreement with [60] and [17]) compared to the value expected based purely on the increase in the number of cores.Although between our "samplings", more than one parameter changed, that is, the net effect cannot be demonstrated clearly, the measured data sufficiently underpin our limited validity conclusions and show that theory correctly describes the tendency of development of performance and efficiency, and even its predicted performance values are reasonably accurate.
Introducing GPU accelerator is a one-time performance increasing step [17], and the theory cannot take it into account.Notice that introducing an accelerator increased the payload performance but decreased efficiency (copying data from one address space to another increases latency).Changing the accelerator to another type with slightly higher performance (but higher latency due to its larger GPGPU memory) caused a slight decrease in the absolute performance because of the considerably dropped efficiency.

Towards zettaflops
As detailed above, our theoretical model enables us to calculate the payload performance in first-order approximation at any nominal performance value.In the light of all of this, one can estimate a short time and a long term development of supercomputer performance, see Figs. 12.A and 12.B.We calculated the diagram lines using Eq. ( 4), with α parameter values derived from TOP500 data of Summit supercomputer; the bubbles show measured values.The diagram lines from the bottom up show the double floating precision HPCG, HPL and the half precision [63] HPL (HPL-AI) diagrams.Given that we calculated our parameter values from a snapshot and that the calculation is essentially an extrapolation, furthermore that at high nominal performance values using the second-order approximation is more and more pressing, the predictions shown in Figure 12 are rough and very optimistic approximations.However, they show values somewhat similar to actual upper limiting values.
In addition to the measured and published performance data, two more diagram lines representing two more calculated α values are also depicted.The 'FP0' (orange) diagram line is calculated with the assumption that the supercomputer makes the stuff needed to perform the HPL benchmark, but actual FP operations are not performed.In other words, the computer works with zero-bit length floating operands (FP0) 30 .
The 'Science' (red) diagram line is calculated with the assumption that nothing is calculated, but science (the finite propagation time due to the lim-7.1 Adding payload performances Eq. ( 6) tells that (in first-order approximation) the speedup of a parallelized computing system cannot exceed 1/(1 − α); a well-known consequence of Amdahl's statement.Due to this, the computing performance of a system cannot be increased above the performance defined by the single-processor performance, the parallelization technology, and the number of processors.The laws of nature prohibit from exceeding a specific computing performance (using the classical paradigm and its classical implementation).There is an analogy between adding speeds in physics and adding performances in computing.In both cases, a correction term is introduced that provides a noticeable effect only at tremendous values.It seems to be an interesting parallel, that both nature and extremely cutting-edge technical (computing) systems show up some extraordinary behavior, i.e., the linear behavior experienced under normal conditions gets strongly non-linear at large values of the independent variable.That behavior makes the validity of linear extrapolations as well as the linear addition of performances at least questionable at high performance values in computing.

Quantal nature of computing time
In computing, the cooperation of components needs some kind of synchronization.Typically, it uses a central clock (with clock domains and distribution trees).In today's technology, the length of the clock cycle is in the ns range, so they seem to be quasi-continuous for the human perception.In some applications, for example, when attempting to imitate the parallel operation of the nervous system, the non-continuous nature of computing time comes to light.In that case, the independently running neurons must be put back to their biological scale, i.e., they stop at the end of the "integration grid time" values, and distribute their result to the peer neurons.This synchronization introduces a "biological clock time", about a million-fold longer than the "processor clock time", and limits the achievable parallelization gain.The effect is noticeable only at a vast number of cores.The limit is conceptual but made much worse with the "communication burst": the appearance of the simultaneous need for communication in the neural networks represents.For details see [29].

'Quantum states' of supercomputers
As discussed in [5], the behavior of computing systems under extreme conditions shows surprising parallels with the behavior of natural objects.Really, "More Is Different" [30].The behavior of supercomputers is somewhat analogous to the behavior of quantum systems, where the measurement selects one of its possible states (and at the same time kills all other possible states).In computing, a supercomputer -as a general-purpose computing system-has the potential of high performance defined by the impressive parameters of its components, for all possible workflows.However, when we run a computation (that is, we measure the computing performance of our computer), that workload selects the best possible combination of limitations that defines its performance for the given workload and kills all other potential performances.
The logical dependence of the operation of the components implicitly also means their temporal dependence [31], and it introduces idle times to computing.In this way, the workload defines how much of those potential abilities can be used: the datasheet values represent hard limits, and their workload sets the soft limits, given that the components block each other's operation.The workload defines a fill-out factor: it introduces different idle times into component's operation, and in this way, forces workload-defined soft performance limits to the components of the supercomputer.Different workloads force different limitations (use their available resources differently), giving a natural explanation of their "different efficiencies" [16].In other words, running some computation destroys the potentially achievable high performance, defined individually by its components.
Benchmarking such computing systems introduces one more limiting component: the needed computation.For floating-point computations, the 'best possible' (producing the highest figures of merit) benchmark is HPL.With the development of parallelization and processor technology, floating computation itself became the major contributor in defining the system's efficiency and performance.Since the benchmark measurement method itself is a computation, the best measurable floating payload performance value cannot be smaller than the benchmark procedure itself represents.
For real-life programs (such as HPCG) their workload-defined performance level (saturation value) are already set well below the Eflops nominal performance, see Fig. 2. Further enhancements in technology, such as tensor processors and OpenCAPI connection bus, can slightly increase their saturation level but cannot change the science-defined shape of the diagram line.

Conclusion
Supercomputers reached their technical limitations, their development is out of steam.To continue enhancing the components of a supercomputer that wants to run any calculation without changing its underlying computing paradigm is not worth any more.To enter the "next level", really renewing the classic computing paradigm is needed [66][67][68]31].
The ironic remark that 'Perhaps supercomputers should just be required to have written in small letters at the bottom on their shiny cabinets: "Object manipulations in this supercomputer run slower than they appear."[16]' is becoming increasingly relevant.The impressive numbers about the performance of their components (including single-processor and/or GPU performance and speed of interconnection) are becoming less relevant when going to the extremes.Given that the most substantial α contribution today takes its origin in the computation the supercomputer runs, even the best possible benchmark HPL dominates floating performance measurement.Enhancing other contributions, such as interconnection, result in marginal enhancement of their payload performance, i.e., the overwhelming majority of expenses increase their "dark performance" only.Because of this, the answers to the questions in the title are: there are as many performance values as many measurement methods (that vary with how big portion of available cores are used in the measurement), and actually benchmarks measure mainly how much mathematics/communication the benchmark program does, rather than the supercomputer architecture (provided that all components deliver their technically achievable best parameters).

Fig. 5
Fig.5The figure explains how the different communication/computation intensities of applications lead to different payload performance values in the same supercomputer system.Left column: models of the computing intensities for different benchmarks.Right column: the corresponding payload performances and α contributions in function of the nominal performance of a fictive supercomputer (P = 1Gf lop/s @ 1GHz).The blue diagram line refers to the right hand scale (R M ax values), all others ((1 − α X ef f ) contributions) to the left hand scale.The figure is purely illustrating the concepts; the displayed numbers are somewhat similar to real ones.The performance breakdown shown in the figures were experimentally measured by[4],[45](Figure7) and[28](Figure8).

Figure 5 .
Figure 5.A illustrates the behavior measured with HPL benchmark.The looping contribution becomes remarkable around 0.1 Eflops, and breaks down payload performance (see Figure1in[4]) when approaching 1 Eflops.In Figure 5.B, the behavior measured with benchmark HPCG is displayed.The application's contribution (brown line) is much higher than in the previous case.The looping contribution (thin green line) is the same as above.Consequently, the achievable payload performance is lower, and the payload performance breakdown is softer in the case of real-life tasks.Figure3depicts the experimental equivalent of Figure5.Given that no dedicated measurements exist, it is hard to compare theoretical prediction

Fig. 7
Fig.7Correlation of performance of processors using accelerator with separate address space and effective parallelism with ranking, in 2017.
Fig.8The correlation of the specific single processor performance (top row) and nonpayload to payload efficiency (middle row: HPL, bottom row: HPCG) with the number of cores in the system.The empty figures mark non-accelerated, and the filled figures GPU accelerated processor systems.The present (as of November 2021) state (right column) versus the state three years ago (left column).The red line section displays the estimated theoretically possible best non-payload to payload ratio (see section 4.1).The dashed line serves as a guide for the eye.

F P 0 F P 16 1 FFig. 9
Fig.9Left: The serial summing of payload and non-payload contributions (assumes no parallelization and/or no blocking.Right: The parallel summing of payload and non-payload contributions (assumes parallelization and enables to consider mutual blocking)

64 Fig. 10
Fig.10The role of non-payload contribution in defining HP L and HP CG efficiencies, for double and half precision floating operation modes.For visibility, a hypothetic efficiency ratio E HP L /E HP CG =10 assumed.The housekeeping (including transfer time and cache misses) dominates, the length of operands has only marginal importance.

Fig. 11
Fig.11History of supercomputer Piz Daint in terms of efficiency and payload performance[10].The left subfigure shows how the efficiency changed as developers proceeded towards higher performance.The right subfigure shows the reported performance data (the bubbles), together with diagram line calculated from the value as described above.Compare the value of diagram line to measured performance data in the next reported stage.

Table 1
Floating point characteristics of supercomputers F ugaku and Summit