Performance evaluation of multi-exaflops machines using Equality network topology

In modern computing architectures, graph theory is the soul of the play due to the rising core counts. It is indispensable to keep finding a better way to connect the cores. A novel chordal-ring interconnect topology system, Equality, is revisited in this paper to compare with a few previous works. This paper details the procedures for constructing the Equality interconnects, its special routing procedures, the strategies for selecting a configuration, and evaluating its performance using the open-source cycle-accurate BookSim package. Four scenarios representing small- to large-scale computing facilities are presented to assess the network performance. This work shows that in 16,384-endpoint systems, the Equality network turns out to be the most efficient system. The results also show the steady scalability of Equality networks extending to 48–320K, and a million endpoints. Equality networks are adjustable to fit with commodity hardware and resilient under ten common traffic models. It is suggested that Equality network topology can be used in constructing efficient multi-exaflops supercomputers and data centers.


Introduction
High-performance computing (HPC) is a type of computing that uses high-end computing components to cooperatively address large-scale tasks that cannot be solved easily by ordinary computers. The computing components are connected by HPC networks to achieve better efficiency.
An HPC network differs from other networks in that it often seeks to synchronize communication and computation so that the communication does not interrupt the computation too much to increase efficiency. An HPC network also tends to use homogeneous computing hardware, such as the same model of switches (with an equal number of ports), CPUs, and accelerators across the entire implementation. Homogeneous products in a system ensure lower prices for each component due to mass production and more straightforward restoration by prompt replacement when some parts go wrong.
Hwang et al. have shown the potential of Equality network compared with a few popular HPC network topologies [1][2][3][4] such as 2-tier fat-tree, 3-tier fat-tree, 3D torus, and 5D torus. In this work, we further analyze the performance of Equality networks in different scales to compare with Slim Fly, Dragonfly, and two popular network topologies, Fat-tree and Tori. We also extend the focus on applying the Equality networks to enable machines capable of reaching multi-exaflops based on current hardware craftsmanship.
The main contributions of the current work that are different from previous works include the following: • The development and implementation of the systematic routing tables for Equality networks, • The modified routing algorithm bottleneck-UGAL to refrain from over-subscribed paths, • The introduction and explanation of a new measure called bisection ratio in addition to bisection bandwidth, • The analysis of the resulting network properties (diameter, average distance, latency, and throughput) of various scales of the Equality networks and the comparison to other existing publications, • The strategies of finding a suitable configuration for a future HPC system utilizing Equality network topology, and • The largest cycle-accurate simulation ever calculated by BookSim (a 1 M-endpoint system).

3
Performance evaluation of multi-exaflops machines using… 2 Network architecture Different network topologies often are designed to fit specific workloads when designed beforehand. To justify the quality of a network and whether it is suitable for the target application workloads, one can inspect the performance measures of a network and additionally perform simulations on the network. The standard network measures used in this paper include the network diameter d and average distance a. The standard communication measures are the message latency and the network's overall throughput under different traffic patterns and injection intensities.
A well-balanced topology should have a reasonable network diameter and also accompanied by tailored routing algorithms to reduce the latency and increase the throughput. Nevertheless, for any application, a given network has an effective diameter, d eff , if all communication patterns of the specific application use no more than d eff hops in the network regardless of the actual network diameter. The same idea goes for the average hop count.
Under low injection rates, when a packet never contends with other packets for resources, the latency is called 'zero-load latency,' l 0 , which is the sum of the latency bits when no queue or blocking is involved. Under high injection rates, when the network is saturated under the specific load, traffic pattern, and routing algorithm, if the latency goes high, the throughput approaches the saturation throughput, t sat .
Predominantly, the latency can be reduced if the network diameter d is reduced; however, the network latency and throughput still depend on the traffic patterns of real applications. Several traffic patterns [5] are devised based on communication patterns triggered by real-world applications. For instance, the transpose traffic mode is encountered in corner-turn [6] and matrix transpose applications. Choosing a network that outperforms others on most traffic patterns contributes to a better system.

Switching delays
The port-to-port latency of the contemporary high-end switches ranges from tens of nanoseconds to a few s. The switching delay depends on many parameters, including the cable length, cable material, optical-electrical conversion, buffer size, switching logic cycle frequency, switching memory access time, routing table size, routing calculation complexity, hop count, etc. The hardware performance can be calculated by summing up all the instruction cycles contributing from each logic element, multiplied by the time in nanoseconds per cycle. The hardware issues, such as memory access time and logic frequency, are out of the scope of the current paper.
If the latency per hop is fixed when the switching hardware is selected, the only way to reduce the overall latency of the system is to fine-tune the topology. In reality, the messages can be waiting in channel buffers in packet-based routing architectures or for the channel to be freed in wormhole routing architectures. In the events of network congestion, the hopping distance may have a lesser impact on the network latency; therefore, the design of the topology and routing algorithm should not only focus on the diameter of the network.

Network selection
There are many topology inherited shortcomings in the existing topologies. The torus networks bear the routing difficulty of long hopping distances and over-subscribed paths. Although fat-tree networks perform well on all permutation traffic patterns, the cost for large fat-tree networks is considerably higher, and the number of layers determines the zero-load latency ( l 0 ) of a fat-tree network.
Most topologies have the total router number being the product of integers, meaning the number of routers cannot be changed easily if the budget is modified. Dragonfly networks suffer from low global bandwidth, which becomes the bottleneck of global traffic. The irregularity of Slim Fly network links predestined the routing intricacy when the network diameter is more than two. Slim Fly MMS requires very high radix routers according to its mathematical expression when it scales up. According to its expression, with 80-port routers, the largest configuration it can offer is k � = 53 , = −1 MMS, which can hold about 2450 routers and 63,700 endpoints. Once a model of router (in this case: 80-port) is selected for Slim Fly MMS, it almost means the number of endpoints is fixed. Other closest solutions in this range are: Daryin and Korzh [7] have been looking for low-diameter topologies that have structures with the optimal performance for the Russian supercomputer manufactured by T-Platforms. They chose to go with the hybrid of Slim Fly and Flattened Butterfly for comparison with Dragonfly, tori, hypercube, and Flattened Butterfly. In the end, they chose Flattened Butterfly for their system #22 in November 2014 Top500 list [8] with Rmax 1.8 petaflops. This has shown that from the perspective of a system designer, the window to finding an appropriate system size can be very narrow. The comparison of our results with this petascale system is shown in Fig. 5 (Sect. 5).
Hwang et al. have compared the performance of Equality network against 2-tier fat-tree [1], 3-tier fat-tree [2], and 3D torus [3]. They also show that Equality can be used to design many-core computer chips [4].

3
Performance evaluation of multi-exaflops machines using…

Methods
In the following sections, we recapitulate the construction of an Equality network (Sect. 3.1). Section 3.2 addresses how we describe Equality networks and some related prior arts. Further, we detail the optimization procedures (Sect. 3.3) and the routing table construction procedures (Sect. 3.4). Section 3.5 briefs three routing algorithms utilized in this work, and Sect. 3.6 recites the routine for cycle-accurate simulation.
The concentration p depends on what application will run on the network, while the system is being designed. However, finding a proper p under saturated traffic is essential. The balance of network radix K and concentration p, together with the system's networking cost and bisection bandwidth, is discussed in Sect. 4.
To get the estimated performance of our designs, we utilize the open-source BookSim [9] for cycle-accurate simulation of our networks. The package we use is downloaded from its GitHub site (https://github.com/booksim) and later modified locally to include our home-brewed routing procedures and algorithms. We implemented our topology, routing algorithms, and a few extra traffic models into Book-Sim. The results in various system scales are discussed in Sect. 5.
Symbols and notations used in the current paper are listed in Table 1.

Connection rules
Since our previous works are short conference papers, we would like to have this chance to address more about the initial conception of Equality network topology. Upon constructing a network, one of the intuitive approaches is to link all the nodes into a ring. A ring topology has a network radix of 2. At the dawn of the current study, we looked at Hamiltonian cycles and sought to find a way to reduce the network's diameter. Since we only focus on any Hamiltonian networks that have equal radix on all router nodes, upon adding one link on a router, the same link has to be applied on all routers. To keep the routing identical for all routers, we tried many strategies for adding connections on all nodes.
To make the interconnects, every member of the routers makes links through the following rules. An Equality network has N routers, where N is an even number. The routers are sequentially numbered from zero to N − 1 , i.e., r 0 , r 1 , … , r N−1 . The routers are connected to form a ring and later to other ring members, just like chordal ring topologies.
A set of positive integers, C , starting from 2 to N − 3 , excluding any even numbers greater than N/2, are used as the candidates C for making the physical links. 1 From C , a subset of integers, S , composed of S A (a collection of odd positive integers) and S B (a collection of even positive integers) is selected for the 1 It is worth to note that all even numbers S j < N∕2 make equal links as the corresponding even numbers N − S j ; therefore, all even numbers greater than N/2 are excluded from the set.

Routing terms
Shortest paths from r i to r j � ⃗ P � (r i , r j ) Sub-shortest paths from r i to r ĵ R Shortest path routing table List of routers for sub-shortest paths from r 0 to r j 1 3 Performance evaluation of multi-exaflops machines using… interconnections. The connections are made for every router r i with r (i+S j )modN if i is even, or with r (i−S j )modN if i is odd, for every number S j .

Syntax
The general notation of Equality involves an 'n' denoting the total number of routers, followed by a number, and a 'k' followed by another number indicating the network radix. The notation of 'p' can be omitted if the number of attached endpoints is not yet specified. Both uppercase and lowercase are allowed in the notation of 'n,' 'k,' or 'p' as long as they are in a sequence to describe the constraints of the target network. The notation is for the designers to have a rough idea of what configuration the network has followed rather than the full specification. A pair of square brackets and a pair of parenthesis enclosing comma-separated numbers are for detailed specification of an Equality network, where S A is listed in square brackets, and S B is listed in the parentheses. For instance, an Equality network named N14K6[−1,1, 3,9](4) is presented in Fig. 1. As described above, the 'N' and 'K' mark the number of router nodes and network radix constraints, respectively. Hence, the number 14 means there are 14 routers in the network, and the number 6 indicates that the network radix of the routers is six. This specification is more detailed in the hops but does not say how many endpoints are attached.

Remark 1
For the general configuration of a network, one can also express in the short-hand notation of n14k6p3 to represent all Equality networks with 14 switches, network radix 6, and 3 endpoints per switch.
An Equality network has an equal number of inter-router connections in all switches. The number of inter-router connections, K, can be evaluated by the following equation: A direct explanation of Eq. 1 is that each odd number adds one inter-router connection for each router. Each even number adds two inter-router connections for each router, except the diameter link of the ring, which is the same as the odd numbers, adds one inter-router connection for each router. The diameter link of the ring can be either an odd or an even link.
The breakthrough of Equality in chordal ring topologies [10][11][12][13][14] is in the mixing of multiple even and odd links in a ring of even nodes while keeping the alternative nature of odd links. In addition, systematic routing rules are provided for derived networks.
For instance, in 2016, Faraha et al. discussed [13] about degree six 3-modified chordal ring networks, where the total number of nodes N must be divisible by 3, and every three nodes are grouped into a class. Zabłudowski et al. discussed [15] about modified network double ring structures. The only publication we found to show a likelihood of the Equality networks is [12] (modified chordal ring CHR5_a(20; 3,7), CHR5_c(16; 3,5,7) and CHR5_d(16; 3,5,8)); however, it is not based on the same construction rules and applies only to radix-5 networks.

Network optimization
Equality topology offers a plethora of networks to be assessed and made to practice. One needs to set a goal to find the candidates.

Assignment of and
Upon the decision of the network radix, one follows Eq. 1 to confine the lengths of S A and S B to achieve a fixed K. For instance, he or she would like to construct a network of 1840 routers with network radix 17. If four even numbers are selected as , assuming hop 920 is not selected (i.e., N/2), they contribute eight connections to each router. If S B has 920, they consume seven connections to each router. Let us say Performance evaluation of multi-exaflops machines using… 920 is not in S B ; then, additional nine numbers can be added to S A to get an Equality network of radix 17.

Optimization
We optimize Equality networks with a genetic algorithm. Ideally, an initial random seed is given for the generation of S A and S B from C , with the constraint of a predefined radix K. We then select the population size and other simulation parameters, such as the maximum number of generations, and mutation rate. The goal of the optimization is to minimize the product of the average distance a and network diameter d. At the end of the optimization, a series of best results from generations of evolution are reported. If the search space being explored is large, the optimized results are not necessarily the global minimum; however, the results are usually low enough for application. If a sufficient amount of evolution is conducted, one can get results close to the global minimum. For large systems, the empirical approach can involve only optimizing S A while keeping S B fixed. For instance, some of our systems are derived by handpicking S B (and possibly a part of S A ) and optimize the remaining S A ; otherwise, the phase space would be too large to explore. The designer has to adjust S B many times in length and composition and optimize S A to reach a better set of answers.

Routing table
Routing is an essential part of communication. In the current work, the routing procedures include a universal routing table and three routing algorithms.

Remark 2
An even node and an odd node see an entirely identical network structure only in the reverse direction in an Equality network.
Starting from r 0 in an Equality network, one can define a set of paths P(r 0 , r j ) to any other node r j in the network. From P(r 0 , r j ) , one can also find the shortest paths � ⃗ P(r 0 , r j ) from r 0 to r j and save the paths as the first routing table R of the network.
From Remark 2, one can then derive the shortest paths � ⃗ P(r i , r j ) between r i and r j by simple conversion. Since the network is symmetric, the shortest paths (in fact, any paths) between nodes depend on their respective relative difference modulus to total node number N in their IDs as described in Eq. 2.
Apart from the shortest paths, � ⃗ P(r 0 , r j ) , we are also interested in the paths that are slightly longer, which are the sub-shortest paths � ⃗ P � (r 0 , r j ) , from r 0 to r j .
For each target node r j , the sub-shortest paths � ⃗ P � (r 0 , r j ) can be evaluated by checking all neighbors r k not in the shortest paths � ⃗ P(r 0 , r j ) , finding all the distances d 0k and pick the nodes r k for all d 0k ≥ d 0j and collecting the list of nodes r k for those where d 0k is minimal. The collected list L 0j of nodes r k can then be incorporated with the shortest path routing table R to form another routing table R′ .
R ′ is thus containing the shortest paths and alternative targets for sub-shortest paths.
The definition of R′ involves the knowledge of total router number N and the interconnection configuration without knowing the number of endpoints p for each router. Therefore, the complexity of the routing table of an Equality network is O(N) instead of O((Np) 2 ) in regards to an irregular network.

Routing algorithms
The routing algorithms including the adaptive minimal (abbreviated amin), nonminimal UGAL (global adaptive routing using local information, abbreviated ugal), and bottleneck-UGAL (abbreviated bgal) are provided in this work to assess the performance of the Equality networks under consideration.

amin
Adaptive minimal routing algorithm routes the packets through the shortest paths, where there may be one or more shortest paths.
From the universal routing table R′ the alternative intermediate router IDs mentioned above as L 0j are used for ugal and bgal routing algorithms.

ugal
ugal routing algorithm originally defined in [16] is implemented and simulated for comparison. In the current work, the packet is routed so that all the provided shortest and sub-shortest paths are considered and weighted based on the products of the paths' queue length and hop distance.

bgal
The bottleneck-UGAL routing is designed based on ugal algorithm; however, only in the bottleneck of all pair-wise relationships are allowed to utilize the sub-shortest paths. It differs from ugal in the quenching of the available sub-shortest paths when the number of the shortest paths is higher than a threshold value 2 (fixed in this work, but this value is adjustable). This means if the number of shortest paths exceeds the threshold value, only the shortest paths are included in the routing paths.

3
Performance evaluation of multi-exaflops machines using…

Traffic models
Ten traffic models, including uniform random, asymmetric, random permutation, neighbor, tornado, bit rotation [5], bit complement, bit reverse, bit complement, and bit shuffle, are implemented in BookSim to evaluate the Equality networks. In the ten traffic models, only bit rotation is implemented locally, which performs a d i = s i+1 mod b relation to set a target endpoint for each source.

Deadlock freedom
Deadlocks happen when several turning points wait for buffers of cyclic dependency. Deadlock-freedom is achieved with what was described in [17], which guarantees the deadlock-freedom by either limiting the routing to ensure cycle-freedom in the channel dependency graph [18] or utilizing virtual channels (VCs) to break such cycles into different sets of buffers [19].
The routing strategy used herein is similar to that introduced in [20,21]. If we consider a packet sent from router r i to r j , we use the number of virtual channels equal to the diameter for minimal routing. If r i and r j are directly connected (i.e., one hop), then the packet is routed using VC0. If the path between r i and r j consists of two hops, then VC0 is used for the first hop, and VC1 is used for the second hop, respectively. Consider a network of diameter two; for example, only one turn can be taken on the path, and therefore, the maximum number of required VCs is two [22]. That way, one flit only depends on the virtual channel one above its current virtual channel to make progress. Thus, no cyclic dependency can happen.
For adaptive routing, on the other hand, the number of virtual channels is equal to the maximum hop of paths in R′ . For d = 2 systems, the maximum hop in R′ is 4, and for d ≥ 3 systems, the maximum hop in R′ is d + 1 . BookSim gives an error when the number of VCs is insufficient. To generalize the algorithm above, we use a VCk ( 0 ≤ k < n ) on a hop k for an n-hop path between r i and r j .
The general parameters in all simulations are the same as described in [17], which are: Single flit packet is used to avoid flow control issues as described in [17] and [22], where the virtual channel buffer size is set to 64 flit entries, and the number of virtual channels is set as described above depending on the network diameter.
We then collected latency and throughput benchmarks under various traffic patterns and injection rates until the values converged.

The balance of K, p and a
The most intuitive interpretation of K/p is the ratio of ports used for inter-router connections to the number of ports used for endpoints on each router. It is easy to see that if K/p is lower, the interconnect's price would be lower for a fixed number of endpoints.
In general, if budget is of principle consideration, Equality networks can be restricted in the range of p ⋅ (d + 1) ≥ P > p ⋅ (a + 1) ; alternatively, pd ≥ K > pa . The K/p value can be relaxed if performance is of principal concern. The design can be fine-tuned depending on what applications will be run on the final system. Equality is probably the only topology that does not need to sacrifice any port to adjust this ratio. If the number of endpoints is reduced (reduced p), the empty ports can be used for larger K values.
On fat-tree systems, the value of K/pd is always one (and a is very close to d). For instance, a 3-tier fat-tree network utilizes two times the number of links to the endpoints in the number of connections between routers, making four times inter-router ports (each inter-router link consumes two ports) to the number of ports on average on each router. Coincidently, the diameter (maximum inter-router hop count) of a 3 L fat-tree network is four. The same idea applies to 2 L and 4 L fat-trees.
The behavior of the power model acts similar to the networking cost as the number of SerDes for the inter-router links to the SerDes for the endpoint links has the same ratio as K/p.
We calculated the total networking cost of all Equality systems with a model similar to what is described in [17], where each of the cabinets is assumed to be 1 m x 1 m without aisle in the cluster. 2 Each of the endpoints and routers is assumed to be 1U in size. The overhead cable pathways are 1 m above the cabinet. The endpoints and the routers are allocated sequentially on the cabinet until the cabinet is filled to 42U standard cabinet size. All routers are situated on the top of the cabinet. Depending on the remaining cabinet space, endpoints can be allocated in the same or adjacent cabinet to the router.
Manhattan distance is calculated for each cable to include the distance from each module to the overhead cable pathway, the space on the aisle, and an additional 0.5 m horizontal distance from the side of the cabinet to the port. Therefore, from this model, a cable to the adjacent module is 104.45 cm, i.e., 2 ×0.5 m + 44.45 mm. Cables longer than 8 m are fiber, otherwise copper. The results are summarized in Sect. 5.2.

3
Performance evaluation of multi-exaflops machines using…

Cost per node
From the result of the cost model presented by Besta et al. (Fig. 11(c) of [17]) and Kim et al. (Fig. 19 of [23]), it is evident that all topologies follow almost steady cost/ endpoint ratio. While the copper cable reduces the networking cost in small systems, the effect is insignificant in larger systems. In practice, this ratio will still depend on the cable length; i.e., larger computing nodes consume larger space, leading to a higher proportion of inter-router connection cost. For any two non-torus networks with computing nodes of a given form factor (for instance, 0.5U or 2U per endpoint), regardless of the topology used, the copper cable prices of the two networks should be equal if the number of routers and servers are the same. The analysis is reported in Sect. 5.3.

Bisection ratio
By definition, the bisection bandwidth, B, is evaluated by cutting a minimal number of cables to separate the system into two parts.
Instead of discussing the bisection bandwidth, we introduce two variables named network bisection ratio ( b r ) and topology bisection ratio ( B r ) to take the networking cost into account. To clarify, the network bisection ratio is the bisection bandwidth divided by total network bandwidth (including all links to endpoints), i.e., b r = B∕Φ , whereas the topology bisection ratio, B r , is the bisection bandwidth divided by total inter-router bandwidth (excluding all links to endpoints). The total network bandwidth, Φ , for a direct network is the total number of cables in the system multiplied by the channel bandwidth , i.e., Φ = N ( K 2 + p). For a fully connected 12-port 10 Gb/s Ethernet switch, the bisection bandwidth is 60 Gb/s, which is half the total cable number multiplied by the channel bandwidth. It is easy to see that the bisection ratio for this fully connected switch is half, i.e., b r = 1∕2 . For a two-tier fat-tree, this value becomes 1/4, as the number of cables in the network doubles, whereas for a three-tier fat-tree, the value of b r is 1/6. The lower the bisection ratio, the higher the cost of networking hardware because a higher fraction of networking cost is contributed to the inter-router links.
For symmetric chordal ring networks like the Equality networks, the bisection bandwidth is equal to the minimum number of links that are cut if one splits the network into two semicircles. During the generation of new networks, the topology bisection ratio B r is evaluated by counting the minimal number of links being cut when splitting the network into two equivalent halves. 3 The relation of B r and network bisection ratio b r is shown in Eq. 3. The data are collected and discussed in Sect. 5.4.

Results
We have selected a few publications as targets for comparison purposes. To make a sensible comparison, most networks utilize switches with the port number available in the market.
Systems where latency or throughput values are reported in this work. † The effective K/p values for fat-tree systems are provided for reference purposes. For fat-tree systems, the number of wires for inter-router connections is n(L − 1) , L being the number of layers. The total port number for inter-router connections is 2n(L − 1) ‡ Not shown in Fig. 2a and b to prevent overlapping with Slim Fly networks.
⋄ The a value here is calculated based on full-span fat-tree.
⊲ T3D is a 22-ary 3-cube, T5D1 is a 6-ary 5-cube, and T5D2 is a 7-ary 5-cube. The number of virtual channels used is 4, and the virtual channel buffer size is 8.
⊳ Citation for SF14, DF14 and 3FT44: [17]. Citation for DF08: [23]. Citation for SF16: [27] . 2 a Various sizes of systems under consideration. The palette on the right-hand side denotes the network diameter of the system. b Zoom in the range where the router number is less than 10,000. c Latency against K/pa under injection rate 0.9 flit/cycle. Fitting function f (x) = ax + b , where a = −20.3211 (asymptotic standard error ±2.736(13.46%)) and b = 53.255 (asymptotic standard error ±3.577(6.717%)) fits against all Equality networks in Table 2. The Slim Fly results are shown as inverted triangles, whereas the fat-tree results are shown as diamonds. For all systems listed in the above sub-figures, the numbers next to the symbol represent the identification of the system in their respective router radix category. For instance, a triangle with six next to it represents the sixth system using 80-port routers; therefore, the system ID for this system is E806, as listed in Table 2

Targeted systems
Equality can be used to achieve reasonably good ratios of the Moore bounds. We include a column in Table 2 to address the ratio of the network size against the Moore bound. We obtained networks reaching ratios 39% of diameter 2 Moore bound, 9.66% of diameter 3 Moore bound, 1.64% of diameter 4 Moore bound, and 0.18% of diameter 5 Moore bound. We simulated networks of scales from small to 1,024,000 endpoints in this work with various router counts and radices. The listed systems, named according to the radix of the routers (i.e., E361 represents the first system using 36-port routers), are selected networks for each configuration. The listed networks are included in Appendix 1.
We have chosen carefully by optimizing the configuration as discussed in Sect. 3.3 and hand-picked one network with better performance in all traffic modes for each configuration. Some of the networks in the table are designed for exascale systems. The values of the torus, 3-tier fat-tree, Dragonfly, and Slim Fly networks are there for reference.
A Slim Fly network denoted by SF14 is compared with three diameter-3 Equality networks (E441, E442, and E443) in the table to show that with a bit of relaxation in router number, Equality allows better throughput using identical hardware specification. Slim Fly in this comparison, the latency under injection rate 0.9 flit/cycle is over 50 cycles, whereas all three Equality systems are under 50 cycles. It can be seen that the maximum sizes 3-tier fat-trees can achieve are far smaller than Equality.

The balance of K, p and a
Figure 2(c) shows that the latency at the package injection rate 0.9 flit/cycle is negatively correlated with K/pa with amin routing algorithm. The value K/pa stands for the switch's balance of upward and downward flow ratios. The higher the a value, the higher the frequency a message has to consult the switches. By adjusting the weight of K/p, one can counterbalance the weight of a. which follow the same trend as the fitted line, only that the system sizes are much larger. A higher number of endpoints will slightly increase the average networking cost per node, but not as significant as the ratio of K/p. The cluster model can be adjusted to include hot/cold aisle and different rack sizes, but it is not in the scope of the current study. Figure 4 shows the distribution of the Equality networks compared with 3D-torus, 5D-torus, Slim Fly, Dragonfly, and fat-tree networks. The location of the network in this graph depends on the K/p and b r values of the respective network. Since b r ∕B r is a function of K/p, the Slim Fly network sits close to the line of K∕p = 2 , where two of the Equality networks (E369 and E487) fall on that line (near FT2L). The Columns in the gray box contain three routing algorithms amin, ugal, bgal running on the target Equality E361 system with the configuration of n2048k28p8 using 36-port switches. The data from the right panel contain the best results of the respective networks from [7] Fig. 6 Comparison of throughput and latency of the Equality E481 system with the configuration of n4800k38p10 using 48-port switches in 10 traffic modes. In total, there are 48,000 endpoints in this system. The transpose traffic uses 16,384 endpoints for calculation, whereas the other bit permutation traffics uses 32,768 endpoints for calculation 1 3 Performance evaluation of multi-exaflops machines using… distribution of and defines B r in Equality networks. The fat-tree networks have good b r in two tiers, whereas it degrades as the number of layers increases. The b r values of tori also explain why uniform traffic is the nightmare of torus networks.

Bisection ratio
The introduction of bisection ratios b r and B r gives a new viewpoint to look at the bisection bandwidth of a network. The question becomes, "With the constraint of the networking budget, what percentage of the budget is contributed into the bisection bandwidth?," instead of "How much is the bisection bandwidth of the network?." The point is not to get a large bisection ratio, as one needs a proper balance to communicate with the nearby and remote routers. Our experiments found that B r around 0.4 ~0.5 and b r around 0.3 are suitable for most traffic models, where global and local traffic ratios are at a balance.

Individual scenarios
The 16, 384-endpoint scenario is prepared for the comparison to [7]. Figure 5 shows the throughput performance of the E361 network, with the best results shown in [7]. For the Equality network, three routing algorithms: amin, ugal, and bgal are shown inside the grayed box on the left-hand side. The other seven blocks: "FlatFly, SFxFF-1, SFxFF-2, Dragonfly, Hypercube, 4D-torus, and 3D-torus" are the best values directly taken from the paper. It is apparent that almost in all traffic modes, the E361 network using three routing algorithms, with the same number of switches of the same radix, performs better than all networks presented in [7].
It shows that the ugal algorithm performs akin to FlatFly in the bit complement traffic model, where all other topologies fail. Although fat-tree performs well in permutation traffic, we do not include it in this comparison because the maximum size 3-layer fat-tree can achieve with 36-port routers has only 11,664 endpoints. Fig. 7 Comparison of throughput and latency of the Equality E807 system with the configuration of n20000k64p16 using 80-port switches in 10 traffic modes. In total, there are 320,000 endpoints in this system. All bit permutation traffics uses 262,144 endpoints for calculation 1 3 The 48, 000-endpoint scenario is prepared for the situation where the InfiniBand LID limit is of major concern. Detailed simulations on ten traffic models are demonstrated in graphs containing both latency and throughput results.
All the injection processes are simulated in various injection frequencies to reflect the spectrum of latency values under 50 cycles. We focus on latencies lower than 50 cycles as they reflect the usable range of the system that is reliable under the given injection process. Figure 6 presents all ten traffic processes.
The 1E scenario features systems for future applications with multi-exaflops peak performance depending on the computing power of each endpoint. Figure 7 shows the ten traffic models. The E807 network tested here has the configuration of n20000K64p16. The 48, 000-endpoint and 1E networks with close K/pa values are here to show the scalability of Equality networks. With a significant increase in the system size, the performance is still consistent across the two networks.
The 10 6 -endpoint system is a single-point (simulating at 0.9 flit/cycle injection rate) simulation to achieve a million endpoints. We only show the results of throughput and latency of amin routing algorithm under an injection rate of 0.9 flit/cycle for the E806 system with the configuration of n64000k64p16 in Table 2 having 1,024,000 endpoints. The simulation time of the BookSim package running this million-endpoints system took a maximum of 272 GB memory running on single-core in bitcomp traffic. As the only large-memory resource on our site is an IBM box, the calculation took around 20 days to complete. At this point, the BookSim package fails with an integer error, presumably due to the integer type range limitation of the _cur_pid variable, but all simulations are reasonably converged before the job ends.
All of the above systems show the flexibility and performance of the Equality networks.

Conclusion and future work
This paper shows that the low memory footprint routing logic of Equality networks is natively born for routing algorithms using minimal adaptive and subshortest paths. With the topology setting and routing logic implemented in Book-Sim, we demonstrate the simulation results from small to large scales. We also perform the first million-node cycle-accurate calculations based on BookSim package.
Many have advocated [17,[24][25][26] the use of low-diameter networks for the realization of enormous network size with high radix routers. Table 2 shows that excellent performance persists in Equality networks with reasonably low diameters and ordinary router radix under high injection rates. Conversely, extremely low-diameter topologies usually involve networks that are not flexible or need very high-radix routers. For the 1 M-endpoint network, the current work provides a solution to build a network of 64,000 routers (80-port) (shown in Table 2) for the replacement of the network solution of 53,138 routers, each with 264 ports reported in [27]. Moreover, the resulting network performance is plausible. We have shown that Equality networks have similar or better performance compared to many other topologies, including Tori, Dragonfly, Slim Fly, Flat Fly, and Fat-Tree.
Equality does not do permutation (e.g., bit complement and bit shuffle) quite as well as fat-tree. Still, it generally trades the zero-load latency with the throughput while keeping moderate-to high-level usabilities.
Equality networks can be used with low-or high-radix routers [28] available in the market. The results of this paper also show excellent performance of Equality networks under uniform random traffic even with an adaptive minimal routing algorithm, which suggests good performance using commodity hardware for general-purpose clusters. We have also performed routing a mini-sized Equality network on 12 HPE 5945 48SFP28 8QSFP28 switches for real applications. The routing of this small cluster is accomplished with multi-protocol label switching (MPLS), where all shortest paths and some sub-shortest paths are assigned between every pair of routers. On InfiniBand, we expect the network will work well with Nue routing introduced by Domke et al. [29].
The adjustability and outstanding performance of Equality networks give the industry a new network topology option for HPC systems. The system designers will have more flexibility in picking commodity hardware. It is also an opportunity for data centers to reorganize for better efficiency.
In future, we plan to run simulations using ROSS-CODES and TraceR as described in [30] to reflect the effects of the real application traffic, especially for other subtle effects while dealing with many jobs on the cluster [31].
To summarize, the current work reiterates the construction of networks based on a novel class of network topology, allowing routing with simple logic to achieve strong scalability from small to large systems. The work shows the performance of networks by comparing the cycle-accurate BookSim benchmarks against many previous works. The results show significant benefits of utilizing this new class of network topology for future high-performance computing applications. Table 3 lists the configurations of the Equality networks simulated in this work.