Abstract
In modern computing architectures, graph theory is the soul of the play due to the rising core counts. It is indispensable to keep finding a better way to connect the cores. A novel chordalring interconnect topology system, Equality, is revisited in this paper to compare with a few previous works. This paper details the procedures for constructing the Equality interconnects, its special routing procedures, the strategies for selecting a configuration, and evaluating its performance using the opensource cycleaccurate BookSim package. Four scenarios representing small to largescale computing facilities are presented to assess the network performance. This work shows that in 16,384endpoint systems, the Equality network turns out to be the most efficient system. The results also show the steady scalability of Equality networks extending to 48–320K, and a million endpoints. Equality networks are adjustable to fit with commodity hardware and resilient under ten common traffic models. It is suggested that Equality network topology can be used in constructing efficient multiexaflops supercomputers and data centers.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Highperformance computing (HPC) is a type of computing that uses highend computing components to cooperatively address largescale tasks that cannot be solved easily by ordinary computers. The computing components are connected by HPC networks to achieve better efficiency.
An HPC network differs from other networks in that it often seeks to synchronize communication and computation so that the communication does not interrupt the computation too much to increase efficiency. An HPC network also tends to use homogeneous computing hardware, such as the same model of switches (with an equal number of ports), CPUs, and accelerators across the entire implementation. Homogeneous products in a system ensure lower prices for each component due to mass production and more straightforward restoration by prompt replacement when some parts go wrong.
Hwang et al. have shown the potential of Equality network compared with a few popular HPC network topologies [1,2,3,4] such as 2tier fattree, 3tier fattree, 3D torus, and 5D torus. In this work, we further analyze the performance of Equality networks in different scales to compare with Slim Fly, Dragonfly, and two popular network topologies, Fattree and Tori. We also extend the focus on applying the Equality networks to enable machines capable of reaching multiexaflops based on current hardware craftsmanship.
The main contributions of the current work that are different from previous works include the following:

The development and implementation of the systematic routing tables for Equality networks,

The modified routing algorithm bottleneckUGAL to refrain from oversubscribed paths,

The introduction and explanation of a new measure called bisection ratio in addition to bisection bandwidth,

The analysis of the resulting network properties (diameter, average distance, latency, and throughput) of various scales of the Equality networks and the comparison to other existing publications,

The strategies of finding a suitable configuration for a future HPC system utilizing Equality network topology, and

The largest cycleaccurate simulation ever calculated by BookSim (a 1 Mendpoint system).
2 Network architecture
Different network topologies often are designed to fit specific workloads when designed beforehand. To justify the quality of a network and whether it is suitable for the target application workloads, one can inspect the performance measures of a network and additionally perform simulations on the network. The standard network measures used in this paper include the network diameter d and average distance a. The standard communication measures are the message latency and the network’s overall throughput under different traffic patterns and injection intensities.
A wellbalanced topology should have a reasonable network diameter and also accompanied by tailored routing algorithms to reduce the latency and increase the throughput. Nevertheless, for any application, a given network has an effective diameter, \(d_{\text {eff}}\), if all communication patterns of the specific application use no more than \(d_{\text {eff}}\) hops in the network regardless of the actual network diameter. The same idea goes for the average hop count.
Under low injection rates, when a packet never contends with other packets for resources, the latency is called ‘zeroload latency,’ \(l_0\), which is the sum of the latency bits when no queue or blocking is involved. Under high injection rates, when the network is saturated under the specific load, traffic pattern, and routing algorithm, if the latency goes high, the throughput approaches the saturation throughput, \(t_\text {sat}\).
Predominantly, the latency can be reduced if the network diameter d is reduced; however, the network latency and throughput still depend on the traffic patterns of real applications. Several traffic patterns [5] are devised based on communication patterns triggered by realworld applications. For instance, the transpose traffic mode is encountered in cornerturn [6] and matrix transpose applications. Choosing a network that outperforms others on most traffic patterns contributes to a better system.
2.1 Switching delays
The porttoport latency of the contemporary highend switches ranges from tens of nanoseconds to a few \(\mu\)s. The switching delay depends on many parameters, including the cable length, cable material, opticalelectrical conversion, buffer size, switching logic cycle frequency, switching memory access time, routing table size, routing calculation complexity, hop count, etc. The hardware performance can be calculated by summing up all the instruction cycles contributing from each logic element, multiplied by the time in nanoseconds per cycle. The hardware issues, such as memory access time and logic frequency, are out of the scope of the current paper.
If the latency per hop is fixed when the switching hardware is selected, the only way to reduce the overall latency of the system is to finetune the topology. In reality, the messages can be waiting in channel buffers in packetbased routing architectures or for the channel to be freed in wormhole routing architectures. In the events of network congestion, the hopping distance may have a lesser impact on the network latency; therefore, the design of the topology and routing algorithm should not only focus on the diameter of the network.
2.2 Network selection
There are many topology inherited shortcomings in the existing topologies. The torus networks bear the routing difficulty of long hopping distances and oversubscribed paths. Although fattree networks perform well on all permutation traffic patterns, the cost for large fattree networks is considerably higher, and the number of layers determines the zeroload latency (\(l_0\)) of a fattree network.
Most topologies have the total router number being the product of integers, meaning the number of routers cannot be changed easily if the budget is modified. Dragonfly networks suffer from low global bandwidth, which becomes the bottleneck of global traffic. The irregularity of Slim Fly network links predestined the routing intricacy when the network diameter is more than two. Slim Fly MMS requires very high radix routers according to its mathematical expression when it scales up. According to its expression, with 80port routers, the largest configuration it can offer is \(k'=53\), \(\delta =1\) MMS, which can hold about 2450 routers and 63,700 endpoints. Once a model of router (in this case: 80port) is selected for Slim Fly MMS, it almost means the number of endpoints is fixed. Other closest solutions in this range are:

\(\delta =0\), holding 2048 routers and 49,152 endpoints.

\(\delta =1\), holding 2402 routers and 52,272 endpoints.
and one can see these solutions differ significantly from the first solution.
Daryin and Korzh [7] have been looking for lowdiameter topologies that have structures with the optimal performance for the Russian supercomputer manufactured by TPlatforms. They chose to go with the hybrid of Slim Fly and Flattened Butterfly for comparison with Dragonfly, tori, hypercube, and Flattened Butterfly. In the end, they chose Flattened Butterfly for their system #22 in November 2014 Top500 list [8] with Rmax 1.8 petaflops. This has shown that from the perspective of a system designer, the window to finding an appropriate system size can be very narrow. The comparison of our results with this petascale system is shown in Fig. 5 (Sect. 5).
Hwang et al. have compared the performance of Equality network against 2tier fattree [1], 3tier fattree [2], and 3D torus [3]. They also show that Equality can be used to design manycore computer chips [4].
3 Methods
In the following sections, we recapitulate the construction of an Equality network (Sect. 3.1). Section 3.2 addresses how we describe Equality networks and some related prior arts. Further, we detail the optimization procedures (Sect. 3.3) and the routing table construction procedures (Sect. 3.4). Section 3.5 briefs three routing algorithms utilized in this work, and Sect. 3.6 recites the routine for cycleaccurate simulation.
The concentration p depends on what application will run on the network, while the system is being designed. However, finding a proper p under saturated traffic is essential. The balance of network radix K and concentration p, together with the system’s networking cost and bisection bandwidth, is discussed in Sect. 4.
To get the estimated performance of our designs, we utilize the opensource BookSim [9] for cycleaccurate simulation of our networks. The package we use is downloaded from its GitHub site (https://github.com/booksim) and later modified locally to include our homebrewed routing procedures and algorithms. We implemented our topology, routing algorithms, and a few extra traffic models into BookSim. The results in various system scales are discussed in Sect. 5.
Symbols and notations used in the current paper are listed in Table 1.
3.1 Connection rules
Since our previous works are short conference papers, we would like to have this chance to address more about the initial conception of Equality network topology.
Upon constructing a network, one of the intuitive approaches is to link all the nodes into a ring. A ring topology has a network radix of 2. At the dawn of the current study, we looked at Hamiltonian cycles and sought to find a way to reduce the network’s diameter. Since we only focus on any Hamiltonian networks that have equal radix on all router nodes, upon adding one link on a router, the same link has to be applied on all routers. To keep the routing identical for all routers, we tried many strategies for adding connections on all nodes.
To make the interconnects, every member of the routers makes links through the following rules. An Equality network has N routers, where N is an even number. The routers are sequentially numbered from zero to \(N1\), i.e., \(r_0, r_1, \ldots , r_{N1}\). The routers are connected to form a ring and later to other ring members, just like chordal ring topologies.
A set of positive integers, \({ C }\), starting from 2 to \(N3\), excluding any even numbers greater than N/2, are used as the candidates \({ C }\) for making the physical links.^{Footnote 1} From \({ C }\), a subset of integers, \({ S }\), composed of \(S_\text {A}\) (a collection of odd positive integers) and \(S_\text {B}\) (a collection of even positive integers) is selected for the interconnections. The connections are made for every router \(r_i\) with \(r_{(i + S_j) \text {mod} N}\) if i is even, or with \(r_{(i  S_j) \text {mod} N}\) if i is odd, for every number \(S_j\).
3.2 Syntax
The general notation of Equality involves an ‘n’ denoting the total number of routers, followed by a number, and a ‘k’ followed by another number indicating the network radix. The notation of ‘p’ can be omitted if the number of attached endpoints is not yet specified. Both uppercase and lowercase are allowed in the notation of ‘n,’ ‘k,’ or ‘p’ as long as they are in a sequence to describe the constraints of the target network. The notation is for the designers to have a rough idea of what configuration the network has followed rather than the full specification.
A pair of square brackets and a pair of parenthesis enclosing commaseparated numbers are for detailed specification of an Equality network, where \(S_\text {A}\) is listed in square brackets, and \(S_\text {B}\) is listed in the parentheses. For instance, an Equality network named N14K6[\(1\),1,3,9](4) is presented in Fig. 1. As described above, the ‘N’ and ‘K’ mark the number of router nodes and network radix constraints, respectively. Hence, the number 14 means there are 14 routers in the network, and the number 6 indicates that the network radix of the routers is six. This specification is more detailed in the hops but does not say how many endpoints are attached.
Remark 1
For the general configuration of a network, one can also express in the shorthand notation of n14k6p3 to represent all Equality networks with 14 switches, network radix 6, and 3 endpoints per switch.
An Equality network has an equal number of interrouter connections in all switches. The number of interrouter connections, K, can be evaluated by the following equation:
A direct explanation of Eq. 1 is that each odd number adds one interrouter connection for each router. Each even number adds two interrouter connections for each router, except the diameter link of the ring, which is the same as the odd numbers, adds one interrouter connection for each router. The diameter link of the ring can be either an odd or an even link.
The breakthrough of Equality in chordal ring topologies [10,11,12,13,14] is in the mixing of multiple even and odd links in a ring of even nodes while keeping the alternative nature of odd links. In addition, systematic routing rules are provided for derived networks.
For instance, in 2016, Faraha et al. discussed [13] about degree six 3modified chordal ring networks, where the total number of nodes N must be divisible by 3, and every three nodes are grouped into a class. Zabłudowski et al. discussed [15] about modified network double ring structures. The only publication we found to show a likelihood of the Equality networks is [12] (modified chordal ring CHR5_a(20; 3,7), CHR5_c(16; 3,5,7) and CHR5_d(16; 3,5,8)); however, it is not based on the same construction rules and applies only to radix5 networks.
3.3 Network optimization
Equality topology offers a plethora of networks to be assessed and made to practice. One needs to set a goal to find the candidates.
3.3.1 Assignment of \(\mathbf {S_A}\) and \(\mathbf {S_B}\)
Upon the decision of the network radix, one follows Eq. 1 to confine the lengths of \(S_\text {A}\) and \(S_\text {B}\) to achieve a fixed K. For instance, he or she would like to construct a network of 1840 routers with network radix 17. If four even numbers are selected as \(\mathbf {S_B}\), assuming hop 920 is not selected (i.e., N/2), they contribute eight connections to each router. If \(S_\text {B}\) has 920, they consume seven connections to each router. Let us say 920 is not in \(S_\text {B}\); then, additional nine numbers can be added to \(S_\text {A}\) to get an Equality network of radix 17.
3.3.2 Optimization
We optimize Equality networks with a genetic algorithm. Ideally, an initial random seed is given for the generation of \(S_\text {A}\) and \(S_\text {B}\) from \({ C }\), with the constraint of a predefined radix K. We then select the population size and other simulation parameters, such as the maximum number of generations, and mutation rate. The goal of the optimization is to minimize the product of the average distance a and network diameter d. At the end of the optimization, a series of best results from generations of evolution are reported. If the search space being explored is large, the optimized results are not necessarily the global minimum; however, the results are usually low enough for application. If a sufficient amount of evolution is conducted, one can get results close to the global minimum.
For large systems, the empirical approach can involve only optimizing \(S_\text {A}\) while keeping \(S_\text {B}\) fixed. For instance, some of our systems are derived by handpicking \(S_\text {B}\) (and possibly a part of \(S_\text {A}\)) and optimize the remaining \(S_\text {A}\); otherwise, the phase space would be too large to explore. The designer has to adjust \(S_\text {B}\) many times in length and composition and optimize \(S_\text {A}\) to reach a better set of answers.
3.4 Routing table
Routing is an essential part of communication. In the current work, the routing procedures include a universal routing table and three routing algorithms.
Remark 2
An even node and an odd node see an entirely identical network structure only in the reverse direction in an Equality network.
Starting from \(r_0\) in an Equality network, one can define a set of paths \({ P }(r_0,r_j)\) to any other node \(r_j\) in the network. From \({ P }(r_0,r_j)\), one can also find the shortest paths \(\overrightarrow{{ P }}(r_0,r_j)\) from \(r_0\) to \(r_j\) and save the paths as the first routing table \(\widehat{{ R }}\) of the network.
From Remark 2, one can then derive the shortest paths \(\overrightarrow{{ P }}(r_i,r_j)\) between \(r_i\) and \(r_j\) by simple conversion. Since the network is symmetric, the shortest paths (in fact, any paths) between nodes depend on their respective relative difference modulus to total node number N in their IDs as described in Eq. 2.
Apart from the shortest paths, \(\overrightarrow{{ P }}(r_0,r_j)\), we are also interested in the paths that are slightly longer, which are the subshortest paths \(\overrightarrow{{ P }}'(r_0,r_j)\), from \(r_0\) to \(r_j\).
For each target node \(r_j\), the subshortest paths \(\overrightarrow{{ P }}'(r_0,r_j)\) can be evaluated by checking all neighbors \(r_k\) not in the shortest paths \(\overrightarrow{{ P }}(r_0,r_j)\), finding all the distances \(d_{0k}\) and pick the nodes \(r_k\) for all \(d_{0k} \ge d_{0j}\) and collecting the list of nodes \(r_k\) for those where \(d_{0k}\) is minimal. The collected list \(L_{0j}\) of nodes \(r_k\) can then be incorporated with the shortest path routing table \(\widehat{{ R }}\) to form another routing table \(\widehat{{ R }'}\).
\(\widehat{{ R }'}\) is thus containing the shortest paths and alternative targets for subshortest paths.
The definition of \(\widehat{{ R }'}\) involves the knowledge of total router number N and the interconnection configuration without knowing the number of endpoints p for each router. Therefore, the complexity of the routing table of an Equality network is O(N) instead of O(\((Np)^2\)) in regards to an irregular network.
3.5 Routing algorithms
The routing algorithms including the adaptive minimal (abbreviated amin), nonminimal UGAL (global adaptive routing using local information, abbreviated ugal), and bottleneckUGAL (abbreviated bgal) are provided in this work to assess the performance of the Equality networks under consideration.
3.5.1 amin
Adaptive minimal routing algorithm routes the packets through the shortest paths, where there may be one or more shortest paths.
From the universal routing table \(\widehat{{ R }'}\) the alternative intermediate router IDs mentioned above as \(L_{0j}\) are used for ugal and bgal routing algorithms.
3.5.2 ugal
ugal routing algorithm originally defined in [16] is implemented and simulated for comparison. In the current work, the packet is routed so that all the provided shortest and subshortest paths are considered and weighted based on the products of the paths’ queue length and hop distance.
3.5.3 bgal
The bottleneckUGAL routing is designed based on ugal algorithm; however, only in the bottleneck of all pairwise relationships are allowed to utilize the subshortest paths. It differs from ugal in the quenching of the available subshortest paths when the number of the shortest paths is higher than a threshold value 2 (fixed in this work, but this value is adjustable). This means if the number of shortest paths exceeds the threshold value, only the shortest paths are included in the routing paths.
3.6 Cycleaccurate simulation conditions
3.6.1 Traffic models
Ten traffic models, including uniform random, asymmetric, random permutation, neighbor, tornado, bit rotation [5], bit complement, bit reverse, bit complement, and bit shuffle, are implemented in BookSim to evaluate the Equality networks. In the ten traffic models, only bit rotation is implemented locally, which performs a \(d_i = s_{i+1} \mod b\) relation to set a target endpoint for each source.
3.6.2 Deadlock freedom
Deadlocks happen when several turning points wait for buffers of cyclic dependency. Deadlockfreedom is achieved with what was described in [17], which guarantees the deadlockfreedom by either limiting the routing to ensure cyclefreedom in the channel dependency graph [18] or utilizing virtual channels (VCs) to break such cycles into different sets of buffers [19].
The routing strategy used herein is similar to that introduced in [20, 21]. If we consider a packet sent from router \(r_i\) to \(r_j\), we use the number of virtual channels equal to the diameter for minimal routing. If \(r_i\) and \(r_j\) are directly connected (i.e., one hop), then the packet is routed using VC0. If the path between \(r_i\) and \(r_j\) consists of two hops, then VC0 is used for the first hop, and VC1 is used for the second hop, respectively. Consider a network of diameter two; for example, only one turn can be taken on the path, and therefore, the maximum number of required VCs is two [22]. That way, one flit only depends on the virtual channel one above its current virtual channel to make progress. Thus, no cyclic dependency can happen.
For adaptive routing, on the other hand, the number of virtual channels is equal to the maximum hop of paths in \(\widehat{{ R }'}\). For \(d=2\) systems, the maximum hop in \(\widehat{{ R }'}\) is 4, and for \(d\ge 3\) systems, the maximum hop in \(\widehat{{ R }'}\) is \(d+1\). BookSim gives an error when the number of VCs is insufficient. To generalize the algorithm above, we use a VCk (\(0\le k<n\)) on a hop k for an nhop path between \(r_i\) and \(r_j\).
The general parameters in all simulations are the same as described in [17], which are:
Single flit packet is used to avoid flow control issues as described in [17] and [22], where the virtual channel buffer size is set to 64 flit entries, and the number of virtual channels is set as described above depending on the network diameter.
We then collected latency and throughput benchmarks under various traffic patterns and injection rates until the values converged.
4 Cost and balance
4.1 The balance of K, p and a
The most intuitive interpretation of K/p is the ratio of ports used for interrouter connections to the number of ports used for endpoints on each router. It is easy to see that if K/p is lower, the interconnect’s price would be lower for a fixed number of endpoints.
In general, if budget is of principle consideration, Equality networks can be restricted in the range of \(p \cdot (d+1) \ge P > p \cdot (a+1)\); alternatively, \(pd \ge K > pa\). The K/p value can be relaxed if performance is of principal concern. The design can be finetuned depending on what applications will be run on the final system. Equality is probably the only topology that does not need to sacrifice any port to adjust this ratio. If the number of endpoints is reduced (reduced p), the empty ports can be used for larger K values.
On fattree systems, the value of K/pd is always one (and a is very close to d). For instance, a 3tier fattree network utilizes two times the number of links to the endpoints in the number of connections between routers, making four times interrouter ports (each interrouter link consumes two ports) to the number of ports on average on each router. Coincidently, the diameter (maximum interrouter hop count) of a 3 L fattree network is four. The same idea applies to 2 L and 4 L fattrees.
The behavior of the power model acts similar to the networking cost as the number of SerDes for the interrouter links to the SerDes for the endpoint links has the same ratio as K/p.
We calculated the total networking cost of all Equality systems with a model similar to what is described in [17], where each of the cabinets is assumed to be 1 m x 1 m without aisle in the cluster.^{Footnote 2} Each of the endpoints and routers is assumed to be 1U in size. The overhead cable pathways are 1 m above the cabinet. The endpoints and the routers are allocated sequentially on the cabinet until the cabinet is filled to 42U standard cabinet size. All routers are situated on the top of the cabinet. Depending on the remaining cabinet space, endpoints can be allocated in the same or adjacent cabinet to the router.
Manhattan distance is calculated for each cable to include the distance from each module to the overhead cable pathway, the space on the aisle, and an additional 0.5 m horizontal distance from the side of the cabinet to the port. Therefore, from this model, a cable to the adjacent module is 104.45 cm, i.e., 2\(\times\)0.5 m + 44.45 mm. Cables longer than 8 m are fiber, otherwise copper. The results are summarized in Sect. 5.2.
4.2 Cost per node
From the result of the cost model presented by Besta et al. (Fig. 11(c) of [17]) and Kim et al. (Fig. 19 of [23]), it is evident that all topologies follow almost steady cost/endpoint ratio. While the copper cable reduces the networking cost in small systems, the effect is insignificant in larger systems. In practice, this ratio will still depend on the cable length; i.e., larger computing nodes consume larger space, leading to a higher proportion of interrouter connection cost. For any two nontorus networks with computing nodes of a given form factor (for instance, 0.5U or 2U per endpoint), regardless of the topology used, the copper cable prices of the two networks should be equal if the number of routers and servers are the same. The analysis is reported in Sect. 5.3.
4.3 Bisection ratio
By definition, the bisection bandwidth, B, is evaluated by cutting a minimal number of cables to separate the system into two parts.
Instead of discussing the bisection bandwidth, we introduce two variables named network bisection ratio (\(b_r\)) and topology bisection ratio (\(B_r\)) to take the networking cost into account. To clarify, the network bisection ratio is the bisection bandwidth divided by total network bandwidth (including all links to endpoints), i.e., \(b_r= B/\Phi\), whereas the topology bisection ratio, \(B_r\), is the bisection bandwidth divided by total interrouter bandwidth (excluding all links to endpoints). The total network bandwidth, \(\Phi\), for a direct network is the total number of cables in the system multiplied by the channel bandwidth \(\phi\), i.e., \(\Phi = N\phi (\frac{K}{2}+p)\).
For a fully connected 12port 10 Gb/s Ethernet switch, the bisection bandwidth is 60 Gb/s, which is half the total cable number multiplied by the channel bandwidth. It is easy to see that the bisection ratio for this fully connected switch is half, i.e., \(b_r= 1/2\). For a twotier fattree, this value becomes 1/4, as the number of cables in the network doubles, whereas for a threetier fattree, the value of \(b_r\) is 1/6. The lower the bisection ratio, the higher the cost of networking hardware because a higher fraction of networking cost is contributed to the interrouter links.
For symmetric chordal ring networks like the Equality networks, the bisection bandwidth is equal to the minimum number of links that are cut if one splits the network into two semicircles. During the generation of new networks, the topology bisection ratio \(B_r\) is evaluated by counting the minimal number of links being cut when splitting the network into two equivalent halves.^{Footnote 3}
The relation of \(B_r\) and network bisection ratio \(b_r\) is shown in Eq. 3.
The data are collected and discussed in Sect. 5.4.
5 Results
We have selected a few publications as targets for comparison purposes. To make a sensible comparison, most networks utilize switches with the port number available in the market.
5.1 Targeted systems
Equality can be used to achieve reasonably good ratios of the Moore bounds. We include a column in Table 2 to address the ratio of the network size against the Moore bound. We obtained networks reaching ratios 39% of diameter 2 Moore bound, 9.66% of diameter 3 Moore bound, 1.64% of diameter 4 Moore bound, and 0.18% of diameter 5 Moore bound.
We simulated networks of scales from small to 1,024,000 endpoints in this work with various router counts and radices. The listed systems, named according to the radix of the routers (i.e., E361 represents the first system using 36port routers), are selected networks for each configuration. The listed networks are included in Appendix 1.
We have chosen carefully by optimizing the configuration as discussed in Sect. 3.3 and handpicked one network with better performance in all traffic modes for each configuration. Some of the networks in the table are designed for exascale systems. The values of the torus, 3tier fattree, Dragonfly, and Slim Fly networks are there for reference.
A Slim Fly network denoted by SF14 is compared with three diameter3 Equality networks (E441, E442, and E443) in the table to show that with a bit of relaxation in router number, Equality allows better throughput using identical hardware specification. Slim Fly in this comparison, the latency under injection rate 0.9 flit/cycle is over 50 cycles, whereas all three Equality systems are under 50 cycles.
Figure 2a and b shows the comparison of scales (number of routers and endpoints) of the listed systems in the table. It can be seen that the maximum sizes 3tier fattrees can achieve are far smaller than Equality.
5.2 The balance of K, p and a
Figure 2(c) shows that the latency at the package injection rate 0.9 flit/cycle is negatively correlated with K/pa with amin routing algorithm. The value K/pa stands for the switch’s balance of upward and downward flow ratios. The higher the a value, the higher the frequency a message has to consult the switches. By adjusting the weight of K/p, one can counterbalance the weight of a.
5.3 Cost per node
Figure 3 shows that with K/p being the variable, the ‘cost per endpoint’ values for all topologies (here shows Slim Fly, Dragonfly, FBF3, FT3 L, and DLN) follow the same trend if fiber cables are used for longer links. On the other hand, if only copper cables are used, the cost behaves like the tori. Equality systems are shown in circles, which follow the same trend as the fitted line, only that the system sizes are much larger. A higher number of endpoints will slightly increase the average networking cost per node, but not as significant as the ratio of K/p. The cluster model can be adjusted to include hot/cold aisle and different rack sizes, but it is not in the scope of the current study.
5.4 Bisection ratio
Figure 4 shows the distribution of the Equality networks compared with 3Dtorus, 5Dtorus, Slim Fly, Dragonfly, and fattree networks. The location of the network in this graph depends on the K/p and \(b_r\) values of the respective network. Since \(b_r/B_r\) is a function of K/p, the Slim Fly network sits close to the line of \(K/p=2\), where two of the Equality networks (E369 and E487) fall on that line (near FT2L). The distribution of \(\mathbf {S_A}\) and \(\mathbf {S_B}\) defines \(B_r\) in Equality networks. The fattree networks have good \(b_r\) in two tiers, whereas it degrades as the number of layers increases. The \(b_r\) values of tori also explain why uniform traffic is the nightmare of torus networks.
The introduction of bisection ratios \(b_r\) and \(B_r\) gives a new viewpoint to look at the bisection bandwidth of a network. The question becomes, “With the constraint of the networking budget, what percentage of the budget is contributed into the bisection bandwidth?,” instead of “How much is the bisection bandwidth of the network?.” The point is not to get a large bisection ratio, as one needs a proper balance to communicate with the nearby and remote routers. Our experiments found that \(B_r\) around 0.4 ~0.5 and \(b_r\) around 0.3 are suitable for most traffic models, where global and local traffic ratios are at a balance.
5.5 Individual scenarios
The 16, 384endpoint scenario is prepared for the comparison to [7]. Figure 5 shows the throughput performance of the E361 network, with the best results shown in [7]. For the Equality network, three routing algorithms: amin, ugal, and bgal are shown inside the grayed box on the lefthand side. The other seven blocks: “FlatFly, SFxFF1, SFxFF2, Dragonfly, Hypercube, 4Dtorus, and 3Dtorus” are the best values directly taken from the paper. It is apparent that almost in all traffic modes, the E361 network using three routing algorithms, with the same number of switches of the same radix, performs better than all networks presented in [7].
It shows that the ugal algorithm performs akin to FlatFly in the bit complement traffic model, where all other topologies fail. Although fattree performs well in permutation traffic, we do not include it in this comparison because the maximum size 3layer fattree can achieve with 36port routers has only 11,664 endpoints.
The 48, 000endpoint scenario is prepared for the situation where the InfiniBand LID limit is of major concern. Detailed simulations on ten traffic models are demonstrated in graphs containing both latency and throughput results.
All the injection processes are simulated in various injection frequencies to reflect the spectrum of latency values under 50 cycles. We focus on latencies lower than 50 cycles as they reflect the usable range of the system that is reliable under the given injection process. Figure 6 presents all ten traffic processes.
The 1E scenario features systems for future applications with multiexaflops peak performance depending on the computing power of each endpoint. Figure 7 shows the ten traffic models. The E807 network tested here has the configuration of n20000K64p16. The 48, 000endpoint and 1E networks with close K/pa values are here to show the scalability of Equality networks. With a significant increase in the system size, the performance is still consistent across the two networks.
The \(10^6\)endpoint system is a singlepoint (simulating at 0.9 flit/cycle injection rate) simulation to achieve a million endpoints. We only show the results of throughput and latency of amin routing algorithm under an injection rate of 0.9 flit/cycle for the E806 system with the configuration of n64000k64p16 in Table 2 having 1,024,000 endpoints. The simulation time of the BookSim package running this millionendpoints system took a maximum of 272 GB memory running on singlecore in bitcomp traffic. As the only largememory resource on our site is an IBM box, the calculation took around 20 days to complete. At this point, the BookSim package fails with an integer error, presumably due to the integer type range limitation of the _cur_pid variable, but all simulations are reasonably converged before the job ends.
All of the above systems show the flexibility and performance of the Equality networks.
6 Conclusion and future work
This paper shows that the low memory footprint routing logic of Equality networks is natively born for routing algorithms using minimal adaptive and subshortest paths. With the topology setting and routing logic implemented in BookSim, we demonstrate the simulation results from small to large scales. We also perform the first millionnode cycleaccurate calculations based on BookSim package.
Many have advocated [17, 24,25,26] the use of lowdiameter networks for the realization of enormous network size with high radix routers. Table 2 shows that excellent performance persists in Equality networks with reasonably low diameters and ordinary router radix under high injection rates. Conversely, extremely lowdiameter topologies usually involve networks that are not flexible or need very highradix routers. For the 1 Mendpoint network, the current work provides a solution to build a network of 64,000 routers (80port) (shown in Table 2) for the replacement of the network solution of 53,138 routers, each with 264 ports reported in [27]. Moreover, the resulting network performance is plausible. We have shown that Equality networks have similar or better performance compared to many other topologies, including Tori, Dragonfly, Slim Fly, Flat Fly, and FatTree.
Equality does not do permutation (e.g., bit complement and bit shuffle) quite as well as fattree. Still, it generally trades the zeroload latency with the throughput while keeping moderate to highlevel usabilities.
Equality networks can be used with low or highradix routers [28] available in the market. The results of this paper also show excellent performance of Equality networks under uniform random traffic even with an adaptive minimal routing algorithm, which suggests good performance using commodity hardware for generalpurpose clusters. We have also performed routing a minisized Equality network on 12 HPE 5945 48SFP28 8QSFP28 switches for real applications. The routing of this small cluster is accomplished with multiprotocol label switching (MPLS), where all shortest paths and some subshortest paths are assigned between every pair of routers. On InfiniBand, we expect the network will work well with Nue routing introduced by Domke et al. [29].
The adjustability and outstanding performance of Equality networks give the industry a new network topology option for HPC systems. The system designers will have more flexibility in picking commodity hardware. It is also an opportunity for data centers to reorganize for better efficiency.
In future, we plan to run simulations using ROSSCODES and TraceR as described in [30] to reflect the effects of the real application traffic, especially for other subtle effects while dealing with many jobs on the cluster [31].
To summarize, the current work reiterates the construction of networks based on a novel class of network topology, allowing routing with simple logic to achieve strong scalability from small to large systems. The work shows the performance of networks by comparing the cycleaccurate BookSim benchmarks against many previous works. The results show significant benefits of utilizing this new class of network topology for future highperformance computing applications.
Data availability
Not applicable.
Notes
It is worth to note that all even numbers \(S_j < N/2\) make equal links as the corresponding even numbers \(NS_j\); therefore, all even numbers greater than N/2 are excluded from the set.
The cost of copper cables: \(f(x) =0.4079x+0.5771\)[$/Gb/s], the cost of fiber cables: \(f(x) =0.0919+7.2745\)[$/Gb/s], and the cost of router: \(f(k)=350.4k892.3\)[$] are consistent to [17] for comparison.
The number of links to the endpoints, p, is a variable and therefore is not considered in the topology generation stage.
References
Yang CY, Liang CH, Wu HL, Cheng CH, Li CC, Chen CM, Huang PL, Hwang CC (2019) Exceeding the performance of twotier fattree: equality network topology. Future of Information and Communication Conference, 14. Springer, Cham, pp 1187–1199
Liang CH, Cheng CH, Wu HL, Li CC, Chen CM, Huang PL, Huang SL, Hwang CC. (2018) Beyond the performance of threetier fattree: equality topology with low diameter. In 2018 international symposium on computer, consumer and control (IS3C) (pp. 2229). IEEE
Wu HL, Cheng CH, Liang CH, Li CC, Chen CM, Huang PL, Huang SL, Hwang CC. (2020) Beyond the performance of 3Dtorus: equality topology with low radix. In 2020 international symposium on computer, consumer and control (IS3C) (pp. 319322). IEEE
Cheng CH, Wu HL, Liang CH, Li CC, Chen CM, Huang PL, Huang SL, Hwang CC. (2020) Equality NoC: a novel NoC topology for high performance and energy efficiency. In 2020 international symposium on computer, consumer and control (IS3C) (pp. 8386). IEEE
Dally WJ, Towles BP (2004) Princ Pract Interconnect Netw. Elsevier
Lutomirski A, Tegmark M, Sanchez NJ, Stein LC, Urry WL, Zaldarriaga M (2011) Solving the cornerturning problem for large interferometers. Mon Not R Astron Soc 410(3):2075–80
Daryin A, Korzh A (2015) Early evaluation of direct largescale InfiniBand networks with adaptive routing. Supercomput Front Innov 1(3):56–69
Dongarra JJ, Meuer HW, Strohmaier E (1997) TOP500 supercomputer sites. Supercomputer 15(13):89–111
Jiang N, Balfour J, Becker DU, Towles B, Dally WJ, Michelogiannakis G, Kim J. (2013) A detailed and flexible cycleaccurate networkonchip simulator. In performance analysis of systems and software (ISPASS), 2013 IEEE international symposium on (pp. 8696). IEEE
Beivide R, Martínez C, Izu C, Gutierrez J, Gregorio JÁ, MiguelAlonso J (2003) Chordal topologies for interconnection networks. International symposium on high performance computing, 20. Springer, Berlin, pp 385–392
Parhami B (2008) Periodically regular chordal rings are preferable to doublering networks. J Interconnect Netw 9:99–126
Dubalski B, Bujnowski S, Ledzinski D, Zabludowski A, Kiedrowski P. (2012) Analysis of modified fifth degree chordal rings. In New Frontiers in Graph Theory. InTech
Faraha RN, Chienb SLE, Othmanca M (2016) Graph theoretical properties of degree six 3modifiled chordal ring networks. J Eng Appl Sci 11(9):1987–1999
Parhami B. (1995) Periodically regular chordal ring networks for massively parallel architectures. In frontiers of massively parallel computation, Proceedings. Frontiers’ 95., fifth symposium on the (1995) pp. 315322, IEEE
Zabłudowski Ł, Dubalski B, Kiedrowski P, Ledziński D, Marciniak T (2012) Modified NDR structures. Image Process Commun 17(3):29–45
Singh A. (2005) Loadbalanced routing in interconnection networks (Doctoral dissertation, Stanford University)
Besta M, Hoefler T. (2014) Slim Fly: A cost effective lowdiameter network topology. In High Performance Computing, Networking, Storage and Analysis, SC14: International Conference (pp. 348359). IEEE
Duato J (1995) A necessary and sufficient condition for deadlockfree adaptive routing in wormhole networks. IEEE Trans Parallel Distrib Syst 6(10):1055–67
Dally WJ, Seitz CL. (1988) Deadlockfree message routing in multiprocessor interconnection networks. California Institute of Technology. (Unpublished)
Duato J, Yalamanchili S, Ni LM (2003) Interconnection networks: an engineering approach. Morgan Kaufmann
Gopal IS. (1994) Interconnection networks for Highperformance parallel computers. chapter Prevention of Storeandforward Deadlock in computer networks., p. 338344. IEEE computer society press, Los Alamitos, CA
Kim J, Balfour J, Dally W. (2007) Flattened butterfly topology for onchip networks. In microarchitecture. MICRO 2007. 40th annual IEEE/ACM international symposium on 2007 (pp. 172182). IEEE
Kim J, Dally WJ, Scott S, Abts D (2008) Technologydriven, highlyscalable dragonfly topology. ACM SIGARCH Comput Archit News. IEEE Comput Soc 36(3):77–88
Kathareios G, Minkenberg C, Prisacari B, Rodriguez G, Hoefler T. (2015) Costeffective diametertwo topologies: analysis and evaluation. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (p. 36). ACM
Mubarak M, Carothers CD, Ross R, Carns P. (2012) Modeling a millionnode dragonfly network using massively parallel discreteevent simulation. In high performance computing, networking, storage and analysis (SCC), 2012 SC Companion (pp. 366376). IEEE
Kim J, Dally WJ, Abts D. (2007) Flattened butterfly: a costefficient topology for highradix networks. In ACM SIGARCH Computer Architecture News (pp. 126137). ACM
Wolfe N, Carothers CD, Mubarak M, Ross R, Carns P. (2016) Modeling a millionnode slim fly network using parallel discreteevent simulation. In Proceedings of the 2016 annual ACM Conference on SIGSIM Principles of Advanced Discrete Simulation (pp. 189199). ACM
Alistarh D, Ballani H, Costa P, Funnell A, Benjamin J, Watts P, Thomsen B. (2015) A highradix, lowlatency optical switch for data centers. In ACM SIGCOMM computer communication review (pp. 367368). ACM
Domke J, Hoefler T, Matsuoka S. (2016) Routing on the dependency graph: a new approach to deadlockfree highperformance routing. In proceedings of the 25th ACM international symposium on highperformance parallel and distributed computing (pp. 314)
Jain N, Bhatele A, White S, Gamblin T, Kale LV (2016) Evaluating HPC networks via simulation of parallel workloads. Framework 20(21):22
Yang X, Jenkins J, Mubarak M, Ross RB, Lan Z. (2016) Watch out for the bully!: job interference study on dragonfly network. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, (p. 64). IEEE Press
Acknowledgments
The authors acknowledge incisive feedback from all the reviewers. This material is partially supported by the Air Force Office of Scientific Research under award number <FA95501610499>. Morphing Supercomputer INC partially supports this material by backing the research budget.
Funding
Ministry of Science and Technology, Taiwan, MOST 1022911I006301. Air Force Office of Scientific Research, United States, FA95501610499. Morphing Supercomputer INC.
Author information
Authors and Affiliations
Contributions
ChiHsiu Liang did the major invention of the network topology, 90% simulation coding, and manuscript writing. ChunHo Cheng and HongLin Wu did the simulation of all the networks. ChaoChin Li maintained the IT hardware and provided hardware support in our group. PoLin Huang did 10% of the simulation coding and its extending applications. ChiChuan Hwang, the PI, was the initiator of the project and provided most of the ideas committing to the brainstorming in this project.
Corresponding authors
Ethics declarations
Conflict of interest
Not applicable.
Ethical approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix 1 Configuration of the listed Equality networks
Appendix 1 Configuration of the listed Equality networks
Table 3 lists the configurations of the Equality networks simulated in this work.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Liang, CH., Cheng, CH., Wu, HL. et al. Performance evaluation of multiexaflops machines using Equality network topology. J Supercomput 79, 8729–8753 (2023). https://doi.org/10.1007/s11227022050051
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227022050051