Implementation and testing of a KNS topology in an InfiniBand cluster

Gomez-Lopez, Gabriel; Escudero-Sahuquillo, Jesus; Garcia, Pedro J.; Quiles, Francisco J.

doi:10.1007/s11227-024-06214-6

Implementation and testing of a KNS topology in an InfiniBand cluster

Open access
Published: 07 June 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

The Journal of Supercomputing Aims and scope Submit manuscript

Implementation and testing of a KNS topology in an InfiniBand cluster

Download PDF

Gabriel Gomez-Lopez¹,
Jesus Escudero-Sahuquillo¹^na1,
Pedro J. Garcia¹^na1 &
…
Francisco J. Quiles¹^na1

231 Accesses
7 Altmetric
Explore all metrics

Abstract

The InfiniBand (IB) interconnection technology is widely used in the networks of modern supercomputers and data centers. Among other advantages, the IB-based network devices allow for building multiple network topologies, and the IB control software (subnet manager) supports several routing engines suitable for the most common topologies. However, the implementation of some novel topologies in IB-based networks may be difficult if suitable routing algorithms are not supported, or if the IB switch or NIC architectures are not directly applicable for that topology. This work describes the implementation of the network topology known as KNS in a real HPC cluster using an IB network. As far as we know, this is the first implementation of this topology in an IB-based system. In more detail, we have implemented the KNS routing algorithm in the OpenSM software distribution of the subnet manager, and we have adapted the available IB-based switches to the particular structure of this topology. We have evaluated the correctness of our implementation through experiments in the real cluster, using well-known benchmarks. The obtained results, which match the expected performance for the KNS topology, show that this topology can be implemented in IB-based clusters as an alternative to other interconnection patterns.

Making the Network Scalable: Inter-subnet Routing in InfiniBand

A Scalable InfiniBand Network Topology-Aware Performance Analysis Tool for MPI

A Shortest-Path Routing Algorithm in Bicubes

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In recent years, the fastest-growing applications and services, among those requiring high-performance computing (HPC), have been the ones related to big data and deep learning [1], which have revolutionized the design trends of supercomputers and data center systems. These technologies are characterized by handling a large amount of data, which must be processed in a short period of time. As a result, the systems intended to process this data require ever-increasing computing and storage capacity.

The computational power required to solve the complex problems associated with the applications mentioned above can be huge. Therefore, HPC systems aim at increasing computational capacity year by year. Indeed, the objective of the previous decade was to reach the ExaFlop computing barrier (i.e., \(10^{18}\) floating point operations per second) by 2018, although this achievement was postponed until June 2022 when the Frontier supercomputer reached 1.1 ExaFlops [2]. In order to reach the required levels of computational power, today’s HPC systems are composed of an ever-increasing number of interconnected processing nodes working in parallel. Hence, one of the most crucial requirements to achieve the desired performance in these systems is to provide high bandwidth and low latency to the communication operations among these nodes. Therefore, in an HPC system, the interconnection network is a key subsystem that must be sufficiently fast to avoid being the bottleneck of the entire system.

Indeed, in the last decades, there have been a lot of efforts from the industry and academia to design high-performance interconnection networks offering high communication bandwidth and low latency while being reliable and cost-effective. These efforts have focused mainly on several key design issues such as network topology, routing algorithms, congestion management techniques, Quality-of-Service (QoS) policies, etc. In particular, the network topology and routing algorithm are design aspects of utmost importance, as they determine the maximum bandwidth and expected latency that the application communication patterns will experience.

In more detail, the network topology defines the disposition of the system nodes and how these nodes are interconnected. Over the years, topologies have evolved from traditional networks, such as buses or rings, to more complex but more efficient ones, such as Tori or Fat-Trees. In the last years, the design trends for network topologies have been to provide low diameter (i.e., the number of hops that a packet needs to cross the network is as reduced as possible), path diversity (i.e., multiple paths between a pair of nodes), and low acquisition cost.

Furthermore, cost-effective network topologies require efficient routing algorithms. The routing algorithm defines the path a packet must follow from when it is injected into the network until it reaches its destination. To leverage the properties of the newly proposed network topologies, efficient routing algorithms have been proposed, which can be classified into two main categories: oblivious and adaptive. Oblivious routing defines that a packet is sent through one of the possible paths in the network without considering the state of the network. Therefore the packets between two end nodes may select one out of the available routes offered in the topology. However, the routing algorithm may want to select always the same path between two end nodes. In this case, we refer to deterministic routing, a subset of oblivious routing algorithms. By contrast, adaptive routing defines that a packet is sent through the less loaded path in the topology so that the packet sidesteps congested areas and the traffic flows are spread throughout the network paths. Adaptive and oblivious routing are widely deployed in current HPC network technologies, such as InfiniBand and Slingshot since they leverage the multi-path nature of different network topologies. However, we have shown in previous studies that adaptive routing can be counterproductive under some traffic conditions [3,4,5], such as those spoiled by network congestion. In these scenarios, congestion may be spread from where it originates back through the multiple paths offered in the topology, thus significantly degrading the network performance. Therefore, adaptive or oblivious routing should be only enabled under certain traffic scenarios. For this reason, deterministic routing is still widely used in interconnection networks for HPC clusters and data centers.

The network topologies traditionally implemented in real HPC systems and data centers can be classified according to the networks into three main types [6]: direct, indirect, and hybrid. Direct topologies or Router-based networks define that every node is a terminal (injecting and receiving traffic packets) and a switch at the same time [7], and each node is directly connected to a neighboring node. These networks are easy to implement if they only have 2 or 3 dimensions while using more than 3 dimensions increases the layout complexity, and they are not scalable in performance. Direct networks can be divided into strictly orthogonal topologies (e.g., meshes, tori, or hypercubes) when each node has a link per dimension and weakly orthogonal topologies when some nodes may not have any link in some dimensions (e.g., trees, start, etc.).

By contrast, indirect topologies appeared as an alternative to improve the bandwidth and latency provision of direct networks. In indirect networks, each node is either a terminal or a switch [7], so the terminal nodes (or end nodes) are interconnected via switches organized in different stages, such as it happens in Fat-Tree [8] or Clos [9] topologies. As the number of nodes grows, indirect networks achieve smaller network diameters than direct networks, but the number of devices required grows significantly, thus increasing the network cost as well.

The third type of network topologies are hybrid topologies, which combine the best qualities of direct and indirect networks and avoid their disadvantages [6]. Indeed, hybrid topologies provide high bandwidth, low latency, and a reduced diameter (like indirect topologies), while they achieve a reduced cost (like direct ones) while keeping the rich path diversity offered by both indirect and direct topologies. One example of a hybrid network is the k-ary n-direct s-indirect, also known as KNS, which combines the best features from direct and indirect topologies to connect an extremely high number of processing nodes efficiently. The KNS topology is an n-dimensional topology where the k nodes of each dimension are connected through a small indirect network of s stages. Indeed, the KNS topology offers bandwidth and latency similar to indirect networks but at a lower cost. Unfortunately, although the KNS topology and its properties have been widely studied through simulation, as far as we know, they are not implemented in real hardware.

In this work, we have implemented and evaluated the KNS network topology in a real HPC cluster using the InfiniBand interconnection network technology. As the KNS topology defines its own routing algorithm, Hybrid-DOR, based on the dimension order routing (DOR) algorithm proposed for strictly orthogonal direct networks, we have also implemented the Hybrid-DOR routing algorithm in the InfiniBand control software. Specifically, we have implemented the Hybrid-DOR OpenFabrics Software (OFS), which includes an open-source implementation of the InfiniBand subnet manager: OpenSM.

To evaluate the proposed implementation, we have used the research cluster called CELLIA Cluster for the Evaluation of Low-Latency Interconnection Architectures, available at the University of Castilla-La Mancha (Spain). CELLIA comprises 50 server nodes and uses an InfiniBand-based interconnection network, so all the servers contain InfiniBand network cards. Moreover, the CELLIA cluster provides 50 8-port switches, allowing the configuration of different network topologies. Note that the InfiniBand technology currently supports many direct, indirect, and hierarchical networks, but the KNS topology has not yet been implemented using it. In summary, the main contributions of this work are the following:

We have built for the first time the KNS topology in a real InfiniBand-based cluster. We propose a wiring scheme and a simple layout for this topology (see the supplementary file).
We have implemented the Hybrid-DOR routing for KNS networks in the OpenSM software, so the network devices are identified, and the routing tables at network switches are conveniently populated.
We have evaluated the performance of the proposed implementation using well-known parallel applications and benchmarks, such as HPCC, HPCG, Graph500, and Netgauge. The experiment results demonstrate that the KNS topology can be built using InfiniBand components.

The remainder of this work is organized as follows. In Sect. 2, we review the background of HPC networks, the hybrid KNS topology, and the InfiniBand architecture. Section 3 describes the tools used to implement Hybrid-DOR in OpenSM. It also describes the cluster used to evaluate the performance of the topology and the tools used to measure the performance of the topology. In Sect. 4, the performance results of several experiments performed on a real InfiniBand cluster are analyzed. Section 5 provides some related work in network topologies. Finally, some conclusions are drawn in Sect. 6.

2 Background

In this section, we describe the basic background concepts of interconnection networks. First, we overview different taxonomies for network topologies, analyzing their advantages and disadvantages, which motivate the proposal for the KNS topology. Next, we focus on the properties of the KNS topology and the Hybrid-DOR routing. Finally, we provide some details regarding the InfiniBand network technology.

2.1 Interconnection networks

High-performance computing (HPC) systems and data centers require an interconnection network that allows the computing and storage nodes to communicate fast and efficiently when running applications and services. Specifically, an interconnection network consists of links for point-to-point connections between two network components, these components can be two switches, a node connected to a switch, or two nodes directly connected. Switches are composed of various ports, used to interconnect them with other network devices. A switch forwards incoming packets from one port to the appropriate output port to reach each packet’s destination. Finally, the network interfaces connect a node’s internal system bus to the network links.

Designing an efficient interconnection network faces important challenges, such as scalability, physical limitations, cost, and performance, to guarantee that it is not the bottleneck of the whole system. Among these challenges, the topology defines how network elements are connected. Note that the ideal network would have all its nodes directly connected, but this is not possible for a large number of nodes. Therefore, different topologies have been studied to connect many nodes in the most efficient way possible. The literature provides different network classifications depending on the characteristics used for their interconnection. The most common classification is between direct and indirect networks.

2.1.1 Direct or router-based networks

In a direct network, every node is a terminal, and a switch [7]. This switch is also referred to as a router and is in charge of providing connectivity to the terminal node to the rest of the network. Each node’s router is directly connected to the router of other neighboring nodes, so these networks are also called router-based networks. Direct networks are scalable in terms of cost, which grows linearly with the number of network devices. However, they do not scale well in terms of performance since increasing the number of nodes increases both the average distance between them and the latency. In addition, the bandwidth of the bisection is small compared to that of other topologies. Some direct networks can be defined with the notation k-ary n-cubes, where k is the number of nodes per dimension, and n is the number of dimensions. However, other direct topologies do not follow this nomenclature, like circular rings or Hamiltonian graphs. Figure 1 shows the most common networks of this type, such as the Hypercubes (Fig. 1a), the Meshes (1b), and the Torus (1c).

The simplicity of the connection patterns of direct networks allows efficient and straightforward routing algorithms to be implemented. Hypercubes offer the best performance when the number of nodes in the network grows. This is because having only two nodes per dimension, the average distance of the network will be less than in other direct topologies, where in each dimension, more nodes have to be traversed to reach the destination. The problem with hypercubes is the cost since they need more links to connect all the dimensions, and the routers need more ports since they must have one port per dimension. In HPC systems, it is also common to use dedicated routers, which replace the router integrated into the nodes; in this case, the nodes’ network interface is more straightforward. One example of this integration was made in the Fugaku supercomputer and its proprietary Tofu-D network technology [10].

2.1.2 Indirect networks

In indirect networks, nodes are either a terminal node or a switch [7], where terminal nodes are interconnected through switches. In indirect networks, a switch can be connected to end nodes and other switches simultaneously or only to other switches. The ideal way to connect the n nodes of an indirect network would be through a switch with n ports [6], but the physical and cost limitations would be prohibitive in systems with thousands of nodes. When the number of nodes increases in these networks, the average diameter and distance do not increase or do in small proportions, so these networks are scalable in performance. The disadvantage is that the number of resources required (switches, cables, and power consumption) grows considerably, so they are limited in cost and physical constraints. The most common indirect networks are the multistage interconnection networks (MIN), often called Fat-Trees [8] because this is the connectivity pattern most commonly used in MINs. The MIN topology consists of a series of stages formed by switches that connect to end nodes or switches of other stages (see Fig. 2a). There are different types of fat trees, depending on how the switches of some stages are connected to others. An example of fat trees is the Slimmed Fat-Tree (SFT) [11]. In an SFT network like the one shown in Fig. 2b, the lower stages have more links than the upper stages. This design increases the bandwidth between closer nodes, taking better advantage of the locality required by some traffic patterns in the network, and does not degrade overall performance too much [12]. These networks allow the connection of many nodes without losing the performance lost in direct networks, although the number of devices required increases considerably, and so does their acquisition cost.

2.1.3 Discussion on other taxonomies for interconnection networks

Historically, Direct networks have used dedicated routers; while, indirect networks have used switches. Nowadays, switches and routers perform similar functions, blurring the line between direct and indirect networks. This paper considers direct networks as those in which all switches are connected to at least one terminal node. Indirect networks are those in which some switches are only connected to other switches. Note that there are also switches connected to end nodes.

In addition, classifications have been proposed as in [6], where the authors define the Hybrid networks as those that cannot be classified as direct or indirect networks because their main characteristic is the use of elements of both types. The intention is to use the best direct and indirect networks to create a more efficient network. The objective is to create a network that, as the number of nodes increases, the devices needed to connect the network do not grow considerably, as happens in indirect networks, and the distance between nodes does not grow as much as in direct networks.

In [6], they also define the Hierarchical as a subgroup of the Hybrid networks. These networks are characterized by grouping nodes into groups and interconnecting them hierarchically. The DragonFly [13] network could be classified as hierarchical, where the nodes of each group are directly connected through a switch or multiple switches. All switches connecting the nodes of each group are directly connected, and all groups of nodes are directly connected. This means that the latency is lower in the nodes of the same group, taking advantage of the locality of the nodes. Each group is connected to the other groups via global links that connect switches from different groups, this link is usually longer than the rest and could lead to a higher latency.

Figure 3 shows an example of a 72-nodes DragonFly with 8 nodes in each group and 9 groups. Each group is interconnected by 4 switches, and these switches are connected to the switches of the rest of the groups. Moreover, newer versions of the DragonFly, such as the MegaFly [14], each group is composed of a MIN network interconnecting the nodes of the group to the rest of the groups, improving the minimum path diversity.

2.2 The KNS network topology

The family of KNS topologies was first proposed in [15], and its performance and cost scalability have been evaluated using simulation models in [16] and [17]. However, no known implementation of a KNS network in a real system using InfiniBand exists. The objective of this work is to implement a real KNS network and to evaluate that its performance is the one expected under the chosen HPC applications.

The KNS hybrid topology combines a direct network of n dimensions with small indirect subnetworks connecting the nodes of each dimension. By combining the advantages of direct and indirect networks, this network can connect nodes with low latency, higher bandwidth, and higher fault tolerance but at a lower cost than indirect networks. In this network, the routers connect the nodes to the rest of the terminal nodes, as is the situation in direct networks. Still, this router is not connected to the router of another terminal node. It is connected to a small indirect network that connects all the nodes of a dimension. A router provides connectivity to n indirect networks, each connecting the nodes of each dimension.

Considering the above, this network can be defined with 3 parameters: the number of nodes per dimension (k), the number of dimensions (n), and the number of stages of the indirect network connecting the nodes of each dimension (s). Note that the number of dimensions n and the number of nodes per dimension k are the same parameters as for direct topologies. The number of nodes in the KNS topology is given by \(N = k^n\). Moreover, the number of stages of the indirect network is represented by s, which depends on the number of switch ports (d) and the k nodes in each dimension. If \(k \le d\), one switch would be sufficient for the indirect network, so \(s = 1\). By contrast, if \(k > d\), it will be necessary to interconnect the k nodes per dimension with an indirect network (e.g., a Clos or fat tree), whose number of stages (s) will be given by \(log_d{K}\).

Note that this topology’s name, KNS, is based on the k, n, and s parameters. Another way of referring to this hybrid topology is k-ary n-direct s-indirect. Table 1 summarizes these parameters, and Fig. 4 shows an example of a 16-node 4-ary 2-direct 1-indirect KNS network. Note that the routers are depicted using circles (colored in orange), each with a [x, y] coordinate. Note that each router is connected to a terminal node (colored in green). Finally, switches are illustrated using blue rectangles. For simplicity, we only show the nodes on the corner routers.

Table 1 KNS parameters

Full size table

Regarding the routing algorithms for KNS topologies, in [17] it is proposed an adaptation of the dimension order routing (DOR) algorithm used in meshes and tori. The main idea behind DOR routing is that packets traverse the network in the order given by the available dimensions, first the X dimension, then the Y dimension, etc. Thus, it is guaranteed that the minimum path is used between two nodes and that no deadlocks can occur. Specifically, the hybrid-DOR routing defined for KNS topologies defines that switches and routers are identified by indicating their position in the network. Routers and terminal nodes are labeled as in some direct networks, with a set of coordinates \([r_{n-1},r_{n-2},...,r_{1},r_{0}]\) and \([x_{n-1},x_{n-2},...,x_{1},x_{0}]\) respectively, with as many coordinates as there are dimensions (n). Each coordinate represents the position of the router or terminal node in each dimension. Note that the terminal has the same coordinates as to which it is connected. To identify the switches, two parameters [d, p] are used, where d is the dimension in which the switch is located and p is the position of that switch in dimension d. For cases where \(s > 1\), more coordinates are needed to uniquely identify each switch.

As we can see in Fig. 4, we assume that switches only interconnect the nodes corresponding to the same dimension, so when they have to forward a packet, they send it to the router whose \(r_{d}\) matches the \(r_{d}\) of the destination (being \(r_{d}\) the dimension in which the switch is located). This behavior is defined in Algorithm 1, which receives as input the terminal node destination coordinates (\([x_{n-1},x_{n-2},...,x_{1},x_{0}]\)) and returns the link connecting with the router whose \(r_d = x_d\). Note that this only works when \(s = 1\), the case of the KNS implemented in this paper. If \(s > 1\) the indirect network is formed by more than one switch, and it can be built using a Fat-Tree topology. In this case, the implementation of the routing algorithm is a bit more complex, but it can be used with some of the existing ones, as proposed in [17].

The Hybrid-DOR algorithm for routers is defined in Algorithm 2. The input of this algorithm is, as in switches, the destination coordinates (\([x_{n-1},x_{n-2},...,x_{1},x_{0}]\)). The algorithm returns as output the next link a packet must take in its path toward its destination. To select the link following a dimension order, routers compare the coordinates of the packet destination with their coordinates in the n dimensions. If there is one mismatch in the comparison, the router forwards the packet to the switch that connects the nodes corresponding to that coordinate. If a router compares all the coordinates and they all match with the destination, the destination must be a node directly connected to that router, then the link must be the link connecting with the terminal node destination.

According to this definition of the KNS topology, a new type of network is obtained, which allows scaling to a large number of nodes at a cost not as high as the indirect topologies and without losing the efficiency commonly lost when increasing the number of nodes in direct topologies [17]. Instead of traversing each dimension node by node, as in direct networks until reaching the destination, in KNS, all the nodes in the same dimension have a connection through a small indirect network, which reduces the network diameter. The cost does not increase as much as in traditional indirect networks since the networks needed to connect the nodes of each dimension are simple and do not require many switches and cables to be implemented.

2.3 The InfiniBand network technology

InfiniBand is one of the most widely used and efficient interconnect technologies for HPC systems. The InfiniBand Architecture (IBA) specification [18] defines how to build small and large networks with a high bandwidth, and low latency per link and a reduced CPU load on the processing nodes. The IBA networks use host channel adapter (HCA) network cards, switches, and links to interconnect them. Moreover, IBA-based networks require a software entity called the subnet manager to identify all the network devices and orchestrate their communication. Specifically, the HCAs connect the physical network link to the processing node’s internal bus. Within the nodes and at the application layer, the IBA provides the verbs, a software API that allows the applications to use specific procedures to directly access the HCAs and perform communication operations without the intervention of the CPU and OS kernel. Note that the verbs also handle the support for transport protocols on the HCAs, such as the remote direct memory access (RDMA) protocol. On the other hand, switches are responsible for forwarding packets from one port to another. Each switch has a linear forwarding table (LFT) with as many entries as there are destinations on the network. In IBA-based networks, each destination is identified by a local identifier (LID), which is a 16-bit address assigned to each device by the subnet manager at the network setup stage. Each input to an LFT consists of a destination LID and the output port through which the switch must route a specific packet to reach its destination.

One of IBA’s most important elements is the subnet manager (SM), which runs on only one node in the network, although there are switches capable of running it. The main functions of the SM are to discover the network topology, identify each network device (HCAs and switches) with a specific LID, and populate each device’s LFTs. Moreover, each network device runs a subnet agent (SA), which receives the information generated by the SM through the subnet manager agent (SMA). In other words, the SM is responsible for calculating the routing by populating the LFTs of each device in a centralized manner in the node where it is executed and sending it to each device. The port assigned to each destination LID depends on the routing algorithm given to the SM when it is initialized.

Furthermore, the IBA specification defines an API of protocols and services bundled in middleware between user applications and IBA hardware. The IBA specification does not say how to implement the API but defines the functionality that InfiniBand devices must support. These functionalities, referred to as verbs, represent actions solicited by applications. The collection of verbs defines the range of interactions an application can have with the IBA hardware. These functionalities are provided in the OpenFabrics Software (OFS), developed by the OpenFabrics Alliance (OFA). The IBA OFS comprises open-source implementations for the SM, the InfiniBand drivers, and the libraries necessary for communication types IBA support, such as RDMA and MPI. The Open Subnet Manager (OpenSM), provided by OFS, is also open source and it offers different routing algorithms or routing engines, such as updn [19], ftree [20], DOR [21], LASH [22], torus-2QoS [19], SSSP [23], DFSSSP [24], and minhop [19]. The OpenSM architecture permits us to implement other routing algorithms that are better suited to other topologies.

3 Description of the KNS implementation

In this section, we describe the different steps to implement the KNS topology in our cluster. First, we describe the hardware and software infrastructure that we have used. Next, we provide some details of implementing the Hybrid-DOR routing. Finally, we describe the cabling and layout for the KNS topology in the CELLIA cluster.

3.1 Hardware/software infrastructure

This work’s hardware and software infrastructure includes the OFS, which offers the OpenSM source code. Additionally, we have used the simulation tool ibsim to emulate the control traffic generated in the setup phase of InfiniBand networks by the subnet manager to discover and identify the network devices. Moreover, we use the CELLIA cluster, which uses InfiniBand technology for the interconnection network. Figure 5 illustrates a general diagram of this work’s hardware and software infrastructure.

We have implemented the Hybrid-DOR routing algorithm for KNS topologies into the OpenSM control software. During the code development stage, we simulated the control traffic of an InfiniBand network utilizing the ibsim simulator. This open-source program enables us to execute OpenSM and emulate its behavior without an actual InfiniBand network. This permits us to modify OpenSM by creating new features and examining them before testing on real hardware. The ibsim simulator requires a topology description file that defines an InfiniBand network, listing all the devices that compose it (HCAs and switches) and the connection pattern that each device’s ports follow. We have used the IB tools, which are also provided by the OFS and the InfiniBand drivers installed in the CELLIA cluster, to debug possible network failures. Specifically, we have used ibnetdiscover, which displays the list of InfiniBand network devices and their connection patterns in the same format used by the ibsim simulator. Note that this hardware/software infrastructure approach separates the software development from the hardware configuration. Therefore, the routing engines of IBA-based networks can be implemented and validated without requiring large hardware installations.

Finally, we wired a KNS network on the CELLIA cluster and launched openSM with our Hybrid-DOR implementation to ensure it runs correctly. We have also evaluated the performance of this topology by running different HPC applications and benchmarks, comparing the results obtained on the KNS with our implementation of Hybrid-DOR and other routing engines available in OpenSM used by other topologies, such as Fat-Trees, which we describe below.

3.2 Hybrid-DOR implementation in OpenSM

In this section, we explain the implementation of the Hybrid-DOR routing algorithm in OpenSM. This is divided into four phases that we will explain below. For the sake of simplicity, we assume hereafter a two-dimensional (2D) KNS topology that uses a single switch in the indirect network, i.e., a k-ary 2-direct 1-indirect KNS topology.^{Footnote 1} Note that the HCAs in the cluster cannot act as routers in the KNS since they cannot forward traffic from one port to another. Therefore, it is necessary to connect an additional switch to each HCA to act as a router and provide connectivity to each HCA with the different indirect network switches connecting each dimension’s nodes.

3.2.1 Network devices discovering and addressing

As we have described before, the first step for all the routing engines in InfiniBand-based networks is to discover the topology. The OpenSM broadcasts management datagrams (MADs) are routed through the network without knowledge, using all the physical paths available in the topology. When a random MAD reaches a specific device (HCA or switch), that device responds directly to the endpoint running OpenSM and communicates its assigned list. This initial communication permits OpenSM to address all the network HCAs and switches and assign them with a unique LID. Moreover, this network discovery phase allows OpenSM to identify the physical connections between the network devices and build a logical view of the network topology. Note that OpenSM runs in a single computing node of the network, so this node will initiate and receive all the control traffic (i.e., MADs) during the discovery phase. Once OpenSM discovers the topology and creates the logical network view, the next step is to place the different LIDs in the different dimensions of the KNS topology and differentiate between the router and switch LIDs.

3.2.2 Switches and routers identification

To place the HCAs in the KNS topology, OpenSM specifies a coordinate for each HCA LID in the same way it sets these coordinates for direct networks. Therefore, in a 2D KNS topology, each HCA has a [x, y] coordinate. Moreover, OpenSM needs to identify which switches are directly connected to the HCAs (routers) and which switches are connected to each router at each dimension. Therefore, to identify each device type, the Hybrid-DOR routing engine examines all the devices that are not HCAs (i.e., switches or routers). It examines all the devices reachable from each port for each device. If at least one HCA is connected to one of the ports of the current device, that device is considered a router. In the KNS topology, routers interconnect the HCAs with the switches (see Fig. 4). By contrast, if some connected devices to the current device are other switches acting as routers or switches, the current device is identified as a switch that is part of the indirect network.

3.2.3 Devices coordinate assignment

Once the switches and routers have been identified, the Hybrid-DOR routing engine assigns the corresponding coordinates to each device (i.e., HCAs, routers, and switches). To accomplish this, the Hybrid-DOR goes through the list of devices already identified in the network. First, the indirect network’s first switch is assigned the coordinates [0, 0], i.e., the switch 0 for both the X- and the Y-dimension. Moreover, this switch assigns coordinates to each router connected to it in ascending order \(([0,0],[0,1],..., [0,k-1])\) through all its ports. Note that, as we assume a 2D KNS topology, all the ports for a given switch will be connected to routers, which share the same coordinates with the HCAs they are connected to. When a router receives its coordinates from the switch in the X-dimension, it assigns the coordinates to the other switch connected to that router in the Y-dimension. That switch preserves the Y-coordinate received by the router and replaces the X-coordinate value with the Y-coordinate. For instance, if the connected router is at [0,3], the switch in the Y Dimension will be located at [1,3]. Once all the switches in dimension Y have coordinates assigned, the remainder of switches in the X dimension can also be assigned coordinates. Hybrid-DOR searches for switches without coordinates in the device list and assigns them coordinates in ascending order. To complete the identification of devices, Hybrid-DOR searches in the list of devices for routers without coordinates and assigns coordinates based on the switches to which they are connected. For instance, if a router is connected to the switch [0, 4] and switch [1, 2], the coordinates for the router will be [4, 2]. This method of identifying devices offers flexibility in wiring, as a specific port order is not required, thus preventing potential wiring mistakes.

3.2.4 Routing tables population

When the Hybrid-DOR routing engine has assigned coordinates to all the devices, the next step is to populate the routing tables (a.k.a. LFT) at switches and routers. Note that the OpenSM fills first the LFTs in its logical view of the topology. As described above, each switch and router has an LFT containing as many entries as LIDs are in the network. The Hybrid-DOR routing engine calculates the output ports for each LID at a given switch or router. Specifically, each LID Hybrid-DOR checks if the device associated with that LID is directly connected to the current switch or router. If so, it assigns the port connecting to that device to that LID in the LFT. Otherwise, if that LID corresponds to a device not directly connected to the current switch or router, its coordinates are compared to those of the current device. If the not-connected device is an HCA, its coordinates are the same as the router to which it is connected, and the routing engine selects the router’s port that connects to the HCA.

By contrast, routers check, following the dimension order, if the coordinates of the source device differ from the destination ones. If the coordinates in a dimension do not match, the port associated with that LID will be the port that connects to a switch that connects the routers of that dimension. In the case of switches, the port for the target LID is the one that connects to the router whose coordinate of the dimension in which the switch works (X or Y) matches that of the target device in that dimension.

Finally, when the Hybrid-DOR routing engine populates the LFTs in the logical view of the KNS topology, the OpenSM sends MADs to each one of the switches and routers to fill their LFTs. After the LFTs have been filled in the physical devices, the setup phase finishes, and the network is ready to operate.

3.3 The CELLIA Cluster

As previously said, the CELLIA (Cluster for the Evaluation of Low-Latency Interconnection Architectures) cluster has been used in this work. This cluster belongs to the RAAP research group^{Footnote 2} at the University of Castilla-La Mancha (Spain). The CELLIA cluster comprises 50 server nodes HP Proliant DL120 Gen9, each with an Intel Xeon E5-2630v3 8-core 1.80GHz processor and 32GB of RAM. Each node is equipped with an InfiniBand HCA with QDR technology, providing 40 Gbps of link speed. The cluster is also equipped with 50 Mellanox\(^{TM}\) IS5022 8-port switches with InfiniBand QDR technology, providing 40 Gbps of bandwidth per port. We have used copper cables and QSFP transceivers from Mellanox\(^{TM}\) compatible with QDR technology. All these devices and the administrative network are racked in three cabinets, as shown in Fig. 6, two for the nodes and one for the switches (called the “Switch rack”). Note that the switch rack contains the 50 switches and can be accessed for cabling at the front and the back, allowing us to build several network topologies in the past.

3.4 KNS built on CELLIA cluster.

Given the available resources in the CELLIA cluster and the limitation of using switches as routers, the best KNS that can be built is a 6-ary 2-direct 1-indirect network (see Fig. 7) since we can interconnect 36 nodes, using 48 switches in total (36 switches for routers plus 12 switches).

In this figure, the routers are illustrated using circles (colored in orange), and their disposition is like in a 2D mesh (i.e., rows and columns), so each one of the routers has a [x, y] coordinate. Each router is connected to an HCA (only the four HCAs in the corners are shown; while, the rest are omitted for clarity). Finally, switches are illustrated using rectangles colored blue. Note that switches, routers, and HCAs have been labeled using the coordinates described in the previous section. Another option is to build a 3-ary 3-direct 1-indirect network, but it would require 54 switches (27 routers and the remaining 27).

In the supplementary file, we have provided a detailed description of the switch layout and cabling necessary to construct this topology. Additionally, a brief analysis of the cost and complexity of this implementation is included in the same file.

It is worth mentioning that KNS topologies could be built with fewer hardware components if the HCAs offered routing functionality, allowing traffic to be forwarded from one HCA port to the other without the host’s intervention. This feature would make cabling easier, requiring fewer devices and less space in switch racks. As far as we know, the InfiniBand hardware cannot forward traffic from one HCA port to another without the host’s intervention. However, we can still use the routers to improve connectivity. For instance, more HCAs could be connected to each router, and parallel links could be used between switches and routers to increase the aggregated bandwidth offered by the topology.

4 Experiments and results

This section evaluates the performance of the KNS topology built in the CELLIA cluster, which we have described in Sect. 2.2. Note that we do not want to achieve the best possible performance that a given topology can achieve in the CELLIA cluster but to evaluate that our implementation of the Hybrid-DOR routing engine in OpenSM works as expected and that the KNS topology built using the available InfiniBand hardware in the CELLIA cluster achieves successful results when using well-known HPC benchmarks. In the following sections, we describe the selected benchmarks, the experiments performed, and their configuration parameters in more detail. Next, we analyze the experiments’ results.

4.1 Benchmarks setup

As we have described above, this section describes the benchmarks that we have compiled and executed in CELLIA to evaluate the performance of the 2D KNS topology. For this study, we have selected some of the most relevant HPC benchmarks based on the Message Passing Interface (MPI) paradigm, such as HPCC, HPCG, Graph500, and Netgauge. From these benchmarks, we have specifically focused on those tests that generate the typical communication patterns in the interconnection network, as explained in the following sections.

4.1.1 HPCC

HPC Challenge (HPCC) [25] is a collection of benchmarks measuring computing and memory access performance. Essentially, HPCC comprises seven tests, such as Linpack [26] (HPL) benchmark, which performs floating point operations solving a series of equations systems; DGEMM that performs multiplications of matrices of real numbers and returns the number of FLOPs that have been achieved; STREAM that gauges the sustainable memory bandwidth (in GB/s) and associated compute rate for a simple vector kernel; PTRANS that performs a parallel matrix transpose; Random-Access that chooses a random position in memory, reads an integer, modifies it in the processor, and writes it back, and measures the speed of random integer updates in memory in Giga Updates Per Second (GUPS); FFT that measures the FLOPs of the systems using the Discrete Fourier Transform (DFT); and b_eff that obtains the communication latency and bandwidth of different communication patterns that are run simultaneously. Unfortunately, not all the HPCC tests are intended to measure the network performance, so we have selected those HPCC tests that generate at least moderate network communication, such as Random-Access, PTRANS and b_eff.

4.1.2 HPCG

The High-Performance Conjugate Gradient (HPCG) [27] benchmark is a more accurate representation of real-world scenarios. This benchmark is based on the Conjugate Gradient (CG) method, a widely used iterative method for solving large problems requiring many operations. The HPCG benchmark accurately represents real-world scenarios because it incorporates computation and data access patterns that more closely match a different and broad set of important applications. The HPCG rating is a weighted GFLOP/s (billion floating operations per second) value composed of the operations performed in the CG iteration phase over the time taken.

4.1.3 Graph500

The Graph500 [28] benchmark focuses on data-intensive applications, which have become very important for business analytics, finance, and all those fields where a high number of data needs to be handled, such as deep learning and big data applications. This benchmark has five types of problems depending on the input size, ranging from 17 GB to 1.1 PB. The benchmarks’ tests involve creating a graph to run typical operations on these data structures to evaluate the system’s performance.

4.1.4 Netgauge

Netgauge [29] is a benchmark focused on evaluating the performance of HPC system networks. Although it supports different communication patterns, such as 1toN, where 1 node sends traffic to the node’s network, or Nto1, where all nodes in the network send traffic to one, the most representative one for comparing different networks is the Effective Bisection Bandwidth (EBB) test. Basically, EBB creates communication operations for different bisections in the network, calculates the bisection bandwidth in each one, and then calculates the average communication performance in Mbps. Consequently, Netgauge is a proper benchmark to measure network performance.

4.2 Experiments setup

To evaluate the KNS topology configured in CELLIA and the correct implementation of the hybrid-DOR routing in OpenSM, we decided to configure a Fat-Tree topology with similar characteristics to KNS, in terms of bisection bandwidth, diameter, and number of nodes. Specifically, we have built a 45-node Slimmed Fat-Tree (SFT) topology, as shown in Fig. 8. We can see that 45 nodes are indirectly connected by 21 switches distributed over three stages. There are 3 groups of 15 compute nodes each, each connecting 3 subgroups of 5 terminal nodes. Note that the stages 1 and 2 of the SFT topology have more switches and links (and more bandwidth) than the stage 3, improving the communication speed of close server nodes. In the upper stage, there are parallel links between each switch to increase bandwidth in that part of the network. We have removed one node from each subgroup, resulting in nine equal subgroups of four nodes each. This configuration has the same number of nodes as the KNS used. The SFT is a popular topology widely used in production HPC systems. The SFT we built in CELLIA is perfectly valid to obtain throughput and latency metrics when running real HPC benchmarks, which can be compared with those performance metrics obtained when using a KNS network. Indeed, our purpose is not to demonstrate that the KNS outperforms the SFT, but to show the KNS Hybrid-DOR routing can operate by obtaining performance metrics similar to those of the SFT.

Furthermore, to evaluate the performance of the KNS and SFT topologies, we conducted tests by configuring different routing engines in OpenSM to determine which one suits each topology better. The routing engines that we have used are the following:

updn. This routing engine sets a maximum number of paths in the network to prevent circular buffer dependencies and ensure no deadlocks occur. This routing engine generates a directed graph in the network with each link assigned a direction. A flow cannot cross a link in the “up” direction after crossing a link in the “down” direction. The root node serves as the root, and only links in the “up” direction reach it. In all possible cycles, there is a node that cannot be crossed, breaking all cyclic dependencies. When executing this in OSM, it is necessary to specify the root node [30].
Layered shortest path (LASH). This routing engine aims to prevent deadlocks by dividing messages into virtual layers, utilizing IBA’s Virtual Lanes (VLs). The VLs are only used in links interconnecting switches, as links connecting switches with HCAs cannot cause a deadlock. This routing calculates the shortest path between each pair of nodes. Every route is included in a layer if it does not generate cycles in that layer. This routing engine ensures that the same VL is used for all SRC/DST - DST/SRC pairs, and it does not ensure that the path followed from SRC to DST is not the reverse one from DST to SRC. To avoid having more layers with more routes, LASH balances the number of routes in each layer, avoiding introducing routes with cycle dependencies in the same layer. This routing engine uses the minimum number of VLs to ensure no deadlocks appear.
DFSSPP. This is the Deadlock-Free (DF) version of the single-source-shortest-path (SSSP) routing algorithm, which globally balances the number of routes per link to optimize link utilization. DFSSP uses the VLs to guarantee the Deadlock-Freedom. This routing engine searches for the shortest path between all nodes in the network using de Dijkstra’s algorithm. Once the minimum paths have been calculated, they are checked for cycles that may cause blockages. If a cycle is found, i.e., a possible deadlock is present, an edge of this cycle is selected and all routes containing this edge are moved to the "next higher" VL. Finally, this routing engine balances the routes among all available VLs. In the case of the CELLIA cluster devices, there are 8 VLs.
ftree. This is a routing engine specifically deployed for Fat-Tree topologies, so it fits well in the SFT topologies configured. This routing engine does not use VLs, but as in updn it prevents the deadlocks. This routing engine works with Fat-Trees from 2 to 8 stages. The openSM documentation indicates that to obtain the best results with this routing engine topologies should be symmetrical. However, it can also work with irregular configurations of the Fat-Tree topologies.
hdor. This is our implementation of Hybrid-DOR in OpenSM. Hybrid-DOR is an adaption for KNS topologies, based on the DOR algorithm used in Meshes or Torus topologies. With this routing engine, the messages are forwarded following a dimension order through the network. As happens it happens in DOR, this routing engine avoids deadlocks.

We used the LASH, updn, and DFSSSP routing engines, which work on any topology. Besides, for the KNS topology, we have also incorporated the hdor routing engine. For the SFT topology, we employed the ftree routing engine. In Table 2, we summarize the number of VLs used by each routing engine in each topology, as this will affect the results of the experiments.

Table 2 Number of VLs used by each routing engine in each topology

Full size table

We have executed each benchmark in both topologies with each of the previously mentioned routing engines. For each benchmark and routing engine, we executed the benchmarks by distributing the MPI tasks linearly and randomly across the nodes, to observe how this factor affects the performance obtained in each experiment. Note that the random distribution follows the same order in all the experiments.

4.3 Results analysis

In this section, we examine the results of the experiments obtained for the HPCC, HPCG, Graph500, and Netgauge benchmarks. Each graph follows the same structure. Each graph shows both the results obtained when the MPI tasks are distributed linearly (on the left) and randomly (on the right). For each metric, the left graph displays the results achieved for KNS; while, the results for SFT are presented on the right.

4.3.1 HPCC

We will begin with an analysis of the HPCC benchmark results. Firstly, we present the results obtained with the PingPong test, which consists of sending many pings from an origin node to a destination node to calculate the maximum, minimum, and mean for the latency and the bandwidth metrics.

Figure 9 shows the plots corresponding to the bandwidth in the PingPong test. The maximum bandwidth values (Fig. 9a and Fig. 9b) are very similar for the two topologies and all routing algorithms, standing at almost 4 GB/s. This makes sense since the CELLIA cluster links go up to 40 Gbps (i.e., 5 GB/s), but for the 8B/10B coding protocol, the effective bandwidth is 32 Gbps (4 GB/s). From the minimum bandwidth values shown in Fig. 9c, we can see that KNS obtains values similar for all the routing engines and task distributions. For the minimum bandwidth in the SFT topology (see Fig. 9d), when MPI tasks are randomly mapped onto the nodes the performance decreases, but it obtains similar results to the KNS when the tasks are linearly mapped. For the KNS, the maximum and minimum values are quite similar. Therefore the average bandwidth (Fig. 9e) is close to the maximum, giving similar average results. In the SFT topology (Fig. 9f), the average bandwidth drops a little for some routing engines but the maximum average bandwidth is the same as in KNS. It should be noted that in the HPCC benchmark, despite being frequently employed, the traffic pattern PingPong introduces a light load on the topology.

Figure 10 shows the average, maximum, and minimum latency results in the PingPong test. The results are similar to those of the bandwidth: for the maximum values of such latency (10a and 10b), there is a case, when using updn routing engine, in the SFT topology where it reaches up to 14 microseconds in the linear mapping of the tasks. When the tasks are mapped randomly in the SFT, the maximum latency is lower for updn, and for the rest of the routing engines, it gets worse by 1 microsecond, while in the KNS, the results are the same for both mappings. For the minimum latency values (10c and 10d), it is true that the SFT offers better results due to the minimum distance being two hops, and in the KNS, it is 4 hops. It is important to highlight that 2 out of these 4 hops are made in routers, whose latency in a KNS using HCAs with routing function would be much lower than in this configuration. As for the average latency (see Figs. 10e and 10f), similar results have been obtained for the two topologies and all the routing used, although it gets a little worse when mapping the MPI tasks on the nodes randomly using the SFT topology.

Further analyzing HPCC results, we will analyze the MPI Random Accesses test results. Figure 11 shows the results of evaluating random memory accesses using MPI. The unit of measurement is GUPs (Giga Updates Per Second) and indicates the number of integers the system can write in one second. In the linear distribution of MPI tasks, we achieve about 0.05 GUPS for KNS and SFT topologies, obtaining a slightly better result for the DFSSSP routing engine. But when these tasks are randomly distributed on the SFT, the performance decreases compared with the KNS, which is maintained and even improves with the hdor routing engine.

In Figs. 12 and 13, we can see more latency-related results, in this case using the ring communication pattern. This pattern distributes the processes in a ring structure, where each only receives and sends data to the processes on its left and right. The processes are distributed in a natural manner or randomly. We obtained comparable results to previous cases. When MPI tasks are distributed linearly on nodes, both KNS and SFT obtain similar results, the worst routing engine in this task distribution in the SFT topology is updn. The rest of the routing engines work a bit better (less than \(0.5\mu s\)) in the SFT because it has a better locality in the lower stages than our KNS implementation. When the MPI tasks are randomly distributed among the nodes, the latency in the SFT increases significantly. Finally, we can observe that in the KNS, the best result is always obtained by the LASH routing engine, and in Fig. 13 when the tasks are randomly mapped, this result is even improved, this happens as LASH take advantage of the VLs. In [31], they demonstrate that using the InfiniBand VLs improves the network performance and reduces the interference between flows. As discussed above, the latency in the KNS is higher than it should be in an implementation using HCAs with router functions since the minimum path is 3 hops on 3 switches rather than 2 routers and 1 switch.

Concluding the HPCC benchmark results, let’s examine the results of the PTRANS test. Figure 14 shows bandwidth results. As in previous tests, the KNS obtains around 12GB/s for all the routing engines and task distributions. This result is the same as the maximum achieved in the SFT topology when using DFSSP. Note that although hdor performs better in the KNS than ftree in the SFT, it does not achieve the best result in any configuration. The best options for the KNS are DFSSSP and LASH, which use VLs (see Table 2), unlike hdor. Note also that despite ftree being specific for Fat-Trees, it does not obtain the best results in the SFT topology, where DFSSSP, taking advantage of the VLs, obtains the best result for both task distribution mappings.

4.3.2 HPCG

This section analyzes the results obtained with the HPCG benchmark. Specifically, we will compare the execution time and the GFLOPS (\(10^{9}\) floating point operations per second) achieved by each execution. In this case, a curious result has been observed: When MPI tasks are assigned in random order, better results are obtained than when they are linearly distributed in the KNS, as shown in Fig. 15, where the computational performance of the CELLIA cluster, measured in GFLOPS, is shown. Both topologies reach a peak of nearly 30 GFLOPS. Note that the routing engine hdor performs best on the KNS topology (see Fig. 15a) when the tasks are randomly mapped. Note that this random distribution has the same order in all the experiments. Finally, note also that although there are topology-specific routing algorithms, hdor for KNS and ftree for SFT, in most of the cases the best performance is achieved with the routing algorithms that use InfiniBand’s VLs (LASH and DFSSSP). This is because there is less interference between flows when using VLs, and the Head-of-Line Blocking effect is reduced [31].

4.3.3 Graph500

This section compares the performance obtained by the Graph500 benchmark in each topology. The focus is on analyzing the number of Traversed Edges Per Second (TEPS) obtained in each execution, a metric specific to this benchmark and related to the number of floating point operations per second that can be performed (FLOPS in LINPACK) and the number of operations with memory (GUPS of HPCC). For a linear distribution of MPI tasks across the nodes, the TEPS values remain almost the same for the KNS and SFT topologies (see Fig. 16), staying close to 2.5 GTEPS. However, for a random distribution of tasks across the nodes, the performance slightly decreases except for hdor routing engine in KNS topology, which remains constant at 2.5 GTEPS (refer to Fig. 16a).

4.3.4 Netgauge

We will conclude the analysis of benchmark results with the benchmark Netgauge. It offers various traffic patterns, including effective bisection bandwidth (EBB). In Fig. 17, it can be observed that this value is significantly higher for KNS than for SFT implemented in the CELLIA cluster. Additionally, hdor gets the higher bisection bandwidth for both distributions of the MPI tasks, improving notably also the results obtained by the routing engines that take advantage of the IB VLs (LASH and DFSSSP). In the SFT topology, we can observe that DFSSSP obtains better bisection bandwidth than the ftree routing engine, specifically defined for Fat-Trees topologies.

4.4 Experiments conclusions

To summarize the results of the analysis, it is relevant to note that latency in the KNS topology tests is higher when switches are used as routers. However, using HCAs with traffic forwarding and routing capabilities would reduce latency. The KNS results demonstrate that hdor functions well as a routing engine in our KNS topology configuration. Our implementation of hdor provides good results compared to other routing engines, obtaining the best result in some experiments, as in the Netgauge benchmark in the EBB test. It should be noted that in some experiments where there is not a considerable difference between routing engines, those that use VLs outperform hdor on some occasions since this routing engine does not take advantage of the VLs, and as it has been proved in [16], the use VLs in the KNS topology can improve the performance of some traffic patterns. So, adding VLs to our hdor implementation could improve its results, although this is out of the scope of this paper.

In comparison with the SFT topology, we have observed that some tests show similar results when using the same routing algorithms (updn, ftree, and DFSSSP). Besides, we observe that hdor obtains close results to the other routing engines. Therefore, we can conclude that the implementation of Hybrid-DOR in Opensm (hdor) and the KNS built-in CELLIA, have been correctly carried out.

5 Related work

Several topologies similar to the KNS topology [17] have been proposed in recent decades. One example is the multi-ring topology [32], formed by multiple interconnected rings. In [33], the authors propose using a bus as the local network and a mesh as the global network. Regardless, this topology is unsuitable for large machines due to the low performance of buses and meshes when the number of nodes increases.

One of the most popular topologies in recent years in the HPC field is the DragonFly [13], which groups routers to increase the effective radix of the network. However, to balance the load across the global channels in this topology, it is recommended to use a non-minimal global adaptive routing. Indeed, these global channels are usually long links connecting different groups, increasing latency. The MegaFly [14] topology was proposed to raise the path diversity between groups, which is achieved by augmenting the number of switches per group network.

On the other hand, the recursive diagonal torus topology [34] was proposed to combine multiple tori networks with a focus on large supercomputers. This topology is built with the basis of a 2D torus, adding bypass links in the diagonal direction. This proposal maintains the same number of hops in a single dimension by not adding new links to connect nodes of the same dimension. Additionally, it optimizes the number of hops required when traversing multiple dimensions. However, due to its complex wiring layout, the proposal is limited to \(2^{16}\) processing nodes.

Finally, another topology similar to KNS is the BCube [35], where each processing node is connected to multiple dimensions using multiple NICs. A switch is used to connect processing nodes within the same dimension. Nevertheless, routing is performed through end-processing nodes by moving packets out of the network through one NIC and reintroducing them through another NIC. As stated in the original KNS paper [17], the BCube and other alternatives could be considered as particular configurations of the KNS family of topologies.

6 Conclusions

In conclusion, this study demostrated the feasibility of implementing a previously unbuilt KNS topology in a real system using InfiniBand, a widely used technology for HPC systems. According to the June 2024 Top500 list, 47,8% of supercomputers use InfiniBand as their interconnect technology. As mentioned in this paper, the interconnect network is one of the main points for improvement in HPC systems. The interconnection network must not be a bottleneck, hindering nodes from reaching their maximum computational capacity and becoming idle when messages cannot be sent or received due to network congestion. Efficient topologies and routing algorithms can be used to improve network performance. The KNS topology examined and implemented in this study is relevant to the present and future of interconnection networks because of its particular characteristics, which we have seen explained in Sect. 2.2. After implementation in a real cluster and testing with standard benchmarks, it can be concluded that this topology operates efficiently. Comparatively, it produces significantly better results than the specific implemented SFT topology in the CELLIA cluster for some benchmarks, particularly when the MPI tasks are mapped randomly among the terminal nodes. Our implementation of Hybrid-DOR has proved to be a reasonably good-performing routing engine, despite not taking advantage of the InfiniBand’s VLs. The results are promising, considering that better results could be obtained by applying queuing schemes and congestion control, which other routing algorithms use. Moreover, implementing the Hybrid-DOR routing algorithm in OpenSM is relatively straightforward. On the other hand, the complexity of KNS regarding the resources required for its physical implementation can be a problem if no InfiniBand HCAs can act as routers since using switches for this function is expensive and a waste of ports. However, if HCAs with router capability are available, the KNS topology uses resources very efficiently, as demonstrated in the complexity analysis (see supplementary file). This study shows that this topology provides favorable results and should be considered for the interconnect network of future HPC and data center systems. Nevertheless, as this is the first implementation in a real system, further work is necessary, such as implementing queuing schemes using VLs and comparing the KNS performance with other contemporary low-diameter topologies like DragonFly and other indirect network configurations.

Notes

For instance, if we use 64-port switches, the number of nodes that can be interconnected by a 64-ary 2-direct 1-indirect KNS is \(N=64^2=4096\).
https://www.i3a.uclm.es/raap/?page_id=1650.

References

Chen X, Lin X (2014) Big data deep learning: challenges and perspectives. IEEE Access 2:514–525. https://doi.org/10.1109/ACCESS.2014.2325029
Article Google Scholar
Top500.org: Top 500 list. https://www.top500.org. Accessed 4 June 2024 (2024)
Rocher-Gonzalez J, Escudero-Sahuquillo J, García PJ, Quiles FJ (2017) On the Impact of Routing Algorithms in the Effectiveness of Queuing Schemes in High-Performance Interconnection Networks. In: 25th IEEE Annual Symposium on High-Performance Interconnects, HOTI 2017, Santa Clara, CA, USA, August 28-30, 2017, pp. 65–72. IEEE Computer Society, USA. https://doi.org/10.1109/HOTI.2017.16
Rocher-Gonzalez J, Escudero-Sahuquillo J, García PJ, Quiles FJ, Mora G (2021) towards an efficient combination of adaptive routing and queuing schemes in fat-tree topologies. J Parallel Distrib Comput 147:46–63. https://doi.org/10.1016/J.JPDC.2020.07.009
Article Google Scholar
Rocher-González J, Gran EG, Reinemo S, Skeie T, Escudero-Sahuquillo J, García PJ, Flor FJQ (2022) Adaptive routing in InfiniBand Hardware. In: 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2022, Taormina, Italy, May 16–19, 463–472. IEEE, USA (2022). https://doi.org/10.1109/CCGRID54584.2022.00056
Duato J, Yalamanchili S, Ni L (2003) Interconnection networks. Elsevier Science, San Francisco
Google Scholar
Dally WJ, Towles BP (2004) Principles and practices of interconnection networks. Elsevier, San Francisco
Google Scholar
Leiserson CE (1985) Fat-trees: universal networks for hardware-efficient supercomputing. IEEE Trans Comput 34(10):892–901. https://doi.org/10.1109/TC.1985.6312192
Article Google Scholar
Singh A, Ong J, Agarwal A, Anderson G, Armistead A, Bannon R, Boving S, Desai G, Felderman B, Germano P, Kanagala A, Provost J, Simmons J, Tanda E, Wanderer J, Hölzle U, Stuart S, Vahdat A (2015) Jupiter rising: a decade of clos topologies and centralized control in Google’s datacenter network. In: Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication. SIGCOMM ’15, pp. 183–197. Association for Computing Machinery, New York, NY, USA . https://doi.org/10.1145/2785956.2787508
Ajima Y, Inoue T, Hiramoto S, Takagi Y, Shimizu T (2012) The Tofu Interconnect. IEEE Micro 32(1):21–31. https://doi.org/10.1109/MM.2011.98
Article Google Scholar
Rodriguez G, Minkenberg C, Beivide R, Luijten RP, Labarta J, Valero M (2009) Oblivious routing schemes in extended generalized fat tree networks. In: 2009 IEEE International Conference on Cluster Computing and Workshops, pp 1–8 . https://doi.org/10.1109/CLUSTR.2009.5289145
Desai N, Balaji P, Sadayappan P, Islam M (2008) Are nonblocking networks really needed for high-end-computing workloads? In: 2008 IEEE International Conference on Cluster Computing, pp 152–159 . https://doi.org/10.1109/CLUSTR.2008.4663766
Kim J, Dally WJ, Scott S, Abts D (2008) Technology-driven, highly-scalable dragonfly topology. In: Proceedings of the 35th Annual International Symposium on Computer Architecture. ISCA ’08, pp 77–88. IEEE Computer Society, USA . https://doi.org/10.1109/ISCA.2008.19
Flajslik M, Borch E, Parker MA (2018) MegaFly: a topology for Exascale systems. In: High performance computing: 33rd international conference, ISC High Performance 2018, Frankfurt, Germany, June 24–28, 2018, Proceedings 33, pp 289–310. Springer, Cham . https://doi.org/10.1007/978-3-319-92040-5_15
Peñaranda R, Gómez C, Gómez ME, López P, Duato J (2012) A new family of hybrid topologies for large-scale interconnection networks. In: 2012 IEEE 11th International Symposium on Network Computing and Applications, pp 220–227 .https://doi.org/10.1109/NCA.2012.22
Yebenes Segura P, Escudero-Sahuquillo J, Gomez C, Garcia PJ, Quiles FJ, Duato J (2013) BBQ: a straightforward queuing scheme to reduce hol-blocking in high-performance hybrid networks. In: Euro-Par 2013 Parallel Processing: 19th International Conference, Aachen, Germany, August 26-30, 2013. Proceedings 19, pp 699–712. Springer, Berlin. https://doi.org/10.1007/978-3-642-40047-6_70
Peñaranda R, Gómez C, Gómez ME, López P, Duato J (2016) The k-ary n-direct s-indirect family of topologies for large-scale interconnection networks. J Supercomput 72(3):1035–1062. https://doi.org/10.1007/s11227-016-1640-z
Article Google Scholar
Shanley T (2003) InfiniBand network architecture. Addison-Wesley, Boston
Google Scholar
Mellanox Technologies: Mellanox OFED for Linux User Manual. Mellanox OFED for Linux User Manual, Rev 2.0-3.0.0 ed., Sunnyvale, CA, USA (2013)
Zahavi E, Johnson G, Kerbyson DJ, Lang M (2010) Optimized InfiniBand\(^{TM}\) fat-tree routing for shift all-to-all communication patterns. Concurr Comput Pract Exp 22(2):217–231. https://doi.org/10.1002/cpe.1527
Article Google Scholar
Sullivan H, Bashkow TR (1977) A large scale, homogenous, fully distributed Parallel machine, I. In: Proceedings of the 4th Annual Symposium on Computer Architecture. ISCA ’77, pp 105–117. Association for Computing Machinery, New York, NY, USA . https://doi.org/10.1145/800255.810659
Dally WJ, Seitz CL (1987) Deadlock-free message routing in multiprocessor interconnection networks. IEEE Trans Comput C–36(5):547–553. https://doi.org/10.1109/TC.1987.1676939
Article Google Scholar
Hoefler T, Schneider T, Lumsdaine A (2009) Optimized routing for large-scale InfiniBand networks. In: 2009 17th IEEE Symposium on High Performance Interconnects, pp 103–111 . https://doi.org/10.1109/HOTI.2009.9
Domke J, Hoefler T, Nagel WE (2011) Deadlock-free oblivious routing for arbitrary topologies. In: 2011 IEEE International Parallel & Distributed Processing Symposium, pp 616–627 . https://doi.org/10.1109/IPDPS.2011.65
Luszczek PR, Bailey DH, Dongarra JJ, Kepner J, Lucas RF, Rabenseifner R, Takahashi D (2006) The HPC challenge (HPCC) benchmark suite. In: Proceedings of the 2006 ACM/IEEE conference on supercomputing, 213, 1
Dongarra J, Luszczek P (2011) In: Padua, D. (ed.) LINPACK benchmark, pp 1033–1036. Springer, Boston. https://doi.org/10.1007/978-0-387-09766-4_155
Dongarra J, Heroux MA, Luszczek P (2016) A new metric for ranking high-performance computing systems. Natl Sci Rev 3(1):30–35. https://doi.org/10.1093/nsr/nwv084
Article Google Scholar
Murphy RC, Wheeler KB, Barrett BW, Ang JA (2010) Introducing the graph 500. Cray Users Group (CUG) 19:45–74
Google Scholar
Hoefler T, Mehlan T, Lumsdaine A, Rehm W (2007) Netgauge: a network performance measurement framework. In: High Performance Computing and Communications: Third International Conference, HPCC 2007, Houston, USA, September 26–28, 2007. Proceedings 3, pp 659–671. Springer, Berlin. https://doi.org/10.1007/978-3-540-75444-2_62
Sancho JC, Robles A, Duato J (2001) Effective strategy to compute forwarding tables for InfiniBand networks. In: International Conference on Parallel Processing, 2001, pp 48–57 .https://doi.org/10.1109/ICPP.2001.952046
Maglione-Mathey G, Escudero-Sahuquillo J, Garcia PJ, Quiles FJ, Duato J (2020) Path2SL: leveraging Infiniband resources to reduce head-of-line blocking in fat trees. IEEE Micro 40(1):8–14. https://doi.org/10.1109/MM.2019.2949280
Article Google Scholar
Bhandarkar SM, Arabnia HR (1995) The Hough transform on a reconfigurable multi-ring network. J Parallel Distrib Comput 24(1):107–114. https://doi.org/10.1006/jpdc.1995.1011
Article Google Scholar
Das R, Eachempati S, Mishra AK, Narayanan V, Das CR (2009) Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs. In: 2009 IEEE 15th International Symposium on High Performance Computer Architecture, pp 175–186 . https://doi.org/10.1109/HPCA.2009.4798252
Yang Y, Funahashi A, Jouraku A, Nishi H, Amano H, Sueyoshi T (2001) Recursive diagonal torus: an interconnection network for massively parallel computers. IEEE Trans Parallel Distrib Syst 12(7):701–715. https://doi.org/10.1109/71.940745
Article Google Scholar
Guo C, Lu G, Li D, Wu H, Zhang X, Shi Y, Tian C, Zhang Y, Lu S (2009) BCube: a high performance, server-centric network architecture for modular data centers. In: Proceedings of the ACM SIGCOMM 2009 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Barcelona, Spain, August 16–21, 2009. SIGCOMM ’09, pp 63–74. Association for Computing Machinery, New York, NY, USA . https://doi.org/10.1145/1592568.1592577

Download references

Funding

This work has been jointly funded by the BBVA Foundation and Becas Leonardo (call 2020) with the grant IN[20]_TIC_TIC_0042, by the Ministry of Science, Innovation and Universities (MCIN/AEI/10.13039/501100011033) and the European Commission (NextGenerationEU/PRTR) under the projects TED2021-130233B-C3 and PID2021-123627OB-C52, by the Junta de Comunidades de Castilla-La Mancha under the project SBPLY/21/180225/000103, and by the Universidad de Castilla-La Mancha under the project 2023-GRIN-34056.

Author information

Jesus Escudero-Sahuquillo, Pedro J. Garcia and Francisco J. Quiles have contributed equally to this work.

Authors and Affiliations

Department of Computing Systems, Universidad de Castilla-La Mancha, Albacete, Spain
Gabriel Gomez-Lopez, Jesus Escudero-Sahuquillo, Pedro J. Garcia & Francisco J. Quiles

Authors

Gabriel Gomez-Lopez
View author publications
You can also search for this author in PubMed Google Scholar
Jesus Escudero-Sahuquillo
View author publications
You can also search for this author in PubMed Google Scholar
Pedro J. Garcia
View author publications
You can also search for this author in PubMed Google Scholar
Francisco J. Quiles
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gabriel Gomez-Lopez.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 204 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Gomez-Lopez, G., Escudero-Sahuquillo, J., Garcia, P.J. et al. Implementation and testing of a KNS topology in an InfiniBand cluster. J Supercomput (2024). https://doi.org/10.1007/s11227-024-06214-6

Download citation

Accepted: 09 May 2024
Published: 07 June 2024
DOI: https://doi.org/10.1007/s11227-024-06214-6

Implementation and testing of a KNS topology in an InfiniBand cluster

Abstract

Similar content being viewed by others

Making the Network Scalable: Inter-subnet Routing in InfiniBand

A Scalable InfiniBand Network Topology-Aware Performance Analysis Tool for MPI

A Shortest-Path Routing Algorithm in Bicubes

1 Introduction

2 Background

2.1 Interconnection networks

2.1.1 Direct or router-based networks

2.1.2 Indirect networks

2.1.3 Discussion on other taxonomies for interconnection networks

2.2 The KNS network topology

2.3 The InfiniBand network technology

3 Description of the KNS implementation

3.1 Hardware/software infrastructure

3.2 Hybrid-DOR implementation in OpenSM

3.2.1 Network devices discovering and addressing

3.2.2 Switches and routers identification

3.2.3 Devices coordinate assignment

3.2.4 Routing tables population

3.3 The CELLIA Cluster

3.4 KNS built on CELLIA cluster.

4 Experiments and results

4.1 Benchmarks setup

4.1.1 HPCC

4.1.2 HPCG

4.1.3 Graph500

4.1.4 Netgauge

4.2 Experiments setup

4.3 Results analysis

4.3.1 HPCC

4.3.2 HPCG

4.3.3 Graph500

4.3.4 Netgauge

4.4 Experiments conclusions

5 Related work

6 Conclusions

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 204 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation