Providing quality of service in omni-path networks

New hierarchical crossbar switch architectures, such as Omni-Path (OPA) and Cray X2, have appeared to improve packet latency, reduce overall cost and increase fault tolerance of the high-performance interconnection networks in supercomputing and data center systems. These and other interconnect technologies (Infiniband or 40/100 Gigabit Ethernet) include support to provide quality of service (QoS) to the applications. In this paper, we show how this QoS support can be enabled to achieve bandwidth and/or latency differentiation in Omni-Path interconnection networks, as a representative case of hierarchical switches. To do that, three different table-based schedulers are used. We include the description of these schedulers and a comparative study by using the results obtained when we evaluate them with Hiperion, a simulation tool that implements an OPA model.


Introduction
In the last decades, there has been a constant advancement in high-speed interconnection network technologies. This development has been fueled by the growth of the supercomputing and data center services, where the interconnection network is usually the limiting factor (bottleneck), i.e., the central element upon which the performance of the whole system relies. Therefore, it is critical to keep improving the overall interconnection network performance. This is achieved by the introduction of continuous improvements in the physical elements of the network (links, 1 3 Providing quality of service in omni-path networks switches, NICs, etc.) and the techniques they implement (routing algorithms, congestion avoidance mechanisms, etc.).
Moreover, the total bandwidth per switch has increased due to a combination of higher pin density and faster signaling rates. As the total bandwidth increases, switch designers face two possibilities to exploit this bandwidth: to build switches with a high number of thin ports (high-radix switches) or to build switches with a low number of fat ports (low-radix switches). The current trend is to use highradix switches [1][2][3] as they present advantages such as: the final packet latency is reduced, the interconnection network overall cost is reduced, the wiring is also reduced, the power dissipated by the network decreases, the network fault tolerance is increased and a distributed packet arbitration process may be applied. However, high-radix switches also face some problems: the cost and efficiency balance is not easy to be maintained, increased buffer requirements drive cost up, the virtual channel allocation process becomes more complex, among others. To address some of these issues, fully-buffered crossbar and hierarchical crossbar switch architectures have been introduced. Fully-buffered crossbar switches require a huge silicon area when the radix increases, making them unfeasible due to the associated costs. Hierarchical crossbar switch architectures overcome that drawback while achieving a very high port count. Some high performance devices such as YARC [1], Omni-Path [4] and Slingshot's Rosetta switches [5] use a hierarchical crossbar architecture to achieve high-radix interconnection devices.
Omni-Path (OPA) emerged with the aim of occupying a space in the select group of high-performance interconnection network technologies, such as InfiniBand (IB) [6] or 40/100 Gigabit Ethernet (GE) [7]. These interconnection network technologies have been competing to achieve better performance and market share than others. In terms of market share, since its introduction in the most powerful computers list TOP500 [8], OPA has ranged from 1.6 to 10%. Considering the 100 most powerful computers on the list, OPA have reached up to 13%.
Current interconnection networks carry not only traffic of applications such as backup or file transfer protocols, which does not require service differences, but also traffic from others like real-time protocols [9], MPI communications or traffic from users with different privilege levels in the system [10]. Therefore, QoS has become the focus of much discussion and research during the last decades [11,12]. A sign of this interest is the inclusion of support aimed to provide QoS on interconnection networks such as GE, IB and also OPA.
One of the most important QoS mechanisms is the scheduling algorithm [13,14]. High performance interconnection networks usually use packet-switching as switching technique. This kind of networks can carry packets from different applications, users and flows, interacting with each other in every interconnection network element. Without any scheduling policy, packets from different traffic flows 1 use as many resources as they need and, in the worst scenario, a single flow may consume all the system resources causing starvation on others. In such way, users may experience a poor system performance even if the system is not overloaded. Therefore, the scheduling algorithm is a crucial element to provide QoS.

3
Providing quality of service in omni-path networks than the two previous ones although is able to provide bandwidth and latency differences with a reasonable computational and implementation complexity [20].
The structure of the paper is as follows: Sect. 2 reviews the OPA architecture and our OPA-based simulation model. Section 3 explains the main output scheduling algorithms proposed. Section 4 shows the results obtained evaluating bandwidth and latency differentiation, and, finally, Sect. 5 presents some conclusions.

The OPA architecture
As stated in Sect. 1, OPA rapidly grew in popularity after it was firstly introduced. The OPA architecture has some elements such as a hierarchical internal crossbar and multiple QoS tables that makes it different from the most popular high performance interconnection network architectures like IB and GE. This allows enabling QoS techniques that are simply not feasible in the rest of high performance interconnection network architectures. And in order to design, explore, and evaluate the performance of these possibilities, testing tools are required such as simulation programs, mathematical models, etc. OPA was initially developed by Intel until 2019, when all the OPA technology IP was transferred to Cornelis Networks, a new company that is continuing the support and development of OPA products [21,22].
As described in Sect. 1, we have chosen OPA just as an example, but the findings and conclusions could be adopted to other similar architectures such as YARC [1], or Slingshot's Rosetta [5] switches.
We have collected all relevant information about the OPA architecture and QoS support, and we have developed the simulation tool Hiperion (HIgh PERformance InterconnectiOn Network), which includes an OPA simulation model [18]. Hiperion is an open-source simulation tool available for researchers and companies and includes multiple useful mechanisms to perform many comparative studies. The simulation model includes all the main features for simulating the movement of packets between source and destination using several configurable QoS strategies. These QoS strategies will be analyzed and compared.

OPA support for QoS
The OPA architecture offers support to provide QoS to applications, flows, packets, etc. According to [17], support is given through the following elements: SLs are mapped to SCs via the SL2SC tables and SCs are mapped to SLs via SC2SL tables, depending on whatever the packets are sent or received, respectively. Each SC carries traffic of a single SL in a single TC, and the Fabric Manager (FM) fulfills SC2VL and VL2SC tables, determining how SCs are mapped onto VLs at each port and vice-versa. The FM is also responsible of: discovering the fabric topology, provisioning the fabric components with identifiers, formulating and provisioning routing tables, monitoring utilization, performance and error rates and fulfilling arbitration tables. OPA includes also QoS mechanisms such as VLArbitration Algorithm and preemption Tables. However, there is not much information about how these mechanisms work. Figure 1 shows an example of the use of TCs, SLs, and SCs across the paths followed by three traffic flows (red, green and blue) in an OPA network. The different links crossed by these packets are ordered from 1 to 7. In this example, we assume  [17] the use of two TCs (TC0 and TC1), three SLs (SL0, SL1 and SL2) and six SCs  (SC0, SC1, SC2, SC3, SC4 and SC5). Moreover, each SL is assigned with two SCs, which, in turn, are mapped to two VLs. TC0 (i.e., traffic flows red and green) is used for example for a request/response high level communication library such as Partitioned Global Address Space protocol (PGAS) 2 . Let's suppose TC0 is assigned with SL0 (red traffic flow) and SL1 (green traffic flow), SL0 is mapped to SC0 and SC1, and SL1 is mapped to SC2 and SC3. On the other hand, TC1 is used, for instance, for storage communications. It is assigned with SL2, and SL2 is mapped to SC4 and SC5. The main goal of assigning a pair of SCs for each SL is topology deadlock avoidance, as it happens normally in torus topologies, while the SLs of TC0 are used for avoiding protocol deadlocks. As we can see in the figure, packets can change of SC link by link; however, the SL and TC are always consistent end-to-end [17].

OPA simulation model
We have carried out the study presented in this work using simulation for being one of the most popular technique to evaluate, verify and validate the behavior and performance of high performance interconnection networks. There are multiple simulation tools such as Garnet [23], xSim [24], etc. focused on on-chip networks. These simulators allow full-systems simulations, feasible for on-chip networks, due to the small network sizes. However, when the network grows to hundreds of elements, the computational resources needed make full-systems simulation unapproachable. Moreover, the characteristics of the off-chip and on-chip traffic are disparate. There are also multiple off-chip simulation tools such as CODES [25], SST [26], etc. However, these simulation tools do not have support for any hierarchical crossbar switch architecture with QoS. Therefore, we have proposed an OPA-based simulation model and a simulation tool called Hiperion based on the available public information [4,17]. It is based on previous tools that have been used for years in our research group, and with multiple publications behind them [27,28]. Our simulator Hiperion gives us a deep knowledge of its operation and a wide flexibility regarding the techniques that can be implemented and its interoperability.
Hiperion is a discrete-event based network simulator, which includes an OPA simulation model that mimics the behavior of main OPA elements, such as switches, links and network interfaces. The simulator main goal is to perform comparative studies tuning a large range of parameters such as queue sizes, topology, routing, packet sizes, scheduling algorithms, etc. The simulator is capable of running simulations using a wide variety of synthetic traffic types such as random, uniform, bitreversal, bit-complement, etc., and MPI applications using the VEF trace framework [29]. Performance and scalability of the interconnection network are evaluated using several metrics: throughput, end-to-end latency, network latency, etc. Figure 2 shows a detailed scheme of a 48-port OPA-based switch, which has been implemented into Hiperion. The OPA switch model assumes that each port delivers one flit per cycle. Hence, the bandwidth is defined based on the clock rate and the flit size. However, the OPA hierarchical architecture has a large range of internal links with different bandwidths [17]. The OPA model defines the input/output port bandwidth (12.5 GB/s) as a reference, thereby an x3 internal link has a speed-up of 3 and so it may deliver 3 flits/cycle. The number of input and output links is represented as INPORTS:OUTPORTS in the crossbar elements, i.e. MPort xBars and Central Crossbar. For instance, in Fig. 2, the MPort0 xBar has 4 input links and 6 output links (4:6), and the Central Crossbar has 24 input links and 48 output links (24:48). The OPA model shown in this figure includes the following elements: • Input buffers: They store the flits from the input ports. There is one input buffer per input port. • Routing unit: There is one routing unit per input buffer. • MPort Xbar: This crossbar has 4 input links, one per input buffer; and 6 output links: 4 links for the output buffers and 2 links for the Central Crossbar. Note that the 75 GB/s link to the Central Crossbar is represented in Providing quality of service in omni-path networks this model as two x3 links, i.e, they may deliver 3 flits/cycle, resulting on 2 Links × 3 × 12.5 GB∕s = 75 GB∕s. • Output buffers: They store the flits of the output ports. There is one output buffer per output port. • Input arbiter: Given an input buffer, it selects the virtual line (VL) that participates in the second allocator phase. The more VLs, the bigger the arbiter is. • Output arbiter: Given an output buffer, it chooses which input port will transmit flits. A flit can arrive at this output buffer coming from an input buffer or from the Central Crossbar. • Output scheduler: Given an output port, it chooses which VL will transmit flits to the neighbor switch. It provides QoS to the OPA switch.
As stated before, Hiperion is a discrete-event based simulation tool for modeling high-performance interconnection networks. Hiperion defines and implements the following discrete events: • IB (Input Buffering): A flit arrives at an input port and is stored in the corresponding queue, depending on the VL. Each input buffer can receive 1 flit/cycle. If that flit is a packet header flit, it is set as RT-ready, and the routing event is called to determine the flit output port. In other case, the flit is set as X-ready, it is stored on the input buffer and it waits to be moved to the appropriate buffer in a Xbar event. When the output port is connected to the same Mport than the input port, the flit is moved to an output buffer. In other case, the flit is moved to a Central Crossbar buffer. For example, let's suppose an OPA switch with 48 ports and 4 ports per MPort (Fig. 2). If a flit needs to travel from the input port 0 to the output port 5, the input port belongs to the MPort 0, which contains input ports from 0 to 3, while the output port belongs to the MPort 1, which contains output ports from 4 to 7. Therefore, the flit must cross the Central Crossbar to arrive at MPort 1. • RT (RouTing): Routes a packet and determines its output port when the packet header flit is tagged as RT-ready. After that, the header flit is tagged as VA-SAready and the input buffer storing this flit can be chosen in the first phase of the allocation event. The RT event is only applied to header flits. Non-header flits always follow the header flit, since OPA architecture implements virtualcut though as switching technique [17]. The routing function is configurable and must be according to the simulated topology. • VA-SA (Virtual Allocator and Switch Allocator): Performs the allocation using a two-stage allocator: -Virtual Allocator: Each input arbiter chooses a VL, only if its input buffer contains at least one VA-SA-ready header flit. The winning VL will be allowed to deliver a packet. Since the Central Crossbar links have VLs as well, the virtual allocator is also performed on the input buffers of the Central Crossbar. -Switch Allocator: Each output arbiter chooses an input buffer with a winning VL. The winning input buffers will be allowed to move a packet to an output buffer or to a Central Crossbar buffer, depending on the destination MPort.
Buffers allowed to transmit tag the top header flit as X-ready. A central buffer has to arbitrate between the 4 input buffers which are connected to its MPort. An output buffer has to arbitrate between the 24 Central Crossbar buffers and its 4 MPort buffers.
Currently, both virtual and switch allocators implement round-robin arbiters. However, we are developing more sophisticated arbiters able to provide applications with QoS. • X (Xbar): Once the allocation is performed, the winning input and Central Crossbar buffers transmit the first packet of their winning VLs to the appropriate output buffer or Central Crossbar buffer. If a packet is moved from an input buffer to a Central Crossbar buffer, the header flit is tagged again as VA-SA-ready in order to perform a VA-SA event from Central Crossbar buffers to output buffers. If the packet reaches an output buffer, their flits are tagged as OB-ready. The bandwidth depends on the input/output pair. MPorts xbar can deliver 3 flits/cycle regardless of the destination buffer, while the Central Crossbar xbar can deliver 4 flits/cycle. • OB (Output Buffering): Each output scheduler chooses which VL will send flits to the neighbor switch. The scheduler selects a VL with OB-ready packets and enough credits to transmit at least one packet. When the last flit of the packet is transmitted, (i.e., the tail flit), the output scheduler releases the winning VL and selects a new VL. Each output port can send 1 flit/cycle. At this point, QoS and packet preemption can be applied. Currently, three scheduling algorithms have been implemented (Sect. 3), but packet preemption is not implemented yet.
VL buffer storage space is dynamically managed, i.e. the buffer space is shared by all the VLs. The buffer storage space is divided according to the traffic requirements, ensuring a minimum and a maximum amount of flits per VL. This prevents a single VL from taking up all the flits in the buffer, causing starvation in the remaining VLs. This dynamical buffer storage management strategy provides more flexibility than static buffers [30]. The main QoS OPA support such as SCs, SLs, VLs, SL2SC and SC2VL tables, etc., have also been implemented in Hiperion. There are some additional mechanisms that have been implemented not directly related with QoS. However, they elements are crucial in some cases. Some of them are: variable Maximum Transfer Units (MTUs) per SL, message generation based on variable MTU sizes, variable injection rate definition per SL, among others. The goal of these QoS mechanisms and the simulation model implemented is to develop, test and compare different QoS scheduling algorithms.

Scheduling algorithms
The main goal of scheduling algorithms is to determine when packets from different SLs are delivered in order to satisfy the specified end-to-end latency and bandwidth requirements. Not all scheduling algorithms are capable of satisfy both Providing quality of service in omni-path networks requirements, some are only able to fulfill bandwidth requirements. Moreover, in the context of high-performance interconnection networks, scheduling algorithms must meet two main characteristics: low computational complexity (the scheduler latency must be smaller than the average packet latency) and low implementation complexity (the scheduling algorithm is typically implemented in hardware and a high implementation complexity implies a large silicon area).
In this section we detail three scheduling algorithm proposals adapted to the context of hierarchical-crossbar-switch architectures, specifically, to the OPA architecture.

The round-robin output scheduler
The round-robin output scheduler is the simplest output scheduler. The main goal of a round-robin output scheduler is to distribute the total bandwidth among all SLs. The bandwidth that each SL will obtain is 1 NumSLs , where numSLs is the total number of SLs. This scheduler could be based on an arbitration table or on an hardware implemented algorithm. Although both approaches are feasible as long as the bandwidth is properly distribute, we have chosen the arbitration table because the other algorithms presented in this work are also based on arbitration tables, as we will explain in Sects. 3.2 and 3.3. In fact, this scheduler can be implemented using the SBT scheduler, equally distributing all the bandwidth among all the SLs. Table 1 shows an example of SBT scheduling table configured to work in a round-robin way. The initial entry weights are not relevant as long as for a round-robin algorithm they are equal on each table entry. For this reason, the details about how the round-robin scheduler works can be found in Sect. 3.2. Note that the round-robin output scheduler does not provide any QoS differences. We have considered this scheduler in order to establish a comparison baseline.

The simple bandwidth table mechanism
Simple Bandwidth Table (SBT) is a table-based scheduler. It is one of the simplest techniques to provide bandwidth differences in a high performance interconnection network.
SBT scheduler is based on an arbitration table per output port with as many entries as SLs are considered. Each table entry is assigned to one SL and the entries store an entry weight. This weight represents how many packets an SL may deliver. Every time that an SL delivers a packet, the entry weight is decremented until it is equal to zero. Table 2 shows an example considering two SL, where SL0 has a weight of 55 and SL1 has a weight of 45. If in a given output port, SL0 delivers 3 Table 1 Round-robin table QoS  algorithm sample  SL  Weight   0  50   1  50 packets, the remaining weight will be 52. Therefore, the fraction of the total bandwidth i assigned to the SLi is where N is the total number of SLs and weight is the entry weight assigned to each SL. In the SBT arbitration table (Table 2), SL0 will get 55% of the total bandwidth and SL1 will get 45% of the bandwidth. In our proposal, for the sake of simplicity, ∑ N−1 j=0 weight j must be equal to 100. In this way, the bandwidth percentage of each SL can be easily obtained.
The arbitration table is cycled through in a round-robin way when the entry weight is equal to zero. The table is also cycled when the SL in transmission becomes "inactive", i.e. the SL has no packets to transmit 3 . When the sum of all entry weights is zero, the initial entry weights are restored. Note that an SL can only transmit when its weight is greater than zero. However, there is an exception: when an active SL does not have enough weight left but it is the only active SL, the transmission of packets is allowed. This exception avoids packet starvation and wasting the link bandwidth.
Finally, realize that the bandwidth is distributed by SL, not VL. Otherwise, if some SLs have a different number of VLs assigned than others, the total bandwidth cannot be distributed correctly between the SLs. Let's suppose two SLs and three VLs in the network. SL0 can use two VLs and SL1 can use the remaining VL. We want to distribute 50% of bandwidth to each SL and we assign the same weight to each VL. Then, SL0 will get 2 3 of the total bandwidth, while SL1 will only get 1 3 . It would also be possible to distribute traffic between SLs by VLs instead of SLs, but this complicates the table configuration and offers no added benefit. Providing quality of service in omni-path networks Algorithm 1 shows the generic mechanism of the SBT scheduler on every port. Note that the first_flit() function allows to extract the first flit from a given VL queue and the is_active() function determines if an SL is active (i.e. it has packets to transmit) or not. Since OPA uses virtual cut-through as switching technique, these algorithms are only applied to header flits, so that body and tail flits will always follow the header flit at one flit distance. Furthermore, the SC identifier is the only QoS identifier stored in packets [17]. For this reason, SC2SL tables are used to get the SL identifier from the SC packet identifier.
The main advantages of SBT are its capacity to provide bandwidth differences and to have a very low computational and implementation complexity (Sect. 3.2.1). However, SBT is not able of providing latency differences, which could be crucial in many scenarios.

Complexity considerations
In terms of computational complexity, SBT is quite simple. In this case, arbitration tables have as many table entries as SLs. OPA supports up to 32 SLs according to [17]. Hence, in the worst case, if all table entries have to be looked over in order to find the next active SL, just 32 table entries will be skipped.
One of the most computationally complex tasks in Algorithm 1 is the is_active() function. However, the optimization strategy suggested in [20] may be used in order to keep the complexity low. Regarding the implementation complexity, considering an arbitration table per output port would require a large silicon area on hardware implementations. Therefore, instead of keeping a table per output port, a single table per switch with the structure shown in Table 3 may be used. The arbitration table has as many columns as output ports (p) plus 2 extra columns ( p + 2 ) and as many rows as SLs N. The first two columns show SL i identifiers and the associated weight x i to the SL i . The other columns represent the remaining SL weights x i − i,j for each output port j. Every output port row is populated with the associated weight to each SL. When ∑ N−1 i=0 x i = 0 in a given column, the values from the Weight column are copied to the column of that port and thus the port will be allowed again to deliver packets.

The DTable scheduling mechanism
As explained in Sect. 3.2, SBT is not able to provide latency differences. Furthermore, SBT has other problems that we will discuss in Sect. 4. Therefore, we implemented, adapted and tested the DTable scheduler [31] on our OPA-based simulation model.
The DTable scheduler is based on an arbitration table with an structure similar to SBT arbitration tables: a column for an SL identifier and another column for an associated weight for each table entry and SL. However, there is an important difference between SBT and DTable arbitration tables: SBT arbitration tables have as many table entries as SLs whilst DTable arbitration tables have a greater arbitrary number of table entries, e.g. 32, 64, 128, etc. This difference is used to provide latency differences on SLs. The number of table entries and the maximum distance between any pair of consecutive table entries assigned to the same SL allow to control the SL latency [32]. Note that now each SL can have multiple table entries, and therefore, the bandwidth i assigned to SLi is Table 3 Arbitration table implementation with one table per where J is the set of table entries assigned to SLi and weight is the entry weight assigned to the table entry. Moreover, each SL has assigned a deficit counter initially set to 0. The deficit counters represent the weight that the scheduler owes to the SLs. The purpose of this counter is explained further on. When scheduling is needed, arbitration tables are cycled through sequentially in a round-robin way until an active SL is found. The DTable scheduler has also an accumulated weight counter which is equal to the sum of the selected table entry weight and the SL deficit counter. The scheduler will deliver as many packets from the selected SL as the accumulated weight allows. The accumulated weight is decremented when packets are transmitted.
There are two possibilities that make the scheduler to select the next active table entry: 1. The SL becomes inactive. In this case the remaining accumulated weight is discarded and the deficit counter is set to zero. 2. The accumulated weight becomes smaller than the size of the packet at the head of the queue. In this case the accumulated weight is saved in the deficit counter.
weight k , Algorithm 2 shows a generic DTable scheduler. When the scheduler gets the "Next table entry assigned to an active SL" (line 19) the arbitration table is cycled through in a round-robin way until an active SL is found. The function returns the entry identifier and a VL associated to the selected SL. As stated in Sect. 2.1, an SL may span multiple SCs. In that case, the function arbiters between the SCs belonging to the same SL in a round-robin way, and it selects the VL through the SC2SL tables. For instance, in a given configuration SC0, SC1 and SC2 have been associated to SL0 as well as VL0, VL1 and VL2 to SC0, SC1 and SC2, respectively. The first time that SL0 is allowed to deliver packets, it will deliver packets from SC0 and VL0, the second time SL0 will deliver packets from SC1 and VL1, etc. Obviously, other SC selection strategy can be applied, such as dividing the accumulated weight among SCs of the same SL.

3
Providing quality of service in omni-path networks

The DTable scheduler and variable OPA MTUs
In our original OPA-based simulation model exposed in Sect. 2.2, the global MTU is one packet of 128 bytes (i.e. 16 flits of 64 bits).
However, if the MTU is one packet sized in all SLs, and minimum entry weight is also one packet sized, the deficit counter will never be used. Moreover, the main advantage of the DTable scheduler is the use of different MTUs for different SLs [33] which allows to decouple the bandwidth assignments from the latency requirements (see Sect. 3.3.2 for further details).
To achieve this, we have modified the delivery message system. Before sending the message to the next network element, the DTable scheduler has to ensure that: i) the entire message fits onto the neighbour receiving buffer and ii) there is enough remaining weight for the selected VL. Therefore, SL_MTU tables are used, which have as many entries as SLs and each entry stores the associated MTU of each SL. The message generation is also based on those tables. For instance, if a given SL has an MTU of three packets, the SL will always generate messages of three packets. Moreover, when a transmission is performed, the SL will deliver three consecutive packets. Note that because of the switching technique used (i.e. virtual-cut through) and the atomic delivering message system, all flits of the same message are stored, sent and received consecutively. Then, VL buffers must have enough space (i.e. flow control credits) for storing at least the biggest MTU in the system.

DTable configuration methodology
In order to provide applications, flows or SLs specific QoS differences, DTable arbitration tables must be configured in a proper way. DTable scheduling mechanisms themselves do not provide QoS without applying a proper configuration methodology [19].
As stated in Sect. 3.3, the maximum distance between any pair of consecutive table entries assigned to the same SL allows to control the latency distribution among SLs [32]. In a given arbitration table configured to meet their latency requirements, we would like to be able to assign the SLi a certain bandwidth i in a flexible way. In other words, this means to keep the minimum bandwidth min i that can be assigned to the SLi as small as possible, and the maximum bandwidth max i assignable to the SLi as large as possible. Table 4 shows the definition of all parameters involved in the configuration methodology.
The maximum total weight that can be divided among the table entries is M × N . However, we have fixed it to a lower value called pool, which is determined by the k configuration parameter. Sect. 3.3.1 explains that a specific MTU value can be assigned for each SL. Then, the bandwidth i assigned to the SLi is: where J is the number of table entries assigned to the SLi and weight j is the weight assigned to the table entry j. Therefore, min i and max i assignable bandwidth values to the SLi are: Let's define M and pool using the GMTU parameter and the decoupling parameters w and k: where k ≤ w because the bandwidth pool has to be smaller than N × M . Hence, the maximum and minimum bandwidth depend not only on the proportion of table entries n i , but also on the w and k parameters and the proportion between their specific MTU i and GMTU: Therefore, parameters w, k and the specific MTU i assigned to each SL allow to vary the maximum and minimum bandwidth assignable to SLs without affecting the final latency [19].

DTable bandwidth correction algorithm
Once the configuration methodology has been applied, we can choose a bandwidth i for each SL between the given min i and max i range. Then, the total entry weight Tweight i has to be computed as pool × i . After that, we have to obtain the entry weight as for each SL and fill in arbitration tables with these values. As Providing quality of service in omni-path networks stated in Sect. 3.3, the entry weight represents how many packets can be delivered from an active SL, so it must have at least enough weight to deliver one packet/ MTU. Moreover, it must be an integer value because float numbers will produce some issues: • The fractional part will only be useful once it is accumulated in the deficit counter and the sum is equal to one packet/MTU. • The final hardware implementation will require more silicon area due to IEEE 754 floating point representation [34]. • The final entry weight may not be enough for delivering a packet/MTU from an active SL without cycling through arbitration tables several times.
To put this right, the entry weight obtained as Tweight i n i will always be rounded up.
However, this could produce some bandwidth imprecisions. Table 5 shows an example about this issue. In this example, each SL will get i = 1 pool . However, as seen in the R i column, the real SLi bandwidth is i ≠ 1 pool . Specifically R 0 = 448 1216 , R 1 = 384 1216 and R 2 = 384 1216 . To solve this issue, the DTable bandwidth correction algorithm is applied. First, the bandwidth difference between i and R i is obtained. The column i − R i on Table 5 shows the bandwidth differences. Secondly, the amount of extra weight that SLi table entries require, called Dweight i , is calculated: For instance, in Table 5 we have Dweight 0 = −(0.03502 × 1216) = −43 , Dweight 1 = − − (0.01751 × 1216) = 21 , etc. Finally, the Dweight i value is added to ∑ n i −1 j=0 weight j getting Fweight i . As can be seen in Table 5, in the column F i , final bandwidths are very close to the desired ones.
Another important aspect is how and when Dweight i is added to arbitration tables. Assuming that the DTable configuration and adjustment are done by the FM during the starting up process, the simplest strategy is: (i) to populate a pre-arbitration table with the bandwidth imperfections discussed here; (ii) to perform the DTable correction algorithm and (iii) to send the final arbitration table to network elements. However, there is a large range of possible ways to add Dweight i . In our OPA-based simulation model, the algorithm always starts from the end of the arbitration table incrementing weight to each entry weight in a round-robin way. Table 6 shows an arbitration table where SLs have a Dweight i of -3, 1 and 2 for SL0, SL1 and SL2 respectively.
The first three rows show the arbitration table before running the DTable bandwidth correction algorithm and the last three rows after running it. The first and fourth rows show the table entry identifiers and the third and sixth rows the SL identifier and the associated weight respectively. The algorithm starts with SL0 and the entry 6 performing 4 + (−1) = 3 , moves to the entry 4 performing 4 + (−1) = 3 and then finishes with the entry 2 performing 4 + (−1) = 3 . Then, the algorithm  On the other hand, it could be interesting to study a different approach to find out if there are differences among start from the button and the top of the table. However, it is essential that the system checks during the increasing process if the weight on the entries is enough to deliver a packet/MTU.

Performance evaluation
In this section, we evaluate the performance of DTable and SBT proposals against a round-robin scheduler as the baseline reference. We have used our simulator Hiperion which implements the simulation model explained in Sect. 2.2, as well as the QoS mechanisms detailed in Sect. 2.1. Note that although we use OPA for configuring the network parameters, our proposal can be applied to any hierarchical-crossbar-switch based interconnection network.
We have also evaluated the QoS mechanisms in two different scenarios. In the first scenario, the network has been evaluated using a synthetic traffic model composed of several traffic flows. These flows represent the network load generated by applications commonly found in cluster and data centers. In the second scenario, the synthetic HPC flow is replaced by the traffic of real MPI applications using the VEF trace framework [29].
In Sect. 4.1 we present the network model used in the performance evaluation. Section 4.2 presents the synthetic scenario and its results, while Sect. 4.3 includes the evaluation and results obtained using the MPI traces.

Network model
We have used two different interconnection topologies with two different layouts: a 2D Torus with 8x8 switches, a 3D Torus with 8x8x4 switches, a 8-ary 3-tree with 192 switches and a 24-ary 2-tree with 48 switches. The configuration of each scenario is the following: • The 2D Torus configuration has 512 endpoints (NICs). Each switch has 48 ports: eight single links to endpoints and four 10x trunk links to neighboring switches. • The 3D Torus configuration has 1024 endpoints connected, the switches have a radix of 28 with 4x trunk links. • The 8-ary 3-tree has been configured with 512 NICs and 16-port switches. • The 24-ary 2-tree has a total of 576 endpoints and 48-ports switches.
We have chosen these topologies because they are very common and well known solutions in high performance environments. The detailed explanation about the switch architecture can be found in Sect. 2. The SL2SC and SC2VL tables configuration is shown in Table 7. For instance, the SL VO has two SCs, SC0 and SC1, and they have VL0 and VL1 associated respectively. Further details about SLs will be provided in Sect. 4.2.1.
The switch model implements a credit-based flow control protocol. The packets will be only transmitted when there is enough buffer space in the next network device. Therefore, packets are not dropped when congestion appears. Traffic with similar characteristics is aggregated via SLs, the packet scheduling is performed with SLs and flow control via VCs. According to [17], the GMTU of OPA messages may be up to 8KB, but we have used a GMTU of 1KB in this evaluation for the sake of simplicity. Nevertheless, the evaluation may be performed with greater MTUs using larger buffers. The credit-based flow control unit is 64 bytes, and thus, the GMTU is up to 16 credits.
As stated before, we have used input, output and central buffer queuing architecture. The buffer capacity is 65,536 bytes (64 × GMTU) per input and output ports of switches and 32,768 bytes (32 × GMTU) at the network interfaces. The central crossbar buffer capacity is 131,072 bytes (128 × GMTU) per MPort. If an application wants to inject a packet into a network interface queue but the queue is full, we assume that the packet is stored in the application layer queue.

Performance evaluation using the synthetic traffic model
In this section we explain the details of the evaluation performed using synthetic traffic.  Providing quality of service in omni-path networks Table 8 shows each traffic type considered. There are five types of traffic flows, three SLs with explicit QoS requirements such as latency and bandwidth, and two SLs for best effort traffic with slight different levels of priority among them. The packets from each SL have been simulated using different Constant-Bit-Rate (CBR) distributions. We have selected the following packet payloads for each SL:

Traffic model
• Voice (VO) traffic is generated using a packet payload of 128 bytes. According to [35], the payload value of voice packets ranges from 20 to 160 bytes. • Video (VI) traffic is generated using a packet payload of 256 bytes. According to [36], a payload ranging from 100 bytes to 64KB is feasible. • Controlled Load (CL) traffic is generated using a packet payload of 512 bytes, representing a possible average packet payload of many HPC application communications. • The traffic of the best effort SLs, Best-effort (BE) and Background (BK), is generated using a packet payload of 1024 bytes.
For all cases, the destination pattern is uniform in order to fully load the network. Note that we have chosen a heterogeneous scenario where multiple types of traffic are mixed. However, our proposal is aimed to any environment where flows with different QoS requirements coexist in a high performance network.

Simulated scenario and scheduler configurations
We have supposed a scenario where the goal is to obtain 10% of the egress link bandwidth and the lowest packet latency to the voice traffic; 30% of bandwidth and a higher packet latency than the voice traffic to the video traffic; around 50% of bandwidth and a higher packet latency than voice traffic to the controlled load traffic and the remaining 10% of bandwidth and the highest latency to the best effort traffic. The bandwidth percentages are intended to represent, as close as possible, a realistic combination of traffic and QoS needs from applications with different requirements. We have configured the schedulers according to these traffic requirements. As mentioned in Sect. 3.2, SBT is the simplest QoS algorithm in terms of complexity and configuration. We have filled in the SBT tables with a weight proportional to the percentages mentioned before for each table row. That is, a weight of 10 for the first table row (VO traffic), a weight of 30 for the second table row (VI traffic), etc. SBT does not require any more configurations. Note, however, that the total table weight has to be 100.
In the case of the DTable scheduler, the configuration process is more complex. We have applied the decoupling methodology explained in Sects. 3.3 and in [31], distributing the table entries among SLs according to latency requirements. To do that, we have established the maximum distance of two consecutive table entries of the same SL as follows: a maximum distance of two entries for SL VO and a maximum distance of 16 to SL BE and SL BK. Table 9 also shows the total number of table entries (#entr.) and the proportion of table entries given to each SL (%entr.). For maximum flexibility, the MTU of each SL has been established as small as the expected packet size of each traffic type. Specifically, we have set an MTU of 128 bytes for VO, an MTU of 256 bytes for VI, an MTU of 512 bytes for CL and an MTU of 1024 bytes, which is the maximum, for BE and BK traffic.
Finally, we have configured proper values for w and k parameters. The main condition that we have taken into account is that we want for SL CL a bandwidth several times higher than the proportion of table entries assigned. Moreover, the SL VO has assigned a high proportion of table entries, whilst it requires a small proportion of bandwidth. However, it is important to keep the k parameter value as small as possible in order to obtain good latency performance. We have finally chosen a value of 8 for w and a value of 2 for k. This combination of values allows us to get a [ min i , max i ] range that fits within the bandwidth needed. Table 9 shows the minimum and maximum bandwidth that may be assigned to each SL with this configuration. Table 10 shows the total amount of traffic that each SL injects, expressed in flits/ cycle/NIC (Inj. column). This table also shows the total weight (T.W.) that we have distributed among the table entries of each SL and the weight assigned to each table entry (E.W.) of each SL. Note that the SL VO and the SL CL have an E.W. of 6-7 and 130 respectively, due to the DTable bandwidth correction algorithm (Sect. 3.3.3). On the one hand, the SL VO has a Dweight 0 = −32 and therefore the first 32 table entries have a weight of 7 and the next 32 table entries have a weight of 6. On the other hand, the SL CL has a Dweight 2 = 32 and therefore each table entry Providing quality of service in omni-path networks has a weight of 130. The rest of the SLs are not affected by the DTable bandwidth correction algorithm due to the fact that the obtained bandwidth is equal to the configured bandwidth. Columns R i and F i show the bandwidth percentage assigned to each SL before and after applying the bandwidth correction algorithm, respectively. Without the adjustment, SLs VO and CL would get 11% and 49% instead of the desired 10% and 50%, respectively. In this specific example, the bandwidth percentage difference is 1%. Nevertheless, in a scenario where link bandwidths are up to 12.5 GB/s, those differences could have a significant impact in the application execution over time. Besides, without the bandwidth correction algorithm, the system administrators would be forced to find an appropriate combination of parameters, i.e. a combination of MTU, k and w values, that would allow them to obtain the required bandwidth distribution being, in some cases, not possible. Note that in the case of SBT, the scheduler has only one entry for each SL, because entry weights and the total weight are equal.

Simulation results using the synthetic workload
In this section, simulation results are shown. The values shown for each injection rate are the average of 30 different simulations varying the seed of the random number generation. We have used two metrics to evaluate the networks and the different QoS mechanisms: • End-to-end latency: Message latency from generation to delivery. It is the latency that users will experience. • Normalized SL throughput: Total amount data expressed in flits/cycle/NIC transmitted through the interconnection network. This metric has been divided by SL and normalized to the total throughput. Figures 3a, c, 4a and c show the end-to-end latency in the 2D Torus, 3D Torus, 8-ary 3-tree and 24-ary 2-tree topologies, respectively. Note that we have represented each SL in different QoS algorithms with the same color and line pattern, and each SL is represented always with the same point style, e.g. SLs when DTable (DT) is used are represented with a circle and the SL VO is plotted with a line-dot pattern and a blue colour, SLs when SBT is used are represented with a square, SLs when round-robin baseline (RR) is used are represented with the cross symbol. As explained in Sect. 3.2, SBT and RR algorithms do not provide latency differences, which can be seen in end-to-end latency figures: the more generation ratio is assigned to SLs, the more latency they have. The only exception is in the case of SLs BE and BK, which achieve a slightly higher latency because of sharing the VLs. For instance, the SL VO has the same injection rate as SLs BE and BK combined and the best-effort SLs achieve higher latency values when SBT or RR are used. Referring DTable end-to-end latency, in some cases SLs get more latency than SBT or RR. This is because DTable does not reduce the overall latency to ensure the latency requirements, but it splits the total latency between SLs based on table entries distance. Given that, for example, SL VO using DTable gets a higher latency than the Providing quality of service in omni-path networks same SL when SBT or RR are used, but SL CL with DTable gets lower latency than this SL with SBT or RR. Note that in Fig. 4a and c the SL CL in the SBT and RR tests is off the chart. We have decided to leave them outside for the sake of clarity, otherwise, the rest of the lines would be too close to each other. The end-to-end latency for the injection rates of 1 flit/cycle/NIC is over 4,500 ns in both cases. Figures 3b, d, 4b and d show the normalized throughput achieved on each topology configuration. DTable obtains a normalized throughput very close to the desired one with an error of ± 2%. In the case of SBT, it gets an error greater than DTable, specifically, it gets almost the same bandwidth division as the RR scheduler. These results suggest that SBT is not suitable for high-performance interconnection networks. The maximum throughput performance for each configuration is: 0.94 flits/ cycle/NIC with DTable and 0.8 flits/cycle/NIC with SBT or RR for 2D Torus; 0.78 flits/cycle/NIC with DTable and 0.68 flits/cycle/NIC with SBT or RR for 3D Torus; DTable achieves more throughput because the scheduler has a hit rate higher than SBT or RR, and the fact that the scheduler does not try to inject long bursts of packets helps to significantly reduce the head-on-line blocking.
Regarding the topology configurations, in terms of end-to-end latency, Torus scenarios show a higher latency values in all SLs before and after the network reaches saturation point (Fig. 3a, c, e and f). Also, Torus topologies penalize less the besteffort SLs after the saturation point than k-ary n-tree configurations, i.e. the latency of all SLs increases progressively as the injection rate increases. The expected behavior is that the best-effort SLs increase its latency as much as possible before increasing the latency of high priority SLs. This fact is very obvious in the 3D Torus Fig. 3d. The k-ary n-tree configurations keep the high priority latencies closer to each other than nD Torus topologies, which means that k-ary n-tree topologies segregate the traffic better than the nD Torus scenarios. Results do not show significant differences in terms of achieved throughput per SL. Only the 3D Torus topology in Fig. 3b shows a slight throughput reduction in SLs using DTable scheduler and the network gets congested earlier than others. This happens because the head-of-line blocking on the 3D Torus topology is stronger than on the other networks, due to this topology has more endpoints and thinner trunk links than the 2D Torus topology. Nevertheless, the DTable scheduler is able to keep the bandwidth distribution very close to the expected distribution.
Finally, Figs. 3e, f, 4e and f show the end-to-end latency of DTable SLs. The main aim of these figures is to show the latency differentiation among SLs. We have established that the SL VO must have the lowest latency, the SL VI must have latency higher than the SL VO and so on. As can be seen, SLs get a latency proportional to the desired ones. After the network gets saturated, i.e. the NICs inject more packets per cycle that they are able to deliver, SLs entry distances are more clear and the latency differentiation is more obvious.
SBT and RR achieve similar behaviors before the network gets saturated. Their results are practically the same because both algorithms work in the same way and, before saturation, each SL can inject as much flits as the NICs generate. However, when saturation appears, there are differences because SBT will try to adjust the throughput to the desired, while RR will try to give to each SL 1 NumSLs of the available bandwidth. Note that differences are more clear with higher injection rates. Nevertheless, we have decided not to include these ratios because are just theoretical injection rates.

Scenario using application trace files
As stated in Sect. 2.2, Hiperion includes support for MPI application trace files using the VEF trace framework [29]. The traces are a very representative way to know how a real HPC application will behave in any interconnection network simulator without requiring a complex system to run the applications. Therefore, we have also used trace files for performing more representative experiments. We expect to see how a poor QoS assignment or the absence of QoS degrades the system performance in terms of application runtime. Hence, those experiments will give us a better perspective on how OPA behaves using traffic from real MPI applications.
We have carried out experiments using multiple trace files obtained from different MPI applications: NAMD (NAMD) [37], a parallel application for simulating large biomolecular systems; GROMACS (gro) [38], a scientific application to perform molecular dynamics; and LINPACK (HPL) and MPIRandomAccess applications from of the HPCC Benchmark Suite [39], which is one of the most used benchmark for evaluating supercomputers. These applications have run considering 512 tasks. On each experiment, we have used the SLs CL and BE, one carrying all the trace file traffic and the other with CBR connections detailed in Sect. 4.2.1 and vice-versa. The purpose of the CBR traffic is to introduce background network workload in order to see how the output scheduling algorithm distinguishes between traffic classes. Otherwise, there would be no competition for resources and the trace would occupy them all, making no difference between using QoS or not.
We have performed several experiments for each trace file varying the injection rate of the background traffic (1%, 5%, 10% and 20%) and the SL at which the trace file is injected (SL CL and SL BE). This combination reveals us the scheduler and architecture behavior when the application could use more or less network resources and when the application has to compete harder for resources, because the increase of the background traffic will try to use them. Those injection rate values have been chosen because they allow us to complete the experiments in a reasonable amount of time while significant results can be extracted. Also, we have run the trace file without any background traffic and QoS support to get the execution time baseline.
We have used DTable as the output scheduling algorithm with the configuration shown in Table 10. The results of RR are also included in order to compare the application performance without QoS mechanism. From the results obtained with the synthetic workload, we have considered DTable more interesting for this experimentation than SBT. For this reason, and for the sake of clarity and not overloading the figures with too much information, we have not included the SBT scheduler in the results.
In those experiments, we have used the same network configuration than the exposed in Sect. 4.1. We have only changed the interconnection topologies, since the trace evaluation is limited by the number of tasks of the trace. Since we have only available 512-task traces and this size is not enough to fulfill the systems presented on Sect. 4.1, we have chosen two different topologies: a 2D Torus with 4x4 switches and a 8-ary 2-tree with 16 switches. The configuration of each topology is the following: • The 2D Torus configuration has 128 NICs. Each switch has 48 ports: eight single links to NICs and four 10x trunk links to neighboring switches. • The 8-ary 2-tree has been configured with 64 NICs and switches with 16 ports.
To analyse the results of these experiments, we have used the normalized total execution time. It is expressed as the percentage of the execution time between the QoS scenario with background traffic and the scenario without background traffic QoS support.

Simulation results
In this section, simulation results using the application traces are shown. Figures 5  and 6 show the bandwidth differences produced by DTable using trace files with the 8-ary 2-tree and the 2D Torus configurations, respectively. Those topologies as well as the experiment configurations are detailed in Sect. 4.3. Each bar in Figs. 5 and 6 represents the normalized execution time, expressed in percentage, between the trace file with background traffic and QoS and the same trace file without QoS enabled. Those percentages have been calculated for each SL. For example, in Fig. 5 NAMD-CL result within an injection rate of 0.01 flits/cycle/NIC is calculated running a simulation where: i) the QoS is enabled; ii) the NAMD trace file is injected by the SL CL; and iii) the background traffic is generated in the SL BK at injection rate of 0.01 flits/cycle/NIC. The execution time of this simulation is compared with the obtained using the NAMD trace disabling the QoS and removing the background traffic. This process has been performed to calculate each result.
Regarding the results shown in Figs. 5 and 6, the total execution time increases with the injection rate, being the MPIRandomAccess trace file the most time-consuming in both topologies. On each experiment performed, the results of SLs CL and BK using the same trace file, regardless of the background traffic injection rate, is always lower in the case of the CL SL. This fact is more obvious as the background traffic injection ratio increases. This means that the DTable scheduling output algorithm is able to properly segregate the traffic flows because the SL CL has much more resources assigned than the SL BK. Therefore, although the SL BK is trying to progressively allocate more resources, DTable is properly limiting the amount of resources it can use. As the injection rate of the background workload increases, the differences between the SLs CL and BK are increased. This is due to the fact that as the background workload increases, it tries to use more resources and it is penalized by DTable increasing its execution time. At low background injection rates, the differences between SLs CL and BK are in the range of 1% to 5% because they do not have to compete strongly for resources as the network has sufficient capacity to serve both SLs. For both topology configurations, in the scenarios where a RR output scheduler, i.e. no QoS is provided, the application execution times are higher than in the scenarios where DTable scheduler is used. Those execution times are even higher at low background traffic injection rates. Hence, the DTable improves the applications performance by distributing the available network resources.
Comparing the results obtained in Figs. 5 and 6, the trend of the results is the same. However, the applications trend to require more time to complete its execution in the 2D Torus topology. This variation is due to the 2D Torus topology has a smaller radix and a larger number of connected NICs than the 8-ary 2-tree. The percentage differences between SLs using DTable in both topologies do not show significant differences. Therefore, the topology configuration does not have any impact on the DTable output scheduler resources distribution. Nevertheless, this is not true in the case of RR were the percentage differences between SLs fluctuate depending on the topology. Summing up, DTable is able to distribute the resources according to its configuration and outperforms the RR scheduler in terms of total execution time required by the applications and resources distribution among SLs.

Conclusion
To enable QoS support in high-performance interconnection networks, the most critical architectural decision is the selection of an adequate output scheduling algorithm, which is in charge of selecting the next packet to be transmitted at any moment. This output scheduling algorithm has to keep the computational complexity as low as possible so it can be realistically implemented. In this paper we have addressed this issue in the context of hierarchical crossbar switch architectures, specifically on OPA switches using SBT and DTable mechanisms. These two tablebased output scheduling algorithms keep the computational complexity low and they are able to provide the required QoS differences. Moreover, we have proposed a DTable bandwidth correction algorithm capable of adjusting the bandwidth imprecisions produced by the baseline DTable configuration methodology. This methodology, in conjunction with the bandwidth correction algorithm, allows to set any bandwidth proportion to SLs.
We have evaluated the performance of these scheduling algorithms using a heterogeneous scenario where multiple traffic flows coexist. We have carried out different experiments using several topology configurations and we have compared the results against a round-robin output scheduler, which represents a scenario without QoS provision. On the one hand, results show that SBT is capable of providing bandwidth differences but it is not able to provide latency differences. Moreover, SBT is not able to distribute the bandwidth according to its configuration. On the other hand, DTable is capable of providing bandwidth and latency differences among SLs, and we are able to establish which SL will experience higher and lower end-to-end latency.
We have also carried out several experiments using multiple network topologies with the aim of finding out differences in the behavior of schedulers in terms of QoS provision. Results show that k-ary n-tree topologies are able to provide slightly better end-to-end latency results than the nD Torus configurations. In terms of throughput, there are not significant differences between configurations.
Moreover, we have carried out experiments with two different topologies using communication trace files obtained from real MPI applications and background network workload. These experiments have been performed with DTable as the output scheduling algorithm. Results show that even with real MPI communications, DTable is able to properly segregate the traffic according with the predefined configuration with independence of the network topology and the trace file.
As explained in Sect. 2.2, unlike non-hierarchical switches, sometimes packets have to cross the central crossbar in order to reach the required output buffer. Hence, we are currently evaluating the impact of including another DTable scheduler in these central buffers. We also are planning to perform a deeper hardware study in order to offer estimates about the silicon area that this middle scheduling algorithm would require.