Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Compact network reconfiguration in fat-trees

  • 192 Accesses

  • 2 Citations


In large high-performance computing systems, the probability of component failure is high. At the same time, for a sustained system performance, reconfiguration is often needed to ensure high utilization of available resources. Reconfiguration in interconnection networks, like InfiniBand (IB), typically involves computation and distribution of a new set of routes in order to maintain connectivity and performance. In general, current routing algorithms do not consider the existing routes in a network when calculating new ones. Such configuration-oblivious routing might result in substantial modifications to the existing paths, and the reconfiguration becomes costly as it potentially involves a large number of source–destination pairs. In this paper, we propose a novel routing algorithm for IB-based fat-tree topologies, SlimUpdate. SlimUpdate employs path preservation techniques to achieve a decrease of up to 80 % in the number of total path modifications, as compared to the OpenSM’s fat-tree routing algorithm, in most reconfiguration scenarios. Furthermore, we present a metabase-aided re-routing method for fat-trees, based on destination leaf-switch multipathing. Our proposed method significantly reduces network reconfiguration overhead, while providing greater routing flexibility. On successive runs, our proposed method saves up to 85 % of the total routing time over the traditional re-routing scheme. Based on the metabase-aided routing, we also present a modified SlimUpdate routing algorithm to dynamically optimize routes for a given MPI node order.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14


  1. 1.

    The OpenFabrics Enterprise Distribution (OFED) is the de facto standard software stack for deploying IB-based applications. http://openfabrics.org/.

  2. 2.

    Multi-homed nodes can be considered as distinct multiple nodes in the routing.

  3. 3.

    Available multipaths between leaf switches are different from switch-to-switch paths in the OpenSM’s fat-tree routing. The fat-tree routing algorithm uses single-path non-balanced switch-to-switch routing, as a relatively small amount of switch-to-switch traffic is anticipated.

  4. 4.

    The nodes connected to the same leaf switch have full bandwidth between them.


  1. 1.

    (2015) Top 500 Super Computer Sites. http://www.top500.org/, accessed November 25, 2015

  2. 2.

    Bergman K, Borkar S, Campbell D, Carlson W, Dally W, Denneau M, Franzon P, Harrod W, Hill K, Hiller J, et al (2008) Exascale computing study: Technology challenges in achieving exascale systems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech Rep 15

  3. 3.

    Cappello F, Geist A, Gropp W, Kale S, Kramer B, Snir M (2014) Toward exascale resilience: 2014 update. Supercomputing frontiers and innovations 1(1):5–28. doi:10.14529/jsfi1401015

  4. 4.

    Schroeder B, Gibson GA (2010) A large-scale study of failures in high-performance computing systems. IEEE Transactions on Dependable and Secure Computing 7(4):337–350

  5. 5.

    Berl A, Gelenbe E, Di Girolamo M, Giuliani G, De Meer H, Dang MQ, Pentikousis K (2010) Energy-efficient cloud computing. The Computer Journal 53(7):1045–1051

  6. 6.

    Duato J, Lysne O, Pang R, Pinkston TM (2005) A theory for deadlock-free dynamic network reconfiguration. Part I. IEEE Transactions on Parallel and Distributed Systems 16(5):412–427

  7. 7.

    Lysne O, Montanana JM, Flich J, Duato J, Pinkston TM, Skeie T (2008) An efficient and deadlock-free network reconfiguration protocol. IEEE Transactions on Computers 57(6):762–779

  8. 8.

    Zahid F, Gran EG, Bogdanski B, Johnsen BD, Skeie T (2015a) SlimUpdate: Minimal Routing Update for Performance-Based Reconfigurations in Fat-Trees. In: 1st HiPINEB Workshop, IEEE International Conference on Cluster Computing (CLUSTER), 2015., IEEE, pp 849–856

  9. 9.

    Teodosiu D, Baxter J, Govil K, Chapin J, Rosenblum M, Horowitz M (1997) Hardware fault containment in scalable shared-memory multiprocessors. ACM SIGARCH Computer Architecture News 25(2):73–84

  10. 10.

    Schroeder MD, Birrell AD, Burrows M, Murray H, Needham RM, Rodeheffer TL, Satterthwaite EH, Thacker CP (1991) Autonet: A high-speed, self-configuring local area network using point-to-point links. IEEE Journal on Selected Areas in Communications 9(8):1318–1335

  11. 11.

    Sem-Jacobsen FO, Lysne O (2012) Topology agnostic dynamic quick reconfiguration for large-scale interconnection networks. In: Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), 2012., IEEE Computer Society, pp 228–235

  12. 12.

    Domke J, Hoefler T, Matsuoka S (2014) Fail-in-place network design: interaction between topology, routing algorithm and failures. In: International Conference for High Performance Computing, Networking, Storage and Analysis, (SC), 2014, IEEE, pp 597–608

  13. 13.

    Zahid F, Gran EG, Bogdański B, Johnsen BD, Skeie T (2015b) A weighted fat-tree routing algorithm for efficient load-balancing in InfiniBand enterprise clusters. In: 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), 2015., IEEE

  14. 14.

    Zahid F, Gran EG, Bogdański B, Johnsen BD, Skeie T (2016) Efficient Network Isolation and Load Balancing in Multi-Tenant HPC Clusters. Future Generation Computer Systems. doi:10.1016/j.future.2016.04.003

  15. 15.

    Skeie T, Lysne O, Theiss I (2002) Layered Shortest Path (LASH) Routing in Irregular System Area Networks. In: International Parallel and Distributed Processing Symposium (IPDPS), 2002., Citeseer, vol 2, p 194

  16. 16.

    Mejia A, Flich J, Duato J, Reinemo SA, Skeie T (2006) Segment-based routing: an efficient fault-tolerant routing algorithm for meshes and tori. In: 20th International Parallel and Distributed Processing Symposium (IPDPS), 2006., IEEE, pp 10–pp

  17. 17.

    Sem-Jacobsen FO, Skeie T, Lysne O, Duato J (2011) Dynamic fault tolerance in fat trees. IEEE Transactions on Computers 60(4):508–525

  18. 18.

    Zahavi E, Keslassy I, Kolodny A (2014) Quasi Fat Trees for HPC Clouds and Their Fault-Resilient Closed-Form Routing. In: Proceedings of the 22nd IEEE Annual Symposium on High-Performance Interconnects (HOTI), 2014., IEEE, pp 41–48

  19. 19.

    Tasoulas E, Gran EG, Johnsen BD, Begnum K, Skeie T (2015) Towards the InfiniBand SR-IOV vSwitch Architecture. In: 2015 IEEE International Conference on Cluster Computing (CLUSTER)., IEEE, pp 371–380

  20. 20.

    Lin XY, Chung YC, Huang TY (2004) A multiple LID routing scheme for fat-tree-based InfiniBand networks. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS), 2004

  21. 21.

    López P, Flich J, Duato J (2001) Deadlock-free routing in infiniband through destination renaming. In: International Conference on Parallel Processing, 2001., IEEE, pp 427–434

  22. 22.

    Nienaber W, Yuan X, Duan Z (2009) LID assignment in InfiniBand networks. IEEE Transactions on Parallel and Distributed Systems 20(4):484–497. doi:10.1109/TPDS.2008.144

  23. 23.

    (2015) InfiniBand Architecture Specification: Release 1.3. http://www.infinibandta.com/, accessed November 25, 2015

  24. 24.

    Bermúdez A, Casado R, Quiles FJ, Pinkston TM, Duato J (2003) On the infiniband subnet discovery process. In: Proceedings of the IEEE International Conference on Cluster Computing, 2003., IEEE, pp 512–517

  25. 25.

    Leiserson CE (1985) Fat-trees: universal networks for hardware-efficient supercomputing. IEEE Transactions on Computers 100(10):892–901

  26. 26.

    Petrini F, Vanneschi M (1997) k-ary n-trees: High performance networks for massively parallel architectures. In: Proceedings of the 11th International Parallel Processing Symposium, 1997., IEEE, pp 87–93

  27. 27.

    Öhring SR, Ibel M, Das SK, Kumar MJ (1995) On generalized fat trees. In: Proceedings of the 9th International Parallel Processing Symposium, 1995., IEEE, pp 37–44

  28. 28.

    Zahavi E (2010) D-Mod-K routing providing non-blocking traffic for shift permutations on real life fat trees. CCIT Report 776, Technion

  29. 29.

    Zahavi E (2012) Fat-tree routing and node ordering providing contention free traffic for MPI global collectives. Journal of Parallel and Distributed Computing 72(11):1423–1432

  30. 30.

    Huang W, Santhanaraman G, Jin HW, Gao Q, Panda DK (2006) Design of high performance MVAPICH2: MPI2 over InfiniBand. In: Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID), 2006., IEEE, vol 1, pp 43–48

  31. 31.

    Luszczek P, Dongarra J, Kepner J (2006) Design and implementation of the HPC Challenge benchmark suite. CT Watch Quarterly 2(4A):18–23

  32. 32.

    Hoefler T, Mehlan T, Lumsdaine A, Rehm W (2007) Netgauge: A Network Performance Measurement Framework. In: Proceedings of High Performance Computing and Communications, HPCC’07, Springer, vol 4782

  33. 33.

    (2015) The OSU Micro-benchmark Suite. http://mvapich.cse.ohio-state.edu/benchmarks/, accessed November 25, 2015

  34. 34.

    Schneider T, Hoefler T, Lumsdaine A (2009) ORCS: An oblivious routing congestion simulator. Indiana University, Computer Science Department

  35. 35.

    Bermúdez A, Casado R, Quiles FJ, Duato J (2004) Use of provisional routes to speed-up change assimilation in InfiniBand networks. In: Proceedings of 18th International Parallel and Distributed Processing Symposium (IPDPS), 2004., IEEE, p 186

  36. 36.

    T Hoefler, T Schneider, and A Lumsdaine (2008) Multistage switches are not crossbars: Effects of static routing in high-performance networks. In: IEEE International Conference on Cluster Computing, 2008., IEEE

Download references


The authors would like to thank Mellanox Technologies for providing some of the hardware we use in our experiments.

Author information

Correspondence to Feroz Zahid.

Additional information

This work was supported by the Norwegian Research Council under the ERAC project (Project Number: 213283/O70).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zahid, F., Gran, E.G., Bogdański, B. et al. Compact network reconfiguration in fat-trees. J Supercomput 72, 4438–4467 (2016). https://doi.org/10.1007/s11227-016-1759-y

Download citation


  • Routing algorithms
  • Interconnection networks
  • Network reconfiguration
  • Fat-trees
  • InfiniBand