Skip to main content
Log in

The BXI routing architecture for exascale supercomputer

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

BXI, Bull eXascale Interconnect, is the new interconnection network developed by Atos for high-performance computing. It has been designed to meet the requirements of exascale supercomputers. At such scale, faults have to be expected and dealt with transparently so that applications remain unaffected by them. BXI features various mechanisms for this purpose, one of which is based on a clear separation between two modes of routing tables computation: offline mode used during bring-up and online mode used to deal with link failures and recoveries. This new architecture is presented along with several offline and online routing algorithms and their actual performance: the full routing tables for a 64k-node fat-tree can be computed in a few minutes in offline mode; and the online mode can withstand numerous inter-router link failures without any noticeable impact on running applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. The term “failures” must be understood from a fabric management point of view. Hardware failures are of course seen as such, but also human mistakes and maintenance operations.

  2. Curie is an instance of the previous petaflopic generation of Bull supercomputer. It was ranked 9th in the Top500 list of June 2012: http://www.top500.org/system/177818.

  3. https://gmplib.org/~tege/x86-timing.

  4. In particular, the number of connected ports of each switch must be the same, top switches included.

  5. The actual rate is much higher since only deterministic routes are considered in this rate computation.

  6. For any two cells, switches at same relative location are assigned same private non-routed IP address for out-of-band communication. This ease delivery, bring-up and also optimize the out-of-band management network. More details in [1].

  7. Only inter-switch link faults can be dealt with by an online routing algorithm.

  8. http://zeromq.org/.

References

  1. Derradji S, Palfer-Sollier T, Panziera J-P, Poudes A, Wellenreiter F (2015) The bxi interconnect architecture. In: 2015 IEEE 23th annual symposium on high-performance interconnects (HOTI)

  2. Agarwal A (1991) Limits on interconnection network performance. IEEE transactions on parallel and distributed systems, vol 2, pp 398–412 (online). http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.8845

  3. Duato J, Yalamanchili S, Lionel N (2002) Interconnection networks: an engineering approach. Morgan Kaufmann Publishers Inc., San Francisco

    Google Scholar 

  4. Leiserson CE (Oct. 1985) Fat-trees: universal networks for hardware-efficient supercomputing. IEEE Trans Comput 34(10):892–901 (online). http://dl.acm.org/citation.cfm?id=4492.4495

  5. Ohring S, Ibel M, Das S, Kumar M (1995) On generalized fat trees. In: Proceedings of 9th international parallel processing symposium

  6. Petrini F, Vanneschi M (1997) k-ary n-trees: high performance networks for massively parallel architectures. In: Proceedings 11th international parallel processing symposium

  7. Zahavi E (2010) D-Mod-K routing providing non-blocking traffic for shift permutations on real life fat trees. Technical Report CCIT Report, Tech. Rep., 2010. (online). http://webee.eedev.technion.ac.il/wp-content/uploads/2014/08/publication_574

  8. Kim J, Dally WJ, Abts D (2007) Flattened butterfly: a cost-efficient topology for high-radix networks. SIGARCH Comput Archit News 35(2):126–137. doi:10.1145/1273440.1250679

    Article  Google Scholar 

  9. Ahn JH, Binkert N, Davis A, McLaren M, Schreiber RS (2009) Hyperx: Topology, routing, and packaging of efficient large-scale networks. In: Proceedings of the conference on high performance computing networking, storage and analysis, ser. SC ’09. ACM, New York, pp 41:1–41:11 (online). doi:10.1145/1654059.1654101

  10. Kim J, Dally WJ, Scott S, Abts D (2008) Technology-driven, highly-scalable dragonfly topology. SIGARCH Comput Archit News 36(3):77–88. doi:10.1145/1394608.1382129

    Article  Google Scholar 

  11. Kim J, Dally W, Scott S, Abts D (2009) Cost-efficient dragonfly topology for large-scale systems. IEEE Micro 29(1):33–40. doi:10.1109/MM.2009.5

    Article  Google Scholar 

  12. Besta M, Hoefler T (2014) Slim fly: a cost effective low-diameter network topology. In: Proceedings of the international conference for high performance computing, networking, storage and analysis, ser. SC ’14. IEEE Press, Piscataway, pp 348–359. (online). doi:10.1109/SC.2014.34

  13. Duato J (1997) A theory of fault-tolerant routing in wormhole networks. IEEE Trans Parallel Distrib Syst 8:790–802

    Article  Google Scholar 

  14. Martínez JC, Flich J, Robles A, López P, Duato J (2003) Supporting fully adaptive routing in infiniband networks. In: Proceedings of the 17th international symposium on parallel and distributed processing, ser. IPDPS ’03. IEEE Computer Society, Washington, DC, p 44.1 (online). http://dl.acm.org/citation.cfm?id=838237.838493

  15. Skeie T, Lysne O, Flich J, López P, Robles A, Duato J (2004) LASH-TOR: a generic transition-oriented routing algorithm. Proc Int Conf Parallel Distrib Syst ICPADS 10:595–604

    Google Scholar 

  16. Lysne O, Skeie T, Reinemo SA, Theiss IR (2006) Layered routing in irregular networks. IEEE Trans Parallel Distrib Syst 17:51–65

    Article  Google Scholar 

  17. Flich J, Skeie T, Mejia A, Lysne O, Lopez P, Robles A, Duato J, Koibuchi M, Rokicki T, Sancho JC (2012) A survey and evaluation of topology-agnostic deterministic routing algorithms. IEEE Trans Parallel Distrib Syst 23(3):405–425

    Article  Google Scholar 

  18. Cherkassky BV, Goldberg AV, Radzik T (1996) Shortest paths algorithms: theory and experimental evaluation, pp 129–174

  19. Chen G, Pang M, Wang J (2007) Calculating shortest path on edge-based data structure of graph. In: Proceedings of 2nd workshop on digital media and its application in museum and heritage, DMAMH 2007, pp 416–421

  20. Demetrescu C, Italiano GF (2006) Experimental analysis of dynamic all pairs shortest. ACM Trans Algorithms 2:578–601

    Article  MathSciNet  MATH  Google Scholar 

  21. Theiss Ir, Lysne O (2006) FRoots: a fault tolerant and topology-flexible routing technique. IEEE Trans Parallel Distrib Syst 17:1136–1150

    Article  Google Scholar 

  22. Mejia A, Flich J, Duato J, Reinemo SA, Skeie T (2006) “Segment-based routing: an efficient fault-tolerant routing algorithm for meshes and tori. In: 20th International parallel and distributed processing symposium, IPDPS 2006, vol 2006

  23. Flich J, Mejia A, Lopez P, Duato J (2007) Region-based routing: An efficient routing mechanism to tackle unreliable hardware in network on chips. In: Proceedings of NOCS 2007: first international symposium on networks-on-chip, pp 183–194

  24. Sem-Jacobsen FO, Lysne O (2008) Fault tolerance with shortest paths in regular and irregular networks. IPDPS Miami 2008. In: Proceedings of the 22nd IEEE international parallel and distributed processing symposium, program and CD-ROM, no. 1

  25. Zahavi E, Keslassy I, Kolodny A (2014) Quasi fat trees for HPC clouds and their fault-resilient closed-form routing. In: 2014 IEEE 22nd annual symposium on high-performance interconnects (HOTI). IEEE, pp 41–48

  26. Dijkstra EW (1971) A short introduction to the art of programming. Technische Hogeschool Eindhoven Eindhoven, vol 4

  27. Schwiebert L, Jayasimha DN (1996) A necessary and sufficient condition for deadlock-free wormhole routing. J Parallel Distrib Comput 32:103–117

    Article  Google Scholar 

  28. Schroeder MD, Birrell AD, Burrows M, Murray H, Needham RM, Rodeheffer TL, Satterthwaite EH, Thacker CP (1991) Autonet: a high-speed, self-configuring local area network using point-to-point links. IEEE J Select Areas Commun 9(8):1318–1335

    Article  Google Scholar 

  29. Greenberg RI, Leiserson CE (1985) Randomized routing on fat-trees. 26th annual symposium on foundations of computer science (sfcs 1985)

  30. Rodriguez G, Minkenberg C, Beivide R, Luijten RP, Labarta J, Valero M (2009) Oblivious routing schemes in extended generalized fat tree networks. In: IEEE international conference on cluster computing and workshops, 2009. CLUSTER’09. IEEE, pp 1–8

  31. Kerbyson DJ, Lang M, Johnson G (October 2006) PAL Roadrunner Report 2: application specific optimization of infiniband networks. Tech Rep

  32. Zahavi E (2012) Fat-tree routing and node ordering providing contention free traffic for MPI global collectives. J Parallel Distrib Comput 72(11):1423–1432. Communication Architectures for Scalable Systems (online). http://www.sciencedirect.com/science/article/pii/S0743731512000305

  33. Gómez C, Gilabert F, Gómez ME, López P, Duato J (2007) Deterministic versus adaptive routing in fat-trees. In: Proceedings of workshop on communication architecture on clusters (CAC07)

  34. Kim J, Dally WJ, Abts D (2006) Adaptive routing in high-radix clos network. In: Proceedings of the 2006 ACM/IEEE conference on supercomputing, ser. SC ’06. ACM, New York (online). doi:10.1145/1188455.1188552

  35. Underwood KD, Borch E (May 2011) A unified algorithm for both randomized deterministic and adaptive routing in torus networks. IEEE international symposium on parallel and distributed processing workshops and Phd forum, pp 723–732 (online). http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6008843

  36. Jean-Noël Q, Pierre V (2013) Transitively deadlock-free routing algorithms. In: Proceedings of the 2nd IEEE international workshop on high-performance interconnection networks in the exascale and big-data era, Barcelona

Download references

Acknowledgments

We are thankful to the Portals team at Sandia Nat. Lab. for their unconditional support, particularly: Ron Brightwell, Brian Barrett (now at Amazon) and Ryan Grant. We also acknowledge the passionate discussions we had with Keith Underwood from Intel during the early stages of this project. We also would like to thank our colleagues, Jean-Pierre Panziera, Ben Bratu, Anne-Marie Fourel and Pascale Bernier-Bruna for their reviews and valuable comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pierre Vignéras.

Additional information

This BXI development has been undertaken under a cooperation between CEA and Atos. The goal of this cooperation is to co-design extreme computing solutions. Atos thanks CEA for all their inputs that were very valuable for this research.

This research was partly funded by a grant of Programme des Investissments d’Avenir.

BXI development was also part of PERFCLOUD, the French FSN (Fond pour la Société Numérique) cooperative project that associates academic and industrial partners to design and provide building blocks for new generations of HPC data-centers.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vignéras, P., Quintin, JN. The BXI routing architecture for exascale supercomputer. J Supercomput 72, 4418–4437 (2016). https://doi.org/10.1007/s11227-016-1755-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-016-1755-2

Keywords

Navigation