The design of ultra scalable MPI collective communication on the K computer

  • Tomoya Adachi
  • Naoyuki Shida
  • Kenichi Miura
  • Shinji Sumimoto
  • Atsuya Uno
  • Motoyoshi Kurokawa
  • Fumiyoshi Shoji
  • Mitsuo Yokokawa
Special Issue Paper

Abstract

This paper proposes the design of ultra scalable MPI collective communication for the K computer, which consists of 82,944 computing nodes and is the world’s first system over 10 PFLOPS. The nodes are connected by a Tofu interconnect that introduces six dimensional mesh/torus topology. Existing MPI libraries, however, perform poorly on such a direct network system since they assume typical cluster environments. Thus, we design collective algorithms optimized for the K computer.

On the design of the algorithms, we place importance on collision-freeness for long messages and low latency for short messages. The long-message algorithms use multiple RDMA network interfaces and consist of neighbor communication in order to gain high bandwidth and avoid message collisions. On the other hand, the short-message algorithms are designed to reduce software overhead, which comes from the number of relaying nodes. The evaluation results on up to 55,296 nodes of the K computer show the new implementation outperforms the existing one for long messages by a factor of 4 to 11 times. It also shows the short-message algorithms complement the long-message ones.

Keywords

K computer MPI collective communication Torus network 

References

  1. 1.
    Ajima Y, Sumimoto S, Shimizu T (2009) Tofu: A 6D mesh/torus interconnect for exascale computers. Computer 42(11):36–40 CrossRefGoogle Scholar
  2. 2.
    Ajima Y, Takagi Y, Inoue T, Hiramoto S, Shimizu T (2011) The Tofu interconnect. In: Proc of HotI 2011, pp 87–94 Google Scholar
  3. 3.
    Almási G, Archer C, Erway CC, Heidelberger P, Martorell X, Moreira JE, Steinmacher-Burow BD, Zheng Y (2005) Optimization of MPI collective communication on BlueGene/L systems. In: Proc of ICS 2005, pp 253–262 Google Scholar
  4. 4.
    Barnett M, Payne DG, van de Geijn RA, Watts J (1996) Broadcasting on meshes with worm-hole routing. J Parallel Distrib Comput 35(2):111–122 CrossRefGoogle Scholar
  5. 5.
    Graham RL, Shipman GM, Barrett BW, Castain RH, Bosilca G, Lumsdaine A (2006) Open MPI: a high-performance, heterogeneous MPI. In: Proc of HeteroPar 2006 Google Scholar
  6. 6.
    Jain N, Sabharwal Y (2010) Optimal bucket algorithms for large MPI collectives on torus interconnects. In: Proc of ICS 2010, pp 27–36 Google Scholar
  7. 7.
    Matsumoto Y, Adachi T, Tanaka M, Sumimoto S, Soga T, Nanri T, Uno A, Kurokawa M, Shoji F, Yokokawa M (2011) Implementation and evaluation of MPI_Allreduce on the K computer. In: IPSJ SIG technical report, vol 2011-HPC-132 Google Scholar
  8. 8.
    McCalpin JD (1995) Memory bandwidth and machine balance in current high performance computers. In: IEEE TCCA Newsletter, pp 19–25 Google Scholar
  9. 9.
    Rabenseifner R (2004) Optimization of collective reduction operations. In: Proc of ICCS 2004, pp 1–9 CrossRefGoogle Scholar
  10. 10.
    Saad Y, Schultz MH (1989) Data communication in parallel architectures. Parallel Comput 11(2):131–150 MathSciNetMATHCrossRefGoogle Scholar
  11. 11.
    Simmen M (1991) Comments on broadcast algorithms for two-dimensional grids. Parallel Comput 17(1):109–112 MATHCrossRefGoogle Scholar
  12. 12.
    van de Geijn RA (1991) Efficient global combine operations. In: Proc of DMCC ’91, pp 291–294 Google Scholar
  13. 13.
    Watts J, van de Geijn RA (1995) A pipelined broadcast for multidimensional meshes. Parallel Process Lett 5(2):281–292 CrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2012

Authors and Affiliations

  • Tomoya Adachi
    • 1
  • Naoyuki Shida
    • 1
  • Kenichi Miura
    • 1
  • Shinji Sumimoto
    • 1
  • Atsuya Uno
    • 2
  • Motoyoshi Kurokawa
    • 2
  • Fumiyoshi Shoji
    • 2
  • Mitsuo Yokokawa
    • 2
  1. 1.Fujitsu LimitedKawasakiJapan
  2. 2.RIKENKobeJapan

Personalised recommendations