The design of ultra scalable MPI collective communication on the K computer
- First Online:
- 266 Downloads
This paper proposes the design of ultra scalable MPI collective communication for the K computer, which consists of 82,944 computing nodes and is the world’s first system over 10 PFLOPS. The nodes are connected by a Tofu interconnect that introduces six dimensional mesh/torus topology. Existing MPI libraries, however, perform poorly on such a direct network system since they assume typical cluster environments. Thus, we design collective algorithms optimized for the K computer.
On the design of the algorithms, we place importance on collision-freeness for long messages and low latency for short messages. The long-message algorithms use multiple RDMA network interfaces and consist of neighbor communication in order to gain high bandwidth and avoid message collisions. On the other hand, the short-message algorithms are designed to reduce software overhead, which comes from the number of relaying nodes. The evaluation results on up to 55,296 nodes of the K computer show the new implementation outperforms the existing one for long messages by a factor of 4 to 11 times. It also shows the short-message algorithms complement the long-message ones.
KeywordsK computer MPI collective communication Torus network
- 2.Ajima Y, Takagi Y, Inoue T, Hiramoto S, Shimizu T (2011) The Tofu interconnect. In: Proc of HotI 2011, pp 87–94 Google Scholar
- 3.Almási G, Archer C, Erway CC, Heidelberger P, Martorell X, Moreira JE, Steinmacher-Burow BD, Zheng Y (2005) Optimization of MPI collective communication on BlueGene/L systems. In: Proc of ICS 2005, pp 253–262 Google Scholar
- 5.Graham RL, Shipman GM, Barrett BW, Castain RH, Bosilca G, Lumsdaine A (2006) Open MPI: a high-performance, heterogeneous MPI. In: Proc of HeteroPar 2006 Google Scholar
- 6.Jain N, Sabharwal Y (2010) Optimal bucket algorithms for large MPI collectives on torus interconnects. In: Proc of ICS 2010, pp 27–36 Google Scholar
- 7.Matsumoto Y, Adachi T, Tanaka M, Sumimoto S, Soga T, Nanri T, Uno A, Kurokawa M, Shoji F, Yokokawa M (2011) Implementation and evaluation of MPI_Allreduce on the K computer. In: IPSJ SIG technical report, vol 2011-HPC-132 Google Scholar
- 8.McCalpin JD (1995) Memory bandwidth and machine balance in current high performance computers. In: IEEE TCCA Newsletter, pp 19–25 Google Scholar
- 12.van de Geijn RA (1991) Efficient global combine operations. In: Proc of DMCC ’91, pp 291–294 Google Scholar