Skip to main content
Log in

The design of ultra scalable MPI collective communication on the K computer

  • Special Issue Paper
  • Published:
Computer Science - Research and Development

Abstract

This paper proposes the design of ultra scalable MPI collective communication for the K computer, which consists of 82,944 computing nodes and is the world’s first system over 10 PFLOPS. The nodes are connected by a Tofu interconnect that introduces six dimensional mesh/torus topology. Existing MPI libraries, however, perform poorly on such a direct network system since they assume typical cluster environments. Thus, we design collective algorithms optimized for the K computer.

On the design of the algorithms, we place importance on collision-freeness for long messages and low latency for short messages. The long-message algorithms use multiple RDMA network interfaces and consist of neighbor communication in order to gain high bandwidth and avoid message collisions. On the other hand, the short-message algorithms are designed to reduce software overhead, which comes from the number of relaying nodes. The evaluation results on up to 55,296 nodes of the K computer show the new implementation outperforms the existing one for long messages by a factor of 4 to 11 times. It also shows the short-message algorithms complement the long-message ones.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Ajima Y, Sumimoto S, Shimizu T (2009) Tofu: A 6D mesh/torus interconnect for exascale computers. Computer 42(11):36–40

    Article  Google Scholar 

  2. Ajima Y, Takagi Y, Inoue T, Hiramoto S, Shimizu T (2011) The Tofu interconnect. In: Proc of HotI 2011, pp 87–94

    Google Scholar 

  3. Almási G, Archer C, Erway CC, Heidelberger P, Martorell X, Moreira JE, Steinmacher-Burow BD, Zheng Y (2005) Optimization of MPI collective communication on BlueGene/L systems. In: Proc of ICS 2005, pp 253–262

    Google Scholar 

  4. Barnett M, Payne DG, van de Geijn RA, Watts J (1996) Broadcasting on meshes with worm-hole routing. J Parallel Distrib Comput 35(2):111–122

    Article  Google Scholar 

  5. Graham RL, Shipman GM, Barrett BW, Castain RH, Bosilca G, Lumsdaine A (2006) Open MPI: a high-performance, heterogeneous MPI. In: Proc of HeteroPar 2006

    Google Scholar 

  6. Jain N, Sabharwal Y (2010) Optimal bucket algorithms for large MPI collectives on torus interconnects. In: Proc of ICS 2010, pp 27–36

    Google Scholar 

  7. Matsumoto Y, Adachi T, Tanaka M, Sumimoto S, Soga T, Nanri T, Uno A, Kurokawa M, Shoji F, Yokokawa M (2011) Implementation and evaluation of MPI_Allreduce on the K computer. In: IPSJ SIG technical report, vol 2011-HPC-132

  8. McCalpin JD (1995) Memory bandwidth and machine balance in current high performance computers. In: IEEE TCCA Newsletter, pp 19–25

  9. Rabenseifner R (2004) Optimization of collective reduction operations. In: Proc of ICCS 2004, pp 1–9

    Chapter  Google Scholar 

  10. Saad Y, Schultz MH (1989) Data communication in parallel architectures. Parallel Comput 11(2):131–150

    Article  MathSciNet  MATH  Google Scholar 

  11. Simmen M (1991) Comments on broadcast algorithms for two-dimensional grids. Parallel Comput 17(1):109–112

    Article  MATH  Google Scholar 

  12. van de Geijn RA (1991) Efficient global combine operations. In: Proc of DMCC ’91, pp 291–294

    Google Scholar 

  13. Watts J, van de Geijn RA (1995) A pipelined broadcast for multidimensional meshes. Parallel Process Lett 5(2):281–292

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tomoya Adachi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Adachi, T., Shida, N., Miura, K. et al. The design of ultra scalable MPI collective communication on the K computer. Comput Sci Res Dev 28, 147–155 (2013). https://doi.org/10.1007/s00450-012-0211-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00450-012-0211-7

Keywords

Navigation