Abstract
In this study, we propose a 2D-compatible implementation of 2.5D parallel matrix multiplication (2.5D-PDGEMM), which was designed to perform computations of 2D distributed matrices on a 2D process grid. We evaluated the performance of our implementation using 16384 nodes (131072 cores) on the K computer, which is a highly parallel computer. The results show that our 2.5D implementation outperforms conventional 2D implementations including the ScaLAPACK PDGEMM routine, in terms of strong scaling, even when the cost for matrix redistribution between 2D and 2.5D distributions is included. We discussed the performance of our implementation by providing a breakdown of the performance and describing the performance model of the implementation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Georganas, E., González-DomÃnguez, J., Solomonik, E., Zheng, Y., Touriño, J., Yelick, K.: Communication avoiding and overlapping for numerical linear algebra. In: Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC 2012), pp. 100:1–100:11 (2012)
Kitazawa, Y., Kuroda, A., Shida, N., Adachi, T., Minami, K.: Evaluation of MPI communication performance using throughput on the K computer. In: Proceedings of IPSJ Symposium on High Performance Computing and Computational Science (HPCS2017), pp. 17–25 (2017). (in Japanese)
Lipshitz, B., Ballard, G., Demmel, J., Schwartz, O.: Communication-avoiding parallel strassen: implementation and performance. In: Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC 2012), pp. 101:1–101:11 (2012)
Schatz, M., Van de Geijn, R.A., Poulson, J.: Parallel matrix multiplication: a systematic journey. SIAM J. Sci. Comput. 38(6), C748–C781 (2016)
Solomonik, E., Demmel, J.: Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011. LNCS, vol. 6853, pp. 90–109. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23397-5_10
Solomonik, E., Demmel, J.: Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. Technical Report UCB/EECS-2011-10, LAPACK Working Note (2011). http://www.netlib.org/lapack/lawnspdf/lawn238.pdf
Van de Geijn, R.A., Watts, J.: SUMMA: scalable universal matrix multiplication algorithm, Technical report. Department of Computer Science, University of Texas at Austin (1995)
Acknowledgment
The results were obtained using the K computer at the RIKEN Advanced Institute for Computational Science (project number: ra000022). This study is a part of the Flagship2020 project. We thank Akiyoshi Kuroda (RIKEN Advanced Institute for Computational Science), Eiji Yamanaka, and Naoki Sueyasu (Fujitsu Limited) for their helpful suggestions and discussions.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Mukunoki, D., Imamura, T. (2018). Implementation and Performance Analysis of 2.5D-PDGEMM on the K Computer. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2017. Lecture Notes in Computer Science(), vol 10777. Springer, Cham. https://doi.org/10.1007/978-3-319-78024-5_31
Download citation
DOI: https://doi.org/10.1007/978-3-319-78024-5_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-78023-8
Online ISBN: 978-3-319-78024-5
eBook Packages: Computer ScienceComputer Science (R0)