Matrix Multiplication on Multidimensional Torus Networks

  • Edgar Solomonik
  • James Demmel
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7851)


Blocked matrix multiplication algorithms such as Cannon’s algorithm and SUMMA have a 2-dimensional communication structure. We introduce a generalized ’Split-Dimensional’ version of Cannon’s algorithm (SD-Cannon) with higher-dimensional and bidirectional communication structure. This algorithm is useful for torus interconnects that can achieve more injection bandwidth than single-link bandwidth. On a bidirectional torus network of dimension d, SD-Cannon can lower the algorithmic bandwidth cost by a factor of up to d. With rectangular collectives, SUMMA also achieves the lower bandwidth cost but has a higher latency cost. We use Charm++ virtualization to efficiently map SD-Cannon on unbalanced and odd-dimensional torus network partitions. Our performance study on Blue Gene/P demonstrates that a MPI version of SD-Cannon can exploit multiple communication links and improve performance.


Matrix Multiplication Outer Product Bandwidth Cost Latency Cost Torus Network 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agarwal, R.C., Balle, S.M., Gustavson, F.G., Joshi, M., Palkar, P.: A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Dev. 39, 575–582 (1995)CrossRefGoogle Scholar
  2. 2.
    Aggarwal, A., Chandra, A.K., Snir, M.: Communication complexity of PRAMs. Theoretical Computer Science 71(1), 3–28 (1990)MathSciNetzbMATHCrossRefGoogle Scholar
  3. 3.
    Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Minimizing communication in linear algebra. SIAM J. Mat. Anal. Appl. 32(3) (2011)Google Scholar
  4. 4.
    Berntsen, J.: Communication efficient matrix multiplication on hypercubes. Parallel Computing 12(3), 335–342 (1989)MathSciNetzbMATHCrossRefGoogle Scholar
  5. 5.
    Cannon, L.E.: A cellular computer to implement the Kalman filter algorithm. Ph.D. thesis, Bozeman, MT, USA (1969)Google Scholar
  6. 6.
    Chen, D., Eisley, N.A., Heidelberger, P., Senger, R.M., Sugawara, Y., Kumar, S., Salapura, V., Satterfield, D.L., Steinmacher-Burow, B., Parker, J.J.: The IBM Blue Gene/Q interconnection network and message unit. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, pp. 1–2. ACM, New York (2011)CrossRefGoogle Scholar
  7. 7.
    Dally, W.: Performance analysis of k-ary n-cube interconnection networks. IEEE Transactions on Computers 39(6), 775–785 (1990)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Dekel, E., Nassimi, D., Sahni, S.: Parallel matrix and graph algorithms. SIAM Journal on Computing 10(4), 657–675 (1981)MathSciNetzbMATHCrossRefGoogle Scholar
  9. 9.
    Faraj, A., Kumar, S., Smith, B., Mamidala, A., Gunnels, J.: MPI collective communications on the Blue Gene/P supercomputer: Algorithms and optimizations. In: 17th IEEE Symposium on High Performance Interconnects, HOTI 2009 (2009)Google Scholar
  10. 10.
    Gropp, W., Lusk, E., Skjellum, A.: Using MPI: portable parallel programming with the message-passing interface. MIT Press, Cambridge (1994)Google Scholar
  11. 11.
    IBM Journal of Research and Development staff: Overview of the IBM Blue Gene/P project. IBM J. Res. Dev. 52, 199–220 (2008)Google Scholar
  12. 12.
    Irony, D., Toledo, S., Tiskin, A.: Communication lower bounds for distributed-memory matrix multiplication. Journal of Parallel and Distributed Computing 64(9), 1017–1026 (2004)zbMATHCrossRefGoogle Scholar
  13. 13.
    Johnsson, S.L.: Minimizing the communication time for matrix multiplication on multiprocessors. Parallel Comput. 19, 1235–1257 (1993)CrossRefGoogle Scholar
  14. 14.
    Kale, L.V., Krishnan, S.: CHARM++: a portable concurrent object oriented system based on C++. In: Proceedings of the Eighth Annual Conference on Object-Oriented Programming Systems, Languages, and Applications, OOPSLA 1993, pp. 91–108. ACM, New York (1993)CrossRefGoogle Scholar
  15. 15.
    Solomonik, E., Bhatele, A., Demmel, J.: Improving communication performance in dense linear algebra via topology aware collectives. In: Supercomputing, Seattle, WA, USA (November 2011)Google Scholar
  16. 16.
    Solomonik, E., Demmel, J.: Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011, Part II. LNCS, vol. 6853, pp. 90–109. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  17. 17.
    Van De Geijn, R.A., Watts, J.: SUMMA: scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience 9(4), 255–274 (1997)CrossRefGoogle Scholar
  18. 18.
    Watts, J., Van De Geijn, R.A.: A pipelined broadcast for multidimensional meshes. Parallel Processing Letters 5, 281–292 (1995)CrossRefGoogle Scholar
  19. 19.
    Yokokawa, M., Shoji, F., Uno, A., Kurokawa, M., Watanabe, T.: The k computer: Japanese next-generation supercomputer development project. In: International Symposium on Low Power Electronics and Design, ISLPED 2011, pp. 371–372 (August 2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Edgar Solomonik
    • 1
  • James Demmel
    • 1
  1. 1.Division of Computer ScienceUniversity of California at BerkeleyUSA

Personalised recommendations