Distributed general matrix multiply and add for a 2D mesh processor network

  • Bo Kågström
  • Mikael Rännar
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1041)


A distributed algorithm with the same functionality as the single-processor level 3 BLAS operation GEMM, i.e., general matrix multiply and add, is presented. With the same functionality we mean the ability to perform GEMM operations on arbitrary subarrays of the matrices involved. The logical network is a 2D square mesh with torus connectivity. The matrices involved are distributed with non-scattered blocked data distribution. The algorithm consists of two main parts, alignment and data movement of subarrays involved in the operation and a distributed blocked matrix multiplication algorithm on (sub)matrices using only a square submesh. Our general approach makes it possible to perform GEMM operations on non-overlapping submeshes simultaneously.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen. LAPACK Users' Guide. Society for Industrial and Applied Mathematics, Philadelphia, 1992.Google Scholar
  2. 2.
    V. Cherkassky and R. Smith. Efficient mapping and implementation of matrix algorithms on a hypercube. Journal of Supercomputing, 2(1):7–27, 1988.Google Scholar
  3. 3.
    J. Choi, J. J. Dongarra, and D. W. Walker. Level 3 BLAS for distributed memory concurrent computers. In CNRS-NSF Workshop on Environments and Tools for Parallel Scientific Computing (Saint Hilaire du Touvet, France, September 7–8, 1992). Elsevier Science Publishers, 1992.Google Scholar
  4. 4.
    J. Choi, J. J. Dongarra, and D. W. Walker. PUMMA: Parallel Universal Matrix Multiplication Algorithms on distributed memory concurrent computers. Technical Report ORNL/TM-12252, Oak Ridge National Laboratory, Oak Ridge, TN, April 1993.Google Scholar
  5. 5.
    E. Dekel, D. Nassimi, and S. Sahni. Parallel matrix and graph algorithms. SIAM Journal of Computing, 10(4):657–675, November 1981.Google Scholar
  6. 6.
    J. Dongarra, J. Du Croz, I. Duff, and S. Hammarling. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Software, 18(1):1–17, 1990.Google Scholar
  7. 7.
    G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker. Solving Problems on Concurrent Processors, volume 1. Prentice-Hall, 1988.Google Scholar
  8. 8.
    G. A. Geist, A. Beguelin, Dongarra J. J., R. Manchek, and V. Sunderam. PVM 3.0 User's Guide and Reference Manual. Technical Report ORNL/TM-12187, Oak Ridge National Laboratory, Oak Ridge, TN, February 1993.Google Scholar
  9. 9.
    G. A. Geist, M. T. Heath, B. W. Peyton, and P. H. Worley. A Users' Guide to PICL: A portable instrumented communication library. Technical Report ORNL/TM-11616, Oak Ridge National Laboratory, Oak Ridge, TN, September 1990.Google Scholar
  10. 10.
    S. Huss-Lederman, E. M. Jacobson, and G. Tsao, A. Zhang. Matrix multiplication on the Intel Touchstone Delta. Technical Report SRC-TR-93-101 (Revised), Supercomputing Research Center, Bowie, MD, February 1994.Google Scholar
  11. 11.
    B. Kågström, P. Ling, and C. Van Loan. High Performance GEMM-Based Level 3 BLAS: Sample Routines for Double Precision Real Data. In M. Durand and F. El Dabaghi, editors, High Performance Computing II, pages 269–281, Amsterdam, 1991. North-Holland.Google Scholar
  12. 12.
    B. Kågström, P. Ling, and C. Van Loan. Portable High Performance GEMM-Based Level 3 BLAS. In Richard F. et al Sincovec, editor, Parallel Processing for Scientific Computing, pages 339–346, Philadelphia, 1993. SIAM Publications.Google Scholar
  13. 13.
    M. Rännar. A Distributed, Portable and General GEMM Operation for a 2D Mesh Processor Network. Report UMINF-95.xx, Department of Computing Science, Umeå University, S-901 87 Umeå, Sweden, 1995.Google Scholar
  14. 14.
    R. van de Geijn and J. Watts. SUMMA: Scalable universal matrix multiplication algorithm. Technical Report UT CS-95-286, LAPACK Working Note # 96, University of Tennessee, 1995.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1996

Authors and Affiliations

  • Bo Kågström
    • 1
  • Mikael Rännar
    • 1
  1. 1.Department of Computing ScienceUniversity of UmeåUmeåSweden

Personalised recommendations