Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms

  • Edgar Solomonik
  • James Demmel
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6853)

Abstract

Extra memory allows parallel matrix multiplication to be done with asymptotically less communication than Cannon’s algorithm and be faster in practice. “3D” algorithms arrange the p processors in a 3D array, and store redundant copies of the matrices on each of p1/3 layers. ‘2D” algorithms such as Cannon’s algorithm store a single copy of the matrices on a 2D array of processors. We generalize these 2D and 3D algorithms by introducing a new class of “2.5D algorithms”. For matrix multiplication, we can take advantage of any amount of extra memory to store c copies of the data, for any \(c \in\{1,2,...,\lfloor p^{1/3}\rfloor\}\), to reduce the bandwidth cost of Cannon’s algorithm by a factor of c1/2 and the latency cost by a factor c3/2. We also show that these costs reach the lower bounds, modulo polylog(p) factors. We introduce a novel algorithm for 2.5D LU decomposition. To the best of our knowledge, this LU algorithm is the first to minimize communication along the critical path of execution in the 3D case. Our 2.5D LU algorithm uses communication-avoiding pivoting, a stable alternative to partial-pivoting. We prove a novel lower bound on the latency cost of 2.5D and 3D LU factorization, showing that while c copies of the data can also reduce the bandwidth by a factor of c1/2, the latency must increase by a factor of c1/2, so that the 2D LU algorithm (c = 1) in fact minimizes latency. We provide implementations and performance results for 2D and 2.5D versions of all the new algorithms. Our results demonstrate that 2.5D matrix multiplication and LU algorithms strongly scale more efficiently than 2D algorithms. Each of our 2.5D algorithms performs over 2X faster than the corresponding 2D algorithm for certain problem sizes on 65,536 cores of a BG/P supercomputer.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agarwal, R.C., Balle, S.M., Gustavson, F.G., Joshi, M., Palkar, P.: A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Dev. 39, 575–582 (1995)CrossRefGoogle Scholar
  2. 2.
    Aggarwal, A., Chandra, A.K., Snir, M.: Communication complexity of PRAMs. Theoretical Computer Science 71(1), 3–28 (1990)MathSciNetCrossRefMATHGoogle Scholar
  3. 3.
    Ashcraft, C.: A taxonomy of distributed dense LU factorization methods. Boeing Computer Services Technical Report ECA-TR-161 (March 1991)Google Scholar
  4. 4.
    Ashcraft, C.: The fan-both family of column-based distributed Cholesky factorization algorithms. In: Alan George, J.R.G., Liu, J.W.H. (eds.) Graph Theory and Sparse Matrix Computation. IMA Volumes in Mathematics and its Applications, vol. 56, pp. 159–190. Springer, Heidelberg (1993)CrossRefGoogle Scholar
  5. 5.
    Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Minimizing communication in numerical linear algebra. To appear in SIAM J. Mat. Anal. Appl., UCB Technical Report EECS-2009-62 (2010)Google Scholar
  6. 6.
    Blackford, L.S., Choi, J., Cleary, A., D’Azeuedo, E., Demmel, J., Dhillon, I., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACK User’s Guide, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA (1997)Google Scholar
  7. 7.
    Cannon, L.E.: A cellular computer to implement the Kalman filter algorithm. Ph.D. thesis, Bozeman, MT, USA (1969)Google Scholar
  8. 8.
    Dekel, E., Nassimi, D., Sahni, S.: Parallel matrix and graph algorithms. SIAM Journal on Computing 10(4), 657–675 (1981)MathSciNetCrossRefMATHGoogle Scholar
  9. 9.
    Demmel, J., Grigori, L., Xiang, H.: A Communication Optimal LU Factorization Algorithm. EECS Technical Report EECS-2010-29, UC Berkeley (March 2010)Google Scholar
  10. 10.
    Demmel, J., Dumitriu, I., Holtz, O.: Fast linear algebra is stable. Numerische Mathematik 108, 59–91 (2007)MathSciNetCrossRefMATHGoogle Scholar
  11. 11.
    Faraj, A., Kumar, S., Smith, B., Mamidala, A., Gunnels, J.: MPI collective communications on the Blue Gene/P supercomputer: Algorithms and optimizations. In: 17th IEEE Symposium on High Performance Interconnects HOTI 2009, pp. 63–72 (2009)Google Scholar
  12. 12.
    Grigori, L., Demmel, J.W., Xiang, H.: Communication avoiding Gaussian elimination. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing SC 2008, pp. 29:1–29:12. IEEE Press, Piscataway (2008)Google Scholar
  13. 13.
    Gropp, W., Lusk, E., Skjellum, A.: Using MPI: portable parallel programming with the message-passing interface. MIT Press, Cambridge (1994)MATHGoogle Scholar
  14. 14.
    Irony, D., Toledo, S.: Trading replication for communication in parallel distributed-memory dense solvers. Parallel Processing Letters 71, 3–28 (2002)Google Scholar
  15. 15.
    Irony, D., Toledo, S., Tiskin, A.: Communication lower bounds for distributed-memory matrix multiplication. Journal of Parallel and Distributed Computing 64(9), 1017–1026 (2004)CrossRefMATHGoogle Scholar
  16. 16.
    Johnsson, S.L.: Minimizing the communication time for matrix multiplication on multiprocessors. Parallel Comput. 19, 1235–1257 (1993)CrossRefGoogle Scholar
  17. 17.
    Kumar, S., Dozsa, G., Almasi, G., Heidelberger, P., Chen, D., Giampapa, M.E., Michael, B., Faraj, A., Parker, J., Ratterman, J., Smith, B., Archer, C.J.: The deep computing messaging framework: generalized scalable message passing on the Blue Gene/P supercomputer. In: Proceedings of the 22nd Annual International Conference on Supercomputing ICS 2008, pp. 94–103. ACM, New York (2008)Google Scholar
  18. 18.
    McColl, W.F., Tiskin, A.: Memory-efficient matrix multiplication in the BSP model. Algorithmica 24, 287–297 (1999)MathSciNetCrossRefMATHGoogle Scholar
  19. 19.
    Solomonik, E., Demmel, J.: Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. Tech. Rep. UCB/EECS-2011-10, EECS Department, University of California, Berkeley (February 2011), http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-10.html
  20. 20.
    Van De Geijn, R.A., Watts, J.: SUMMA: scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience 9(4), 255–274 (1997)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Edgar Solomonik
    • 1
  • James Demmel
    • 1
  1. 1.Department of Computer ScienceUniversity of California at BerkeleyBerkeleyUSA

Personalised recommendations