Task Parallel Implementation of Matrix Multiplication on Multi-socket Multi-core Architectures

  • Yizhuo WangEmail author
  • Weixing Ji
  • Xu Chen
  • Sensen Hu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9530)


Matrix multiplication is a very important computation kernel in many science and engineering applications. This paper presents a parallel implementation framework for dense matrix multiplication on multi-socket multi-core architectures. Our framework first partitions the computation between the multi-core processors. Then a hybrid matrix multiplication algorithm is used on each processor, which combines the Winograd algorithm and the classical algorithm. In addition, a hierarchical work-stealing scheme is applied to achieve dynamic load balancing and enforce data locality in our framework. Performance experiments on two platforms show that our implementation gets significant performance gains compared with the state-of-the-art implementations.


Matrix multiplications Multi-socket Fast algorithms Winograd Work-stealing 



The authors thank Professor Alexandru Nicolau and Professor Feng Shi for their inputs during discussion sessions pertaining to the present study, and the anonymous reviewers for their valuable comments on the manuscript. This work was partially supported by the National Natural Science Foundation of China under grant NSFC- 61300011.


  1. 1.
    Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. In: PLDI, pp. 30–44, New York, NY, USA (1991)Google Scholar
  2. 2.
    Strassen, V.: Gaussian elimination is not optimal. Numer. Math. 14(3), 354–356 (1969)MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Winograd, S.: On the multiplication of 2 × 2 matrices. Linear Algebra Appl. 4(4), 381–388 (1971)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Ballard, G., Demmel, J., et al.: Communication-optimal parallel algorithm for Strassen’s matrix multiplication. In: SPAA, pp. 193–204, New York, NY, USA (2012)Google Scholar
  5. 5.
    Park, N., Hong, B., Prasanna, V.K.: Tiling, block data layout, and memory hierarchy performance. IEEE Trans. Parallel Distrib. Syst. 14(7), 640–654 (2003)CrossRefGoogle Scholar
  6. 6.
    Chatterjee, S., Lebeck, A.R., et al.: Recursive array layouts and fast matrix multiplication. IEEE Trans. Parallel Distrib. Syst. 13(11), 1105–1123 (2002)CrossRefGoogle Scholar
  7. 7.
    Coppersmith, D., Winograd, S.: On the asymptotic complexity of matrix multiplication. In: SFCS, pp. 82–90 (1981)Google Scholar
  8. 8.
    Pan, V.Y.: Strassen’s algorithm is not optimal. FOCS 19, 166–176 (1978)Google Scholar
  9. 9.
    Coppersmith, D., Winograd, S.: Matrix multiplication via arithmetic progressions. J. Symbolic Comput. 9(3), 251–280 (1990)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Stothers, A.: On the complexity of matrix multiplication. Ph.D. Thesis, U. Edinburgh (2010)Google Scholar
  11. 11.
    Williams, V.V.: Multiplying matrices faster than Coppersmith-Winograd. In: STOC, pp. 887–898, New York (2012)Google Scholar
  12. 12.
    Hunold, S., Rauber, T., Rünger, G.: Combining building blocks for parallel multi-level matrix multiplication. Parallel Comput. 34(6), 411–426 (2008)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Desprez, F., Suter, F.: Impact of mixed-parallelism on parallel implementations of the Strassen and Winograd matrix multiplication algorithms. Concurr. Comput. : Pract. Exper. 16(8), 771–797 (2004)CrossRefGoogle Scholar
  14. 14.
    Alberto, P.D., Nicolau, A.: Adaptive Winograd’s matrix multiplications. ACM Trans. Math. Softw. 36(1), 1–23 (2009)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Whaley, R.C., Dongarra, J.J.: Automatically tuned linear algebra software. Tech. Rep. UT-CS-97–366, University of Tennessee (1997)Google Scholar
  16. 16.
    Goto, K., Geijn, R.V.D.: High-performance implementation of the level-3 BLAS. ACM Trans. Math. Softw. 35(1), 1–14 (2008)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Smith, B.J.: Architecture and application of the HEP multiprocessor computer system. Real Time Signal Process. IV 298, 342–349 (1981)Google Scholar
  18. 18.
    Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46(5), 720–748 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Polychronopoulos, C.D., Kuck, D.J.: Guided self-scheduling: a practical scheduling scheme for parallel supercomputers. IEEE Trans. Comput. 36(12), 1425–1439 (1987)CrossRefGoogle Scholar
  20. 20.
    Huss-Lederman, S., Jacobson, E.M., et al.: Implementation of Strassen’s algorithm for matrix multiplication. In: Supercomputing, article 32 (1996)Google Scholar
  21. 21.
    Boyer, B., Dumas, J.G., et al.: Memory efficient scheduling of Strassen-Winograd’s matrix multiplication algorithm. In: ISSAC, pp. 55–62, New York, NY, USA (2009)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.School of Computer Science and TechnologyBeijing Institute of TechnologyBeijingChina

Personalised recommendations