Task Parallel Implementation of Matrix Multiplication on Multi-socket Multi-core Architectures

Wang, Yizhuo; Ji, Weixing; Chen, Xu; Hu, Sensen

doi:10.1007/978-3-319-27137-8_8

Yizhuo Wang¹⁷,
Weixing Ji¹⁷,
Xu Chen¹⁷ &
…
Sensen Hu¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9530))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

Abstract

Matrix multiplication is a very important computation kernel in many science and engineering applications. This paper presents a parallel implementation framework for dense matrix multiplication on multi-socket multi-core architectures. Our framework first partitions the computation between the multi-core processors. Then a hybrid matrix multiplication algorithm is used on each processor, which combines the Winograd algorithm and the classical algorithm. In addition, a hierarchical work-stealing scheme is applied to achieve dynamic load balancing and enforce data locality in our framework. Performance experiments on two platforms show that our implementation gets significant performance gains compared with the state-of-the-art implementations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. In: PLDI, pp. 30–44, New York, NY, USA (1991)
Google Scholar
Strassen, V.: Gaussian elimination is not optimal. Numer. Math. 14(3), 354–356 (1969)
Article MathSciNet MATH Google Scholar
Winograd, S.: On the multiplication of 2 × 2 matrices. Linear Algebra Appl. 4(4), 381–388 (1971)
Article MathSciNet MATH Google Scholar
Ballard, G., Demmel, J., et al.: Communication-optimal parallel algorithm for Strassen’s matrix multiplication. In: SPAA, pp. 193–204, New York, NY, USA (2012)
Google Scholar
Park, N., Hong, B., Prasanna, V.K.: Tiling, block data layout, and memory hierarchy performance. IEEE Trans. Parallel Distrib. Syst. 14(7), 640–654 (2003)
Article Google Scholar
Chatterjee, S., Lebeck, A.R., et al.: Recursive array layouts and fast matrix multiplication. IEEE Trans. Parallel Distrib. Syst. 13(11), 1105–1123 (2002)
Article Google Scholar
Coppersmith, D., Winograd, S.: On the asymptotic complexity of matrix multiplication. In: SFCS, pp. 82–90 (1981)
Google Scholar
Pan, V.Y.: Strassen’s algorithm is not optimal. FOCS 19, 166–176 (1978)
Google Scholar
Coppersmith, D., Winograd, S.: Matrix multiplication via arithmetic progressions. J. Symbolic Comput. 9(3), 251–280 (1990)
Article MathSciNet MATH Google Scholar
Stothers, A.: On the complexity of matrix multiplication. Ph.D. Thesis, U. Edinburgh (2010)
Google Scholar
Williams, V.V.: Multiplying matrices faster than Coppersmith-Winograd. In: STOC, pp. 887–898, New York (2012)
Google Scholar
Hunold, S., Rauber, T., Rünger, G.: Combining building blocks for parallel multi-level matrix multiplication. Parallel Comput. 34(6), 411–426 (2008)
Article MathSciNet Google Scholar
Desprez, F., Suter, F.: Impact of mixed-parallelism on parallel implementations of the Strassen and Winograd matrix multiplication algorithms. Concurr. Comput. : Pract. Exper. 16(8), 771–797 (2004)
Article Google Scholar
Alberto, P.D., Nicolau, A.: Adaptive Winograd’s matrix multiplications. ACM Trans. Math. Softw. 36(1), 1–23 (2009)
Article MathSciNet Google Scholar
Whaley, R.C., Dongarra, J.J.: Automatically tuned linear algebra software. Tech. Rep. UT-CS-97–366, University of Tennessee (1997)
Google Scholar
Goto, K., Geijn, R.V.D.: High-performance implementation of the level-3 BLAS. ACM Trans. Math. Softw. 35(1), 1–14 (2008)
Article MathSciNet Google Scholar
Smith, B.J.: Architecture and application of the HEP multiprocessor computer system. Real Time Signal Process. IV 298, 342–349 (1981)
Google Scholar
Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46(5), 720–748 (1999)
Article MathSciNet MATH Google Scholar
Polychronopoulos, C.D., Kuck, D.J.: Guided self-scheduling: a practical scheduling scheme for parallel supercomputers. IEEE Trans. Comput. 36(12), 1425–1439 (1987)
Article Google Scholar
Huss-Lederman, S., Jacobson, E.M., et al.: Implementation of Strassen’s algorithm for matrix multiplication. In: Supercomputing, article 32 (1996)
Google Scholar
Boyer, B., Dumas, J.G., et al.: Memory efficient scheduling of Strassen-Winograd’s matrix multiplication algorithm. In: ISSAC, pp. 55–62, New York, NY, USA (2009)
Google Scholar

Download references

Acknowledgments

The authors thank Professor Alexandru Nicolau and Professor Feng Shi for their inputs during discussion sessions pertaining to the present study, and the anonymous reviewers for their valuable comments on the manuscript. This work was partially supported by the National Natural Science Foundation of China under grant NSFC- 61300011.

Author information

Authors and Affiliations

School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China
Yizhuo Wang, Weixing Ji, Xu Chen & Sensen Hu

Authors

Yizhuo Wang
View author publications
You can also search for this author in PubMed Google Scholar
Weixing Ji
View author publications
You can also search for this author in PubMed Google Scholar
Xu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Sensen Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yizhuo Wang .

Editor information

Editors and Affiliations

Central South University, Changsha, China
Guojun Wang
The University of Sydney, Sydney, New South Wales, Australia
Albert Zomaya
University of Murcia, Murcia, Murcia, Spain
Gregorio Martinez
Hunan University, Changsha, China
Kenli Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y., Ji, W., Chen, X., Hu, S. (2015). Task Parallel Implementation of Matrix Multiplication on Multi-socket Multi-core Architectures. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9530. Springer, Cham. https://doi.org/10.1007/978-3-319-27137-8_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-27137-8_8
Published: 16 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27136-1
Online ISBN: 978-3-319-27137-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics