Abstract
Matrix multiplication is a very important computation kernel in many science and engineering applications. This paper presents a parallel implementation framework for dense matrix multiplication on multi-socket multi-core architectures. Our framework first partitions the computation between the multi-core processors. Then a hybrid matrix multiplication algorithm is used on each processor, which combines the Winograd algorithm and the classical algorithm. In addition, a hierarchical work-stealing scheme is applied to achieve dynamic load balancing and enforce data locality in our framework. Performance experiments on two platforms show that our implementation gets significant performance gains compared with the state-of-the-art implementations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. In: PLDI, pp. 30–44, New York, NY, USA (1991)
Strassen, V.: Gaussian elimination is not optimal. Numer. Math. 14(3), 354–356 (1969)
Winograd, S.: On the multiplication of 2 × 2 matrices. Linear Algebra Appl. 4(4), 381–388 (1971)
Ballard, G., Demmel, J., et al.: Communication-optimal parallel algorithm for Strassen’s matrix multiplication. In: SPAA, pp. 193–204, New York, NY, USA (2012)
Park, N., Hong, B., Prasanna, V.K.: Tiling, block data layout, and memory hierarchy performance. IEEE Trans. Parallel Distrib. Syst. 14(7), 640–654 (2003)
Chatterjee, S., Lebeck, A.R., et al.: Recursive array layouts and fast matrix multiplication. IEEE Trans. Parallel Distrib. Syst. 13(11), 1105–1123 (2002)
Coppersmith, D., Winograd, S.: On the asymptotic complexity of matrix multiplication. In: SFCS, pp. 82–90 (1981)
Pan, V.Y.: Strassen’s algorithm is not optimal. FOCS 19, 166–176 (1978)
Coppersmith, D., Winograd, S.: Matrix multiplication via arithmetic progressions. J. Symbolic Comput. 9(3), 251–280 (1990)
Stothers, A.: On the complexity of matrix multiplication. Ph.D. Thesis, U. Edinburgh (2010)
Williams, V.V.: Multiplying matrices faster than Coppersmith-Winograd. In: STOC, pp. 887–898, New York (2012)
Hunold, S., Rauber, T., Rünger, G.: Combining building blocks for parallel multi-level matrix multiplication. Parallel Comput. 34(6), 411–426 (2008)
Desprez, F., Suter, F.: Impact of mixed-parallelism on parallel implementations of the Strassen and Winograd matrix multiplication algorithms. Concurr. Comput. : Pract. Exper. 16(8), 771–797 (2004)
Alberto, P.D., Nicolau, A.: Adaptive Winograd’s matrix multiplications. ACM Trans. Math. Softw. 36(1), 1–23 (2009)
Whaley, R.C., Dongarra, J.J.: Automatically tuned linear algebra software. Tech. Rep. UT-CS-97–366, University of Tennessee (1997)
Goto, K., Geijn, R.V.D.: High-performance implementation of the level-3 BLAS. ACM Trans. Math. Softw. 35(1), 1–14 (2008)
Smith, B.J.: Architecture and application of the HEP multiprocessor computer system. Real Time Signal Process. IV 298, 342–349 (1981)
Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46(5), 720–748 (1999)
Polychronopoulos, C.D., Kuck, D.J.: Guided self-scheduling: a practical scheduling scheme for parallel supercomputers. IEEE Trans. Comput. 36(12), 1425–1439 (1987)
Huss-Lederman, S., Jacobson, E.M., et al.: Implementation of Strassen’s algorithm for matrix multiplication. In: Supercomputing, article 32 (1996)
Boyer, B., Dumas, J.G., et al.: Memory efficient scheduling of Strassen-Winograd’s matrix multiplication algorithm. In: ISSAC, pp. 55–62, New York, NY, USA (2009)
Acknowledgments
The authors thank Professor Alexandru Nicolau and Professor Feng Shi for their inputs during discussion sessions pertaining to the present study, and the anonymous reviewers for their valuable comments on the manuscript. This work was partially supported by the National Natural Science Foundation of China under grant NSFC- 61300011.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Wang, Y., Ji, W., Chen, X., Hu, S. (2015). Task Parallel Implementation of Matrix Multiplication on Multi-socket Multi-core Architectures. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9530. Springer, Cham. https://doi.org/10.1007/978-3-319-27137-8_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-27137-8_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27136-1
Online ISBN: 978-3-319-27137-8
eBook Packages: Computer ScienceComputer Science (R0)