Skip to main content

Task Parallel Implementation of Matrix Multiplication on Multi-socket Multi-core Architectures

  • Conference paper
  • First Online:
Algorithms and Architectures for Parallel Processing (ICA3PP 2015)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9530))

Abstract

Matrix multiplication is a very important computation kernel in many science and engineering applications. This paper presents a parallel implementation framework for dense matrix multiplication on multi-socket multi-core architectures. Our framework first partitions the computation between the multi-core processors. Then a hybrid matrix multiplication algorithm is used on each processor, which combines the Winograd algorithm and the classical algorithm. In addition, a hierarchical work-stealing scheme is applied to achieve dynamic load balancing and enforce data locality in our framework. Performance experiments on two platforms show that our implementation gets significant performance gains compared with the state-of-the-art implementations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. In: PLDI, pp. 30–44, New York, NY, USA (1991)

    Google Scholar 

  2. Strassen, V.: Gaussian elimination is not optimal. Numer. Math. 14(3), 354–356 (1969)

    Article  MathSciNet  MATH  Google Scholar 

  3. Winograd, S.: On the multiplication of 2 × 2 matrices. Linear Algebra Appl. 4(4), 381–388 (1971)

    Article  MathSciNet  MATH  Google Scholar 

  4. Ballard, G., Demmel, J., et al.: Communication-optimal parallel algorithm for Strassen’s matrix multiplication. In: SPAA, pp. 193–204, New York, NY, USA (2012)

    Google Scholar 

  5. Park, N., Hong, B., Prasanna, V.K.: Tiling, block data layout, and memory hierarchy performance. IEEE Trans. Parallel Distrib. Syst. 14(7), 640–654 (2003)

    Article  Google Scholar 

  6. Chatterjee, S., Lebeck, A.R., et al.: Recursive array layouts and fast matrix multiplication. IEEE Trans. Parallel Distrib. Syst. 13(11), 1105–1123 (2002)

    Article  Google Scholar 

  7. Coppersmith, D., Winograd, S.: On the asymptotic complexity of matrix multiplication. In: SFCS, pp. 82–90 (1981)

    Google Scholar 

  8. Pan, V.Y.: Strassen’s algorithm is not optimal. FOCS 19, 166–176 (1978)

    Google Scholar 

  9. Coppersmith, D., Winograd, S.: Matrix multiplication via arithmetic progressions. J. Symbolic Comput. 9(3), 251–280 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  10. Stothers, A.: On the complexity of matrix multiplication. Ph.D. Thesis, U. Edinburgh (2010)

    Google Scholar 

  11. Williams, V.V.: Multiplying matrices faster than Coppersmith-Winograd. In: STOC, pp. 887–898, New York (2012)

    Google Scholar 

  12. Hunold, S., Rauber, T., Rünger, G.: Combining building blocks for parallel multi-level matrix multiplication. Parallel Comput. 34(6), 411–426 (2008)

    Article  MathSciNet  Google Scholar 

  13. Desprez, F., Suter, F.: Impact of mixed-parallelism on parallel implementations of the Strassen and Winograd matrix multiplication algorithms. Concurr. Comput. : Pract. Exper. 16(8), 771–797 (2004)

    Article  Google Scholar 

  14. Alberto, P.D., Nicolau, A.: Adaptive Winograd’s matrix multiplications. ACM Trans. Math. Softw. 36(1), 1–23 (2009)

    Article  MathSciNet  Google Scholar 

  15. Whaley, R.C., Dongarra, J.J.: Automatically tuned linear algebra software. Tech. Rep. UT-CS-97–366, University of Tennessee (1997)

    Google Scholar 

  16. Goto, K., Geijn, R.V.D.: High-performance implementation of the level-3 BLAS. ACM Trans. Math. Softw. 35(1), 1–14 (2008)

    Article  MathSciNet  Google Scholar 

  17. Smith, B.J.: Architecture and application of the HEP multiprocessor computer system. Real Time Signal Process. IV 298, 342–349 (1981)

    Google Scholar 

  18. Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46(5), 720–748 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  19. Polychronopoulos, C.D., Kuck, D.J.: Guided self-scheduling: a practical scheduling scheme for parallel supercomputers. IEEE Trans. Comput. 36(12), 1425–1439 (1987)

    Article  Google Scholar 

  20. Huss-Lederman, S., Jacobson, E.M., et al.: Implementation of Strassen’s algorithm for matrix multiplication. In: Supercomputing, article 32 (1996)

    Google Scholar 

  21. Boyer, B., Dumas, J.G., et al.: Memory efficient scheduling of Strassen-Winograd’s matrix multiplication algorithm. In: ISSAC, pp. 55–62, New York, NY, USA (2009)

    Google Scholar 

Download references

Acknowledgments

The authors thank Professor Alexandru Nicolau and Professor Feng Shi for their inputs during discussion sessions pertaining to the present study, and the anonymous reviewers for their valuable comments on the manuscript. This work was partially supported by the National Natural Science Foundation of China under grant NSFC- 61300011.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yizhuo Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Wang, Y., Ji, W., Chen, X., Hu, S. (2015). Task Parallel Implementation of Matrix Multiplication on Multi-socket Multi-core Architectures. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9530. Springer, Cham. https://doi.org/10.1007/978-3-319-27137-8_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27137-8_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27136-1

  • Online ISBN: 978-3-319-27137-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics