Optimizing the Matrix Multiplication Using Strassen and Winograd Algorithms with Limited Recursions on Many-Core

Khan, Ayaz ul Hassan; Al-Mouhamed, Mayez; Fatayer, Allam; Mohammad, Nazeeruddin

doi:10.1007/s10766-015-0378-1

Optimizing the Matrix Multiplication Using Strassen and Winograd Algorithms with Limited Recursions on Many-Core

Published: 22 September 2015

Volume 44, pages 801–830, (2016)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Ayaz ul Hassan Khan¹,
Mayez Al-Mouhamed²,
Allam Fatayer² &
…
Nazeeruddin Mohammad³

799 Accesses
7 Citations
Explore all metrics

An Erratum to this article was published on 23 November 2015

Abstract

Many-core systems are basically designed for applications having large data parallelism. We propose an efficient hybrid matrix multiplication implementation based on Strassen and Winograd algorithms (S-MM and W-MM) on many-core. A depth first (DFS) traversal of a recursion tree is used where all cores work in parallel on computing each of the \(N \times N\) sub-matrices, which are computed in sequence. DFS reduces the storage to the detriment of large data motion to gather and aggregate the results. The proposed approach uses three optimizations: (1) a small set of basic algebra functions to reduce overhead, (2) invoking efficient library (CUBLAS 5.5) for basic functions, and (3) using parameter-tuning of parametric kernel to improve resource occupancy. Evaluation of S-MM and W-MM is carried out on GPU and MIC (Xeon Phi). For GPU, W-MM and S-MM with one recursion level outperform CUBLAS 5.5 Library with up to twice as fast for arrays satisfying \(N \ge 2048\) and \(N \ge 3072\), respectively. Similar trends are observed for S-MM with reordering (R-S-MM), which is used to save storage. Compared to NVIDIA SDK library, S-MM and W-MM achieved a speedup between 20\(\times \) and 80\(\times \) for the above arrays. For MIC, two-recursion S-MM with reordering is faster than MKL library by 14–26 % for \(N \ge 1024\). Proposed implementations achieve 2.35 TFLOPS (67 % of peak) on GPU and 0.5 TFLOPS (21 % of peak) on MIC. Similar encouraging results are obtained for a 16-core Xeon-E5 server. We conclude that S-MM and W-MM implementations with a few recursion levels can be used to further optimize the performance of basic algebra libraries.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

Article Open access 11 March 2024

Accelerating approximate matrix multiplication for near-sparse matrices on GPUs

Article 14 February 2022

Matrix Multiplication on High-Density Multi-GPU Architectures: Theoretical and Experimental Investigations

References

Al-Mouhamed, M., ul Hassan Khan, A.: Exploration of automatic optimization for CUDA programming. Int. J. Parallel Emerg. Distrib. Syst. pp. 1–16 (2014)
Badin, M., D’Alberto, P., Bic, L., Dillencourt, M., Nicolau, A.: Improving numerical accuracy for non-negative matrix multiplication on GPUs using recursive algorithms. In: Proceedings of the 27th International ACM Conference on Supercomputing, pp. 213–222. ACM (2013)
Bailey, D.H.: Extra high speed matrix multiplication on the cray-2. SIAM J. Sci. Stat. Comput. 9(3), 603–607 (1988)
Article MathSciNet MATH Google Scholar
Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., Schwartz, O.: Communication-optimal parallel algorithm for Strassen’s matrix multiplication. In: Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’12, pp. 193–204. ACM (2012)
Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Communication costs of Strassen’s matrix multiplication. Commun. ACM 57(2), 107–114 (2014)
Article MATH Google Scholar
Chen, C., Taha, T.: A communication reduction approach to iteratively solve large sparse linear systems on a GPGPU cluster. Cluster Comput. 17(2), 327–337 (2014)
Article Google Scholar
Coppersmith, D., Winograd, S.: Matrix multiplication via arithmetic progressions. In: Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing, STOC ’87, pp. 1–6. ACM (1987)
Costarelli, S., Storti, M., Paz, R., Dalcin, L., Idelsohn, S.: GPGPU implementation of the BFECC algorithm for pure advection equations. Cluster Comput. 17(2), 243–254 (2014)
Article Google Scholar
Cui, X., Chen, Y., Zhang, C., Mei, H.: Auto-tuning dense matrix multiplication for GPGPU with cache. In: IEEE 16th International Conference on Parallel and Distributed Systems (ICPADS), pp. 237–242 (2010)
Dumitrescu, B., Roch, J.L., Trystram, D.: Fast matrix multiplication algorithms on MIMD architectures. Parallel Algorithms Appl. 4(1–2), 53–70 (1994)
Article MATH Google Scholar
Heinecke, A., Vaidyanathan, K., Smelyanskiy, M., Kobotov, A., Dubtsov, R., Henry, G., Shet, A.G., Chrysos, G., Dubey, P.: Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel Xeon Phi coprocessor. In: International Symposium on Parallel and Distributed Processing, pp. 126–137 (2013)
Intel Corporation: Intel Knights Corner: Software Developer Guide (2012)
Intel Corporation: Intel Xeon Phi: Coprocessor Instruction Set Architecture, Reference Manual (2012)
Kaporin, I.: A practical algorithm for faster matrix multiplication. Numer. Linear Algebra Appl. 6(8), 687–700 (1999)
Article MathSciNet MATH Google Scholar
Kirk, D.B., Hwu, W.m.W.: Programming Massively Parallel Processors: A Hands-on Approach, 1st edn. Morgan Kaufmann Pub. (2010)
Kurzak, J., Tomov, S., Dongarra, J.: Autotuning GEMMs for Fermi. Tech. Rep. 245, LAPACK Working Note (2011)
Lai, P.W., Arafat, H., Elango, V., Sadayappan, P.: Accelerating Strassen-Winograd’s matrix multiplication algorithm on GPUs. In: 20th International Conference on High Performance Computing (HiPC), 2013, pp. 139–148 (2013)
Lee, C., Ro, W., Gaudiot, J.L.: Boosting CUDA applications with CPUGPU hybrid computing. Int. J. Parallel Program. 42(2), 384–404 (2014)
Article Google Scholar
Li, J., Ranka, S., Sahni, S.: Strassen’s Matrix Multiplication on GPUs. In: Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems, ICPADS ’11, pp. 157–164 (2011)
Li, Y., Dongarra, J., Tomov, S.: A Note on Auto-tuning GEMM for GPUs. In: Proceedings of the 9th International Conference on Computational Science, ICCS ’09, pp. 884–892 (2009)
Lipshitz, B., Ballard, G., Demmel, J., Schwartz, O.: Communication-avoiding parallel Strassen: Implementation and performance. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pp. 101:1–11 (2012)
Nath, R., Tomov, S., Dongarra, J.: An improved MAGMA GEMM for fermi graphics. Int. J. High Perform. Comput. Appl. 24(4), 511–515 (2010)
Article Google Scholar
NVIDIA: CUBLAS (2013). https://developer.nvidia.com/cuBLAS
Pan, V.Y.: How to Multiply Matrices Faster. Lecture Notes in Computer Science. vol. 179. Springer (1984)
Reinders, J.: An Overview of Programming for Intel Xeon processors and Intel Xeon Phi coprocessors. Intel Corporation, Santa Clara (2012)
Google Scholar
Robinson, S.: Toward an optimal algorithm for matrix multiplication. SIAM News 38(9), 1–3 (2005)
Google Scholar
Strassen, V.: Gaussian elimination is not optimal. Numerische Mathematik 13(4), 354–356 (1969)
Article MathSciNet MATH Google Scholar
Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC ’08, pp. 31:1–11 (2008)
Wei, S.C., Huang, B.: Accelerating volkov’s hybrid implementation of cholesky factorization on a fermi gpu. In: Parallel and Distributed Systems (ICPADS), 2012 IEEE 18th International Conference on, pp. 896–900. IEEE (2012)
Williams, V.: Multiplying matrices in \(o(n^{2.373})\) time. Stanford University (2014). http://theory.stanford.edu/~virgi/matrixmult-f.pdf
Winograd, S.: Some remarks on fast multiplication of polynomials. Complexity of Sequential and Parallel Numerical Algorithms p. 181 (1973)
Yang, Y., Zhou, H.: The implementation of a high performance GPGPU compiler. Int. J. Parallel Program. 41(6), 768–781 (2013)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, Qassim University, Almulyda-Qassim, Saudi Arabia
Ayaz ul Hassan Khan
Computer Engineering Department, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia
Mayez Al-Mouhamed & Allam Fatayer
Computer Engineering Department, Prince Mohammad Bin Fahd University, Dhahran, Saudi Arabia
Nazeeruddin Mohammad

Authors

Ayaz ul Hassan Khan
View author publications
You can also search for this author in PubMed Google Scholar
Mayez Al-Mouhamed
View author publications
You can also search for this author in PubMed Google Scholar
Allam Fatayer
View author publications
You can also search for this author in PubMed Google Scholar
Nazeeruddin Mohammad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ayaz ul Hassan Khan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khan, A.u.H., Al-Mouhamed, M., Fatayer, A. et al. Optimizing the Matrix Multiplication Using Strassen and Winograd Algorithms with Limited Recursions on Many-Core. Int J Parallel Prog 44, 801–830 (2016). https://doi.org/10.1007/s10766-015-0378-1

Download citation

Received: 08 June 2014
Accepted: 31 August 2015
Published: 22 September 2015
Issue Date: August 2016
DOI: https://doi.org/10.1007/s10766-015-0378-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimizing the Matrix Multiplication Using Strassen and Winograd Algorithms with Limited Recursions on Many-Core

Abstract

Access this article

Similar content being viewed by others

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

Accelerating approximate matrix multiplication for near-sparse matrices on GPUs

Matrix Multiplication on High-Density Multi-GPU Architectures: Theoretical and Experimental Investigations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimizing the Matrix Multiplication Using Strassen and Winograd Algorithms with Limited Recursions on Many-Core

Abstract

Access this article

Similar content being viewed by others

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

Accelerating approximate matrix multiplication for near-sparse matrices on GPUs

Matrix Multiplication on High-Density Multi-GPU Architectures: Theoretical and Experimental Investigations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation