A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures

Kelefouras, Vasilios; Kritikakou, A.; Mporas, Iosif; Kolonias, Vasilios

doi:10.1007/s11227-015-1613-7

A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures

Published: 22 January 2016

Volume 72, pages 804–844, (2016)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Vasilios Kelefouras¹,
A. Kritikakou²,
Iosif Mporas¹ &
…
Vasilios Kolonias¹

984 Accesses
15 Citations
Explore all metrics

Abstract

Current compilers cannot generate code that can compete with hand-tuned code in efficiency, even for a simple kernel like matrix–matrix multiplication (MMM). A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and number of levels of tiling. The scheduling parameter values selection is a very difficult and time-consuming task, since parameter values depend on each other; this is why they are found by using searching methods and empirical techniques. To overcome this problem, the scheduling sub-problems must be optimized together, as one problem and not separately. In this paper, an MMM methodology is presented where the optimum scheduling parameters are found by decreasing the search space theoretically, while the major scheduling sub-problems are addressed together as one problem and not separately according to the hardware architecture parameters and input size; for different hardware architecture parameters and/or input sizes, a different implementation is produced. This is achieved by fully exploiting the software characteristics (e.g., data reuse) and hardware architecture parameters (e.g., data caches sizes and associativities), giving high-quality solutions and a smaller search space. This methodology refers to a wide range of CPU and GPU architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors

Performance evaluation and analysis of sparse matrix and graph kernels on heterogeneous processors

Article 12 June 2019

Feng Zhang, Weifeng Liu, … Xiaoyong Du

Matrix Multiplication on High-Density Multi-GPU Architectures: Theoretical and Experimental Investigations

References

Kelefouras VI, Kritikakou A, Papadima E, Goutis CE (2015) A methodology for speeding up matrix vector multiplication for single/multi-core architectures. J Supercomput 71(7):2644–2667
Article Google Scholar
Pinter SS (1996) Register allocation with instruction scheduling: a new approach. J Prog Lang 4(1): 21–38
MathSciNet Google Scholar
Shobaki G, Shawabkeh M, Rmaileh NEA (2008) Preallocation instruction scheduling with register pressure minimization using a combinatorial optimization approach. ACM Trans Archit Code Optim 10(3):14:1–14:31. doi:10.1145/2512432
Google Scholar
Bacon DF, Graham SL, Oliver, Sharp J (1994) Compiler transformations for high-performance computing. ACM Comput Surv 26:345–420
Article Google Scholar
Granston E, Holler A (2001) Automatic recommendation of compiler options. In: Proceedings of the workshop on feedback-directed and dynamic optimization (FDDO’01)
Triantafyllis S, Vachharajani M, Vachharajani N, August DI (2003) Compiler optimization-space exploration. In: Proceedings of the international symposium on code generation and optimization: feedback-directed and runtime optimization (CGO’03). IEEE Computer Society, Washington, DC, pp 204–215. http://dl.acm.org/citation.cfm?id=776261.776284. Accessed 15 Jan 2015
Cooper KD, Subramanian D, Torczon L (2001) Adaptive optimizing compilers for the 21st century. J Supercomput 23:2002
MATH Google Scholar
Kisuki T, Knijnenburg PMW, O’Boyle MFP, Bodin F, Wijshoff HAG (1999) A feasibility study in iterative compilation. In: Proceedings of the second international symposium on high performance computing (ISHPC’99). Springer, London, pp 121–132. http://dl.acm.org/citation.cfm?id=646347.690219. Accessed 15 Jan 2015
Kulkarni PA, Whalley DB, Tyson GS, Davidson JW (2009) Practical exhaustive optimization phase order exploration and evaluation. ACM Trans Archit Code Optim 6(1):1:1–1:36. doi:10.1145/1509864.1509865
Article Google Scholar
Kulkarni P, Hines S, Hiser J, Whalley D, Davidson J, Jones D (2004) Fast searches for effective optimization phase sequences. SIGPLAN Not 39(6):171–182. doi:10.1145/996893.996863
Article Google Scholar
Park E, Kulkarni S, Cavazos J (2011) An evaluation of different modeling techniques for iterative compilation. In: Proceedings of the 14th international conference on compilers, architectures and synthesis for embedded systems (CASES’11). ACM, New York, pp 65–74. doi:10.1145/2038698.2038711
Monsifrot A, Bodin F, Quiniou R (2002) A machine learning approach to automatic production of compiler heuristics. In: Proceedings of the 10th international conference on artificial intelligence: methodology, systems, and applications (AIMSA’02). Springer, London, pp 41–50. http://dl.acm.org/citation.cfm?id=646053.677574. Accessed 15 Jan 2015
Stephenson M, Amarasinghe S, Martin M, O’Reilly U-M (2003) Meta optimization: improving compiler heuristics with machine learning. SIGPLAN Not 38(5):77–90. doi:10.1145/780822.781141
Article Google Scholar
Tartara M, Crespi Reghizzi S (2013) Continuous learning of compiler heuristics. ACM Trans Archit Code Optim 9(4):46:1–46:25. doi:10.1145/2400682.2400705
Article Google Scholar
Agakov F, Bonilla E, Cavazos J, Franke B, Fursin G, O’Boyle MFP, Thomson J, Toussaint M, Williams CKI (2006) Using machine learning to focus iterative optimization. In: Proceedings of the international symposium on code generation and optimization (CGO’06). IEEE Computer Society, Washington, DC, pp 295–305. doi:10.1109/CGO.2006.37
Whaley RC (2005) Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw Pract Exp 35(2):101–121
Article Google Scholar
Openblas (2012) An optimized blas library. http://xianyi.github.com/OpenBLAS/. Accessed 15 Jan 2015
Guennebaud G, Jacob B et al (2010) Eigen v3. http://eigen.tuxfamily.org. Accessed 15 Jan 2015
Intel (2012) Intel mkl. http://software.intel.com/en-us/intel-mkl. Accessed 15 Jan 2015
Bilmes J, Asanović K, Chin C, Demmel J (1997) Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: Proceedings of the international conference on supercomputing, ACM SIGARC, Vienna
Goto K, van de Geijn RA (2008) Anatomy of high-performance matrix multiplication. ACM Trans Math Softw 34(3):12:1–12:25. doi:10.1145/1356052.1356053
Article MathSciNet MATH Google Scholar
Van Zee FG, Van De Geijn RA (2015) BLIS: a framework for rapidly instantiating BLAS functionality. ACM Trans Math Softw 41(3):14-1–14-33
Smith TM, van de Geijn RA, Smelyanskiy M, Hammond JR, Zee FGV (2014) Anatomy of high-performance many-threaded matrix multiplication. In: 2014 IEEE 28th international parallel and distributed processing symposium, Phoenix, pp 1049–1059. doi:10.1109/IPDPS.2014.110
Yotov K, Li X, Ren G, Garzaran M, Padua D, Pingali K, Stodghill P (2005) Is search really necessary to generate high-performance blas? Proc IEEE 93(2):358–386. (Special issue on “Program generation, optimization, and adaptation”)
Volkov V, Demmel JW (2008) Benchmarking gpus to tune dense linear algebra. In: Proceedings of the 2008 ACM/IEEE conference on supercomputing (SC’08). IEEE Press, Piscataway, pp 31:1–31:11. http://dl.acm.org/citation.cfm?id=1413370.1413402. Accessed 15 Jan 2015
Nath R, Tomov S, Dongarra J (2010) An improved Magma Gemm for Fermi graphics processing units. Int J High Perform Comput Appl 24(4):511515. doi:10.1177/1094342010385729
Article Google Scholar
Kelefouras VI, Kritikakou A, Goutis C (2014) A matrix–matrix multiplication methodology for single/multi-core architectures using SIMD, supercomputing. Springer, New York. doi:10.1007/s11227-014-1098-9
Nethercote N, Seward J (2007) Valgrind: a framework for heavyweight dynamic binary instrumentation. SIGPLAN Not 42(6):89–100. doi:10.1145/1273442.1250746
Article Google Scholar
Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill MD, Wood DA (2011) The gem5 simulator. SIGARCH Comput Archit News 39(2):1–7. doi:10.1145/2024716.2024718
Article Google Scholar
Austin T, Larson E, Ernst D (2002) Simplescalar: an infrastructure for computer system modeling. Computer 35:59–67. doi:10.1109/2.982917
Article Google Scholar
Bhandarkar SM, Arabnia HR (1995) The REFINE multiprocessor—theoretical properties and algorithms. Parallel Comput 21(11):1783–1805
Article Google Scholar
Arabnia HR, Smith JW (1993) A reconfigurable interconnection network for imaging operations and its implementation using a multi-stage switching box, pp 349–357
Wani MA, Arabnia HR (2003) Parallel edge-region-based segmentation algorithm targeted at reconfigurable MultiRing network. J Supercomput 25(1):43–62
Article MATH Google Scholar
Arabnia HR (1990) A parallel algorithm for the arbitrary rotation of digitized images using process-and-data-decomposition approach. J Parallel Distrib Comput 10(2):188–192
Article Google Scholar
Arabnia HR, Oliver MA (1989) A transputer network for fast operations on digitised images. Comput Graph Forum 8(1):3–11
Article Google Scholar
Bhandarkar SM, Arabnia HR (1995) The Hough transform on a reconfigurable multi-ring network. J Parallel Distrib Comput 24(1):107–114
Article Google Scholar
Arabnia HR, Oliver MA (1987) A transputer network for the arbitrary rotation of digitised images. Comput J 30(5):425–432
Article Google Scholar
Arabnia HR, Bhandarkar SM (1996) Parallel stereocorrelation on a reconfigurable multi-ring network. J Supercomput 10(3):243–269
Article MATH Google Scholar
Arabnia HR, Oliver MA (1987) Arbitrary rotation of raster images with SIMD machine architectures. Comput Graph Forum 6(1):3–11
Article Google Scholar
Bhandarkar SM, Arabnia HR, Smith JW (1995) A reconfigurable architecture for image processing and computer vision. Int J Pattern Recognit Artif Intell 9(02):201–229
Article Google Scholar
Arabnia H (1995) A distributed stereocorrelation algorithm. In: Proceedings of fourth international conference on computer communications and networks. IEEE, pp 479–482
Whaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimization of software and the ATLAS project. Parallel Comput 27(1–2):3–35
Article MATH Google Scholar
Whaley RC, Dongarra J (1999) Automatically tuned linear algebra software. In: Ninth SIAM conference on parallel processing for scientific computing, CD-ROM proceedings
Whaley RC, Dongarra JJ (1998) Automatically tuned linear algebra software. In: Proceedings of the 1998 ACM/IEEE conference on supercomputing, supercomputing’98. IEEE Computer Society, Washington, DC, pp 1–27. http://dl.acm.org/citation.cfm?id=509058.509096. Accessed 15 Jan 2015
Whaley RC, Dongarra J (1997) Automatically tuned linear algebra software. Tech. Rep. UT-CS-97-366, University of Tennessee
Bjørstad P, Manne F, Sørevik T, Vajtersic M (1992) Efficient matrix multiplication on simd computers. SIAM J Matrix Anal Appl 13:386–401
Article MathSciNet MATH Google Scholar
Geijn RAVD, Watts J (1997) Summa: scalable universal matrix multiplication algorithm. Tech. rep
Chatterjee S, Lebeck AR, Patnala PK, Thottethodi M (1999) Recursive array layouts and fast parallel matrix multiplication. In: Proceedings of eleventh annual ACM symposium on parallel algorithms and architectures, pp 222–231
Moon B, Jagadish HV, Faloutsos C, Saltz JH (2001) Analysis of the clustering properties of the Hilbert space-filling curve. IEEE Trans Knowl Data Eng 13:2001
Article Google Scholar
Thottethodi M, Chatterjee S, Lebeck AR (1998) Tuning strassen’s matrix multiplication for memory efficiency. In: Proceedings of SC98 (CD-ROM)
Kulkarni M, Pingali K (2008) An experimental study of self-optimizing dense linear algebra software. Proc IEEE 96(5):832–848
Article Google Scholar
Garcia E, Venetis IE, Khan R, Gao GR (2010) Optimized dense matrix multiplication on a many-core architecture. In: Proceedings of the 16th international Euro-Par conference on parallel processing: part II (Euro-Par’10). Springer, Berlin, pp 316–327. http://dl.acm.org/citation.cfm?id=1885276.1885308. Accessed 15 Jan 2015
Choi J (1998) A new parallel matrix multiplication algorithm on distributed-memory concurrent computers. Concurr Pract Exp 10(8):655–670. http://dblp.uni-trier.de/db/journals/concurrency/concurrency10.html. Accessed 15 Jan 2015
Kurzak J, Alvaro W, Dongarra J (2009) Optimizing matrix multiplication for a short-vector simd architecture—cell processor. Parallel Comput 35(3):138–150. doi:10.1016/j.parco.2008.12.010
Article Google Scholar
Desprez F, Suter F (2004) Impact of mixed-parallelism on parallel implementations of the strassen and winograd matrix multiplication algorithms: research articles. Concurr Comput Pract Exp 16(8):771–797. doi:10.1002/cpe.v16:8
Article Google Scholar
Hattori M, Ito N, Chen W, Wada K (2005) Parallel matrix-multiplication algorithm for distributed parallel computers. Syst Comput Jpn 36(4):48–59. doi:10.1002/scj.v36:4
Article Google Scholar
Hunold S, Rauber T, Rünger G (2004) Multilevel hierarchical matrix multiplication on clusters. In: Proceedings of the 18th annual international conference on supercomputing (ICS’04). ACM, New York, pp 136–145. doi:10.1145/1006209.1006230
Krishnan M, Nieplocha J (2006) Memory efficient parallel matrix multiplication operation for irregular problems. In: Proceedings of the 3rd conference on computing frontiers (CF’06). ACM, New York, pp 229–240. doi:10.1145/1128022.1128054
Hunold S, Rauber T (2005) Automatic tuning of pdgemm towards optimal performance. In: Proceedings of the 11th international Euro-Par conference on parallel processing (Euro-Par’05). Springer, Berlin, pp 837–846. doi:10.1007/11549468_91
Desprez F, Suter F (2002) Impact of mixed-parallelism on parallel implementations of Strassen and Winograd matrix multiplication algorithms. Rapport de recherche RR-4482, INRIA. http://hal.inria.fr/inria-00072106. Accessed 15 Jan 2015
Tsilikas G, Fleury M (2004) Matrix multiplication performance on commodity shared-memory multiprocessors. In: Proceedings of the international conference on parallel computing in electrical engineering (PARELEC’04). IEEE Computer Society, Washington, DC, pp 13–18. doi:10.1109/PARELEC.2004.43
Rünger G, Schwind M (2010) Fast recursive matrix multiplication for multi-core architectures. Procedia Comput Sci 1(1):67–76. [International conference on computational science 2010 (ICCS’10)]
Krishnan M, Nieplocha J (2004) Srumma: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems. Parallel Distrib Process Symp Int 1:70b. doi:10.1109/IPDPS.2004.1303000. Accessed 15 Jan 2015
Strassen V (1969) Gaussian elimination is not optimal. Numer Math 14(3):354–356
Article MathSciNet MATH Google Scholar
Blumofe RD, Joerg CF, Kuszmaul BC, Leiserson CE, Randall KH, Zhou Y (1995) Cilk: an efficient multithreaded runtime system. SIGPLAN Not 30(8):207–216. doi:10.1145/209937.209958
Article Google Scholar
Michaud P (2011) Replacement policies for shared caches on symmetric multicores: a programmer-centric point of view, In: Proceedings of the 6th international conference on high performance and embedded architectures and compilers (HiPEAC’11). ACM, New York, pp 187–196. doi:10.1145/1944862.1944890
Nikolopoulos DS (2003) Code and data transformations for improving shared cache performance on smt processors. In: ISHPC, pp 54–69
Kurzak J, Tomov S, Dongarra J (2012) Autotuning gemm kernels for the fermi gpu. 23(11):2045–2057. http://dblp.uni-trier.de/db/journals/tpds/tpds23.html. Accessed 15 Jan 2015
Jiang C, Snir M (2005) Automatic tuning matrix multiplication performance on graphics hardware. In: Proceesings of the 14th international conference on parallel architecture and compilation techniques (PACT), pp 185–196
Li J, Ranka S, Sahni S (2011) Strassen’s matrix multiplication on gpus. In: Proceedings of the 2011 IEEE 17th international conference on parallel and distributed systems (ICPADS’11). IEEE Computer Society, Washington, DC, pp 157–164. doi:10.1109/ICPADS.2011.130
Cecilia JM, García JM, Ujaldon M (2009) The GPU on the matrix–matrix multiply: performance study and contributions. In: Chapman BM, Desprez F, Joubert GR, Lichnewsky A, Peters FJ, Priol T (eds) PARCO, vol. 19 of advances in parallel computing. IOS Press, pp 331–340. http://dblp.uni-trier.de/db/conf/parco/parco2009.html. Accessed 15 Jan 2015
Cecilia JM, García JM, Ujaldon M (2009) The GPU on the matrix-matrix multiply: performance study and contributions. In: Parallel computing: from multicores and GPU’s to petascale, proceedings of the conference ParCo 2009, 1–4 September 2009, Lyon, France. pp 331–340. doi:10.3233/978-1-60750-530-3-331.http://dblp.uni-trier.de/rec/bib/conf/parco/CeciliaGU09. Accessed 15 Jan 2015
Athil T, Christian R, Reddy YB (2014) Cuda memory techniques for matrix multiplication on quadro 4000. In: Proceedings of the 2014 11th international conference on information technology: new generations (ITNG’14). IEEE Computer Society, Washington, DC, pp 419–425. doi:10.1109/ITNG.2014.24
Djinevski L, Arsenovski S, Ristov S, Gusev M (2013) Performance drawbacks for matrix multiplication using set associative cache in GPU devices. In: 36th international convention on information and communication technology electronics and microelectronics (MIPRO). IEEE, p 193198
Matsumoto K, Nakasato N, Sakai T, Yahagi H, Sedukhin SG (2011) Multi-level optimization of matrix multiplication for GPU-equipped systems. Procedia Comput Sci 4(0):342–351. doi:10.1016/j.procs.2011.04.036. http://www.sciencedirect.com/science/article/pii/S1877050911000949. (Proceedings of the international conference on computational science, ICCS’11)

Download references

Author information

Authors and Affiliations

Electrical and Computer Engineering Department, University of Patras, Patras, Greece
Vasilios Kelefouras, Iosif Mporas & Vasilios Kolonias
Education and Research Department Computer Science and Electrical Engineering, University Rennes 1 - IRISA/INRIA, Rennes, France
A. Kritikakou

Authors

Vasilios Kelefouras
View author publications
You can also search for this author in PubMed Google Scholar
A. Kritikakou
View author publications
You can also search for this author in PubMed Google Scholar
Iosif Mporas
View author publications
You can also search for this author in PubMed Google Scholar
Vasilios Kolonias
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vasilios Kelefouras.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kelefouras, V., Kritikakou, A., Mporas, I. et al. A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures. J Supercomput 72, 804–844 (2016). https://doi.org/10.1007/s11227-015-1613-7

Download citation

Published: 22 January 2016
Issue Date: March 2016
DOI: https://doi.org/10.1007/s11227-015-1613-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures

Abstract

Access this article

Similar content being viewed by others

On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors

Performance evaluation and analysis of sparse matrix and graph kernels on heterogeneous processors

Matrix Multiplication on High-Density Multi-GPU Architectures: Theoretical and Experimental Investigations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures

Abstract

Access this article

Similar content being viewed by others

On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors

Performance evaluation and analysis of sparse matrix and graph kernels on heterogeneous processors

Matrix Multiplication on High-Density Multi-GPU Architectures: Theoretical and Experimental Investigations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation