Skip to main content

A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures


Current compilers cannot generate code that can compete with hand-tuned code in efficiency, even for a simple kernel like matrix–matrix multiplication (MMM). A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and number of levels of tiling. The scheduling parameter values selection is a very difficult and time-consuming task, since parameter values depend on each other; this is why they are found by using searching methods and empirical techniques. To overcome this problem, the scheduling sub-problems must be optimized together, as one problem and not separately. In this paper, an MMM methodology is presented where the optimum scheduling parameters are found by decreasing the search space theoretically, while the major scheduling sub-problems are addressed together as one problem and not separately according to the hardware architecture parameters and input size; for different hardware architecture parameters and/or input sizes, a different implementation is produced. This is achieved by fully exploiting the software characteristics (e.g., data reuse) and hardware architecture parameters (e.g., data caches sizes and associativities), giving high-quality solutions and a smaller search space. This methodology refers to a wide range of CPU and GPU architectures.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23


  1. Kelefouras VI, Kritikakou A, Papadima E, Goutis CE (2015) A methodology for speeding up matrix vector multiplication for single/multi-core architectures. J Supercomput 71(7):2644–2667

    Article  Google Scholar 

  2. Pinter SS (1996) Register allocation with instruction scheduling: a new approach. J Prog Lang 4(1): 21–38

    MathSciNet  Google Scholar 

  3. Shobaki G, Shawabkeh M, Rmaileh NEA (2008) Preallocation instruction scheduling with register pressure minimization using a combinatorial optimization approach. ACM Trans Archit Code Optim 10(3):14:1–14:31. doi:10.1145/2512432

    Google Scholar 

  4. Bacon DF, Graham SL, Oliver, Sharp J (1994) Compiler transformations for high-performance computing. ACM Comput Surv 26:345–420

    Article  Google Scholar 

  5. Granston E, Holler A (2001) Automatic recommendation of compiler options. In: Proceedings of the workshop on feedback-directed and dynamic optimization (FDDO’01)

  6. Triantafyllis S, Vachharajani M, Vachharajani N, August DI (2003) Compiler optimization-space exploration. In: Proceedings of the international symposium on code generation and optimization: feedback-directed and runtime optimization (CGO’03). IEEE Computer Society, Washington, DC, pp 204–215. Accessed 15 Jan 2015

  7. Cooper KD, Subramanian D, Torczon L (2001) Adaptive optimizing compilers for the 21st century. J Supercomput 23:2002

    MATH  Google Scholar 

  8. Kisuki T, Knijnenburg PMW, O’Boyle MFP, Bodin F, Wijshoff HAG (1999) A feasibility study in iterative compilation. In: Proceedings of the second international symposium on high performance computing (ISHPC’99). Springer, London, pp 121–132. Accessed 15 Jan 2015

  9. Kulkarni PA, Whalley DB, Tyson GS, Davidson JW (2009) Practical exhaustive optimization phase order exploration and evaluation. ACM Trans Archit Code Optim 6(1):1:1–1:36. doi:10.1145/1509864.1509865

    Article  Google Scholar 

  10. Kulkarni P, Hines S, Hiser J, Whalley D, Davidson J, Jones D (2004) Fast searches for effective optimization phase sequences. SIGPLAN Not 39(6):171–182. doi:10.1145/996893.996863

    Article  Google Scholar 

  11. Park E, Kulkarni S, Cavazos J (2011) An evaluation of different modeling techniques for iterative compilation. In: Proceedings of the 14th international conference on compilers, architectures and synthesis for embedded systems (CASES’11). ACM, New York, pp 65–74. doi:10.1145/2038698.2038711

  12. Monsifrot A, Bodin F, Quiniou R (2002) A machine learning approach to automatic production of compiler heuristics. In: Proceedings of the 10th international conference on artificial intelligence: methodology, systems, and applications (AIMSA’02). Springer, London, pp 41–50. Accessed 15 Jan 2015

  13. Stephenson M, Amarasinghe S, Martin M, O’Reilly U-M (2003) Meta optimization: improving compiler heuristics with machine learning. SIGPLAN Not 38(5):77–90. doi:10.1145/780822.781141

    Article  Google Scholar 

  14. Tartara M, Crespi Reghizzi S (2013) Continuous learning of compiler heuristics. ACM Trans Archit Code Optim 9(4):46:1–46:25. doi:10.1145/2400682.2400705

    Article  Google Scholar 

  15. Agakov F, Bonilla E, Cavazos J, Franke B, Fursin G, O’Boyle MFP, Thomson J, Toussaint M, Williams CKI (2006) Using machine learning to focus iterative optimization. In: Proceedings of the international symposium on code generation and optimization (CGO’06). IEEE Computer Society, Washington, DC, pp 295–305. doi:10.1109/CGO.2006.37

  16. Whaley RC (2005) Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw Pract Exp 35(2):101–121

    Article  Google Scholar 

  17. Openblas (2012) An optimized blas library. Accessed 15 Jan 2015

  18. Guennebaud G, Jacob B et al (2010) Eigen v3. Accessed 15 Jan 2015

  19. Intel (2012) Intel mkl. Accessed 15 Jan 2015

  20. Bilmes J, Asanović K, Chin C, Demmel J (1997) Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: Proceedings of the international conference on supercomputing, ACM SIGARC, Vienna

  21. Goto K, van de Geijn RA (2008) Anatomy of high-performance matrix multiplication. ACM Trans Math Softw 34(3):12:1–12:25. doi:10.1145/1356052.1356053

    Article  MathSciNet  MATH  Google Scholar 

  22. Van Zee FG, Van De Geijn RA (2015) BLIS: a framework for rapidly instantiating BLAS functionality. ACM Trans Math Softw 41(3):14-1–14-33

  23. Smith TM, van de Geijn RA, Smelyanskiy M, Hammond JR, Zee FGV (2014) Anatomy of high-performance many-threaded matrix multiplication. In: 2014 IEEE 28th international parallel and distributed processing symposium, Phoenix, pp 1049–1059. doi:10.1109/IPDPS.2014.110

  24. Yotov K, Li X, Ren G, Garzaran M, Padua D, Pingali K, Stodghill P (2005) Is search really necessary to generate high-performance blas? Proc IEEE 93(2):358–386. (Special issue on “Program generation, optimization, and adaptation”)

  25. Volkov V, Demmel JW (2008) Benchmarking gpus to tune dense linear algebra. In: Proceedings of the 2008 ACM/IEEE conference on supercomputing (SC’08). IEEE Press, Piscataway, pp 31:1–31:11. Accessed 15 Jan 2015

  26. Nath R, Tomov S, Dongarra J (2010) An improved Magma Gemm for Fermi graphics processing units. Int J High Perform Comput Appl 24(4):511515. doi:10.1177/1094342010385729

    Article  Google Scholar 

  27. Kelefouras VI, Kritikakou A, Goutis C (2014) A matrix–matrix multiplication methodology for single/multi-core architectures using SIMD, supercomputing. Springer, New York. doi:10.1007/s11227-014-1098-9

  28. Nethercote N, Seward J (2007) Valgrind: a framework for heavyweight dynamic binary instrumentation. SIGPLAN Not 42(6):89–100. doi:10.1145/1273442.1250746

    Article  Google Scholar 

  29. Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill MD, Wood DA (2011) The gem5 simulator. SIGARCH Comput Archit News 39(2):1–7. doi:10.1145/2024716.2024718

    Article  Google Scholar 

  30. Austin T, Larson E, Ernst D (2002) Simplescalar: an infrastructure for computer system modeling. Computer 35:59–67. doi:10.1109/2.982917

    Article  Google Scholar 

  31. Bhandarkar SM, Arabnia HR (1995) The REFINE multiprocessor—theoretical properties and algorithms. Parallel Comput 21(11):1783–1805

    Article  Google Scholar 

  32. Arabnia HR, Smith JW (1993) A reconfigurable interconnection network for imaging operations and its implementation using a multi-stage switching box, pp 349–357

  33. Wani MA, Arabnia HR (2003) Parallel edge-region-based segmentation algorithm targeted at reconfigurable MultiRing network. J Supercomput 25(1):43–62

    Article  MATH  Google Scholar 

  34. Arabnia HR (1990) A parallel algorithm for the arbitrary rotation of digitized images using process-and-data-decomposition approach. J Parallel Distrib Comput 10(2):188–192

    Article  Google Scholar 

  35. Arabnia HR, Oliver MA (1989) A transputer network for fast operations on digitised images. Comput Graph Forum 8(1):3–11

    Article  Google Scholar 

  36. Bhandarkar SM, Arabnia HR (1995) The Hough transform on a reconfigurable multi-ring network. J Parallel Distrib Comput 24(1):107–114

    Article  Google Scholar 

  37. Arabnia HR, Oliver MA (1987) A transputer network for the arbitrary rotation of digitised images. Comput J 30(5):425–432

    Article  Google Scholar 

  38. Arabnia HR, Bhandarkar SM (1996) Parallel stereocorrelation on a reconfigurable multi-ring network. J Supercomput 10(3):243–269

    Article  MATH  Google Scholar 

  39. Arabnia HR, Oliver MA (1987) Arbitrary rotation of raster images with SIMD machine architectures. Comput Graph Forum 6(1):3–11

    Article  Google Scholar 

  40. Bhandarkar SM, Arabnia HR, Smith JW (1995) A reconfigurable architecture for image processing and computer vision. Int J Pattern Recognit Artif Intell 9(02):201–229

    Article  Google Scholar 

  41. Arabnia H (1995) A distributed stereocorrelation algorithm. In: Proceedings of fourth international conference on computer communications and networks. IEEE, pp 479–482

  42. Whaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimization of software and the ATLAS project. Parallel Comput 27(1–2):3–35

    Article  MATH  Google Scholar 

  43. Whaley RC, Dongarra J (1999) Automatically tuned linear algebra software. In: Ninth SIAM conference on parallel processing for scientific computing, CD-ROM proceedings

  44. Whaley RC, Dongarra JJ (1998) Automatically tuned linear algebra software. In: Proceedings of the 1998 ACM/IEEE conference on supercomputing, supercomputing’98. IEEE Computer Society, Washington, DC, pp 1–27. Accessed 15 Jan 2015

  45. Whaley RC, Dongarra J (1997) Automatically tuned linear algebra software. Tech. Rep. UT-CS-97-366, University of Tennessee

  46. Bjørstad P, Manne F, Sørevik T, Vajtersic M (1992) Efficient matrix multiplication on simd computers. SIAM J Matrix Anal Appl 13:386–401

    Article  MathSciNet  MATH  Google Scholar 

  47. Geijn RAVD, Watts J (1997) Summa: scalable universal matrix multiplication algorithm. Tech. rep

  48. Chatterjee S, Lebeck AR, Patnala PK, Thottethodi M (1999) Recursive array layouts and fast parallel matrix multiplication. In: Proceedings of eleventh annual ACM symposium on parallel algorithms and architectures, pp 222–231

  49. Moon B, Jagadish HV, Faloutsos C, Saltz JH (2001) Analysis of the clustering properties of the Hilbert space-filling curve. IEEE Trans Knowl Data Eng 13:2001

    Article  Google Scholar 

  50. Thottethodi M, Chatterjee S, Lebeck AR (1998) Tuning strassen’s matrix multiplication for memory efficiency. In: Proceedings of SC98 (CD-ROM)

  51. Kulkarni M, Pingali K (2008) An experimental study of self-optimizing dense linear algebra software. Proc IEEE 96(5):832–848

    Article  Google Scholar 

  52. Garcia E, Venetis IE, Khan R, Gao GR (2010) Optimized dense matrix multiplication on a many-core architecture. In: Proceedings of the 16th international Euro-Par conference on parallel processing: part II (Euro-Par’10). Springer, Berlin, pp 316–327. Accessed 15 Jan 2015

  53. Choi J (1998) A new parallel matrix multiplication algorithm on distributed-memory concurrent computers. Concurr Pract Exp 10(8):655–670. Accessed 15 Jan 2015

  54. Kurzak J, Alvaro W, Dongarra J (2009) Optimizing matrix multiplication for a short-vector simd architecture—cell processor. Parallel Comput 35(3):138–150. doi:10.1016/j.parco.2008.12.010

    Article  Google Scholar 

  55. Desprez F, Suter F (2004) Impact of mixed-parallelism on parallel implementations of the strassen and winograd matrix multiplication algorithms: research articles. Concurr Comput Pract Exp 16(8):771–797. doi:10.1002/cpe.v16:8

    Article  Google Scholar 

  56. Hattori M, Ito N, Chen W, Wada K (2005) Parallel matrix-multiplication algorithm for distributed parallel computers. Syst Comput Jpn 36(4):48–59. doi:10.1002/scj.v36:4

    Article  Google Scholar 

  57. Hunold S, Rauber T, Rünger G (2004) Multilevel hierarchical matrix multiplication on clusters. In: Proceedings of the 18th annual international conference on supercomputing (ICS’04). ACM, New York, pp 136–145. doi:10.1145/1006209.1006230

  58. Krishnan M, Nieplocha J (2006) Memory efficient parallel matrix multiplication operation for irregular problems. In: Proceedings of the 3rd conference on computing frontiers (CF’06). ACM, New York, pp 229–240. doi:10.1145/1128022.1128054

  59. Hunold S, Rauber T (2005) Automatic tuning of pdgemm towards optimal performance. In: Proceedings of the 11th international Euro-Par conference on parallel processing (Euro-Par’05). Springer, Berlin, pp 837–846. doi:10.1007/11549468_91

  60. Desprez F, Suter F (2002) Impact of mixed-parallelism on parallel implementations of Strassen and Winograd matrix multiplication algorithms. Rapport de recherche RR-4482, INRIA. Accessed 15 Jan 2015

  61. Tsilikas G, Fleury M (2004) Matrix multiplication performance on commodity shared-memory multiprocessors. In: Proceedings of the international conference on parallel computing in electrical engineering (PARELEC’04). IEEE Computer Society, Washington, DC, pp 13–18. doi:10.1109/PARELEC.2004.43

  62. Rünger G, Schwind M (2010) Fast recursive matrix multiplication for multi-core architectures. Procedia Comput Sci 1(1):67–76. [International conference on computational science 2010 (ICCS’10)]

  63. Krishnan M, Nieplocha J (2004) Srumma: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems. Parallel Distrib Process Symp Int 1:70b. doi:10.1109/IPDPS.2004.1303000. Accessed 15 Jan 2015

  64. Strassen V (1969) Gaussian elimination is not optimal. Numer Math 14(3):354–356

    Article  MathSciNet  MATH  Google Scholar 

  65. Blumofe RD, Joerg CF, Kuszmaul BC, Leiserson CE, Randall KH, Zhou Y (1995) Cilk: an efficient multithreaded runtime system. SIGPLAN Not 30(8):207–216. doi:10.1145/209937.209958

    Article  Google Scholar 

  66. Michaud P (2011) Replacement policies for shared caches on symmetric multicores: a programmer-centric point of view, In: Proceedings of the 6th international conference on high performance and embedded architectures and compilers (HiPEAC’11). ACM, New York, pp 187–196. doi:10.1145/1944862.1944890

  67. Nikolopoulos DS (2003) Code and data transformations for improving shared cache performance on smt processors. In: ISHPC, pp 54–69

  68. Kurzak J, Tomov S, Dongarra J (2012) Autotuning gemm kernels for the fermi gpu. 23(11):2045–2057. Accessed 15 Jan 2015

  69. Jiang C, Snir M (2005) Automatic tuning matrix multiplication performance on graphics hardware. In: Proceesings of the 14th international conference on parallel architecture and compilation techniques (PACT), pp 185–196

  70. Li J, Ranka S, Sahni S (2011) Strassen’s matrix multiplication on gpus. In: Proceedings of the 2011 IEEE 17th international conference on parallel and distributed systems (ICPADS’11). IEEE Computer Society, Washington, DC, pp 157–164. doi:10.1109/ICPADS.2011.130

  71. Cecilia JM, García JM, Ujaldon M (2009) The GPU on the matrix–matrix multiply: performance study and contributions. In: Chapman BM, Desprez F, Joubert GR, Lichnewsky A, Peters FJ, Priol T (eds) PARCO, vol. 19 of advances in parallel computing. IOS Press, pp 331–340. Accessed 15 Jan 2015

  72. Cecilia JM, García JM, Ujaldon M (2009) The GPU on the matrix-matrix multiply: performance study and contributions. In: Parallel computing: from multicores and GPU’s to petascale, proceedings of the conference ParCo 2009, 1–4 September 2009, Lyon, France. pp 331–340. doi:10.3233/978-1-60750-530-3-331. Accessed 15 Jan 2015

  73. Athil T, Christian R, Reddy YB (2014) Cuda memory techniques for matrix multiplication on quadro 4000. In: Proceedings of the 2014 11th international conference on information technology: new generations (ITNG’14). IEEE Computer Society, Washington, DC, pp 419–425. doi:10.1109/ITNG.2014.24

  74. Djinevski L, Arsenovski S, Ristov S, Gusev M (2013) Performance drawbacks for matrix multiplication using set associative cache in GPU devices. In: 36th international convention on information and communication technology electronics and microelectronics (MIPRO). IEEE, p 193198

  75. Matsumoto K, Nakasato N, Sakai T, Yahagi H, Sedukhin SG (2011) Multi-level optimization of matrix multiplication for GPU-equipped systems. Procedia Comput Sci 4(0):342–351. doi:10.1016/j.procs.2011.04.036. (Proceedings of the international conference on computational science, ICCS’11)

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Vasilios Kelefouras.

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kelefouras, V., Kritikakou, A., Mporas, I. et al. A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures. J Supercomput 72, 804–844 (2016).

Download citation

  • Published:

  • Issue Date:

  • DOI: