The Journal of Supercomputing

, Volume 72, Issue 3, pp 804–844 | Cite as

A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures

  • Vasilios Kelefouras
  • A. Kritikakou
  • Iosif Mporas
  • Vasilios Kolonias


Current compilers cannot generate code that can compete with hand-tuned code in efficiency, even for a simple kernel like matrix–matrix multiplication (MMM). A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and number of levels of tiling. The scheduling parameter values selection is a very difficult and time-consuming task, since parameter values depend on each other; this is why they are found by using searching methods and empirical techniques. To overcome this problem, the scheduling sub-problems must be optimized together, as one problem and not separately. In this paper, an MMM methodology is presented where the optimum scheduling parameters are found by decreasing the search space theoretically, while the major scheduling sub-problems are addressed together as one problem and not separately according to the hardware architecture parameters and input size; for different hardware architecture parameters and/or input sizes, a different implementation is produced. This is achieved by fully exploiting the software characteristics (e.g., data reuse) and hardware architecture parameters (e.g., data caches sizes and associativities), giving high-quality solutions and a smaller search space. This methodology refers to a wide range of CPU and GPU architectures.


Matrix–matrix multiplication Data reuse Optimization SIMD Memory hierarchy Loop tiling 


  1. 1.
    Kelefouras VI, Kritikakou A, Papadima E, Goutis CE (2015) A methodology for speeding up matrix vector multiplication for single/multi-core architectures. J Supercomput 71(7):2644–2667CrossRefGoogle Scholar
  2. 2.
    Pinter SS (1996) Register allocation with instruction scheduling: a new approach. J Prog Lang 4(1): 21–38MathSciNetGoogle Scholar
  3. 3.
    Shobaki G, Shawabkeh M, Rmaileh NEA (2008) Preallocation instruction scheduling with register pressure minimization using a combinatorial optimization approach. ACM Trans Archit Code Optim 10(3):14:1–14:31. doi: 10.1145/2512432 Google Scholar
  4. 4.
    Bacon DF, Graham SL, Oliver, Sharp J (1994) Compiler transformations for high-performance computing. ACM Comput Surv 26:345–420CrossRefGoogle Scholar
  5. 5.
    Granston E, Holler A (2001) Automatic recommendation of compiler options. In: Proceedings of the workshop on feedback-directed and dynamic optimization (FDDO’01)Google Scholar
  6. 6.
    Triantafyllis S, Vachharajani M, Vachharajani N, August DI (2003) Compiler optimization-space exploration. In: Proceedings of the international symposium on code generation and optimization: feedback-directed and runtime optimization (CGO’03). IEEE Computer Society, Washington, DC, pp 204–215. Accessed 15 Jan 2015
  7. 7.
    Cooper KD, Subramanian D, Torczon L (2001) Adaptive optimizing compilers for the 21st century. J Supercomput 23:2002zbMATHGoogle Scholar
  8. 8.
    Kisuki T, Knijnenburg PMW, O’Boyle MFP, Bodin F, Wijshoff HAG (1999) A feasibility study in iterative compilation. In: Proceedings of the second international symposium on high performance computing (ISHPC’99). Springer, London, pp 121–132. Accessed 15 Jan 2015
  9. 9.
    Kulkarni PA, Whalley DB, Tyson GS, Davidson JW (2009) Practical exhaustive optimization phase order exploration and evaluation. ACM Trans Archit Code Optim 6(1):1:1–1:36. doi: 10.1145/1509864.1509865 CrossRefGoogle Scholar
  10. 10.
    Kulkarni P, Hines S, Hiser J, Whalley D, Davidson J, Jones D (2004) Fast searches for effective optimization phase sequences. SIGPLAN Not 39(6):171–182. doi: 10.1145/996893.996863 CrossRefGoogle Scholar
  11. 11.
    Park E, Kulkarni S, Cavazos J (2011) An evaluation of different modeling techniques for iterative compilation. In: Proceedings of the 14th international conference on compilers, architectures and synthesis for embedded systems (CASES’11). ACM, New York, pp 65–74. doi: 10.1145/2038698.2038711
  12. 12.
    Monsifrot A, Bodin F, Quiniou R (2002) A machine learning approach to automatic production of compiler heuristics. In: Proceedings of the 10th international conference on artificial intelligence: methodology, systems, and applications (AIMSA’02). Springer, London, pp 41–50. Accessed 15 Jan 2015
  13. 13.
    Stephenson M, Amarasinghe S, Martin M, O’Reilly U-M (2003) Meta optimization: improving compiler heuristics with machine learning. SIGPLAN Not 38(5):77–90. doi: 10.1145/780822.781141 CrossRefGoogle Scholar
  14. 14.
    Tartara M, Crespi Reghizzi S (2013) Continuous learning of compiler heuristics. ACM Trans Archit Code Optim 9(4):46:1–46:25. doi: 10.1145/2400682.2400705 CrossRefGoogle Scholar
  15. 15.
    Agakov F, Bonilla E, Cavazos J, Franke B, Fursin G, O’Boyle MFP, Thomson J, Toussaint M, Williams CKI (2006) Using machine learning to focus iterative optimization. In: Proceedings of the international symposium on code generation and optimization (CGO’06). IEEE Computer Society, Washington, DC, pp 295–305. doi: 10.1109/CGO.2006.37
  16. 16.
    Whaley RC (2005) Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw Pract Exp 35(2):101–121CrossRefGoogle Scholar
  17. 17.
    Openblas (2012) An optimized blas library. Accessed 15 Jan 2015
  18. 18.
    Guennebaud G, Jacob B et al (2010) Eigen v3. Accessed 15 Jan 2015
  19. 19.
    Intel (2012) Intel mkl. Accessed 15 Jan 2015
  20. 20.
    Bilmes J, Asanović K, Chin C, Demmel J (1997) Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: Proceedings of the international conference on supercomputing, ACM SIGARC, ViennaGoogle Scholar
  21. 21.
    Goto K, van de Geijn RA (2008) Anatomy of high-performance matrix multiplication. ACM Trans Math Softw 34(3):12:1–12:25. doi: 10.1145/1356052.1356053 MathSciNetCrossRefzbMATHGoogle Scholar
  22. 22.
    Van Zee FG, Van De Geijn RA (2015) BLIS: a framework for rapidly instantiating BLAS functionality. ACM Trans Math Softw 41(3):14-1–14-33Google Scholar
  23. 23.
    Smith TM, van de Geijn RA, Smelyanskiy M, Hammond JR, Zee FGV (2014) Anatomy of high-performance many-threaded matrix multiplication. In: 2014 IEEE 28th international parallel and distributed processing symposium, Phoenix, pp 1049–1059. doi: 10.1109/IPDPS.2014.110
  24. 24.
    Yotov K, Li X, Ren G, Garzaran M, Padua D, Pingali K, Stodghill P (2005) Is search really necessary to generate high-performance blas? Proc IEEE 93(2):358–386. (Special issue on “Program generation, optimization, and adaptation”)Google Scholar
  25. 25.
    Volkov V, Demmel JW (2008) Benchmarking gpus to tune dense linear algebra. In: Proceedings of the 2008 ACM/IEEE conference on supercomputing (SC’08). IEEE Press, Piscataway, pp 31:1–31:11. Accessed 15 Jan 2015
  26. 26.
    Nath R, Tomov S, Dongarra J (2010) An improved Magma Gemm for Fermi graphics processing units. Int J High Perform Comput Appl 24(4):511515. doi: 10.1177/1094342010385729 CrossRefGoogle Scholar
  27. 27.
    Kelefouras VI, Kritikakou A, Goutis C (2014) A matrix–matrix multiplication methodology for single/multi-core architectures using SIMD, supercomputing. Springer, New York. doi: 10.1007/s11227-014-1098-9
  28. 28.
    Nethercote N, Seward J (2007) Valgrind: a framework for heavyweight dynamic binary instrumentation. SIGPLAN Not 42(6):89–100. doi: 10.1145/1273442.1250746 CrossRefGoogle Scholar
  29. 29.
    Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill MD, Wood DA (2011) The gem5 simulator. SIGARCH Comput Archit News 39(2):1–7. doi: 10.1145/2024716.2024718 CrossRefGoogle Scholar
  30. 30.
    Austin T, Larson E, Ernst D (2002) Simplescalar: an infrastructure for computer system modeling. Computer 35:59–67. doi: 10.1109/2.982917 CrossRefGoogle Scholar
  31. 31.
    Bhandarkar SM, Arabnia HR (1995) The REFINE multiprocessor—theoretical properties and algorithms. Parallel Comput 21(11):1783–1805CrossRefGoogle Scholar
  32. 32.
    Arabnia HR, Smith JW (1993) A reconfigurable interconnection network for imaging operations and its implementation using a multi-stage switching box, pp 349–357Google Scholar
  33. 33.
    Wani MA, Arabnia HR (2003) Parallel edge-region-based segmentation algorithm targeted at reconfigurable MultiRing network. J Supercomput 25(1):43–62CrossRefzbMATHGoogle Scholar
  34. 34.
    Arabnia HR (1990) A parallel algorithm for the arbitrary rotation of digitized images using process-and-data-decomposition approach. J Parallel Distrib Comput 10(2):188–192CrossRefGoogle Scholar
  35. 35.
    Arabnia HR, Oliver MA (1989) A transputer network for fast operations on digitised images. Comput Graph Forum 8(1):3–11CrossRefGoogle Scholar
  36. 36.
    Bhandarkar SM, Arabnia HR (1995) The Hough transform on a reconfigurable multi-ring network. J Parallel Distrib Comput 24(1):107–114CrossRefGoogle Scholar
  37. 37.
    Arabnia HR, Oliver MA (1987) A transputer network for the arbitrary rotation of digitised images. Comput J 30(5):425–432CrossRefGoogle Scholar
  38. 38.
    Arabnia HR, Bhandarkar SM (1996) Parallel stereocorrelation on a reconfigurable multi-ring network. J Supercomput 10(3):243–269CrossRefzbMATHGoogle Scholar
  39. 39.
    Arabnia HR, Oliver MA (1987) Arbitrary rotation of raster images with SIMD machine architectures. Comput Graph Forum 6(1):3–11CrossRefGoogle Scholar
  40. 40.
    Bhandarkar SM, Arabnia HR, Smith JW (1995) A reconfigurable architecture for image processing and computer vision. Int J Pattern Recognit Artif Intell 9(02):201–229CrossRefGoogle Scholar
  41. 41.
    Arabnia H (1995) A distributed stereocorrelation algorithm. In: Proceedings of fourth international conference on computer communications and networks. IEEE, pp 479–482Google Scholar
  42. 42.
    Whaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimization of software and the ATLAS project. Parallel Comput 27(1–2):3–35CrossRefzbMATHGoogle Scholar
  43. 43.
    Whaley RC, Dongarra J (1999) Automatically tuned linear algebra software. In: Ninth SIAM conference on parallel processing for scientific computing, CD-ROM proceedingsGoogle Scholar
  44. 44.
    Whaley RC, Dongarra JJ (1998) Automatically tuned linear algebra software. In: Proceedings of the 1998 ACM/IEEE conference on supercomputing, supercomputing’98. IEEE Computer Society, Washington, DC, pp 1–27. Accessed 15 Jan 2015
  45. 45.
    Whaley RC, Dongarra J (1997) Automatically tuned linear algebra software. Tech. Rep. UT-CS-97-366, University of TennesseeGoogle Scholar
  46. 46.
    Bjørstad P, Manne F, Sørevik T, Vajtersic M (1992) Efficient matrix multiplication on simd computers. SIAM J Matrix Anal Appl 13:386–401MathSciNetCrossRefzbMATHGoogle Scholar
  47. 47.
    Geijn RAVD, Watts J (1997) Summa: scalable universal matrix multiplication algorithm. Tech. repGoogle Scholar
  48. 48.
    Chatterjee S, Lebeck AR, Patnala PK, Thottethodi M (1999) Recursive array layouts and fast parallel matrix multiplication. In: Proceedings of eleventh annual ACM symposium on parallel algorithms and architectures, pp 222–231Google Scholar
  49. 49.
    Moon B, Jagadish HV, Faloutsos C, Saltz JH (2001) Analysis of the clustering properties of the Hilbert space-filling curve. IEEE Trans Knowl Data Eng 13:2001CrossRefGoogle Scholar
  50. 50.
    Thottethodi M, Chatterjee S, Lebeck AR (1998) Tuning strassen’s matrix multiplication for memory efficiency. In: Proceedings of SC98 (CD-ROM)Google Scholar
  51. 51.
    Kulkarni M, Pingali K (2008) An experimental study of self-optimizing dense linear algebra software. Proc IEEE 96(5):832–848CrossRefGoogle Scholar
  52. 52.
    Garcia E, Venetis IE, Khan R, Gao GR (2010) Optimized dense matrix multiplication on a many-core architecture. In: Proceedings of the 16th international Euro-Par conference on parallel processing: part II (Euro-Par’10). Springer, Berlin, pp 316–327. Accessed 15 Jan 2015
  53. 53.
    Choi J (1998) A new parallel matrix multiplication algorithm on distributed-memory concurrent computers. Concurr Pract Exp 10(8):655–670. Accessed 15 Jan 2015
  54. 54.
    Kurzak J, Alvaro W, Dongarra J (2009) Optimizing matrix multiplication for a short-vector simd architecture—cell processor. Parallel Comput 35(3):138–150. doi: 10.1016/j.parco.2008.12.010 CrossRefGoogle Scholar
  55. 55.
    Desprez F, Suter F (2004) Impact of mixed-parallelism on parallel implementations of the strassen and winograd matrix multiplication algorithms: research articles. Concurr Comput Pract Exp 16(8):771–797. doi: 10.1002/cpe.v16:8 CrossRefGoogle Scholar
  56. 56.
    Hattori M, Ito N, Chen W, Wada K (2005) Parallel matrix-multiplication algorithm for distributed parallel computers. Syst Comput Jpn 36(4):48–59. doi: 10.1002/scj.v36:4 CrossRefGoogle Scholar
  57. 57.
    Hunold S, Rauber T, Rünger G (2004) Multilevel hierarchical matrix multiplication on clusters. In: Proceedings of the 18th annual international conference on supercomputing (ICS’04). ACM, New York, pp 136–145. doi: 10.1145/1006209.1006230
  58. 58.
    Krishnan M, Nieplocha J (2006) Memory efficient parallel matrix multiplication operation for irregular problems. In: Proceedings of the 3rd conference on computing frontiers (CF’06). ACM, New York, pp 229–240. doi: 10.1145/1128022.1128054
  59. 59.
    Hunold S, Rauber T (2005) Automatic tuning of pdgemm towards optimal performance. In: Proceedings of the 11th international Euro-Par conference on parallel processing (Euro-Par’05). Springer, Berlin, pp 837–846. doi: 10.1007/11549468_91
  60. 60.
    Desprez F, Suter F (2002) Impact of mixed-parallelism on parallel implementations of Strassen and Winograd matrix multiplication algorithms. Rapport de recherche RR-4482, INRIA. Accessed 15 Jan 2015
  61. 61.
    Tsilikas G, Fleury M (2004) Matrix multiplication performance on commodity shared-memory multiprocessors. In: Proceedings of the international conference on parallel computing in electrical engineering (PARELEC’04). IEEE Computer Society, Washington, DC, pp 13–18. doi: 10.1109/PARELEC.2004.43
  62. 62.
    Rünger G, Schwind M (2010) Fast recursive matrix multiplication for multi-core architectures. Procedia Comput Sci 1(1):67–76. [International conference on computational science 2010 (ICCS’10)]Google Scholar
  63. 63.
    Krishnan M, Nieplocha J (2004) Srumma: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems. Parallel Distrib Process Symp Int 1:70b. doi: 10.1109/IPDPS.2004.1303000. Accessed 15 Jan 2015
  64. 64.
    Strassen V (1969) Gaussian elimination is not optimal. Numer Math 14(3):354–356MathSciNetCrossRefzbMATHGoogle Scholar
  65. 65.
    Blumofe RD, Joerg CF, Kuszmaul BC, Leiserson CE, Randall KH, Zhou Y (1995) Cilk: an efficient multithreaded runtime system. SIGPLAN Not 30(8):207–216. doi: 10.1145/209937.209958 CrossRefGoogle Scholar
  66. 66.
    Michaud P (2011) Replacement policies for shared caches on symmetric multicores: a programmer-centric point of view, In: Proceedings of the 6th international conference on high performance and embedded architectures and compilers (HiPEAC’11). ACM, New York, pp 187–196. doi: 10.1145/1944862.1944890
  67. 67.
    Nikolopoulos DS (2003) Code and data transformations for improving shared cache performance on smt processors. In: ISHPC, pp 54–69Google Scholar
  68. 68.
    Kurzak J, Tomov S, Dongarra J (2012) Autotuning gemm kernels for the fermi gpu. 23(11):2045–2057. Accessed 15 Jan 2015
  69. 69.
    Jiang C, Snir M (2005) Automatic tuning matrix multiplication performance on graphics hardware. In: Proceesings of the 14th international conference on parallel architecture and compilation techniques (PACT), pp 185–196Google Scholar
  70. 70.
    Li J, Ranka S, Sahni S (2011) Strassen’s matrix multiplication on gpus. In: Proceedings of the 2011 IEEE 17th international conference on parallel and distributed systems (ICPADS’11). IEEE Computer Society, Washington, DC, pp 157–164. doi: 10.1109/ICPADS.2011.130
  71. 71.
    Cecilia JM, García JM, Ujaldon M (2009) The GPU on the matrix–matrix multiply: performance study and contributions. In: Chapman BM, Desprez F, Joubert GR, Lichnewsky A, Peters FJ, Priol T (eds) PARCO, vol. 19 of advances in parallel computing. IOS Press, pp 331–340. Accessed 15 Jan 2015
  72. 72.
    Cecilia JM, García JM, Ujaldon M (2009) The GPU on the matrix-matrix multiply: performance study and contributions. In: Parallel computing: from multicores and GPU’s to petascale, proceedings of the conference ParCo 2009, 1–4 September 2009, Lyon, France. pp 331–340. doi: 10.3233/978-1-60750-530-3-331. Accessed 15 Jan 2015
  73. 73.
    Athil T, Christian R, Reddy YB (2014) Cuda memory techniques for matrix multiplication on quadro 4000. In: Proceedings of the 2014 11th international conference on information technology: new generations (ITNG’14). IEEE Computer Society, Washington, DC, pp 419–425. doi: 10.1109/ITNG.2014.24
  74. 74.
    Djinevski L, Arsenovski S, Ristov S, Gusev M (2013) Performance drawbacks for matrix multiplication using set associative cache in GPU devices. In: 36th international convention on information and communication technology electronics and microelectronics (MIPRO). IEEE, p 193198Google Scholar
  75. 75.
    Matsumoto K, Nakasato N, Sakai T, Yahagi H, Sedukhin SG (2011) Multi-level optimization of matrix multiplication for GPU-equipped systems. Procedia Comput Sci 4(0):342–351. doi: 10.1016/j.procs.2011.04.036. (Proceedings of the international conference on computational science, ICCS’11)

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Vasilios Kelefouras
    • 1
  • A. Kritikakou
    • 2
  • Iosif Mporas
    • 1
  • Vasilios Kolonias
    • 1
  1. 1.Electrical and Computer Engineering DepartmentUniversity of PatrasPatrasGreece
  2. 2.Education and Research Department Computer Science and Electrical EngineeringUniversity Rennes 1 - IRISA/INRIARennesFrance

Personalised recommendations