A methodology for speeding up matrix vector multiplication for single/multi-core architectures

Kelefouras, Vasilios; Kritikakou, Angeliki; Papadima, Elissavet; Goutis, Costas

doi:10.1007/s11227-015-1409-9

A methodology for speeding up matrix vector multiplication for single/multi-core architectures

Published: 29 March 2015

Volume 71, pages 2644–2667, (2015)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Vasilios Kelefouras¹,
Angeliki Kritikakou²,
Elissavet Papadima¹ &
…
Costas Goutis¹

298 Accesses
11 Citations
Explore all metrics

Abstract

In this paper, a new methodology for computing the Dense Matrix Vector Multiplication, for both embedded (processors without SIMD unit) and general purpose processors (single and multi-core processors, with SIMD unit), is presented. This methodology achieves higher execution speed than ATLAS state-of-the-art library (speedup from 1.2 up to 1.45). This is achieved by fully exploiting the combination of the software (e.g., data reuse) and hardware parameters (e.g., data cache associativity) which are considered simultaneously as one problem and not separately, giving a smaller search space and high-quality solutions. The proposed methodology produces a different schedule for different values of the (i) number of the levels of data cache; (ii) data cache sizes; (iii) data cache associativities; (iv) data cache and main memory latencies; (v) data array layout of the matrix and (vi) number of cores.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithm-oriented design of efficient many-core architectures applied to dense matrix multiplication

Article 04 December 2014

Influence of Architectural Features of the SNC-4 Mode of the Intel Xeon Phi KNL on Matrix Multiplication

NUMA-Aware Optimization of Sparse Matrix-Vector Multiplication on ARMv8-Based Many-Core Architectures

References

Whaley RC, Petitet A (2005) Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw: Pract Exp 35(2):101–121
Google Scholar
OpenBlas (2012). http://xianyi.github.com/OpenBLAS
Krivutsenko A (2008) GotoBLAS—anatomy of a fast matrix multiplication. Technical report
Guennebaud G, Jacob B et al (2010) Eigen v3. http://eigen.tuxfamily.org
Intel: Intel MKL (2012). http://software.intel.com/en-us/intel-mkl
Bilmes J, Asanović K, Chin C, Demmel J (1997) Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: Proceedings of the international conference on supercomputing. ACM SIGARC, Vienna, Austria
Frigo M, Johnson SG (1997) The fastest Fourier transform in the west. Technical report. Cambridge, MA, USA
Milder P, Franchetti F, Hoe JC, Püschel M (2012) Computer generation of hardware for linear digital signal processing transforms. ACM Trans Des Autom Electron Syst 17(2) 15:1–15:33. doi:10.1145/2159542.2159547
Pinter SS (1996) Register allocation with instruction scheduling: a new approach. J Prog Lang 4(1):21–38
Shobaki G, Shawabkeh M, Rmaileh NEA (2008) Preallocation instruction scheduling with register pressure minimization using a combinatorial optimization approach. ACM Trans Archit Code Optim 10(3):14:1–14:31. doi:10.1145/2512432
Bacon DF, Graham SL, Sharp OJ (1994) Compiler transformations for high-performance computing. ACM Comput Surv 26(4):345–420. doi:10.1145/197405.197406
Granston E, Holler A (2001) Automatic recommendation of compiler options. In: Proceedings of the workshop on feedback-directed and dynamic optimization (FDDO)
Triantafyllis S, Vachharajani M, Vachharajani N, August DI (2003) Compiler optimization-space exploration. In: Proceedings of the international symposium on code generation and optimization: feedback-directed and runtime optimization, CGO ’03, pp 204–215. IEEE Computer Society, Washington, DC, USA. http://dl.acm.org/citation.cfm?id=776261.776284
Cooper KD, Subramanian D, Torczon L (2002) Adaptive optimizing compilers for the 21st century. J Supercomput 23(1):7–22. doi:10.1023/A:1015729001611
Kisuki T, Knijnenburg PMW, O’Boyle MFP, Bodin F, Wijshoff HAG (1999) A feasibility study in iterative compilation. In: Proceedings of the 2nd international symposium on high performance computing, ISHPC ’99, pp 121–132. Springer-Verlag, London, UK. http://dl.acm.org/citation.cfm?id=646347.690219
Kulkarni PA, Whalley DB, Tyson GS, Davidson JW (2009) Practical exhaustive optimization phase order exploration and evaluation. ACM Trans Archit Code Optim 6(1):1:1–1:36. doi:10.1145/1509864.1509865
Kulkarni P, Hines S, Hiser J, Whalley D, Davidson J, Jones D (2004) Fast searches for effective optimization phase sequences. SIGPLAN Not 39(6):171–182. doi:10.1145/996893.996863
Article Google Scholar
Park E, Kulkarni S, Cavazos J (2011) An evaluation of different modeling techniques for iterative compilation. In: Proceedings of the 14th international conference on compilers, architectures and synthesis for embedded systems, CASES ’11, pp 65–74. ACM, New York, NY, USA. doi:10.1145/2038698.2038711
Monsifrot A, Bodin F, Quiniou R (2002) A machine learning approach to automatic production of compiler heuristics. In: Proceedings of the 10th international conference on artificial intelligence: methodology, systems, and applications, AIMSA ’02, pp 41–50. Springer-Verlag, London, UK. http://dl.acm.org/citation.cfm?id=646053.677574
Stephenson M, Amarasinghe S, Martin M, O’Reilly UM (2003) Meta optimization: improving compiler heuristics with machine learning. SIGPLAN Not 38(5):77–90 (2003). doi:10.1145/780822.781141
Tartara M, Crespi Reghizzi S (2013) Continuous learning of compiler heuristics. ACM Trans Archit Code Optim 9(4):46:1–46:25. doi:10.1145/2400682.2400705
Agakov F, Bonilla E, Cavazos J, Franke B, Fursin G, O’Boyle MFP, Thomson J, Toussaint M, Williams CKI (2006) Using machine learning to focus iterative optimization. In: Proceedings of the international symposium on code generation and optimization, CGO ’06, pp 295–305. IEEE Computer Society, Washington, DC, USA. doi:10.1109/CGO.2006.37
Nethercote N, Seward J (2007) Valgrind: a framework for heavyweight dynamic binary instrumentation. SIGPLAN Not 42(6):89–100. doi:10.1145/1273442.1250746
Simplescalar CI, Burger D, Austin TM (1997) The SimpleScalar tool set, version 2.0. Technical report
Whaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimization of software and the ATLAS project. Parallel Comput 27(1–2):3–35
Article MATH Google Scholar
Whaley RC, Dongarra J (1999) Automatically tuned linear algebra software. In: 9th SIAM conference on parallel processing for scientific computing. CD-ROM Proceedings
Whaley RC, Dongarra J (1998) Automatically tuned linear algebra software. In: SuperComputing 1998: high performance networking and computing
Whaley RC, Dongarra J (1997) Automatically tuned linear algebra software. Technical report. UT-CS-97-366, University of Tennessee
See homepage for details: ATLAS homepage (2012). http://math-atlas.sourceforge.net/
Fujimoto N (2008) Dense matrix–vector multiplication on the CUDA architecture. Parallel Process Lett 18(4):511–530
Article MathSciNet Google Scholar
Fujimoto N (2008) Faster matrix–vector multiplication on GeForce 8800GTX. In: IPDPS, pp 1–8. IEEE. http://dblp.uni-trier.de/db/conf/ipps/ipdps2008.html
Hendrickson B, Leland R, Plimpton S (1995) An efficient parallel algorithm for matrix–vector multiplication. Int J High Speed Comput 7:73–88
Article Google Scholar
Sørensen HHB (2012) High-performance matrix–vector multiplication on the GPU. In: Proceedings of the 2011 international conference on parallel processing, Euro-Par’11, pp 377–386. Springer-Verlag, Berlin, Heidelberg
Zhang N (2012) A novel parallel scan for multicore processors and its application in sparse matrix–vector multiplication. IEEE Trans Parallel Distrib Syst 23(3):397–404. doi:10.1109/TPDS.2011.174
Article Google Scholar
Williams S, Oliker L, Vuduc R, Shalf J, Yelick K, Demmel J (2007) Optimization of sparse matrix–vector multiplication on emerging multicore platforms. In: Proceedings of the 2007 ACM/IEEE conference on supercomputing, SC ’07, pp 38:1–38:12. ACM, New York, NY, USA. doi:10.1145/1362622.1362674
Goumas G, Kourtis K, Anastopoulos N, Karakasis V, Koziris N (2009) Performance evaluation of the sparse matrix–vector multiplication on modern architectures. J Supercomput 50(1):36–77. doi:10.1007/s11227-008-0251-8
Michailidis PD, Margaritis KG (2010) Performance models for matrix computations on multicore processors using OpenMP. In: Proceedings of the 2010 international conference on parallel and distributed computing. Applications and Technologies, PDCAT ’10, pp 375–380. IEEE Computer Society, Washington, DC, USA. doi:10.1109/PDCAT.2010.52
Schmollinger M, Kaufmann M (2002) Algorithms for SMP-clusters dense matrix–vector multiplication. In: Proceedings of the 16th international parallel and distributed processing Sysmposium, IPDPS ’02, pp 57–. IEEE Computer Society, Washington, DC, USA. http://dl.acm.org/citation.cfm?id=645610.661893
Waghmare VN, Kendre SV, Chordiya SG (2011) Article: performance analysis of matrix–vector multiplication in hybrid (MPI + OpenMP). Int J Comput Appl 22(5):22–25. Published by Foundation of Computer Science
Baker AH, Schulz M, Yang UM (2011) On the performance of an algebraic multigrid solver on multicore clusters. In: Proceedings of the 9th international conference on high performance computing for computational science, VECPAR’10, pp 102–115. Springer-Verlag, Berlin, Heidelberg. http://dl.acm.org/citation.cfm?id=1964238.1964252
Parallel methods for matrix–vector multiplication. http://www.hpcc.unn.ru/mskurs/ENG/DOC/pp07.pdf
Bhandarkar SM, Arabnia HR (1995) The REFINE multiprocessor—theoretical properties and algorithms. Parallel Comput 21(11):1783–1805
Article Google Scholar
Arabnia HR, Smith JW (1993) A reconfigurable interconnection network for imaging operations and its implementation using a multi-stage switching box. pp 349–357
Wani MA, Arabnia HR (2003) Parallel edge-region-based segmentation algorithm targeted at reconfigurable MultiRing network. J Supercomput 25(1):43–62
Article MATH Google Scholar
Arabnia HR (1990) A parallel algorithm for the arbitrary rotation of digitized images using process-and-data-decomposition approach. J Parallel Distrib Comput 10(2):188–192
Article Google Scholar
Arabnia HR, Oliver MA (1989) A transputer network for fast operations on digitised images. Comput Graph Forum 8(1):3–11
Article Google Scholar
Bhandarkar SM, Arabnia HR (1995) The Hough transform on a reconfigurable multi-ring network. J Parallel Distrib Comput 24(1):107–114
Article Google Scholar
Arabnia HR, Oliver MA (1987) A transputer network for the arbitrary rotation of digitised images. Comput J 30(5):425–432
Article Google Scholar
Arabnia HR, Bhandarkar SM (1996) Parallel stereocorrelation on a reconfigurable multi-ring network. J Supercomput 10(3):243–269
Article MATH Google Scholar
Arabnia HR, Oliver MA (1987) Arbitrary rotation of raster images with SIMD machine architectures. Comput Graph Forum 6(1):3–11
Article Google Scholar
Bhandarkar SM, Arabnia HR, Smith JW (1995) A reconfigurable architecture for image processing and computer vision. Int J Pattern Recognit Artif Intell 9(02):201–229
Article Google Scholar
Arabnia H (1995) A distributed stereocorrelation algorithm. In: Computer communications and networks, 1995. Proceedings, 4th international conference on, pp 479–482, IEEE
Intel core 2 duo processor E6550. http://ark.intel.com/Product.aspx?id=30783
Intel core 2 duo processor T6600. http://ark.intel.com/products/37255/Intel-Core2-Duo-Processor-T6600
Intel i7-2600K Processor. http://ark.intel.com/products/52214

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, University of Patras, Patras, Greece
Vasilios Kelefouras, Elissavet Papadima & Costas Goutis
Education and Research Department in Computer science and Electrical Engineering, University of Rennes 1-IRISA/INRIA, Rennes, France
Angeliki Kritikakou

Authors

Vasilios Kelefouras
View author publications
You can also search for this author in PubMed Google Scholar
Angeliki Kritikakou
View author publications
You can also search for this author in PubMed Google Scholar
Elissavet Papadima
View author publications
You can also search for this author in PubMed Google Scholar
Costas Goutis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vasilios Kelefouras.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kelefouras, V., Kritikakou, A., Papadima, E. et al. A methodology for speeding up matrix vector multiplication for single/multi-core architectures. J Supercomput 71, 2644–2667 (2015). https://doi.org/10.1007/s11227-015-1409-9

Download citation

Published: 29 March 2015
Issue Date: July 2015
DOI: https://doi.org/10.1007/s11227-015-1409-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A methodology for speeding up matrix vector multiplication for single/multi-core architectures

Abstract

Access this article

Similar content being viewed by others

Algorithm-oriented design of efficient many-core architectures applied to dense matrix multiplication

Influence of Architectural Features of the SNC-4 Mode of the Intel Xeon Phi KNL on Matrix Multiplication

NUMA-Aware Optimization of Sparse Matrix-Vector Multiplication on ARMv8-Based Many-Core Architectures

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A methodology for speeding up matrix vector multiplication for single/multi-core architectures

Abstract

Access this article

Similar content being viewed by others

Algorithm-oriented design of efficient many-core architectures applied to dense matrix multiplication

Influence of Architectural Features of the SNC-4 Mode of the Intel Xeon Phi KNL on Matrix Multiplication

NUMA-Aware Optimization of Sparse Matrix-Vector Multiplication on ARMv8-Based Many-Core Architectures

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation