Matrix computing is the core component of machine learning and artificial intelligence. Fast matrix computations can facilitate many large-scale computational projects greatly. Basic linear algebra subprograms (BLAS) are proposed, which classify different matrices and provide a standardized interface. Currently, the most commonly used heterogeneous computing platforms are central processing unit (CPU) and graphics processing unit (GPU). At present, BLAS has been implemented on both CPU and GPU. However, due to the different characteristics of algorithms and hardware, a particular matrix method should be designed for a particular processor. It is important to choose the right processor for a particular matrix computation. This paper first briefly reviews the BLAS, and then introduces architecture and optimization methods of CPU and GPU. The effect of different subroutines in BLAS is studied through experiments. Finally, we discuss the reasons and the processor selection scheme of matrix computations.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Price excludes VAT (USA)
Tax calculation will be finalised during checkout.
Oh KS, Jung K (2004) GPU implementation of neural networks. Pattern Recogn 37(6):1311–1314
Baptista D, Morgado-Dias F (2013) A survey of artificial neural network training tools. Neural Comput Appl 23(3–4):609–615
Baptista D, Abreu S, Freitas F et al (2013) A survey of software and hardware use in artificial neural networks. Neural Comput Appl 23(3–4):591–599
Lee VW, Kim C, Chhugani J et al (2010) Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. Int Symp Comput Archit 38(3):451–460
Owens JD, Luebke D, Govindaraju NK et al (2007) A survey of general-purpose computation on graphics hardware. Comput Gr Forum 26(1):80–113
Brodtkorb AR, Hagen TR, Saetra ML et al (2013) Graphics processing unit (GPU) programming strategies and trends in GPU computing. J Parallel Distrib Comput 73(1):4–13
Lawson CL, Hanson RJ, Kincaid DR et al (1979) Basic linear algebra subprograms for fortran usage. ACM Trans Math Softw 5(3):308–323
AMD, AMD Core Math Library (ACML). http://developer.amd.com/acml
Wang E, Zhang Q, Shen B et al (2014) Intel math kernel library. High-Performance Computing on the Intel Xeon Phi. Springer International Publishing, Berlin, pp 167–188
Barrachina S, Castillo M, Igual FD et al (2008) Evaluation and tuning of the level 3 CUBLAS for graphics processors. In: IEEE international symposium on parallel and distributed processing, pp 1–8
Anderson E, Bai Z, Bischof C et al (1999) LAPACK users’ guide. Society for Industrial and Applied Mathematics, Philadelphia, PA
Moler C (2000) Matlab incorporates LAPACK. Increasing the speed and capabilities of matrix computation, MATLAB News and NotesCWinter
Walt S, Colbert SC, Varoquaux G (2011) The NumPy array: a structure for efficient numerical computation. Comput Sci Eng 13(2):22–30
Huang Z, Ye Y, Li X et al (2017) Joint weighted nonnegative matrix factorization for mining attributed graphs. Pacific-Asia conference on knowledge discovery and data mining. Springer, Cham, pp 368–380
Zhang H, Ho JKL, Wu QMJ et al (2013) Multidimensional latent semantic analysis using term spatial information. IEEE Trans Cybern 43(6):1625–1640
Abadi M, Agarwal A, Barham P et al (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems
Jia Y, Shelhamer E, Donahue J et al (2014) Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on Multimedia, pp 675–678
Zhang H, Li J, Ji Y et al (2017) Understanding subtitles by character-level sequence-to-sequence learning. IEEE Trans Industr Inf 13(2):616–624
Uzair M, Shafait F, Ghanem B et al (2015) Representation learning with deep extreme learning machines for efficient image set classification. Neural Comput Appl, pp 1–13
Zhang H, Cao X, Ho JKL et al (2017) Object-level video advertising: an optimization framework. IEEE Trans Industr Inf 13(2):520–531
Guo H, Tang R, Ye Y et al (2017) DeepFM: a factorization-machine based neural network for CTR prediction. In: The twenty-sixth international joint conference on artificial intelligence (IJCAI), pp 1725–1731
Dongarra J, DuCroz J, Hammarling S et al (1988) An extended set of FORTRAN basic linear algebra subprograms. ACM Trans Math Softw 14(1):1–17
Dongarra J, DuCroz J, Hammarling S et al (1990) A set of level 3 basic linear algebra subprograms. ACM Trans Math Softw 16(1):1–17
Mukunoki D, Imamura T, Takahashi D (2015) Fast implementation of general matrix–vector multiplication (GEMV) on Kepler GPUs. In: 23rd Euromicro international conference on parallel, distributed and network-based processing (PDP), IEEE,, pp 642–650
Danihelka I, Wayne G, Uria B et al (2016) Associative long short-term memory. arXiv preprint arXiv:1602.03032
Nath R, Tomov S, Dongarra J (2010) An improved MAGMA GEMM for Fermi graphics processing units. Int J High Perform Comput Appl 24(4):511–515
Nakasato N (2011) A fast GEMM implementation on the Cypress GPU. ACM SIGMETRICS Perform Eval Rev 38(4):50–55
Romine CH, Ortega JM (1988) Parallel solution of triangular systems of equations. Parallel Comput 6(1):109–114
This research was supported in part by NSFC under Grant Nos. 61572158 and 61602132, Shenzhen Science and Technology Program under Grant Nos. JSGG20150512145714247, JCYJ20160330163900579 and JCYJ20170413105929681. And manuscript is approved by all authors for publication. I would like to declare on behalf of my co-authors that the work described was original research that has not been published previously and not under consideration for publication elsewhere, in whole or in part. All the authors listed have approved the manuscript that is enclosed.
Conflict of interest
No conflict of interest exits in the submission of this manuscript.
About this article
Cite this article
Li, F., Ye, Y., Tian, Z. et al. CPU versus GPU: which can perform matrix computation faster—performance comparison for basic linear algebra subprograms. Neural Comput & Applic 31, 4353–4365 (2019). https://doi.org/10.1007/s00521-018-3354-z