Skip to main content

CPU versus GPU: which can perform matrix computation faster—performance comparison for basic linear algebra subprograms


Matrix computing is the core component of machine learning and artificial intelligence. Fast matrix computations can facilitate many large-scale computational projects greatly. Basic linear algebra subprograms (BLAS) are proposed, which classify different matrices and provide a standardized interface. Currently, the most commonly used heterogeneous computing platforms are central processing unit (CPU) and graphics processing unit (GPU). At present, BLAS has been implemented on both CPU and GPU. However, due to the different characteristics of algorithms and hardware, a particular matrix method should be designed for a particular processor. It is important to choose the right processor for a particular matrix computation. This paper first briefly reviews the BLAS, and then introduces architecture and optimization methods of CPU and GPU. The effect of different subroutines in BLAS is studied through experiments. Finally, we discuss the reasons and the processor selection scheme of matrix computations.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9


  1. Oh KS, Jung K (2004) GPU implementation of neural networks. Pattern Recogn 37(6):1311–1314

    Article  MATH  Google Scholar 

  2. Baptista D, Morgado-Dias F (2013) A survey of artificial neural network training tools. Neural Comput Appl 23(3–4):609–615

    Article  Google Scholar 

  3. Baptista D, Abreu S, Freitas F et al (2013) A survey of software and hardware use in artificial neural networks. Neural Comput Appl 23(3–4):591–599

    Article  Google Scholar 

  4. Lee VW, Kim C, Chhugani J et al (2010) Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. Int Symp Comput Archit 38(3):451–460

    Google Scholar 

  5. Owens JD, Luebke D, Govindaraju NK et al (2007) A survey of general-purpose computation on graphics hardware. Comput Gr Forum 26(1):80–113

    Article  Google Scholar 

  6. Brodtkorb AR, Hagen TR, Saetra ML et al (2013) Graphics processing unit (GPU) programming strategies and trends in GPU computing. J Parallel Distrib Comput 73(1):4–13

    Article  Google Scholar 

  7. Lawson CL, Hanson RJ, Kincaid DR et al (1979) Basic linear algebra subprograms for fortran usage. ACM Trans Math Softw 5(3):308–323

    Article  MATH  Google Scholar 

  8. AMD, AMD Core Math Library (ACML).

  9. Wang E, Zhang Q, Shen B et al (2014) Intel math kernel library. High-Performance Computing on the Intel Xeon Phi. Springer International Publishing, Berlin, pp 167–188

  10. Barrachina S, Castillo M, Igual FD et al (2008) Evaluation and tuning of the level 3 CUBLAS for graphics processors. In: IEEE international symposium on parallel and distributed processing, pp 1–8

  11. Anderson E, Bai Z, Bischof C et al (1999) LAPACK users’ guide. Society for Industrial and Applied Mathematics, Philadelphia, PA

    Book  MATH  Google Scholar 

  12. Moler C (2000) Matlab incorporates LAPACK. Increasing the speed and capabilities of matrix computation, MATLAB News and NotesCWinter

  13. Walt S, Colbert SC, Varoquaux G (2011) The NumPy array: a structure for efficient numerical computation. Comput Sci Eng 13(2):22–30

    Article  Google Scholar 

  14. Huang Z, Ye Y, Li X et al (2017) Joint weighted nonnegative matrix factorization for mining attributed graphs. Pacific-Asia conference on knowledge discovery and data mining. Springer, Cham, pp 368–380

    Google Scholar 

  15. Zhang H, Ho JKL, Wu QMJ et al (2013) Multidimensional latent semantic analysis using term spatial information. IEEE Trans Cybern 43(6):1625–1640

    Article  Google Scholar 

  16. Abadi M, Agarwal A, Barham P et al (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems

  17. Jia Y, Shelhamer E, Donahue J et al (2014) Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on Multimedia, pp 675–678

  18. Zhang H, Li J, Ji Y et al (2017) Understanding subtitles by character-level sequence-to-sequence learning. IEEE Trans Industr Inf 13(2):616–624

    Article  Google Scholar 

  19. Uzair M, Shafait F, Ghanem B et al (2015) Representation learning with deep extreme learning machines for efficient image set classification. Neural Comput Appl, pp 1–13

  20. Zhang H, Cao X, Ho JKL et al (2017) Object-level video advertising: an optimization framework. IEEE Trans Industr Inf 13(2):520–531

    Article  Google Scholar 

  21. Guo H, Tang R, Ye Y et al (2017) DeepFM: a factorization-machine based neural network for CTR prediction. In: The twenty-sixth international joint conference on artificial intelligence (IJCAI), pp 1725–1731

  22. Dongarra J, DuCroz J, Hammarling S et al (1988) An extended set of FORTRAN basic linear algebra subprograms. ACM Trans Math Softw 14(1):1–17

    Article  MATH  Google Scholar 

  23. Dongarra J, DuCroz J, Hammarling S et al (1990) A set of level 3 basic linear algebra subprograms. ACM Trans Math Softw 16(1):1–17

    Article  MathSciNet  MATH  Google Scholar 

  24. Mukunoki D, Imamura T, Takahashi D (2015) Fast implementation of general matrix–vector multiplication (GEMV) on Kepler GPUs. In: 23rd Euromicro international conference on parallel, distributed and network-based processing (PDP), IEEE,, pp 642–650

  25. Danihelka I, Wayne G, Uria B et al (2016) Associative long short-term memory. arXiv preprint arXiv:1602.03032

  26. Nath R, Tomov S, Dongarra J (2010) An improved MAGMA GEMM for Fermi graphics processing units. Int J High Perform Comput Appl 24(4):511–515

    Article  Google Scholar 

  27. Nakasato N (2011) A fast GEMM implementation on the Cypress GPU. ACM SIGMETRICS Perform Eval Rev 38(4):50–55

    Article  Google Scholar 

  28. Romine CH, Ortega JM (1988) Parallel solution of triangular systems of equations. Parallel Comput 6(1):109–114

    Article  MathSciNet  MATH  Google Scholar 

Download references


This research was supported in part by NSFC under Grant Nos. 61572158 and 61602132, Shenzhen Science and Technology Program under Grant Nos. JSGG20150512145714247, JCYJ20160330163900579 and JCYJ20170413105929681. And manuscript is approved by all authors for publication. I would like to declare on behalf of my co-authors that the work described was original research that has not been published previously and not under consideration for publication elsewhere, in whole or in part. All the authors listed have approved the manuscript that is enclosed.

Author information

Authors and Affiliations


Corresponding authors

Correspondence to Feng Li or Xiaofeng Zhang.

Ethics declarations

Conflict of interest

No conflict of interest exits in the submission of this manuscript.

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, F., Ye, Y., Tian, Z. et al. CPU versus GPU: which can perform matrix computation faster—performance comparison for basic linear algebra subprograms. Neural Comput & Applic 31, 4353–4365 (2019).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: