Auto-tuning Dense Vector and Matrix-Vector Operations for Fermi GPUs

  • Hans Henrik Brandenborg Sørensen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7203)


In this paper, we consider the automatic performance tuning of dense vector and matrix-vector operations on GPUs. Such operations form the backbone of level 1 and level 2 routines in the Basic Linear Algebra Subroutines (BLAS) library and are therefore of great importance in many scientific applications. As examples, we develop single-precision CUDA kernels for the Euclidian norm (SNRM2) and the matrix-vector multiplication (SGEMV). The target hardware is the most recent Nvidia Tesla 20-series (Fermi architecture). We show that auto-tuning can be successfully applied to achieve high performance for dense vector and matrix-vector operations by appropriately utilizing the fine-grained parallelism of the GPU. Our tuned kernels display between 25-100% better performance than the current CUBLAS 3.2 library.


GPU BLAS Dense linear algebra Parallel algorithms 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    NVIDIA Corp.: CUDA Toolkit Version 3.2. (2010)Google Scholar
  2. 2.
    Khronos Group: OpenCL Specification 1.1. (2010)Google Scholar
  3. 3.
    Dongarra, J.J., Du Croz, J., Hammarling, S., Duff, I.S.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16, 1–17 (1990)MATHCrossRefGoogle Scholar
  4. 4.
    Anderson, E., Bai, Z., Bischof, C., Blackford, L.S., Demmel, J., Dongarra, J.J., Du Croz, J., Hammarling, S., Greenbaum, A., McKenney, A., Sorensen, D.: LAPACK Users’ guide, 3rd edn. SIAM, Philadelphia (1999)CrossRefGoogle Scholar
  5. 5.
    Tomov, S., Nath, R., Du, P., Dongarra, J.: MAGMA v0.2 Users’ Guide (2009)Google Scholar
  6. 6.
    Humphrey, J.R., Price, D.K., Spagnoli, K.E., Paolini, A.L., Kelmelis, E.J.: CULA: hybrid GPU accelerated linear algebra routines. In: Proc. SPIE, vol. 7705 (2010)Google Scholar
  7. 7.
    Dongarra, J., Moore, S.: 12. In: Empirical Performance Tuning of Dense Linear Algebra Software, pp. 255–272. CRC Press (2010)Google Scholar
  8. 8.
    Whaley, R.C., Petitet, A., Clint, R., Antoine, W., Jack, P., Dongarra, J.J.: Automated Empirical Optimizations of Software and the ATLAS project (2000)Google Scholar
  9. 9.
    Li, Y., Dongarra, J., Tomov, S.: A note on auto-tuning gemm for gpus (2009)Google Scholar
  10. 10.
    Micikevicius, P.: Analysis-driven performance opt. GTC, Recorded Session (2010)Google Scholar
  11. 11.
    Volkov, V.: Better performance at lower occupancy. GTC, Recorded Session (2010)Google Scholar
  12. 12.
    Harris, M.: Optimizing parallel reduction in cuda. NVIDIA Dev. Tech. (2008)Google Scholar
  13. 13.
    NVIDIA Corp.: CUDA C Programming Guide Version 3.2. (2010)Google Scholar
  14. 14.
    Klöckner, A., Pinto, N., Lee, Y., Catanzaro, B., Ivanov, P., Fasih, A.: PyCUDA: GPU Run-Time Code Generation for High-Performance Computing (2009)Google Scholar
  15. 15.
    NVIDIA Corp.: CUDA GPU Occupancy Calculator (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Hans Henrik Brandenborg Sørensen
    • 1
  1. 1.Informatics and Mathematical ModellingTechnical University of DenmarkLyngbyDenmark

Personalised recommendations