Auto-tuning Dense Vector and Matrix-Vector Operations for Fermi GPUs

  • Hans Henrik Brandenborg Sørensen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7203)


In this paper, we consider the automatic performance tuning of dense vector and matrix-vector operations on GPUs. Such operations form the backbone of level 1 and level 2 routines in the Basic Linear Algebra Subroutines (BLAS) library and are therefore of great importance in many scientific applications. As examples, we develop single-precision CUDA kernels for the Euclidian norm (SNRM2) and the matrix-vector multiplication (SGEMV). The target hardware is the most recent Nvidia Tesla 20-series (Fermi architecture). We show that auto-tuning can be successfully applied to achieve high performance for dense vector and matrix-vector operations by appropriately utilizing the fine-grained parallelism of the GPU. Our tuned kernels display between 25-100% better performance than the current CUBLAS 3.2 library.


GPU BLAS Dense linear algebra Parallel algorithms 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Hans Henrik Brandenborg Sørensen
    • 1
  1. 1.Informatics and Mathematical ModellingTechnical University of DenmarkLyngbyDenmark

Personalised recommendations