Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures

  • Alexander Monakov
  • Anton Lokhmotov
  • Arutyun Avetisyan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5952)


Graphics processors are increasingly used in scientific applications due to their high computational power, which comes from hardware with multiple-level parallelism and memory hierarchy. Sparse matrix computations frequently arise in scientific applications, for example, when solving PDEs on unstructured grids. However, traditional sparse matrix algorithms are difficult to efficiently parallelize for GPUs due to irregular patterns of memory references. In this paper we present a new storage format for sparse matrices that better employs locality, has low memory footprint and enables automatic specialization for various matrices and future devices via parameter tuning. Experimental evaluation demonstrates significant speedups compared to previously published results.


Storage Format Memory Bandwidth Thread Block Sparse Matrice Texture Memory 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The landscape of parallel computing research: A view from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley (December 2006)Google Scholar
  2. 2.
    Baskaran, M.M., Bordawekar, R.: Optimizing sparse matrix-vector multiplication on GPUs. Technical report, IBM TJ Watson Research Center (2009)Google Scholar
  3. 3.
    Bell, N., Garland, M.: Efficient sparse matrix-vector multiplication on CUDA. NVIDIA Technical Report NVR-2008-004 (2008)Google Scholar
  4. 4.
    Buatois, L., Caumon, G., Lévy, B.: Concurrent number cruncher: An efficient sparse linear solver on the GPU. In: Perrott, R., Chapman, B.M., Subhlok, J., de Mello, R.F., Yang, L.T. (eds.) HPCC 2007. LNCS, vol. 4782, pp. 358–371. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  5. 5.
    Kincaid, D.R., Oppe, T.C., Young, D.M.: ITPACKV 2D User’s GuideGoogle Scholar
  6. 6.
    Monakov, A., Avetisyan, A.: Implementing blocked sparse matrix-vector multiplication on NVIDIA GPUs. In: SAMOS, pp. 289–297 (2009)Google Scholar
  7. 7.
    NVIDIA Corporation. NVIDIA CUDA Programming Guide 2.2 (2009)Google Scholar
  8. 8.
    Vázquez, F., Garzón, E.M., Martnez, J.A., Fernández, J.J.: The sparse matrix vector product on GPUs. Technical report, University of Almeria (2009)Google Scholar
  9. 9.
    Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: SC 2008: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pp. 1–11. IEEE Press, Los Alamitos (2008)Google Scholar
  10. 10.
    Vuduc, R.W.: Automatic performance tuning of sparse matrix kernels, PhD thesis, University of California, Berkeley (2003); Chair-Demmel, J.W.Google Scholar
  11. 11.
    Williams, S., Oliker, L., Vuduc, R.W., Shalf, J., Yelick, K.A., Demmel, J.: Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In: SC, p. 38 (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Alexander Monakov
    • 1
  • Anton Lokhmotov
    • 2
  • Arutyun Avetisyan
    • 1
  1. 1.Institute for System Programming of RASMoscowRussian Federation
  2. 2.Department of ComputingImperial College LondonLondonUnited Kingdom

Personalised recommendations