The Journal of Supercomputing

, Volume 50, Issue 1, pp 36–77 | Cite as

Performance evaluation of the sparse matrix-vector multiplication on modern architectures

  • Georgios Goumas
  • Kornilios Kourtis
  • Nikos Anastopoulos
  • Vasileios Karakasis
  • Nectarios Koziris


In this paper, we revisit the performance issues of the widely used sparse matrix-vector multiplication (SpMxV) kernel on modern microarchitectures. Previous scientific work reports a number of different factors that may significantly reduce performance. However, the interaction of these factors with the underlying architectural characteristics is not clearly understood, a fact that may lead to misguided, and thus unsuccessful attempts for optimization. In order to gain an insight into the details of SpMxV performance, we conduct a suite of experiments on a rich set of matrices for three different commodity hardware platforms. In addition, we investigate the parallel version of the kernel and report on the corresponding performance results and their relation to each architecture’s specific multithreaded configuration. Based on our experiments, we extract useful conclusions that can serve as guidelines for the optimization process of both single and multithreaded versions of the kernel.


Sparse matrix-vector multiplication Multicore architectures Scientific applications Performance evaluation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agarwal RC, Gustavson FG, Zubair M (1992) a high performance algorithm using pre-processing for the sparse matrix-vector multiplication. In: Supercomputing’92, Minnesota, November 1992. IEEE, New York, pp 32–41 Google Scholar
  2. 2.
    Asanovic K, Bodik R, Catanzaro BC, Gebis JJ, Husbands P, Keutzer K, Patterson DA, Plishker WL, Shalf J, Williams SW, Yelick KA (2006) The landscape of parallel computing research: A view from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley Google Scholar
  3. 3.
    Athanasaki E, Anastopoulos N, Kourtis K, Koziris N (2008) Exploring the performance limits of simultaneous multithreading for memory intensive applications. J Supercomput 44(1):64–97 CrossRefGoogle Scholar
  4. 4.
    Barrett R, Berry M, Chan TF, Demmel J, Donato JM, Dongarra J, Eijkhout V, Pozo R, Romine C, der Vorst HV (1994) Templates for the solution of linear systems: building blocks for iterative methods. SIAM, Philadelphia Google Scholar
  5. 5.
    Buttari A, Eijkhout V, Langou J, Filippone S (2005) Performance optimization and modeling of blocked sparse kernels. Technical Report ICL-UT-04-05, Innovative Computing Laboratory, University of Tennessee Google Scholar
  6. 6.
    Catalyuerek UV, Aykanat C (1996) Decomposing irregularly sparse matrices for parallel matrix-vector multiplication. In: Lecture notes in computer science, vol 1117, pp 75–86 Google Scholar
  7. 7.
    Davis T (1997) University of Florida Sparse Matrix Collection. NA Digest 97(23)
  8. 8.
    Geus R, Röllin S (1999) Towards a fast parallel sparse matrix-vector multiplication. In: Parallel computing: fundamentals and applications, international conference ParCo. Imperial College Press, 1999, pp 308–315 Google Scholar
  9. 9.
    Gropp W, Kaushik D, Keyes D, Smith B (1999) Toward realistic performance bounds for implicit cfd codes. In: Ecer A et al. (eds) Proceedings of parallel CFD’99. Elsevier, Amsterdam Google Scholar
  10. 10.
    Im E (2000) Optimizing the performance of sparse matrix-vector multiplication. PhD thesis, University of California, Berkeley Google Scholar
  11. 11.
    Im E, Yelick K (1999) Optimizing sparse matrix-vector multiplication on SMPs. In: 9th SIAM conference on parallel processing for scientific computing, SIAM, March 1999 Google Scholar
  12. 12.
    Im E, Yelick K (2001) Optimizing sparse matrix computations for register reuse in SPARSITY. In: Lecture notes in computer science, vol 2073, pp 127–136 Google Scholar
  13. 13.
    Kotakemori H, Hasegawa H, Kajiyama T, Nukada A, Suda R, Nishida A (2005) Performance evaluation of parallel sparse matrix-vector products on SGI Altix3700. In: 1st International workshop on OpenMP (IWOMP), Eugene, OR, USA, June 2005 Google Scholar
  14. 14.
    Lo JL, Eggers SJ, Emer JS, Levy HM, Stamm RL, Tullsen DM (1997) Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading. ACM Trans Comput Syst 15(3):322–354 CrossRefGoogle Scholar
  15. 15.
    Mellor-Crummey J, Garvin J (2004) Optimizing sparse matrix-vector product computations using unroll and jam. Int J High Perform Comput Appl 18(2):225 CrossRefGoogle Scholar
  16. 16.
    Mitchell N, Carter L, Ferrante J, Tullsen D (1999) Instruction level parallelism vs. thread level parallelism on simultaneous multi-threading processors. In: Proceedings of supercomputing’99 (CD-ROM), Portland, OR, November 1999. ACM SIGARCH and IEEE Google Scholar
  17. 17.
    Paolini GV, Radicati di Brozolo G (1989) Data structures to vectorize CG algorithms for general sparsity patterns. BIT Numer Math 29(4):703–718 zbMATHCrossRefMathSciNetGoogle Scholar
  18. 18.
    Pichel JC, Heras DB, Cabaleiro JC, Rivera FF (2004) Improving the locality of the sparse matrix-vector product on shared memory multiprocessors. In: PDP, IEEE Computer Society, 2004, pp 66–71 Google Scholar
  19. 19.
    Pichel JC, Heras DB, Cabaleiro JC, Rivera FF (2005) Performance optimization of irregular codes based on the combination of reordering and blocking techniques. Parallel Comput 31(8–9):858–876 CrossRefGoogle Scholar
  20. 20.
    Pinar A, Heath MT (1999) Improving performance of sparse matrix-vector multiplication. In: Supercomputing’99, Portland, OR, November 1999. ACM SIGARCH and IEEE Google Scholar
  21. 21.
    Saad Y (1990) Sparskit: A basic tool kit for sparse matrix computation. Technical report, Center for Supercomputing Research and Development, University of Illinois at Urbana Champaign Google Scholar
  22. 22.
    Saad Y (2003) Iterative methods for sparse linear systems. SIAM, Philadelphia zbMATHGoogle Scholar
  23. 23.
    Temam O, Jalby W (1992) Characterizing the behavior of sparse algorithms on caches. In: Supercomputing’92, Minnesota, November 1992. IEEE, New York, pp 578–587 Google Scholar
  24. 24.
    Toledo S (1997) Improving the memory-system performance of sparse-matrix vector multiplication. IBM J Res Dev 41(6):711–725 CrossRefGoogle Scholar
  25. 25.
    Vuduc R, Demmel J, Yelick K, Kamil S, Nishtala R, Lee B (2002) Performance optimizations and bounds for sparse matrix-vector multiply. In: Supercomputing, Baltimore, MD, November, 2002 Google Scholar
  26. 26.
    Vuduc RW, Moon H (2005) Fast sparse matrix-vector multiplication by exploiting variable block structure. In: High performance computing and communications. Lecture notes in computer science, vol 3726. Springer, Berlin, pp 807–816 CrossRefGoogle Scholar
  27. 27.
    White J, Sadayappan P (1997) On improving the performance of sparse matrix-vector multiplication. In: 4th International conference on high performance computing (HiPC ’97), 1997 Google Scholar
  28. 28.
    Willcock J, Lumsdaine A (2006) Accelerating sparse matrix computations via data compression. In: ICS ’06: Proceedings of the 20th annual international conference on supercomputing, New York, NY, USA, 2006. ACM Press, New York, pp 307–316 CrossRefGoogle Scholar
  29. 29.
    Williams S, Oilker L, Vuduc R, Shalf J, Yelick K, Demmel J (2007) Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In: Supercomputing’07, Reno, NV, November 2007 Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Georgios Goumas
    • 1
  • Kornilios Kourtis
    • 1
  • Nikos Anastopoulos
    • 1
  • Vasileios Karakasis
    • 1
  • Nectarios Koziris
    • 1
  1. 1.Computing Systems Laboratory, School of Electrical and Computer EngineeringNational Technical University of AthensZografouGreece

Personalised recommendations