Memory Hierarchy Optimizations and Performance Bounds for Sparse ATAx

  • Richard Vuduc
  • Attila Gyulassy
  • James W. Demmel
  • Katherine A. Yelick
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2659)


This paper presents uniprocessor performance optimizations, automatic tuning techniques, and an experimental analysis of the sparse matrix operation, y = A T Ax, where A is a sparse matrix and x, y are dense vectors. We describe an implementation of this computational kernel which brings A through the memory hierarchy only once, and which can be combined naturally with the register blocking optimization previously proposed in the Sparsity tuning system for sparse matrix-vector multiply. We evaluate these optimizations on a benchmark set of 44 matrices and 4 platforms, showing speedups of up to 4.2×. We also develop platform-specific upper-bounds on the performance of these implementations. We analyze how closely we can approach these bounds, and show when low-level tuning techniques (e.g., better instruction scheduling) are likely to yield a significant pay-o. Finally, we propose a hybrid o.-line/run-time heuristic which in practice automatically selects near-optimal values of the key tuning parameters, the register block sizes.


Sparse Matrix Cache Line Memory Hierarchy Algorithmic Cache Cache Capacity 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    A.J.C. Bik and H.A.G. Wijsho.. Automatic nonzero structure analysis. SIAM Journal on Computing, 28(5):1576–1587, 1999.zbMATHCrossRefMathSciNetGoogle Scholar
  2. 2.
    J. Bilmes, K. Asanović, C. Chin, and J. Demmel. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In Proceedings of the International Conference on Supercomputing, July 1997.Google Scholar
  3. 3.
    S. Blackford et al. Document for the Basic Linear Algebra Subprograms (BLAS) standard: BLAS Technical Forum, 2001. Chapter 3:
  4. 4.
    S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci. A scalable crossplatform infrastructure for application performance tuning using hardware counters. In Proceedings of Supercomputing, November 2000.Google Scholar
  5. 5.
    J.W. Demmel. Applied Numerical Linear Algebra. SIAM, 1997.Google Scholar
  6. 6.
    B.B. Fraguela, R. Doallo, and E.L. Zapata. Memory hierarchy performance prediction for sparse blocked algorithms. Parallel Processing Letters, 9(3), 1999.Google Scholar
  7. 7.
    W.D. Gropp, D.K. Kasushik, D.E. Keyes, and B.F. Smith. Towards realistic bounds for implicit CFD codes. In Proceedings of Parallel Computational Fluid Dynamics, pages 241–248, 1999.Google Scholar
  8. 8.
    G. Heber, A.J. Dolgert, M. Alt, K.A. Mazurkiewicz, and L. Stringer. Fracture mechanics on the intel itanium architecture: A case study. In Workshop on EPIC Architectures and Compiler Technology (ACM MICRO 34), Austin, TX, 2001.Google Scholar
  9. 9.
    E.-J. Im and K.A. Yelick. Optimizing sparse matrix computations for register reuse in SPARSITY. In Proceedings of ICCS, pages 127–136, May 2001.Google Scholar
  10. 10.
    J.M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5): 604–632, 1999.zbMATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    Y. Saad. SPARSKIT: A basic toolkit for sparse matrix computations, 1994.
  12. 12.
    P. Stodghill. A Relational Approach to the Automatic Generation of Sequential Sparse Matrix Codes. PhD thesis, Cornell University, August 1997.Google Scholar
  13. 13.
    O. Temam and W. Jalby. Characterizing the behavior of sparse algorithms on caches. In Proceedings of Supercomputing’ 92, 1992.Google Scholar
  14. 14.
    R. Vuduc, J.W. Demmel, K.A. Yelick, S. Kamil, R. Nishtala, and B. Lee. Performance optimizations and bounds for sparse matrix-vector multiply. In Proceedings of Supercomputing, Baltimore, MD, USA, November 2002.Google Scholar
  15. 15.
    R. Vuduc, A. Gyulassy, J.W. Demmel, and K.A. Yelick. Memory hierarchy optimizations and performance bounds for sparse ATAx. Technical Report UCB/CS-03-1232, University of California, Berkeley, February 2003.Google Scholar
  16. 16.
    R. Vuduc, S. Kamil, J. Hsu, R. Nishtala, J.W. Demmel, and K.A. Yelick. Automatic performance tuning and analysis of sparse triangular solve. In ICS 2002: POHLL Workshop, New York, USA, June 2002.Google Scholar
  17. 17.
    W. Wang and D.P. O’Leary. Adaptive use of iterative methods in interior point methods for linear programming. Technical Report UMIACS-95-111, University of Maryland at College Park, College Park, MD, USA, 1995.Google Scholar
  18. 18.
    C. Whaley and J. Dongarra. Automatically tuned linear algebra software. In Proc. of Supercomp., Orlando, FL, 1998.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Richard Vuduc
    • 1
  • Attila Gyulassy
    • 1
  • James W. Demmel
    • 1
  • Katherine A. Yelick
    • 1
  1. 1.Computer Science DivisionUniversity of CaliforniaBerkeley

Personalised recommendations