Using Recursion to Boost ATLAS’s Performance

  • Paolo D’Alberto
  • Alexandru Nicolau
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4759)


We investigate the performance benefits of a novel recursive formulation of Strassen’s algorithm over highly tuned matrix-multiply (MM) routines, such as the widely used ATLAS for high-performance systems.

We combine Strassen’s recursion with high-tuned version of ATLAS MM and we present a family of recursive algorithms achieving up to 15% speed-up over ATLAS alone. We show experimental results for 7 different systems.


dense kernels matrix-matrix product performance optimizations 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Anderson, E., Bai, Z., Bischof, C., Dongarra, J.D.J., DuCroz, J., Greenbaum, A., Hammarling, S., McKenney, A., Ostrouchov, S., Sorensen, D.: LAPACK User’ Guide, Release 2.0., 2nd edn. SIAM (1995)Google Scholar
  2. 2.
    Kagstrom, B., Ling, P., van Loan, C.: Algorithm 784: GEMM-based level 3 BLAS: portability and optimization issues. ACM Transactions on Mathematical Software 24, 303–316 (1998)CrossRefGoogle Scholar
  3. 3.
    Kagstrom, B., Ling, P., van Loan, C.: GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark. ACM Transactions on Mathematical Software 24, 268–302 (1998)CrossRefGoogle Scholar
  4. 4.
    Coppersmith, D., Winograd, S.: Matrix multiplication via arithmetic progressions. In: Proceedings of the 19-th annual ACM conference on Theory of computing, pp. 1–6 (1987)Google Scholar
  5. 5.
    Higham, N.J.: Exploiting fast matrix multiplication within the level 3 BLAS. ACM Trans. Math. Softw. 16, 352–368 (1990)MATHCrossRefMathSciNetGoogle Scholar
  6. 6.
    Frens, J., Wise, D.: Auto-Blocking matrix-multiplication or tracking BLAS3 performance from source code. In: Proc. 1997 ACM Symp. on Principles and Practice of Parallel Programming, vol. 32, pp. 206–216 (1997)Google Scholar
  7. 7.
    Eiron, N., Rodeh, M., Steinwarts, I.: Matrix multiplication: a case study of algorithm engineering. In: Proceedings WAE 1998, Saarbrūcken, Germany (1998)Google Scholar
  8. 8.
    Whaley, R., Dongarra, J.: Automatically tuned linear algebra software. In: Proceedings of the 1998 ACM/IEEE conference on Supercomputing (CDROM), pp. 1–27. IEEE Computer Society Press, Los Alamitos (1998)Google Scholar
  9. 9.
    Bilardi, G., D’Alberto, P., Nicolau, A.: Fractal matrix multiplication: a case study on portability of cache performance. In: Workshop on Algorithm Engineering 2001, Aarhus, Denmark (2001)Google Scholar
  10. 10.
    Goto, K., van de Geijn, R.: On reducing tlb misses in matrix multiplication. Technical Report Technical Report TR-2002-55, The University of Texas at Austin, Department of Computer Sciences (2002)Google Scholar
  11. 11.
    Demmel, J., Dongarra, J., Eijkhout, E., Fuentes, E., Petitet, E., Vuduc, V., Whaley, R., Yelick, K.: Self-Adapting linear algebra algorithms and software. In: Proceedings of the IEEE, special issue on Program Generation, Optimization, and Adaptation, vol. 93 (2005)Google Scholar
  12. 12.
    Strassen, V.: Gaussian elimination is not optimal. Numerische Mathematik 14, 354–356 (1969)CrossRefMathSciNetGoogle Scholar
  13. 13.
    Brent, R.P.: Error analysis of algorithms for matrix multiplication and triangular decomposition using Winograd’s identity. Numerische Mathematik 16, 145–156 (1970)MATHCrossRefMathSciNetGoogle Scholar
  14. 14.
    Brent, R.P.: Algorithms for matrix multiplication. Technical Report TR-CS-70-157, Stanford University (1970)Google Scholar
  15. 15.
    Huss-Lederman, S., Jacobson, E., Tsao, A., Turnbull, T., Johnson, J.: Implementation of Strassen’s algorithm for matrix multiplication. In: Supercomputing 1996. Proceedings of the 1996 ACM/IEEE conference on Supercomputing (CDROM), p. 32. ACM Press, New York (1996)CrossRefGoogle Scholar
  16. 16.
    Bailey, D.H., Gerguson, H.R.P.: A Strassen-Newton algorithm for high-speed parallelizable matrix inversion. In: Supercomputing 1988. Proceedings of the 1988 ACM/IEEE conference on Supercomputing, pp. 419–424. IEEE Computer Society Press, Los Alamitos (1988)CrossRefGoogle Scholar
  17. 17.
    Bilmes, J., Asanovic, K., Chin, C., Demmel, J.: Optimizing matrix multiply using PHiPAC: a portable, high-performance, Ansi C coding methodology. In: International Conference on Supercomputing (1997)Google Scholar
  18. 18.
    Thottethodi, M., Chatterjee, S., Lebeck, A.: Tuning Strassen’s matrix multiplication for memory efficiency. In: Proc. Supercomputing, Orlando, FL (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Paolo D’Alberto
    • 1
  • Alexandru Nicolau
    • 2
  1. 1.Department of Electrical and Computer EngineeringCarnegie Mellon University 
  2. 2.Department of Computer ScienceUniversity of California at Irvine 

Personalised recommendations