Cache Blocking for Linear Algebra Algorithms

  • Fred G. Gustavson
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7203)


We briefly describe Cache Blocking for Dense Linear Algebra Algorithms on computer architectures since about 1985. Before that one had uniform memory architectures. The Cray I machine was the last holdout. We cover the where, when, what, how and why of Cache Blocking. Almost all computer manufacturers have recently (about seven years ago) dramatically changed their computer architectures to produce Multicore (MC) processors. It will be seen that the arrangement in memory of the submatrices A ij of A is a critical factor for obtaining high performance. From a practical point of view, this work is very important as it will allow existing codes using LAPACK and ScaLAPACK to remain usable by new versions of LAPACK and ScaLAPACK.


Dimension Theory Rectangular Block Cache Blocking Generalize Data Structure Basic Linear Algebra Subprogram 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agarwal, R.C., Cooley, J.W., Gustavson, F.G., Shearer, J.B., Slishman, G., Tuckerman, B.: New scalar and vector elementary functions for the IBM System/370. IBM Journal of Research and Development 30(2), 126–144 (1986)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Agarwal, R.C., Gustavson, F.G.: A Parallel Implementation of Matrix Multiplication and LU factorization on the IBM 3090. In: Wright, M. (ed.) Proceedings of the IFIP WG 2.5 on Aspects of Computation on Asynchronous Parallel Processors, Stanford CA, pp. 217–221. North Holland (August 1988)Google Scholar
  3. 3.
    Agarwal, R.C., Gustavson, F.G., Zubair, M.: Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms. IBM Journal of Research and Development 38(5), 563–576 (1994)CrossRefGoogle Scholar
  4. 4.
    Agarwal, R.C., Gustavson, F.G., Zubair, M.: A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication. IBM J. R. & D. 38(6), 673–681 (1994); See also IBM RC 18694 with dates 8/5/92 & 8/10/92 & 2/8/93CrossRefGoogle Scholar
  5. 5.
    Andersen, B.S., Gunnels, J.A., Gustavson, F.G., Reid, J.K., Waśniewski, J.: A Fully Portable High Performance Minimal Storage Hybrid Cholesky Algorithm. ACM TOMS 31(2), 201–227 (2005)zbMATHCrossRefGoogle Scholar
  6. 6.
    Anderson, E., et al.: LAPACK Users’ Guide Release 3.0. SIAM, Philadelphia (1999)CrossRefGoogle Scholar
  7. 7.
    Blackford, L.S., et al.: ScaLAPACK Users’ Guide. SIAM, Philadelphia (1997)zbMATHCrossRefGoogle Scholar
  8. 8.
    Bilmes, J., Asanovic, K., Whye Chin, C., Demmel, J.: Optimizing Matrix Multiply Using PHiPAC: A Portable, High-Performance, ANSI C Coding Methodology. In: Proceedings of International Conference on Supercomputing, Vienna, Austria (1997)Google Scholar
  9. 9.
    Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: A class of parallel tiled linear algorithms for MC architectures. Parallel Comput. 35(1), 38–53 (2009)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Chan, E., Quintana-orti, E.S., Quintana-orti, G., Van De Geijn, R.: Super-Matrix Out-of-Core Scheduling of Matrix Operations for SMP and Multi-Core Architectures. In: SPAA 2007, June 9-11, pp. 116–125 (2007)Google Scholar
  11. 11.
    Dongarra, J.J., Du Croz, J., Hammarling, S., Duff, I.: A Set of Level 3 Basic Linear Algebra Subprograms. TOMS 16(1), 1–17 (1990)zbMATHCrossRefGoogle Scholar
  12. 12.
    Elmroth, E., Gustavson, F.G., Jonsson, I., Kågström, B.: Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software. SIAM Review 46(1), 3–45 (2004)MathSciNetzbMATHCrossRefGoogle Scholar
  13. 13.
    Gallivan, K., Jalby, W., Meier, U., Sameh, A.: The Impact of Hierarchical Memory Systems on Linear Algebra Algorithm Design. International Journal of Supercomputer Applications 2(1), 12–48 (1988)CrossRefGoogle Scholar
  14. 14.
    Golub, G., VanLoan, C.: Matrix Computations, 3rd edn. John Hopkins Press, Baltimore and London (1996)zbMATHGoogle Scholar
  15. 15.
    Gustavson, F.G.: Recursion Leads to Automatic Variable Blocking for Dense Linear-Algebra Algorithms. IBM J. R. & D. 41(6), 737–755 (1997)CrossRefGoogle Scholar
  16. 16.
    Gustavson, F.G.: New Generalized Data Structures for Matrices Lead to a Variety of High-Performance Algorithms. In: Boisvert, R.F., Tang, P.T.P. (eds.) Proceedings of the IFIP WG 2.5 Working Group on The Architecture of Scientific Software, Ottawa, Canada, October 2-4, pp. 211–234. Kluwer Academic Publishers (2000)Google Scholar
  17. 17.
    Gustavson, F.G.: High Performance Linear Algebra Algs. using New Generalized Data Structures for Matrices. IBM J. R. & D. 47(1), 31–55 (2003)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Gustavson, F.G.: New Generalized Data Structures for Matrices Lead to a Variety of High Performance Dense Linear Algebra Algorithms. In: Dongarra, J., Madsen, K., Waśniewski, J. (eds.) PARA 2004. LNCS, vol. 3732, pp. 11–20. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  19. 19.
    Gustavson, F.G., Gunnels, J., Sexton, J.: Minimal Data Copy For Dense Linear Algebra Factorization. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 540–549. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  20. 20.
    Gustavson, F.G., Swirszcz, T.: In-Place Transposition of Rectangular Matrices. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 560–569. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  21. 21.
    Gustavson, F.G.: The Relevance of New Data Structure Approaches for Dense Linear Algebra in the New Multicore/Manycore Environments. IBM Research report RC24599, also, to appear in PARA 2008 Proceeding, 10 pages (2008)Google Scholar
  22. 22.
    Gustavson, F.G., Karlsson, L., Kågström, B.: Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion. ACM TOMS, 34 pages (to appear, 2012)Google Scholar
  23. 23.
    IBM. IBM Engineering and Scientific Subroutine Library. IBM Pub. No. SA22-7272-00 (February 1986); Also, Release II, 1987 & AIX Version 3, Release 3Google Scholar
  24. 24.
    Karlsson, L.: Blocked in-place transposition with application to storage format conversion. Tech. Rep. UMINF 09.01. ISSN 0348-0542, Department of Computing Science, Umeå University, Umeå, Sweden (January 2009)Google Scholar
  25. 25.
    Knuth, D.: The Art of Computer Programming, 3rd edn., vol. 1, 2 & 3. Addison-Wesley (1998)Google Scholar
  26. 26.
    Kurzak, J., Buttari, A., Dongarra, J.: Solving systems of Linear Equations on the Cell Processor using Cholesky Factorization. IEEE Trans. Parallel Distrib. Syst. 19(9), 1175–1186 (2008)CrossRefGoogle Scholar
  27. 27.
    Kurzak, J., Dongarra, J.: Implementation of mixed precision in solving mixed precision of linear equations on the Cell processor: Research Articles. Concurr. Comput.: Pract. Exper. 19(10), 1371–1385 (2007)CrossRefGoogle Scholar
  28. 28.
    Park, N., Hong, B., Prasanna, V.: Tiling, Block Data Layout, and Memory Hierarchy Performance. IEEE Trans. Parallel and Distributed Systems 14(7), 640–654 (2003)CrossRefGoogle Scholar
  29. 29.
    Tietze, H.: Three Dimensions–Higher Dimensions. In: Famous Problems of Mathematics, pp. 106–120. Graylock Press (1965)Google Scholar
  30. 30.
    Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated Empirical Optimization of Software and the ATLAS Project. Parallel Computing (1-2), 3–35 (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Fred G. Gustavson
    • 1
    • 2
  1. 1.IBM T.J. Watson Research Center, EmeritusUSA
  2. 2.Umeå UniversitySweden

Personalised recommendations