Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance

  • Gianfranco Bilardi
  • Paolo D’Alberto
  • Alex Nicolau
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2141)


The practical portability of a simple version of matrix multiplication is demonstrated. The multiplication algorithm is designed to exploit maximal and predictable locality at all levels of the memory hierarchy, with no a priori knowledge of the specific memory system organization for any particular machine. By both simulations and execution on a number of platforms, we show that memory hierarchies portability does not sacrifice floating point performance; indeed, it is always a significant fraction of peak and, at least on one machine, is higher than the tuned routines by both ATLAS and vendor. The results are obtained by careful algorithm engineering, which combines a number of known as well as novel implementation ideas. This effort can be viewed as an experimental case study, complementary to the theoretical investigations on portability of cache performance begun by Bilardi and Peserico.


Matrix Multiplication Call Tree Memory Hierarchy R5000 IP32 Fractal Scheme 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    A. Aggarwal, B. Alpern, A. K. Chandra and M. Snir: A model for hierarchical memory. Proc. of 19th Annual ACM Symposium on the Theory of Computing, New York, 1987, 305–314.Google Scholar
  2. 2.
    A. Aggarwal, A. K. Chandra and M. Snir: Hierarchical memory with block transfer. 1987 IEEE.Google Scholar
  3. 3.
    B. Alpern, L. Carter, E. Feig and T. Selker: The uniform memory hierarchy model of computation. In Algorithmica, vol. 12, (1994), 72–129.zbMATHCrossRefMathSciNetGoogle Scholar
  4. 4.
    U. Banerjee, R. Eigenmann, A. Nicolau and D. Padua: Automatic program parallelization. Proceedings of the IEEE vol 81, n. 2 Feb. 1993.Google Scholar
  5. 5.
    G. Bilardi, P. D’Alberto, and A. Nicolau: Fractal Matrix Multiplication: a Case Study on Portability of Cache Performance, University of California at Irvine, ICS TR#00-21, 2000.Google Scholar
  6. 6.
    G. Bilardi and F. P. Preparata: Processor-time tradeoffs under bounded-speed message propagation. Part II: lower bounds. Theory of Computing Systems, Vol. 32, 531–559, 1999.zbMATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    G. Bilardi, E. Peserico: An Approach toward an Analytical Characterization of Locality and its Portability. IWIA 2000, International Workshop on Innovative Architectures, Maui, Hawai, January 2001.Google Scholar
  8. 8.
    G. Bilardi, E. Peserico: A Characterization of Temporal Locality and its Portability Across Memory Hierarchies. ICALP 2001, International Colloquium on Automata, Languages, and Programming, Crete, July 2001.Google Scholar
  9. 9.
    G. Bilardi, A. Pietracaprina, and P. D’Alberto: On the space and access complexity of computation DAGs. 26th Workshop on Graph-Theoretic Concepts in Computer Science, Konstanz, Germany, June 2000.Google Scholar
  10. 10.
    J. Bilmes, Krste Asanovic, C. Chin and J. Demmel: Optimizing matrix multiply using PHiPAC: a portable, high-performance, Ansi C coding methodology. International Conference on Supercomputing, July 1997.Google Scholar
  11. 11.
    S. Carr and K. Kennedy: Compiler blockability of numerical algorithms. Proceedings of Supercomputing Nov 1992, pg. 114–124.Google Scholar
  12. 12.
    S. Chatterjee, V. V. Jain, A. R. Lebeck and S. Mundhra: Nonlinear array layouts for hierarchical memory systems. Proc. of ACM international Conference on Supercomputing, Rhodes,Greece, June 1999.Google Scholar
  13. 13.
    S. Chatterjee, A. R. Lebeck, P. K. Patnala and M. Thottethodi: Recursive array layout and fast parallel matrix multiplication. Proc. 11-th ACM SIGPLAN, June 1999.Google Scholar
  14. 14.
    D. Coppersmith and S. Winograd: Matrix multiplication via arithmetic progression. In Poceedings of 9th annual ACM Symposium on Theory of Computing pag. 1–6, 1987.Google Scholar
  15. 15.
    P. D’Alberto, G. Bilardi and A. Nicolau: Fractal LU-decomposition with partial pivoting. Manuscript.Google Scholar
  16. 16.
    M. J. Dayde and I. S. Duff: A blocked implementation of level 3 BLAS for RISC processors. TRPA9606, available on line Apr. 6 1996.
  17. 17.
    N. Eiron, M. Rodeh and I. Steinwarts: Matrix multiplication: a case study of algorithm engineering. Proceedings WAE’98, Saarbrücken, Germany, Aug.20–22, 1998Google Scholar
  18. 18.
    Engineering and Scientific Subroutine Library.
  19. 19.
    P. Flajolet, G. Gonnet, C. Puech and J. M. Robson: The analysis of multidimentional searching in Quad-Tree. Proceeding of the second Annual ACM-SIAM symposium on Discrete Algorithms, San Francisco, 1991, pag. 100–109.Google Scholar
  20. 20.
    J. D. Frens and D. S. Wise: Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code. Proc. 1997 ACM Symp. on Principles and Practice of Parallel Programming, SIGPLAN Not. 32, 7 (July 1997), 206–216.Google Scholar
  21. 21.
    M. Frigo and S. G. Johnson: The fastest Fourier transform in the west. MIT-LCS-TR-728 Massachusetts Institute of technology, Sep. 11 1997.Google Scholar
  22. 22.
    M. Frigo, C. E. Leiserson, H. Prokop and S. Ramachandran: Cache-oblivious algorithms. Proc. 40th Annual Symposium on Foundations of Computer Science, (1999).Google Scholar
  23. 23.
    E. D. Granston, W. Jalby and O. Teman: To copy or not to copy: a compiletime technique for assessing when data copying should be used to eliminate cache conflicts. Proceedings of Supercomputing Nov 1993, pg. 410–419.Google Scholar
  24. 24.
    G. H. Golub and C. F. van Loan: Matrix computations. Johns Hopkins editor 3-rd edition.Google Scholar
  25. 25.
    F. G. Gustavson: Recursion leads to automatic variable blocking for dense linear algebra algorithms. Journal of Research and Development Volume 41, Number 6, November 1997.Google Scholar
  26. 26.
    F. Gustavson, A. Henriksson, I. Jonsson, P. Ling, and B. Kagstrom: Recursive blocked data formats and BLAS’s for dense linear algebra algorithms. In B. Kagstrom et al (eds), Applied Parallel Computing. Large Scale Scientific and Industrial Problems, PARA’98 Proceedings. Lecture Notes in Computing Science, No. 1541, p. 195–206, Springer Verlag, 1998.Google Scholar
  27. 27.
    N. J. Higham: Accuracy and stability of numerical algorithms ed. SIAM 1996Google Scholar
  28. 28.
    Hong Jia-Wei and T. H. Kung: I/O complexity:The Red-Blue pebble game. Proc.of the 13th Ann. ACM Symposium on Theory of Computing Oct.1981, 326–333.Google Scholar
  29. 29.
    B. Kÿagström, P. Ling and C. Van Loan: Algorithm 784: GEMM-based level 3 BLAS: portability and optimization issues. ACM transactions on Mathematical Software, Vol24, No. 3, Sept. 1998, pages 303–316CrossRefGoogle Scholar
  30. 30.
    B. Kÿagström, P. Ling and C. Van Loan: GEMM-based level 3 BLAS: highperformance model implementations and performance evaluation benchmark. ACM transactions on Mathematical Software, Vol24, No. 3, Sept. 1998, pages 268–302.CrossRefGoogle Scholar
  31. 31.
    M. Lam, E. Rothberg and M. Wolfe: The cache performance and optimizations of blocked algorithms. Proceedings of the fourth international conference on architectural support for programming languages and operating system, Apr. 1991,pg. 63–74.Google Scholar
  32. 32.
    S. S. Muchnick: Advanced compiler design implementation. Morgan KaufmanGoogle Scholar
  33. 33.
    P. D’Alberto: Performance Evaluation of Data Locality Exploitation. Techincal Report UBLCS-2000-9. Department of Computer Science, University of Bologna.Google Scholar
  34. 34.
    P. R. Panda, H. Nakamura, N. D. Dutt and A. Nicolau: Improving cache performance through tiling and data alignment. Solving Irregularly Structured Problems in Parallel Lecture Notes in Computer Science, Springer-Verlag 1997.Google Scholar
  35. 35.
    John E. Savage: Space-Time tradeoff in memory hierarchies. Technical report Oct 19, 1993.Google Scholar
  36. 36.
    V. Strassen: Gaussian elimination is not optimal. Numerische Mathematik 14(3):354–356, 1969.CrossRefMathSciNetGoogle Scholar
  37. 37.
    S. Toledo: Locality of reference in LU decomposition with partial pivoting. SIAM J.Matrix Anal. Appl. Vol.18, No. 4, pp. 1065–1081, Oct. 1997zbMATHCrossRefMathSciNetGoogle Scholar
  38. 38.
    M. Thottethodi, S. Chatterjee and A. R. Lebeck: Tuning Strassen’s matrix multiplication for memory efficiency. Proc. SC98, Orlando,FL, nov.1998 (
  39. 39.
    R. C. Whaley and J. J. Dongarra: Automatically Tuned Linear Algebra Software.
  40. 40.
    D. S. Wise: Undulant-block elimination and integer-preserving matrix inversion. Technical Report 418 Computer Science Department Indiana University August 1995Google Scholar
  41. 41.
    M. Wolfe: More iteration space tiling. Proceedings of Supercomputing, Nov. 1989, pg. 655–665.Google Scholar
  42. 42.
    M. Wolfe and M. Lam: A Data locality optimizing algorithm. Proceedings of the ACM SIGPLAN’91 conference on programming Language Design and Implementation, Toronto, Ontario,Canada, June 26–28, 1991.Google Scholar
  43. 43.
    M. Wolfe: High performance compilers for parallel computing. Addison-Wesley Pub.Co. 1995Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2001

Authors and Affiliations

  • Gianfranco Bilardi
    • 1
  • Paolo D’Alberto
    • 2
  • Alex Nicolau
    • 2
  1. 1.Dipartimento di Elettronica e InformaticaUniversit; di PadovaItaly
  2. 2.Information and Computer ScienceUniversity of California at IrvineUSA

Personalised recommendations