Skip to main content

Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance

  • Conference paper
  • First Online:
Algorithm Engineering (WAE 2001)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2141))

Included in the following conference series:

Abstract

The practical portability of a simple version of matrix multiplication is demonstrated. The multiplication algorithm is designed to exploit maximal and predictable locality at all levels of the memory hierarchy, with no a priori knowledge of the specific memory system organization for any particular machine. By both simulations and execution on a number of platforms, we show that memory hierarchies portability does not sacrifice floating point performance; indeed, it is always a significant fraction of peak and, at least on one machine, is higher than the tuned routines by both ATLAS and vendor. The results are obtained by careful algorithm engineering, which combines a number of known as well as novel implementation ideas. This effort can be viewed as an experimental case study, complementary to the theoretical investigations on portability of cache performance begun by Bilardi and Peserico.

This work was supported, in part, by CNR and MURST of Italy

Supported by AMRM DABT63-98-C-0045

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A. Aggarwal, B. Alpern, A. K. Chandra and M. Snir: A model for hierarchical memory. Proc. of 19th Annual ACM Symposium on the Theory of Computing, New York, 1987, 305–314.

    Google Scholar 

  2. A. Aggarwal, A. K. Chandra and M. Snir: Hierarchical memory with block transfer. 1987 IEEE.

    Google Scholar 

  3. B. Alpern, L. Carter, E. Feig and T. Selker: The uniform memory hierarchy model of computation. In Algorithmica, vol. 12, (1994), 72–129.

    Article  MATH  MathSciNet  Google Scholar 

  4. U. Banerjee, R. Eigenmann, A. Nicolau and D. Padua: Automatic program parallelization. Proceedings of the IEEE vol 81, n. 2 Feb. 1993.

    Google Scholar 

  5. G. Bilardi, P. D’Alberto, and A. Nicolau: Fractal Matrix Multiplication: a Case Study on Portability of Cache Performance, University of California at Irvine, ICS TR#00-21, 2000.

    Google Scholar 

  6. G. Bilardi and F. P. Preparata: Processor-time tradeoffs under bounded-speed message propagation. Part II: lower bounds. Theory of Computing Systems, Vol. 32, 531–559, 1999.

    Article  MATH  MathSciNet  Google Scholar 

  7. G. Bilardi, E. Peserico: An Approach toward an Analytical Characterization of Locality and its Portability. IWIA 2000, International Workshop on Innovative Architectures, Maui, Hawai, January 2001.

    Google Scholar 

  8. G. Bilardi, E. Peserico: A Characterization of Temporal Locality and its Portability Across Memory Hierarchies. ICALP 2001, International Colloquium on Automata, Languages, and Programming, Crete, July 2001.

    Google Scholar 

  9. G. Bilardi, A. Pietracaprina, and P. D’Alberto: On the space and access complexity of computation DAGs. 26th Workshop on Graph-Theoretic Concepts in Computer Science, Konstanz, Germany, June 2000.

    Google Scholar 

  10. J. Bilmes, Krste Asanovic, C. Chin and J. Demmel: Optimizing matrix multiply using PHiPAC: a portable, high-performance, Ansi C coding methodology. International Conference on Supercomputing, July 1997.

    Google Scholar 

  11. S. Carr and K. Kennedy: Compiler blockability of numerical algorithms. Proceedings of Supercomputing Nov 1992, pg. 114–124.

    Google Scholar 

  12. S. Chatterjee, V. V. Jain, A. R. Lebeck and S. Mundhra: Nonlinear array layouts for hierarchical memory systems. Proc. of ACM international Conference on Supercomputing, Rhodes,Greece, June 1999.

    Google Scholar 

  13. S. Chatterjee, A. R. Lebeck, P. K. Patnala and M. Thottethodi: Recursive array layout and fast parallel matrix multiplication. Proc. 11-th ACM SIGPLAN, June 1999.

    Google Scholar 

  14. D. Coppersmith and S. Winograd: Matrix multiplication via arithmetic progression. In Poceedings of 9th annual ACM Symposium on Theory of Computing pag. 1–6, 1987.

    Google Scholar 

  15. P. D’Alberto, G. Bilardi and A. Nicolau: Fractal LU-decomposition with partial pivoting. Manuscript.

    Google Scholar 

  16. M. J. Dayde and I. S. Duff: A blocked implementation of level 3 BLAS for RISC processors. TRPA9606, available on line http://www.cerfacs.fr/algorreports/TRPA9606.ps.gz Apr. 6 1996.

  17. N. Eiron, M. Rodeh and I. Steinwarts: Matrix multiplication: a case study of algorithm engineering. Proceedings WAE’98, Saarbrücken, Germany, Aug.20–22, 1998

    Google Scholar 

  18. Engineering and Scientific Subroutine Library. http://www.rs6000.ibm.com/resource/aixresource/spbooks/essl/

  19. P. Flajolet, G. Gonnet, C. Puech and J. M. Robson: The analysis of multidimentional searching in Quad-Tree. Proceeding of the second Annual ACM-SIAM symposium on Discrete Algorithms, San Francisco, 1991, pag. 100–109.

    Google Scholar 

  20. J. D. Frens and D. S. Wise: Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code. Proc. 1997 ACM Symp. on Principles and Practice of Parallel Programming, SIGPLAN Not. 32, 7 (July 1997), 206–216.

    Google Scholar 

  21. M. Frigo and S. G. Johnson: The fastest Fourier transform in the west. MIT-LCS-TR-728 Massachusetts Institute of technology, Sep. 11 1997.

    Google Scholar 

  22. M. Frigo, C. E. Leiserson, H. Prokop and S. Ramachandran: Cache-oblivious algorithms. Proc. 40th Annual Symposium on Foundations of Computer Science, (1999).

    Google Scholar 

  23. E. D. Granston, W. Jalby and O. Teman: To copy or not to copy: a compiletime technique for assessing when data copying should be used to eliminate cache conflicts. Proceedings of Supercomputing Nov 1993, pg. 410–419.

    Google Scholar 

  24. G. H. Golub and C. F. van Loan: Matrix computations. Johns Hopkins editor 3-rd edition.

    Google Scholar 

  25. F. G. Gustavson: Recursion leads to automatic variable blocking for dense linear algebra algorithms. Journal of Research and Development Volume 41, Number 6, November 1997.

    Google Scholar 

  26. F. Gustavson, A. Henriksson, I. Jonsson, P. Ling, and B. Kagstrom: Recursive blocked data formats and BLAS’s for dense linear algebra algorithms. In B. Kagstrom et al (eds), Applied Parallel Computing. Large Scale Scientific and Industrial Problems, PARA’98 Proceedings. Lecture Notes in Computing Science, No. 1541, p. 195–206, Springer Verlag, 1998.

    Google Scholar 

  27. N. J. Higham: Accuracy and stability of numerical algorithms ed. SIAM 1996

    Google Scholar 

  28. Hong Jia-Wei and T. H. Kung: I/O complexity:The Red-Blue pebble game. Proc.of the 13th Ann. ACM Symposium on Theory of Computing Oct.1981, 326–333.

    Google Scholar 

  29. B. Kÿagström, P. Ling and C. Van Loan: Algorithm 784: GEMM-based level 3 BLAS: portability and optimization issues. ACM transactions on Mathematical Software, Vol24, No. 3, Sept. 1998, pages 303–316

    Article  Google Scholar 

  30. B. Kÿagström, P. Ling and C. Van Loan: GEMM-based level 3 BLAS: highperformance model implementations and performance evaluation benchmark. ACM transactions on Mathematical Software, Vol24, No. 3, Sept. 1998, pages 268–302.

    Article  Google Scholar 

  31. M. Lam, E. Rothberg and M. Wolfe: The cache performance and optimizations of blocked algorithms. Proceedings of the fourth international conference on architectural support for programming languages and operating system, Apr. 1991,pg. 63–74.

    Google Scholar 

  32. S. S. Muchnick: Advanced compiler design implementation. Morgan Kaufman

    Google Scholar 

  33. P. D’Alberto: Performance Evaluation of Data Locality Exploitation. Techincal Report UBLCS-2000-9. Department of Computer Science, University of Bologna.

    Google Scholar 

  34. P. R. Panda, H. Nakamura, N. D. Dutt and A. Nicolau: Improving cache performance through tiling and data alignment. Solving Irregularly Structured Problems in Parallel Lecture Notes in Computer Science, Springer-Verlag 1997.

    Google Scholar 

  35. John E. Savage: Space-Time tradeoff in memory hierarchies. Technical report Oct 19, 1993.

    Google Scholar 

  36. V. Strassen: Gaussian elimination is not optimal. Numerische Mathematik 14(3):354–356, 1969.

    Article  MathSciNet  Google Scholar 

  37. S. Toledo: Locality of reference in LU decomposition with partial pivoting. SIAM J.Matrix Anal. Appl. Vol.18, No. 4, pp. 1065–1081, Oct. 1997

    Article  MATH  MathSciNet  Google Scholar 

  38. M. Thottethodi, S. Chatterjee and A. R. Lebeck: Tuning Strassen’s matrix multiplication for memory efficiency. Proc. SC98, Orlando,FL, nov.1998 ( http://www.supercomp.org/sc98 .

  39. R. C. Whaley and J. J. Dongarra: Automatically Tuned Linear Algebra Software. http://www.netlib.org/atlas/index.html

  40. D. S. Wise: Undulant-block elimination and integer-preserving matrix inversion. Technical Report 418 Computer Science Department Indiana University August 1995

    Google Scholar 

  41. M. Wolfe: More iteration space tiling. Proceedings of Supercomputing, Nov. 1989, pg. 655–665.

    Google Scholar 

  42. M. Wolfe and M. Lam: A Data locality optimizing algorithm. Proceedings of the ACM SIGPLAN’91 conference on programming Language Design and Implementation, Toronto, Ontario,Canada, June 26–28, 1991.

    Google Scholar 

  43. M. Wolfe: High performance compilers for parallel computing. Addison-Wesley Pub.Co. 1995

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bilardi, G., D’Alberto, P., Nicolau, A. (2001). Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance. In: Brodal, G.S., Frigioni, D., Marchetti-Spaccamela, A. (eds) Algorithm Engineering. WAE 2001. Lecture Notes in Computer Science, vol 2141. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44688-5_3

Download citation

  • DOI: https://doi.org/10.1007/3-540-44688-5_3

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-42500-7

  • Online ISBN: 978-3-540-44688-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics