Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance

Bilardi, Gianfranco; D’Alberto, Paolo; Nicolau, Alex

doi:10.1007/3-540-44688-5_3

Gianfranco Bilardi⁷,
Paolo D’Alberto⁸ &
Alex Nicolau⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2141))

Included in the following conference series:

International Workshop on Algorithm Engineering

466 Accesses
7 Citations

Abstract

The practical portability of a simple version of matrix multiplication is demonstrated. The multiplication algorithm is designed to exploit maximal and predictable locality at all levels of the memory hierarchy, with no a priori knowledge of the specific memory system organization for any particular machine. By both simulations and execution on a number of platforms, we show that memory hierarchies portability does not sacrifice floating point performance; indeed, it is always a significant fraction of peak and, at least on one machine, is higher than the tuned routines by both ATLAS and vendor. The results are obtained by careful algorithm engineering, which combines a number of known as well as novel implementation ideas. This effort can be viewed as an experimental case study, complementary to the theoretical investigations on portability of cache performance begun by Bilardi and Peserico.

This work was supported, in part, by CNR and MURST of Italy

Supported by AMRM DABT63-98-C-0045

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

A. Aggarwal, B. Alpern, A. K. Chandra and M. Snir: A model for hierarchical memory. Proc. of 19th Annual ACM Symposium on the Theory of Computing, New York, 1987, 305–314.
Google Scholar
A. Aggarwal, A. K. Chandra and M. Snir: Hierarchical memory with block transfer. 1987 IEEE.
Google Scholar
B. Alpern, L. Carter, E. Feig and T. Selker: The uniform memory hierarchy model of computation. In Algorithmica, vol. 12, (1994), 72–129.
Article MATH MathSciNet Google Scholar
U. Banerjee, R. Eigenmann, A. Nicolau and D. Padua: Automatic program parallelization. Proceedings of the IEEE vol 81, n. 2 Feb. 1993.
Google Scholar
G. Bilardi, P. D’Alberto, and A. Nicolau: Fractal Matrix Multiplication: a Case Study on Portability of Cache Performance, University of California at Irvine, ICS TR#00-21, 2000.
Google Scholar
G. Bilardi and F. P. Preparata: Processor-time tradeoffs under bounded-speed message propagation. Part II: lower bounds. Theory of Computing Systems, Vol. 32, 531–559, 1999.
Article MATH MathSciNet Google Scholar
G. Bilardi, E. Peserico: An Approach toward an Analytical Characterization of Locality and its Portability. IWIA 2000, International Workshop on Innovative Architectures, Maui, Hawai, January 2001.
Google Scholar
G. Bilardi, E. Peserico: A Characterization of Temporal Locality and its Portability Across Memory Hierarchies. ICALP 2001, International Colloquium on Automata, Languages, and Programming, Crete, July 2001.
Google Scholar
G. Bilardi, A. Pietracaprina, and P. D’Alberto: On the space and access complexity of computation DAGs. 26th Workshop on Graph-Theoretic Concepts in Computer Science, Konstanz, Germany, June 2000.
Google Scholar
J. Bilmes, Krste Asanovic, C. Chin and J. Demmel: Optimizing matrix multiply using PHiPAC: a portable, high-performance, Ansi C coding methodology. International Conference on Supercomputing, July 1997.
Google Scholar
S. Carr and K. Kennedy: Compiler blockability of numerical algorithms. Proceedings of Supercomputing Nov 1992, pg. 114–124.
Google Scholar
S. Chatterjee, V. V. Jain, A. R. Lebeck and S. Mundhra: Nonlinear array layouts for hierarchical memory systems. Proc. of ACM international Conference on Supercomputing, Rhodes,Greece, June 1999.
Google Scholar
S. Chatterjee, A. R. Lebeck, P. K. Patnala and M. Thottethodi: Recursive array layout and fast parallel matrix multiplication. Proc. 11-th ACM SIGPLAN, June 1999.
Google Scholar
D. Coppersmith and S. Winograd: Matrix multiplication via arithmetic progression. In Poceedings of 9th annual ACM Symposium on Theory of Computing pag. 1–6, 1987.
Google Scholar
P. D’Alberto, G. Bilardi and A. Nicolau: Fractal LU-decomposition with partial pivoting. Manuscript.
Google Scholar
M. J. Dayde and I. S. Duff: A blocked implementation of level 3 BLAS for RISC processors. TRPA9606, available on line http://www.cerfacs.fr/algorreports/TRPA9606.ps.gz Apr. 6 1996.
N. Eiron, M. Rodeh and I. Steinwarts: Matrix multiplication: a case study of algorithm engineering. Proceedings WAE’98, Saarbrücken, Germany, Aug.20–22, 1998
Google Scholar
Engineering and Scientific Subroutine Library. http://www.rs6000.ibm.com/resource/aixresource/spbooks/essl/
P. Flajolet, G. Gonnet, C. Puech and J. M. Robson: The analysis of multidimentional searching in Quad-Tree. Proceeding of the second Annual ACM-SIAM symposium on Discrete Algorithms, San Francisco, 1991, pag. 100–109.
Google Scholar
J. D. Frens and D. S. Wise: Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code. Proc. 1997 ACM Symp. on Principles and Practice of Parallel Programming, SIGPLAN Not. 32, 7 (July 1997), 206–216.
Google Scholar
M. Frigo and S. G. Johnson: The fastest Fourier transform in the west. MIT-LCS-TR-728 Massachusetts Institute of technology, Sep. 11 1997.
Google Scholar
M. Frigo, C. E. Leiserson, H. Prokop and S. Ramachandran: Cache-oblivious algorithms. Proc. 40th Annual Symposium on Foundations of Computer Science, (1999).
Google Scholar
E. D. Granston, W. Jalby and O. Teman: To copy or not to copy: a compiletime technique for assessing when data copying should be used to eliminate cache conflicts. Proceedings of Supercomputing Nov 1993, pg. 410–419.
Google Scholar
G. H. Golub and C. F. van Loan: Matrix computations. Johns Hopkins editor 3-rd edition.
Google Scholar
F. G. Gustavson: Recursion leads to automatic variable blocking for dense linear algebra algorithms. Journal of Research and Development Volume 41, Number 6, November 1997.
Google Scholar
F. Gustavson, A. Henriksson, I. Jonsson, P. Ling, and B. Kagstrom: Recursive blocked data formats and BLAS’s for dense linear algebra algorithms. In B. Kagstrom et al (eds), Applied Parallel Computing. Large Scale Scientific and Industrial Problems, PARA’98 Proceedings. Lecture Notes in Computing Science, No. 1541, p. 195–206, Springer Verlag, 1998.
Google Scholar
N. J. Higham: Accuracy and stability of numerical algorithms ed. SIAM 1996
Google Scholar
Hong Jia-Wei and T. H. Kung: I/O complexity:The Red-Blue pebble game. Proc.of the 13th Ann. ACM Symposium on Theory of Computing Oct.1981, 326–333.
Google Scholar
B. Kÿagström, P. Ling and C. Van Loan: Algorithm 784: GEMM-based level 3 BLAS: portability and optimization issues. ACM transactions on Mathematical Software, Vol24, No. 3, Sept. 1998, pages 303–316
Article Google Scholar
B. Kÿagström, P. Ling and C. Van Loan: GEMM-based level 3 BLAS: highperformance model implementations and performance evaluation benchmark. ACM transactions on Mathematical Software, Vol24, No. 3, Sept. 1998, pages 268–302.
Article Google Scholar
M. Lam, E. Rothberg and M. Wolfe: The cache performance and optimizations of blocked algorithms. Proceedings of the fourth international conference on architectural support for programming languages and operating system, Apr. 1991,pg. 63–74.
Google Scholar
S. S. Muchnick: Advanced compiler design implementation. Morgan Kaufman
Google Scholar
P. D’Alberto: Performance Evaluation of Data Locality Exploitation. Techincal Report UBLCS-2000-9. Department of Computer Science, University of Bologna.
Google Scholar
P. R. Panda, H. Nakamura, N. D. Dutt and A. Nicolau: Improving cache performance through tiling and data alignment. Solving Irregularly Structured Problems in Parallel Lecture Notes in Computer Science, Springer-Verlag 1997.
Google Scholar
John E. Savage: Space-Time tradeoff in memory hierarchies. Technical report Oct 19, 1993.
Google Scholar
V. Strassen: Gaussian elimination is not optimal. Numerische Mathematik 14(3):354–356, 1969.
Article MathSciNet Google Scholar
S. Toledo: Locality of reference in LU decomposition with partial pivoting. SIAM J.Matrix Anal. Appl. Vol.18, No. 4, pp. 1065–1081, Oct. 1997
Article MATH MathSciNet Google Scholar
M. Thottethodi, S. Chatterjee and A. R. Lebeck: Tuning Strassen’s matrix multiplication for memory efficiency. Proc. SC98, Orlando,FL, nov.1998 ( http://www.supercomp.org/sc98 .
R. C. Whaley and J. J. Dongarra: Automatically Tuned Linear Algebra Software. http://www.netlib.org/atlas/index.html
D. S. Wise: Undulant-block elimination and integer-preserving matrix inversion. Technical Report 418 Computer Science Department Indiana University August 1995
Google Scholar
M. Wolfe: More iteration space tiling. Proceedings of Supercomputing, Nov. 1989, pg. 655–665.
Google Scholar
M. Wolfe and M. Lam: A Data locality optimizing algorithm. Proceedings of the ACM SIGPLAN’91 conference on programming Language Design and Implementation, Toronto, Ontario,Canada, June 26–28, 1991.
Google Scholar
M. Wolfe: High performance compilers for parallel computing. Addison-Wesley Pub.Co. 1995
Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Elettronica e Informatica, Universit; di Padova, Italy
Gianfranco Bilardi
Information and Computer Science, University of California at Irvine, USA
Paolo D’Alberto & Alex Nicolau

Authors

Gianfranco Bilardi
View author publications
You can also search for this author in PubMed Google Scholar
Paolo D’Alberto
View author publications
You can also search for this author in PubMed Google Scholar
Alex Nicolau
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Aarhus, BRICS, 8000, Åarhus, Denmark
Gerth Stølting Brodal
Dipartimento di Ingegneria Elettrica, Universitá dell’Aquila, Poggio di Roio, 67040, L’Aquila, Italy
Daniele Frigioni
Dipartimento di Informatica e Sistemistica, Universitá di Roma “La Sapienza”, via Salaria 113, 00198, Roma, Italy
Alberto Marchetti-Spaccamela

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bilardi, G., D’Alberto, P., Nicolau, A. (2001). Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance. In: Brodal, G.S., Frigioni, D., Marchetti-Spaccamela, A. (eds) Algorithm Engineering. WAE 2001. Lecture Notes in Computer Science, vol 2141. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44688-5_3

Download citation

DOI: https://doi.org/10.1007/3-540-44688-5_3
Published: 17 August 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42500-7
Online ISBN: 978-3-540-44688-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics