Abstract
The practical portability of a simple version of matrix multiplication is demonstrated. The multiplication algorithm is designed to exploit maximal and predictable locality at all levels of the memory hierarchy, with no a priori knowledge of the specific memory system organization for any particular machine. By both simulations and execution on a number of platforms, we show that memory hierarchies portability does not sacrifice floating point performance; indeed, it is always a significant fraction of peak and, at least on one machine, is higher than the tuned routines by both ATLAS and vendor. The results are obtained by careful algorithm engineering, which combines a number of known as well as novel implementation ideas. This effort can be viewed as an experimental case study, complementary to the theoretical investigations on portability of cache performance begun by Bilardi and Peserico.
This work was supported, in part, by CNR and MURST of Italy
Supported by AMRM DABT63-98-C-0045
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
A. Aggarwal, B. Alpern, A. K. Chandra and M. Snir: A model for hierarchical memory. Proc. of 19th Annual ACM Symposium on the Theory of Computing, New York, 1987, 305–314.
A. Aggarwal, A. K. Chandra and M. Snir: Hierarchical memory with block transfer. 1987 IEEE.
B. Alpern, L. Carter, E. Feig and T. Selker: The uniform memory hierarchy model of computation. In Algorithmica, vol. 12, (1994), 72–129.
U. Banerjee, R. Eigenmann, A. Nicolau and D. Padua: Automatic program parallelization. Proceedings of the IEEE vol 81, n. 2 Feb. 1993.
G. Bilardi, P. D’Alberto, and A. Nicolau: Fractal Matrix Multiplication: a Case Study on Portability of Cache Performance, University of California at Irvine, ICS TR#00-21, 2000.
G. Bilardi and F. P. Preparata: Processor-time tradeoffs under bounded-speed message propagation. Part II: lower bounds. Theory of Computing Systems, Vol. 32, 531–559, 1999.
G. Bilardi, E. Peserico: An Approach toward an Analytical Characterization of Locality and its Portability. IWIA 2000, International Workshop on Innovative Architectures, Maui, Hawai, January 2001.
G. Bilardi, E. Peserico: A Characterization of Temporal Locality and its Portability Across Memory Hierarchies. ICALP 2001, International Colloquium on Automata, Languages, and Programming, Crete, July 2001.
G. Bilardi, A. Pietracaprina, and P. D’Alberto: On the space and access complexity of computation DAGs. 26th Workshop on Graph-Theoretic Concepts in Computer Science, Konstanz, Germany, June 2000.
J. Bilmes, Krste Asanovic, C. Chin and J. Demmel: Optimizing matrix multiply using PHiPAC: a portable, high-performance, Ansi C coding methodology. International Conference on Supercomputing, July 1997.
S. Carr and K. Kennedy: Compiler blockability of numerical algorithms. Proceedings of Supercomputing Nov 1992, pg. 114–124.
S. Chatterjee, V. V. Jain, A. R. Lebeck and S. Mundhra: Nonlinear array layouts for hierarchical memory systems. Proc. of ACM international Conference on Supercomputing, Rhodes,Greece, June 1999.
S. Chatterjee, A. R. Lebeck, P. K. Patnala and M. Thottethodi: Recursive array layout and fast parallel matrix multiplication. Proc. 11-th ACM SIGPLAN, June 1999.
D. Coppersmith and S. Winograd: Matrix multiplication via arithmetic progression. In Poceedings of 9th annual ACM Symposium on Theory of Computing pag. 1–6, 1987.
P. D’Alberto, G. Bilardi and A. Nicolau: Fractal LU-decomposition with partial pivoting. Manuscript.
M. J. Dayde and I. S. Duff: A blocked implementation of level 3 BLAS for RISC processors. TRPA9606, available on line http://www.cerfacs.fr/algorreports/TRPA9606.ps.gz Apr. 6 1996.
N. Eiron, M. Rodeh and I. Steinwarts: Matrix multiplication: a case study of algorithm engineering. Proceedings WAE’98, Saarbrücken, Germany, Aug.20–22, 1998
Engineering and Scientific Subroutine Library. http://www.rs6000.ibm.com/resource/aixresource/spbooks/essl/
P. Flajolet, G. Gonnet, C. Puech and J. M. Robson: The analysis of multidimentional searching in Quad-Tree. Proceeding of the second Annual ACM-SIAM symposium on Discrete Algorithms, San Francisco, 1991, pag. 100–109.
J. D. Frens and D. S. Wise: Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code. Proc. 1997 ACM Symp. on Principles and Practice of Parallel Programming, SIGPLAN Not. 32, 7 (July 1997), 206–216.
M. Frigo and S. G. Johnson: The fastest Fourier transform in the west. MIT-LCS-TR-728 Massachusetts Institute of technology, Sep. 11 1997.
M. Frigo, C. E. Leiserson, H. Prokop and S. Ramachandran: Cache-oblivious algorithms. Proc. 40th Annual Symposium on Foundations of Computer Science, (1999).
E. D. Granston, W. Jalby and O. Teman: To copy or not to copy: a compiletime technique for assessing when data copying should be used to eliminate cache conflicts. Proceedings of Supercomputing Nov 1993, pg. 410–419.
G. H. Golub and C. F. van Loan: Matrix computations. Johns Hopkins editor 3-rd edition.
F. G. Gustavson: Recursion leads to automatic variable blocking for dense linear algebra algorithms. Journal of Research and Development Volume 41, Number 6, November 1997.
F. Gustavson, A. Henriksson, I. Jonsson, P. Ling, and B. Kagstrom: Recursive blocked data formats and BLAS’s for dense linear algebra algorithms. In B. Kagstrom et al (eds), Applied Parallel Computing. Large Scale Scientific and Industrial Problems, PARA’98 Proceedings. Lecture Notes in Computing Science, No. 1541, p. 195–206, Springer Verlag, 1998.
N. J. Higham: Accuracy and stability of numerical algorithms ed. SIAM 1996
Hong Jia-Wei and T. H. Kung: I/O complexity:The Red-Blue pebble game. Proc.of the 13th Ann. ACM Symposium on Theory of Computing Oct.1981, 326–333.
B. Kÿagström, P. Ling and C. Van Loan: Algorithm 784: GEMM-based level 3 BLAS: portability and optimization issues. ACM transactions on Mathematical Software, Vol24, No. 3, Sept. 1998, pages 303–316
B. Kÿagström, P. Ling and C. Van Loan: GEMM-based level 3 BLAS: highperformance model implementations and performance evaluation benchmark. ACM transactions on Mathematical Software, Vol24, No. 3, Sept. 1998, pages 268–302.
M. Lam, E. Rothberg and M. Wolfe: The cache performance and optimizations of blocked algorithms. Proceedings of the fourth international conference on architectural support for programming languages and operating system, Apr. 1991,pg. 63–74.
S. S. Muchnick: Advanced compiler design implementation. Morgan Kaufman
P. D’Alberto: Performance Evaluation of Data Locality Exploitation. Techincal Report UBLCS-2000-9. Department of Computer Science, University of Bologna.
P. R. Panda, H. Nakamura, N. D. Dutt and A. Nicolau: Improving cache performance through tiling and data alignment. Solving Irregularly Structured Problems in Parallel Lecture Notes in Computer Science, Springer-Verlag 1997.
John E. Savage: Space-Time tradeoff in memory hierarchies. Technical report Oct 19, 1993.
V. Strassen: Gaussian elimination is not optimal. Numerische Mathematik 14(3):354–356, 1969.
S. Toledo: Locality of reference in LU decomposition with partial pivoting. SIAM J.Matrix Anal. Appl. Vol.18, No. 4, pp. 1065–1081, Oct. 1997
M. Thottethodi, S. Chatterjee and A. R. Lebeck: Tuning Strassen’s matrix multiplication for memory efficiency. Proc. SC98, Orlando,FL, nov.1998 ( http://www.supercomp.org/sc98 .
R. C. Whaley and J. J. Dongarra: Automatically Tuned Linear Algebra Software. http://www.netlib.org/atlas/index.html
D. S. Wise: Undulant-block elimination and integer-preserving matrix inversion. Technical Report 418 Computer Science Department Indiana University August 1995
M. Wolfe: More iteration space tiling. Proceedings of Supercomputing, Nov. 1989, pg. 655–665.
M. Wolfe and M. Lam: A Data locality optimizing algorithm. Proceedings of the ACM SIGPLAN’91 conference on programming Language Design and Implementation, Toronto, Ontario,Canada, June 26–28, 1991.
M. Wolfe: High performance compilers for parallel computing. Addison-Wesley Pub.Co. 1995
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bilardi, G., D’Alberto, P., Nicolau, A. (2001). Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance. In: Brodal, G.S., Frigioni, D., Marchetti-Spaccamela, A. (eds) Algorithm Engineering. WAE 2001. Lecture Notes in Computer Science, vol 2141. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44688-5_3
Download citation
DOI: https://doi.org/10.1007/3-540-44688-5_3
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42500-7
Online ISBN: 978-3-540-44688-0
eBook Packages: Springer Book Archive