Optimizing Matrix Multiplication with a Classifier Learning System

Li, Xiaoming; Garzarán, María Jesús

doi:10.1007/978-3-540-69330-7_9

Xiaoming Li²⁰ &
María Jesús Garzarán²⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4339))

Included in the following conference series:

International Workshop on Languages and Compilers for Parallel Computing

527 Accesses
4 Citations

Abstract

Compilers have been very successful on automating the process of program optimization, but there is still a significant difference in performance between the code generated by the compiler and the hand-optimized code. Library generators such as ATLAS, SPIRAL, and FFTW address this problem by using empirical search to find the parameter values of certain optimization such as degree of unroll. We have recently developed a generator of sorting routines. Sorting differs from the algorithms implemented by other library generators in that performance of sorting depends not only on the target platform but also on the characteristics of the input data. In our work we used a classifier learning system to generate sorting routines that are capable of adapting to the input data. In this paper we follow a similar approach and use a classifier learning system to generate high performance libraries for matrix-matrix multiplication. Our library generator produces matrix multiplication routines that use recursive layouts and several levels of tiling. Our approach is to use a classifier learning system to search in the space of the different ways to partition the input matrices the one that performs the best. As a result, our system will determine the number of levels of tiling and tile size for each level depending on the target platform and the dimensions of the input matrices.

This work was supported in part by the National Science Foundation under grant CCR 01-21401 ITR; by DARPA under contract NBCH30390004; and by gifts from INTEL and IBM. This work is not necessarily representative of the positions or policies of the Army or Government.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

ATLAS home page, http://math-atlas.sourceforge.net/errata.html#tuneCE
ATLAS home page, http://math-atlas.sourceforge.net/faq.html#NB80
Abu-Sufah, W., Kuck, D., Lawrie, D.: On the Performance Enhancememt of Paging Systems through Program Analysis and Transformations. IEEE Transactions on Computers 30(5), 341–356 (1981)
Article Google Scholar
Bilmes, J., Asanovic, K., Chin, C., Demmel, J.: OptimizingMatrixMultiply using PHiPAC: A Portable, High-Performance, ANSI C Coding Methodology. In: Proc. of the 11th ACM International Conference on Supercomputing (ICS) (July 1997)
Google Scholar
Brewer, E.A.: High-level Optimization via Automated Statistical Modeling. In: Proc. of the Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 80–91. ACM Press, New York (1995)
Google Scholar
Butz, M.V., Wilson, S.W.: An Algorithmic Description of XCS. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2000. LNCS (LNAI), vol. 1996, pp. 253–272. Springer, Heidelberg (2001)
Chapter Google Scholar
Chatterjee, S., Jain, V.V., Lebeck, A.R., Mundhra, S., Thottethodi, M.: Nonlinear Array Layouts for Hierarchical Memory Systems. In: International Conference on Supercomputing, pp. 444–453 (1999)
Google Scholar
Chatterjee, S., Lebeck, A.R., Patnala, P.K., Thotterhodi, M.: Recursive array layouts and fast matrix multiplication. IEEE Transactions on Parallel and Distributed Systems 13, 1105–1123 (2002)
Article Google Scholar
Coleman, S., McKinley, K.s.: Tile Selection Using Cache Organization and Data Layout. In: Proc. of Int. Conference Programming Language Design and Implementation, June 1995, pp. 279–290 (1995)
Google Scholar
Frens, J., Wise, D.: Auto-blocking Matrix-Multiplication or Tracking BLAS3 Performance with Source Code. In: Proc. of the Intenational Symp. on Principles and Practice of Parallel programming (PPoPP), June 1997, pp. 206–216 (1997)
Google Scholar
Frigo, M.: A Fast Fourier Transform Compiler. In: Proc. of Programing Language Design and Implementation (1999)
Google Scholar
Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-Oblivious Algorithms. In: Proc. of the Intenational Symp. on Foundations of Computer Science (FOCS) (October 1999)
Google Scholar
Goldberg, D.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addison- Wesley, Reading (1989)
MATH Google Scholar
Gustavson, F.G.: Recursion Leads to Automatic Variable Blocking for Dense Linear-Algebra Algorithms. IBM Journal of Research and Development 41(6), 737–755 (1997)
Article Google Scholar
Hilbert, D.: Über Stetige Abbildung einer Linie auf ein Flächenstrück. Mathematische Annalen 38, 459–460 (1891)
Article MathSciNet Google Scholar
Lam, M., Rothberg, E., Wolf, M.E.: The Cache Performance and Optimizations of Blocked Algorithms. In: Proc. of the Int. conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 1991, pp. 63–74 (1991)
Google Scholar
Li, X., Garzarán, M.J., Padua, D.: A Dynamically Tuned Sorting Library. In: Proc. of the Int. Symp. on Code Generation and Optimization, pp. 111–124 (2004)
Google Scholar
Li, X., Garzarán, M.J., Padua, D.: Optimizing Sorting with Genetic Algorithms. In: Proc. of the Int. Symp. on Code Generation and Optimization, March 2005, pp. 99–110 (2005)
Google Scholar
McKellar, A., Coffman, E.: Organizing Matrices andMatrix Operations for Paged Memory Systems. Communications of the ACM 12(3), 153–165 (1969)
Article MATH Google Scholar
Mitchell, N., Hogstedt, K., Carter, L., Ferrante, J.: Quantifying the Multi-Level Nature of Tiling Interactions. Int. Journal of Parallel Programming 26(6), 641–670 (1998)
Article Google Scholar
Panda, P., Nakamura, H., Dutt, N., Nicolau, A.: Augmenting Loop Tiling with Data Alignment for Improved Cache Performance. IEEE Trans. on Computers 48(2), 142–149 (1999)
Article Google Scholar
Park, N., Hong, B., Prasanna, V.: Tiling, Block Data Layout, and Memory Hierarchy Performance. IEEE Trans. on Parallel and Distributed Systems 14(7), 640–654 (2003)
Article Google Scholar
Peano, G.: Sur Une Curbe qui Remplit Toute une Aire Plaine. Mathematische Annalen 36, 157–160 (1890)
Article MathSciNet Google Scholar
Pier Luca Lanzi, W.S., Wilson, S.W.: Learning Classifier Systems, From Foundations to Applications. Springer, Heidelberg (2000)
Book Google Scholar
Rivera, G., Tseng, C.: Data Transformations for Eliminating conflict Misses. In: Proc. of Int. Conference Programming Language Design and Implementation, June 1998, pp. 38–49 (1998)
Google Scholar
Rivera, G., Tseng, C.: Locality Optimizations for Multi-Level Caches. In: Proc. of IEEE Supercomputing (November 1999)
Google Scholar
Sagan, H.: Space-Filling Curves. Springer, Heidelberg (1994)
MATH Google Scholar
Temam, O., Granston, E., Jalby, W.: To Copy or Not to Copy: A Compile–Time Technique for Assessing When Data Copying Should be Used to Eliminate Cache Conflicts. In: Proc. of the ACM/IEEE Supercomputing Conference (November 1993)
Google Scholar
Thomas, N., Tanase, G., Tkachyshyn, O., Perdue, J., Amato, N.M., Rauchwerger, L.: A Framework for Adaptive Algorithm Selection in STAPL. In: Proc. of Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 277–288. ACM Press, New York (2005)
Chapter Google Scholar
Whaley, R., Petitet, A., Dongarra, J.: Automated Empirical Optimizations of Sofware and the ATLAS Project. Parallel Computing 27(1-2), 3–35 (2001)
Article MATH Google Scholar
Wilson, S.W.: Classifier Fitness Based on Accuracy. Evolutionary Computation 3(2), 149–175 (1995)
Article Google Scholar
Wolfe, M.: Iteration Space Tiling for Memory Hierarchies. In: Third SIAM Conference on Parallel Processing for Scientific Computing (December 1987)
Google Scholar
Xiong, J., Johnson, J., Johnson, R., Padua, D.: SPL: A Language and a Compiler for DSP Algorithms. In: Proc. of the International Conference on Programming Language Design and Implementation, pp. 298–308 (2001)
Google Scholar
Yi, Q., Adve, V., Kennedy, K.: Transforming Loops To Recursion for Multi-LevelMemory Hierarchies. In: Proc. of the Int. Conf. on Programming Language Design and Implementation (PLDI), June 2000, pp. 169–181 (2000)
Google Scholar
Yotov, K., Li, X., Ren, G., Cibulskis, M., DeJong, G., Garzarán, M., Padua, D., Pingali, K., Stodghill, P., Wu, P.: A Comparison of Empirical and Model-driven Optimization. In: Proc. of Programing Language Design and Implementation, June 2003, pp. 63–76 (2003)
Google Scholar
Yotov, K., Li, X., Ren, G., Garzarán, M.J., Padua, D., Pingali, K., Stodghill, P.: Is Search Really Necessary to Generate a High Performance Blas? Proc. of the IEEE, special issue on Program Generation, Optimization, and Platform Adaptation 23, 358–386 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Illinois at Urbana-Champaign,
Xiaoming Li & María Jesús Garzarán

Authors

Xiaoming Li
View author publications
You can also search for this author in PubMed Google Scholar
María Jesús Garzarán
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

BSC-UPC,
Eduard Ayguadé
Department of Computer Science, Louisiana State University, 70803, Baton Rouge, LA, USA
Gerald Baumgartner
Dept. of Electrical and Computer Engg., Louisiana State University, Baton Rouge, LA, USA
J. Ramanujam
Department of Computer Science and Engineering, The Ohio State University, 2015 Neil Avenue, 43210, Columbus, OH, USA
P. Sadayappan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, X., Garzarán, M.J. (2006). Optimizing Matrix Multiplication with a Classifier Learning System. In: Ayguadé, E., Baumgartner, G., Ramanujam, J., Sadayappan, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2005. Lecture Notes in Computer Science, vol 4339. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69330-7_9

Download citation

DOI: https://doi.org/10.1007/978-3-540-69330-7_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69329-1
Online ISBN: 978-3-540-69330-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics