Abstract
Compilers have been very successful on automating the process of program optimization, but there is still a significant difference in performance between the code generated by the compiler and the hand-optimized code. Library generators such as ATLAS, SPIRAL, and FFTW address this problem by using empirical search to find the parameter values of certain optimization such as degree of unroll. We have recently developed a generator of sorting routines. Sorting differs from the algorithms implemented by other library generators in that performance of sorting depends not only on the target platform but also on the characteristics of the input data. In our work we used a classifier learning system to generate sorting routines that are capable of adapting to the input data. In this paper we follow a similar approach and use a classifier learning system to generate high performance libraries for matrix-matrix multiplication. Our library generator produces matrix multiplication routines that use recursive layouts and several levels of tiling. Our approach is to use a classifier learning system to search in the space of the different ways to partition the input matrices the one that performs the best. As a result, our system will determine the number of levels of tiling and tile size for each level depending on the target platform and the dimensions of the input matrices.
This work was supported in part by the National Science Foundation under grant CCR 01-21401 ITR; by DARPA under contract NBCH30390004; and by gifts from INTEL and IBM. This work is not necessarily representative of the positions or policies of the Army or Government.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
ATLAS home page, http://math-atlas.sourceforge.net/errata.html#tuneCE
ATLAS home page, http://math-atlas.sourceforge.net/faq.html#NB80
Abu-Sufah, W., Kuck, D., Lawrie, D.: On the Performance Enhancememt of Paging Systems through Program Analysis and Transformations. IEEE Transactions on Computers 30(5), 341–356 (1981)
Bilmes, J., Asanovic, K., Chin, C., Demmel, J.: OptimizingMatrixMultiply using PHiPAC: A Portable, High-Performance, ANSI C Coding Methodology. In: Proc. of the 11th ACM International Conference on Supercomputing (ICS) (July 1997)
Brewer, E.A.: High-level Optimization via Automated Statistical Modeling. In: Proc. of the Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 80–91. ACM Press, New York (1995)
Butz, M.V., Wilson, S.W.: An Algorithmic Description of XCS. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2000. LNCS (LNAI), vol. 1996, pp. 253–272. Springer, Heidelberg (2001)
Chatterjee, S., Jain, V.V., Lebeck, A.R., Mundhra, S., Thottethodi, M.: Nonlinear Array Layouts for Hierarchical Memory Systems. In: International Conference on Supercomputing, pp. 444–453 (1999)
Chatterjee, S., Lebeck, A.R., Patnala, P.K., Thotterhodi, M.: Recursive array layouts and fast matrix multiplication. IEEE Transactions on Parallel and Distributed Systems 13, 1105–1123 (2002)
Coleman, S., McKinley, K.s.: Tile Selection Using Cache Organization and Data Layout. In: Proc. of Int. Conference Programming Language Design and Implementation, June 1995, pp. 279–290 (1995)
Frens, J., Wise, D.: Auto-blocking Matrix-Multiplication or Tracking BLAS3 Performance with Source Code. In: Proc. of the Intenational Symp. on Principles and Practice of Parallel programming (PPoPP), June 1997, pp. 206–216 (1997)
Frigo, M.: A Fast Fourier Transform Compiler. In: Proc. of Programing Language Design and Implementation (1999)
Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-Oblivious Algorithms. In: Proc. of the Intenational Symp. on Foundations of Computer Science (FOCS) (October 1999)
Goldberg, D.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addison- Wesley, Reading (1989)
Gustavson, F.G.: Recursion Leads to Automatic Variable Blocking for Dense Linear-Algebra Algorithms. IBM Journal of Research and Development 41(6), 737–755 (1997)
Hilbert, D.: Über Stetige Abbildung einer Linie auf ein Flächenstrück. Mathematische Annalen 38, 459–460 (1891)
Lam, M., Rothberg, E., Wolf, M.E.: The Cache Performance and Optimizations of Blocked Algorithms. In: Proc. of the Int. conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 1991, pp. 63–74 (1991)
Li, X., Garzarán, M.J., Padua, D.: A Dynamically Tuned Sorting Library. In: Proc. of the Int. Symp. on Code Generation and Optimization, pp. 111–124 (2004)
Li, X., Garzarán, M.J., Padua, D.: Optimizing Sorting with Genetic Algorithms. In: Proc. of the Int. Symp. on Code Generation and Optimization, March 2005, pp. 99–110 (2005)
McKellar, A., Coffman, E.: Organizing Matrices andMatrix Operations for Paged Memory Systems. Communications of the ACM 12(3), 153–165 (1969)
Mitchell, N., Hogstedt, K., Carter, L., Ferrante, J.: Quantifying the Multi-Level Nature of Tiling Interactions. Int. Journal of Parallel Programming 26(6), 641–670 (1998)
Panda, P., Nakamura, H., Dutt, N., Nicolau, A.: Augmenting Loop Tiling with Data Alignment for Improved Cache Performance. IEEE Trans. on Computers 48(2), 142–149 (1999)
Park, N., Hong, B., Prasanna, V.: Tiling, Block Data Layout, and Memory Hierarchy Performance. IEEE Trans. on Parallel and Distributed Systems 14(7), 640–654 (2003)
Peano, G.: Sur Une Curbe qui Remplit Toute une Aire Plaine. Mathematische Annalen 36, 157–160 (1890)
Pier Luca Lanzi, W.S., Wilson, S.W.: Learning Classifier Systems, From Foundations to Applications. Springer, Heidelberg (2000)
Rivera, G., Tseng, C.: Data Transformations for Eliminating conflict Misses. In: Proc. of Int. Conference Programming Language Design and Implementation, June 1998, pp. 38–49 (1998)
Rivera, G., Tseng, C.: Locality Optimizations for Multi-Level Caches. In: Proc. of IEEE Supercomputing (November 1999)
Sagan, H.: Space-Filling Curves. Springer, Heidelberg (1994)
Temam, O., Granston, E., Jalby, W.: To Copy or Not to Copy: A Compile–Time Technique for Assessing When Data Copying Should be Used to Eliminate Cache Conflicts. In: Proc. of the ACM/IEEE Supercomputing Conference (November 1993)
Thomas, N., Tanase, G., Tkachyshyn, O., Perdue, J., Amato, N.M., Rauchwerger, L.: A Framework for Adaptive Algorithm Selection in STAPL. In: Proc. of Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 277–288. ACM Press, New York (2005)
Whaley, R., Petitet, A., Dongarra, J.: Automated Empirical Optimizations of Sofware and the ATLAS Project. Parallel Computing 27(1-2), 3–35 (2001)
Wilson, S.W.: Classifier Fitness Based on Accuracy. Evolutionary Computation 3(2), 149–175 (1995)
Wolfe, M.: Iteration Space Tiling for Memory Hierarchies. In: Third SIAM Conference on Parallel Processing for Scientific Computing (December 1987)
Xiong, J., Johnson, J., Johnson, R., Padua, D.: SPL: A Language and a Compiler for DSP Algorithms. In: Proc. of the International Conference on Programming Language Design and Implementation, pp. 298–308 (2001)
Yi, Q., Adve, V., Kennedy, K.: Transforming Loops To Recursion for Multi-LevelMemory Hierarchies. In: Proc. of the Int. Conf. on Programming Language Design and Implementation (PLDI), June 2000, pp. 169–181 (2000)
Yotov, K., Li, X., Ren, G., Cibulskis, M., DeJong, G., Garzarán, M., Padua, D., Pingali, K., Stodghill, P., Wu, P.: A Comparison of Empirical and Model-driven Optimization. In: Proc. of Programing Language Design and Implementation, June 2003, pp. 63–76 (2003)
Yotov, K., Li, X., Ren, G., Garzarán, M.J., Padua, D., Pingali, K., Stodghill, P.: Is Search Really Necessary to Generate a High Performance Blas? Proc. of the IEEE, special issue on Program Generation, Optimization, and Platform Adaptation 23, 358–386 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, X., Garzarán, M.J. (2006). Optimizing Matrix Multiplication with a Classifier Learning System. In: Ayguadé, E., Baumgartner, G., Ramanujam, J., Sadayappan, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2005. Lecture Notes in Computer Science, vol 4339. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69330-7_9
Download citation
DOI: https://doi.org/10.1007/978-3-540-69330-7_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69329-1
Online ISBN: 978-3-540-69330-7
eBook Packages: Computer ScienceComputer Science (R0)