Skip to main content

Optimizing Matrix Multiplication with a Classifier Learning System

  • Conference paper
Languages and Compilers for Parallel Computing (LCPC 2005)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4339))

Abstract

Compilers have been very successful on automating the process of program optimization, but there is still a significant difference in performance between the code generated by the compiler and the hand-optimized code. Library generators such as ATLAS, SPIRAL, and FFTW address this problem by using empirical search to find the parameter values of certain optimization such as degree of unroll. We have recently developed a generator of sorting routines. Sorting differs from the algorithms implemented by other library generators in that performance of sorting depends not only on the target platform but also on the characteristics of the input data. In our work we used a classifier learning system to generate sorting routines that are capable of adapting to the input data. In this paper we follow a similar approach and use a classifier learning system to generate high performance libraries for matrix-matrix multiplication. Our library generator produces matrix multiplication routines that use recursive layouts and several levels of tiling. Our approach is to use a classifier learning system to search in the space of the different ways to partition the input matrices the one that performs the best. As a result, our system will determine the number of levels of tiling and tile size for each level depending on the target platform and the dimensions of the input matrices.

This work was supported in part by the National Science Foundation under grant CCR 01-21401 ITR; by DARPA under contract NBCH30390004; and by gifts from INTEL and IBM. This work is not necessarily representative of the positions or policies of the Army or Government.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. ATLAS home page, http://math-atlas.sourceforge.net/errata.html#tuneCE

  2. ATLAS home page, http://math-atlas.sourceforge.net/faq.html#NB80

  3. Abu-Sufah, W., Kuck, D., Lawrie, D.: On the Performance Enhancememt of Paging Systems through Program Analysis and Transformations. IEEE Transactions on Computers 30(5), 341–356 (1981)

    Article  Google Scholar 

  4. Bilmes, J., Asanovic, K., Chin, C., Demmel, J.: OptimizingMatrixMultiply using PHiPAC: A Portable, High-Performance, ANSI C Coding Methodology. In: Proc. of the 11th ACM International Conference on Supercomputing (ICS) (July 1997)

    Google Scholar 

  5. Brewer, E.A.: High-level Optimization via Automated Statistical Modeling. In: Proc. of the Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 80–91. ACM Press, New York (1995)

    Google Scholar 

  6. Butz, M.V., Wilson, S.W.: An Algorithmic Description of XCS. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2000. LNCS (LNAI), vol. 1996, pp. 253–272. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  7. Chatterjee, S., Jain, V.V., Lebeck, A.R., Mundhra, S., Thottethodi, M.: Nonlinear Array Layouts for Hierarchical Memory Systems. In: International Conference on Supercomputing, pp. 444–453 (1999)

    Google Scholar 

  8. Chatterjee, S., Lebeck, A.R., Patnala, P.K., Thotterhodi, M.: Recursive array layouts and fast matrix multiplication. IEEE Transactions on Parallel and Distributed Systems 13, 1105–1123 (2002)

    Article  Google Scholar 

  9. Coleman, S., McKinley, K.s.: Tile Selection Using Cache Organization and Data Layout. In: Proc. of Int. Conference Programming Language Design and Implementation, June 1995, pp. 279–290 (1995)

    Google Scholar 

  10. Frens, J., Wise, D.: Auto-blocking Matrix-Multiplication or Tracking BLAS3 Performance with Source Code. In: Proc. of the Intenational Symp. on Principles and Practice of Parallel programming (PPoPP), June 1997, pp. 206–216 (1997)

    Google Scholar 

  11. Frigo, M.: A Fast Fourier Transform Compiler. In: Proc. of Programing Language Design and Implementation (1999)

    Google Scholar 

  12. Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-Oblivious Algorithms. In: Proc. of the Intenational Symp. on Foundations of Computer Science (FOCS) (October 1999)

    Google Scholar 

  13. Goldberg, D.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addison- Wesley, Reading (1989)

    MATH  Google Scholar 

  14. Gustavson, F.G.: Recursion Leads to Automatic Variable Blocking for Dense Linear-Algebra Algorithms. IBM Journal of Research and Development 41(6), 737–755 (1997)

    Article  Google Scholar 

  15. Hilbert, D.: Über Stetige Abbildung einer Linie auf ein Flächenstrück. Mathematische Annalen 38, 459–460 (1891)

    Article  MathSciNet  Google Scholar 

  16. Lam, M., Rothberg, E., Wolf, M.E.: The Cache Performance and Optimizations of Blocked Algorithms. In: Proc. of the Int. conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 1991, pp. 63–74 (1991)

    Google Scholar 

  17. Li, X., Garzarán, M.J., Padua, D.: A Dynamically Tuned Sorting Library. In: Proc. of the Int. Symp. on Code Generation and Optimization, pp. 111–124 (2004)

    Google Scholar 

  18. Li, X., Garzarán, M.J., Padua, D.: Optimizing Sorting with Genetic Algorithms. In: Proc. of the Int. Symp. on Code Generation and Optimization, March 2005, pp. 99–110 (2005)

    Google Scholar 

  19. McKellar, A., Coffman, E.: Organizing Matrices andMatrix Operations for Paged Memory Systems. Communications of the ACM 12(3), 153–165 (1969)

    Article  MATH  Google Scholar 

  20. Mitchell, N., Hogstedt, K., Carter, L., Ferrante, J.: Quantifying the Multi-Level Nature of Tiling Interactions. Int. Journal of Parallel Programming 26(6), 641–670 (1998)

    Article  Google Scholar 

  21. Panda, P., Nakamura, H., Dutt, N., Nicolau, A.: Augmenting Loop Tiling with Data Alignment for Improved Cache Performance. IEEE Trans. on Computers 48(2), 142–149 (1999)

    Article  Google Scholar 

  22. Park, N., Hong, B., Prasanna, V.: Tiling, Block Data Layout, and Memory Hierarchy Performance. IEEE Trans. on Parallel and Distributed Systems 14(7), 640–654 (2003)

    Article  Google Scholar 

  23. Peano, G.: Sur Une Curbe qui Remplit Toute une Aire Plaine. Mathematische Annalen 36, 157–160 (1890)

    Article  MathSciNet  Google Scholar 

  24. Pier Luca Lanzi, W.S., Wilson, S.W.: Learning Classifier Systems, From Foundations to Applications. Springer, Heidelberg (2000)

    Book  Google Scholar 

  25. Rivera, G., Tseng, C.: Data Transformations for Eliminating conflict Misses. In: Proc. of Int. Conference Programming Language Design and Implementation, June 1998, pp. 38–49 (1998)

    Google Scholar 

  26. Rivera, G., Tseng, C.: Locality Optimizations for Multi-Level Caches. In: Proc. of IEEE Supercomputing (November 1999)

    Google Scholar 

  27. Sagan, H.: Space-Filling Curves. Springer, Heidelberg (1994)

    MATH  Google Scholar 

  28. Temam, O., Granston, E., Jalby, W.: To Copy or Not to Copy: A Compile–Time Technique for Assessing When Data Copying Should be Used to Eliminate Cache Conflicts. In: Proc. of the ACM/IEEE Supercomputing Conference (November 1993)

    Google Scholar 

  29. Thomas, N., Tanase, G., Tkachyshyn, O., Perdue, J., Amato, N.M., Rauchwerger, L.: A Framework for Adaptive Algorithm Selection in STAPL. In: Proc. of Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 277–288. ACM Press, New York (2005)

    Chapter  Google Scholar 

  30. Whaley, R., Petitet, A., Dongarra, J.: Automated Empirical Optimizations of Sofware and the ATLAS Project. Parallel Computing 27(1-2), 3–35 (2001)

    Article  MATH  Google Scholar 

  31. Wilson, S.W.: Classifier Fitness Based on Accuracy. Evolutionary Computation 3(2), 149–175 (1995)

    Article  Google Scholar 

  32. Wolfe, M.: Iteration Space Tiling for Memory Hierarchies. In: Third SIAM Conference on Parallel Processing for Scientific Computing (December 1987)

    Google Scholar 

  33. Xiong, J., Johnson, J., Johnson, R., Padua, D.: SPL: A Language and a Compiler for DSP Algorithms. In: Proc. of the International Conference on Programming Language Design and Implementation, pp. 298–308 (2001)

    Google Scholar 

  34. Yi, Q., Adve, V., Kennedy, K.: Transforming Loops To Recursion for Multi-LevelMemory Hierarchies. In: Proc. of the Int. Conf. on Programming Language Design and Implementation (PLDI), June 2000, pp. 169–181 (2000)

    Google Scholar 

  35. Yotov, K., Li, X., Ren, G., Cibulskis, M., DeJong, G., Garzarán, M., Padua, D., Pingali, K., Stodghill, P., Wu, P.: A Comparison of Empirical and Model-driven Optimization. In: Proc. of Programing Language Design and Implementation, June 2003, pp. 63–76 (2003)

    Google Scholar 

  36. Yotov, K., Li, X., Ren, G., Garzarán, M.J., Padua, D., Pingali, K., Stodghill, P.: Is Search Really Necessary to Generate a High Performance Blas? Proc. of the IEEE, special issue on Program Generation, Optimization, and Platform Adaptation 23, 358–386 (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Li, X., Garzarán, M.J. (2006). Optimizing Matrix Multiplication with a Classifier Learning System. In: Ayguadé, E., Baumgartner, G., Ramanujam, J., Sadayappan, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2005. Lecture Notes in Computer Science, vol 4339. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69330-7_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-69330-7_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-69329-1

  • Online ISBN: 978-3-540-69330-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics