Optimizing Matrix Multiplication with a Classifier Learning System

  • Xiaoming Li
  • María Jesús Garzarán
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4339)

Abstract

Compilers have been very successful on automating the process of program optimization, but there is still a significant difference in performance between the code generated by the compiler and the hand-optimized code. Library generators such as ATLAS, SPIRAL, and FFTW address this problem by using empirical search to find the parameter values of certain optimization such as degree of unroll. We have recently developed a generator of sorting routines. Sorting differs from the algorithms implemented by other library generators in that performance of sorting depends not only on the target platform but also on the characteristics of the input data. In our work we used a classifier learning system to generate sorting routines that are capable of adapting to the input data. In this paper we follow a similar approach and use a classifier learning system to generate high performance libraries for matrix-matrix multiplication. Our library generator produces matrix multiplication routines that use recursive layouts and several levels of tiling. Our approach is to use a classifier learning system to search in the space of the different ways to partition the input matrices the one that performs the best. As a result, our system will determine the number of levels of tiling and tile size for each level depending on the target platform and the dimensions of the input matrices.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
  2. 2.
  3. 3.
    Abu-Sufah, W., Kuck, D., Lawrie, D.: On the Performance Enhancememt of Paging Systems through Program Analysis and Transformations. IEEE Transactions on Computers 30(5), 341–356 (1981)CrossRefGoogle Scholar
  4. 4.
    Bilmes, J., Asanovic, K., Chin, C., Demmel, J.: OptimizingMatrixMultiply using PHiPAC: A Portable, High-Performance, ANSI C Coding Methodology. In: Proc. of the 11th ACM International Conference on Supercomputing (ICS) (July 1997)Google Scholar
  5. 5.
    Brewer, E.A.: High-level Optimization via Automated Statistical Modeling. In: Proc. of the Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 80–91. ACM Press, New York (1995)Google Scholar
  6. 6.
    Butz, M.V., Wilson, S.W.: An Algorithmic Description of XCS. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2000. LNCS (LNAI), vol. 1996, pp. 253–272. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  7. 7.
    Chatterjee, S., Jain, V.V., Lebeck, A.R., Mundhra, S., Thottethodi, M.: Nonlinear Array Layouts for Hierarchical Memory Systems. In: International Conference on Supercomputing, pp. 444–453 (1999)Google Scholar
  8. 8.
    Chatterjee, S., Lebeck, A.R., Patnala, P.K., Thotterhodi, M.: Recursive array layouts and fast matrix multiplication. IEEE Transactions on Parallel and Distributed Systems 13, 1105–1123 (2002)CrossRefGoogle Scholar
  9. 9.
    Coleman, S., McKinley, K.s.: Tile Selection Using Cache Organization and Data Layout. In: Proc. of Int. Conference Programming Language Design and Implementation, June 1995, pp. 279–290 (1995)Google Scholar
  10. 10.
    Frens, J., Wise, D.: Auto-blocking Matrix-Multiplication or Tracking BLAS3 Performance with Source Code. In: Proc. of the Intenational Symp. on Principles and Practice of Parallel programming (PPoPP), June 1997, pp. 206–216 (1997)Google Scholar
  11. 11.
    Frigo, M.: A Fast Fourier Transform Compiler. In: Proc. of Programing Language Design and Implementation (1999)Google Scholar
  12. 12.
    Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-Oblivious Algorithms. In: Proc. of the Intenational Symp. on Foundations of Computer Science (FOCS) (October 1999)Google Scholar
  13. 13.
    Goldberg, D.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addison- Wesley, Reading (1989)MATHGoogle Scholar
  14. 14.
    Gustavson, F.G.: Recursion Leads to Automatic Variable Blocking for Dense Linear-Algebra Algorithms. IBM Journal of Research and Development 41(6), 737–755 (1997)CrossRefGoogle Scholar
  15. 15.
    Hilbert, D.: Über Stetige Abbildung einer Linie auf ein Flächenstrück. Mathematische Annalen 38, 459–460 (1891)CrossRefMathSciNetGoogle Scholar
  16. 16.
    Lam, M., Rothberg, E., Wolf, M.E.: The Cache Performance and Optimizations of Blocked Algorithms. In: Proc. of the Int. conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 1991, pp. 63–74 (1991)Google Scholar
  17. 17.
    Li, X., Garzarán, M.J., Padua, D.: A Dynamically Tuned Sorting Library. In: Proc. of the Int. Symp. on Code Generation and Optimization, pp. 111–124 (2004)Google Scholar
  18. 18.
    Li, X., Garzarán, M.J., Padua, D.: Optimizing Sorting with Genetic Algorithms. In: Proc. of the Int. Symp. on Code Generation and Optimization, March 2005, pp. 99–110 (2005)Google Scholar
  19. 19.
    McKellar, A., Coffman, E.: Organizing Matrices andMatrix Operations for Paged Memory Systems. Communications of the ACM 12(3), 153–165 (1969)MATHCrossRefGoogle Scholar
  20. 20.
    Mitchell, N., Hogstedt, K., Carter, L., Ferrante, J.: Quantifying the Multi-Level Nature of Tiling Interactions. Int. Journal of Parallel Programming 26(6), 641–670 (1998)CrossRefGoogle Scholar
  21. 21.
    Panda, P., Nakamura, H., Dutt, N., Nicolau, A.: Augmenting Loop Tiling with Data Alignment for Improved Cache Performance. IEEE Trans. on Computers 48(2), 142–149 (1999)CrossRefGoogle Scholar
  22. 22.
    Park, N., Hong, B., Prasanna, V.: Tiling, Block Data Layout, and Memory Hierarchy Performance. IEEE Trans. on Parallel and Distributed Systems 14(7), 640–654 (2003)CrossRefGoogle Scholar
  23. 23.
    Peano, G.: Sur Une Curbe qui Remplit Toute une Aire Plaine. Mathematische Annalen 36, 157–160 (1890)CrossRefMathSciNetGoogle Scholar
  24. 24.
    Pier Luca Lanzi, W.S., Wilson, S.W.: Learning Classifier Systems, From Foundations to Applications. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  25. 25.
    Rivera, G., Tseng, C.: Data Transformations for Eliminating conflict Misses. In: Proc. of Int. Conference Programming Language Design and Implementation, June 1998, pp. 38–49 (1998)Google Scholar
  26. 26.
    Rivera, G., Tseng, C.: Locality Optimizations for Multi-Level Caches. In: Proc. of IEEE Supercomputing (November 1999)Google Scholar
  27. 27.
    Sagan, H.: Space-Filling Curves. Springer, Heidelberg (1994)MATHGoogle Scholar
  28. 28.
    Temam, O., Granston, E., Jalby, W.: To Copy or Not to Copy: A Compile–Time Technique for Assessing When Data Copying Should be Used to Eliminate Cache Conflicts. In: Proc. of the ACM/IEEE Supercomputing Conference (November 1993)Google Scholar
  29. 29.
    Thomas, N., Tanase, G., Tkachyshyn, O., Perdue, J., Amato, N.M., Rauchwerger, L.: A Framework for Adaptive Algorithm Selection in STAPL. In: Proc. of Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 277–288. ACM Press, New York (2005)CrossRefGoogle Scholar
  30. 30.
    Whaley, R., Petitet, A., Dongarra, J.: Automated Empirical Optimizations of Sofware and the ATLAS Project. Parallel Computing 27(1-2), 3–35 (2001)MATHCrossRefGoogle Scholar
  31. 31.
    Wilson, S.W.: Classifier Fitness Based on Accuracy. Evolutionary Computation 3(2), 149–175 (1995)CrossRefGoogle Scholar
  32. 32.
    Wolfe, M.: Iteration Space Tiling for Memory Hierarchies. In: Third SIAM Conference on Parallel Processing for Scientific Computing (December 1987)Google Scholar
  33. 33.
    Xiong, J., Johnson, J., Johnson, R., Padua, D.: SPL: A Language and a Compiler for DSP Algorithms. In: Proc. of the International Conference on Programming Language Design and Implementation, pp. 298–308 (2001)Google Scholar
  34. 34.
    Yi, Q., Adve, V., Kennedy, K.: Transforming Loops To Recursion for Multi-LevelMemory Hierarchies. In: Proc. of the Int. Conf. on Programming Language Design and Implementation (PLDI), June 2000, pp. 169–181 (2000)Google Scholar
  35. 35.
    Yotov, K., Li, X., Ren, G., Cibulskis, M., DeJong, G., Garzarán, M., Padua, D., Pingali, K., Stodghill, P., Wu, P.: A Comparison of Empirical and Model-driven Optimization. In: Proc. of Programing Language Design and Implementation, June 2003, pp. 63–76 (2003)Google Scholar
  36. 36.
    Yotov, K., Li, X., Ren, G., Garzarán, M.J., Padua, D., Pingali, K., Stodghill, P.: Is Search Really Necessary to Generate a High Performance Blas? Proc. of the IEEE, special issue on Program Generation, Optimization, and Platform Adaptation 23, 358–386 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Xiaoming Li
    • 1
  • María Jesús Garzarán
    • 1
  1. 1.Department of Computer ScienceUniversity of Illinois at Urbana-Champaign 

Personalised recommendations