Iterative Compilation with Kernel Exploration
The increasing complexity of hardware mechanisms for recent processors makes high performance code generation very challenging. One of the main issue for high performance is the optimization of memory accesses. General purpose compilers, with no knowledge of the application context and approximate memory model, seem inappropriate for this task. Combining application-dependent optimizations on the source code and exploration of optimization parameters as it is achieved with ATLAS, has been shown as one way to improve performance. Yet, hand-tuned codes such as in the MKL library still outperform ATLAS with an important speed-up and some effort has to be done in order to bridge the gap between performance obtained by automatic and manual optimizations.
In this paper, a new iterative compilation approach for the generation of high performance codes is proposed. This approach is not application-dependent, compared to ATLAS. The idea is to separate the memory optimization phase from the computation optimization phase. The first step automatically finds all possible decompositions of the code into kernels. With datasets that fit into the cache and simplified memory accesses, these kernels are simpler to optimize, either with the compiler, at source level, or with a dedicated code generator. The best decomposition is then found by a model-guided approach, performing on the source code the required memory optimizations.
Exploration of optimization sequences and their parameters is achieved with a meta-compilation language, X language. The first results on linear algebra codes for Itanium show that the performance obtained reduce the gap with those of highly optimized hand-tuned codes.
KeywordsCache Size Instruction Level Paral Tile Size Matrix Matrix Multiplication Prefetching Distance
Unable to display preview. Download preview PDF.
- 2.Bodin, F., Mevel, Y., Quiniou, R.: A user level program transformation tool. In: ACM Int. Conf. on Supercomputing, Melbourne, Australia, pp. 180–187. ACM Press, New York (1998), doi:10.1145/277830.277868Google Scholar
- 3.Clauss, P.: Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: Applications to analyze and transform scientific programs. In: ACM Int. Conf. on Supercomputing, pp. 278–295. ACM Press, New York (1996)Google Scholar
- 5.Cooper, K.D., Waterman, T.: Investigating Adaptive Compilation using the MIPSPro Compiler. In: Symp. of the Los Alamos Computer Science Institute, October (2003)Google Scholar
- 6.Djoudi, L., et al.: Exploring application performance: a new tool for a static/dynamic approach. In: Symp. of the Los Alamos Computer Science Institute, Santa Fe, NM, Oct. (2005)Google Scholar
- 8.Engineering and scientific subroutine library. Guide and Reference. IBM.Google Scholar
- 10.Fraguela, B., Doallo, R., Zapata, E.: Automatic analytical modeling for the estimation of cache misses. In: Int. Conf. on Parallel Architectures and Compilation Techniques, Washington, DC, USA, p. 221. IEEE Computer Society Press, Los Alamitos (1999)Google Scholar
- 11.Goto, K., van de Geijn, R.: On reducing tlb misses in matrix multiplication. Technical report, The University of Texas at Austin, Department of Computer Sciences (2002)Google Scholar
- 14.Kodukula, I., Pingali, K.: Transformations for imperfectly nested loops. In: ACM Int. Conf. on Supercomputing, Pittsburgh, Pennsylvania, United States, p. 12. IEEE Computer Society, Washington (1996), doi:10.1145/369028.369051Google Scholar
- 15.Metzger, R., Wen, Z.: Automatic Algorithm Recognition: A New Approach to Program Optimization. MIT Press, Cambridge (2000)Google Scholar
- 16.Intel math kernel library (intel mkl). Intel.Google Scholar
- 17.Triantafyllis, S., Vachharajani, M., August, D.I.: Compiler Optimization-Space Exploration. Journal of Instruction-level Parallelism (2005)Google Scholar
- 18.Whaley, R., Dongarra, J.: Automatically tuned linear algebra software (1997)Google Scholar
- 19.Wolfe, M.: Iteration space tiling for memory hierarchies. In: Conf. on Parallel Processing for Scientific Computing, pp. 357–361. Society for Industrial and Applied Mathematics, Philadelphia (1989)Google Scholar
- 20.Caps entreprise. http://www.caps-entreprise.com
- 21.Yotov, K., et al.: Is search really necessary to generate high-performance blas (2005)Google Scholar