Advertisement

A Programming Language Interface to Describe Transformations and Code Generation

  • Gabe Rudy
  • Malik Murtaza Khan
  • Mary Hall
  • Chun Chen
  • Jacqueline Chame
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6548)

Abstract

This paper presents a programming language interface, a complete scripting language, to describe composable compiler transformations. These transformation programs can be written, shared and reused by non-expert application and library developers. From a compiler writer’s perspective, a scripting language interface permits rapid prototyping of compiler algorithms that can mix levels and compose different sequences of transformations, producing readable code as output. From a library or application developer’s perspective, the use of transformation programs permits expression of clean high-level code, and a separate description of how to map that code to architectural features, easing maintenance and porting to new architectures.

We illustrate this interface in the context of CUDA-CHiLL, a source-to-source compiler transformation and code generation framework that transforms sequential loop nests to high-performance GPU code. We show how this high-level transformation and code generation language can be used to express: (1) complex transformation sequences, exemplified by a single loop restructuring construct used to generate a series of tiling and permute commands; and, (2) complex code generation sequences to produce CUDA code from a high-level specification. We demonstrate that the automatically-generated code either performs closely or outperforms two hand-tuned GPU library kernels from Nvidia’s CUBLAS 2.2 and 3.2 libraries.

Keywords

Iteration Space Transformation Recipe Memory Hierarchy Tile Size Loop Transformation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ahmed, N., Mateev, N., Pingali, K.: Synthesizing transformations for locality enhancement of imperfectly-nested loop nests. In: Proceedings of the 2000 ACM International Conference on Supercomputing (May 2000)Google Scholar
  2. 2.
    Bailey, D.H., Chame, J., Chen, C., Dongarra, J., Hall, M., Hollingsworth, J.K., Hovland, P., Moore, S., Seymour, K., Shin, J., Tiwari, A., Williams, S., You, H.: PERI auto-tuning. Journal of Physics: Conference Series 125(1) (2008)Google Scholar
  3. 3.
    Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: A compiler framework for optimization of affine loop nests for GPGPUs. In: Proceedings of the 22nd Annual International Conference on Supercomputing, ICS 2008, pp. 225–234. ACM, New York (2008)Google Scholar
  4. 4.
    Carr, S., Kennedy, K.: Improving the ratio of memory operations to floating-point operations in loops. ACM Transactions on Programming Languages and Systems 16(6), 1768–1810 (1994)CrossRefGoogle Scholar
  5. 5.
    Chen, C.: Model-Guided Empirical Optimization for Memory Hierarchy. PhD thesis, University of Southern California (May 2007)Google Scholar
  6. 6.
    Chen, C., Chame, J., Hall, M.: CHiLL: A framework for composing high-level loop transformations. Technical Report 08-897, University of Southern California (June 2008)Google Scholar
  7. 7.
    Donadio, S., Brodman, J., Roeder, T., Yotov, K., Barthou, D., Cohen, A., Garzarán, M.J., Padua, D., Pingali, K.: A language for the compact representation of multiple program versions. In: Ayguadé, E., Baumgartner, G., Ramanujam, J., Sadayappan, P. (eds.) LCPC 2005. LNCS, vol. 4339, pp. 136–151. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  8. 8.
    Girbal, S., Vasilache, N., Bastoul, C., Cohen, A., Parello, D., Sigler, M., Temam, O.: Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. International Journal of Parallel Programming 34(3), 261–317 (2006)CrossRefMATHGoogle Scholar
  9. 9.
    Hall, M., Chame, J., Chen, C., Shin, J., Rudy, G., Khan, M.M.: Loop transformation recipes for code generation and auto-tuning. In: Gao, G.R., Pollock, L.L., Cavazos, J., Li, X. (eds.) LCPC 2009. LNCS, vol. 5898, pp. 50–64. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  10. 10.
    Hartono, A., Norris, B., Sadayappan, P.: Annotation-based empirical performance tuning using Orio. In: Proceedings of the 23rd International Parallel and Distributed Processing Symposium (May 2009)Google Scholar
  11. 11.
    Ierusalimschy, R., de Figueiredo, L.H., Filho, W.C.: Lua an extensible extension language. Softw. Pract. Exper. 26, 635–652 (1996)CrossRefGoogle Scholar
  12. 12.
    Jiménez, M., Llabería, J.M., Fernández, A.: Register tiling in nonrectangular iteration spaces. ACM Transactions on Programming Languages and Systems 24(4), 409–453 (2002)CrossRefGoogle Scholar
  13. 13.
    Kelly, W., Pugh, W.: A framework for unifying reordering transformations. Technical Report CS-TR-3193, Department of Computer Science, University of Maryland (1993)Google Scholar
  14. 14.
    Kennedy, K., McKinley, K.: Optimizing for parallelism and data locality. In: ACM International Conference on Supercomputing (July 1992)Google Scholar
  15. 15.
    Kirk, D., Hwu, W.: Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers, San Francisco (2010)Google Scholar
  16. 16.
    Kodukula, I., Ahmed, N., Pingali, K.: Data-centric multi-level blocking. In: Proceedings of ACM SIGPLAN Conference on Programming Languages Design and Implementation (June 1997)Google Scholar
  17. 17.
    Lee, S., Min, S.-J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (February 2009)Google Scholar
  18. 18.
    Lim, A.W., Lam, M.S.: Maximizing parallelism and minimizing synchronization with affine partitioning. In: Proceedings of ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL 1997) (January 1997)Google Scholar
  19. 19.
    Lim, A.W., Liao, S.-W., Lam, M.S.: Blocking and array contraction across arbitrarily nested loops using affine partitioning. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (June 2001)Google Scholar
  20. 20.
    McKinley, K.S., Carr, S., Tseng, C.-W.: Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems 18(4), 424–453 (1996)CrossRefGoogle Scholar
  21. 21.
    Pugh, B., Rosser, E.: Iteration space slicing for locality. In: Carter, L., Ferrante, J. (eds.) LCPC 1999. LNCS, vol. 1863, p. 164. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  22. 22.
    Qasem, A., Jin, G., Mellor-Crummey, J.: Improving performance with integrated program transformations. Technical Report TR03-419, Rice University (October 2003)Google Scholar
  23. 23.
    Rivera, G., Tseng, C.-W.: Data transformations for eliminating conflict misses. In: Proceedings of ACM SIGPLAN Conference on Programming Languages Design and Implementation (June 1998)Google Scholar
  24. 24.
    Rudy, G.: CUDA-CHiLL: A programming language interface for GPGPU optimizations and code generation. Master’s thesis, University of Utah (May 2010)Google Scholar
  25. 25.
    Sarkar, V., Thekkath, R.: A general framework for iteration-reordering loop transformations. In: Proceedings of ACM SIGPLAN Conference on Programming Languages Design and Implementation (June 1992)Google Scholar
  26. 26.
    Shin, J., Hall, M., Chame, J., Chen, C., Fischer, P.F., Hovland, P.D.: Speeding up nek5000 with autotuning and specialization. In: Proceedings of the 2010 ACM International Conference on Supercomputing (June 2010)Google Scholar
  27. 27.
    Shin, J., Hall, M.W., Chame, J., Chen, C., Hovland, P.D.: Autotuning and specialization: Speeding up matrix multiply for small matrices with compiler technology. In: Proceedings of the 4th International Workshop on Automatic Performance Tuning (October 2009)Google Scholar
  28. 28.
    Temam, O., Granston, E.D., Jalby, W.: To copy or not to copy: A compile-time technique for assessing when data copying should be used to eliminate cache conflicts. In: Proceedings of Supercomputing 1993 (November 1993)Google Scholar
  29. 29.
    Tiwari, A., Chen, C., Chame, J., Hall, M., Hollingsworth, J.K.: A scalable auto-tuning framework for compiler optimization. In: Proceedings of the 24th International Parallel and Distributed Processing Symposium (April 2009)Google Scholar
  30. 30.
    Ueng, S.-Z., Lathara, M., Baghsorkhi, S.S., Hwu, W.-m.W.: CUDA-Lite: Reducing GPU Programming Complexity. In: Amaral, J.N. (ed.) LCPC 2008. LNCS, vol. 5335, pp. 1–15. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  31. 31.
    Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. In: Proceedings of ACM SIGPLAN Conference on Programming Languages Design and Implementation (June 1991)Google Scholar
  32. 32.
    Wolf, M.E., Lam, M.S.: A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems 2(4), 452–471 (1991)CrossRefGoogle Scholar
  33. 33.
    Wolfe, M.: More iteration space tiling. In: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, pp. 655–664. ACM, New York (1989)CrossRefGoogle Scholar
  34. 34.
    Wolfe, M.: Data dependence and program restructuring. The Journal of Supercomputing 4(4), 321–344 (1991)CrossRefGoogle Scholar
  35. 35.
    Yang, Y., Xiang, P., Kong, J., Zhou, H.: A gpgpu compiler for memory optimization and parallelism management. SIGPLAN Not. 45(6), 86–97 (2010)CrossRefGoogle Scholar
  36. 36.
    Yi, Q., Seymour, K., You, H., Vuduc, R., Quinlan, D.: POET: parameterized optimizations for empirical tuning. In: Proceedings of the 21st International Parallel and Distributed Processing Symposium (March 2007)Google Scholar
  37. 37.
    Zima, H., Hall, M., Chen, C., Chame, J.: Model-guided autotuning of high-productivity languages for petascale computing. In: Proceedings of the 18th ACM International Symposium on High Performance Distributed Computing (HPDC 2009) (June 2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Gabe Rudy
    • 1
  • Malik Murtaza Khan
    • 2
  • Mary Hall
    • 1
  • Chun Chen
    • 1
  • Jacqueline Chame
    • 2
  1. 1.School of ComputingUniversity of UtahSalt Lake CityUS
  2. 2.Information Sciences InstituteUSCMarina del ReyUS

Personalised recommendations