A Transformation-Based Approach to Developing High-Performance GPU Programs

  • Bastian Hagedorn
  • Michel Steuwer
  • Sergei Gorlatch
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10742)


We advocate the use of formal patterns and transformations for programming modern many-core processors like Graphics Processing Units (GPU), as an alternative to the currently used low-level, ad hoc programming approaches like CUDA or OpenCL. Our new contribution is introducing an intermediate level of low-level patterns in order to bridge the abstraction gap between popular high-level patterns (\({map}\), fold/reduce, \({zip}\), etc.) and imperative, executable code for many-cores. We define our low-level patterns based on the OpenCL programming model which is portable across parallel architectures of different vendors, and we introduce semantics-preserving rewrite rules that transform programs with high-level patterns into programs with low-level patterns, from which executable OpenCL programs are automatically generated. We show that program design decisions and optimizations, which are usually applied ad-hoc by experts, are systematically expressed in our approach as provably-correct transformations for high- and low-level patterns. We evaluate our approach by systematically deriving several differently optimized OpenCL implementations of parallel reduction that achieve performance competitive with OpenCL programs which are manually written and highly tuned by performance experts.


Parallel programming Rewrite rules Algorithmic patterns GPU OpenCL Code generation Skeletons Transformations 



This work was supported by the German Research Council (DFG) within the Cluster of Excellence CiM (University of Münster), by the German Ministry of Education and Research (BMBF) within the project HPC\(^2\)SE, and by a EuroLab-4-HPC collaboration. We thank Nvidia for their generous hardware donation used in our experiments.

Supplementary material


  1. 1.
    Aldinucci, M., Danelutto, M., Kilpatrick, P., Torquati, M.: Fastflow: high-level and efficient streaming on multi-core. In: Programming Multi-core and Many-core Computing Systems. Wiley-Blackwell, Hoboken (2011)Google Scholar
  2. 2.
    AMD: Bolt C++ Template LibraryGoogle Scholar
  3. 3.
    Backus, J.: Can programming be liberated from the von Neumann style? A functional style and its algebra of programs. Commun. ACM 21(8), 613–641 (1978)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Bird, R.S.: Algebraic identities for program calculation. Comput. J. 32(2), 122–126 (1989)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Burstall, R.M., Darlington, J.: A transformation system for developing recursive programs. J. ACM 24(1), 44–67 (1977)MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Chakravarty, M., Keller, G., Lee, S., McDonell, T.L., Grover, V.: Accelerating Haskell array codes with multicore GPUs. In: DAMP, pp. 3–14. ACM (2011)Google Scholar
  7. 7.
    Gorlatch, S., Cole, M.: Parallel skeletons. In: Padua, D. (ed.) Encyclopedia of Parallel Computing, pp. 1417–1422. Springer, Boston (2011). Google Scholar
  8. 8.
    Harris, M., et al.: Optimizing parallel reduction in CUDA. NVIDIA Developer Technol. 2(4), 1–39 (2007)Google Scholar
  9. 9.
    Holk, E., Byrd, W.E., Mahajan, N., Willcock, J., Chauhan, A., Lumsdaine, A.: Declarative parallel programming for GPUs. In: PARCO, pp. 297–304 (2011)Google Scholar
  10. 10.
    Khronos OpenCL Working Group: The OpenCL SpecificationGoogle Scholar
  11. 11.
    Kuchen, H.: A skeleton library. In: Monien, B., Feldmann, R. (eds.) Euro-Par 2002. LNCS, vol. 2400, pp. 620–629. Springer, Heidelberg (2002). CrossRefGoogle Scholar
  12. 12.
    Nvidia: CUDA Basic Linear Algebra Subroutines (cuBLAS). Version 6.5Google Scholar
  13. 13.
    Steuwer, M., Fensch, C., Lindley, S., Dubach, C.: Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance openCL code. In: ICFP, pp. 205–217. ACM (2015)Google Scholar
  14. 14.
    Steuwer, M., Gorlatch, S.: High-level programming for medical imaging on multi-GPU systems using the skelCL library. In: Procedia Computer Science, ICCS, vol. 18, pp. 749–758. Elsevier (2013)Google Scholar
  15. 15.
    Steuwer, M., Kegel, P., Gorlatch, S.: SkelCL: a portable skeleton library for high-level GPU programming. In: HIPS @ IPDPS, pp. 1176–1182. IEEE (2011)Google Scholar
  16. 16.
    Steuwer, M., Remmelg, T., Dubach, C.: Lift: a functional data-parallel IR for high-performance GPU code generation. In: CGO, pp. 74–85. ACM (2017)Google Scholar
  17. 17.
    Svensson, J., Sheeran, M., Claessen, K.: Obsidian: a domain specific embedded language for parallel programming of graphics processors. In: Scholz, S.-B., Chitil, O. (eds.) IFL 2008. LNCS, vol. 5836, pp. 156–173. Springer, Heidelberg (2011). CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  • Bastian Hagedorn
    • 1
  • Michel Steuwer
    • 2
  • Sergei Gorlatch
    • 1
  1. 1.University of MünsterMünsterGermany
  2. 2.University of GlasgowGlasgowUK

Personalised recommendations