Mastering Software Variant Explosion for GPU Accelerators

  • Richard Membarth
  • Frank Hannig
  • Jürgen Teich
  • Mario Körner
  • Wieland Eckert
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7640)


Mapping algorithms in an efficient way to the target hardware poses a challenge for algorithm designers. This is particular true for heterogeneous systems hosting accelerators like graphics cards. While algorithm developers have profound knowledge of the application domain, they often lack detailed insight into the underlying hardware of accelerators in order to exploit the provided processing power. Therefore, this paper introduces a rule-based, domain-specific optimization engine for generating the most appropriate code variant for different Graphics Processing Unit (GPU) accelerators. The optimization engine relies on knowledge fused from the application domain and the target architecture. The optimization engine is embedded into a framework that allows to design imaging algorithms in a Domain-Specific Language (DSL). We show that this allows to have one common description of an algorithm in the DSL and select the optimal target code variant for different GPU accelerators and target languages like CUDA and OpenCL.


Graphic Processing Unit Local Operator Local Memory Iteration Space Code Variant 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Clint Whaley, R., Petitet, A., Dongarra, J.: Automated Empirical Optimizations of Software and the ATLAS Project. Parallel Computing 27(1), 3–35 (2001)zbMATHCrossRefGoogle Scholar
  2. 2.
    Czarnecki, K., Eisenecker, U.: Generative Programming: Methods, Tools, and Applications. Addison-Wesley (2000)Google Scholar
  3. 3.
    Du, P., Weber, R., Luszczek, P., Tomov, S., Peterson, G., Dongarra, J.: From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming. Parallel Computing 38(8), 391–407 (2011)CrossRefGoogle Scholar
  4. 4.
    Grewe, D., Wang, Z., O’Boyle, M.F.: A Workload-Aware Mapping Approach for Data-Parallel Programs. In: Proceedings of the 6th International Conference on High-Performance and Embedded Architectures and Compilers, HiPEAC, pp. 117–126. ACM (January 2011)Google Scholar
  5. 5.
    Li, Y., Dongarra, J., Tomov, S.: A Note on Auto-tuning GEMM for GPUs. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2009, Part I. LNCS, vol. 5544, pp. 884–892. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  6. 6.
    Membarth, R., Hannig, F., Teich, J., Körner, M., Eckert, W.: Automatic Optimization of In-Flight Memory Transactions for GPU Accelerators based on a Domain-Specific Language for Medical Imaging. In: Proceedings of the 11th International Symposium on Parallel and Distributed Computing, ISPDC. IEEE (June 2012)Google Scholar
  7. 7.
    Membarth, R., Hannig, F., Teich, J., Körner, M., Eckert, W.: Generating Device-specific GPU Code for Local Operators in Medical Imaging. In: Proceedings of the 26th IEEE International Parallel & Distributed Processing Symposium, IPDPS, pp. 569–581. IEEE (May 2012)Google Scholar
  8. 8.
    Pohl, K., Böckle, G., Van Der Linden, F.: Software Product Line Engineering: Foundations, Principles, and Techniques. Springer (2005)Google Scholar
  9. 9.
    Ryoo, S., Rodrigues, C., Stone, S., Stratton, J., Ueng, S., Baghsorkhi, S., Hwu, W.: Program Optimization Carving for GPU Computing. Journal of Parallel and Distributed Computing 68(10), 1389–1401 (2008)CrossRefGoogle Scholar
  10. 10.
    Thoman, P., Kofler, K., Studt, H., Thomson, J., Fahringer, T.: Automatic OpenCL Device Characterization: Guiding Optimized Kernel Design. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011, Part II. LNCS, vol. 6853, pp. 438–452. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  11. 11.
    Wong, H., Papadopoulou, M., Sadooghi-Alvandi, M., Moshovos, A.: Demystifying GPU Microarchitecture through Microbenchmarking. In: Proceedings of the 2010 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS, pp. 235–246. IEEE (2010)Google Scholar
  12. 12.
    Yotov, K., Li, X., Ren, G., Garzaran, M., Padua, D., Pingali, K., Stodghill, P.: Is Search Really Necessary to Generate High-performance BLAS? Proceedings of the IEEE Special Issue on “Program Generation, Optimization, and Platform Adaptation” 93(2), 358–386 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Richard Membarth
    • 1
  • Frank Hannig
    • 1
  • Jürgen Teich
    • 1
  • Mario Körner
    • 2
  • Wieland Eckert
    • 2
  1. 1.Hardware/Software Co-Design, Department of Computer ScienceUniversity of Erlangen-NurembergGermany
  2. 2.Siemens Healthcare Sector, H IM AXForchheimGermany

Personalised recommendations