Directive-Based Compilers for GPUs

  • Swapnil Ghike
  • Rubén GranEmail author
  • María J. Garzarán
  • David Padua
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8967)


General Purpose Graphics Computing Units can be effectively used for enhancing the performance of many contemporary scientific applications. However, programming GPUs using machine-specific notations like CUDA or OpenCL can be complex and time consuming. In addition, the resulting programs are typically fine-tuned for a particular target device. A promising alternative is to program in a conventional and machine-independent notation extended with directives and use compilers to generate GPU code automatically. These compilers enable portability and increase programmer productivity and, if effective, would not impose much penalty on performance.

This paper evaluates two such compilers, PGI and Cray. We first identify a collection of standard transformations that these compilers can apply. Then, we propose a sequence of manual transformations that programmers can apply to enable the generation of efficient GPU kernels. Lastly, using the Rodinia Benchmark suite, we compare the performance of the code generated by the PGI and Cray compilers with that of code written in CUDA. Our evaluation shows that the code produced by the PGI and Cray compilers can perform well. For 6 of the 15 benchmarks that we evaluated, the compiler generated code achieved over 85 % of the performance of a hand-tuned CUDA version.


Directive-based compiler OpenACC GPGPU Evaluation Cray PGI Accelerator 



This research is part of the Blue Waters sustained-petascale computing project, which is supported by NSF (award number OCI 07-25070) and the state of Illinois. It was also supported by NSF under Award CNS 1111407 and by grants TIN2007-60625, TIN2010-21291-C02-01 and TIN2013-64957-C2-1-P (Spanish Government and European ERDF), gaZ: T48 research group (Aragon Government and European ESF).


  1. 1.
    Amini, M., Coelho, F., Irigoin, F., Keryell, R.: Static compilation analysis for host-accelerator communication optimization. In: Rajopadhye, S., Mills Strout, M. (eds.) LCPC 2011. LNCS, vol. 7146, pp. 237–251. Springer, Heidelberg (2013) CrossRefGoogle Scholar
  2. 2.
    Bordawekar, R., Bondhugula, U., Rao, R.: Can CPUs match GPUs on performance with productivity?: Experiences with optimizing a flop-intensive application on CPUs and GPU. Technical report RC25033, IBM, August 2010Google Scholar
  3. 3.
    Boyer, M., Tarjan, D., Acton, S.T., Skadron, K.: Accelerating leukocyte tracking using cuda: a case study in leveraging manycore coprocessors. In: Proceedings of IPDPS, pp. 1–12 (2009)Google Scholar
  4. 4.
    CAPS Enterprise: HMPP workbench (2011).
  5. 5.
    CAPS Enterprise and Cray Inc. and NVIDIA and the Portland Group: The openacc application programming interface, v1.0, November 2011.
  6. 6.
    Che, S., et al.: Rodinia: a benchmark suite for heterogeneous computing. In: IISWC 2009. pp. 44–54, October 2009Google Scholar
  7. 7.
    Che, S., et al.: A characterization of the rodinia benchmark suite with comparison to contemporary CMP workloads. In: Proceedings of IISWC, pp. 1–11 (2010)Google Scholar
  8. 8.
    Cloutier, B., Muite, B.K., Rigge, P.: A comparison of CPU and GPU performance for fourier pseudospectral simulations of the navier-stokes, cubic nonlinear schrodinger and sine gordon equations. ArXiv e-prints, June 2012Google Scholar
  9. 9.
  10. 10.
    Grauer Gray, S., Xu, L., Searles, R., Ayalasomayajula, S., Cavazos, J.: Auto-tuning a high-level language targeted to GPU codes. In: Proceedings of InPar, pp. 1–10 (2012)Google Scholar
  11. 11.
    Hacker, H., Trinitis, C., Weidendorfer, J., Brehm, M.: Considering GPGPU for HPC Centers: Is It Worth the effort? In: Keller, R., Kramer, D., Weiss, J.-P. (eds.) Facing the Multicore-Challenge. LNCS, vol. 6310, pp. 118–130. Springer, Heidelberg (2010) CrossRefGoogle Scholar
  12. 12.
    Han, T., Abdelrahman, T.: HiCUDA: high-level GPGPU programming. IEEE Trans. Parallel Distrib. Syst. 22(1), 78–90 (2011)CrossRefGoogle Scholar
  13. 13.
    Henderson, T., Middlecoff, J., Rosinski, J., Govett, M., Madden, P.: Experience applying fortran GPU compilers to numerical weather prediction. In: Proceedings of SAAHPC, pp. 34–41, July 2011Google Scholar
  14. 14.
    Hernandez, O., Ding, W., Chapman, B., Kartsaklis, C., Sankaran, R., Graham, R.: Experiences with high-level programming directives for porting applications to GPUs. In: Keller, R., Kramer, D., Weiss, J.-P. (eds.) Facing the Multicore - Challenge II. LNCS, vol. 7174, pp. 96–107. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  15. 15.
    Enos, J., et al.: Quantifying the impact of GPUs on performance and energy efficiency in HPC clusters. In: Internatioanl Green Computing Conference, pp. 317–324, August 2010Google Scholar
  16. 16.
    Jablin, T.B., et al.: Automatic CPU-GPU communication management and optimization. SIGPLAN Not. 47(6), 142–151 (2011)CrossRefGoogle Scholar
  17. 17.
    Jin, H., Kellogg, M., Mehrotra, P.: Using compiler directives for accelerating CFD applications on GPUs. In: Chapman, B.M., Massaioli, F., Müller, M.S., Rorro, M. (eds.) IWOMP 2012. LNCS, vol. 7312, pp. 154–168. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  18. 18.
    Kennedy, K., Allen, J.R.: Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann Publishers Inc., San Francisco (2002) Google Scholar
  19. 19.
    Khronos Group: Opencl - the open standard for parallel programming of heterogeneous systems (2011).
  20. 20.
    Lee, S., Eigenmann, R.: OpenMPC: extended OpenMP programming and tuning for GPUs. In: Proceedings of SC 2010 (2010)Google Scholar
  21. 21.
    Lee, S., Min, S.J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: Proceedings of PPoPP 2009 (2010)Google Scholar
  22. 22.
    Lee, S., Vetter, J.S.: Early evaluation of directive-based GPU programming models for productive exascale computing. In: Proceedings of SC2012. IEEE Press, Salt Lake City (2012)Google Scholar
  23. 23.
    Leung, A., Vasilache, N., Meister, B., Baskaran, M., Wohlford, D., Bastoul, C., Lethin, R.: A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction. In: Proceedings of GPGPU (2010)Google Scholar
  24. 24.
    Membarth, R., Hannig, F., Teich, J., Korner, M., Eckert, W.: Frameworks for GPU accelerators: a comprehensive evaluation using 2D/3D image registration. In: Proceedings of SASP, pp. 78–81, June 2011Google Scholar
  25. 25.
    NVIDIA: Compute Command Line Profiler. NVIDIA WhitepaperGoogle Scholar
  26. 26.
    NVIDIA: The Benefits of Multiple CPU Cores in Mobile Devices. NVIDIA Whitepaper.
  27. 27.
    NVIDIA: Bring high-end graphics to handheld devices. NVIDIA White Paper (2011).
  28. 28.
    NVIDIA Corporation: NVIDIA CUDA programming guide version 4.0 (2011).
  29. 29.
    OpenMP: Openmp: Complete specification v4.0 (2013).
  30. 30.
    The Portland Group: PGI compiler reference manual (2011).
  31. 31.
  32. 32.
    Wienke, S., Springer, P., Terboven, C., an Mey, D.: OpenACC — first experiences with real-world applications. In: Kaklamanis, C., Papatheodorou, T., Spirakis, P.G. (eds.) Euro-Par 2012. LNCS, vol. 7484, pp. 859–870. Springer, Heidelberg (2012) CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Swapnil Ghike
    • 1
  • Rubén Gran
    • 2
    Email author
  • María J. Garzarán
    • 1
  • David Padua
    • 1
  1. 1.Department of Computer ScienceUniversity of Illinois at Urbana-ChampaignChampaignUSA
  2. 2.Departamento de Informática e Ingeniería de Sistemas Universidad de ZaragozaZaragozaSpain

Personalised recommendations