Advertisement

Compiling and Optimizing OpenMP 4.X Programs to OpenCL and SPIR

  • Marcio M. PereiraEmail author
  • Rafael C. F. Sousa
  • Guido Araujo
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10468)

Abstract

Given their massively parallel computing capabilities heterogeneous architectures comprised of CPUs and accelerators have been increasingly used to speed-up scientific and engineering applications. Nevertheless, programming such architectures is a challenging task for most non-expert programmers as typical accelerator programming languages (e.g. CUDA and OpenCL) demand a thoroughly understanding of the underlying hardware to enable an effective application speed-up. To achieve that, programmers are usually required to significantly change and adapt program structures and algorithms, thus impacting both performance and productivity. A simpler alternative is to use high-level directive-based programming models like OpenACC and OpenMP. These models allow programmers to insert both directives and runtime calls into existing source code, thus providing hints to the compiler and runtime to perform certain transformations and optimizations on the annotated code regions. In this paper, we present ACLang, an open-source LLVM/Clang compiler framework (http://www.aclang.org) that implements the recently released OpenMP 4.X Accelerator Programming Model. ACLang automatically converts OpenMP 4.X annotated program regions into OpenCL/SPIR kernels, while providing a set of polyhedral based optimizations like tiling and vectorization. OpenCL kernels resulting from ACLang can be executed on any OpenCL/SPIR compatible acceleration device, not only GPUs, but also FPGA accelerators like those found in the Intel HARP architecture. To the best of our knowledge and at the time this paper was written, this is the first LLVM/Clang implementation of the OpenMP 4.X Accelerator Model that provides a source-to-target OpenCL conversion. Experiments using ACLang on the Polybench benchmark reveal speed-ups of up to 30x on an Exynos 8890 Octacore CPU with a ARM Mali-T880 MP12 GPU, up to 62x on a 2.4 GHz dual-core Intel Core i5 processor equipped with an Intel Iris GPU unit, and up to 112x on a 2.1 GHz 32 cores Intel-Xeon processor equipped with a Tesla K40c GPU.

Notes

Acknowledgments

The authors would like to thank the anonymous reviewers for the insightful comments.

This work is supported by Samsung (grant 4716.08) and FAPESP Center for Computational Engineering and Sciences (grant 13/08293-7).

References

  1. 1.
    OpenCL: The Open Standard for Parallel Programming Language of heterogeneous Systems. Khronos Group (2010). http://www.khronos.org/opencl
  2. 2.
    SPIR: An OpenCL Standard Portable Intermediate Language for parallel compute and graphics. Khronos Group (2014). https://www.khronos.org/spir
  3. 3.
    CUDA – Compute Unified Device Architecture. NVIDIA. http://www.nvidia.com/object/cuda_home_new.html
  4. 4.
    OpenMP API Specification for Parallel Programming. Version 4.5, OpenMP ARB (2015). http://openmp.org/wp/openmp-specifications/
  5. 5.
    PolyBench/GPU: Implementation of PolyBench codes for GPU processing. http://web.cse.ohio-state.edu/~pouchet/software/polybench/GPU/
  6. 6.
  7. 7.
  8. 8.
    Baskaran, M.M., Ramanujam, J., Sadayappan, P.: Automatic C-to-CUDA code generation for affine programs. In: Gupta, R. (ed.) CC 2010. LNCS, vol. 6011, pp. 244–263. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-11970-5_14 CrossRefGoogle Scholar
  9. 9.
    Lee, S., Min, S.-J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: PPoPP 2009, pp. 101–110 (2009)Google Scholar
  10. 10.
    Liao, C., Yan, Y., Supinski, B.R., Quinlan, D.J., Chapman, B.: Early experiences with the OpenMP accelerator model. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 84–98. Springer, Heidelberg (2013). doi: 10.1007/978-3-642-40698-0_7 CrossRefGoogle Scholar
  11. 11.
    Verdoolaege, S.: isl: an integer set library for the polyhedral model. In: Fukuda, K., Hoeven, J., Joswig, M., Takayama, N. (eds.) ICMS 2010. LNCS, vol. 6327, pp. 299–302. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-15582-6_49 CrossRefGoogle Scholar
  12. 12.
    Verdoolaege, S., Carlos Juega, J., Cohen, A., Ignacio Gómez, J., Tenllado, C., Catthoor, F.: Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. 9(4), 1–23 (2013)CrossRefGoogle Scholar
  13. 13.
    Bastoul, C.: Code generation in the polyhedral model is easier than you think. In: Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques - PACT 2004. IEEE Computer Society (2004)Google Scholar
  14. 14.
    Grosser, T., Verdoolaege, S., Cohen, A.: Polyhedral AST generation is more than scanning polyhedra. ACM Trans. Program. Lang. Syst. 37(4), 50 (2015). Article No. 12. http://dx.doi.org/10.1145/2743016
  15. 15.
    Bertolli, C., Antao, S.F., Bercea, G.-T., Jacob, A.C., Eichenberger, A.E., Chen, T., Sura, Z., Sung, H., Rokos, G., Appelhans, D., O’Brien, K.: Integrating GPU support for OpenMP offloading directives into Clang LLVM-HPC2015, Austin, Texas USA, 15–20 November 2015 (2015)Google Scholar
  16. 16.
    Antao, S.F., Bataev, A., Jacob, A.C., Bercea, G.-T., Eichenberger, A.E., Rokos, G, Martineau, M, Jin, T., Ozen, G., Sura, Z., Chen, T., Sung, H., Bertolli, C., O’Brien, K.: Offloading support for OpenMP in Clang and LLVM. In: 2016 Third Workshop on the LLVM Compiler Infrastructure in HPC (2016)Google Scholar
  17. 17.
    Martineau, M., McIntosh-Smith, S., Bertolli, C., Jacob, A.C., Antao, S.F., Eichenberger, A., Bercea, G.-T., Chen, T., Jin, T., O’Brien, K., Rokos, G., Sung, H., Sura, Z.: Performance analysis and optimization of Clang’s OpenMP 4.5 GPU support. In: 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems, pp. 54–64. IEEE Press (2016)Google Scholar
  18. 18.
    Tian, X., Saito, H., Su, E., Gaba, A., Masten, M., Garcia, E., Zaks, A.: LLVM framework and IR extensions for parallelization, SIMD vectorization and offloading. In: 2016 Third Workshop on the LLVM Compiler Infrastructure in HPC (2016)Google Scholar
  19. 19.
    Nuzman, D., Zaks, A.: Outer-loop vectorization revisited for short SIMD architectures. In: International Conference on Parallel Architecture and Compilation Techniques, PACT 2008 (2008)Google Scholar
  20. 20.
    Trifunovic’, K., Nuzman, D., Cohen, A., Zaks, A., Rosen, I.: Polyhedral-model guided loop-nest auto-vectorization. In: International Conference on Parallel Architecture and Compilation Techniques, PACT 2009 (2009)Google Scholar
  21. 21.
    Firestone, D.: SmartNIC: FPGA innovation in OCS servers for Microsoft Azure. In: OCP U.S, Summit (2016)Google Scholar
  22. 22.
    Hussain, W., Airoldi, R., Hoffmann, H., Ahonen, T., Nurmi, J.: HARP2: an X-scale reconfigurable accelerator-rich platform for massively-parallel signal processing algorithms. J. Sig. Process. Syst. 85(3), 341 (2016)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Marcio M. Pereira
    • 1
    Email author
  • Rafael C. F. Sousa
    • 1
  • Guido Araujo
    • 1
  1. 1.Institute of ComputingUniversity of Campinas—UNICAMPCampinasBrazil

Personalised recommendations