Skip to main content
Log in

Automatic translation of data parallel programs for heterogeneous parallelism through OpenMP offloading

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Heterogeneous multicores like GPGPUs are now commonplace in modern computing systems. Although heterogeneous multicores offer the potential for high performance, programmers are struggling to program such systems. This paper presents OAO, a compiler-based approach to automatically translate shared-memory OpenMP data-parallel programs to run on heterogeneous multicores through OpenMP offloading directives. Given the large user base of shared memory OpenMP programs, our approach allows programmers to continue using a single-source-based programming language that they are familiar with while benefiting from the heterogeneous performance. OAO introduces a novel runtime optimization scheme to automatically eliminate unnecessary host–device communication to minimize the communication overhead between the host and the accelerator device. We evaluate OAO by applying it to 23 benchmarks from the PolyBench and Rodinia suites on two distinct GPU platforms. Experimental results show that OAO achieves up to 32\(\times\) speedup over the original OpenMP version, and can reduce the host–device communication overhead by up to 99% over the hand-translated version.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Al-Saber N, Kulkarni M (2015) Semcache++: semantics-aware caching for efficient multi-gpu offloading. In: Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, pp 79–88

  2. Baskaran MM, Ramanujam J, Sadayappan P (2010) Automatic c-to-cuda code generation for affine programs. In: International Conference on Compiler Construction. Springer, pp 244–263

  3. Castro D, Romano P, Ilic A, Khan AM (2019) Hetm: transactional memory for heterogeneous systems. In: 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, pp 232–244

  4. Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee SH, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing. In: 2009 IEEE International Symposium on Workload Characterization (IISWC). IEEE, pp 44–54

  5. Corporation N (2019) Cuda toolkit documentation v10.2.89. https://docs.nvidia.com/cuda. Accessed 10 Dec 2019

  6. Huang Y, Li D (2017) Performance modeling for optimal data placement on GPU with heterogeneous memory systems. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, pp 166–177

  7. Jablin TB, Jablin JA, Prabhu P, Liu F, August DI (2012) Dynamically managed data for CPU–GPU architectures. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization. ACM, pp 165–174

  8. Jablin TB, Prabhu P, Jablin JA, Johnson NP, Beard SR, August DI (2011) Automatic CPU–GPU communication management and optimization. In: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, pp 142–151

  9. Kim Y, Kim H (2019) Translating cuda to opencl for hardware generation using neural machine translation. In: 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, pp 285–286

  10. Lattner C, Adve V (2004) Llvm: a compilation framework for lifelong program analysis and transformation. In: International Symposium on Code Generation and Optimization, 2004. CGO 2004. IEEE, pp 75–86

  11. Lee S, Eigenmann R (2010) Openmpc: extended openmp programming and tuning for gpus. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, pp 1–11

  12. Li L, Chapman B (2019) Compiler assisted hybrid implicit and explicit gpu memory management under unified address space. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, p 51

  13. LLVM AT (2020) Clang: a C language family frontend for llvm. http://clang.llvm.org. Accessed 14 Sep 2020

  14. LLVM AT (2020) The LLVM compiler infrastructure. http://llvm.org. Accessed 14 Sep 2020

  15. Mendonça G, Guimarães B, Alves P, Pereira M, Araújo G, Pereira FMQ (2017) DAWNCC: automatic annotation for data parallelism and offloading. ACM Trans Archit Code Optim (TACO) 14(2):13

    Google Scholar 

  16. Mendonça G, Guimarães B, Pereira FMQ (2018) Benchmarks used to evaluate DAWNCC. http://cuda.dcc.ufmg.br/dawn/benchmarks.zip. Accessed 21 Dec 2018

  17. Mendonça GSD, Guimaraes BCF, Alves PRO, Pereira FMQ, Pereira MM, Araújo G (2016) Automatic insertion of copy annotation in data-parallel programs. In: 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE, pp 34–41

  18. Nugteren C, Corporaal H (2015) Bones: an automatic skeleton-based c-to-cuda compiler for gpus. ACM Trans Arch Code Optim (TACO) 11(4):35

    Google Scholar 

  19. O’Boyle MF, Wang Z, Grewe D (2013) Portable mapping of data parallel programs to opencl for heterogeneous systems. In: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE Computer Society, pp 1–10

  20. OpenMP ARB (2019) Openmp application program interface version 3.1. https://www.openmp.org/wp-content/uploads/OpenMP3.1.pdf. Accessed 07 Nov 2019

  21. OpenMP ARB (2019) Openmp application program interface version 4.0. https://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf. Accessed 07 Nov 2019

  22. OpenMP ARB (2019) Openmp application program interface version 4.5. https://www.openmp.org/wp-content/uploads/openmp-4.5.pdf. Accessed 07 Nov 2019

  23. OpenMP ARB (2019) Openmp application program interface version 5.0. https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf. Accessed 07 Nov 2019

  24. Pai S, Govindarajan R, Thazhuthaveetil MJ (2012) Fast and efficient automatic memory management for gpus using compiler-assisted runtime coherence scheme. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. ACM, pp 33–42

  25. Pouchet LN et al (2018) Polybench/c the polyhedral benchmark suite. https://web.cse.ohio-state.edu/~pouchet.2/software/polybench. Accessed 21 Dec 2018

  26. Riebler H, Vaz G, Kenter T, Plessl C (2019) Transparent acceleration for heterogeneous platforms with compilation to opencl. ACM Trans Arch Code Optim (TACO) 16(2):1–26

    Article  Google Scholar 

  27. Saraswat V, Bloom B, Peshansky I, Tardieu O, Grove D (2019) The x10 parallel programming language. http://x10-lang.org. Accessed 10 Dec 2019

  28. Sathre P, Gardner M, Feng WC (2019) On the portability of CPU-accelerated applications via automated source-to-source translation. In: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, pp 1–8

  29. Sousa R, Pereira M, Pereira FMQ, Araujo G (2019) Data-flow analysis and optimization for data coherence in heterogeneous architectures. J Parallel Distrib Comput 130:126–139

    Article  Google Scholar 

  30. Verdoolaege S, Carlos Juega J, Cohen A, Ignacio Gomez J, Tenllado C, Catthoor F (2013) Polyhedral parallel code generation for cuda. ACM Trans Arch Code Optim (TACO) 9(4):54

    Google Scholar 

  31. Wang K, Che S, Skadron K (2019) Rodinia: a benchmark suit for heterogeneous computing. http://lava.cs.virginia.edu/Rodinia/download_links.htm. Accessed 23 June 2019

  32. Wang X, Huang K, Knoll A, Qian X (2019) A hybrid framework for fast and accurate gpu performance estimation through source-level analysis and trace-based simulation. In: 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, pp 506–518

  33. Wu S, Dong X, Zhang X, Zhu Z (2019) Not: a high-level no-threading parallel programming method for heterogeneous systems. J Supercomput 75(7):3810–3841

    Article  Google Scholar 

  34. Xiao J, Andelfinger P, Cai W, Richmond P, Knoll A, Eckhoff D (2020) Openablext: an automatic code generation framework for agent-based simulations on CPU–GPU–FPGA heterogeneous platforms. Concurrency and Computation: Practice and Experience p. e5807

  35. Zhang W, Cheng AM, Subhlok J (2015) Dwarfcode: a performance prediction tool for parallel applications. IEEE Trans Comput 65(2):495–507

    Article  MathSciNet  Google Scholar 

  36. Zhang W, Hao M, Snir M (2017) Predicting hpc parallel program performance based on llvm compiler. Cluster Comput 20(2):1179–1192

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the National Key Research and Development Program of China (No. 2017YFB0202901), the Key-Area Research and Development Program of Guangdong Province (No. 2019B010136001), the National Natural Science Foundation of China (NSFC) (No. 61672186), and the Shenzhen Technology Research and Development Fund (No. JCYJ20190806143418198). Professor Zhang is the corresponding author.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weizhe Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, F., Zhang, W., Guo, H. et al. Automatic translation of data parallel programs for heterogeneous parallelism through OpenMP offloading. J Supercomput 77, 4957–4987 (2021). https://doi.org/10.1007/s11227-020-03452-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-020-03452-2

Keywords

Navigation