Automatic translation of data parallel programs for heterogeneous parallelism through OpenMP offloading

Wang, Farui; Zhang, Weizhe; Guo, Haonan; Hao, Meng; Lu, Gangzhao; Wang, Zheng

doi:10.1007/s11227-020-03452-2

Automatic translation of data parallel programs for heterogeneous parallelism through OpenMP offloading

Published: 29 October 2020

Volume 77, pages 4957–4987, (2021)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Farui Wang¹,
Weizhe Zhang ORCID: orcid.org/0000-0003-4783-876X¹,
Haonan Guo¹,
Meng Hao¹,
Gangzhao Lu¹ &
…
Zheng Wang²

405 Accesses
4 Citations
Explore all metrics

Abstract

Heterogeneous multicores like GPGPUs are now commonplace in modern computing systems. Although heterogeneous multicores offer the potential for high performance, programmers are struggling to program such systems. This paper presents OAO, a compiler-based approach to automatically translate shared-memory OpenMP data-parallel programs to run on heterogeneous multicores through OpenMP offloading directives. Given the large user base of shared memory OpenMP programs, our approach allows programmers to continue using a single-source-based programming language that they are familiar with while benefiting from the heterogeneous performance. OAO introduces a novel runtime optimization scheme to automatically eliminate unnecessary host–device communication to minimize the communication overhead between the host and the accelerator device. We evaluate OAO by applying it to 23 benchmarks from the PolyBench and Rodinia suites on two distinct GPU platforms. Experimental results show that OAO achieves up to 32\(\times\) speedup over the original OpenMP version, and can reduce the host–device communication overhead by up to 99% over the hand-translated version.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data Transfer and Reuse Analysis Tool for GPU-Offloading Using OpenMP

Early Experiences with the OpenMP Accelerator Model

A Source-to-Source OpenACC Compiler for CUDA

References

Al-Saber N, Kulkarni M (2015) Semcache++: semantics-aware caching for efficient multi-gpu offloading. In: Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, pp 79–88
Baskaran MM, Ramanujam J, Sadayappan P (2010) Automatic c-to-cuda code generation for affine programs. In: International Conference on Compiler Construction. Springer, pp 244–263
Castro D, Romano P, Ilic A, Khan AM (2019) Hetm: transactional memory for heterogeneous systems. In: 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, pp 232–244
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee SH, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing. In: 2009 IEEE International Symposium on Workload Characterization (IISWC). IEEE, pp 44–54
Corporation N (2019) Cuda toolkit documentation v10.2.89. https://docs.nvidia.com/cuda. Accessed 10 Dec 2019
Huang Y, Li D (2017) Performance modeling for optimal data placement on GPU with heterogeneous memory systems. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, pp 166–177
Jablin TB, Jablin JA, Prabhu P, Liu F, August DI (2012) Dynamically managed data for CPU–GPU architectures. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization. ACM, pp 165–174
Jablin TB, Prabhu P, Jablin JA, Johnson NP, Beard SR, August DI (2011) Automatic CPU–GPU communication management and optimization. In: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, pp 142–151
Kim Y, Kim H (2019) Translating cuda to opencl for hardware generation using neural machine translation. In: 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, pp 285–286
Lattner C, Adve V (2004) Llvm: a compilation framework for lifelong program analysis and transformation. In: International Symposium on Code Generation and Optimization, 2004. CGO 2004. IEEE, pp 75–86
Lee S, Eigenmann R (2010) Openmpc: extended openmp programming and tuning for gpus. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, pp 1–11
Li L, Chapman B (2019) Compiler assisted hybrid implicit and explicit gpu memory management under unified address space. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, p 51
LLVM AT (2020) Clang: a C language family frontend for llvm. http://clang.llvm.org. Accessed 14 Sep 2020
LLVM AT (2020) The LLVM compiler infrastructure. http://llvm.org. Accessed 14 Sep 2020
Mendonça G, Guimarães B, Alves P, Pereira M, Araújo G, Pereira FMQ (2017) DAWNCC: automatic annotation for data parallelism and offloading. ACM Trans Archit Code Optim (TACO) 14(2):13
Google Scholar
Mendonça G, Guimarães B, Pereira FMQ (2018) Benchmarks used to evaluate DAWNCC. http://cuda.dcc.ufmg.br/dawn/benchmarks.zip. Accessed 21 Dec 2018
Mendonça GSD, Guimaraes BCF, Alves PRO, Pereira FMQ, Pereira MM, Araújo G (2016) Automatic insertion of copy annotation in data-parallel programs. In: 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE, pp 34–41
Nugteren C, Corporaal H (2015) Bones: an automatic skeleton-based c-to-cuda compiler for gpus. ACM Trans Arch Code Optim (TACO) 11(4):35
Google Scholar
O’Boyle MF, Wang Z, Grewe D (2013) Portable mapping of data parallel programs to opencl for heterogeneous systems. In: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE Computer Society, pp 1–10
OpenMP ARB (2019) Openmp application program interface version 3.1. https://www.openmp.org/wp-content/uploads/OpenMP3.1.pdf. Accessed 07 Nov 2019
OpenMP ARB (2019) Openmp application program interface version 4.0. https://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf. Accessed 07 Nov 2019
OpenMP ARB (2019) Openmp application program interface version 4.5. https://www.openmp.org/wp-content/uploads/openmp-4.5.pdf. Accessed 07 Nov 2019
OpenMP ARB (2019) Openmp application program interface version 5.0. https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf. Accessed 07 Nov 2019
Pai S, Govindarajan R, Thazhuthaveetil MJ (2012) Fast and efficient automatic memory management for gpus using compiler-assisted runtime coherence scheme. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. ACM, pp 33–42
Pouchet LN et al (2018) Polybench/c the polyhedral benchmark suite. https://web.cse.ohio-state.edu/~pouchet.2/software/polybench. Accessed 21 Dec 2018
Riebler H, Vaz G, Kenter T, Plessl C (2019) Transparent acceleration for heterogeneous platforms with compilation to opencl. ACM Trans Arch Code Optim (TACO) 16(2):1–26
Article Google Scholar
Saraswat V, Bloom B, Peshansky I, Tardieu O, Grove D (2019) The x10 parallel programming language. http://x10-lang.org. Accessed 10 Dec 2019
Sathre P, Gardner M, Feng WC (2019) On the portability of CPU-accelerated applications via automated source-to-source translation. In: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, pp 1–8
Sousa R, Pereira M, Pereira FMQ, Araujo G (2019) Data-flow analysis and optimization for data coherence in heterogeneous architectures. J Parallel Distrib Comput 130:126–139
Article Google Scholar
Verdoolaege S, Carlos Juega J, Cohen A, Ignacio Gomez J, Tenllado C, Catthoor F (2013) Polyhedral parallel code generation for cuda. ACM Trans Arch Code Optim (TACO) 9(4):54
Google Scholar
Wang K, Che S, Skadron K (2019) Rodinia: a benchmark suit for heterogeneous computing. http://lava.cs.virginia.edu/Rodinia/download_links.htm. Accessed 23 June 2019
Wang X, Huang K, Knoll A, Qian X (2019) A hybrid framework for fast and accurate gpu performance estimation through source-level analysis and trace-based simulation. In: 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, pp 506–518
Wu S, Dong X, Zhang X, Zhu Z (2019) Not: a high-level no-threading parallel programming method for heterogeneous systems. J Supercomput 75(7):3810–3841
Article Google Scholar
Xiao J, Andelfinger P, Cai W, Richmond P, Knoll A, Eckhoff D (2020) Openablext: an automatic code generation framework for agent-based simulations on CPU–GPU–FPGA heterogeneous platforms. Concurrency and Computation: Practice and Experience p. e5807
Zhang W, Cheng AM, Subhlok J (2015) Dwarfcode: a performance prediction tool for parallel applications. IEEE Trans Comput 65(2):495–507
Article MathSciNet Google Scholar
Zhang W, Hao M, Snir M (2017) Predicting hpc parallel program performance based on llvm compiler. Cluster Comput 20(2):1179–1192
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Key Research and Development Program of China (No. 2017YFB0202901), the Key-Area Research and Development Program of Guangdong Province (No. 2019B010136001), the National Natural Science Foundation of China (NSFC) (No. 61672186), and the Shenzhen Technology Research and Development Fund (No. JCYJ20190806143418198). Professor Zhang is the corresponding author.

Author information

Authors and Affiliations

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, HL, China
Farui Wang, Weizhe Zhang, Haonan Guo, Meng Hao & Gangzhao Lu
School of Computing, University of Leeds, Leeds, UK
Zheng Wang

Authors

Farui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Weizhe Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Haonan Guo
View author publications
You can also search for this author in PubMed Google Scholar
Meng Hao
View author publications
You can also search for this author in PubMed Google Scholar
Gangzhao Lu
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weizhe Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, F., Zhang, W., Guo, H. et al. Automatic translation of data parallel programs for heterogeneous parallelism through OpenMP offloading. J Supercomput 77, 4957–4987 (2021). https://doi.org/10.1007/s11227-020-03452-2

Download citation

Accepted: 11 October 2020
Published: 29 October 2020
Issue Date: May 2021
DOI: https://doi.org/10.1007/s11227-020-03452-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic translation of data parallel programs for heterogeneous parallelism through OpenMP offloading

Abstract

Access this article

Similar content being viewed by others

Data Transfer and Reuse Analysis Tool for GPU-Offloading Using OpenMP

Early Experiences with the OpenMP Accelerator Model

A Source-to-Source OpenACC Compiler for CUDA

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automatic translation of data parallel programs for heterogeneous parallelism through OpenMP offloading

Abstract

Access this article

Similar content being viewed by others

Data Transfer and Reuse Analysis Tool for GPU-Offloading Using OpenMP

Early Experiences with the OpenMP Accelerator Model

A Source-to-Source OpenACC Compiler for CUDA

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation