The TRegion Interface and Compiler Optimizations for OpenMP Target Regions

Doerfert, Johannes; Diaz, Jose Manuel Monsalve; Finkel, Hal

doi:10.1007/978-3-030-28596-8_11

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11718))

Included in the following conference series:

International Workshop on OpenMP

882 Accesses
12 Citations

Abstract

OpenMP is a well established, single-source programming language extension to introduce parallelism into (historically) sequential base languages, namely C/C++ and Fortran. To program not only multi-core CPUs but also many-cores and heavily parallel accelerators, OpenMP 4.0 adopted a flexible offloading scheme inspired by the hierarchy in many GPU designs. The flexible design of the offloading scheme allows to use it in various application scenarios. However, it may also result in a significant performance loss, especially because OpenMP semantics is traditionally interpreted solely in the language front-end as a way to avoid problems with the “sequential-execution-minded” optimization pipeline. Given the limited analysis and transformation capabilities in a modern compiler front-end, the actual syntax used for OpenMP offloading can substantially impact the observed performance. The compiler front-end will always have to favor correct but overly conservative code, if certain facts are not syntactically obvious.

In this work, we investigate how we can delay (target specific) implementation decisions currently taken early during the compilation of OpenMP offloading code. We prototyped our solution in LLVM/Clang, an industrial strength OpenMP compiler, to show that we can use semantic source code analyses as a rational instead of relying on the user provided syntax. Our preliminary results on the rather simple Rodinia benchmarks already show speedups of up to 1.55\(\times \).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The current TRegion design (and its implementation) can deal with “first-level” shared variables, e.g., sharing and modifying a pointer to a global value. However, “higher-level” sharing, e.g., sharing a pointer to a master stack variable, is not possible. While there is no reason we could not reuse the existing scheme, as described by Bercea et al. [2], we are still in the process of determining if a middle-end solution is sensible.
2.
See also footnote 1 on Page 5.
3.
Reading special registers, e.g., the block index register in CUDA, is modeled as a builtin call in LLVM-IR and considered a side-effect here.
4.
We removed calls which were intended for specific hardware.

References

Antão, S.F., et al.: Offloading support for OpenMP in Clang and LLVM. In: Third Workshop on the LLVM Compiler Infrastructure in HPC, LLVM-HPC@SC 2016, Salt Lake City, UT, USA, 14 November 2016, pp. 1–11. IEEE Computer Society (2016). https://doi.org/10.1109/LLVM-HPC.2016.006
Bercea, G., et al.: Implementing implicit OpenMP data sharing on GPUs. In: Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC, LLVM-HPC@SC 2017, Denver, CO, USA, 13 November 2017, pp. 5:1–5:12. ACM (2017). https://doi.org/10.1145/3148173.3148189
Bertolli, C., et al.: Integrating GPU support for OpenMP offloading directives into Clang. In: Finkel, H. (ed.) Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, LLVM 2015, Austin, Texas, USA, 15 November 2015, pp. 5:1–5:11. ACM (2015). https://doi.org/10.1145/2833157.2833161
Bertolli, C., et al.: Coordinating GPU threads for OpenMP 4.0 in LLVM. In: Finkel, H., Hammond, J.R. (eds.) Proceedings of the 2014 LLVM Compiler Infrastructure in HPC, LLVM 2014, New Orleans, LA, USA, 17 November 2014, pp. 12–21. IEEE Computer Society (2014). https://doi.org/10.1109/LLVM-HPC.2014.10
Bertolli, C., Bercea, G.: Performance portability with OpenMP on Nvidia GPUs. In: DOE Centers of Excellence Performance Portability Meeting (2016). https://asc.llnl.gov/DOE-COE-Mtg-2016/talks/2-19_Bertolli.pdf
Che, S., et al.: Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of the 2009 IEEE International Symposium on Workload Characterization, IISWC 2009, Austin, TX, USA, 4–6 October 2009, pp. 44–54 (2009). https://doi.org/10.1109/IISWC.2009.5306797
Doerfert, J., Finkel, H.: Compiler optimizations for OpenMP. In: de Supinski, B.R., Valero-Lara, P., Martorell, X., Mateo Bellido, S., Labarta, J. (eds.) IWOMP 2018. LNCS, vol. 11128, pp. 113–127. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98521-3_8
Chapter Google Scholar
Doerfert, J., Finkel, H.: Compiler optimizations for parallel programs. In: 31th International Workshop on Languages and Compilers for Parallel Computing, LCPC 2018, Short Papers, Salt Lake City, UT, USA, 2–4 October 2018. Lecture Notes in Computer Science. Springer (2018)
Google Scholar
Gonzalo, S.G.D., Huang, S., Gómez-Luna, J., Hammond, S.D., Mutlu, O., Hwu, W.: Automatic generation of warp-level primitives and atomic instructions for fast and portable parallel reduction on GPUs. In: Kandemir, M.T., Jimborean, A., Moseley, T. (eds.) IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2019, Washington, DC, USA, 16–20 February 2019, pp. 73–84. IEEE (2019). https://doi.org/10.1109/CGO.2019.8661187
Jacob, A.C., et al.: Efficient fork-join on GPUs through warp specialization. In: 24th IEEE International Conference on High Performance Computing, HiPC 2017, Jaipur, India, 8–21 December 2017, pp. 358–367. IEEE Computer Society (2017). https://doi.org/10.1109/HiPC.2017.00048
Jordan, H., Pellegrini, S., Thoman, P., Kofler, K., Fahringer, T.: INSPIRE: the insieme parallel intermediate representation. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, Edinburgh, United Kingdom, September 7–11 2013, pp. 7–17 (2013). https://doi.org/10.1109/PACT.2013.6618799
Khaldi, D., Jouvelot, P., Irigoin, F., Ancourt, C., Chapman, B.M.: LLVM parallel intermediate representation: design and evaluation using openshmem communications. In: Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, LLVM 2015, Austin, Texas, USA, 15 November 2015, pp. 2:1–2:8 (2015). https://doi.org/10.1145/2833157.2833158
Larkin, J.: Early results of OpenMP 4.5 portability on NVIDIA GPUs. In: DOE Centers of Excellence Performance Portability Meeting (2017). https://www.lanl.gov/asc/_assets/docs/doe-coe17-talks/S7_2_larkin_doe_portability.pdf
Lattner, C., Adve, V.S.: LLVM: a compilation framework for lifelong program analysis & transformation. In: 2nd IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2004), San Jose, CA, USA, 20–24 March 2004, pp. 75–88 (2004). https://doi.org/10.1109/CGO.2004.1281665
Liao, C., Yan, Y., de Supinski, B.R., Quinlan, D.J., Chapman, B.: Early experiences with the openmp accelerator model. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 84–98. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40698-0_7
Chapter Google Scholar
Martineau, M., et al.: Performance analysis and optimization of Clang’s OpenMP 4.5 GPU support. In: 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, PMBS@SC 2016, Salt Lake, UT, USA, 14 November 2016, pp. 54–64. IEEE Computer Society (2016). https://doi.org/10.1109/PMBS.2016.011
Martineau, M., McIntosh-Smith, S., Price, J., Gaudin, W.: Writing performance portable OpenMP 4.5. In: OpenMP Booth Talk (2016). https://www.openmp.org/wp-content/uploads/Matt_openmp-booth-talk.pdf
OpenMP, A.: The OpenMP API Specification (2018). https://www.openmp.org
Schardl, T.B., Moses, W.S., Leiserson, C.E.: Tapir: embedding fork-join parallelism into LLVM’s intermediate representation. In: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Austin, TX, USA, 4–8 February 2017, pp. 249–265 (2017). http://dl.acm.org/citation.cfm?id=3018758
Tian, X., Girkar, M., Bik, A.J.C., Saito, H.: Practical compiler techniques on efficient multithreaded code generation for OpenMP programs. Comput. J. 48(5), 588–601 (2005). https://doi.org/10.1093/comjnl/bxh109
Article Google Scholar
Tian, X., Girkar, M., Shah, S., Armstrong, D., Su, E., Petersen, P.: Compiler and runtime support for running OpenMP programs on pentium-and itanium-architectures. In: Eighth International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS 2003), 22 April 2003, Nice, France, pp. 47–55 (2003). https://doi.org/10.1109/HIPS.2003.1196494
Tian, X., et al.: LLVM framework and IR extensions for parallelization, SIMD vectorization and offloading. In: Third Workshop on the LLVM Compiler Infrastructure in HPC, LLVM-HPC@SC 2016, Salt Lake City, UT, USA, 14 November 2016, pp. 21–31 (2016). https://doi.org/10.1109/LLVM-HPC.2016.008
Zhao, J., Sarkar, V.: Intermediate language extensions for parallelism. In: Conference on Systems, Programming, and Applications: Software for Humanity, SPLASH 2011, Proceedings of the Compilation of the Co-located Workshops, DSM 2011, TMC 2011, AGERE! 2011, AOOPES 2011, NEAT 2011, and VMIL 2011, Portland, OR, USA, 22–27 October 2011, pp. 329–340 (2011). https://doi.org/10.1145/2095050.2095103

Download references

Acknowledgments

We would like to thank the reviewers for their helpful and extensive comments.

This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation’s exascale computing imperative.

Author information

Authors and Affiliations

Argonne Leadership Computing Facility, Argonne National Laboratory, Argonne, IL, 60439, USA
Johannes Doerfert, Jose Manuel Monsalve Diaz & Hal Finkel

Authors

Johannes Doerfert
View author publications
You can also search for this author in PubMed Google Scholar
Jose Manuel Monsalve Diaz
View author publications
You can also search for this author in PubMed Google Scholar
Hal Finkel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Johannes Doerfert .

Editor information

Editors and Affiliations

University of Auckland, Auckland, New Zealand
Xing Fan
Lawrence Livermore National Laboratory, Livermore, CA, USA
Bronis R. de Supinski
University of Auckland, Auckland, New Zealand
Oliver Sinnen
University of Auckland, Auckland, New Zealand
Nasser Giacaman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Doerfert, J., Diaz, J.M.M., Finkel, H. (2019). The TRegion Interface and Compiler Optimizations for OpenMP Target Regions. In: Fan, X., de Supinski, B., Sinnen, O., Giacaman, N. (eds) OpenMP: Conquering the Full Hardware Spectrum. IWOMP 2019. Lecture Notes in Computer Science(), vol 11718. Springer, Cham. https://doi.org/10.1007/978-3-030-28596-8_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-28596-8_11
Published: 09 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-28595-1
Online ISBN: 978-3-030-28596-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics