OpenMP Code Offloading: Splitting GPU Kernels, Pipelining Communication and Computation, and Selecting Better Grid Geometries

Chikin, Artem; Gobran, Tyler; Amaral, José Nelson

doi:10.1007/978-3-030-12274-4_3

Artem Chikin¹⁵,
Tyler Gobran¹⁵ &
José Nelson Amaral¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11381))

Included in the following conference series:

International Workshop on Accelerator Programming Using Directives

357 Accesses
1 Citations

Abstract

This paper presents three ideas that focus on improving the execution of high-level parallel code in GPUs. The first addresses programs that include multiple parallel blocks within a single region of GPU code. A proposed compiler transformation can split such regions into multiple, leading to the launching of multiple kernels, one for each parallel region. Advantages include the opportunity to tailor grid geometry of each kernel to the parallel region that it executes and the elimination of the overheads imposed by a code-generation scheme meant to handle multiple nested parallel regions. Second, is a code transformation that sets up a pipeline of kernel execution and asynchronous data transfer. This transformation enables the overlap of communication and computation. Intricate technical details that are required for this transformation are described. The third idea is that the selection of a grid geometry for the execution of a parallel region must balance the GPU occupancy with the potential saturation of the memory throughput in the GPU. Adding this additional parameter to the geometry selection heuristic can often yield better performance at lower occupancy levels.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
CUDA terminology is used in this paper.
2.
Dynamic frequency scaling makes achieving consitent, reproducible results very challenging due to high variance and increased effects of device warm-up.

References

The OpenACC application programming interface. https://www.openacc.org/sites/default/files/inline-files/OpenACC.2.6.final.pdf
Bauer, M., Cook, H., Khailany, B.: CudaDMA: optimizing GPU memory bandwidth via warp specialization. In: High Performance Computing, Networking, Storage and Analysis SC, Seattle, WA, USA, pp. 1–11 (2011)
Google Scholar
OpenMP Architecture Review Board: OpenMP application programming interface. https://www.openmp.org/wp-content/uploads/openmp-4.5.pdf
Che, S., et al.: Rodinia: a benchmark suite for heterogeneous computing. In: IEEE International Symposium on Workload Characterization (IISWC), pp. 44–54 (2009)
Google Scholar
Chikin, A.: Unibench for OpenMP 4.0. https://github.com/artemcm/Unibench
Jacob, A.C., et al.: Clang-YKT source-code repository. https://github.com/clang-ykt
Jacob, A.C., et al.: Efficient fork-join on GPUs through warp specialization. In: High Performance Computing HiPC, Jaipur, India, pp. 358–367 (2017)
Google Scholar
Kayıran, O., Jog, A., Kandemir, M.T., Das, C.R.: Neither more nor less: optimizing thread-level parallelism for GPGPUs. In: Parallel Architectures and Compilation Techniques PACT, Piscataway, NJ, USA, pp. 157–166 (2013)
Google Scholar
Kim, K., Lee, S., Yoon, M.K., Koo, G., Ro, W.W., Annavaram, M.: Warped-preexecution: a GPU pre-execution approach for improving latency hiding. In: High Performance Computer Architecture HPCA, Barcelona, Spain, pp. 163–175 (2016)
Google Scholar
Komoda, T., Miwa, S., Nakamura, H.: Communication library to overlap computation and communication for OpenCL application. In: Parallel and Distributed Processing Symposium Workshops IPDPSW, Shanghai, China, pp. 567–573 (2012)
Google Scholar
Lee, M., et al.: Improving GPGPU resource utilization through alternative thread block scheduling. In: High Performance Computer Architecture HPCA, Orlando, FL, USA, pp. 260–271 (2014)
Google Scholar
Lee, S.-Y., Arunkumar, A., Wu, C.-J.: CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. In: International Symposium on Computer Architecture (ISCA), Portland, Oregon, pp. 515–527. ACM (2015)
Google Scholar
Lloyd, T., Chikin, A., Amaral, J.N., Tiotto, E.: Automated GPU grid geometry selection for OpenMP kernels. In: Workshop on Applications for Multi-Core Architectures, WAMCA 2018, September 2018. Pre-print Manuscript. https://webdocs.cs.ualberta.ca/~amaral/papers/LloydWAMCA18.pdf
Mokhtari, R., Stumm, M.: Bigkernel – high performance CPU-GPU communication pipelining for big data-style applications. In: International Parallel and Distributed Processing Symposium IPDPS, Phoenix, AZ, USA, pp. 819–828 (2014)
Google Scholar
Oh, Y., et al.: APRES: improving cache efficiency by exploiting load characteristics on GPUs. ACM SIGARCH Comput. Archit. News 44(3), 191–203 (2016)
Article Google Scholar
Rau, B.R.: Iterative modulo scheduling: an algorithm for software pipelining loops. In: Proceedings of the 27th Annual International Symposium on Microarchitecture, MICRO 27, New York, NY, USA, pp. 63–74. ACM (1994)
Google Scholar
Sethia, A., Jamshidi, D.A., Mahlke, S.: Mascar: speeding up GPU warps by reducing memory pitstops. In: High Performance Computer Architecture HPCA, San Francisco, CA, USA, pp. 174–185 (2015)
Google Scholar
Sethia, A., Mahlke, S.: Equalizer: dynamic tuning of GPU resources for efficient execution. In: International Symposium on Microarchitecture MICRO, Cambridge, UK, pp. 647–658 (2014)
Google Scholar

Download references

Acknowledgements

This research was supported by the IBM Canada Software Lab Centre for Advanced Studies (CAS) and by the National Science and Engineering Research Council (NSERC) of Canada through their Collaborative Research and Development (CRD) program and through the Undergraduate Student Research Awards (USRA) program.

Author information

Authors and Affiliations

University of Alberta, Edmonton, AB, Canada
Artem Chikin, Tyler Gobran & José Nelson Amaral

Authors

Artem Chikin
View author publications
You can also search for this author in PubMed Google Scholar
Tyler Gobran
View author publications
You can also search for this author in PubMed Google Scholar
José Nelson Amaral
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Artem Chikin .

Editor information

Editors and Affiliations

Department of Computer and Information Science, University of Delaware, Newark, DE, USA
Sunita Chandrasekaran
Helmholtz-Zentrum Dresden-Rossendorf, Dresden, Sachsen, Germany
Guido Juckeland
RWTH Aachen University, Aachen, Nordrhein-Westfalen, Germany
Sandra Wienke

A Artifact Description Appendix: OpenMP Target Offloading: Splitting GPU Kernels, Pipelining Communication and Computation, and Selecting Better Grid Geometries

1.1 A.1 Abstract

This artifact contains the code for our experimental evaluations of the kernel splitting and kernel pipelining methods with instructions to run the benchmark versions to replicate all experimental results from Sect. 7.

1.2 A.2 Description

Check-List

Program: C code, Python3 code
Compilation: Prototype of Clang-YKT compiler used
Transformations: Kernel-Splitting, Kernel-Pipelining
Hardware: Intel i7-4770 with Nvidia Titan X Pascal, IBM POWER8 (8335-GTB) with Nvidia P100 GPU
Software: x86: Ubuntu 18.04 LTS, Cuda V9.1.85; POWER: RHEL Server 7.3, Cuda V9.2.88
Experiment workflow: Install Clang-YKT prototype then run the provided benchmarks with the given script
Publicly available?: Yes

How Software Can Be Obtained. Our prototype of the Clang-YKT compiler is available on Github, with our benchmark versions used for all experiments included.

The original Clang-YKT compiler can be found at:

https://github.com/clang-ykt/clang

With the commit hash: 49d8020e03f898ea31212f6c565001e067f67d4f

Hardware Dependencies. An Intel i7-4770 machine with an Nvidia Titan X Pascal GPU was used for almost all experimentation and for similar results an equivalent machine must be utilized. This is especially true for the experiments on occupancy as our optimal occupancies are tied to the Nvidia Titan X Pascal GPU. An additional IBM POWER8 (8335-GTB) host with an Nvidia P100 GPU was used with kernel-pipelining for experimenting with different page sizes and to replicate those results a similar machine must be utilized.

Software Dependencies. A prototype of the Clang-YKT compiler from Github was utilized for compilation of OpenMP code though any compiler that supports OpenMP 4 can be used to run the kernel-splitting and kernel-pipelining benchmark versions.

Datasets. Experiments for each benchmark require only inputting the given tripcount desired for each, with only SRAD requiring an additional pgm image.

1.3 A.3 Installation

Clone the Clang-YKT prototype repository (includes all testing files):

$ git clone https://github.com/uasys/openmp-split

Then install the compiler with the following commands:

$ mkdir -p $build

# 60 stands for GPU compute capability

$ cmake DCMAKE_BUILD_TYPE=RELEASE DCMAKE_INSTALL_PREFIX=$CLANGYKT_DIR DLLVM_ENABLE_BACKTRACES=ON DLLVM_ENABLE_WERROR=OFF DBUILD_SHARED_LIBS=OFF DLLVM_ENABLE_RTTI=ON DOPENMP_ENABLE_LIBOMPTARGET=ON DCMAKE_C_FLAGS=’- DOPENMP_NVPTX_COMPUTE_CAPABILITY=60’ DCMAKE_CXX_FLAGS=’DOPENMP_NVPTX_COMPUTE_CAPABILITY=60’ DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITY=60 DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_60 DLIBOMPTARGET_NVPTX_ENABLE_BCLIB=true -G Ninja $LLVM_BASE

$ ninja -j4; ninja install

After installation the GPU clock rate must be locked at 80% of the nominal clock rate of the GPU to prevent any variation in performance due to frequency scaling when performing the experiments.

To lock the Nvidia Titan X Pascal input:

nvidia-smi -pm 1

nvidia-smi -application-clocks=4513,1240

1.4 A.4 Experiment Workflow

Experimentation is performed by executing the runTest.py file in the given transformations folder with the chosen benchmark’s name and the tripcount to run it at. Benchmarks for kernel-splitting and those for kernel-pipelining are held in separate folders.

1.5 A.5 Evaluation and Expected Results

The script above will produce a printout once all runs are complete that contains the average run time of each version with the percentage variance and the speedup ratio relative to the baseline.

1.6 A.6 Experiment Customization

Adjusting grid geometry can be done by editing the BLOCKS macro values in the benchmark files with a postfix including G which indicate versions with custom grid geometry.

1.7 A.7 Notes

None.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chikin, A., Gobran, T., Amaral, J.N. (2019). OpenMP Code Offloading: Splitting GPU Kernels, Pipelining Communication and Computation, and Selecting Better Grid Geometries. In: Chandrasekaran, S., Juckeland, G., Wienke, S. (eds) Accelerator Programming Using Directives. WACCPD 2018. Lecture Notes in Computer Science(), vol 11381. Springer, Cham. https://doi.org/10.1007/978-3-030-12274-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-12274-4_3
Published: 24 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-12273-7
Online ISBN: 978-3-030-12274-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

OpenMP Code Offloading: Splitting GPU Kernels, Pipelining Communication and Computation, and Selecting Better Grid Geometries

Abstract

Access this chapter

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Artifact Description Appendix: OpenMP Target Offloading: Splitting GPU Kernels, Pipelining Communication and Computation, and Selecting Better Grid Geometries

A Artifact Description Appendix: OpenMP Target Offloading: Splitting GPU Kernels, Pipelining Communication and Computation, and Selecting Better Grid Geometries

1.1 A.1 Abstract

1.2 A.2 Description

1.3 A.3 Installation

1.4 A.4 Experiment Workflow

1.5 A.5 Evaluation and Expected Results

1.6 A.6 Experiment Customization

1.7 A.7 Notes

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation