Skip to main content

OpenMP Code Offloading: Splitting GPU Kernels, Pipelining Communication and Computation, and Selecting Better Grid Geometries

  • Conference paper
  • First Online:
Accelerator Programming Using Directives (WACCPD 2018)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11381))

Included in the following conference series:

Abstract

This paper presents three ideas that focus on improving the execution of high-level parallel code in GPUs. The first addresses programs that include multiple parallel blocks within a single region of GPU code. A proposed compiler transformation can split such regions into multiple, leading to the launching of multiple kernels, one for each parallel region. Advantages include the opportunity to tailor grid geometry of each kernel to the parallel region that it executes and the elimination of the overheads imposed by a code-generation scheme meant to handle multiple nested parallel regions. Second, is a code transformation that sets up a pipeline of kernel execution and asynchronous data transfer. This transformation enables the overlap of communication and computation. Intricate technical details that are required for this transformation are described. The third idea is that the selection of a grid geometry for the execution of a parallel region must balance the GPU occupancy with the potential saturation of the memory throughput in the GPU. Adding this additional parameter to the geometry selection heuristic can often yield better performance at lower occupancy levels.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    CUDA terminology is used in this paper.

  2. 2.

    Dynamic frequency scaling makes achieving consitent, reproducible results very challenging due to high variance and increased effects of device warm-up.

References

  1. The OpenACC application programming interface. https://www.openacc.org/sites/default/files/inline-files/OpenACC.2.6.final.pdf

  2. Bauer, M., Cook, H., Khailany, B.: CudaDMA: optimizing GPU memory bandwidth via warp specialization. In: High Performance Computing, Networking, Storage and Analysis SC, Seattle, WA, USA, pp. 1–11 (2011)

    Google Scholar 

  3. OpenMP Architecture Review Board: OpenMP application programming interface. https://www.openmp.org/wp-content/uploads/openmp-4.5.pdf

  4. Che, S., et al.: Rodinia: a benchmark suite for heterogeneous computing. In: IEEE International Symposium on Workload Characterization (IISWC), pp. 44–54 (2009)

    Google Scholar 

  5. Chikin, A.: Unibench for OpenMP 4.0. https://github.com/artemcm/Unibench

  6. Jacob, A.C., et al.: Clang-YKT source-code repository. https://github.com/clang-ykt

  7. Jacob, A.C., et al.: Efficient fork-join on GPUs through warp specialization. In: High Performance Computing HiPC, Jaipur, India, pp. 358–367 (2017)

    Google Scholar 

  8. Kayıran, O., Jog, A., Kandemir, M.T., Das, C.R.: Neither more nor less: optimizing thread-level parallelism for GPGPUs. In: Parallel Architectures and Compilation Techniques PACT, Piscataway, NJ, USA, pp. 157–166 (2013)

    Google Scholar 

  9. Kim, K., Lee, S., Yoon, M.K., Koo, G., Ro, W.W., Annavaram, M.: Warped-preexecution: a GPU pre-execution approach for improving latency hiding. In: High Performance Computer Architecture HPCA, Barcelona, Spain, pp. 163–175 (2016)

    Google Scholar 

  10. Komoda, T., Miwa, S., Nakamura, H.: Communication library to overlap computation and communication for OpenCL application. In: Parallel and Distributed Processing Symposium Workshops IPDPSW, Shanghai, China, pp. 567–573 (2012)

    Google Scholar 

  11. Lee, M., et al.: Improving GPGPU resource utilization through alternative thread block scheduling. In: High Performance Computer Architecture HPCA, Orlando, FL, USA, pp. 260–271 (2014)

    Google Scholar 

  12. Lee, S.-Y., Arunkumar, A., Wu, C.-J.: CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. In: International Symposium on Computer Architecture (ISCA), Portland, Oregon, pp. 515–527. ACM (2015)

    Google Scholar 

  13. Lloyd, T., Chikin, A., Amaral, J.N., Tiotto, E.: Automated GPU grid geometry selection for OpenMP kernels. In: Workshop on Applications for Multi-Core Architectures, WAMCA 2018, September 2018. Pre-print Manuscript. https://webdocs.cs.ualberta.ca/~amaral/papers/LloydWAMCA18.pdf

  14. Mokhtari, R., Stumm, M.: Bigkernel – high performance CPU-GPU communication pipelining for big data-style applications. In: International Parallel and Distributed Processing Symposium IPDPS, Phoenix, AZ, USA, pp. 819–828 (2014)

    Google Scholar 

  15. Oh, Y., et al.: APRES: improving cache efficiency by exploiting load characteristics on GPUs. ACM SIGARCH Comput. Archit. News 44(3), 191–203 (2016)

    Article  Google Scholar 

  16. Rau, B.R.: Iterative modulo scheduling: an algorithm for software pipelining loops. In: Proceedings of the 27th Annual International Symposium on Microarchitecture, MICRO 27, New York, NY, USA, pp. 63–74. ACM (1994)

    Google Scholar 

  17. Sethia, A., Jamshidi, D.A., Mahlke, S.: Mascar: speeding up GPU warps by reducing memory pitstops. In: High Performance Computer Architecture HPCA, San Francisco, CA, USA, pp. 174–185 (2015)

    Google Scholar 

  18. Sethia, A., Mahlke, S.: Equalizer: dynamic tuning of GPU resources for efficient execution. In: International Symposium on Microarchitecture MICRO, Cambridge, UK, pp. 647–658 (2014)

    Google Scholar 

Download references

Acknowledgements

This research was supported by the IBM Canada Software Lab Centre for Advanced Studies (CAS) and by the National Science and Engineering Research Council (NSERC) of Canada through their Collaborative Research and Development (CRD) program and through the Undergraduate Student Research Awards (USRA) program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Artem Chikin .

Editor information

Editors and Affiliations

A Artifact Description Appendix: OpenMP Target Offloading: Splitting GPU Kernels, Pipelining Communication and Computation, and Selecting Better Grid Geometries

A Artifact Description Appendix: OpenMP Target Offloading: Splitting GPU Kernels, Pipelining Communication and Computation, and Selecting Better Grid Geometries

1.1 A.1 Abstract

This artifact contains the code for our experimental evaluations of the kernel splitting and kernel pipelining methods with instructions to run the benchmark versions to replicate all experimental results from Sect. 7.

1.2 A.2 Description

Check-List

  • Program: C code, Python3 code

  • Compilation: Prototype of Clang-YKT compiler used

  • Transformations: Kernel-Splitting, Kernel-Pipelining

  • Hardware: Intel i7-4770 with Nvidia Titan X Pascal, IBM POWER8 (8335-GTB) with Nvidia P100 GPU

  • Software: x86: Ubuntu 18.04 LTS, Cuda V9.1.85; POWER: RHEL Server 7.3, Cuda V9.2.88

  • Experiment workflow: Install Clang-YKT prototype then run the provided benchmarks with the given script

  • Publicly available?: Yes

How Software Can Be Obtained. Our prototype of the Clang-YKT compiler is available on Github, with our benchmark versions used for all experiments included.

The original Clang-YKT compiler can be found at:

https://github.com/clang-ykt/clang

With the commit hash: 49d8020e03f898ea31212f6c565001e067f67d4f

Hardware Dependencies. An Intel i7-4770 machine with an Nvidia Titan X Pascal GPU was used for almost all experimentation and for similar results an equivalent machine must be utilized. This is especially true for the experiments on occupancy as our optimal occupancies are tied to the Nvidia Titan X Pascal GPU. An additional IBM POWER8 (8335-GTB) host with an Nvidia P100 GPU was used with kernel-pipelining for experimenting with different page sizes and to replicate those results a similar machine must be utilized.

Software Dependencies. A prototype of the Clang-YKT compiler from Github was utilized for compilation of OpenMP code though any compiler that supports OpenMP 4 can be used to run the kernel-splitting and kernel-pipelining benchmark versions.

Datasets. Experiments for each benchmark require only inputting the given tripcount desired for each, with only SRAD requiring an additional pgm image.

1.3 A.3 Installation

Clone the Clang-YKT prototype repository (includes all testing files):

$ git clone https://github.com/uasys/openmp-split

Then install the compiler with the following commands:

$ mkdir -p $build

# 60 stands for GPU compute capability

$ cmake DCMAKE_BUILD_TYPE=RELEASE DCMAKE_INSTALL_PREFIX=$CLANGYKT_DIR DLLVM_ENABLE_BACKTRACES=ON DLLVM_ENABLE_WERROR=OFF DBUILD_SHARED_LIBS=OFF DLLVM_ENABLE_RTTI=ON DOPENMP_ENABLE_LIBOMPTARGET=ON DCMAKE_C_FLAGS=’- DOPENMP_NVPTX_COMPUTE_CAPABILITY=60’ DCMAKE_CXX_FLAGS=’DOPENMP_NVPTX_COMPUTE_CAPABILITY=60’ DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITY=60 DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_60 DLIBOMPTARGET_NVPTX_ENABLE_BCLIB=true -G Ninja $LLVM_BASE

$ ninja -j4; ninja install

After installation the GPU clock rate must be locked at 80% of the nominal clock rate of the GPU to prevent any variation in performance due to frequency scaling when performing the experiments.

To lock the Nvidia Titan X Pascal input:

nvidia-smi -pm 1

nvidia-smi -application-clocks=4513,1240

1.4 A.4 Experiment Workflow

Experimentation is performed by executing the runTest.py file in the given transformations folder with the chosen benchmark’s name and the tripcount to run it at. Benchmarks for kernel-splitting and those for kernel-pipelining are held in separate folders.

1.5 A.5 Evaluation and Expected Results

The script above will produce a printout once all runs are complete that contains the average run time of each version with the percentage variance and the speedup ratio relative to the baseline.

1.6 A.6 Experiment Customization

Adjusting grid geometry can be done by editing the BLOCKS macro values in the benchmark files with a postfix including G which indicate versions with custom grid geometry.

1.7 A.7 Notes

None.

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chikin, A., Gobran, T., Amaral, J.N. (2019). OpenMP Code Offloading: Splitting GPU Kernels, Pipelining Communication and Computation, and Selecting Better Grid Geometries. In: Chandrasekaran, S., Juckeland, G., Wienke, S. (eds) Accelerator Programming Using Directives. WACCPD 2018. Lecture Notes in Computer Science(), vol 11381. Springer, Cham. https://doi.org/10.1007/978-3-030-12274-4_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-12274-4_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-12273-7

  • Online ISBN: 978-3-030-12274-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics