OpenMP Code Offloading: Splitting GPU Kernels, Pipelining Communication and Computation, and Selecting Better Grid Geometries

  • Artem ChikinEmail author
  • Tyler Gobran
  • José Nelson Amaral
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11381)


This paper presents three ideas that focus on improving the execution of high-level parallel code in GPUs. The first addresses programs that include multiple parallel blocks within a single region of GPU code. A proposed compiler transformation can split such regions into multiple, leading to the launching of multiple kernels, one for each parallel region. Advantages include the opportunity to tailor grid geometry of each kernel to the parallel region that it executes and the elimination of the overheads imposed by a code-generation scheme meant to handle multiple nested parallel regions. Second, is a code transformation that sets up a pipeline of kernel execution and asynchronous data transfer. This transformation enables the overlap of communication and computation. Intricate technical details that are required for this transformation are described. The third idea is that the selection of a grid geometry for the execution of a parallel region must balance the GPU occupancy with the potential saturation of the memory throughput in the GPU. Adding this additional parameter to the geometry selection heuristic can often yield better performance at lower occupancy levels.



This research was supported by the IBM Canada Software Lab Centre for Advanced Studies (CAS) and by the National Science and Engineering Research Council (NSERC) of Canada through their Collaborative Research and Development (CRD) program and through the Undergraduate Student Research Awards (USRA) program.


  1. 1.
  2. 2.
    Bauer, M., Cook, H., Khailany, B.: CudaDMA: optimizing GPU memory bandwidth via warp specialization. In: High Performance Computing, Networking, Storage and Analysis SC, Seattle, WA, USA, pp. 1–11 (2011)Google Scholar
  3. 3.
    OpenMP Architecture Review Board: OpenMP application programming interface.
  4. 4.
    Che, S., et al.: Rodinia: a benchmark suite for heterogeneous computing. In: IEEE International Symposium on Workload Characterization (IISWC), pp. 44–54 (2009)Google Scholar
  5. 5.
    Chikin, A.: Unibench for OpenMP 4.0.
  6. 6.
    Jacob, A.C., et al.: Clang-YKT source-code repository.
  7. 7.
    Jacob, A.C., et al.: Efficient fork-join on GPUs through warp specialization. In: High Performance Computing HiPC, Jaipur, India, pp. 358–367 (2017)Google Scholar
  8. 8.
    Kayıran, O., Jog, A., Kandemir, M.T., Das, C.R.: Neither more nor less: optimizing thread-level parallelism for GPGPUs. In: Parallel Architectures and Compilation Techniques PACT, Piscataway, NJ, USA, pp. 157–166 (2013)Google Scholar
  9. 9.
    Kim, K., Lee, S., Yoon, M.K., Koo, G., Ro, W.W., Annavaram, M.: Warped-preexecution: a GPU pre-execution approach for improving latency hiding. In: High Performance Computer Architecture HPCA, Barcelona, Spain, pp. 163–175 (2016)Google Scholar
  10. 10.
    Komoda, T., Miwa, S., Nakamura, H.: Communication library to overlap computation and communication for OpenCL application. In: Parallel and Distributed Processing Symposium Workshops IPDPSW, Shanghai, China, pp. 567–573 (2012)Google Scholar
  11. 11.
    Lee, M., et al.: Improving GPGPU resource utilization through alternative thread block scheduling. In: High Performance Computer Architecture HPCA, Orlando, FL, USA, pp. 260–271 (2014)Google Scholar
  12. 12.
    Lee, S.-Y., Arunkumar, A., Wu, C.-J.: CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. In: International Symposium on Computer Architecture (ISCA), Portland, Oregon, pp. 515–527. ACM (2015)Google Scholar
  13. 13.
    Lloyd, T., Chikin, A., Amaral, J.N., Tiotto, E.: Automated GPU grid geometry selection for OpenMP kernels. In: Workshop on Applications for Multi-Core Architectures, WAMCA 2018, September 2018. Pre-print Manuscript.
  14. 14.
    Mokhtari, R., Stumm, M.: Bigkernel – high performance CPU-GPU communication pipelining for big data-style applications. In: International Parallel and Distributed Processing Symposium IPDPS, Phoenix, AZ, USA, pp. 819–828 (2014)Google Scholar
  15. 15.
    Oh, Y., et al.: APRES: improving cache efficiency by exploiting load characteristics on GPUs. ACM SIGARCH Comput. Archit. News 44(3), 191–203 (2016)CrossRefGoogle Scholar
  16. 16.
    Rau, B.R.: Iterative modulo scheduling: an algorithm for software pipelining loops. In: Proceedings of the 27th Annual International Symposium on Microarchitecture, MICRO 27, New York, NY, USA, pp. 63–74. ACM (1994)Google Scholar
  17. 17.
    Sethia, A., Jamshidi, D.A., Mahlke, S.: Mascar: speeding up GPU warps by reducing memory pitstops. In: High Performance Computer Architecture HPCA, San Francisco, CA, USA, pp. 174–185 (2015)Google Scholar
  18. 18.
    Sethia, A., Mahlke, S.: Equalizer: dynamic tuning of GPU resources for efficient execution. In: International Symposium on Microarchitecture MICRO, Cambridge, UK, pp. 647–658 (2014)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Artem Chikin
    • 1
    Email author
  • Tyler Gobran
    • 1
  • José Nelson Amaral
    • 1
  1. 1.University of AlbertaEdmontonCanada

Personalised recommendations