A Pattern for Overlapping Communication and Computation with OpenMP\(^*\) Target Directives

  • Jonas Hahnfeld
  • Tim Cramer
  • Michael Klemm
  • Christian Terboven
  • Matthias S. Müller
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10468)


OpenMP\(^*\) 4.0 introduced initial support for heterogeneous devices. OpenMP 4.5 improved programmability and added capabilities for asynchronous device kernel offload and data transfer management. However, the programmers are still burdened to optimize data transfer for improved performance and to deal with the limited amount of memory on the target device. This work presents a pipelining concept to efficiently overlap communication and computation using the OpenMP 4.5 target directives. Our evaluation of two key HPC kernels shows performance improvements of up to 24% and the ability to process data larger than device memory.


  1. 1.
    Aji, A.M., Panwar, L.S., Ji, F., Murthy, K., Chabbi, M., Balaji, P., Bisset, K.R., Dinan, J.S., Feng, W.C., Mellor-Crummey, J., Ma, X., Thakur, R.S.: MPI-ACC: accelerator-aware MPI for scientific applications. IEEE Trans. Parallel Distrib. Syst. 27(5), 1401–1414 (2016)CrossRefGoogle Scholar
  2. 2.
    Beltran, V., Carrera, D., Torres, J., Ayguadé, E.: CellMT: A cooperative multithreading library for the Cell/B.E. In: 2009 International Conference on High Performance Computing (HiPC), pp. 245–253, December 2009Google Scholar
  3. 3.
    Brightwell, R., Riesen, R., Underwood, K.D.: Analyzing the impact of overlap, offload, and independent progress for message passing interface applications. Int. J. High Perform. Comput. Appl. 19(2), 103–117 (2005). http://hpc.sagepub.com/content/19/2/103.abstract CrossRefGoogle Scholar
  4. 4.
    Castelló, A., Peña, A.J., Mayo, R., Balaji, P., Quintana-Ortí, E.S.: Exploring the suitability of remote GPGPU virtualization for the OpenACC programming model using rCUDA. In: Proceedings of the 2015 IEEE International Conference on Cluster Computing, CLUSTER 2015, pp. 92–95 (2015). http://dx.doi.org/10.1109/CLUSTER.2015.23
  5. 5.
    Chen, T., Sura, Z., O’Brien, K., O’Brien, J.K.: Optimizing the Use of Static Buffers for DMA on a CELL Chip. In: Almási, G., Caşcaval, C., Wu, P. (eds.) LCPC 2006. LNCS, vol. 4382, pp. 314–329. Springer, Heidelberg (2007). doi:10.1007/978-3-540-72521-3_23 CrossRefGoogle Scholar
  6. 6.
    Cui, X., Scogland, T.R., de Supinski, B.R., Feng, W.C.: Directive-based pipelining extension for OpenMP. In: Proceedings of the 2016 IEEE International Conference on Cluster Computing, pp. 481–484 (2016)Google Scholar
  7. 7.
    Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bur. Stand. 49(6), 409–436 (1952)MathSciNetCrossRefMATHGoogle Scholar
  8. 8.
    Hoefler, T., Lumsdaine, A., Rehm, W.: Implementation and performance analysis of non-blocking collective operations for MPI. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC 2007, pp. 52:1–52:10. ACM, New York (2007). http://doi.acm.org/10.1145/1362622.1362692
  9. 9.
    Liu, F., Chaudhary, V.: Extending OpenMP for heterogeneous chip multiprocessors. In: 2003 International Conference on Parallel Processing, Proceedings, pp. 161–168, October 2003Google Scholar
  10. 10.
    Miki, N., Ino, F., Hagihara, K.: An extension of OpenACC directives for out-of-core stencil computation with temporal blocking. In: Proceedings of the Third International Workshop on Accelerator Programming Using Directives, WACCPD 2016, pp. 36–45. IEEE Press, Piscataway (2016)Google Scholar
  11. 11.
    Si, M., Ishikawa, Y., Tatagi, M.: Direct MPI library for Intel Xeon Phi co-processors. In: 2013 IEEE International Parallel and Distributed Processing Symposium Workshop and PhD Forum (IPDPSW), pp. 816–824. IEEE (2013)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Jonas Hahnfeld
    • 1
  • Tim Cramer
    • 1
  • Michael Klemm
    • 2
  • Christian Terboven
    • 1
  • Matthias S. Müller
    • 1
  1. 1.Chair for High Performance Computing & IT Center, JARA–HPCRWTH Aachen UniversityAachenGermany
  2. 2.Intel Deutschland GmbHFeldkirchenGermany

Personalised recommendations