A Pattern for Overlapping Communication and Computation with OpenMP\(^*\) Target Directives

  • Jonas HahnfeldEmail author
  • Tim Cramer
  • Michael Klemm
  • Christian Terboven
  • Matthias S. Müller
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10468)


OpenMP\(^*\) 4.0 introduced initial support for heterogeneous devices. OpenMP 4.5 improved programmability and added capabilities for asynchronous device kernel offload and data transfer management. However, the programmers are still burdened to optimize data transfer for improved performance and to deal with the limited amount of memory on the target device. This work presents a pipelining concept to efficiently overlap communication and computation using the OpenMP 4.5 target directives. Our evaluation of two key HPC kernels shows performance improvements of up to 24% and the ability to process data larger than device memory.



Parts of this work were funded by the German Federal Ministry of Research and Education (BMBF) under Grant Number 01IH13008A (ELP). Simulations were performed with computing resources granted by JARA-HPC from RWTH Aachen University under project jara0001.

Intel, Xeon, and Xeon Phi are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

\(^*\)Other names and brands are the property of their respective owners.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.


  1. 1.
    Aji, A.M., Panwar, L.S., Ji, F., Murthy, K., Chabbi, M., Balaji, P., Bisset, K.R., Dinan, J.S., Feng, W.C., Mellor-Crummey, J., Ma, X., Thakur, R.S.: MPI-ACC: accelerator-aware MPI for scientific applications. IEEE Trans. Parallel Distrib. Syst. 27(5), 1401–1414 (2016)CrossRefGoogle Scholar
  2. 2.
    Beltran, V., Carrera, D., Torres, J., Ayguadé, E.: CellMT: A cooperative multithreading library for the Cell/B.E. In: 2009 International Conference on High Performance Computing (HiPC), pp. 245–253, December 2009Google Scholar
  3. 3.
    Brightwell, R., Riesen, R., Underwood, K.D.: Analyzing the impact of overlap, offload, and independent progress for message passing interface applications. Int. J. High Perform. Comput. Appl. 19(2), 103–117 (2005). CrossRefGoogle Scholar
  4. 4.
    Castelló, A., Peña, A.J., Mayo, R., Balaji, P., Quintana-Ortí, E.S.: Exploring the suitability of remote GPGPU virtualization for the OpenACC programming model using rCUDA. In: Proceedings of the 2015 IEEE International Conference on Cluster Computing, CLUSTER 2015, pp. 92–95 (2015).
  5. 5.
    Chen, T., Sura, Z., O’Brien, K., O’Brien, J.K.: Optimizing the Use of Static Buffers for DMA on a CELL Chip. In: Almási, G., Caşcaval, C., Wu, P. (eds.) LCPC 2006. LNCS, vol. 4382, pp. 314–329. Springer, Heidelberg (2007). doi: 10.1007/978-3-540-72521-3_23 CrossRefGoogle Scholar
  6. 6.
    Cui, X., Scogland, T.R., de Supinski, B.R., Feng, W.C.: Directive-based pipelining extension for OpenMP. In: Proceedings of the 2016 IEEE International Conference on Cluster Computing, pp. 481–484 (2016)Google Scholar
  7. 7.
    Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bur. Stand. 49(6), 409–436 (1952)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Hoefler, T., Lumsdaine, A., Rehm, W.: Implementation and performance analysis of non-blocking collective operations for MPI. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC 2007, pp. 52:1–52:10. ACM, New York (2007).
  9. 9.
    Liu, F., Chaudhary, V.: Extending OpenMP for heterogeneous chip multiprocessors. In: 2003 International Conference on Parallel Processing, Proceedings, pp. 161–168, October 2003Google Scholar
  10. 10.
    Miki, N., Ino, F., Hagihara, K.: An extension of OpenACC directives for out-of-core stencil computation with temporal blocking. In: Proceedings of the Third International Workshop on Accelerator Programming Using Directives, WACCPD 2016, pp. 36–45. IEEE Press, Piscataway (2016)Google Scholar
  11. 11.
    Si, M., Ishikawa, Y., Tatagi, M.: Direct MPI library for Intel Xeon Phi co-processors. In: 2013 IEEE International Parallel and Distributed Processing Symposium Workshop and PhD Forum (IPDPSW), pp. 816–824. IEEE (2013)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Jonas Hahnfeld
    • 1
    Email author
  • Tim Cramer
    • 1
  • Michael Klemm
    • 2
  • Christian Terboven
    • 1
  • Matthias S. Müller
    • 1
  1. 1.Chair for High Performance Computing & IT Center, JARA–HPCRWTH Aachen UniversityAachenGermany
  2. 2.Intel Deutschland GmbHFeldkirchenGermany

Personalised recommendations