Skip to main content

Enhancing OpenMP Tasking Model: Performance and Portability

Part of the Lecture Notes in Computer Science book series (LNPSE,volume 12870)


OpenMP, as the de-facto standard programming model in symmetric multiprocessing for HPC, has seen its performance boosted continuously by the community, either through implementation enhancements or specification augmentations. Furthermore, the language has evolved from a prescriptive nature, as defined by the thread-centric model, to a descriptive behavior, as defined by the task-centric model. However, the overhead related to the orchestration of tasks is still relatively high. Applications exploiting very fine-grained parallelism and systems with a large number of cores available might fail on scaling.

In this work, we propose to include the concept of Task Dependency Graph (TDG) in the specification by introducing a new clause, named taskgraph, attached to task or target directives. By design, the TDG allows alleviating the overhead associated with the OpenMP tasking model, and it also facilitates linking OpenMP with other programming models that support task parallelism. According to our experiments, a GCC implementation of the taskgraph is able to significantly reduce the execution time of fine-grained task applications and increase their scalability with regard to the number of threads.


  • OpenMP specification
  • Tasking model
  • Runtime overhead

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-85262-7_3
  • Chapter length: 15 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
USD   64.99
Price excludes VAT (USA)
  • ISBN: 978-3-030-85262-7
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   84.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.


  1. 1.

    The execution has run in a node of the Marenostrum IV [1] supercomputer, equipped with an Intel Xeon Platinum 8160 CPU, having 2 sockets of 24 physical cores each, at 2.1 GHz and 33 MB L3 cache.


  1. BSC: Marenostrum IV User’s Guide (2017).

  2. Castello, A., Seo, S., Mayo, R., Balaji, P., Quintana-Orti, E.S., Pena, A.J.: GLTO: on the adequacy of lightweight thread approaches for openmp implementations. In: Proceedings of the International Conference on Parallel Processing, pp. 60–69 (2017)

    Google Scholar 

  3. Gautier, T., Perez, C., Richard, J.: On the impact of OpenMP task granularity. In: de Supinski, B.R., Valero-Lara, P., Martorell, X., Mateo Bellido, S., Labarta, J. (eds.) IWOMP 2018. LNCS, vol. 11128, pp. 205–221. Springer, Cham (2018).

    CrossRef  Google Scholar 

  4. Giannozzi, P., et al.: Quantum espresso: a modular and open-source software project for quantum simulations of materials. J. Phys. Condens. Matter 21(39), 395502 (2009)

    Google Scholar 

  5. Kalray MPPA products (2021).

  6. Komatitsch, D., Tromp, J.: SPECFEM3D Cartesian (2021).

  7. Kukanov, A., Voss, M.J.: The foundations for scalable multi-core software in intel threading building blocks. Intel Technol. J. 11(4), 309–322 (2007).;jsessionid=79B311F4CEB9A4B610520177C7144D57?doi=

  8. Lagrone, J., Aribuki, A., Chapman, B.: A set of microbenchmarks for measuring OpenMP task overheads. In: Proceedingis of International Conference on Parallel and Distributed Processing Techniques and Applications II, pp. 594–600 (2011).

  9. Leiserson, C.E.: The Cilk++ concurrency platform. J. Supercomput. 51(3), 244–257 (2010)

    CrossRef  Google Scholar 

  10. Munera, A., Royuela, S., Quinones, E.: Towards a qualifiable OpenMP framework for embedded systems. In: Proceedings of the 2020 Design, Automation and Test in Europe Conference and Exhibition, DATE 2020, no. 2, pp. 903–908 (2020)

    Google Scholar 

  11. Nvidia: CUDA Graph programming guide (2021).

  12. Olivier, S.L., Prins, J.F.: Evaluating OpenMP 3.0 run time systems on unbalanced task graphs. In: Müller, M.S., de Supinski, B.R., Chapman, B.M. (eds.) IWOMP 2009. LNCS, vol. 5568, pp. 63–78. Springer, Heidelberg (2009).

    CrossRef  Google Scholar 

  13. Perez, J.M., Beltran, V., Labarta, J., Ayguade, E.: Improving the integration of task nesting and dependencies in OpenMP. In: Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium, IPDPS 2017, pp. 809–818 (2017)

    Google Scholar 

  14. Sainz, F., Mateo, S., Beltran, V., Bosque, J.L., Martorell, X., Ayguadé, E.: Leveraging OmpSs to exploit hardware accelerators. In: 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing, pp. 112–119. IEEE (2014)

    Google Scholar 

  15. Schuchart, J., Nachtmann, M., Gracia, J.: Patterns for OpenMP task data dependency overhead measurements. In: de Supinski, B.R., Olivier, S.L., Terboven, C., Chapman, B.M., Müller, M.S. (eds.) Scaling OpenMP for Exascale Performance and Portability, pp. 156–168. Springer International Publishing, Cham (2017)

    CrossRef  Google Scholar 

  16. Serrano, M.A., Melani, A., Vargas, R., Marongiu, A., Bertogna, M., Quiñones, E.: Timing characterization of OpenMP4 tasking model. In: 2015 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, CASES 2015, pp. 157–166 (2015)

    Google Scholar 

  17. Stpiczyński, P.: Language-based vectorization and parallelization using intrinsics, openmp, tbb and cilk plus. J. Supercomput. 74(4), 1461–1472 (2018)

    CrossRef  Google Scholar 

  18. TOP500 (2020).

  19. Valero-Lara, P., Catalán, S., Martorell, X., Usui, T., Labarta, J.: sLASs: a fully automatic auto-tuned linear algebra library based on openmp extensions implemented in ompss (lass library). J. Parallel Distrib. Comput. 138, 153–171 (2020)

    CrossRef  Google Scholar 

  20. Yu, C., Royuela, S., Quiñones, E.: OpenMP to CUDA graphs: a compiler-based transformation to enhance the programmability of NVIDIA devices. In: Proceedings of the 23rd International Workshop on Software and Compilers for Embedded Systems, SCOPES 2020, pp. 42–47 (2020)

    Google Scholar 

Download references


This work has been supported by the EU H2020 project AMPERE under the grant agreement no. 871669.

Author information

Authors and Affiliations


Corresponding authors

Correspondence to Chenle Yu , Sara Royuela or Eduardo Quiñones .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Yu, C., Royuela, S., Quiñones, E. (2021). Enhancing OpenMP Tasking Model: Performance and Portability. In: McIntosh-Smith, S., de Supinski, B.R., Klinkenberg, J. (eds) OpenMP: Enabling Massive Node-Level Parallelism. IWOMP 2021. Lecture Notes in Computer Science(), vol 12870. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-85261-0

  • Online ISBN: 978-3-030-85262-7

  • eBook Packages: Computer ScienceComputer Science (R0)