Advertisement

Fine-Grained MPI+OpenMP Plasma Simulations: Communication Overlap with Dependent Tasks

  • Jérôme RichardEmail author
  • Guillaume LatuEmail author
  • Julien BigotEmail author
  • Thierry GautierEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11725)

Abstract

This paper demonstrates how OpenMP 4.5 tasks can be used to efficiently overlap computations and MPI communications based on a case-study conducted on multi-core and many-core architectures. It focuses on task granularity, dependencies and priorities, and also identifies some limitations of OpenMP. Results on 64 Skylake nodes show that while 64% of the wall-clock time is spent in MPI communications, 60% of the cores are busy in computations, which is a good result. Indeed, the chosen dataset is small enough to be a challenging case in terms of overlap and thus useful to assess worst-case scenarios in future simulations.

Two key features were identified: by using task priority we improved the performance by 5.7% (mainly due to an improved overlap), and with recursive tasks we shortened the execution time by 9.7%. We also illustrate the need to have access to tools for task tracing and task visualization. These tools allowed a fine understanding and a performance increase for this task-based OpenMP+MPI code.

Keywords

Dependent tasks OpenMP 4.5 MPI Many-core 

Notes

Acknowledgments

This work was supported by the EoCoE and EoCoE2 projects, grant agreement numbers 676629 & 824158, funded within the EU’s H2020 program. We also acknowledge CEA for the support provided by Programme Transversal de Compétences – Simulation Numérique.

References

  1. 1.
    Augonnet, C., Aumage, O., Furmento, N., Namyst, R., Thibault, S.: StarPU-MPI: task programming over clusters of machines enhanced with accelerators. In: Träff, J.L., Benkner, S., Dongarra, J.J. (eds.) EuroMPI 2012. LNCS, vol. 7490, pp. 298–299. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-33518-1_40CrossRefGoogle Scholar
  2. 2.
    Besseron, X., Gautier, T.: Impact of over-decomposition on coordinated checkpoint/rollback protocol. In: Alexander, M., et al. (eds.) Euro-Par 2011. LNCS, vol. 7156, pp. 322–332. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-29740-3_36CrossRefGoogle Scholar
  3. 3.
    Bouzat, N., Rozar, F., Latu, G., Roman, J.: A new parallelization scheme for the Hermite interpolation based gyroaverage operator. In: 2017 16th ISPDC (2017)Google Scholar
  4. 4.
    Bouzat, N., et al.: Targeting realistic geometry in Tokamak code Gysela. ESAIM Proc. Surv. 63, 179–207 (2018)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Broquedis, F., Gautier, T., Danjean, V.: libKOMP, an efficient OpenMP runtime system for both fork-join and data flow paradigms. In: Chapman, B.M., Massaioli, F., Müller, M.S., Rorro, M. (eds.) IWOMP 2012. LNCS, vol. 7312, pp. 102–115. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-30961-8_8CrossRefGoogle Scholar
  6. 6.
    Bueno, J., et al.: Productive cluster programming with OmpSs. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011. LNCS, vol. 6852, pp. 555–566. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-23400-2_52CrossRefGoogle Scholar
  7. 7.
    Crouseilles, N., Latu, G., Sonnendrücker, E.: Hermite spline interpolationon patches for parallelly solving the Vlasov-Poisson equation. IJAMCS 17(3), 335–349 (2007)zbMATHGoogle Scholar
  8. 8.
    Diaz, J., Muñoz-Caro, C., Niño, A.: A survey of parallel programming modelsand tools in the multi and many-core era. IEEE TPDS 23(8), 1369–1386 (2012)Google Scholar
  9. 9.
    Gautier, T., Pérez, C., Richard, J.: On the impact of OpenMP task granularity. In: de Supinski, B.R., Valero-Lara, P., Martorell, X., Mateo Bellido, S., Labarta, J. (eds.) IWOMP 2018. LNCS, vol. 11128, pp. 205–221. Springer, Cham (2018).  https://doi.org/10.1007/978-3-319-98521-3_14CrossRefGoogle Scholar
  10. 10.
    Martorell, X., Teruel, X., Klemm, M.: Advanced OpenMP Tutorial (2018). https://openmpcon.org/wp-content/uploads/2018_Tutorial3_Martorell_Teruel_Klemm.pdf
  11. 11.
    OpenMP Architecture Review Board: OpenMP Application Programming Interface Version 4.5, November 2015. http://www.openmp.org
  12. 12.
    OpenMP Architecture Review Board: OpenMP Application Programming Interface Version 5.0, November 2018. http://www.openmp.org
  13. 13.
    Pérache, M., Jourdren, H., Namyst, R.: MPC: a unified parallel runtime for clusters of NUMA machines. In: Luque, E., Margalef, T., Benítez, D. (eds.) Euro-Par 2008. LNCS, vol. 5168, pp. 78–88. Springer, Heidelberg (2008).  https://doi.org/10.1007/978-3-540-85451-7_9CrossRefGoogle Scholar
  14. 14.
    Perez, J.M., Beltran, V., Labarta, J., Ayguadé, E.: Improving the integration of task nesting and dependencies in OpenMP. In: IPDPS 2017. IEEE (2017)Google Scholar
  15. 15.
    Sala, K., et al.: Improving the interoperability between MPI and task-based programming models. In: Proceedings of EuroMPI 2018, pp. 6:1–6:11. ACM (2018)Google Scholar
  16. 16.
    Song, F., YarKhan, A., Dongarra, J.: Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems. In: Proceedings of the Conference on HPC Networking, Storage and Analysis, SC 2009. ACM (2009)Google Scholar
  17. 17.
    Sonnendrücker, E., et al.: The semi-Lagrangian method for the numerical resolution of the Vlasov equation. J. Comput. Phys. 149(2), 201–220 (1999)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.CEA/IRFMSt-Paul lez DuranceFrance
  2. 2.ZébrysToulouseFrance
  3. 3.Maison de la Simulation, CEA, CNRS, Univ. Paris-Sud, UVSQ, Université Paris-SaclayGif-sur-YvetteFrance
  4. 4.Univ. Lyon, Inria, CNRS, ENS de Lyon, Univ. Claude-Bernard Lyon 1, LIPLyonFrance

Personalised recommendations