Advertisement

Efficient Communication/Computation Overlap with MPI+OpenMP Runtimes Collaboration

  • Marc Sergent
  • Mario Dagrada
  • Patrick Carribault
  • Julien Jaeger
  • Marc Pérache
  • Guillaume Papauré
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11014)

Abstract

Overlap network communications and computations is a major requirement to ensure scalability of HPC applications on future exascale machines. To this purpose the de-facto MPI standard provides non-blocking routines for asynchronous communication progress. In various implementations, a dedicated progress thread (PT) is deployed on the host CPU to actually achieve this overlap. However, current PT solutions struggle to find a balance between efficient detection of network events and minimal impact on the application computations. In this paper we propose a solution inspired from the PT approach which benefits from idle time of compute threads to make MPI communication progress in background. We implement our idea in the context of MPI+OpenMP collaboration using the OpenMP Tools interface which will be part of the OpenMP 5.0 standard. Our solution shows an overall performance gain on unbalanced workloads such as the AMG CORAL benchmark.

Keywords

Parallel computing Distributed computing Runtime systems Runtime collaboration 

References

  1. 1.
    Collaboration of Oak Ridge, Argonne, and Livermore benchmark codes. https://asc.llnl.gov/CORAL-benchmarks
  2. 2.
    Tera-1000-2-Part 1 (2017). https://www.top500.org/system/179162
  3. 3.
  4. 4.
    Barrett, B.W., et al.: The Portals 4.0 network programming interface. Technical Report, Sandia National Laboratories, SAND2013-3181 (2013)Google Scholar
  5. 5.
    Brightwell, R., Riesen, R., Underwood, K.D.: Analyzing the impact of overlap, offload, and independent progress for message passing interface applications. HPCA J. 19, 103–117 (2005)Google Scholar
  6. 6.
    Cardellini, V., Fanfarillo, A., Filippone, S.: Overlapping communication with computation in MPI applications. Technical Report, Universita di Roma Tor Vergata, DICII RR-16.09 (2016)Google Scholar
  7. 7.
    Derradji, S., Palfer-Sollier, T., Panziera, J.P., Poudes, A., Atos, F.W.: The BXI interconnect architecture. IEEE, August 2015Google Scholar
  8. 8.
    Dongarra, J., et al.: The international exascale software project roadmap. HPCA J. 25, 3–60 (2011)Google Scholar
  9. 9.
    Gabriel, E., et al.: Open MPI: goals, concept, and design of a next generation MPI implementation. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 97–104. Springer, Heidelberg (2004).  https://doi.org/10.1007/978-3-540-30218-6_19CrossRefGoogle Scholar
  10. 10.
    Hoefler, T., Lumsdaine, A.: Message progression in parallel computing - to thread or not to thread? In: IEEE CLUSTER (2008)Google Scholar
  11. 11.
    Isaacs, K.E., Gamblin, T., Bhatele, A., Schulz, M., Hamann, B., Bremer, P.T.: Ordering traces logically to identify lateness in message passing programs. IEEE Trans. Parallel Distrib. Syst. 27, 829–840 (2016)CrossRefGoogle Scholar
  12. 12.
    Lewis, J.G., Van de Geijn, R.A.: Distributed memory matrix-vector multiplication and conjugate gradient algorithms. In: Proceedings of the 1993 ACM/IEEE Conference on Supercomputing (1993)Google Scholar
  13. 13.
    Mellor-Crummey, J.: Performance Analysis of MPI+OpenMP Programs with HPCToolkit, March 2015Google Scholar
  14. 14.
    Message Passing Interface Forum: MPI: a message-passing interface standard, version 3.1, June 2015Google Scholar
  15. 15.
    OpenMP Language Working Group: OpenMP\(\textregistered \)RTechnical Report 4: Version 5.0 Preview 2. Technical report, The OpenMP Architecture Review Board, November 2017Google Scholar
  16. 16.
    Pérache, M., Carribault, P., Jourdren, H.: MPC-MPI: an MPI implementation reducing the overall memory consumption. In: Ropo, M., Westerholm, J., Dongarra, J. (eds.) EuroPVM/MPI 2009. LNCS, vol. 5759, pp. 94–103. Springer, Heidelberg (2009).  https://doi.org/10.1007/978-3-642-03770-2_16CrossRefGoogle Scholar
  17. 17.
    Pérache, M., Jourdren, H., Namyst, R.: MPC: a unified parallel runtime for clusters of NUMA machines. In: Luque, E., Margalef, T., Benítez, D. (eds.) Euro-Par 2008. LNCS, vol. 5168, pp. 78–88. Springer, Heidelberg (2008).  https://doi.org/10.1007/978-3-540-85451-7_9CrossRefGoogle Scholar
  18. 18.
    Pfister, G.F.: An introduction to the InfiniBand\(^{\rm {TM}}\) architecture. In: High Performance Mass Storage and Parallel I/O (2001)Google Scholar
  19. 19.
    Rabenseifner, R.: Hybrid parallel programming on HPC platforms. In: Proceedings of the Fifth European Workshop on OpenMP, EWOMP (2003)Google Scholar
  20. 20.
    Si, M., Pea, A.J., Balaji, P., Takagi, M., Ishikawa, Y.: MT-MPI: multithreaded MPI for many-core environments. ACM Press (2014)Google Scholar
  21. 21.
    Trahay, F., Brunet, E., Denis, A.: An analysis of the impact of multi-threading on communication performance. In: IEEE IPDPS (2009)Google Scholar
  22. 22.
    Trahay, F., Denis, A.: A scalable and generic task scheduling system for communication libraries. In: IEEE CLUSTER (2009)Google Scholar
  23. 23.
    Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33, 103–111 (1990)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Marc Sergent
    • 1
  • Mario Dagrada
    • 1
  • Patrick Carribault
    • 2
  • Julien Jaeger
    • 2
  • Marc Pérache
    • 2
  • Guillaume Papauré
    • 1
  1. 1.Atos Bull TechnologiesEchirollesFrance
  2. 2.CEA, DAM, DIFArpajonFrance

Personalised recommendations