Skip to main content

Automatic Partitioning of MPI Operations in MPI+OpenMP Applications

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2021)

Abstract

The new MPI 4.0 standard includes a new chapter about partitioned point-to-point communication operations. These partitioned operations allow multiple actors of one MPI process (e.g. multiple threads) to contribute data to one communication operation. These operations are designed to mitigate current problems in multithreaded MPI programs, with some work suggesting a substantial performance benefit (up to 26%) when using these operations compared to their existing non-blocking counterparts.

In this work, we explore the possibility for the compiler to automatically partition sending operations across multiple OpenMP threads. For this purpose, we developed an LLVM compiler pass that partitions MPI sending operations across the different iterations of OpenMP for loops. We demonstrate the feasibility of this approach by applying it to 2D stencil codes, observing very little overhead while the correctness of the codes is sustained. Therefore, this approach facilitates the usage of these new additions to the MPI standard for existing codes.

Our code is available on github: https://github.com/tudasc/CommPart.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Meaning that the modification of this partition the sending operation is forbidden until the sending operation has completed locally.

  2. 2.

    Note that this only illustrates the transformation, as the transformation happen on the LLVM IR, no source code is being output.

  3. 3.

    For example a type created by MPI_Type_contiguous.

  4. 4.

    This is a valid implementation according to the MPI standard.

  5. 5.

    For a receive operation, reading and writing is forbidden, though.

  6. 6.

    False positives in the MPI implementation or the application itself are not filtered.

  7. 7.

    On the Lichtenberg cluster equipped with Intel Xeon Platinum 9242 CPUs, the execution of unaltered version compiled with clang 11.1 took 614 s on average, while the execution of the automatically partitioned version took 619 s.

References

  1. Ahmed, H., Skjellumh, A., Bangalore, P., Pirkelbauer, P.: Transforming blocking MPI collectives to non-blocking and persistent operations. In: Proceedings of the 24th European MPI Users’ Group Meeting, pp. 1–11 (2017)

    Google Scholar 

  2. Danalis, A., Pollock, L., Swany, M.: Automatic MPI application transformation with ASPhALT. In: 2007 IEEE International Parallel and Distributed Processing Symposium, pp. 1–8. IEEE (2007)

    Google Scholar 

  3. Danalis, A., Pollock, L., Swany, M., Cavazos, J.: MPI-aware compiler optimizations for improving communication-computation overlap. In: Proceedings of the 23rd International Conference on Supercomputing, pp. 316–325 (2009)

    Google Scholar 

  4. Grant, R., Skjellum, A., Bangalore, P.V.: Lightweight threading with MPI using Persistent Communications Semantics. Technical report, Sandia National Lab. (SNL-NM), Albuquerque, NM (United States) (2015)

    Google Scholar 

  5. Grant, R.E., Dosanjh, M.G.F., Levenhagen, M.J., Brightwell, R., Skjellum, A.: Finepoints: partitioned multithreaded MPI communication. In: Weiland, M., Juckeland, G., Trinitis, C., Sadayappan, P. (eds.) ISC High Performance 2019. LNCS, vol. 11501, pp. 330–350. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20656-7_17

    Chapter  Google Scholar 

  6. Guo, J., Yi, Q., Meng, J., Zhang, J., Balaji, P.: Compiler-assisted overlapping of communication and computation in MPI applications. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp. 60–69. IEEE (2016)

    Google Scholar 

  7. Jammer, T., Iwainsky, C., Bischof, C.: Automatic detection of MPI assertions. In: Jagode, H., Anzt, H., Juckeland, G., Ltaief, H. (eds.) ISC High Performance 2020. LNCS, vol. 12321, pp. 34–42. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59851-8_3

    Chapter  Google Scholar 

  8. Laguna, I., Marshall, R., Mohror, K., Ruefenacht, M., Skjellum, A., Sultana, N.: A large-scale study of MPI usage in open-source HPC applications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19. ACM (2019). https://doi.org/10.1145/3295500.3356176

  9. Message Passing Interface Forum: MPI: A Message-Passing Interface Standard Version 4.0 (2021). https://www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf

  10. Nguyen, V.M., Saillard, E., Jaeger, J., Barthou, D., Carribault, P.: Automatic code motion to extend MPI nonblocking overlap window. In: Jagode, H., Anzt, H., Juckeland, G., Ltaief, H. (eds.) ISC High Performance 2020. LNCS, vol. 12321, pp. 43–54. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59851-8_4

    Chapter  Google Scholar 

  11. Schonbein, W., Dosanjh, M.G.F., Grant, R.E., Bridges, P.G.: Measuring multithreaded message matching misery. In: Aldinucci, M., Padovani, L., Torquati, M. (eds.) Euro-Par 2018. LNCS, vol. 11014, pp. 480–491. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-96983-1_34

    Chapter  Google Scholar 

  12. Seward, J., et al.: Memcheck: a memory error detector (2020). https://valgrind.org/docs/manual/mc-manual.html

  13. Squar, J., Jammer, T., Blesel, M., Kuhn, M., Ludwig, T.: Compiler assisted source transformation of openmp kernels. In: 2020 19th International Symposium on Parallel and Distributed Computing (ISPDC), pp. 44–51 (2020). https://doi.org/10.1109/ISPDC51135.2020.00016

Download references

Acknowlegements

We especially want to thank Dr. Christian Iwainsky (TU Darmstadt) for fruitful discussion. This work was supported by the Hessian Ministry for Higher Education, Research and the Arts through the Hessian Competence Center for High-Performance Computing. Measurements for this work were conducted on the Lichtenberg high performance computer of the TU Darmstadt. Some of the code analyzing the OpenMP parallel regions originated from CATO [13] (https://github.com/JSquar/cato).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tim Jammer .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jammer, T., Bischof, C. (2021). Automatic Partitioning of MPI Operations in MPI+OpenMP Applications. In: Jagode, H., Anzt, H., Ltaief, H., Luszczek, P. (eds) High Performance Computing. ISC High Performance 2021. Lecture Notes in Computer Science(), vol 12761. Springer, Cham. https://doi.org/10.1007/978-3-030-90539-2_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-90539-2_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-90538-5

  • Online ISBN: 978-3-030-90539-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics