Automatic Partitioning of MPI Operations in MPI+OpenMP Applications

Jammer, Tim; Bischof, Christian

doi:10.1007/978-3-030-90539-2_12

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12761))

Included in the following conference series:

International Conference on High Performance Computing

1605 Accesses

Abstract

The new MPI 4.0 standard includes a new chapter about partitioned point-to-point communication operations. These partitioned operations allow multiple actors of one MPI process (e.g. multiple threads) to contribute data to one communication operation. These operations are designed to mitigate current problems in multithreaded MPI programs, with some work suggesting a substantial performance benefit (up to 26%) when using these operations compared to their existing non-blocking counterparts.

In this work, we explore the possibility for the compiler to automatically partition sending operations across multiple OpenMP threads. For this purpose, we developed an LLVM compiler pass that partitions MPI sending operations across the different iterations of OpenMP for loops. We demonstrate the feasibility of this approach by applying it to 2D stencil codes, observing very little overhead while the correctness of the codes is sustained. Therefore, this approach facilitates the usage of these new additions to the MPI standard for existing codes.

Our code is available on github: https://github.com/tudasc/CommPart.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Meaning that the modification of this partition the sending operation is forbidden until the sending operation has completed locally.
2.
Note that this only illustrates the transformation, as the transformation happen on the LLVM IR, no source code is being output.
3.
For example a type created by MPI_Type_contiguous.
4.
This is a valid implementation according to the MPI standard.
5.
For a receive operation, reading and writing is forbidden, though.
6.
False positives in the MPI implementation or the application itself are not filtered.
7.
On the Lichtenberg cluster equipped with Intel Xeon Platinum 9242 CPUs, the execution of unaltered version compiled with clang 11.1 took 614 s on average, while the execution of the automatically partitioned version took 619 s.

References

Ahmed, H., Skjellumh, A., Bangalore, P., Pirkelbauer, P.: Transforming blocking MPI collectives to non-blocking and persistent operations. In: Proceedings of the 24th European MPI Users’ Group Meeting, pp. 1–11 (2017)
Google Scholar
Danalis, A., Pollock, L., Swany, M.: Automatic MPI application transformation with ASPhALT. In: 2007 IEEE International Parallel and Distributed Processing Symposium, pp. 1–8. IEEE (2007)
Google Scholar
Danalis, A., Pollock, L., Swany, M., Cavazos, J.: MPI-aware compiler optimizations for improving communication-computation overlap. In: Proceedings of the 23rd International Conference on Supercomputing, pp. 316–325 (2009)
Google Scholar
Grant, R., Skjellum, A., Bangalore, P.V.: Lightweight threading with MPI using Persistent Communications Semantics. Technical report, Sandia National Lab. (SNL-NM), Albuquerque, NM (United States) (2015)
Google Scholar
Grant, R.E., Dosanjh, M.G.F., Levenhagen, M.J., Brightwell, R., Skjellum, A.: Finepoints: partitioned multithreaded MPI communication. In: Weiland, M., Juckeland, G., Trinitis, C., Sadayappan, P. (eds.) ISC High Performance 2019. LNCS, vol. 11501, pp. 330–350. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20656-7_17
Chapter Google Scholar
Guo, J., Yi, Q., Meng, J., Zhang, J., Balaji, P.: Compiler-assisted overlapping of communication and computation in MPI applications. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp. 60–69. IEEE (2016)
Google Scholar
Jammer, T., Iwainsky, C., Bischof, C.: Automatic detection of MPI assertions. In: Jagode, H., Anzt, H., Juckeland, G., Ltaief, H. (eds.) ISC High Performance 2020. LNCS, vol. 12321, pp. 34–42. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59851-8_3
Chapter Google Scholar
Laguna, I., Marshall, R., Mohror, K., Ruefenacht, M., Skjellum, A., Sultana, N.: A large-scale study of MPI usage in open-source HPC applications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19. ACM (2019). https://doi.org/10.1145/3295500.3356176
Message Passing Interface Forum: MPI: A Message-Passing Interface Standard Version 4.0 (2021). https://www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf
Nguyen, V.M., Saillard, E., Jaeger, J., Barthou, D., Carribault, P.: Automatic code motion to extend MPI nonblocking overlap window. In: Jagode, H., Anzt, H., Juckeland, G., Ltaief, H. (eds.) ISC High Performance 2020. LNCS, vol. 12321, pp. 43–54. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59851-8_4
Chapter Google Scholar
Schonbein, W., Dosanjh, M.G.F., Grant, R.E., Bridges, P.G.: Measuring multithreaded message matching misery. In: Aldinucci, M., Padovani, L., Torquati, M. (eds.) Euro-Par 2018. LNCS, vol. 11014, pp. 480–491. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-96983-1_34
Chapter Google Scholar
Seward, J., et al.: Memcheck: a memory error detector (2020). https://valgrind.org/docs/manual/mc-manual.html
Squar, J., Jammer, T., Blesel, M., Kuhn, M., Ludwig, T.: Compiler assisted source transformation of openmp kernels. In: 2020 19th International Symposium on Parallel and Distributed Computing (ISPDC), pp. 44–51 (2020). https://doi.org/10.1109/ISPDC51135.2020.00016

Download references

Acknowlegements

We especially want to thank Dr. Christian Iwainsky (TU Darmstadt) for fruitful discussion. This work was supported by the Hessian Ministry for Higher Education, Research and the Arts through the Hessian Competence Center for High-Performance Computing. Measurements for this work were conducted on the Lichtenberg high performance computer of the TU Darmstadt. Some of the code analyzing the OpenMP parallel regions originated from CATO [13] (https://github.com/JSquar/cato).

Author information

Authors and Affiliations

Hessian Competence Center for High Performance Computing (HKHLR), Darmstadt, Germany
Tim Jammer
Scientific Computing Group, Department of Computer Science, Technical University Darmstadt, 64283, Darmstadt, Germany
Tim Jammer & Christian Bischof

Authors

Tim Jammer
View author publications
You can also search for this author in PubMed Google Scholar
Christian Bischof
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tim Jammer .

Editor information

Editors and Affiliations

University of Tennessee at Knoxville, Knowville, TN, USA
Heike Jagode
Karlsruhe Institute of Technology, Karlsruhe, Baden-Württemberg, Germany
Hartwig Anzt
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Hatem Ltaief
University of Tennessee System, Knoxville, TN, USA
Piotr Luszczek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jammer, T., Bischof, C. (2021). Automatic Partitioning of MPI Operations in MPI+OpenMP Applications. In: Jagode, H., Anzt, H., Ltaief, H., Luszczek, P. (eds) High Performance Computing. ISC High Performance 2021. Lecture Notes in Computer Science(), vol 12761. Springer, Cham. https://doi.org/10.1007/978-3-030-90539-2_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-90539-2_12
Published: 13 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-90538-5
Online ISBN: 978-3-030-90539-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics