Abstract
The Message Passing Interface (MPI-3) provides a one-sided communication interface, also known as MPI Remote Memory Access (RMA), which enables one process to specify all required communication parameters for both the sending and receiving side. While this communication interface enables superior performance potential developers have to deal with a complex memory consistency model. Proper synchronization of asynchronous remote memory accesses to shared data structures is a challenging task. More importantly, it is difficult to pinpoint such synchronization bugs as they do not necessarily manifest in an error or occur for example only after porting the application to a different HPC environment.
We introduce a debugging tool to support the detection of latent synchronization bugs. Based on the semantic flexibility of the MPI-3 specification we dynamically modify executions of improperly synchronized MPI remote memory accesses to force a manifestation of an error. An experimental evaluation with small applications and the usage in a library which heavily relies on MPI RMA reveal that this approach can uncover synchronization bugs which would otherwise likely go unnoticed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chen, Z., Dinan, J., Tang, Z., Balaji, P., Zhong, H., Wei, J., Huang, T., Qin, F.: MC-Checker: detecting memory consistency errors in MPI one-sided applications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 499–510. IEEE Press, Piscataway (2014)
Dan, A.M., Lam, P., Hoefler, T., Vechev, M.: Modeling and analysis of remote memory access programming. In: Proceedings of the ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, Amsterdam, pp. 129–144 (2016)
Faanes, G., Bataineh, A., Roweth, D., Court, T., Froese, E., Alverson, B., Johnson, T., Kopnick, J., Higgins, M., Reinhard, J.: Cray cascade: a scalable HPC system based on a dragonfly network. In: 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–9. IEEE, Washington, DC (2012)
Fürlinger, K., Fuchs, T., Kowalewski, R.: DASH: a C++ PGAS library for distributed data structures and parallel algorithms. In: Proceedings of the 18th IEEE International Conference on High Performance Computing and Communications HPCC (2016)
Gropp, W., Thakur, R.: An evaluation of implementation options for MPI one-sided communication. In: Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 415–424. Springer, Berlin (2005)
Hermanns, M.A., Miklosch, M., Böhme, D., Wolf, F.: Understanding the formation of wait states in applications with one-sided communication. In: Proceedings of the 20th European MPI Users’ Group Meeting, pp. 73–78. ACM, New York (2013)
Hilbrich, T., Protze, J., Schulz, M., de Supinski, B.R., Müller, M.S.: MPI runtime error detection with MUST: advances in deadlock detection. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pp. 30:1–30:11. IEEE Computer Society Press, Los Alamitos, CA (2012)
Hoefler, T., Dinan, J., Thakur, R., Barrett, B., Balaji, P., Gropp, W., Underwood, K.: Remote memory access programming in MPI-3. ACM Trans. Parallel Comput. 2 (2), 9:1–9:26 (2015). doi:10.1145/2780584
Infiniband Trade Association: InfiniBand Architecture Specification Volume 2. https://cw.infinibandta.org/document/dl/7155 (2006)
Kowalewski, R., Fürlinger, K.: Nasty-MPI: Debugging Synchronization Errors in MPI-3 One-Sided Applications. Lecture Notes in Computer Science, pp. 51–62. Springer, Cham (2016). doi:10.1007/978-3-319-43659-3_4. http://dx.doi.org/10.1007/978-3-319-43659-3_4
Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21 (7), 558–565 (1978). doi:10.1145/359545.359563
Leibniz Supercomputing Centre, Munich, Germany: SuperMUC Petascale System. https://www.lrz.de/services/compute/supermuc/systemdescription/. Last accessed 2016
Luecke, G.R., Spanoyannis, S., Kraeva, M.: The performance and scalability of SHMEM and MPI-2 one-sided routines on a SGI origin 2000 and a Cray T3E-600: performances. Concurr. Comput. Pract. Exper. 16 (10), 1037–1060 (2004). doi:10.1002/cpe.v16:10
Mellor-Crummey, J.M., Scott, M.L.: Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. 9 (1), 21–65 (1991). doi:10.1145/103727.103729
MPI Forum: MPI: A Message-Passing Interface Standard. Version 3.0 (2012). Available at: http://www.mpi-forum.org
National Energy Research Center, United States: Edison System Configuration. https://www.nersc.gov/users/computational-systems/edison/configuration/. Last accessed 2016
Park, C.S., Sen, K., Hargrove, P., Iancu, C.: Efficient data race detection for distributed memory parallel programs. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, pp. 51:1–51:12. ACM, New York (2011). doi:10.1145/2063384.2063452
Pervez, S., Gopalakrishnan, G., Kirby, R., Thakur, R., Gropp, W.: Formal verification of programs that use MPI one-sided communication. In: Mohr, B., Träff, J., Worringen, J., Dongarra, J. (eds.) Recent Advances in Parallel Virtual Machine and Message Passing Interface. Lecture Notes in Computer Science, vol. 4192, pp. 30–39. Springer, Berlin/ Heidelberg (2006). doi:10.1007/11846802_13
Acknowledgements
We gratefully acknowledge funding by the German Research Foundation (DFG) through the German Priority Programme 1648 Software for Exascale Computing (SPPEXA). We further want to inform that this work is an extended revision from an originally published paper [10].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Kowalewski, R., Fürlinger, K. (2017). Debugging Latent Synchronization Errors in MPI-3 One-Sided Communication. In: Niethammer, C., Gracia, J., Hilbrich, T., Knüpfer, A., Resch, M., Nagel, W. (eds) Tools for High Performance Computing 2016. Springer, Cham. https://doi.org/10.1007/978-3-319-56702-0_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-56702-0_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-56701-3
Online ISBN: 978-3-319-56702-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)