Abstract
The Message Passing Interface (MPI) standard is the most popular parallel programming model for distributed systems. However, it lacks fault-tolerance support and, traditionally, failures are addressed with stop-and-restart checkpointing solutions. The proposal of User Level Failure Mitigation (ULFM) for the inclusion of resilience capabilities in the MPI standard provides new opportunities in this field, allowing the implementation of resilient MPI applications, i.e., applications that are able to detect and react to failures without stopping their execution. This work compares the performance of a traditional stop-and-restart checkpointing solution with its equivalent resilience proposal. Both approaches are built on top of ComPiler for Portable Checkpoiting (CPPC) an application-level checkpointing tool for MPI applications, and they allow to transparently obtain fault-tolerant MPI applications from generic MPI Single Program Multiple Data (SPMD). The evaluation is focused on the scalability of the two solutions, comparing both proposals using up to 3072 cores.
Similar content being viewed by others
References
Ali MM, Strazdins PE, Harding B, Hegland M (2016) Complex scientific applications made fault-tolerant with the sparse grid combination technique. Int J High Perform Comput Appl. doi:10.1177/1094342015628056. http://hpc.sagepub.com/content/early/2016/02/10/1094342015628056.abstract
ASC Sequoia Benchmark Codes: https://asc.llnl.gov/sequoia/benchmarks/. Last Accessed June 2016
Aulwes R, Daniel D, Desai N, Graham R, Risinger L, Taylor MA, Woodall T, Sukalski M (2004) Architecture of LA-MPI, a network-fault-tolerant MPI. In: International parallel and distributed processing symposium, p 15
Bland W, Bouteiller A, Herault T, Hursey J, Bosilca G, Dongarra J (2012) An evaluation of user-level failure mitigation support in MPI. Recent Adv Message Pass Interface 7490:193–203
Bland W, Raffenetti K, Balaji P (2014) Simplifying the recovery model of user-level failure mitigation. In: Workshop on Exascale MPI at Supercomputing Conference, pp 20–25
Broquedis F, Clet-Ortega J, Moreaud S, Furmento N, Goglin B, Mercier G, Thibault S, Namyst R (2010) hwloc: a generic framework for managing hardware affinities in HPC applications. In: International Conference on Parallel, Distributed and Network-Based Computing
Cores I, Rodríguez G, Martín M, González P, Osorio R (2013) Improving scalability of application-level checkpoint-recovery by reducing checkpoint sizes. New Gener Comput 31(3):163–185
Di Martino C, Kramer W, Kalbarczyk Z, Iyer R (2015) Measuring and understanding extreme-scale application resilience: a field study of 5,000,000 HPC application runs. In: International Conference on Dependable Systems and Networks, pp 25–36
Fagg G, Dongarra J (2000) FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Recent advances in parallel virtual machine and message passing interface, vol 1908, pp 346–353. Springer, New York
Himeno Benchmark: http://accc.riken.jp/en/supercom/himenobmt/. Last Accessed: June 2016
Hursey J, Graham RL, Bronevetsky G, Buntinas D, Pritchard H, Solt DG (2011) Run-through stabilization: An MPI proposal for process fault tolerance. In: Recent advances in the message passing interface, pp 329–332
Laguna I, Richards D, Gamblin T, Schulz M, de Supinski B (2014) Evaluating user-level fault tolerance for MPI Applications. In: European MPI Users’ group meeting, EuroMPI/ASIA ’14, pp 57–62
Laguna I, Richards DF, Gamblin T, Schulz M, de Supinski BR, Mohror K, Pritchard H (2016) Evaluating and extending user-level fault tolerance in MPI applications. Int J High Perform Comput Appl. doi:10.1177/1094342015623623. http://hpc.sagepub.com/content/early/2016/01/11/1094342015623623.abstract
Losada N, Cores I, Martín MJ, González P (2016) Resilient MPI applications using an application-level checkpointing framework and ULFM. J Supercomput 1–14. doi:10.1007/s11227-016-1629-7. http://link.springer.com/article/10.1007/s11227-016-1629-7
Moody A, Bronevetsky G, Mohror K, De Supinski BR (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–11
Pauli S, Kohler M, Arbenz P (2013) A fault tolerant implementation of multi-level Monte Carlo methods. In: Advances in parallel computing, pp 471–480
Plank JS, Li K, Puening MA (1998) Diskless checkpointing. Trans Parall Distrib Syst 9(10):972–986
Rizzi F, Morris K, Sargsyan K, Mycek P, Safta C, Debusschere B, LeMaitre O, Knio O (2016) ULFM-MPI implementation of a resilient task-based partial differential equations preconditioner. In: Workshop on fault-tolerance for HPC at extreme scale, pp 19–26
Rodríguez G, Martín M, González P, Touriño J, Doallo R (2010) CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications. Concur Comput Pract Exp 22(6):749–766
Sato K, Moody A, Mohror K, Gamblin T, De Supinski B, Maruyama N, Matsuoka S (2014) FMI: fault tolerant messaging interface for fast and transparent recovery. In: International parallel and distributed processing symposium, pp 1225–1234
Suo G, Lu Y, Liao X, Xie M, Cao H (2013) NR-MPI: a non-stop and fault resilient MPI. In: International Conference on Parallel and Distributed Systems, pp 190–199
Teranishi K, Heroux M (2014) Toward local failure local recovery resilience model using MPI-ULFM. In: European MPI users’ group meeting, pp 51–56
Wolters E, Smith M (2013) MOCFE-Bone: the 3D MOC mini-application for exascale research. Tech. rep, Argonne National Laboratory
Acknowledgments
This work has been supported by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (Project TIN2013-42148-P and predoctoral Grant of Nuria Losada ref. BES-2014-068066) and by the Galician Government (Xunta de Galicia) under the Consolidation Program of Competitive Research Units, cofunded by FEDER funds (Ref. GRC2013/055). We gratefully thank CESGA for providing access to the FinisTerrae-II supercomputer.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Losada, N., Martín, M.J. & González, P. Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications. J Supercomput 73, 316–329 (2017). https://doi.org/10.1007/s11227-016-1863-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-016-1863-z