Skip to main content

Advertisement

Log in

Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The Message Passing Interface (MPI) standard is the most popular parallel programming model for distributed systems. However, it lacks fault-tolerance support and, traditionally, failures are addressed with stop-and-restart checkpointing solutions. The proposal of User Level Failure Mitigation (ULFM) for the inclusion of resilience capabilities in the MPI standard provides new opportunities in this field, allowing the implementation of resilient MPI applications, i.e., applications that are able to detect and react to failures without stopping their execution. This work compares the performance of a traditional stop-and-restart checkpointing solution with its equivalent resilience proposal. Both approaches are built on top of ComPiler for Portable Checkpoiting (CPPC) an application-level checkpointing tool for MPI applications, and they allow to transparently obtain fault-tolerant MPI applications from generic MPI Single Program Multiple Data (SPMD). The evaluation is focused on the scalability of the two solutions, comparing both proposals using up to 3072 cores.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Ali MM, Strazdins PE, Harding B, Hegland M (2016) Complex scientific applications made fault-tolerant with the sparse grid combination technique. Int J High Perform Comput Appl. doi:10.1177/1094342015628056. http://hpc.sagepub.com/content/early/2016/02/10/1094342015628056.abstract

  2. ASC Sequoia Benchmark Codes: https://asc.llnl.gov/sequoia/benchmarks/. Last Accessed June 2016

  3. Aulwes R, Daniel D, Desai N, Graham R, Risinger L, Taylor MA, Woodall T, Sukalski M (2004) Architecture of LA-MPI, a network-fault-tolerant MPI. In: International parallel and distributed processing symposium, p 15

  4. Bland W, Bouteiller A, Herault T, Hursey J, Bosilca G, Dongarra J (2012) An evaluation of user-level failure mitigation support in MPI. Recent Adv Message Pass Interface 7490:193–203

    Article  Google Scholar 

  5. Bland W, Raffenetti K, Balaji P (2014) Simplifying the recovery model of user-level failure mitigation. In: Workshop on Exascale MPI at Supercomputing Conference, pp 20–25

  6. Broquedis F, Clet-Ortega J, Moreaud S, Furmento N, Goglin B, Mercier G, Thibault S, Namyst R (2010) hwloc: a generic framework for managing hardware affinities in HPC applications. In: International Conference on Parallel, Distributed and Network-Based Computing

  7. Cores I, Rodríguez G, Martín M, González P, Osorio R (2013) Improving scalability of application-level checkpoint-recovery by reducing checkpoint sizes. New Gener Comput 31(3):163–185

    Article  Google Scholar 

  8. Di Martino C, Kramer W, Kalbarczyk Z, Iyer R (2015) Measuring and understanding extreme-scale application resilience: a field study of 5,000,000 HPC application runs. In: International Conference on Dependable Systems and Networks, pp 25–36

  9. Fagg G, Dongarra J (2000) FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Recent advances in parallel virtual machine and message passing interface, vol 1908, pp 346–353. Springer, New York

  10. Himeno Benchmark: http://accc.riken.jp/en/supercom/himenobmt/. Last Accessed: June 2016

  11. Hursey J, Graham RL, Bronevetsky G, Buntinas D, Pritchard H, Solt DG (2011) Run-through stabilization: An MPI proposal for process fault tolerance. In: Recent advances in the message passing interface, pp 329–332

  12. Laguna I, Richards D, Gamblin T, Schulz M, de Supinski B (2014) Evaluating user-level fault tolerance for MPI Applications. In: European MPI Users’ group meeting, EuroMPI/ASIA ’14, pp 57–62

  13. Laguna I, Richards DF, Gamblin T, Schulz M, de Supinski BR, Mohror K, Pritchard H (2016) Evaluating and extending user-level fault tolerance in MPI applications. Int J High Perform Comput Appl. doi:10.1177/1094342015623623. http://hpc.sagepub.com/content/early/2016/01/11/1094342015623623.abstract

  14. Losada N, Cores I, Martín MJ, González P (2016) Resilient MPI applications using an application-level checkpointing framework and ULFM. J Supercomput 1–14. doi:10.1007/s11227-016-1629-7. http://link.springer.com/article/10.1007/s11227-016-1629-7

  15. Moody A, Bronevetsky G, Mohror K, De Supinski BR (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–11

  16. Pauli S, Kohler M, Arbenz P (2013) A fault tolerant implementation of multi-level Monte Carlo methods. In: Advances in parallel computing, pp 471–480

  17. Plank JS, Li K, Puening MA (1998) Diskless checkpointing. Trans Parall Distrib Syst 9(10):972–986

    Article  Google Scholar 

  18. Rizzi F, Morris K, Sargsyan K, Mycek P, Safta C, Debusschere B, LeMaitre O, Knio O (2016) ULFM-MPI implementation of a resilient task-based partial differential equations preconditioner. In: Workshop on fault-tolerance for HPC at extreme scale, pp 19–26

  19. Rodríguez G, Martín M, González P, Touriño J, Doallo R (2010) CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications. Concur Comput Pract Exp 22(6):749–766

    Article  Google Scholar 

  20. Sato K, Moody A, Mohror K, Gamblin T, De Supinski B, Maruyama N, Matsuoka S (2014) FMI: fault tolerant messaging interface for fast and transparent recovery. In: International parallel and distributed processing symposium, pp 1225–1234

  21. Suo G, Lu Y, Liao X, Xie M, Cao H (2013) NR-MPI: a non-stop and fault resilient MPI. In: International Conference on Parallel and Distributed Systems, pp 190–199

  22. Teranishi K, Heroux M (2014) Toward local failure local recovery resilience model using MPI-ULFM. In: European MPI users’ group meeting, pp 51–56

  23. Wolters E, Smith M (2013) MOCFE-Bone: the 3D MOC mini-application for exascale research. Tech. rep, Argonne National Laboratory

Download references

Acknowledgments

This work has been supported by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (Project TIN2013-42148-P and predoctoral Grant of Nuria Losada ref. BES-2014-068066) and by the Galician Government (Xunta de Galicia) under the Consolidation Program of Competitive Research Units, cofunded by FEDER funds (Ref. GRC2013/055). We gratefully thank CESGA for providing access to the FinisTerrae-II supercomputer.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nuria Losada.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Losada, N., Martín, M.J. & González, P. Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications. J Supercomput 73, 316–329 (2017). https://doi.org/10.1007/s11227-016-1863-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-016-1863-z

Keywords

Navigation