The Journal of Supercomputing

, Volume 73, Issue 1, pp 316–329 | Cite as

Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications

  • Nuria Losada
  • María J. Martín
  • Patricia González


The Message Passing Interface (MPI) standard is the most popular parallel programming model for distributed systems. However, it lacks fault-tolerance support and, traditionally, failures are addressed with stop-and-restart checkpointing solutions. The proposal of User Level Failure Mitigation (ULFM) for the inclusion of resilience capabilities in the MPI standard provides new opportunities in this field, allowing the implementation of resilient MPI applications, i.e., applications that are able to detect and react to failures without stopping their execution. This work compares the performance of a traditional stop-and-restart checkpointing solution with its equivalent resilience proposal. Both approaches are built on top of ComPiler for Portable Checkpoiting (CPPC) an application-level checkpointing tool for MPI applications, and they allow to transparently obtain fault-tolerant MPI applications from generic MPI Single Program Multiple Data (SPMD). The evaluation is focused on the scalability of the two solutions, comparing both proposals using up to 3072 cores.


Resilience Checkpointing Fault tolerance MPI 



This work has been supported by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (Project TIN2013-42148-P and predoctoral Grant of Nuria Losada ref. BES-2014-068066) and by the Galician Government (Xunta de Galicia) under the Consolidation Program of Competitive Research Units, cofunded by FEDER funds (Ref. GRC2013/055). We gratefully thank CESGA for providing access to the FinisTerrae-II supercomputer.


  1. 1.
    Ali MM, Strazdins PE, Harding B, Hegland M (2016) Complex scientific applications made fault-tolerant with the sparse grid combination technique. Int J High Perform Comput Appl. doi: 10.1177/1094342015628056.
  2. 2.
    ASC Sequoia Benchmark Codes: Last Accessed June 2016
  3. 3.
    Aulwes R, Daniel D, Desai N, Graham R, Risinger L, Taylor MA, Woodall T, Sukalski M (2004) Architecture of LA-MPI, a network-fault-tolerant MPI. In: International parallel and distributed processing symposium, p 15Google Scholar
  4. 4.
    Bland W, Bouteiller A, Herault T, Hursey J, Bosilca G, Dongarra J (2012) An evaluation of user-level failure mitigation support in MPI. Recent Adv Message Pass Interface 7490:193–203CrossRefGoogle Scholar
  5. 5.
    Bland W, Raffenetti K, Balaji P (2014) Simplifying the recovery model of user-level failure mitigation. In: Workshop on Exascale MPI at Supercomputing Conference, pp 20–25Google Scholar
  6. 6.
    Broquedis F, Clet-Ortega J, Moreaud S, Furmento N, Goglin B, Mercier G, Thibault S, Namyst R (2010) hwloc: a generic framework for managing hardware affinities in HPC applications. In: International Conference on Parallel, Distributed and Network-Based ComputingGoogle Scholar
  7. 7.
    Cores I, Rodríguez G, Martín M, González P, Osorio R (2013) Improving scalability of application-level checkpoint-recovery by reducing checkpoint sizes. New Gener Comput 31(3):163–185CrossRefGoogle Scholar
  8. 8.
    Di Martino C, Kramer W, Kalbarczyk Z, Iyer R (2015) Measuring and understanding extreme-scale application resilience: a field study of 5,000,000 HPC application runs. In: International Conference on Dependable Systems and Networks, pp 25–36Google Scholar
  9. 9.
    Fagg G, Dongarra J (2000) FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Recent advances in parallel virtual machine and message passing interface, vol 1908, pp 346–353. Springer, New YorkGoogle Scholar
  10. 10.
    Himeno Benchmark: Last Accessed: June 2016
  11. 11.
    Hursey J, Graham RL, Bronevetsky G, Buntinas D, Pritchard H, Solt DG (2011) Run-through stabilization: An MPI proposal for process fault tolerance. In: Recent advances in the message passing interface, pp 329–332Google Scholar
  12. 12.
    Laguna I, Richards D, Gamblin T, Schulz M, de Supinski B (2014) Evaluating user-level fault tolerance for MPI Applications. In: European MPI Users’ group meeting, EuroMPI/ASIA ’14, pp 57–62Google Scholar
  13. 13.
    Laguna I, Richards DF, Gamblin T, Schulz M, de Supinski BR, Mohror K, Pritchard H (2016) Evaluating and extending user-level fault tolerance in MPI applications. Int J High Perform Comput Appl. doi: 10.1177/1094342015623623.
  14. 14.
    Losada N, Cores I, Martín MJ, González P (2016) Resilient MPI applications using an application-level checkpointing framework and ULFM. J Supercomput 1–14. doi: 10.1007/s11227-016-1629-7.
  15. 15.
    Moody A, Bronevetsky G, Mohror K, De Supinski BR (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–11Google Scholar
  16. 16.
    Pauli S, Kohler M, Arbenz P (2013) A fault tolerant implementation of multi-level Monte Carlo methods. In: Advances in parallel computing, pp 471–480Google Scholar
  17. 17.
    Plank JS, Li K, Puening MA (1998) Diskless checkpointing. Trans Parall Distrib Syst 9(10):972–986CrossRefGoogle Scholar
  18. 18.
    Rizzi F, Morris K, Sargsyan K, Mycek P, Safta C, Debusschere B, LeMaitre O, Knio O (2016) ULFM-MPI implementation of a resilient task-based partial differential equations preconditioner. In: Workshop on fault-tolerance for HPC at extreme scale, pp 19–26Google Scholar
  19. 19.
    Rodríguez G, Martín M, González P, Touriño J, Doallo R (2010) CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications. Concur Comput Pract Exp 22(6):749–766CrossRefGoogle Scholar
  20. 20.
    Sato K, Moody A, Mohror K, Gamblin T, De Supinski B, Maruyama N, Matsuoka S (2014) FMI: fault tolerant messaging interface for fast and transparent recovery. In: International parallel and distributed processing symposium, pp 1225–1234Google Scholar
  21. 21.
    Suo G, Lu Y, Liao X, Xie M, Cao H (2013) NR-MPI: a non-stop and fault resilient MPI. In: International Conference on Parallel and Distributed Systems, pp 190–199Google Scholar
  22. 22.
    Teranishi K, Heroux M (2014) Toward local failure local recovery resilience model using MPI-ULFM. In: European MPI users’ group meeting, pp 51–56Google Scholar
  23. 23.
    Wolters E, Smith M (2013) MOCFE-Bone: the 3D MOC mini-application for exascale research. Tech. rep, Argonne National LaboratoryGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Grupo de Arquitectura de ComputadoresUniversidade da CoruñaA CoruñaSpain

Personalised recommendations