The Journal of Supercomputing

, Volume 73, Issue 1, pp 100–113 | Cite as

Resilient MPI applications using an application-level checkpointing framework and ULFM

  • Nuria Losada
  • Iván Cores
  • María J. Martín
  • Patricia González
Article

Abstract

Future exascale systems, formed by millions of cores, will present high failure rates, and long-running applications will need to make use of new fault tolerance techniques to ensure successful execution completion. The Fault Tolerance Working Group, within the MPI forum, has presented the User Level Failure Mitigation (ULFM) proposal, providing new functionalities for the implementation of resilient MPI applications. In this work, the CPPC checkpointing framework is extended to exploit the new ULFM functionalities. The proposed solution transparently obtains resilient MPI applications by instrumenting the original application code. Besides, a multithreaded multilevel checkpointing, in which the checkpoint files are saved in different memory levels, improves the scalability of the solution. The experimental evaluation shows a low overhead when tolerating failures in one or several MPI processes.

Keywords

Resilience Checkpointing Fault Tolerance MPI 

References

  1. 1.
    Ali M, Southern J, Strazdins P, Harding B (2014) Application level fault recovery: using fault-tolerant open MPI in a PDE solver. In: IEEE international parallel distributed processing symposium workshops, pp 1169–1178Google Scholar
  2. 2.
    ASC Sequoia Benchmark Codes: https://asc.llnl.gov/sequoia/benchmarks/. Last accessed September 2015
  3. 3.
    Aulwes R, Daniel D, Desai N, Graham R, Risinger L, Taylor MA, Woodall T, Sukalski M (2004) Architecture of LA-MPI, a network-fault-tolerant MPI. In: International parallel and distributed processing symposium, p 15Google Scholar
  4. 4.
    Bland W, Bouteiller A, Herault T, Hursey J, Bosilca G, Dongarra J (2012) An evaluation of user-level failure mitigation support in MPI. In: Recent advances in the message passing interface. Lecture notes in computer science, vol 7490. Springer, Berlin, , pp 193–203Google Scholar
  5. 5.
    Broquedis F, Clet-Ortega J, Moreaud S, Furmento N, Goglin B, Mercier G, Thibault S, Namyst R (2010) hwloc: a generic framework for managing hardware affinities in HPC applications. In: Euromicro international conference on parallel, distributed and network-based computing. Pisa, ItalyGoogle Scholar
  6. 6.
    Cappello F (2009) Fault tolerance in petascale/exascale systems: current knowledge, challenges and research opportunities. Int J High Perform Comput Appl 23(3):212–226CrossRefGoogle Scholar
  7. 7.
    Cores I, Rodríguez G, González P, Martín M (2014) Failure avoidance in MPI applications using an application-level approach. Comput J 57(1):100–114CrossRefGoogle Scholar
  8. 8.
    Cores I, Rodríguez G, Martín M, González P, Osorio R (2013) Improving scalability of application-level checkpoint-recovery by reducing checkpoint sizes. New Gener Comput 31(3):163–185CrossRefGoogle Scholar
  9. 9.
    Fagg G, Dongarra J (2000) FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Recent advances in parallel virtual machine and message passing interface, vol 1908. Springer, Berlin, pp 346–353Google Scholar
  10. 10.
    Himeno Benchmark: http://accc.riken.jp/2444.htm. Last accessed September 2015
  11. 11.
    Laguna I, Richards D, Gamblin T, Schulz M, de Supinski B (2014) Evaluating user-level fault tolerance for MPI applications. In: European MPI Users’ Group Meeting, EuroMPI/ASIA ’14, pp 57–62Google Scholar
  12. 12.
    Rodríguez G, Martín M, González P, Touriño J, Doallo R (2010) CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications. Concurr Comput Pract Exp 22(6):749–766CrossRefGoogle Scholar
  13. 13.
    Sato K, Moody A, Mohror K, Gamblin T, De Supinski B, Maruyama N, Matsuoka S (2014) FMI: fault tolerant messaging interface for fast and transparent recovery. In: IEEE international parallel and distributed processing symposium, pp 1225–1234Google Scholar
  14. 14.
    Schroeder B, Gibson G (2010) A large-scale study of failures in high-performance computing systems. IEEE Trans Dependable Secure Comput 7(4):337–350CrossRefGoogle Scholar
  15. 15.
    Teranishi K, Heroux M (2014) Toward local failure local recovery resilience model using MPI-ULFM. In: European MPI Users’ Group Meeting, pp 51–56Google Scholar
  16. 16.
    Wang C, Mueller F, Engelmann C, Scott S (2008) Proactive process-level live migration in HPC environments. In: ACM/IEEE conference on Supercomputing, pp 1–12Google Scholar
  17. 17.
    Wolters E, Smith M (2013) MOCFE-Bone: the 3D MOC mini-application for exascale research. Tech. rep, Argonne National Laboratory (ANL)Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Grupo de Arquitectura de ComputadoresUniversidade da CoruñaA CoruñaSpain

Personalised recommendations