Skip to main content
Log in

Resilient MPI applications using an application-level checkpointing framework and ULFM

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Future exascale systems, formed by millions of cores, will present high failure rates, and long-running applications will need to make use of new fault tolerance techniques to ensure successful execution completion. The Fault Tolerance Working Group, within the MPI forum, has presented the User Level Failure Mitigation (ULFM) proposal, providing new functionalities for the implementation of resilient MPI applications. In this work, the CPPC checkpointing framework is extended to exploit the new ULFM functionalities. The proposed solution transparently obtains resilient MPI applications by instrumenting the original application code. Besides, a multithreaded multilevel checkpointing, in which the checkpoint files are saved in different memory levels, improves the scalability of the solution. The experimental evaluation shows a low overhead when tolerating failures in one or several MPI processes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Ali M, Southern J, Strazdins P, Harding B (2014) Application level fault recovery: using fault-tolerant open MPI in a PDE solver. In: IEEE international parallel distributed processing symposium workshops, pp 1169–1178

  2. ASC Sequoia Benchmark Codes: https://asc.llnl.gov/sequoia/benchmarks/. Last accessed September 2015

  3. Aulwes R, Daniel D, Desai N, Graham R, Risinger L, Taylor MA, Woodall T, Sukalski M (2004) Architecture of LA-MPI, a network-fault-tolerant MPI. In: International parallel and distributed processing symposium, p 15

  4. Bland W, Bouteiller A, Herault T, Hursey J, Bosilca G, Dongarra J (2012) An evaluation of user-level failure mitigation support in MPI. In: Recent advances in the message passing interface. Lecture notes in computer science, vol 7490. Springer, Berlin, , pp 193–203

  5. Broquedis F, Clet-Ortega J, Moreaud S, Furmento N, Goglin B, Mercier G, Thibault S, Namyst R (2010) hwloc: a generic framework for managing hardware affinities in HPC applications. In: Euromicro international conference on parallel, distributed and network-based computing. Pisa, Italy

  6. Cappello F (2009) Fault tolerance in petascale/exascale systems: current knowledge, challenges and research opportunities. Int J High Perform Comput Appl 23(3):212–226

    Article  Google Scholar 

  7. Cores I, Rodríguez G, González P, Martín M (2014) Failure avoidance in MPI applications using an application-level approach. Comput J 57(1):100–114

    Article  Google Scholar 

  8. Cores I, Rodríguez G, Martín M, González P, Osorio R (2013) Improving scalability of application-level checkpoint-recovery by reducing checkpoint sizes. New Gener Comput 31(3):163–185

    Article  Google Scholar 

  9. Fagg G, Dongarra J (2000) FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Recent advances in parallel virtual machine and message passing interface, vol 1908. Springer, Berlin, pp 346–353

  10. Himeno Benchmark: http://accc.riken.jp/2444.htm. Last accessed September 2015

  11. Laguna I, Richards D, Gamblin T, Schulz M, de Supinski B (2014) Evaluating user-level fault tolerance for MPI applications. In: European MPI Users’ Group Meeting, EuroMPI/ASIA ’14, pp 57–62

  12. Rodríguez G, Martín M, González P, Touriño J, Doallo R (2010) CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications. Concurr Comput Pract Exp 22(6):749–766

    Article  Google Scholar 

  13. Sato K, Moody A, Mohror K, Gamblin T, De Supinski B, Maruyama N, Matsuoka S (2014) FMI: fault tolerant messaging interface for fast and transparent recovery. In: IEEE international parallel and distributed processing symposium, pp 1225–1234

  14. Schroeder B, Gibson G (2010) A large-scale study of failures in high-performance computing systems. IEEE Trans Dependable Secure Comput 7(4):337–350

    Article  Google Scholar 

  15. Teranishi K, Heroux M (2014) Toward local failure local recovery resilience model using MPI-ULFM. In: European MPI Users’ Group Meeting, pp 51–56

  16. Wang C, Mueller F, Engelmann C, Scott S (2008) Proactive process-level live migration in HPC environments. In: ACM/IEEE conference on Supercomputing, pp 1–12

  17. Wolters E, Smith M (2013) MOCFE-Bone: the 3D MOC mini-application for exascale research. Tech. rep, Argonne National Laboratory (ANL)

Download references

Acknowledgments

This research was supported by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (Project TIN2013-42148-P, CAPAP-H5 network TIN2014-53522-REDT, and the predoctoral Grant of Nuria Losada ref. BES-2014-068066) and by the Galician Government (Xunta de Galicia) and FEDER funds of the EU under the Consolidation Program of Competitive Research (ref. GRC2013/055).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nuria Losada.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Losada, N., Cores, I., Martín, M.J. et al. Resilient MPI applications using an application-level checkpointing framework and ULFM. J Supercomput 73, 100–113 (2017). https://doi.org/10.1007/s11227-016-1629-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-016-1629-7

Keywords

Navigation