Skip to main content
Log in

An evaluation of User-Level Failure Mitigation support in MPI

  • Published:
Computing Aims and scope Submit manuscript

Abstract

As the scale of computing platforms becomes increasingly extreme, the requirements for application fault tolerance are increasing as well. Techniques to address this problem by improving the resilience of algorithms have been developed, but they currently receive no support from the programming model, and without such support, they are bound to fail. This paper discusses the failure-free overhead and recovery impact of the user-level failure mitigation proposal presented in the MPI Forum. Experiments demonstrate that fault-aware MPI has little or no impact on performance for a range of applications, and produces satisfactory recovery times when there are failures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. http://www.top500.org/.

  2. http://mpi-forum.org/docs/mpi-2.2/mpi22-report.pdf.

  3. http://svn.mpi-forum.org/trac/mpi-forum-web/ticket/323.

  4. https://asc.llnl.gov/sequoia/benchmarks/#amg.

References

  1. Angskun T, Bosilca G, Dongarra J (2007) Binomial graph: a scalable and faulttolerant logical network topology. In: ISPA07. Number 4742 in LNCS, Springer, pp 471–482

  2. Bland W, Bosilca G, Bouteiller A, Herault T, Dongarra J (2012) A proposal for user-level failure Mitigation in the MPI-3 standard. Department of Electrical Engineering and Computer Science, University of Tennessee

  3. Bland W, Bouteiller A, Herault T, Hursey J, Bosilca G, Dongarra JJ (2012) An evaluation of user-level failure mitigation support in MPI. In: Träff JL, Benkner S, Dongarra JJ (eds) EuroMPI, Lecture Notes in Computer Science, vol 7490, Springer, pp 193–203

  4. Bland W, Du P, Bouteiller A, Herault T, Bosilca G, Dongarra JJ (2012) A Checkpoint-on-Failure protocol for algorithm-based recovery in standard MPI. In: 18th Euro-Par, LNCS, vol 7484, Springer, pp 477–489

  5. Bosilca G, Bouteiller A, Brunet É, Cappello F, Dongarra J, Guermouche A, Herault T, Robert Y, Vivien F, Zaidouni D (2012) Unified model for assessing checkpointing protocols at extreme-scale. Tech. report RR-7950, INRIA

  6. Bougeret M, Casanova H, Robert Y, Vivien F, Zaidouni D (2012) Using group replication for resilience on exascale systems. Tech. Rep. 265, LAWNs

  7. Bouteiller A, Bosilca G, Dongarra J (2010) Redesigning the message logging model for high performance. CCPE 22(16):2196–2211

    Google Scholar 

  8. Buntinas D, Coti C, Herault T, Lemarinier P, Pilard L, Rezmerita A, Rodriguez E, Cappello F (2008) Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI protocols. FGCS 24(1):73–84. doi:10.1016/j.future.2007.02.002

    Google Scholar 

  9. Cappello F, Geist A, Gropp B, Kalé LV, Kramer B, Snir M (2009) Toward exascale resilience. IJHPCA 23(4):374–388

    Google Scholar 

  10. Davies T, Karlsson C, Liu H, Ding C, Chen Z (2011) High performance linpack benchmark: a fault tolerant implementation without checkpointing. In: 25th ICS, ACM, pp 162–171

  11. Dongarra J, Beckman P et al (2011) The international exascale software roadmap. IJHPCA 25(11):3–60

    Google Scholar 

  12. Du P, Bouteiller A et al (2012) Algorithm-based Fault Tolerance for dense matrix factorizations. In: 17th SIGPLAN PPoPP, ACM, pp 225–234

  13. Fagg G, Dongarra J (2000) FT-MPI: Fault Tolerant MPI, supporting dynamic applications in a dynamic world. In: 7th EuroPVM/MPI, LNCS, vol 1908, Springer, pp 346–353

  14. Gabriel E et al (2004) Open MPI: Goals, concept, and design of a next generation MPI implementation. In: 11th EuroPVM/MPI, LNCS, vol 3241, Springer, pp 353–377

  15. Gropp W, Lusk E (2004) Fault tolerance in Message Passing Interface programs. IJHPCA 18:363–372. doi:10.1177/1094342004046045

    Google Scholar 

  16. Hadzilacos V, Toueg S (1993) Distributed systems (2nd edn). In: Fault-tolerant broadcasts and related problems, ACM/Addison-Wesley, pp 97–145

  17. Huang K, Abraham J (1984) Algorithm-based Fault Tolerance for matrix operations. IEEE Trans Comput 100(6):518–528

    Article  Google Scholar 

  18. Hursey J, Graham RL, Bronevetsky G, Buntinas D, Pritchard H, Solt DG (2011) Run-through stabilization: an MPI proposal for process fault tolerance. In: 18th EuroMPI, LNCS, vol 6690, Springer, pp 329–332

  19. Hursey J, Naughton T, Vallee G, Graham RL (2011) A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI. In: 18th EuroMPI, LNCS, vol 6690, Springer, pp 255–263

  20. Lusk E, Chan A (2008) Early experiments with the OpenMP/MPI hybrid programming model. In: 4th IWOMP, LNCS, vol 5004, Springer, pp 36–47

  21. Mohan C, Lindsay B (1985) Efficient commit protocols for the tree of processes model of distributed transactions. In: SIGOPS OSR, vol 19, ACM, pp 40–52

  22. Sterling T (2011) HPC in phase change: towards a new execution model. In: HPCCS-VECPAR 2010, LNCS, vol 6449, Springer, pp 31–31

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to George Bosilca.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bland, W., Bouteiller, A., Herault, T. et al. An evaluation of User-Level Failure Mitigation support in MPI. Computing 95, 1171–1184 (2013). https://doi.org/10.1007/s00607-013-0331-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00607-013-0331-3

Keywords

Mathematics Subject Classification

Navigation