User Level Failure Mitigation in MPI

  • Wesley Bland
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7640)


In a constant effort to deliver steady performance improvements, the size of High Performance Computing (HPC) systems, as observed by the Top 500 ranking1, has grown tremendously over the last decade. This trend, along with the resultant decrease of the Mean Time Between Failure (MTBF), is unlikely to stop; thereby many computing nodes will inevitably fail during application execution [5]. It is alarming that most popular fault tolerant approaches see their efficiency plummet at Exascale [3,4], calling for more efficient approaches evolving around application centric failure mitigation strategies [7].


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bland, W., Bosilca, G., Bouteiller, A., Herault, T., Dongarra, J.: A proposal for User-Level Failure Mitigation in the MPI-3 standard. Tech. rep., Department of Electrical Engineering and Computer Science, University of Tennessee (2012)Google Scholar
  2. 2.
    Bland, W., Bouteiller, A., Herault, T., Hursey, J., Bosilca, G., Dongarra, J.J.: An Evaluation of User-Level Failure Mitigation Support in MPI. In: Träff, J.L., Benkner, S., Dongarra, J.J. (eds.) EuroMPI 2012. LNCS, vol. 7490, pp. 193–203. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  3. 3.
    Bosilca, G., Bouteiller, A., Brunet, É., Cappello, F., Dongarra, J., Guermouche, A., Hérault, T., Robert, Y., Vivien, F., Zaidouni, D.: Unified Model for Assessing Checkpointing Protocols at Extreme-Scale. Tech. Rep. RR-7950, INRIA (2012)Google Scholar
  4. 4.
    Bougeret, M., Casanova, H., Robert, Y., Vivien, F., Zaidouni, D.: Using group replication for resilience on exascale systems. Tech. Rep. 265, LAWNs (2012)Google Scholar
  5. 5.
    Cappello, F., Geist, A., Gropp, B., Kalé, L.V., Kramer, B., Snir, M.: Toward exascale resilience. IJHPCA 23(4), 374–388 (2009)Google Scholar
  6. 6.
    Gropp, W., Lusk, E.: Fault tolerance in Message Passing Interface programs. IJHPCA 18, 363–372 (2004)Google Scholar
  7. 7.
    Huang, K., Abraham, J.: Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers 100(6), 518–528 (1984)CrossRefGoogle Scholar
  8. 8.
    Hursey, J., Naughton, T., Vallee, G., Graham, R.L.: A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI. In: Cotronis, Y., Danalis, A., Nikolopoulos, D.S., Dongarra, J. (eds.) EuroMPI 2011. LNCS, vol. 6960, pp. 255–263. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  9. 9.
    Mohan, C., Lindsay, B.: Efficient commit protocols for the tree of processes model of distributed transactions. In: SIGOPS OSR, vol. 19, pp. 40–52. ACM (1985)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Wesley Bland
    • 1
  1. 1.Innovative Computing LaboratoryUniversity of TennesseeUSA

Personalised recommendations