An Evaluation of User-Level Failure Mitigation Support in MPI

  • Wesley Bland
  • Aurelien Bouteiller
  • Thomas Herault
  • Joshua Hursey
  • George Bosilca
  • Jack J. Dongarra
Conference paper

DOI: 10.1007/978-3-642-33518-1_24

Part of the Lecture Notes in Computer Science book series (LNCS, volume 7490)
Cite this paper as:
Bland W., Bouteiller A., Herault T., Hursey J., Bosilca G., Dongarra J.J. (2012) An Evaluation of User-Level Failure Mitigation Support in MPI. In: Träff J.L., Benkner S., Dongarra J.J. (eds) Recent Advances in the Message Passing Interface. EuroMPI 2012. Lecture Notes in Computer Science, vol 7490. Springer, Berlin, Heidelberg

Abstract

As the scale of computing platforms becomes increasingly extreme, the requirements for application fault tolerance are increasing as well. Techniques to address this problem by improving the resilience of algorithms have been developed, but they currently receive no support from the programming model, and without such support, they are bound to fail. This paper discusses the failure-free overhead and recovery impact aspects of the User-Level Failure Mitigation proposal presented in the MPI Forum. Experiments demonstrate that fault-aware MPI has little or no impact on performance for a range of applications, and produces satisfactory recovery times when there are failures.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Wesley Bland
    • 1
  • Aurelien Bouteiller
    • 1
  • Thomas Herault
    • 1
  • Joshua Hursey
    • 2
  • George Bosilca
    • 1
  • Jack J. Dongarra
    • 1
  1. 1.Innovative Computing LaboratoryUniversity of TennesseeUSA
  2. 2.Oak Ridge National LaboratoryUSA

Personalised recommendations