Chapter

Recent Advances in the Message Passing Interface

Volume 7490 of the series Lecture Notes in Computer Science pp 193-203

An Evaluation of User-Level Failure Mitigation Support in MPI

  • Wesley BlandAffiliated withInnovative Computing Laboratory, University of Tennessee
  • , Aurelien BouteillerAffiliated withInnovative Computing Laboratory, University of Tennessee
  • , Thomas HeraultAffiliated withInnovative Computing Laboratory, University of Tennessee
  • , Joshua HurseyAffiliated withOak Ridge National Laboratory
  • , George BosilcaAffiliated withInnovative Computing Laboratory, University of Tennessee
  • , Jack J. DongarraAffiliated withInnovative Computing Laboratory, University of Tennessee

Abstract

As the scale of computing platforms becomes increasingly extreme, the requirements for application fault tolerance are increasing as well. Techniques to address this problem by improving the resilience of algorithms have been developed, but they currently receive no support from the programming model, and without such support, they are bound to fail. This paper discusses the failure-free overhead and recovery impact aspects of the User-Level Failure Mitigation proposal presented in the MPI Forum. Experiments demonstrate that fault-aware MPI has little or no impact on performance for a range of applications, and produces satisfactory recovery times when there are failures.