Asynchronous Checkpointing by Dedicated Checkpoint Threads

  • Faisal Shahzad
  • Markus Wittmann
  • Thomas Zeiser
  • Gerhard Wellein
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7490)

Abstract

Checkpoint/restart (C/R) is a classical approach to introduce fault tolerance in large HPC applications. Although it is relatively easy as compared to other fault tolerance approaches, its overhead hinders its wide usage. We present an application-level checkpointing technique that significantly reduces the checkpoint overhead. The checkpoint I/O is overlapped with the computation of the application by following a two-stage checkpointing mechanism with dedicated threads for doing I/O.

References

  1. 1.
    Hursey, J.: Coordinated Checkpoint/Restart Process Fault Tolerance for MPI Applications on HPC Systems. PhD thesis, Indiana University, Bloomington, IN, USA (July 2010)Google Scholar
  2. 2.
    Hager, G., Schubert, G., Schoenemeyer, T., Wellein, G.: Prospects for Truly Asynchronous Communication with Pure MPI and Hybrid MPI/OpenMP on Current Supercomputing Platforms. In: Cray Users Group Conference 2011, Fairbanks, AK, USA (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Faisal Shahzad
    • 1
  • Markus Wittmann
    • 1
  • Thomas Zeiser
    • 1
  • Gerhard Wellein
    • 2
  1. 1.Erlangen Regional Computing CenterUniversity of Erlangen-NurembergGermany
  2. 2.Department of Computer ScienceUniversity of Erlangen-NurembergGermany

Personalised recommendations