Advertisement

Compiler-Enhanced Incremental Checkpointing

  • Greg Bronevetsky
  • Daniel Marques
  • Keshav Pingali
  • Radu Rugina
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5234)

Abstract

As modern supercomputing systems reach the peta-flop performance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety of causes. Checkpointing is a popular technique for tolerating such failures in that it allows applications to periodically save their state and restart the computation after a failure. Although a variety of automated system-level checkpointing solutions are currently available to HPC users, manual application-level checkpointing remains by far the most popular approach because of its superior performance. This paper focuses on improving the performance of automated checkpointing via a compiler analysis for incremental checkpointing. This analysis is shown to significantly reduce checkpoint sizes (upto 78%) and to enable asynchronous checkpointing.

Keywords

Execution Time Batch Size Soft Error Memory Region Checkpoint Function 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
  2. 2.
  3. 3.
    Adiga, N.R., Almasi, G., Almasi, G.S., Aridor, Y., Barik, R., Beece, D., Bellofatto, R., Bhanot, G., Bickford, R., Blumrich, M., Bright, A.A., Brunleroto J.: An overview of the bluegene/l supercomputer. In: IEEE/ACM Supercomputing Conference (2002)Google Scholar
  4. 4.
    Agarwal, S., Garg, R., Gupta, M.S., Moreira, J.: Adaptive incremental checkpointing for massively parallel systems. In: Proceedings of the 18th International Conference on Supercomputing (ICS), pp. 277–286 (2004)Google Scholar
  5. 5.
    Gioiosa, R., Sancho, J.C., Jiang, S., Petrini, F.: Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers. In: Supercomputing (November 2005)Google Scholar
  6. 6.
    Michalak, S.E., Harris, K.W., Hengartner, N.W., Takala, B.E., Wender, S.A.: Predicting the number of fatal soft errors in los alamos national laboratorys asc q supercomputer. IEEE Transactions on Device and Materials Reliability 5(3), 329–335 (2005)CrossRefGoogle Scholar
  7. 7.
    Plank, J.S., Beck, M., Kingsley, G.: Compiler-assisted memory exclusion for fast checkpointing. IEEE Technical Committee on Operating Systems and Application Environments 7(4), 10–14 (Winter 1995)Google Scholar
  8. 8.
    Quinlan, D.: Rose: Compiler support for object-oriented frameworks. Parallel Processing Letters 10(2-3), 215–226 (2000)CrossRefGoogle Scholar
  9. 9.
    Ross, K.C.R., Moreirra, J., Preiffer, W.: Parallel i/o on the ibm blue gene /l system. Technical report, BlueGene Consortium (2005)Google Scholar
  10. 10.
    Sancho, J.C., Petrini, F., Johnson, G., Fernandez, J., Frachtenberg, E.: On the feasibility of incremental checkpointing for scientific computing. In: 18th International Parallel and Distributed Processing Symposium (IPDPS), p. 58 (2004)Google Scholar
  11. 11.
    Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN) (June 2006)Google Scholar
  12. 12.
    Zhang, K., Pande, S.: Efficient application migration under compiler guidance. In: Poceedings of the Conference on Languages, Compilers, and Tools for Embedded Systems, pp. 10–20 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Greg Bronevetsky
    • 1
  • Daniel Marques
    • 2
  • Keshav Pingali
    • 2
  • Radu Rugina
    • 3
  1. 1.Center for Applied Scientific Computing, Lawrence Livermore National Laboratory LivermoreUSA
  2. 2.Department of Computer SciencesThe University of Texas at AustinAustinUSA
  3. 3.Department of Computer ScienceCornell UniversityIthacaUSA

Personalised recommendations