Skip to main content

Optimal Checkpointing Period: Time vs. Energy

  • Conference paper
  • First Online:
High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation (PMBS 2013)

Abstract

This short paper deals with parallel scientific applications using non-blocking and periodic coordinated checkpointing to enforce resilience. We provide a model and detailed formulas for total execution time and consumed energy. We characterize the optimal period for both objectives, and we assess the range of time/energy trade-offs to be made by instantiating the model with a set of realistic scenarios for Exascale systems. We give a particular emphasis to I/O transfers, because the relative cost of communication is expected to dramatically increase, both in terms of latency and consumed energy, for future Exascale platforms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Dongarra, J., Beckman, P., Aerts, P., Cappello, F., Lippert, T., Matsuoka, S., Messina, P., Moore, T., Stevens, R., Trefethen, A., Valero, M.: The international exascale software project: a call to cooperative action by the global high-performance community. Int. Journal of High Performance Computing Applications 23, 309–322 (2009)

    Article  Google Scholar 

  2. Sarkar, V., et al.: Exascale software study: Software challenges in extreme scale systems (2009), White paper available at; http://users.ece.gatech.edu/mrichard/ExascaleComputingStudyReports/ECSS%20report%20101909.pdf

  3. Young, J.W.: A first order approximation to the optimum checkpoint interval. Comm. of the ACM 17, 530–531 (1974)

    Article  MATH  Google Scholar 

  4. Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. FGCS 22, 303–312 (2004)

    Article  Google Scholar 

  5. Shalf, J., Dosanjh, S., Morrison, J.: Exascale computing technology challenges. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 1–25. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  6. Meneses, E., Sarood, O., Kalé, L.V.: Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems. In: Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2012), New York, USA (2012)

    Google Scholar 

  7. Chandy, K.M., Lamport, L.: Distributed snapshots: Determining global states of distributed systems. Transactions on Computer Systems 3(1), 63–75 (1985)

    Article  Google Scholar 

  8. Bosilca, G., Bouteiller, A., Brunet, E., Cappello, F., Dongarra, J., Guermouche, A., Hérault, T., Robert, Y., Vivien, F., Zaidouni, D.: Unified model for assessing checkpointing protocols at extreme-scale. Concurrency and Computation: Practice and Experience (2013) (to be published); Also available as INRIA research report 7950 at http://graal.ens-lyon.fr/~yrobert

  9. Ferreira, K., Stearley, J., Laros, J.H.I., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P.G., Arnold, D.: Evaluating the Viability of Process Replication Reliability for Exascale Systems. In: Proc. of the ACM/IEEE SC Conf. (2011)

    Google Scholar 

  10. Zheng, G., Ni, X., Kalé, L.V.: A scalable double in-memory checkpoint and restart scheme towards exascale. In: Dependable Systems and Networks Workshops (DSN-W) (2012)

    Google Scholar 

  11. Cappello, F., Casanova, H., Robert, Y.: Preventive migration vs. preventive checkpointing for extreme scale supercomputers. Parallel Processing Letters 21, 111–132 (2011)

    Article  MathSciNet  Google Scholar 

  12. Zheng, G., Shi, L., Kalé, L.V.: FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In: Proc. 2004 IEEE Int. Conf. Cluster Computing. IEEE Computer Society (2004)

    Google Scholar 

  13. Ni, X., Meneses, E., Kalé, L.V.: Hiding checkpoint overhead in HPC applications with a semi-blocking algorithm. In: Proc. 2012 IEEE Int. Conf. Cluster Computing. IEEE Computer Society (2012)

    Google Scholar 

  14. Dongarra, J., Hérault, T., Robert, Y.: Revisiting the double checkpointing algorithm. In: 15th Workshop on Advances in Parallel and Distributed Computational Models, APDCM 2013. IEEE Computer Society Press (2013)

    Google Scholar 

  15. Rajachandrasekar, R., Moody, A., Mohror, K., Panda, D.K.D.: A 1 PB/s file system to checkpoint three million MPI tasks. In: Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2013, pp. 143–154. ACM, New York (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guillaume Aupy .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Aupy, G., Benoit, A., Hérault, T., Robert, Y., Dongarra, J. (2014). Optimal Checkpointing Period: Time vs. Energy. In: Jarvis, S., Wright, S., Hammond, S. (eds) High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation. PMBS 2013. Lecture Notes in Computer Science(), vol 8551. Springer, Cham. https://doi.org/10.1007/978-3-319-10214-6_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-10214-6_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-10213-9

  • Online ISBN: 978-3-319-10214-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics