Optimal Checkpointing Period: Time vs. Energy

Aupy, Guillaume; Benoit, Anne; Hérault, Thomas; Robert, Yves; Dongarra, Jack

doi:10.1007/978-3-319-10214-6_10

Guillaume Aupy¹⁶,
Anne Benoit¹⁶,
Thomas Hérault¹⁷,
Yves Robert^16,17 &
…
Jack Dongarra¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8551))

Included in the following conference series:

International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems

821 Accesses
7 Citations

Abstract

This short paper deals with parallel scientific applications using non-blocking and periodic coordinated checkpointing to enforce resilience. We provide a model and detailed formulas for total execution time and consumed energy. We characterize the optimal period for both objectives, and we assess the range of time/energy trade-offs to be made by instantiating the model with a set of realistic scenarios for Exascale systems. We give a particular emphasis to I/O transfers, because the relative cost of communication is expected to dramatically increase, both in terms of latency and consumed energy, for future Exascale platforms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Dongarra, J., Beckman, P., Aerts, P., Cappello, F., Lippert, T., Matsuoka, S., Messina, P., Moore, T., Stevens, R., Trefethen, A., Valero, M.: The international exascale software project: a call to cooperative action by the global high-performance community. Int. Journal of High Performance Computing Applications 23, 309–322 (2009)
Article Google Scholar
Sarkar, V., et al.: Exascale software study: Software challenges in extreme scale systems (2009), White paper available at; http://users.ece.gatech.edu/mrichard/ExascaleComputingStudyReports/ECSS%20report%20101909.pdf
Young, J.W.: A first order approximation to the optimum checkpoint interval. Comm. of the ACM 17, 530–531 (1974)
Article MATH Google Scholar
Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. FGCS 22, 303–312 (2004)
Article Google Scholar
Shalf, J., Dosanjh, S., Morrison, J.: Exascale computing technology challenges. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 1–25. Springer, Heidelberg (2011)
Chapter Google Scholar
Meneses, E., Sarood, O., Kalé, L.V.: Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems. In: Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2012), New York, USA (2012)
Google Scholar
Chandy, K.M., Lamport, L.: Distributed snapshots: Determining global states of distributed systems. Transactions on Computer Systems 3(1), 63–75 (1985)
Article Google Scholar
Bosilca, G., Bouteiller, A., Brunet, E., Cappello, F., Dongarra, J., Guermouche, A., Hérault, T., Robert, Y., Vivien, F., Zaidouni, D.: Unified model for assessing checkpointing protocols at extreme-scale. Concurrency and Computation: Practice and Experience (2013) (to be published); Also available as INRIA research report 7950 at http://graal.ens-lyon.fr/~yrobert
Ferreira, K., Stearley, J., Laros, J.H.I., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P.G., Arnold, D.: Evaluating the Viability of Process Replication Reliability for Exascale Systems. In: Proc. of the ACM/IEEE SC Conf. (2011)
Google Scholar
Zheng, G., Ni, X., Kalé, L.V.: A scalable double in-memory checkpoint and restart scheme towards exascale. In: Dependable Systems and Networks Workshops (DSN-W) (2012)
Google Scholar
Cappello, F., Casanova, H., Robert, Y.: Preventive migration vs. preventive checkpointing for extreme scale supercomputers. Parallel Processing Letters 21, 111–132 (2011)
Article MathSciNet Google Scholar
Zheng, G., Shi, L., Kalé, L.V.: FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In: Proc. 2004 IEEE Int. Conf. Cluster Computing. IEEE Computer Society (2004)
Google Scholar
Ni, X., Meneses, E., Kalé, L.V.: Hiding checkpoint overhead in HPC applications with a semi-blocking algorithm. In: Proc. 2012 IEEE Int. Conf. Cluster Computing. IEEE Computer Society (2012)
Google Scholar
Dongarra, J., Hérault, T., Robert, Y.: Revisiting the double checkpointing algorithm. In: 15th Workshop on Advances in Parallel and Distributed Computational Models, APDCM 2013. IEEE Computer Society Press (2013)
Google Scholar
Rajachandrasekar, R., Moody, A., Mohror, K., Panda, D.K.D.: A 1 PB/s file system to checkpoint three million MPI tasks. In: Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2013, pp. 143–154. ACM, New York (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratoire LIP, École Normale Supérieure de Lyon, Lyon, France
Guillaume Aupy, Anne Benoit & Yves Robert
University of Tennessee, Knoxville, USA
Thomas Hérault, Yves Robert & Jack Dongarra

Authors

Guillaume Aupy
View author publications
You can also search for this author in PubMed Google Scholar
Anne Benoit
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Hérault
View author publications
You can also search for this author in PubMed Google Scholar
Yves Robert
View author publications
You can also search for this author in PubMed Google Scholar
Jack Dongarra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guillaume Aupy .

Editor information

Editors and Affiliations

University of Warwick Coventry, West Midlands, United Kingdom
Stephen A. Jarvis
University of Warwick Coventry, West Midlands, United Kingdom
Steven A. Wright
Sandia National Laboratories CSRI, Albuquerque, New Mexico, USA
Simon D. Hammond

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Aupy, G., Benoit, A., Hérault, T., Robert, Y., Dongarra, J. (2014). Optimal Checkpointing Period: Time vs. Energy. In: Jarvis, S., Wright, S., Hammond, S. (eds) High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation. PMBS 2013. Lecture Notes in Computer Science(), vol 8551. Springer, Cham. https://doi.org/10.1007/978-3-319-10214-6_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-10214-6_10
Published: 01 October 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10213-9
Online ISBN: 978-3-319-10214-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics