Abstract
This short paper deals with parallel scientific applications using non-blocking and periodic coordinated checkpointing to enforce resilience. We provide a model and detailed formulas for total execution time and consumed energy. We characterize the optimal period for both objectives, and we assess the range of time/energy trade-offs to be made by instantiating the model with a set of realistic scenarios for Exascale systems. We give a particular emphasis to I/O transfers, because the relative cost of communication is expected to dramatically increase, both in terms of latency and consumed energy, for future Exascale platforms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Dongarra, J., Beckman, P., Aerts, P., Cappello, F., Lippert, T., Matsuoka, S., Messina, P., Moore, T., Stevens, R., Trefethen, A., Valero, M.: The international exascale software project: a call to cooperative action by the global high-performance community. Int. Journal of High Performance Computing Applications 23, 309–322 (2009)
Sarkar, V., et al.: Exascale software study: Software challenges in extreme scale systems (2009), White paper available at; http://users.ece.gatech.edu/mrichard/ExascaleComputingStudyReports/ECSS%20report%20101909.pdf
Young, J.W.: A first order approximation to the optimum checkpoint interval. Comm. of the ACM 17, 530–531 (1974)
Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. FGCS 22, 303–312 (2004)
Shalf, J., Dosanjh, S., Morrison, J.: Exascale computing technology challenges. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 1–25. Springer, Heidelberg (2011)
Meneses, E., Sarood, O., Kalé, L.V.: Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems. In: Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2012), New York, USA (2012)
Chandy, K.M., Lamport, L.: Distributed snapshots: Determining global states of distributed systems. Transactions on Computer Systems 3(1), 63–75 (1985)
Bosilca, G., Bouteiller, A., Brunet, E., Cappello, F., Dongarra, J., Guermouche, A., Hérault, T., Robert, Y., Vivien, F., Zaidouni, D.: Unified model for assessing checkpointing protocols at extreme-scale. Concurrency and Computation: Practice and Experience (2013) (to be published); Also available as INRIA research report 7950 at http://graal.ens-lyon.fr/~yrobert
Ferreira, K., Stearley, J., Laros, J.H.I., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P.G., Arnold, D.: Evaluating the Viability of Process Replication Reliability for Exascale Systems. In: Proc. of the ACM/IEEE SC Conf. (2011)
Zheng, G., Ni, X., Kalé, L.V.: A scalable double in-memory checkpoint and restart scheme towards exascale. In: Dependable Systems and Networks Workshops (DSN-W) (2012)
Cappello, F., Casanova, H., Robert, Y.: Preventive migration vs. preventive checkpointing for extreme scale supercomputers. Parallel Processing Letters 21, 111–132 (2011)
Zheng, G., Shi, L., Kalé, L.V.: FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In: Proc. 2004 IEEE Int. Conf. Cluster Computing. IEEE Computer Society (2004)
Ni, X., Meneses, E., Kalé, L.V.: Hiding checkpoint overhead in HPC applications with a semi-blocking algorithm. In: Proc. 2012 IEEE Int. Conf. Cluster Computing. IEEE Computer Society (2012)
Dongarra, J., Hérault, T., Robert, Y.: Revisiting the double checkpointing algorithm. In: 15th Workshop on Advances in Parallel and Distributed Computational Models, APDCM 2013. IEEE Computer Society Press (2013)
Rajachandrasekar, R., Moody, A., Mohror, K., Panda, D.K.D.: A 1 PB/s file system to checkpoint three million MPI tasks. In: Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2013, pp. 143–154. ACM, New York (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Aupy, G., Benoit, A., Hérault, T., Robert, Y., Dongarra, J. (2014). Optimal Checkpointing Period: Time vs. Energy. In: Jarvis, S., Wright, S., Hammond, S. (eds) High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation. PMBS 2013. Lecture Notes in Computer Science(), vol 8551. Springer, Cham. https://doi.org/10.1007/978-3-319-10214-6_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-10214-6_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10213-9
Online ISBN: 978-3-319-10214-6
eBook Packages: Computer ScienceComputer Science (R0)