Automatic Parameter Tuning of Hierarchical Incremental Checkpointing

  • Alfian Amrizal
  • Shoichi Hirasawa
  • Hiroyuki TakizawaEmail author
  • Hiroaki Kobayashi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8969)


As future HPC systems become larger, the failure rates and the cost of checkpointing to the global file system are expected to increase. Hierarchical incremental CPR is a promising approach to solve this problem. It utilizes a hierarchical storage system of local and global storages and performs incremental checkpointing by writing only updated memory pages between two consecutive checkpoints. In this paper, we response to an open question; how to optimize the checkpoint interval when the checkpoint overheads are changing with time as in hierarchical incremental CPR. We propose a runtime checkpoint interval autotuning technique to optimize the efficiency of hierarchical incremental CPR. Evaluation results show that the efficiency can be significantly increased if the storage hierarchy can be exploited with appropriate checkpoint intervals.



This research is partially supported by JST CREST “An Evolutionary Approach to Construction of a Software Development Environment for Massively-Parallel Heterogeneous Systems” and Grant-in-Aid for Scientific Research(B) #25280041. The first author, Alfian Amrizal, is financially supported by Monbukagakusho.


  1. 1.
    Schroeder, B., Gibson, G.A.: Understanding failures in petascale computers. J. Phys. Conf. Ser. 78, 012022 (2007)CrossRefGoogle Scholar
  2. 2.
    Sancho, J.C., Pertini, F., Johnson, G., Fernandez, J., Frachtenberg, E.: On the feasibility of incremental checkpointing for scientific computing. In: Proceedings of IPDPS 2004, pp. 58–67 (2004)Google Scholar
  3. 3.
    Amrizal, A., Hirasawa, S., Komatsu, K., Takizawa, H., Kobayashi, H.: Improving the scalability of transparent checkpointing for GPU computing systems. In: Proceedings of the 2012 IEEE Region 10 Conference, pp. 989–994, 19–22 November 2012Google Scholar
  4. 4.
    Vaidya, N.H.: A case for two-level recovery schemes. IEEE Trans. Comput. 47(6), 656666 (1998)CrossRefGoogle Scholar
  5. 5.
    Dong, X., Muralimanohar, N., Jouppi, N., Kaufmann, R., Xie Y.: Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems. In: Proceedings of SC 2009 (2009)Google Scholar
  6. 6.
    Di, S., Bouguerra, M.S., Gomez, L.B., Cappello, F.: Optimization of multi-level checkpoint model for large scale HPC applications. In: Proceedings of IPDPS 2014, pp. 1181–1190 (2004)Google Scholar
  7. 7.
    Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Sys. 22(3), 303–312 (2006)CrossRefGoogle Scholar
  8. 8.
    Dehn, E.: Algebraic Equations: An Introduction to the Theories of Lagrange and Galois. Columbia University Press, New York (1930)zbMATHGoogle Scholar
  9. 9.
    Balakrishnan, N., Childs, A.: Outlier. In: Hazewinkel, M. (ed.) Encyclopedia of Mathematics. Springer (2001). ISBN 978-1-55608-010-4Google Scholar
  10. 10.
    Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17(9), 530–531 (1974)CrossRefzbMATHGoogle Scholar
  11. 11.
    Dash, S.: A comparative study of moving averages: simple, weighted, and exponential.
  12. 12.
    Brun, R., Dumitrescu, L.Z.: CTH: a software family for multi-dimensional shock physics analysis. In: Hertel Jr., E.S., Bell, R.L., Elrick, M.G., Farnsworth, A.V., Kerley, G.I., McGlaun, J.W., Petney, S.V., Silling, S.A., Taylor, P.A., Yarrington, L. (eds.) Shock Waves @ Marseille, pp. 377–382. Springer, Heidelberg (1995) Google Scholar
  13. 13.
    Ferreira, K.B., Riesen, R., Brighwell, R., Bridges, P., Arnold, D.: ibhashckpt: hash-based incremental checkpointing using GPU’s. In: Cotronis, Y., Danalis, A., Nikolopoulos, D.S., Dongarra, J. (eds.) Recent Advances in the Message Passing Interface. LNCS, vol. 6960, pp. 272–281. Springer, Heidelberg (2011) CrossRefGoogle Scholar
  14. 14.
    Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)CrossRefGoogle Scholar
  15. 15.
    Moody, A., Bronevetsky, G., Mohror, K., Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of SC 2010 (2010)Google Scholar
  16. 16.
    Sato, K., Maruyama, N., Mohror, K., Moody, A., Gamblin, T., de Supinski, B.R., Matsuoka, S.: Design and modeling of a non-blocking checkpointing system. In: Proceedings of SC 2012 (2012).

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Alfian Amrizal
    • 1
    • 3
  • Shoichi Hirasawa
    • 1
    • 3
  • Hiroyuki Takizawa
    • 1
    • 3
    Email author
  • Hiroaki Kobayashi
    • 1
    • 2
    • 3
  1. 1.Graduate School of Information ScienceTohoku UniversitySendaiJapan
  2. 2.Cyberscience CenterTohoku UniversitySendaiJapan
  3. 3.Japan Science and Technology AgencyCRESTKawaguchiJapan

Personalised recommendations