Analysis of the Tradeoffs Between Energy and Run Time for Multilevel Checkpointing

  • Prasanna BalaprakashEmail author
  • Leonardo A. Bautista Gomez
  • Mohamed-Slim Bouguerra
  • Stefan M. Wild
  • Franck Cappello
  • Paul D. Hovland
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8966)


In high-performance computing, there is a perpetual hunt for performance and scalability. Supercomputers grow larger offering improved computational science throughput. Nevertheless, with an increase in the number of systems’ components and their interactions, the number of failures and the power consumption will increase rapidly. Energy and reliability are among the most challenging issues that need to be addressed for extreme scale computing. We develop analytical models for run time and energy usage for multilevel fault-tolerance schemes. We use these models to study the tradeoff between run time and energy in FTI, a recently developed multilevel checkpoint library, on an IBM Blue Gene/Q. Our results show that energy consumed by FTI is low and the tradeoff between the run time and energy is small. Using the analytical models, we explore the impact of various system-level parameters on run time and energy tradeoffs.


Power Consumption Pareto Front Idle Power Parallel File System Checkpoint Interval 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work was supported by the SciDAC and X-Stack activities within the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research program under contract number DE-AC02-06CH11357.


  1. 1.
  2. 2.
    Aupy, G., Benoit, A., Hérault, T., Robert, Y., Dongarra, J.: Optimal checkpointing period: time vs. energy. In: Jarvis, S.A., Wright, S.A., Hammond, S.D. (eds.) PMBS 2013. LNCS, vol. 8551, pp. 203–214. Springer, Heidelberg (2014) CrossRefGoogle Scholar
  3. 3.
    Balaprakash, P., Tiwari, A., Wild, S.M.: Multi objective optimization of HPC kernels for performance, power, and energy. In: Jarvis, S.A., Wright, S.A., Hammond, S.D. (eds.) PMBS 2013. LNCS, vol. 8551, pp. 239–260. Springer, Heidelberg (2014) CrossRefGoogle Scholar
  4. 4.
    Bautista-Gomez, L., Tsuboi, S., Komatitsch, D., Cappello, F., Maruyama, N., Matsuoka, S.: FTI: high performance fault tolerance interface for hybrid systems. In: Proceedings 2011 International Conference on High Performance Computing, Networking, Storage and Analysis (SC11), pp. 32:1–32:32. ACM (2011)Google Scholar
  5. 5.
    Bouguerra, M.-S., Trystram, D., Wagner, F.: Complexity analysis of checkpoint scheduling with variable costs. IEEE Trans. Comput. 62(6), 1269–1275 (2013)CrossRefMathSciNetGoogle Scholar
  6. 6.
    Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst. 22(3), 303–312 (2006)CrossRefGoogle Scholar
  7. 7.
    Di, S., Bouguerra, M.S., Bautista-Gomez, L., Cappello, F.: Optimization of multi-level checkpoint model for large-scale HPC applications. In: International Parallel and Distributed Processing Symposium (2014, to appear)Google Scholar
  8. 8.
    Ehrgott, M.: Multicriteria Optimization. Springer-Verlag, Heidelberg (2005)zbMATHGoogle Scholar
  9. 9.
    el Mehdi Diouri, M., Gluck, O., Lefèvre, L., Cappello, F.: Energy considerations in checkpointing and fault tolerance protocols. In: 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), pp. 1–6 (2012)Google Scholar
  10. 10.
    el Mehdi Diouri, M., Gluck, O., Lefevre, L., Cappello, F.: ECOFIT: A framework to estimate energy consumption of fault tolerance protocols for HPC applications. In: 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid13), pp. 522–529 (2013)Google Scholar
  11. 11.
    Hackenberg, D., Ilsche, T., Schone, R., Molka, D., Schmidt, M., Nagel, W.E.: Power measurement techniques on standard compute nodes: A quantitative comparison. In: 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS13), pp. 194–204 (2013)Google Scholar
  12. 12.
    Meneses, E., Sarood, O., Kalé, L.V.: Energy profile of rollback-recovery strategies in high performance computing. Parallel Comput. 40(9), 536–547 (2014)CrossRefGoogle Scholar
  13. 13.
    Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings 2010 International Conference on High Performance Computing, Networking, Storage and Analysis (SC10), pp. 1–11 (2010)Google Scholar
  14. 14.
    Plimpton, S., Crozier, P., Thompson, A.: LAMMPS: Large-scale Atomic/Molecular Massively Parallel Simulator. Sandia National Laboratories, Albuquerque (2007) Google Scholar
  15. 15.
    Shalf, J., Dosanjh, S., Morrison, J.: Exascale computing technology challenges. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 1–25. Springer, Heidelberg (2011) CrossRefGoogle Scholar
  16. 16.
    Wallace, S., Vishwanath, V., Coghlan, S., Tramm, J., Lan, Z., Papka, M.E.: Application power profiling on IBM Blue Gene/Q. In: 2013 IEEE International Conference on Cluster Computing (CLUSTER13), pp. 1–8 (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Prasanna Balaprakash
    • 1
    • 2
    Email author
  • Leonardo A. Bautista Gomez
    • 1
  • Mohamed-Slim Bouguerra
    • 1
  • Stefan M. Wild
    • 1
  • Franck Cappello
    • 1
    • 3
  • Paul D. Hovland
    • 1
  1. 1.Mathematics and Computer Science DivisionArgonne National LaboratoryArgonneUSA
  2. 2.Leadership Computing FacilityArgonne National LaboratoryArgonneUSA
  3. 3.University of Illinois at Urbana-ChampaignChampaignUSA

Personalised recommendations