Skip to main content
Log in

Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes

  • Published:
New Generation Computing Aims and scope Submit manuscript

Abstract

The execution times of large-scale parallel applications on nowadays multi/many-core systems are usually longer than the mean time between failures. Therefore, parallel applications must tolerate hardware failures to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery is one of the most popular techniques to implement fault-tolerant applications. However, checkpointing parallel applications is expensive in terms of computing time, network utilization and storage resources. Thus, current checkpoint-recovery techniques should minimize these costs in order to be useful for large scale systems. In this paper three different and complementary techniques to reduce the size of the checkpoints generated by application-level checkpointing are proposed and implemented. Detailed experimental results obtained on a multicore cluster show the effectiveness of the proposed methods to reduce checkpointing cost.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agarwal, S., Garg, R. and Gupta, M. S., “Adaptive incremental checkpointing for massively parallel systems,” in Proc. of the 18th Annual International Conference on Supercomputing (ICS’04) (Saint Malo, France, 26 June-01 July 2004), ACM, New York, pp. 277–286.

  2. Bosilca, G., Delmas, R., Dongarra, J. and Langou, J., “Algorithm-based fault tolerance applied to high performance computing,” J. Parallel Distrib. Comput., 69, 4, pp. 410–416, 2009.

    Google Scholar 

  3. Cappello, F., “Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities,” International Journal of High Performance Computing Applications, IJHPCA, 23, 3, pp. 212–226, 2009.

  4. Chen, Z., Fagg, G. E., Gabriel, E., Langou, J., Angskun, T., Bosilca, G. and Dongarra, J., “Fault tolerant high performance computing by a coding approach,” in Proc. of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming (New York, NY, USA, 2005), PPoPP ’05,ACM, pp. 213–223.

  5. Chiu, G.-M. and Chiu, J.-F., “A new diskless checkpointing approach for multiple processor failures,” IEEE Transactions on Dependable and Secure Computing, 8, 4, pp. 481–493, 2011.

  6. Elliott, J., Kharbas, K., Fiala, D., Mueller, F., Ferreira, K. B. and Engelmann, C., “Combining partial redundancy and checkpointing for hpc,” in 2012 IEEE 32nd International Conference on Distributed Computing Systems, Macau, China, June 18-21, 2012, pp. 615–626, 2012.

  7. Elnozahy, E., Alvisi, L., Wang, Y.-M. and Johnson, D., “A survey of rollback-recovery protocols in message-passing systems,” ACM Computing Surveys, 34, 3, pp. 375–408, 2002.

  8. Elnozahy, E., Johnson, D. and Zwaenepoel, W., “The performance of consistent checkpointing,” in Proc. of the 11th Symposium on Reliable Distributed Systems, 1992, pp. 39–47, Oct. 1992.

  9. Ferreira, K. B., Riesen, R., Brightwell, R., Bridges, P. G. and Arnold, D., “libhashckpt: Hash-based incremental checkpointing using gpu’s,” in Recent Advances in the Message Passing Interface - 18th European MPI Users’ Group Meeting, EuroMPI 2011, Santorini, Greece, September 18-21, 2011, Proceedings, pp. 272–281, 2011.

  10. Ferreira, K. B., Stearley, J., Laros, J. H., Oldfield, R., Pedretti, K. T., Brightwell, R., Riesen, R., Bridges, P. G. and Arnold, D., “Evaluating the viability of process replication reliability for exascale systems,” in Conference on High Performance Computing Networking, Storage and Analysis, SC 2011, Seattle, WA, USA, November 12-18, 2011, pp. 1–44, 2011.

  11. Gharaibeh, A., Al-Kiswany, S., Gopalakrishnan, S. and Ripeanu, M., “A gpu accelerated storage system,” in Proc. of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC 2010, Chicago, Illinois, USA, June 21-25, 2010, pp. 167–178, 2010.

  12. Gioiosa, R., Sancho, J. C., Jiang, S. and Petrini, F., “Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers,” in Proc. of the ACM/IEEE SC2005 Conference on High Performance Networking and Computing, November 12-18, 2005, Seattle, WA, USA, p. 9, 2005.

  13. Gomez, L. A. B., Maruyama, N., Cappello, F. and Matsuoka, S., “Distributed diskless checkpoint for large scale systems,” in Proc. of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (Washington, DC, USA, 2010), CCGRID ’10, IEEE Computer Society, pp. 63–72.

  14. Huffman, D. A., “A method for the construction of minimum-redundancy codes,” in Proc. of the Institute of Radio Engineers (September 1952), 40, pp. 1098–1101.

  15. Hursey, J. and Lumsdaine, A., “A composable runtime recovery policy framework supporting resilient HPC applications,” Tech. Rep. TR686, Indiana University, Bloomington, Indiana, USA, August 2010.

  16. IEEE Global History Network, “History of lossless data compression algorithms,” http://www.ieeeghn.org/wiki/index.php/History_of_Lossless_Data_Compression_Algorithms. Last accessed October 2012.

  17. Iskra, K., Romein, J. W., Yoshii, K. and Beckman, P., “Zoid: I/o-forwarding infrastructure for petascale architectures,” in Proc. of the 13 th ACM SIGPLAN Symposium on Principles and practice of parallel programming, PPoPP ’08, ACM, pp. 153–162, 2008.

  18. Jin, H., Ke, T., Chen, Y. and Sun, X.-H., “Checkpointing orchestration: Toward a scalable hpc fault-tolerant environment,” in 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2012, Ottawa, Canada, May 13-16, 2012, pp. 276–283, 2012.

  19. Laurence Berkeley National Laboratory, Berkeley Lab Checkpoint/Restart. https://ftg.lbl.gov/CheckpointRestart/. Last accessed October 2012.

  20. Li, C.-C. and Fuchs, W., “Catch-compiler-assisted techniques for checkpointing,” in Fault-Tolerant Computing, 1990, FTCS-20. Digest of Papers, 20 th International Symposium (June 1990), pp. 74–81.

  21. Li, M., Vazhkudai, S. S., Butt, A. R., Meng, F., Ma, X., Kim, Y., Engelmann, C. and Shipman, G. M., “Functional partitioning to optimize end-to-end performance on many-core architectures,” in Conference on High Performance Computing Networking, Storage and Analysis, SC 2010, New Orleans, LA, USA, November 13-19, 2010, pp. 1–12, 2010.

  22. Li, Y. and Lan, Z., “FREM: A fast restart mechanism for general checkpoint/restart,” IEEE Transactions on Computers, 60, 5, pp. 639–652, 2011.

  23. Moody, A., Bronevetsky, G., Mohror, K. and de Supinski, B. R., “Design, modeling, and evaluation of a scalable multi-level checkpointing system,” in Conference on High Performance Computing Networking, Storage and Analysis, SC 2010, New Orleans, LA, USA, November 13-19, 2010, pp. 1–11, 2010.

  24. Naksinehaboon, N., Liu, Y., Leangsuksun, C. B., Nassar, R., Paun, M. and Scott, S. L., “Reliability-aware approach: An incremental checkpoint/restart model in hpc environments,” in Proc. of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid, pp. 783–788, 2008.

  25. Nam, H.-C., Kim, J., Hong, S. and Lee, S., “Probabilistic checkpointing,” in Fault-Tolerant Computing, 1997. FTCS-27. Digest of Papers, Twenty-Seventh Annual International Symposium on (June 1997), pp. 48–57.

  26. Nam, H.-C., Kim, J., Hong, S. J. and Lee, S., “Secure checkpointing,” Journal of Systems Architecture, 48, 8-10, pp. 237–254, 2003.

  27. National Aeronautics and Space Administration. The NAS Parallel Benchmarks. http://www.nas.nasa.gov/Software/NPB. Last accessed October 2012.

  28. Norman, A. and Lin, C., “A scalable algorithm for compiler-placed staggered checkpointing,” in Proc. of the 23rd International Conference on Parallel and Distributed Computing and Systems (PDCS 2011), Acta Press, 2012.

  29. Oldfield, R., Arunagiri, S., Teller, P. J., Seelam, S. R., Varela, M. R., Riesen, R. and Roth, P. C., “Modeling the impact of checkpoints on next generation systems,” in 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007), 24-27 September 2007, San Diego, California, USA, pp. 30–46, 2007.

  30. Plank, J., Beck, M. and Kingsley, G., “Compiler-assisted memory exclusion for fast checkpointing,” IEEE Technical Committee on Operating Systems and Application Environments, 7, 4, pp. 10–14, 1995.

  31. Plank, J. S., Beck, M., Kingsley, G. and Li, K., “Libckpt: Transparent Checkpointing under Unix,” in Usenix Winter Technical Conference (January 1995), pp. 213–223.

  32. Plank, J. S. and Li, K., “ickp: A consistent checkpointer for multicomputers,” IEEE Parallel Distrib. Technol., 2, pp. 62–67, June 1994.

  33. Plank, J. S., Li, K. and Puening, M. A., “Diskless checkpointing,” IEEE Transactions on Parallel and Distributed Systems, 9, 10, pp. 972–986, October 1998.

  34. Plank, J. S., Xu, J. and Netzer, R. H. B., “Compressed differences: an algorithm for fast incremental checkpointing,” Tech. Rep. CS-95-302, University of Tennessee, Department of Computer Science, Aug. 1995.

  35. Rodríguez, G., Martín, M. J., González, P. and Touriño, J., “Analysis of performance-impacting factors on checkpointing frameworks: the CPPC case study,” The Computer Journal, 54, 11, pp. 1821–1837, 2011.

  36. Rodríguez, G., Martín, M. J., González, P., Touriño, J. and Doallo, R., “CPPC: A compiler-assisted tool for portable checkpointing of message-passing applications,” Concurrency and Computation: Practice and Experience, 22, 6, pp. 749–766, 2010.

  37. Schroeder, B. and Gibson, G., “A large-scale study of failures in high-performance computing systems,” IEEE Trans. Dependable Secur. Comput., 7, 4, pp. 337–351, Oct. 2010.

  38. Schroeder, B. and Gibson, G. A., “Understanding failures in petascale computers,” Journal of Physics: Conference Series, 78, 1, pp. 012–022, 2007.

  39. The HDF5 Group. HDF-5: Hierarchical Data Format. http://www.hdfgroup.org/HDF5/. Last accessed October 2012.

  40. Victor C. Zandy., CKPT process checkpoint library. http://pages.cs.wisc.edu/~zandy/ckpt/. Last accessed October 2012.

  41. Wang, C., Mueller, F., Engelmann, C. and Scott, S. L., “Hybrid checkpointing for mpi jobs in hpc environments,” in IEEE 16th International Conference on Parallel and Distributed Systems, ICPADS 2010, 8-10 Dec. 2010, Shanghai, China, pp. 524–533, 2010.

  42. Wang, C., Mueller, F., Engelmann, C. and Scott, S. L., “Proactive process-level live migration and back migration in hpc environments,” J. Parallel Distrib. Comput., 72, 2, pp. 254–267, Feb. 2012.

  43. Zheng, G., Ni, X. and Kalé, L. V., “A scalable double in-memory checkpoint and restart scheme towards exascale,” in IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, DSN 2012, Boston, MA, USA, June 25-28, 2012, pp. 1–6, 2012.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Iván Cores.

About this article

Cite this article

Cores, I., Rodríguez, G., martín, M.J. et al. Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes. New Gener. Comput. 31, 163–185 (2013). https://doi.org/10.1007/s00354-013-0302-4

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00354-013-0302-4

Keywords

Navigation