Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes

Cores, Iván; Rodríguez, Gabriel; martín, Mará J.; González, Patricia; Osorio, Roberto R.

doi:10.1007/s00354-013-0302-4

Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes

Published: 03 August 2013

Volume 31, pages 163–185, (2013)
Cite this article

New Generation Computing Aims and scope Submit manuscript

Iván Cores¹,
Gabriel Rodríguez¹,
Mará J. martín¹,
Patricia González¹ &
…
Roberto R. Osorio¹

280 Accesses
17 Citations
Explore all metrics

Abstract

The execution times of large-scale parallel applications on nowadays multi/many-core systems are usually longer than the mean time between failures. Therefore, parallel applications must tolerate hardware failures to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery is one of the most popular techniques to implement fault-tolerant applications. However, checkpointing parallel applications is expensive in terms of computing time, network utilization and storage resources. Thus, current checkpoint-recovery techniques should minimize these costs in order to be useful for large scale systems. In this paper three different and complementary techniques to reduce the size of the checkpoints generated by application-level checkpointing are proposed and implemented. Detailed experimental results obtained on a multicore cluster show the effectiveness of the proposed methods to reduce checkpointing cost.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Analysis of parallel application checkpoint storage for system configuration

Article 16 October 2020

Resilient MPI applications using an application-level checkpointing framework and ULFM

Article 22 January 2016

Task-Level Checkpointing System for Task-Based Parallel Workflows

References

Agarwal, S., Garg, R. and Gupta, M. S., “Adaptive incremental checkpointing for massively parallel systems,” in Proc. of the 18th Annual International Conference on Supercomputing (ICS’04) (Saint Malo, France, 26 June-01 July 2004), ACM, New York, pp. 277–286.
Bosilca, G., Delmas, R., Dongarra, J. and Langou, J., “Algorithm-based fault tolerance applied to high performance computing,” J. Parallel Distrib. Comput., 69, 4, pp. 410–416, 2009.
Google Scholar
Cappello, F., “Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities,” International Journal of High Performance Computing Applications, IJHPCA, 23, 3, pp. 212–226, 2009.
Chen, Z., Fagg, G. E., Gabriel, E., Langou, J., Angskun, T., Bosilca, G. and Dongarra, J., “Fault tolerant high performance computing by a coding approach,” in Proc. of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming (New York, NY, USA, 2005), PPoPP ’05,ACM, pp. 213–223.
Chiu, G.-M. and Chiu, J.-F., “A new diskless checkpointing approach for multiple processor failures,” IEEE Transactions on Dependable and Secure Computing, 8, 4, pp. 481–493, 2011.
Elliott, J., Kharbas, K., Fiala, D., Mueller, F., Ferreira, K. B. and Engelmann, C., “Combining partial redundancy and checkpointing for hpc,” in 2012 IEEE 32nd International Conference on Distributed Computing Systems, Macau, China, June 18-21, 2012, pp. 615–626, 2012.
Elnozahy, E., Alvisi, L., Wang, Y.-M. and Johnson, D., “A survey of rollback-recovery protocols in message-passing systems,” ACM Computing Surveys, 34, 3, pp. 375–408, 2002.
Elnozahy, E., Johnson, D. and Zwaenepoel, W., “The performance of consistent checkpointing,” in Proc. of the 11th Symposium on Reliable Distributed Systems, 1992, pp. 39–47, Oct. 1992.
Ferreira, K. B., Riesen, R., Brightwell, R., Bridges, P. G. and Arnold, D., “libhashckpt: Hash-based incremental checkpointing using gpu’s,” in Recent Advances in the Message Passing Interface - 18th European MPI Users’ Group Meeting, EuroMPI 2011, Santorini, Greece, September 18-21, 2011, Proceedings, pp. 272–281, 2011.
Ferreira, K. B., Stearley, J., Laros, J. H., Oldfield, R., Pedretti, K. T., Brightwell, R., Riesen, R., Bridges, P. G. and Arnold, D., “Evaluating the viability of process replication reliability for exascale systems,” in Conference on High Performance Computing Networking, Storage and Analysis, SC 2011, Seattle, WA, USA, November 12-18, 2011, pp. 1–44, 2011.
Gharaibeh, A., Al-Kiswany, S., Gopalakrishnan, S. and Ripeanu, M., “A gpu accelerated storage system,” in Proc. of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC 2010, Chicago, Illinois, USA, June 21-25, 2010, pp. 167–178, 2010.
Gioiosa, R., Sancho, J. C., Jiang, S. and Petrini, F., “Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers,” in Proc. of the ACM/IEEE SC2005 Conference on High Performance Networking and Computing, November 12-18, 2005, Seattle, WA, USA, p. 9, 2005.
Gomez, L. A. B., Maruyama, N., Cappello, F. and Matsuoka, S., “Distributed diskless checkpoint for large scale systems,” in Proc. of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (Washington, DC, USA, 2010), CCGRID ’10, IEEE Computer Society, pp. 63–72.
Huffman, D. A., “A method for the construction of minimum-redundancy codes,” in Proc. of the Institute of Radio Engineers (September 1952), 40, pp. 1098–1101.
Hursey, J. and Lumsdaine, A., “A composable runtime recovery policy framework supporting resilient HPC applications,” Tech. Rep. TR686, Indiana University, Bloomington, Indiana, USA, August 2010.
IEEE Global History Network, “History of lossless data compression algorithms,” http://www.ieeeghn.org/wiki/index.php/History_of_Lossless_Data_Compression_Algorithms. Last accessed October 2012.
Iskra, K., Romein, J. W., Yoshii, K. and Beckman, P., “Zoid: I/o-forwarding infrastructure for petascale architectures,” in Proc. of the 13 ^th ACM SIGPLAN Symposium on Principles and practice of parallel programming, PPoPP ’08, ACM, pp. 153–162, 2008.
Jin, H., Ke, T., Chen, Y. and Sun, X.-H., “Checkpointing orchestration: Toward a scalable hpc fault-tolerant environment,” in 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2012, Ottawa, Canada, May 13-16, 2012, pp. 276–283, 2012.
Laurence Berkeley National Laboratory, Berkeley Lab Checkpoint/Restart. https://ftg.lbl.gov/CheckpointRestart/. Last accessed October 2012.
Li, C.-C. and Fuchs, W., “Catch-compiler-assisted techniques for checkpointing,” in Fault-Tolerant Computing, 1990, FTCS-20. Digest of Papers, 20 ^th International Symposium (June 1990), pp. 74–81.
Li, M., Vazhkudai, S. S., Butt, A. R., Meng, F., Ma, X., Kim, Y., Engelmann, C. and Shipman, G. M., “Functional partitioning to optimize end-to-end performance on many-core architectures,” in Conference on High Performance Computing Networking, Storage and Analysis, SC 2010, New Orleans, LA, USA, November 13-19, 2010, pp. 1–12, 2010.
Li, Y. and Lan, Z., “FREM: A fast restart mechanism for general checkpoint/restart,” IEEE Transactions on Computers, 60, 5, pp. 639–652, 2011.
Moody, A., Bronevetsky, G., Mohror, K. and de Supinski, B. R., “Design, modeling, and evaluation of a scalable multi-level checkpointing system,” in Conference on High Performance Computing Networking, Storage and Analysis, SC 2010, New Orleans, LA, USA, November 13-19, 2010, pp. 1–11, 2010.
Naksinehaboon, N., Liu, Y., Leangsuksun, C. B., Nassar, R., Paun, M. and Scott, S. L., “Reliability-aware approach: An incremental checkpoint/restart model in hpc environments,” in Proc. of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid, pp. 783–788, 2008.
Nam, H.-C., Kim, J., Hong, S. and Lee, S., “Probabilistic checkpointing,” in Fault-Tolerant Computing, 1997. FTCS-27. Digest of Papers, Twenty-Seventh Annual International Symposium on (June 1997), pp. 48–57.
Nam, H.-C., Kim, J., Hong, S. J. and Lee, S., “Secure checkpointing,” Journal of Systems Architecture, 48, 8-10, pp. 237–254, 2003.
National Aeronautics and Space Administration. The NAS Parallel Benchmarks. http://www.nas.nasa.gov/Software/NPB. Last accessed October 2012.
Norman, A. and Lin, C., “A scalable algorithm for compiler-placed staggered checkpointing,” in Proc. of the 23rd International Conference on Parallel and Distributed Computing and Systems (PDCS 2011), Acta Press, 2012.
Oldfield, R., Arunagiri, S., Teller, P. J., Seelam, S. R., Varela, M. R., Riesen, R. and Roth, P. C., “Modeling the impact of checkpoints on next generation systems,” in 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007), 24-27 September 2007, San Diego, California, USA, pp. 30–46, 2007.
Plank, J., Beck, M. and Kingsley, G., “Compiler-assisted memory exclusion for fast checkpointing,” IEEE Technical Committee on Operating Systems and Application Environments, 7, 4, pp. 10–14, 1995.
Plank, J. S., Beck, M., Kingsley, G. and Li, K., “Libckpt: Transparent Checkpointing under Unix,” in Usenix Winter Technical Conference (January 1995), pp. 213–223.
Plank, J. S. and Li, K., “ickp: A consistent checkpointer for multicomputers,” IEEE Parallel Distrib. Technol., 2, pp. 62–67, June 1994.
Plank, J. S., Li, K. and Puening, M. A., “Diskless checkpointing,” IEEE Transactions on Parallel and Distributed Systems, 9, 10, pp. 972–986, October 1998.
Plank, J. S., Xu, J. and Netzer, R. H. B., “Compressed differences: an algorithm for fast incremental checkpointing,” Tech. Rep. CS-95-302, University of Tennessee, Department of Computer Science, Aug. 1995.
Rodríguez, G., Martín, M. J., González, P. and Touriño, J., “Analysis of performance-impacting factors on checkpointing frameworks: the CPPC case study,” The Computer Journal, 54, 11, pp. 1821–1837, 2011.
Rodríguez, G., Martín, M. J., González, P., Touriño, J. and Doallo, R., “CPPC: A compiler-assisted tool for portable checkpointing of message-passing applications,” Concurrency and Computation: Practice and Experience, 22, 6, pp. 749–766, 2010.
Schroeder, B. and Gibson, G., “A large-scale study of failures in high-performance computing systems,” IEEE Trans. Dependable Secur. Comput., 7, 4, pp. 337–351, Oct. 2010.
Schroeder, B. and Gibson, G. A., “Understanding failures in petascale computers,” Journal of Physics: Conference Series, 78, 1, pp. 012–022, 2007.
The HDF5 Group. HDF-5: Hierarchical Data Format. http://www.hdfgroup.org/HDF5/. Last accessed October 2012.
Victor C. Zandy., CKPT process checkpoint library. http://pages.cs.wisc.edu/~zandy/ckpt/. Last accessed October 2012.
Wang, C., Mueller, F., Engelmann, C. and Scott, S. L., “Hybrid checkpointing for mpi jobs in hpc environments,” in IEEE 16th International Conference on Parallel and Distributed Systems, ICPADS 2010, 8-10 Dec. 2010, Shanghai, China, pp. 524–533, 2010.
Wang, C., Mueller, F., Engelmann, C. and Scott, S. L., “Proactive process-level live migration and back migration in hpc environments,” J. Parallel Distrib. Comput., 72, 2, pp. 254–267, Feb. 2012.
Zheng, G., Ni, X. and Kalé, L. V., “A scalable double in-memory checkpoint and restart scheme towards exascale,” in IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, DSN 2012, Boston, MA, USA, June 25-28, 2012, pp. 1–6, 2012.

Download references

Author information

Authors and Affiliations

Computer Architecture Group, University of A Coruña, Coruña, Spain
Iván Cores, Gabriel Rodríguez, Mará J. martín, Patricia González & Roberto R. Osorio

Authors

Iván Cores
View author publications
You can also search for this author in PubMed Google Scholar
Gabriel Rodríguez
View author publications
You can also search for this author in PubMed Google Scholar
Mará J. martín
View author publications
You can also search for this author in PubMed Google Scholar
Patricia González
View author publications
You can also search for this author in PubMed Google Scholar
Roberto R. Osorio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Iván Cores.

About this article

Cite this article

Cores, I., Rodríguez, G., martín, M.J. et al. Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes. New Gener. Comput. 31, 163–185 (2013). https://doi.org/10.1007/s00354-013-0302-4

Download citation

Received: 19 April 2013
Published: 03 August 2013
Issue Date: July 2013
DOI: https://doi.org/10.1007/s00354-013-0302-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes

Abstract

Access this article

Similar content being viewed by others

Analysis of parallel application checkpoint storage for system configuration

Resilient MPI applications using an application-level checkpointing framework and ULFM

Task-Level Checkpointing System for Task-Based Parallel Workflows

References

Author information

Authors and Affiliations

Corresponding author

About this article

Cite this article

Keywords

Navigation

Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes

Abstract

Access this article

Similar content being viewed by others

Analysis of parallel application checkpoint storage for system configuration

Resilient MPI applications using an application-level checkpointing framework and ULFM

Task-Level Checkpointing System for Task-Based Parallel Workflows

References

Author information

Authors and Affiliations

Corresponding author

About this article

Cite this article

Share this article

Keywords

Search

Navigation