Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes
- First Online:
The execution times of large-scale parallel applications on nowadays multi/many-core systems are usually longer than the mean time between failures. Therefore, parallel applications must tolerate hardware failures to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery is one of the most popular techniques to implement fault-tolerant applications. However, checkpointing parallel applications is expensive in terms of computing time, network utilization and storage resources. Thus, current checkpoint-recovery techniques should minimize these costs in order to be useful for large scale systems. In this paper three different and complementary techniques to reduce the size of the checkpoints generated by application-level checkpointing are proposed and implemented. Detailed experimental results obtained on a multicore cluster show the effectiveness of the proposed methods to reduce checkpointing cost.
KeywordsParallel Programming Message-Passing MPI Fault Tolerance Checkpointing
Unable to display preview. Download preview PDF.
- 1.Agarwal, S., Garg, R. and Gupta, M. S., “Adaptive incremental checkpointing for massively parallel systems,” in Proc. of the 18th Annual International Conference on Supercomputing (ICS’04) (Saint Malo, France, 26 June-01 July 2004), ACM, New York, pp. 277–286.Google Scholar
- 2.Bosilca, G., Delmas, R., Dongarra, J. and Langou, J., “Algorithm-based fault tolerance applied to high performance computing,” J. Parallel Distrib. Comput., 69, 4, pp. 410–416, 2009.Google Scholar
- 3.Cappello, F., “Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities,” International Journal of High Performance Computing Applications, IJHPCA, 23, 3, pp. 212–226, 2009.Google Scholar
- 4.Chen, Z., Fagg, G. E., Gabriel, E., Langou, J., Angskun, T., Bosilca, G. and Dongarra, J., “Fault tolerant high performance computing by a coding approach,” in Proc. of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming (New York, NY, USA, 2005), PPoPP ’05,ACM, pp. 213–223.Google Scholar
- 5.Chiu, G.-M. and Chiu, J.-F., “A new diskless checkpointing approach for multiple processor failures,” IEEE Transactions on Dependable and Secure Computing, 8, 4, pp. 481–493, 2011.Google Scholar
- 6.Elliott, J., Kharbas, K., Fiala, D., Mueller, F., Ferreira, K. B. and Engelmann, C., “Combining partial redundancy and checkpointing for hpc,” in 2012 IEEE 32nd International Conference on Distributed Computing Systems, Macau, China, June 18-21, 2012, pp. 615–626, 2012.Google Scholar
- 7.Elnozahy, E., Alvisi, L., Wang, Y.-M. and Johnson, D., “A survey of rollback-recovery protocols in message-passing systems,” ACM Computing Surveys, 34, 3, pp. 375–408, 2002.Google Scholar
- 8.Elnozahy, E., Johnson, D. and Zwaenepoel, W., “The performance of consistent checkpointing,” in Proc. of the 11th Symposium on Reliable Distributed Systems, 1992, pp. 39–47, Oct. 1992.Google Scholar
- 9.Ferreira, K. B., Riesen, R., Brightwell, R., Bridges, P. G. and Arnold, D., “libhashckpt: Hash-based incremental checkpointing using gpu’s,” in Recent Advances in the Message Passing Interface - 18th European MPI Users’ Group Meeting, EuroMPI 2011, Santorini, Greece, September 18-21, 2011, Proceedings, pp. 272–281, 2011.Google Scholar
- 10.Ferreira, K. B., Stearley, J., Laros, J. H., Oldfield, R., Pedretti, K. T., Brightwell, R., Riesen, R., Bridges, P. G. and Arnold, D., “Evaluating the viability of process replication reliability for exascale systems,” in Conference on High Performance Computing Networking, Storage and Analysis, SC 2011, Seattle, WA, USA, November 12-18, 2011, pp. 1–44, 2011.Google Scholar
- 11.Gharaibeh, A., Al-Kiswany, S., Gopalakrishnan, S. and Ripeanu, M., “A gpu accelerated storage system,” in Proc. of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC 2010, Chicago, Illinois, USA, June 21-25, 2010, pp. 167–178, 2010.Google Scholar
- 12.Gioiosa, R., Sancho, J. C., Jiang, S. and Petrini, F., “Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers,” in Proc. of the ACM/IEEE SC2005 Conference on High Performance Networking and Computing, November 12-18, 2005, Seattle, WA, USA, p. 9, 2005.Google Scholar
- 13.Gomez, L. A. B., Maruyama, N., Cappello, F. and Matsuoka, S., “Distributed diskless checkpoint for large scale systems,” in Proc. of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (Washington, DC, USA, 2010), CCGRID ’10, IEEE Computer Society, pp. 63–72.Google Scholar
- 14.Huffman, D. A., “A method for the construction of minimum-redundancy codes,” in Proc. of the Institute of Radio Engineers (September 1952), 40, pp. 1098–1101.Google Scholar
- 15.Hursey, J. and Lumsdaine, A., “A composable runtime recovery policy framework supporting resilient HPC applications,” Tech. Rep. TR686, Indiana University, Bloomington, Indiana, USA, August 2010.Google Scholar
- 16.IEEE Global History Network, “History of lossless data compression algorithms,” http://www.ieeeghn.org/wiki/index.php/History_of_Lossless_Data_Compression_Algorithms. Last accessed October 2012.
- 17.Iskra, K., Romein, J. W., Yoshii, K. and Beckman, P., “Zoid: I/o-forwarding infrastructure for petascale architectures,” in Proc. of the 13 th ACM SIGPLAN Symposium on Principles and practice of parallel programming, PPoPP ’08, ACM, pp. 153–162, 2008.Google Scholar
- 18.Jin, H., Ke, T., Chen, Y. and Sun, X.-H., “Checkpointing orchestration: Toward a scalable hpc fault-tolerant environment,” in 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2012, Ottawa, Canada, May 13-16, 2012, pp. 276–283, 2012.Google Scholar
- 19.Laurence Berkeley National Laboratory, Berkeley Lab Checkpoint/Restart. https://ftg.lbl.gov/CheckpointRestart/. Last accessed October 2012.
- 20.Li, C.-C. and Fuchs, W., “Catch-compiler-assisted techniques for checkpointing,” in Fault-Tolerant Computing, 1990, FTCS-20. Digest of Papers, 20 th International Symposium (June 1990), pp. 74–81.Google Scholar
- 21.Li, M., Vazhkudai, S. S., Butt, A. R., Meng, F., Ma, X., Kim, Y., Engelmann, C. and Shipman, G. M., “Functional partitioning to optimize end-to-end performance on many-core architectures,” in Conference on High Performance Computing Networking, Storage and Analysis, SC 2010, New Orleans, LA, USA, November 13-19, 2010, pp. 1–12, 2010.Google Scholar
- 22.Li, Y. and Lan, Z., “FREM: A fast restart mechanism for general checkpoint/restart,” IEEE Transactions on Computers, 60, 5, pp. 639–652, 2011.Google Scholar
- 23.Moody, A., Bronevetsky, G., Mohror, K. and de Supinski, B. R., “Design, modeling, and evaluation of a scalable multi-level checkpointing system,” in Conference on High Performance Computing Networking, Storage and Analysis, SC 2010, New Orleans, LA, USA, November 13-19, 2010, pp. 1–11, 2010.Google Scholar
- 24.Naksinehaboon, N., Liu, Y., Leangsuksun, C. B., Nassar, R., Paun, M. and Scott, S. L., “Reliability-aware approach: An incremental checkpoint/restart model in hpc environments,” in Proc. of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid, pp. 783–788, 2008.Google Scholar
- 25.Nam, H.-C., Kim, J., Hong, S. and Lee, S., “Probabilistic checkpointing,” in Fault-Tolerant Computing, 1997. FTCS-27. Digest of Papers, Twenty-Seventh Annual International Symposium on (June 1997), pp. 48–57.Google Scholar
- 26.Nam, H.-C., Kim, J., Hong, S. J. and Lee, S., “Secure checkpointing,” Journal of Systems Architecture, 48, 8-10, pp. 237–254, 2003.Google Scholar
- 27.National Aeronautics and Space Administration. The NAS Parallel Benchmarks. http://www.nas.nasa.gov/Software/NPB. Last accessed October 2012.
- 28.Norman, A. and Lin, C., “A scalable algorithm for compiler-placed staggered checkpointing,” in Proc. of the 23rd International Conference on Parallel and Distributed Computing and Systems (PDCS 2011), Acta Press, 2012.Google Scholar
- 29.Oldfield, R., Arunagiri, S., Teller, P. J., Seelam, S. R., Varela, M. R., Riesen, R. and Roth, P. C., “Modeling the impact of checkpoints on next generation systems,” in 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007), 24-27 September 2007, San Diego, California, USA, pp. 30–46, 2007.Google Scholar
- 30.Plank, J., Beck, M. and Kingsley, G., “Compiler-assisted memory exclusion for fast checkpointing,” IEEE Technical Committee on Operating Systems and Application Environments, 7, 4, pp. 10–14, 1995.Google Scholar
- 31.Plank, J. S., Beck, M., Kingsley, G. and Li, K., “Libckpt: Transparent Checkpointing under Unix,” in Usenix Winter Technical Conference (January 1995), pp. 213–223.Google Scholar
- 32.Plank, J. S. and Li, K., “ickp: A consistent checkpointer for multicomputers,” IEEE Parallel Distrib. Technol., 2, pp. 62–67, June 1994.Google Scholar
- 33.Plank, J. S., Li, K. and Puening, M. A., “Diskless checkpointing,” IEEE Transactions on Parallel and Distributed Systems, 9, 10, pp. 972–986, October 1998.Google Scholar
- 34.Plank, J. S., Xu, J. and Netzer, R. H. B., “Compressed differences: an algorithm for fast incremental checkpointing,” Tech. Rep. CS-95-302, University of Tennessee, Department of Computer Science, Aug. 1995.Google Scholar
- 35.Rodríguez, G., Martín, M. J., González, P. and Touriño, J., “Analysis of performance-impacting factors on checkpointing frameworks: the CPPC case study,” The Computer Journal, 54, 11, pp. 1821–1837, 2011.Google Scholar
- 36.Rodríguez, G., Martín, M. J., González, P., Touriño, J. and Doallo, R., “CPPC: A compiler-assisted tool for portable checkpointing of message-passing applications,” Concurrency and Computation: Practice and Experience, 22, 6, pp. 749–766, 2010.Google Scholar
- 37.Schroeder, B. and Gibson, G., “A large-scale study of failures in high-performance computing systems,” IEEE Trans. Dependable Secur. Comput., 7, 4, pp. 337–351, Oct. 2010.Google Scholar
- 38.Schroeder, B. and Gibson, G. A., “Understanding failures in petascale computers,” Journal of Physics: Conference Series, 78, 1, pp. 012–022, 2007.Google Scholar
- 39.The HDF5 Group. HDF-5: Hierarchical Data Format. http://www.hdfgroup.org/HDF5/. Last accessed October 2012.
- 40.Victor C. Zandy., CKPT process checkpoint library. http://pages.cs.wisc.edu/~zandy/ckpt/. Last accessed October 2012.
- 41.Wang, C., Mueller, F., Engelmann, C. and Scott, S. L., “Hybrid checkpointing for mpi jobs in hpc environments,” in IEEE 16th International Conference on Parallel and Distributed Systems, ICPADS 2010, 8-10 Dec. 2010, Shanghai, China, pp. 524–533, 2010.Google Scholar
- 42.Wang, C., Mueller, F., Engelmann, C. and Scott, S. L., “Proactive process-level live migration and back migration in hpc environments,” J. Parallel Distrib. Comput., 72, 2, pp. 254–267, Feb. 2012.Google Scholar
- 43.Zheng, G., Ni, X. and Kalé, L. V., “A scalable double in-memory checkpoint and restart scheme towards exascale,” in IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, DSN 2012, Boston, MA, USA, June 25-28, 2012, pp. 1–6, 2012.Google Scholar