New Generation Computing

, Volume 31, Issue 3, pp 163–185 | Cite as

Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes

  • Iván Cores
  • Gabriel Rodríguez
  • Mará J. martín
  • Patricia González
  • Roberto R. Osorio


The execution times of large-scale parallel applications on nowadays multi/many-core systems are usually longer than the mean time between failures. Therefore, parallel applications must tolerate hardware failures to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery is one of the most popular techniques to implement fault-tolerant applications. However, checkpointing parallel applications is expensive in terms of computing time, network utilization and storage resources. Thus, current checkpoint-recovery techniques should minimize these costs in order to be useful for large scale systems. In this paper three different and complementary techniques to reduce the size of the checkpoints generated by application-level checkpointing are proposed and implemented. Detailed experimental results obtained on a multicore cluster show the effectiveness of the proposed methods to reduce checkpointing cost.


Parallel Programming Message-Passing MPI Fault Tolerance Checkpointing 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agarwal, S., Garg, R. and Gupta, M. S., “Adaptive incremental checkpointing for massively parallel systems,” in Proc. of the 18th Annual International Conference on Supercomputing (ICS’04) (Saint Malo, France, 26 June-01 July 2004), ACM, New York, pp. 277–286.Google Scholar
  2. 2.
    Bosilca, G., Delmas, R., Dongarra, J. and Langou, J., “Algorithm-based fault tolerance applied to high performance computing,” J. Parallel Distrib. Comput., 69, 4, pp. 410–416, 2009.Google Scholar
  3. 3.
    Cappello, F., “Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities,” International Journal of High Performance Computing Applications, IJHPCA, 23, 3, pp. 212–226, 2009.Google Scholar
  4. 4.
    Chen, Z., Fagg, G. E., Gabriel, E., Langou, J., Angskun, T., Bosilca, G. and Dongarra, J., “Fault tolerant high performance computing by a coding approach,” in Proc. of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming (New York, NY, USA, 2005), PPoPP ’05,ACM, pp. 213–223.Google Scholar
  5. 5.
    Chiu, G.-M. and Chiu, J.-F., “A new diskless checkpointing approach for multiple processor failures,” IEEE Transactions on Dependable and Secure Computing, 8, 4, pp. 481–493, 2011.Google Scholar
  6. 6.
    Elliott, J., Kharbas, K., Fiala, D., Mueller, F., Ferreira, K. B. and Engelmann, C., “Combining partial redundancy and checkpointing for hpc,” in 2012 IEEE 32nd International Conference on Distributed Computing Systems, Macau, China, June 18-21, 2012, pp. 615–626, 2012.Google Scholar
  7. 7.
    Elnozahy, E., Alvisi, L., Wang, Y.-M. and Johnson, D., “A survey of rollback-recovery protocols in message-passing systems,” ACM Computing Surveys, 34, 3, pp. 375–408, 2002.Google Scholar
  8. 8.
    Elnozahy, E., Johnson, D. and Zwaenepoel, W., “The performance of consistent checkpointing,” in Proc. of the 11th Symposium on Reliable Distributed Systems, 1992, pp. 39–47, Oct. 1992.Google Scholar
  9. 9.
    Ferreira, K. B., Riesen, R., Brightwell, R., Bridges, P. G. and Arnold, D., “libhashckpt: Hash-based incremental checkpointing using gpu’s,” in Recent Advances in the Message Passing Interface - 18th European MPI Users’ Group Meeting, EuroMPI 2011, Santorini, Greece, September 18-21, 2011, Proceedings, pp. 272–281, 2011.Google Scholar
  10. 10.
    Ferreira, K. B., Stearley, J., Laros, J. H., Oldfield, R., Pedretti, K. T., Brightwell, R., Riesen, R., Bridges, P. G. and Arnold, D., “Evaluating the viability of process replication reliability for exascale systems,” in Conference on High Performance Computing Networking, Storage and Analysis, SC 2011, Seattle, WA, USA, November 12-18, 2011, pp. 1–44, 2011.Google Scholar
  11. 11.
    Gharaibeh, A., Al-Kiswany, S., Gopalakrishnan, S. and Ripeanu, M., “A gpu accelerated storage system,” in Proc. of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC 2010, Chicago, Illinois, USA, June 21-25, 2010, pp. 167–178, 2010.Google Scholar
  12. 12.
    Gioiosa, R., Sancho, J. C., Jiang, S. and Petrini, F., “Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers,” in Proc. of the ACM/IEEE SC2005 Conference on High Performance Networking and Computing, November 12-18, 2005, Seattle, WA, USA, p. 9, 2005.Google Scholar
  13. 13.
    Gomez, L. A. B., Maruyama, N., Cappello, F. and Matsuoka, S., “Distributed diskless checkpoint for large scale systems,” in Proc. of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (Washington, DC, USA, 2010), CCGRID ’10, IEEE Computer Society, pp. 63–72.Google Scholar
  14. 14.
    Huffman, D. A., “A method for the construction of minimum-redundancy codes,” in Proc. of the Institute of Radio Engineers (September 1952), 40, pp. 1098–1101.Google Scholar
  15. 15.
    Hursey, J. and Lumsdaine, A., “A composable runtime recovery policy framework supporting resilient HPC applications,” Tech. Rep. TR686, Indiana University, Bloomington, Indiana, USA, August 2010.Google Scholar
  16. 16.
    IEEE Global History Network, “History of lossless data compression algorithms,” Last accessed October 2012.
  17. 17.
    Iskra, K., Romein, J. W., Yoshii, K. and Beckman, P., “Zoid: I/o-forwarding infrastructure for petascale architectures,” in Proc. of the 13 th ACM SIGPLAN Symposium on Principles and practice of parallel programming, PPoPP ’08, ACM, pp. 153–162, 2008.Google Scholar
  18. 18.
    Jin, H., Ke, T., Chen, Y. and Sun, X.-H., “Checkpointing orchestration: Toward a scalable hpc fault-tolerant environment,” in 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2012, Ottawa, Canada, May 13-16, 2012, pp. 276–283, 2012.Google Scholar
  19. 19.
    Laurence Berkeley National Laboratory, Berkeley Lab Checkpoint/Restart. Last accessed October 2012.
  20. 20.
    Li, C.-C. and Fuchs, W., “Catch-compiler-assisted techniques for checkpointing,” in Fault-Tolerant Computing, 1990, FTCS-20. Digest of Papers, 20 th International Symposium (June 1990), pp. 74–81.Google Scholar
  21. 21.
    Li, M., Vazhkudai, S. S., Butt, A. R., Meng, F., Ma, X., Kim, Y., Engelmann, C. and Shipman, G. M., “Functional partitioning to optimize end-to-end performance on many-core architectures,” in Conference on High Performance Computing Networking, Storage and Analysis, SC 2010, New Orleans, LA, USA, November 13-19, 2010, pp. 1–12, 2010.Google Scholar
  22. 22.
    Li, Y. and Lan, Z., “FREM: A fast restart mechanism for general checkpoint/restart,” IEEE Transactions on Computers, 60, 5, pp. 639–652, 2011.Google Scholar
  23. 23.
    Moody, A., Bronevetsky, G., Mohror, K. and de Supinski, B. R., “Design, modeling, and evaluation of a scalable multi-level checkpointing system,” in Conference on High Performance Computing Networking, Storage and Analysis, SC 2010, New Orleans, LA, USA, November 13-19, 2010, pp. 1–11, 2010.Google Scholar
  24. 24.
    Naksinehaboon, N., Liu, Y., Leangsuksun, C. B., Nassar, R., Paun, M. and Scott, S. L., “Reliability-aware approach: An incremental checkpoint/restart model in hpc environments,” in Proc. of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid, pp. 783–788, 2008.Google Scholar
  25. 25.
    Nam, H.-C., Kim, J., Hong, S. and Lee, S., “Probabilistic checkpointing,” in Fault-Tolerant Computing, 1997. FTCS-27. Digest of Papers, Twenty-Seventh Annual International Symposium on (June 1997), pp. 48–57.Google Scholar
  26. 26.
    Nam, H.-C., Kim, J., Hong, S. J. and Lee, S., “Secure checkpointing,” Journal of Systems Architecture, 48, 8-10, pp. 237–254, 2003.Google Scholar
  27. 27.
    National Aeronautics and Space Administration. The NAS Parallel Benchmarks. Last accessed October 2012.
  28. 28.
    Norman, A. and Lin, C., “A scalable algorithm for compiler-placed staggered checkpointing,” in Proc. of the 23rd International Conference on Parallel and Distributed Computing and Systems (PDCS 2011), Acta Press, 2012.Google Scholar
  29. 29.
    Oldfield, R., Arunagiri, S., Teller, P. J., Seelam, S. R., Varela, M. R., Riesen, R. and Roth, P. C., “Modeling the impact of checkpoints on next generation systems,” in 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007), 24-27 September 2007, San Diego, California, USA, pp. 30–46, 2007.Google Scholar
  30. 30.
    Plank, J., Beck, M. and Kingsley, G., “Compiler-assisted memory exclusion for fast checkpointing,” IEEE Technical Committee on Operating Systems and Application Environments, 7, 4, pp. 10–14, 1995.Google Scholar
  31. 31.
    Plank, J. S., Beck, M., Kingsley, G. and Li, K., “Libckpt: Transparent Checkpointing under Unix,” in Usenix Winter Technical Conference (January 1995), pp. 213–223.Google Scholar
  32. 32.
    Plank, J. S. and Li, K., “ickp: A consistent checkpointer for multicomputers,” IEEE Parallel Distrib. Technol., 2, pp. 62–67, June 1994.Google Scholar
  33. 33.
    Plank, J. S., Li, K. and Puening, M. A., “Diskless checkpointing,” IEEE Transactions on Parallel and Distributed Systems, 9, 10, pp. 972–986, October 1998.Google Scholar
  34. 34.
    Plank, J. S., Xu, J. and Netzer, R. H. B., “Compressed differences: an algorithm for fast incremental checkpointing,” Tech. Rep. CS-95-302, University of Tennessee, Department of Computer Science, Aug. 1995.Google Scholar
  35. 35.
    Rodríguez, G., Martín, M. J., González, P. and Touriño, J., “Analysis of performance-impacting factors on checkpointing frameworks: the CPPC case study,” The Computer Journal, 54, 11, pp. 1821–1837, 2011.Google Scholar
  36. 36.
    Rodríguez, G., Martín, M. J., González, P., Touriño, J. and Doallo, R., “CPPC: A compiler-assisted tool for portable checkpointing of message-passing applications,” Concurrency and Computation: Practice and Experience, 22, 6, pp. 749–766, 2010.Google Scholar
  37. 37.
    Schroeder, B. and Gibson, G., “A large-scale study of failures in high-performance computing systems,” IEEE Trans. Dependable Secur. Comput., 7, 4, pp. 337–351, Oct. 2010.Google Scholar
  38. 38.
    Schroeder, B. and Gibson, G. A., “Understanding failures in petascale computers,” Journal of Physics: Conference Series, 78, 1, pp. 012–022, 2007.Google Scholar
  39. 39.
    The HDF5 Group. HDF-5: Hierarchical Data Format. Last accessed October 2012.
  40. 40.
    Victor C. Zandy., CKPT process checkpoint library. Last accessed October 2012.
  41. 41.
    Wang, C., Mueller, F., Engelmann, C. and Scott, S. L., “Hybrid checkpointing for mpi jobs in hpc environments,” in IEEE 16th International Conference on Parallel and Distributed Systems, ICPADS 2010, 8-10 Dec. 2010, Shanghai, China, pp. 524–533, 2010.Google Scholar
  42. 42.
    Wang, C., Mueller, F., Engelmann, C. and Scott, S. L., “Proactive process-level live migration and back migration in hpc environments,” J. Parallel Distrib. Comput., 72, 2, pp. 254–267, Feb. 2012.Google Scholar
  43. 43.
    Zheng, G., Ni, X. and Kalé, L. V., “A scalable double in-memory checkpoint and restart scheme towards exascale,” in IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, DSN 2012, Boston, MA, USA, June 25-28, 2012, pp. 1–6, 2012.Google Scholar

Copyright information

© Ohmsha and Springer Japan 2013

Authors and Affiliations

  • Iván Cores
    • 1
  • Gabriel Rodríguez
    • 1
  • Mará J. martín
    • 1
  • Patricia González
    • 1
  • Roberto R. Osorio
    • 1
  1. 1.Computer Architecture GroupUniversity of A CoruñaCoruñaSpain

Personalised recommendations