Advertisement

On the Viability of Checkpoint Compression for Extreme Scale Fault Tolerance

  • Dewan Ibtesham
  • Dorian Arnold
  • Kurt B. Ferreira
  • Patrick G. Bridges
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7156)

Abstract

The increasing size and complexity of high performance computing systems have lead to major concerns over fault frequencies and the mechanisms necessary to tolerate these faults. Previous studies have shown that state-of-the-field checkpoint/restart mechanisms will not scale sufficiently for future generation systems. In this work, we explore the feasibility of checkpoint data compression to reduce checkpoint commit latency and storage overheads. Leveraging a simple model for checkpoint compression viability, we conclude that checkpoint data compression should be considered as a part of a scalable checkpoint/restart solution and discuss additional scenarios and improvements that may make checkpoint data compression even more viable.

Keywords

Checkpoint data compression extreme scale fault-tolerance checkpoint/restart 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    7zip project official home page, http://www.7-zip.org.
  2. 2.
    ASC Sequoia, https://asc.llnl.gov/computing_resources/sequoia (visited May 2011)
  3. 3.
    Bent, J., Gibson, G., Grider, G., McClelland, B., Nowoczynski, P., Nunez, J., Polte, M., Wingate, M.: Plfs: a checkpoint filesystem for parallel applications. In: Conference on High Performance Computing Networking, Storage and Analysis (SC 2009), pp. 21:1–21:12. ACM, New York (2009)Google Scholar
  4. 4.
    Deutsch, P.: Deflate compressed data format specificationGoogle Scholar
  5. 5.
    Elnozahy, E.N., Alvisi, L., Wang, Y.-M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34(3), 375–408 (2002)CrossRefGoogle Scholar
  6. 6.
    Elnozahy, E.N., Johnson, D.B., Zwaenpoel, W.: The performance of consistent checkpointing. In: 11th IEEE Symposium on Reliable Distributed Systems, Houston, TX (1992)Google Scholar
  7. 7.
    Elnozahy, E.N., Plank, J.S.: Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery. IEEE Transactions on Dependable and Secure Computing 1(2), 97–108 (2004)CrossRefGoogle Scholar
  8. 8.
    Elytra, J.G.: Parallel data compression with bzip2Google Scholar
  9. 9.
    Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 97–104. Springer, Heidelberg (2004), doi:10.1007/978-3-540-30218-6_19CrossRefGoogle Scholar
  10. 10.
    Gibson, G., Schroeder, B., Digney, J.: Failure tolerance in petascale computers. CTWatch Quarterly 3(4) (November 2007)Google Scholar
  11. 11.
    Hargrove, P.H., Duell, J.C.: Berkeley lab checkpoint/restart (blcr) for linux clusters. Journal of Physics: Conference Series 46(1) (2006)Google Scholar
  12. 12.
    Heroux, M.A., Doerfler, D.W., Crozier, P.S., Willenbring, J.M., Edwards, H.C., Williams, A., Rajan, M., Keiter, E.R., Thornquist, H.K., Numrich, R.W.: Improving performance via mini-applications. Technical Report SAND2009-5574, Sandia National Laboratory (2009)Google Scholar
  13. 13.
    Morse Jr., K.G.: Compression tools compared (137) (September 2005)Google Scholar
  14. 14.
    Kogge, P.: ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems. Technical report, Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO) (September 2008)Google Scholar
  15. 15.
    Lee, J., Winslett, M., Ma, X., Yu, S.: Enhancing data migration performance via parallel data compression. In: Proceedings International on Parallel and Distributed Processing Symposium, IPDPS 2002, Abstracts and CD-ROM, pp. 444–451 (2002)Google Scholar
  16. 16.
    Li, C.-C., Fuchs, W.: Catch-compiler-assisted techniques for checkpointing. In: 20th International Symposium on Fault-Tolerant Computing, FTCS-20, Digest of Papers, pp. 74–81 ( June 1990)Google Scholar
  17. 17.
    Li, K., Naughton, J.F., Plank, J.S.: Real-time, concurrent checkpoint for parallel programs. In: 2nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP 1990), pp. 79–88. ACM, Seattle (1990)CrossRefGoogle Scholar
  18. 18.
    Li, K., Naughton, J.F., Plank, J.S.: Low-latency, concurrent checkpointing for parallel programs. IEEE Transactions on Parallel and Distributed Systems 5(8), 874–879 (1994)CrossRefGoogle Scholar
  19. 19.
    Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, pp. 1–11. IEEE Computer Society, Washington, DC (2010)CrossRefGoogle Scholar
  20. 20.
    Moshovos, A., Kostopoulos, A.: Cost-effective, high-performance giga-scale checkpoint/restore. Technical report, University of Toronto (November 2004)Google Scholar
  21. 21.
    Pavlov, I.: Lzma sdk (software development kit) (2007)Google Scholar
  22. 22.
    Plank, J., Li, K., Puening, M.: Diskless checkpointing. IEEE Transactions on Parallel and Distributed Systems 9(10), 972–986 (1998)CrossRefGoogle Scholar
  23. 23.
    Plank, J.S., Chen, Y., Li, K., Beck, M., Kingsley, G.: Memory exclusion: Optimizing the performance of checkpointing systems. Software – Practice & Experience 29(2), 125–142 (1999)CrossRefGoogle Scholar
  24. 24.
    Plank, J.S., Li, K.: ickp: A consistent checkpointer for multicomputers. IEEE Parallel & Distributed Technology: Systems & Applications 2(2), 62–67 (1994)CrossRefGoogle Scholar
  25. 25.
    Plank, J.S., Xu, J., Netzer, R.H.B.: Compressed differences: An algorithm for fast incremental checkpointing. Technical Report CS-95-302, University of Tennessee (August 1995)Google Scholar
  26. 26.
    Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: Dependable Systems and Networks (DSN 2006), Philadelphia, PA (June 2006)Google Scholar
  27. 27.
    Top 500 Supercomputer Sites, http://www.top500.org/ (visited September 2011)
  28. 28.
    Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)MathSciNetzbMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Dewan Ibtesham
    • 1
  • Dorian Arnold
    • 1
  • Kurt B. Ferreira
    • 2
  • Patrick G. Bridges
    • 1
  1. 1.University of New MexicoAlbuquerqueUSA
  2. 2.Sandia National LaboratoriesAlbuquerqueUSA

Personalised recommendations