Skip to main content

Advertisement

SpringerLink
Log in
Menu
Find a journal Publish with us
Search
Cart
Book cover

European Conference on Parallel Processing

Euro-Par 2011: Euro-Par 2011: Parallel Processing Workshops pp 302–311Cite as

  1. Home
  2. Euro-Par 2011: Parallel Processing Workshops
  3. Conference paper
On the Viability of Checkpoint Compression for Extreme Scale Fault Tolerance

On the Viability of Checkpoint Compression for Extreme Scale Fault Tolerance

  • Dewan Ibtesham30,
  • Dorian Arnold30,
  • Kurt B. Ferreira31 &
  • …
  • Patrick G. Bridges30 
  • Conference paper
  • 1109 Accesses

  • 8 Citations

Part of the Lecture Notes in Computer Science book series (LNTCS,volume 7156)

Abstract

The increasing size and complexity of high performance computing systems have lead to major concerns over fault frequencies and the mechanisms necessary to tolerate these faults. Previous studies have shown that state-of-the-field checkpoint/restart mechanisms will not scale sufficiently for future generation systems. In this work, we explore the feasibility of checkpoint data compression to reduce checkpoint commit latency and storage overheads. Leveraging a simple model for checkpoint compression viability, we conclude that checkpoint data compression should be considered as a part of a scalable checkpoint/restart solution and discuss additional scenarios and improvements that may make checkpoint data compression even more viable.

Keywords

  • Checkpoint data compression
  • extreme scale fault-tolerance
  • checkpoint/restart

Download conference paper PDF

References

  1. 7zip project official home page, http://www.7-zip.org .

  2. ASC Sequoia, https://asc.llnl.gov/computing_resources/sequoia (visited May 2011)

  3. Bent, J., Gibson, G., Grider, G., McClelland, B., Nowoczynski, P., Nunez, J., Polte, M., Wingate, M.: Plfs: a checkpoint filesystem for parallel applications. In: Conference on High Performance Computing Networking, Storage and Analysis (SC 2009), pp. 21:1–21:12. ACM, New York (2009)

    Google Scholar 

  4. Deutsch, P.: Deflate compressed data format specification

    Google Scholar 

  5. Elnozahy, E.N., Alvisi, L., Wang, Y.-M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34(3), 375–408 (2002)

    CrossRef  Google Scholar 

  6. Elnozahy, E.N., Johnson, D.B., Zwaenpoel, W.: The performance of consistent checkpointing. In: 11th IEEE Symposium on Reliable Distributed Systems, Houston, TX (1992)

    Google Scholar 

  7. Elnozahy, E.N., Plank, J.S.: Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery. IEEE Transactions on Dependable and Secure Computing 1(2), 97–108 (2004)

    CrossRef  Google Scholar 

  8. Elytra, J.G.: Parallel data compression with bzip2

    Google Scholar 

  9. Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 97–104. Springer, Heidelberg (2004), doi:10.1007/978-3-540-30218-6_19

    CrossRef  Google Scholar 

  10. Gibson, G., Schroeder, B., Digney, J.: Failure tolerance in petascale computers. CTWatch Quarterly 3(4) (November 2007)

    Google Scholar 

  11. Hargrove, P.H., Duell, J.C.: Berkeley lab checkpoint/restart (blcr) for linux clusters. Journal of Physics: Conference Series 46(1) (2006)

    Google Scholar 

  12. Heroux, M.A., Doerfler, D.W., Crozier, P.S., Willenbring, J.M., Edwards, H.C., Williams, A., Rajan, M., Keiter, E.R., Thornquist, H.K., Numrich, R.W.: Improving performance via mini-applications. Technical Report SAND2009-5574, Sandia National Laboratory (2009)

    Google Scholar 

  13. Morse Jr., K.G.: Compression tools compared (137) (September 2005)

    Google Scholar 

  14. Kogge, P.: ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems. Technical report, Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO) (September 2008)

    Google Scholar 

  15. Lee, J., Winslett, M., Ma, X., Yu, S.: Enhancing data migration performance via parallel data compression. In: Proceedings International on Parallel and Distributed Processing Symposium, IPDPS 2002, Abstracts and CD-ROM, pp. 444–451 (2002)

    Google Scholar 

  16. Li, C.-C., Fuchs, W.: Catch-compiler-assisted techniques for checkpointing. In: 20th International Symposium on Fault-Tolerant Computing, FTCS-20, Digest of Papers, pp. 74–81 ( June 1990)

    Google Scholar 

  17. Li, K., Naughton, J.F., Plank, J.S.: Real-time, concurrent checkpoint for parallel programs. In: 2nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP 1990), pp. 79–88. ACM, Seattle (1990)

    CrossRef  Google Scholar 

  18. Li, K., Naughton, J.F., Plank, J.S.: Low-latency, concurrent checkpointing for parallel programs. IEEE Transactions on Parallel and Distributed Systems 5(8), 874–879 (1994)

    CrossRef  Google Scholar 

  19. Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, pp. 1–11. IEEE Computer Society, Washington, DC (2010)

    CrossRef  Google Scholar 

  20. Moshovos, A., Kostopoulos, A.: Cost-effective, high-performance giga-scale checkpoint/restore. Technical report, University of Toronto (November 2004)

    Google Scholar 

  21. Pavlov, I.: Lzma sdk (software development kit) (2007)

    Google Scholar 

  22. Plank, J., Li, K., Puening, M.: Diskless checkpointing. IEEE Transactions on Parallel and Distributed Systems 9(10), 972–986 (1998)

    CrossRef  Google Scholar 

  23. Plank, J.S., Chen, Y., Li, K., Beck, M., Kingsley, G.: Memory exclusion: Optimizing the performance of checkpointing systems. Software – Practice & Experience 29(2), 125–142 (1999)

    CrossRef  Google Scholar 

  24. Plank, J.S., Li, K.: ickp: A consistent checkpointer for multicomputers. IEEE Parallel & Distributed Technology: Systems & Applications 2(2), 62–67 (1994)

    CrossRef  Google Scholar 

  25. Plank, J.S., Xu, J., Netzer, R.H.B.: Compressed differences: An algorithm for fast incremental checkpointing. Technical Report CS-95-302, University of Tennessee (August 1995)

    Google Scholar 

  26. Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: Dependable Systems and Networks (DSN 2006), Philadelphia, PA (June 2006)

    Google Scholar 

  27. Top 500 Supercomputer Sites, http://www.top500.org/ (visited September 2011)

  28. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)

    CrossRef  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

  1. University of New Mexico, Albuquerque, NM, 87131, USA

    Dewan Ibtesham, Dorian Arnold & Patrick G. Bridges

  2. Sandia National Laboratories, Albuquerque, NM, 87123, USA

    Kurt B. Ferreira

Authors
  1. Dewan Ibtesham
    View author publications

    You can also search for this author in PubMed Google Scholar

  2. Dorian Arnold
    View author publications

    You can also search for this author in PubMed Google Scholar

  3. Kurt B. Ferreira
    View author publications

    You can also search for this author in PubMed Google Scholar

  4. Patrick G. Bridges
    View author publications

    You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

  1. Scilytics, Koellnerhofgasse 3/15A, 1010, Vienna, Austria

    Michael Alexander

  2. ICAR-CNR, Via P. Castellino, 111, 80131, Napoli, Italy

    Pasqua D’Ambra

  3. University of Amsterdam, 1090, Amsterdam, Netherlands

    Adam Belloum

  4. Innovative Computing Laboratory, The University of Tennessee, US

    George Bosilca

  5. Department of Experimental Medicine and Clinic, University Magna Græcia, 88100, Catanzaro, Italy

    Mario Cannataro

  6. Computer Science Department, University of Pisa, Italy

    Marco Danelutto

  7. Second University of Naples, Italy

    Beniamino Di Martino

  8. TUMünchen,, Boltzmannstr. 3, ,, 85748, Garching, Germany

    Michael Gerndt

  9. Equipe Runtime, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France

    Emmanuel Jeannot & Raymond Namyst & 

  10. Equipe HIEPACS, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France

    Jean Roman

  11. Computer Science and Mathematics Division, Oak Ridge National Laboratory, 37831-6164, Oak Ridge, TN, USA

    Stephen L. Scott

  12. Department of Scientific Computing, University of Vienna, Nordbergstr. 15/3C, 1090, Vienna, Austria

    Jesper Larsson Traff

  13. Computer Science and Mathematics Division, Oak Ridge National Laboratory, 37831, Oak Ridge, TN, USA

    Geoffroy Vallée

  14. Technische Universität München, Germany

    Josef Weidendorfer

Rights and permissions

Reprints and Permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ibtesham, D., Arnold, D., Ferreira, K.B., Bridges, P.G. (2012). On the Viability of Checkpoint Compression for Extreme Scale Fault Tolerance. In: Alexander, M., et al. Euro-Par 2011: Parallel Processing Workshops. Euro-Par 2011. Lecture Notes in Computer Science, vol 7156. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29740-3_34

Download citation

  • .RIS
  • .ENW
  • .BIB
  • DOI: https://doi.org/10.1007/978-3-642-29740-3_34

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-29739-7

  • Online ISBN: 978-3-642-29740-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Search

Navigation

  • Find a journal
  • Publish with us

Discover content

  • Journals A-Z
  • Books A-Z

Publish with us

  • Publish your research
  • Open access publishing

Products and services

  • Our products
  • Librarians
  • Societies
  • Partners and advertisers

Our imprints

  • Springer
  • Nature Portfolio
  • BMC
  • Palgrave Macmillan
  • Apress
  • Your US state privacy rights
  • Accessibility statement
  • Terms and conditions
  • Privacy policy
  • Help and support

167.114.118.210

Not affiliated

Springer Nature

© 2023 Springer Nature