Skip to main content

libhashckpt: Hash-Based Incremental Checkpointing Using GPU’s

  • Conference paper
Recent Advances in the Message Passing Interface (EuroMPI 2011)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 6960))

Included in the following conference series:

Abstract

Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability guarantees of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the last 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to minimize state and reduce checkpoint time. One well-known optimization to traditional checkpoint/restart is incremental checkpointing, which has a number of known limitations. To address these limitations, we introduce libhashckpt; a hybrid incremental checkpointing solution that uses both page protection and hashing on GPUs to determine changes in application data with very low overhead. Using real capability workloads, we show the merit of this technique for a certain class of HPC applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. libgcrypt web page (July 2010), http://directory.fsf.org/project/libgcrypt/

  2. Libckpt web page (2011), http://web.eecs.utk.edu/~plank/plank/www/libckpt.html

  3. Agarwal, S., Garg, R., Gupta, M.S., Moreira, J.E.: Adaptive incremental checkpointing for massively parallel systems. In: Proceedings of the 2004 International Conference on Supercomputing, St. Malo, France (2004)

    Google Scholar 

  4. Bronevetsky, G., Marques, D., Pingali, K., McKee, S.A., Rugina, R.: Compiler-enhanced incremental checkpointing for openmp applications. In: IPDPS, pp. 1–12. IEEE, Los Alamitos (2009)

    Google Scholar 

  5. Camp, W.J., Tomkins, J.L.: Thor’s hammer: The first version of the Red Storm MPP architecture. In: Proceedings of the SC 2002 Conference on High Performance Networking and Computing, Baltimore, MD (November 2002)

    Google Scholar 

  6. Chen, Y., Plank, J.S., Li, K.: CLIP: a checkpointing tool for message-passing parallel programs. In: Proceedings of the 1997 ACM/IEEE conference on Supercomputing (CDROM), pp. 1–11. ACM, New York (1997), http://doi.acm.org/10.1145/509593.509626

    Google Scholar 

  7. Elnozahy, E.N.: How safe is probabilistic checkpointing? In: Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing, FTCS 1998, pp. 358–363. IEEE Computer Society, Washington, DC (1998), http://portal.acm.org/citation.cfm?id=795671.796882

    Google Scholar 

  8. Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34(3), 375–408 (2002)

    Article  Google Scholar 

  9. Hertel Jr., E.S., Bell, R., Elrick, M., Farnsworth, A., Kerley, G., McGlaun, J., Petney, S., Silling, S., Taylor, P., Yarrington, L.: CTH: A Software Family for Multi-Dimensional Shock Physics Analysis. In: Proceedings of the 19th International Symposium on Shock Waves, Held at Marseille, France, pp. 377–382 (July 1993)

    Google Scholar 

  10. Feldman, S.I., Brown, C.B.: Igor: a system for program debugging via reversible execution. In: Proceedings of the 1988 ACM SIGPLAN and SIGOPS Workshop on Parallel and Distributed Debugging, PADD 1988, pp. 112–123. ACM, New York (1988), http://doi.acm.org/10.1145/68210.69226

    Chapter  Google Scholar 

  11. Gioiosa, R., Sancho, J.C., Jiang, S., Petrini, F.: Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers. In: Proceedings of the 2005 ACM/IEEE Conference on High-Performance Computing and Networking, Seattle, WA, USA (2005)

    Google Scholar 

  12. Menezes, A.J., Vanstone, S.A., Oorschot, P.C.V.: Handbook of Applied Cryptography, 1st edn. CRC Press, Inc., Boca Raton (1996)

    Book  MATH  Google Scholar 

  13. Chang Nam, H., Kim, J., Hong, S.J., Lee, S.: A secure checkpointing system. In: Proceedings of Pacific Rim International Symposium on Dependable Computing, pp. 49–56 (2001)

    Google Scholar 

  14. Nam, H.C., Kim, J., Hong, S., Lee, S.: Probabilistic checkpointing. In: Twenty-Seventh Annual International Symposium on Fault-Tolerant Computing, FTCS-27, June 1997, pp. 48–57 (1997)

    Google Scholar 

  15. Netzer, R.H.B., Xu, J.: Necessary and sufficient conditions for consistent global snapshots. IEEE Trans. Parallel Distrib. Syst. 6, 165–169 (1995), http://dx.doi.org/10.1109/71.342127

    Article  Google Scholar 

  16. Plank, J.S., Li, K.: ickp: A consistent checkpointer for multicomputers. Parallel & Distributed Technology: Systems & Applications 2(2), 62–67 (1994)

    Article  Google Scholar 

  17. Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: transparent checkpointing under unix. In: Proceedings of the USENIX 1995 Technical Conference Proceedings, TCON 1995, pp. 18–18. USENIX Association, Berkeley (1995), http://portal.acm.org/citation.cfm?id=1267411.1267429

    Google Scholar 

  18. Plimpton, S.J.: Fast parallel algorithms for short-range molecular dynamics. Journal Computation Physics 117, 1–19 (1995)

    Article  MATH  Google Scholar 

  19. Sandia National Laboratory: LAMMPS molecular dynamics simulator April 10 (2010), http://lammps.sandia.gov

  20. Zandy, V.C., Miller, B.P., Livny, M.: Process hijacking. In: Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing, HPDC 1999, p. 32. IEEE Computer Society, Washington, DC (1999), http://portal.acm.org/citation.cfm?id=822084.823234

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ferreira, K.B., Riesen, R., Brighwell, R., Bridges, P., Arnold, D. (2011). libhashckpt: Hash-Based Incremental Checkpointing Using GPU’s. In: Cotronis, Y., Danalis, A., Nikolopoulos, D.S., Dongarra, J. (eds) Recent Advances in the Message Passing Interface. EuroMPI 2011. Lecture Notes in Computer Science, vol 6960. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24449-0_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-24449-0_31

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-24448-3

  • Online ISBN: 978-3-642-24449-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics