libhashckpt: Hash-Based Incremental Checkpointing Using GPU’s

  • Kurt B. Ferreira
  • Rolf Riesen
  • Ron Brighwell
  • Patrick Bridges
  • Dorian Arnold
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6960)

Abstract

Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability guarantees of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the last 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to minimize state and reduce checkpoint time. One well-known optimization to traditional checkpoint/restart is incremental checkpointing, which has a number of known limitations. To address these limitations, we introduce libhashckpt; a hybrid incremental checkpointing solution that uses both page protection and hashing on GPUs to determine changes in application data with very low overhead. Using real capability workloads, we show the merit of this technique for a certain class of HPC applications.

Keywords

Graphic Process Unit High Performance Computing Sandia National Laboratory Checkpoint Mechanism High Performance Computing System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    libgcrypt web page (July 2010), http://directory.fsf.org/project/libgcrypt/
  2. 2.
  3. 3.
    Agarwal, S., Garg, R., Gupta, M.S., Moreira, J.E.: Adaptive incremental checkpointing for massively parallel systems. In: Proceedings of the 2004 International Conference on Supercomputing, St. Malo, France (2004)Google Scholar
  4. 4.
    Bronevetsky, G., Marques, D., Pingali, K., McKee, S.A., Rugina, R.: Compiler-enhanced incremental checkpointing for openmp applications. In: IPDPS, pp. 1–12. IEEE, Los Alamitos (2009)Google Scholar
  5. 5.
    Camp, W.J., Tomkins, J.L.: Thor’s hammer: The first version of the Red Storm MPP architecture. In: Proceedings of the SC 2002 Conference on High Performance Networking and Computing, Baltimore, MD (November 2002)Google Scholar
  6. 6.
    Chen, Y., Plank, J.S., Li, K.: CLIP: a checkpointing tool for message-passing parallel programs. In: Proceedings of the 1997 ACM/IEEE conference on Supercomputing (CDROM), pp. 1–11. ACM, New York (1997), http://doi.acm.org/10.1145/509593.509626 Google Scholar
  7. 7.
    Elnozahy, E.N.: How safe is probabilistic checkpointing? In: Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing, FTCS 1998, pp. 358–363. IEEE Computer Society, Washington, DC (1998), http://portal.acm.org/citation.cfm?id=795671.796882 Google Scholar
  8. 8.
    Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34(3), 375–408 (2002)CrossRefGoogle Scholar
  9. 9.
    Hertel Jr., E.S., Bell, R., Elrick, M., Farnsworth, A., Kerley, G., McGlaun, J., Petney, S., Silling, S., Taylor, P., Yarrington, L.: CTH: A Software Family for Multi-Dimensional Shock Physics Analysis. In: Proceedings of the 19th International Symposium on Shock Waves, Held at Marseille, France, pp. 377–382 (July 1993)Google Scholar
  10. 10.
    Feldman, S.I., Brown, C.B.: Igor: a system for program debugging via reversible execution. In: Proceedings of the 1988 ACM SIGPLAN and SIGOPS Workshop on Parallel and Distributed Debugging, PADD 1988, pp. 112–123. ACM, New York (1988), http://doi.acm.org/10.1145/68210.69226 CrossRefGoogle Scholar
  11. 11.
    Gioiosa, R., Sancho, J.C., Jiang, S., Petrini, F.: Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers. In: Proceedings of the 2005 ACM/IEEE Conference on High-Performance Computing and Networking, Seattle, WA, USA (2005)Google Scholar
  12. 12.
    Menezes, A.J., Vanstone, S.A., Oorschot, P.C.V.: Handbook of Applied Cryptography, 1st edn. CRC Press, Inc., Boca Raton (1996)CrossRefMATHGoogle Scholar
  13. 13.
    Chang Nam, H., Kim, J., Hong, S.J., Lee, S.: A secure checkpointing system. In: Proceedings of Pacific Rim International Symposium on Dependable Computing, pp. 49–56 (2001)Google Scholar
  14. 14.
    Nam, H.C., Kim, J., Hong, S., Lee, S.: Probabilistic checkpointing. In: Twenty-Seventh Annual International Symposium on Fault-Tolerant Computing, FTCS-27, June 1997, pp. 48–57 (1997)Google Scholar
  15. 15.
    Netzer, R.H.B., Xu, J.: Necessary and sufficient conditions for consistent global snapshots. IEEE Trans. Parallel Distrib. Syst. 6, 165–169 (1995), http://dx.doi.org/10.1109/71.342127 CrossRefGoogle Scholar
  16. 16.
    Plank, J.S., Li, K.: ickp: A consistent checkpointer for multicomputers. Parallel & Distributed Technology: Systems & Applications 2(2), 62–67 (1994)CrossRefGoogle Scholar
  17. 17.
    Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: transparent checkpointing under unix. In: Proceedings of the USENIX 1995 Technical Conference Proceedings, TCON 1995, pp. 18–18. USENIX Association, Berkeley (1995), http://portal.acm.org/citation.cfm?id=1267411.1267429 Google Scholar
  18. 18.
    Plimpton, S.J.: Fast parallel algorithms for short-range molecular dynamics. Journal Computation Physics 117, 1–19 (1995)CrossRefMATHGoogle Scholar
  19. 19.
    Sandia National Laboratory: LAMMPS molecular dynamics simulator April 10 (2010), http://lammps.sandia.gov
  20. 20.
    Zandy, V.C., Miller, B.P., Livny, M.: Process hijacking. In: Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing, HPDC 1999, p. 32. IEEE Computer Society, Washington, DC (1999), http://portal.acm.org/citation.cfm?id=822084.823234 Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Kurt B. Ferreira
    • 1
    • 3
  • Rolf Riesen
    • 2
  • Ron Brighwell
    • 1
  • Patrick Bridges
    • 3
  • Dorian Arnold
    • 3
  1. 1.Scalable System SoftwareSandia National LaboratoriesMexico
  2. 2.IBM ResearchIreland
  3. 3.Department of Computer ScienceUniversity of New MexicoMexico

Personalised recommendations