Abstract
Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability guarantees of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the last 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to minimize state and reduce checkpoint time. One well-known optimization to traditional checkpoint/restart is incremental checkpointing, which has a number of known limitations. To address these limitations, we introduce libhashckpt; a hybrid incremental checkpointing solution that uses both page protection and hashing on GPUs to determine changes in application data with very low overhead. Using real capability workloads, we show the merit of this technique for a certain class of HPC applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
libgcrypt web page (July 2010), http://directory.fsf.org/project/libgcrypt/
Libckpt web page (2011), http://web.eecs.utk.edu/~plank/plank/www/libckpt.html
Agarwal, S., Garg, R., Gupta, M.S., Moreira, J.E.: Adaptive incremental checkpointing for massively parallel systems. In: Proceedings of the 2004 International Conference on Supercomputing, St. Malo, France (2004)
Bronevetsky, G., Marques, D., Pingali, K., McKee, S.A., Rugina, R.: Compiler-enhanced incremental checkpointing for openmp applications. In: IPDPS, pp. 1–12. IEEE, Los Alamitos (2009)
Camp, W.J., Tomkins, J.L.: Thor’s hammer: The first version of the Red Storm MPP architecture. In: Proceedings of the SC 2002 Conference on High Performance Networking and Computing, Baltimore, MD (November 2002)
Chen, Y., Plank, J.S., Li, K.: CLIP: a checkpointing tool for message-passing parallel programs. In: Proceedings of the 1997 ACM/IEEE conference on Supercomputing (CDROM), pp. 1–11. ACM, New York (1997), http://doi.acm.org/10.1145/509593.509626
Elnozahy, E.N.: How safe is probabilistic checkpointing? In: Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing, FTCS 1998, pp. 358–363. IEEE Computer Society, Washington, DC (1998), http://portal.acm.org/citation.cfm?id=795671.796882
Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34(3), 375–408 (2002)
Hertel Jr., E.S., Bell, R., Elrick, M., Farnsworth, A., Kerley, G., McGlaun, J., Petney, S., Silling, S., Taylor, P., Yarrington, L.: CTH: A Software Family for Multi-Dimensional Shock Physics Analysis. In: Proceedings of the 19th International Symposium on Shock Waves, Held at Marseille, France, pp. 377–382 (July 1993)
Feldman, S.I., Brown, C.B.: Igor: a system for program debugging via reversible execution. In: Proceedings of the 1988 ACM SIGPLAN and SIGOPS Workshop on Parallel and Distributed Debugging, PADD 1988, pp. 112–123. ACM, New York (1988), http://doi.acm.org/10.1145/68210.69226
Gioiosa, R., Sancho, J.C., Jiang, S., Petrini, F.: Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers. In: Proceedings of the 2005 ACM/IEEE Conference on High-Performance Computing and Networking, Seattle, WA, USA (2005)
Menezes, A.J., Vanstone, S.A., Oorschot, P.C.V.: Handbook of Applied Cryptography, 1st edn. CRC Press, Inc., Boca Raton (1996)
Chang Nam, H., Kim, J., Hong, S.J., Lee, S.: A secure checkpointing system. In: Proceedings of Pacific Rim International Symposium on Dependable Computing, pp. 49–56 (2001)
Nam, H.C., Kim, J., Hong, S., Lee, S.: Probabilistic checkpointing. In: Twenty-Seventh Annual International Symposium on Fault-Tolerant Computing, FTCS-27, June 1997, pp. 48–57 (1997)
Netzer, R.H.B., Xu, J.: Necessary and sufficient conditions for consistent global snapshots. IEEE Trans. Parallel Distrib. Syst. 6, 165–169 (1995), http://dx.doi.org/10.1109/71.342127
Plank, J.S., Li, K.: ickp: A consistent checkpointer for multicomputers. Parallel & Distributed Technology: Systems & Applications 2(2), 62–67 (1994)
Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: transparent checkpointing under unix. In: Proceedings of the USENIX 1995 Technical Conference Proceedings, TCON 1995, pp. 18–18. USENIX Association, Berkeley (1995), http://portal.acm.org/citation.cfm?id=1267411.1267429
Plimpton, S.J.: Fast parallel algorithms for short-range molecular dynamics. Journal Computation Physics 117, 1–19 (1995)
Sandia National Laboratory: LAMMPS molecular dynamics simulator April 10 (2010), http://lammps.sandia.gov
Zandy, V.C., Miller, B.P., Livny, M.: Process hijacking. In: Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing, HPDC 1999, p. 32. IEEE Computer Society, Washington, DC (1999), http://portal.acm.org/citation.cfm?id=822084.823234
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ferreira, K.B., Riesen, R., Brighwell, R., Bridges, P., Arnold, D. (2011). libhashckpt: Hash-Based Incremental Checkpointing Using GPU’s. In: Cotronis, Y., Danalis, A., Nikolopoulos, D.S., Dongarra, J. (eds) Recent Advances in the Message Passing Interface. EuroMPI 2011. Lecture Notes in Computer Science, vol 6960. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24449-0_31
Download citation
DOI: https://doi.org/10.1007/978-3-642-24449-0_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24448-3
Online ISBN: 978-3-642-24449-0
eBook Packages: Computer ScienceComputer Science (R0)