Abstract
Distributed Shared Memory (DSM) creates an abstraction of a physical shared memory that parallel programmers can access. Most recent software DSMs provide relaxed memory models that guarantee consistency only at synchronization operations. As the main goal of DSM systems is to provide support for long term computation intensive applications, checkpointing and recovery mechanisms are highly desirable. This article presents and evaluates the integration of a coordinated checkpointing mechanism to the barrier primitive that is usually provided with many DSM systems. Our results on some popular benchmarks and a real parallel application show that the overhead introduced during the failure-free execution is often small.
This work was partially supported by NSERC, Canada Foundation for Innovation and Canada Research Chair Programs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Amza, C., Cox, A., Dwarkakas, S., Zwaenenpoel, W.: Software DSM Protocols that Adapt between Single Writer and Multiple Writer. In: Proc. of HPCA 1997, pp. 261–271 (1997)
Bailey, D., et al.: The NAS Parallel Benchmarks, TR 103863-NASA (July 1993)
Elnozahy, M., Alvisi, L., Wang, L.: A Survey of Rollback/recovery Protocols in Message-Passing Systems, TR CMU-CS-96-181 (1996)
Gharachorloo, K., et al.: Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors. In: Proc. ISCA, pp. 15–24 (May 1990)
Hu, W., Shi, W., Tang, Z.: JIAJIA: An SVM System Based on A New Cache Coherence Protocol. In: Proc. of HPCN 1999, pp. 463–472 (1999)
Iftode, L.: Home-Based Shared Virtual Memory, PhD Thesis. Princeton University, Princeton (1998)
Iftode, L., et al.: Scope Consistency: Bridging the Gap Between Release Consistency and Entry Consistency. In: Proc. ACM SPAA 1996, pp. 277–287 (1996)
Janakiraman, G., Tamir, Y.: Coordinated Checkpointing-Rollback Error Recovery for DSM Multicomputers. In: Proc. of 13th Symposium on Reliable Distributed Systems (1994)
Keleher, P., et al.: TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems. In: Proc. USENIX, pp. 115–132 (1994)
Kongmunvattana, A., Tanchatchawal, S., Tzeng, N.: Coherence-Based Coordinated Check-pointing for Software Distributed Shared Memory Systems. In: Proc. ICDCS, April, pp. 556–563 (2000)
Kongmunvattana, A., Tzeng, N.: Logging and Recovery in Adaptive Software Distributed Shared memory Systems. In: Proc. of the 18th Symp. on Reliable Distributed Systems (1999)
Lu, H., Dwarkadas, S., Cox, A.L., Zwaenepoel, W.: Quantifying the performance differences between pvm and Treadmarks. Journal of Parallel and Distributed Computation 43, 65–78 (1997)
Melo, R., et al.: Comparing Two Long DNA Sequences Using a DSM System. In: Kosch, H., Böszörményi, L., Hellwagner, H. (eds.) Euro-Par 2003. LNCS, vol. 2790, pp. 517–524. Springer, Heidelberg (2003)
Monnerat, L., Bianchinni, R.: Efficiently Adapting to Sharing Patterns in Software DSMs. In: Proc. HPCA 1998 (February 1998)
Mosberger, D.: Memory Consistency Models. Operating Systems Review, 18–26 (1993)
Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: transparent Checkpointing under Linux. In: USENIX Winter 1995 Technical Conference (January 1995)
Shi, W.: Improving the Performance of DSM Systems, PhD Thesis, CAS (November 1999)
Speight, E., Bennett, J.: Reducing Coherence-Related Communication in Software Distributed Shared Memory Systems, TR ECE-TR-98-03, Rice University (1998)
Sultan, F., Nguyen, T., Iftode, L.: Scalable Fault Tolerant Distributed Shared Memory. In: Proc. of Int. Conf. On High Performance Networking and Computing (2000)
Wang, Y., Chung, P., Fuchs, W.: Tight Upper Bound on Useful Distributed Systems Checkpoints, Technical Report CRHC-95-16, University of Urbana-Champaign, USA (1995)
Zandy, V.: CKPT: A Checkpoint Library under Unix, http://www.cs.wisc.edu/~zandy/ckpt
Smith, T.F., Waterman, M.S.: Identification of common molecular sub-sequences. Journal of Molecular Biology 147(1), 195–197 (1981)
Badrinath, R., Morin, C.: Locks and Barriers in Checkpointing and Recovery. In: Proceedings of the IEEE/ACM CCGrid 2004, Chicago, USA (April 2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Boukerche, A., Koch, J., de Melo, A.C.M.A. (2005). Integrating Coordinated Checkpointing and Recovery Mechanisms into DSM Synchronization Barriers. In: Nikoletseas, S.E. (eds) Experimental and Efficient Algorithms. WEA 2005. Lecture Notes in Computer Science, vol 3503. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11427186_35
Download citation
DOI: https://doi.org/10.1007/11427186_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25920-6
Online ISBN: 978-3-540-32078-4
eBook Packages: Computer ScienceComputer Science (R0)