Integrating Coordinated Checkpointing and Recovery Mechanisms into DSM Synchronization Barriers

  • Azzedine Boukerche
  • Jeferson Koch
  • Alba Cristina Magalhaes Alves de Melo
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3503)

Abstract

Distributed Shared Memory (DSM) creates an abstraction of a physical shared memory that parallel programmers can access. Most recent software DSMs provide relaxed memory models that guarantee consistency only at synchronization operations. As the main goal of DSM systems is to provide support for long term computation intensive applications, checkpointing and recovery mechanisms are highly desirable. This article presents and evaluates the integration of a coordinated checkpointing mechanism to the barrier primitive that is usually provided with many DSM systems. Our results on some popular benchmarks and a real parallel application show that the overhead introduced during the failure-free execution is often small.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Amza, C., Cox, A., Dwarkakas, S., Zwaenenpoel, W.: Software DSM Protocols that Adapt between Single Writer and Multiple Writer. In: Proc. of HPCA 1997, pp. 261–271 (1997)Google Scholar
  2. 2.
    Bailey, D., et al.: The NAS Parallel Benchmarks, TR 103863-NASA (July 1993)Google Scholar
  3. 3.
    Elnozahy, M., Alvisi, L., Wang, L.: A Survey of Rollback/recovery Protocols in Message-Passing Systems, TR CMU-CS-96-181 (1996)Google Scholar
  4. 4.
    Gharachorloo, K., et al.: Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors. In: Proc. ISCA, pp. 15–24 (May 1990)Google Scholar
  5. 5.
    Hu, W., Shi, W., Tang, Z.: JIAJIA: An SVM System Based on A New Cache Coherence Protocol. In: Proc. of HPCN 1999, pp. 463–472 (1999)Google Scholar
  6. 6.
    Iftode, L.: Home-Based Shared Virtual Memory, PhD Thesis. Princeton University, Princeton (1998)Google Scholar
  7. 7.
    Iftode, L., et al.: Scope Consistency: Bridging the Gap Between Release Consistency and Entry Consistency. In: Proc. ACM SPAA 1996, pp. 277–287 (1996)Google Scholar
  8. 8.
    Janakiraman, G., Tamir, Y.: Coordinated Checkpointing-Rollback Error Recovery for DSM Multicomputers. In: Proc. of 13th Symposium on Reliable Distributed Systems (1994)Google Scholar
  9. 9.
    Keleher, P., et al.: TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems. In: Proc. USENIX, pp. 115–132 (1994)Google Scholar
  10. 10.
    Kongmunvattana, A., Tanchatchawal, S., Tzeng, N.: Coherence-Based Coordinated Check-pointing for Software Distributed Shared Memory Systems. In: Proc. ICDCS, April, pp. 556–563 (2000)Google Scholar
  11. 11.
    Kongmunvattana, A., Tzeng, N.: Logging and Recovery in Adaptive Software Distributed Shared memory Systems. In: Proc. of the 18th Symp. on Reliable Distributed Systems (1999)Google Scholar
  12. 12.
    Lu, H., Dwarkadas, S., Cox, A.L., Zwaenepoel, W.: Quantifying the performance differences between pvm and Treadmarks. Journal of Parallel and Distributed Computation 43, 65–78 (1997)CrossRefGoogle Scholar
  13. 13.
    Melo, R., et al.: Comparing Two Long DNA Sequences Using a DSM System. In: Kosch, H., Böszörményi, L., Hellwagner, H. (eds.) Euro-Par 2003. LNCS, vol. 2790, pp. 517–524. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  14. 14.
    Monnerat, L., Bianchinni, R.: Efficiently Adapting to Sharing Patterns in Software DSMs. In: Proc. HPCA 1998 (February 1998)Google Scholar
  15. 15.
    Mosberger, D.: Memory Consistency Models. Operating Systems Review, 18–26 (1993)Google Scholar
  16. 16.
    Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: transparent Checkpointing under Linux. In: USENIX Winter 1995 Technical Conference (January 1995)Google Scholar
  17. 17.
    Shi, W.: Improving the Performance of DSM Systems, PhD Thesis, CAS (November 1999)Google Scholar
  18. 18.
    Speight, E., Bennett, J.: Reducing Coherence-Related Communication in Software Distributed Shared Memory Systems, TR ECE-TR-98-03, Rice University (1998)Google Scholar
  19. 19.
    Sultan, F., Nguyen, T., Iftode, L.: Scalable Fault Tolerant Distributed Shared Memory. In: Proc. of Int. Conf. On High Performance Networking and Computing (2000)Google Scholar
  20. 20.
    Wang, Y., Chung, P., Fuchs, W.: Tight Upper Bound on Useful Distributed Systems Checkpoints, Technical Report CRHC-95-16, University of Urbana-Champaign, USA (1995)Google Scholar
  21. 21.
    Zandy, V.: CKPT: A Checkpoint Library under Unix, http://www.cs.wisc.edu/~zandy/ckpt
  22. 22.
    Smith, T.F., Waterman, M.S.: Identification of common molecular sub-sequences. Journal of Molecular Biology 147(1), 195–197 (1981)CrossRefGoogle Scholar
  23. 23.
    Badrinath, R., Morin, C.: Locks and Barriers in Checkpointing and Recovery. In: Proceedings of the IEEE/ACM CCGrid 2004, Chicago, USA (April 2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Azzedine Boukerche
    • 1
  • Jeferson Koch
    • 2
  • Alba Cristina Magalhaes Alves de Melo
    • 2
  1. 1.SITE – School of Information Technology and EngineeringUniversity of OttawaCanada
  2. 2.Department of Computer ScienceUniversity of BrasiliaBrasiliaBrazil

Personalised recommendations