Integrating Coordinated Checkpointing and Recovery Mechanisms into DSM Synchronization Barriers

  • Azzedine Boukerche
  • Jeferson Koch
  • Alba Cristina Magalhaes Alves de Melo
Conference paper

DOI: 10.1007/11427186_35

Part of the Lecture Notes in Computer Science book series (LNCS, volume 3503)
Cite this paper as:
Boukerche A., Koch J., de Melo A.C.M.A. (2005) Integrating Coordinated Checkpointing and Recovery Mechanisms into DSM Synchronization Barriers. In: Nikoletseas S.E. (eds) Experimental and Efficient Algorithms. WEA 2005. Lecture Notes in Computer Science, vol 3503. Springer, Berlin, Heidelberg

Abstract

Distributed Shared Memory (DSM) creates an abstraction of a physical shared memory that parallel programmers can access. Most recent software DSMs provide relaxed memory models that guarantee consistency only at synchronization operations. As the main goal of DSM systems is to provide support for long term computation intensive applications, checkpointing and recovery mechanisms are highly desirable. This article presents and evaluates the integration of a coordinated checkpointing mechanism to the barrier primitive that is usually provided with many DSM systems. Our results on some popular benchmarks and a real parallel application show that the overhead introduced during the failure-free execution is often small.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Azzedine Boukerche
    • 1
  • Jeferson Koch
    • 2
  • Alba Cristina Magalhaes Alves de Melo
    • 2
  1. 1.SITE – School of Information Technology and EngineeringUniversity of OttawaCanada
  2. 2.Department of Computer ScienceUniversity of BrasiliaBrasiliaBrazil

Personalised recommendations