Experimental and Efficient Algorithms

Volume 3503 of the series Lecture Notes in Computer Science pp 403-414

Integrating Coordinated Checkpointing and Recovery Mechanisms into DSM Synchronization Barriers

  • Azzedine BoukercheAffiliated withSITE – School of Information Technology and Engineering, University of Ottawa
  • , Jeferson KochAffiliated withDepartment of Computer Science, University of Brasilia
  • , Alba Cristina Magalhaes Alves de MeloAffiliated withDepartment of Computer Science, University of Brasilia


Distributed Shared Memory (DSM) creates an abstraction of a physical shared memory that parallel programmers can access. Most recent software DSMs provide relaxed memory models that guarantee consistency only at synchronization operations. As the main goal of DSM systems is to provide support for long term computation intensive applications, checkpointing and recovery mechanisms are highly desirable. This article presents and evaluates the integration of a coordinated checkpointing mechanism to the barrier primitive that is usually provided with many DSM systems. Our results on some popular benchmarks and a real parallel application show that the overhead introduced during the failure-free execution is often small.