Performance Comparison of Hierarchical Checkpoint Protocols Grid Computing

  • Ndeye Massata Ndiaye
  • Pierre Sens
  • Ousmane Thiare
Part of the Advances in Intelligent and Soft Computing book series (AINSC, volume 151)


Grid infrastructure is a large set of nodes geographically distributed and connected by a communication. In this context, fault tolerance is a necessity imposed by the distribution as any node can fail at any moment and the average time between failures highly decreases. To improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resistance to these faults of the system. Fault tolerance is intended to allow the system to provide service as specified in spite of occurrences of faults. To meet this need, several techniques have been proposed in the literature. We will study the protocols based on rollback recovery classified into two categories: checkpoint-based rollback recovery protocols and message logging protocols. However, the performance of a protocol depends on the characteristics of the system, network and applications running. Faced with the constraints of large-scale environments, many of algorithms of the literature showed inadequate.Given an application environment and a system, it is not easy to identify the recovery protocol that is most appropriate for a cluster or hierarchical environment, like grid computing. Hence there is a need to implement these protocols in a hierarchical fashion to compare their performance in grid computing. In this paper, we propose hierarchical version of these protocols. We have implemented and compare their performance in clusters and grid computing using the Omnet++ simulator.


Grid Computing Fault Tolerance Grid Infrastructure Faulty Node Recovery Protocol 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Elnozahy, E.N., Alvisi, L., Wang, Y.-M., Johnson, D.B.: A Survey of Rollback- Recovery Protocols in Message-Passing Systems. ACM Computing Surveys 34(3), 375–408 (2002)CrossRefGoogle Scholar
  2. 2.
    Chandy, M., Lamport, L.: Distributed snapshots: Determining global states of distributed systems. ACM Trans. Computing Systems 3(1), 63–75Google Scholar
  3. 3.
    Coti, C., Herault, T., Lemarinier, P., Pilard, L., Rezmerita, A., Rodriguez, E., Cappello, F.: Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI. In: SC 2006: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, New York, USA, p. 127 (2006)Google Scholar
  4. 4.
    Himadri, S.-P., Gupta, A., Badrinath, R.: Hierarchical Coordinated Checkpointing Protocol. In: International Conference on Parallel and Distributed Computing Systems, pp. 240–245 (2002)Google Scholar
  5. 5.
    Bhatia, K., Marzullo, K., Alvisi, L.: Scalable causal Message Logging for Wide- Area Environments. Concurency and Computation: Practice and Experience 15(3), 873–889 (2003)CrossRefGoogle Scholar
  6. 6.
    Monnet, S., Morin, C., Badrinath, R.: Hybrid Checkpointing for Parallel Applications in cluster Federations. In: Proc. 4th IEEE/ACM International Symposium on Cluster Computing and the Grid, Chicago, USA, pp. 773–782 (2004)Google Scholar
  7. 7.

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Ndeye Massata Ndiaye
    • 1
    • 2
  • Pierre Sens
    • 2
  • Ousmane Thiare
    • 1
  1. 1.Gaston Berger Univsersity of Saint-LouisSaint-LouisSenegal
  2. 2.Regal team Paris JussieuParisFrance

Personalised recommendations