Performance Comparison of Hierarchical Checkpoint Protocols Grid Computing
Grid infrastructure is a large set of nodes geographically distributed and connected by a communication. In this context, fault tolerance is a necessity imposed by the distribution as any node can fail at any moment and the average time between failures highly decreases. To improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resistance to these faults of the system. Fault tolerance is intended to allow the system to provide service as specified in spite of occurrences of faults. To meet this need, several techniques have been proposed in the literature. We will study the protocols based on rollback recovery classified into two categories: checkpoint-based rollback recovery protocols and message logging protocols. However, the performance of a protocol depends on the characteristics of the system, network and applications running. Faced with the constraints of large-scale environments, many of algorithms of the literature showed inadequate.Given an application environment and a system, it is not easy to identify the recovery protocol that is most appropriate for a cluster or hierarchical environment, like grid computing. Hence there is a need to implement these protocols in a hierarchical fashion to compare their performance in grid computing. In this paper, we propose hierarchical version of these protocols. We have implemented and compare their performance in clusters and grid computing using the Omnet++ simulator.
KeywordsGrid Computing Fault Tolerance Grid Infrastructure Faulty Node Recovery Protocol
Unable to display preview. Download preview PDF.
- 2.Chandy, M., Lamport, L.: Distributed snapshots: Determining global states of distributed systems. ACM Trans. Computing Systems 3(1), 63–75Google Scholar
- 3.Coti, C., Herault, T., Lemarinier, P., Pilard, L., Rezmerita, A., Rodriguez, E., Cappello, F.: Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI. In: SC 2006: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, New York, USA, p. 127 (2006)Google Scholar
- 4.Himadri, S.-P., Gupta, A., Badrinath, R.: Hierarchical Coordinated Checkpointing Protocol. In: International Conference on Parallel and Distributed Computing Systems, pp. 240–245 (2002)Google Scholar
- 6.Monnet, S., Morin, C., Badrinath, R.: Hybrid Checkpointing for Parallel Applications in cluster Federations. In: Proc. 4th IEEE/ACM International Symposium on Cluster Computing and the Grid, Chicago, USA, pp. 773–782 (2004)Google Scholar