A Replication-Based Fault Tolerance Protocol Using Group Communication for the Grid

  • Kayhan Erciyes
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4330)


We describe a replication-based protocol that uses group communication for fault tolerance in the Computational Grid. The Grid is partitioned into a number of clusters and each cluster has a designated coordinator that manages the states of the replicas within its cluster. The coordinators belong to a process group and the proposed protocol ensures the correct sequence of message deliveries to the replicas by the coordinators. Any failing node of the Grid is replaced by an active replica to provide correct continuation of the operation of the application. We show the theoretical framework along with illustrations of the replication protocol and its implementation results and analyze its performance and scalability.


Cluster Head Fault Tolerance Group Communication Total Order Correct Process 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Foster, I., Kesselman, C., Tuecke, S.: The Anatomy of the Grid: Enabling Scalable Virtual Organizations. Int. Journal of High Performance Computing Applications 15(3), 200–222 (2001)CrossRefGoogle Scholar
  2. 2.
    Foster, I.: What is the Grid? A Three Point Checklist, Grid Today 1(6) (2002)Google Scholar
  3. 3.
    Valcarenghi, L., et al.: QoS Aware Fault Tolerance in Grid Computing. In: Workshop on Reliability and Robustness in Grid Computing Systems, GGF16, Athens, Greece, February 13-16 (2006)Google Scholar
  4. 4.
    MPICH-G2: A Grid-enabled Implementation of the Message Passing Interface. Journal of Parallel and Distributed Computing 63(5), 551–563 (2003)Google Scholar
  5. 5.
    Tunali, T., Erciyes, K., Soysert, Z.: A Hierarchical Fault-Tolerant Ring Protocol For A Distributed Real-Time System. Special issue of Parallel and Distributed Computing Practices on Parallel and Distributed Real-Time Systems 2(1), 33–44 (2000)Google Scholar
  6. 6.
    Amir, Y., et al.: The TOTEM Single Ring Ordering and membership Protocol. ACM Trans. Comp. Systems. 13(4) (1995)Google Scholar
  7. 7.
    Amir, Y., et al.: Transis: A communication subsystem for high availability. In: Proc. of 22nd IEEE Int’l Symp. on Fault-Tolerant Computing, pp. 76–84. IEEE Press, NJGoogle Scholar
  8. 8.
    Birman, K.P., van Renesse, R.: Reliable Distributed Computing with the Isis Toolkit. IEEE Computer Society Press, Los Alamitos (1994)Google Scholar
  9. 9.
    Birman, K.P.: The Process Group Approach to Reliable Distributed Computing. Communications of the ACM, 36(12) (December 1993)Google Scholar
  10. 10.
    Chockler, G., Keidar, I., Vitenberg, R.: Group communication specifications: a comprehensive study. ACM Computing Surveys 33(4), 427–469 (2001)CrossRefGoogle Scholar
  11. 11.
    Cristian F.: Synchronous and Asynchronous Communication, Communications of the ACM. Special Section on Group Communication 39(4) (April 1996)Google Scholar
  12. 12.
    Defago, X.: Agreement Related Problem: From semi-passive replication to Totally Ordered Broadcast. Ph.D. thesis, Ecole Polytechnique Lausanne, Switzerland (August 2000)Google Scholar
  13. 13.
    Kaashoek, M.F., Tanenbaum, A.S.: Group Communication in the Amoeba distributed operating system. In: Proc. of the 11th IEEE International Conf. on Distributed Computing Systems, pp. 436–447. IEEE Computer Society press, Los AlamitosGoogle Scholar
  14. 14.
    Keidar, I., et al.: Moshe: A group membership service for WANs. ACM Transactions on Computer Systems (TOCS) 20(3), 191–238 (2002)CrossRefGoogle Scholar
  15. 15.
    Schenider, F.: Replication management using the state-machine approach. In: Duistributed Systems, pp. 169–198. ACM Press, New YorkGoogle Scholar
  16. 16.
    Van Renesse, R., Birman, K.P., Maffeis, S.: Horus: A Flexible Group communication System. Communications of the ACM, Special section on Group Communication 39(4) (April 1996)Google Scholar
  17. 17.
    Susuki, I., Kasami, T.: A Distributed Mutual Exclusion Algorithm. ACM Trans. Computer Systems 3(4), 344–349 (1985)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Kayhan Erciyes
    • 1
  1. 1.Computer Eng. Dept.Izmir Institute of TechnologyUrlaTurkey

Personalised recommendations