Skip to main content

A Replication-Based Fault Tolerance Protocol Using Group Communication for the Grid

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4330))

Abstract

We describe a replication-based protocol that uses group communication for fault tolerance in the Computational Grid. The Grid is partitioned into a number of clusters and each cluster has a designated coordinator that manages the states of the replicas within its cluster. The coordinators belong to a process group and the proposed protocol ensures the correct sequence of message deliveries to the replicas by the coordinators. Any failing node of the Grid is replaced by an active replica to provide correct continuation of the operation of the application. We show the theoretical framework along with illustrations of the replication protocol and its implementation results and analyze its performance and scalability.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Foster, I., Kesselman, C., Tuecke, S.: The Anatomy of the Grid: Enabling Scalable Virtual Organizations. Int. Journal of High Performance Computing Applications 15(3), 200–222 (2001)

    Article  Google Scholar 

  2. Foster, I.: What is the Grid? A Three Point Checklist, Grid Today 1(6) (2002)

    Google Scholar 

  3. Valcarenghi, L., et al.: QoS Aware Fault Tolerance in Grid Computing. In: Workshop on Reliability and Robustness in Grid Computing Systems, GGF16, Athens, Greece, February 13-16 (2006)

    Google Scholar 

  4. MPICH-G2: A Grid-enabled Implementation of the Message Passing Interface. Journal of Parallel and Distributed Computing 63(5), 551–563 (2003)

    Google Scholar 

  5. Tunali, T., Erciyes, K., Soysert, Z.: A Hierarchical Fault-Tolerant Ring Protocol For A Distributed Real-Time System. Special issue of Parallel and Distributed Computing Practices on Parallel and Distributed Real-Time Systems 2(1), 33–44 (2000)

    Google Scholar 

  6. Amir, Y., et al.: The TOTEM Single Ring Ordering and membership Protocol. ACM Trans. Comp. Systems. 13(4) (1995)

    Google Scholar 

  7. Amir, Y., et al.: Transis: A communication subsystem for high availability. In: Proc. of 22nd IEEE Int’l Symp. on Fault-Tolerant Computing, pp. 76–84. IEEE Press, NJ

    Google Scholar 

  8. Birman, K.P., van Renesse, R.: Reliable Distributed Computing with the Isis Toolkit. IEEE Computer Society Press, Los Alamitos (1994)

    Google Scholar 

  9. Birman, K.P.: The Process Group Approach to Reliable Distributed Computing. Communications of the ACM, 36(12) (December 1993)

    Google Scholar 

  10. Chockler, G., Keidar, I., Vitenberg, R.: Group communication specifications: a comprehensive study. ACM Computing Surveys 33(4), 427–469 (2001)

    Article  Google Scholar 

  11. Cristian F.: Synchronous and Asynchronous Communication, Communications of the ACM. Special Section on Group Communication 39(4) (April 1996)

    Google Scholar 

  12. Defago, X.: Agreement Related Problem: From semi-passive replication to Totally Ordered Broadcast. Ph.D. thesis, Ecole Polytechnique Lausanne, Switzerland (August 2000)

    Google Scholar 

  13. Kaashoek, M.F., Tanenbaum, A.S.: Group Communication in the Amoeba distributed operating system. In: Proc. of the 11th IEEE International Conf. on Distributed Computing Systems, pp. 436–447. IEEE Computer Society press, Los Alamitos

    Google Scholar 

  14. Keidar, I., et al.: Moshe: A group membership service for WANs. ACM Transactions on Computer Systems (TOCS) 20(3), 191–238 (2002)

    Article  Google Scholar 

  15. Schenider, F.: Replication management using the state-machine approach. In: Duistributed Systems, pp. 169–198. ACM Press, New York

    Google Scholar 

  16. Van Renesse, R., Birman, K.P., Maffeis, S.: Horus: A Flexible Group communication System. Communications of the ACM, Special section on Group Communication 39(4) (April 1996)

    Google Scholar 

  17. Susuki, I., Kasami, T.: A Distributed Mutual Exclusion Algorithm. ACM Trans. Computer Systems 3(4), 344–349 (1985)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Erciyes, K. (2006). A Replication-Based Fault Tolerance Protocol Using Group Communication for the Grid. In: Guo, M., Yang, L.T., Di Martino, B., Zima, H.P., Dongarra, J., Tang, F. (eds) Parallel and Distributed Processing and Applications. ISPA 2006. Lecture Notes in Computer Science, vol 4330. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11946441_62

Download citation

  • DOI: https://doi.org/10.1007/11946441_62

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-68067-3

  • Online ISBN: 978-3-540-68070-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics