A Replication-Based Fault Tolerance Protocol Using Group Communication for the Grid

Erciyes, Kayhan

doi:10.1007/11946441_62

A Replication-Based Fault Tolerance Protocol Using Group Communication for the Grid

Kayhan Erciyes²²

Conference paper

594 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4330))

Abstract

We describe a replication-based protocol that uses group communication for fault tolerance in the Computational Grid. The Grid is partitioned into a number of clusters and each cluster has a designated coordinator that manages the states of the replicas within its cluster. The coordinators belong to a process group and the proposed protocol ensures the correct sequence of message deliveries to the replicas by the coordinators. Any failing node of the Grid is replaced by an active replica to provide correct continuation of the operation of the application. We show the theoretical framework along with illustrations of the replication protocol and its implementation results and analyze its performance and scalability.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Foster, I., Kesselman, C., Tuecke, S.: The Anatomy of the Grid: Enabling Scalable Virtual Organizations. Int. Journal of High Performance Computing Applications 15(3), 200–222 (2001)
Article Google Scholar
Foster, I.: What is the Grid? A Three Point Checklist, Grid Today 1(6) (2002)
Google Scholar
Valcarenghi, L., et al.: QoS Aware Fault Tolerance in Grid Computing. In: Workshop on Reliability and Robustness in Grid Computing Systems, GGF16, Athens, Greece, February 13-16 (2006)
Google Scholar
MPICH-G2: A Grid-enabled Implementation of the Message Passing Interface. Journal of Parallel and Distributed Computing 63(5), 551–563 (2003)
Google Scholar
Tunali, T., Erciyes, K., Soysert, Z.: A Hierarchical Fault-Tolerant Ring Protocol For A Distributed Real-Time System. Special issue of Parallel and Distributed Computing Practices on Parallel and Distributed Real-Time Systems 2(1), 33–44 (2000)
Google Scholar
Amir, Y., et al.: The TOTEM Single Ring Ordering and membership Protocol. ACM Trans. Comp. Systems. 13(4) (1995)
Google Scholar
Amir, Y., et al.: Transis: A communication subsystem for high availability. In: Proc. of 22nd IEEE Int’l Symp. on Fault-Tolerant Computing, pp. 76–84. IEEE Press, NJ
Google Scholar
Birman, K.P., van Renesse, R.: Reliable Distributed Computing with the Isis Toolkit. IEEE Computer Society Press, Los Alamitos (1994)
Google Scholar
Birman, K.P.: The Process Group Approach to Reliable Distributed Computing. Communications of the ACM, 36(12) (December 1993)
Google Scholar
Chockler, G., Keidar, I., Vitenberg, R.: Group communication specifications: a comprehensive study. ACM Computing Surveys 33(4), 427–469 (2001)
Article Google Scholar
Cristian F.: Synchronous and Asynchronous Communication, Communications of the ACM. Special Section on Group Communication 39(4) (April 1996)
Google Scholar
Defago, X.: Agreement Related Problem: From semi-passive replication to Totally Ordered Broadcast. Ph.D. thesis, Ecole Polytechnique Lausanne, Switzerland (August 2000)
Google Scholar
Kaashoek, M.F., Tanenbaum, A.S.: Group Communication in the Amoeba distributed operating system. In: Proc. of the 11th IEEE International Conf. on Distributed Computing Systems, pp. 436–447. IEEE Computer Society press, Los Alamitos
Google Scholar
Keidar, I., et al.: Moshe: A group membership service for WANs. ACM Transactions on Computer Systems (TOCS) 20(3), 191–238 (2002)
Article Google Scholar
Schenider, F.: Replication management using the state-machine approach. In: Duistributed Systems, pp. 169–198. ACM Press, New York
Google Scholar
Van Renesse, R., Birman, K.P., Maffeis, S.: Horus: A Flexible Group communication System. Communications of the ACM, Special section on Group Communication 39(4) (April 1996)
Google Scholar
Susuki, I., Kasami, T.: A Distributed Mutual Exclusion Algorithm. ACM Trans. Computer Systems 3(4), 344–349 (1985)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Eng. Dept., Izmir Institute of Technology, Urla, Izmir, 35430, Turkey
Kayhan Erciyes

Authors

Kayhan Erciyes
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Shanghai Jiao Tong University, 200030, Shanghai, China
Minyi Guo
Department of Computer Science, St. Francis Xavier University, Antigonish, Canada
Laurence T. Yang
Dipartimento di Ingegneria dell’ Informazione - Second, University of Naples - Italy, Real Casa dell’Annunziata, via Roma, 29 81031, Aversa (CE), Italy
Beniamino Di Martino
Institute of Scientific Computing, University of Vienna, Nordbergstr. 15/C/3, A-1090, Vienna, Austria/JPL, Caltech, USA
Hans P. Zima
Computer Science Department, University of Tennessee, TN 37996-3450, Knoxville, USA
Jack Dongarra
Grid Computing Center, Shanghai Jiao Tong University, 800 Dongchuan Road, 200240, Shanghai, China
Feilong Tang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Erciyes, K. (2006). A Replication-Based Fault Tolerance Protocol Using Group Communication for the Grid. In: Guo, M., Yang, L.T., Di Martino, B., Zima, H.P., Dongarra, J., Tang, F. (eds) Parallel and Distributed Processing and Applications. ISPA 2006. Lecture Notes in Computer Science, vol 4330. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11946441_62

Download citation

DOI: https://doi.org/10.1007/11946441_62
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68067-3
Online ISBN: 978-3-540-68070-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics