Fault-tolerant grid architecture and practice

Jin, Hai; Zou, DeQing; Chen, HanHua; Sun, JianHua; Wu, Song

doi:10.1007/BF02948916

Fault-tolerant grid architecture and practice

Published: July 2003

Volume 18, pages 423–433, (2003)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Hai Jin¹,
DeQing Zou¹,
HanHua Chen¹,
JianHua Sun¹ &
…
Song Wu¹

81 Accesses
11 Citations
3 Altmetric
Explore all metrics

Abstract

Grid computing emerges as effective technologies to couple geographically distributed resources and solve large-scale computational problems in wide area networks. The fault tolerance is a significant and complex issue in grid computing systems. Various techniques have been investigated to detect and correct faults in distributed computing systems. Unreliable fault detection is one of the most effective techniques. Globus as a grid middleware manages resources in a wide area network. The Globus fault detection service uses the well-known techniques based on unreliable fault detectors to detect and report component failures. However, more powerful techniques are required to detect and correct both system-level and application-level faults in a grid system, and a convenient toolkit is also needed to maintain the consistency in the grid. A fault-tolerant grid platform (FTGP) based on an unreliable fault detector and the Globus fault detection service is presented in this paper. The platform offers effective strategies in such three aspects as grid key components, user tasks, and high-level applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fault tolerance in computational grids: perspectives, challenges, and issues

Article Open access 18 November 2016

A Survey on Fault Management Techniques in Distributed Computing

A Distributed Fault Analysis (DFA) Method for Fault Tolerance in High-Performance Computing Systems

References

Stallings W. SNMP and SNMPv2: The infrastructure for network management.IEEE Communications Magazine, Mar., 1998, 36(3): 37–43.
Article Google Scholar
Armstrong R, Gannon D, Geist Aet al. Toward a common component architecture, for high performance scientific computing. InProc. the 8th IEEE Symposium on High Performance Distributed Computing, Redondo Beach, CA, Aug., 1999, pp.115–124.
Dongarra J. An overview of computational grids and survey of a few research projects. InProc. Symposium on Global Information Processing Technology, Tokyo, Japan, 1999.
Johnston W E, Gannon D, Nitzberg B. Grids as production computing environemnts: The engineering aspects of NASA's information power Grid. InProc. the 8th IEEE Symposium on High Performance Distributed Computing, Redondo Beach, CA, 1999, pp.197–204.
Angulo D, Aydt R, Berman Fet al. Toward a framework for preparing and executing adaptive Grid programs. InProc. IPDPS'02, Fort Lauderdale, FL, 2002, pp.171–175.
Chandra T D, Toueg S. Unreliable failure detectors for reliable distributed systems.Journal of the ACM, Mar. 1996, 43(2): 225–267.
Article MATH MathSciNet Google Scholar
Foster I, Kesselman C. Globus: A metacomputing infrastructure toolkit.International Journal of Supercomputer Applications, 1997, 11(2): 115–128.
Article Google Scholar
Stelling P, DeMatteis C, Foster Iet al. A fault detection service for wide area distributed computations.Cluster Computing, 1999, 2: 117–128.
Article Google Scholar
Eugster P T, Guerraoui R, Handurukande Set al. Lightweight probabilistic broadcast. InProc. the 2001 IEEE International Conference on Dependable Systems and Networks, San Francisco, CA, June, 2001, pp.443–452.
Guerraoui R, Schiper A. Genuine atomic multicast in asynchronous distributed systems.Theoretical Computer Science, Mar. 2001, 254(1–2), 297–316.
Article MATH MathSciNet Google Scholar
Hadzilacos V, Toueg S. Fault-Tolerant Broadcasts and Related Problems.Distributed Systems, Mullender S (ed.), Addison-Wesley, 1993, pp.97–145.
Birman K. Replication and fault tolerance in the ISIS system. InProc. the 10th ACM Symposium on Operating Systems Principles, Orcas Island, Washington, USA, Dec. 1985, pp.79–86.
I Gupta, R van Renesse, K Birman. Scalable faulttolerant aggregation in large process groups. InProc. the 2001 IEEE International Conference on Dependable Systems and Networks, San Francisco, CA, June, 2001, pp.433–442.
R van Renesse, T Hickey, K Birman. Design and performance of hours: A lightweight group communications system. Technical Report TR94-1442, Cornell University, 1994.
Epema D H J, Livny M, Dantzig R Vet al. A world-wide flock of condors: Load sharing among workstation clusters.Future Generation Computer Systems, 1996, 12: 53–65.
Article Google Scholar
Litzkow M, Tannenbaum T, Basney J, Livny M. Checkpoint and Migration of UNIX, Processes in the Condor Distributed Processing System. Technical Report 1346, University of Wisconsin, Madison, Computer Sciences, 1997.
Google Scholar
Ripeanu M. Peer-to-Peer architecture case study: Gnutella network. InProc. International Conference on Peer-to-peer Computing, Skyways, Sweden, 2001, pp.99–100.
Stoica I, Morris R, Karger Det al. Chord: A scalable peer-to-peer Comkup service for Internet applications. InProc. ACM SIGCOMM, San Diego, CA, 2001, pp.149–160.
Buyya R, Abramson D, Giddy J. Nimrod/G: An architecture for a resource management and scheduling system in a global computational grid. InProc. International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2000), Beijing, China, 2000, pp.283–289.
Berman F, Wolski R, Figueira Set al., Application-level scheduling on distributed heterogenous networks. InProc. Supercomputing'96, Pittsburg, PA, 1996.
Czajkowski K, Fitzgerald S, Foster Iet al. Grid information services for distributed resource sharing. InProc. the 10th IEEE Int. Symp. High-Performance Distributed Computing (HPDC-10), San Francisco, CA, 2001, pp.181–184.
Howell J, Kotz D. End-to-End authorization. InProc. 2000 Symposium on Operating Systems Design and Implementation, San Diego, CA, 2000, pp.151–164.

Download references

Author information

Authors and Affiliations

Huazhong University of Science and Technology, 430074, Wuhan, P.R. China
Hai Jin, DeQing Zou, HanHua Chen, JianHua Sun & Song Wu

Authors

Hai Jin
View author publications
You can also search for this author in PubMed Google Scholar
DeQing Zou
View author publications
You can also search for this author in PubMed Google Scholar
HanHua Chen
View author publications
You can also search for this author in PubMed Google Scholar
JianHua Sun
View author publications
You can also search for this author in PubMed Google Scholar
Song Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hai Jin.

Additional information

This work is supported by the National Natural Science Foundation of China under Grants No.60125208 and No.60273076.

JIN Hai is a professor of computer science and engineering at the Huazhong University of Science and Technology (HUST) in China. He received his Ph.D. degree in computer engineering from HUST in 1994. In 1996, he was awarded German Academic Exchange Service (DAAD) fellowship for visiting the Technical University of Chemnitz in Germany. He worked for the University of Hong Kong between 1998 and 2000 and participated in the HKU Cluster project. He worked as a visiting scholar at the Internet and Cluster Computing Laboratory at the University of Southern California in 1999 to 2000. He is a member of IEEE and ACM, an associate editor ofInternational Journal of Parallel and Distributed Systems and Networks. He was the guest editors ofJournal of Parallel and Distributed Computing, Future Generation of Computer Systems, Cluster Computing, andCalculateur Parallèle. He is the executive member and region coordinator of IEEE Task Force on Cluster Computing (TFCC). He is IASTED technical committee member onParallel & Distributed Computing and Systems. He is the steering committee member ofIEEE/ACM International Symposium on Cluster Computing and Grid (CCGrid), and served as program vice-chair of theFirst IEEE/ACM International Symposium on Cluster Computing and Grid (CCGrid'01), The Fourth International Conference on Parallel and Distributed Computing, Applications, and Technologies (PDCAT 2003). He served as conference chairs forInternational Workshop on Internet Computing and E-Commerce from 2001 to 2003. He also served as program committee member for more than 30 international conferences/workshops. He has co-authored four books and published over 50 research papers. His research interests include parallel I/O, high performance storage system, cluster computing and grid computing, network security, and fault tolerance.

ZOU DeQing received his B.S. degree in computer science from Fuzhou University (China) in 1997 and entered Huazhong University of Science and Technology (China) for a M.S. degree in 1999. He is currently a Ph.D. candidate in the School of Computer, HUST. His research interests include grid computing, peer-to-peer computing, semantic web, operating system, and parallel program design.

CHEN HanHua received his B.S. degree in computer science from Wuhan University of Science and Technology (China) in 2001. He entered there for the M.S. degree in 1999. His research interests include grid computing, peer-to-peer computing, and web services.

SUN JianHua received her B.S. degree in School of Computer, Henan University (China) in 2000, entered Huazhong University of Science and Technology (China) for the M.S. degree in 2000, and now is a Ph.D. candidate in the School of Computer, HUST. Her research interests include computer and network security, and data mining.

WU Song received the Ph.D. degree in computer engineering from Huazhong University of Science and Technology (China) in 2002. He is an associate professor at the same university. He has been leading several National projects such as NFSC and 863. He has published more than 10 scientific papers. His research interests include cluster/grid computing, storage system and multimedia server.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jin, H., Zou, D., Chen, H. et al. Fault-tolerant grid architecture and practice. J. Comput. Sci. & Technol. 18, 423–433 (2003). https://doi.org/10.1007/BF02948916

Download citation

Received: 07 January 2003
Revised: 28 March 2003
Issue Date: July 2003
DOI: https://doi.org/10.1007/BF02948916

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fault-tolerant grid architecture and practice

Abstract

Access this article

Similar content being viewed by others

Fault tolerance in computational grids: perspectives, challenges, and issues

A Survey on Fault Management Techniques in Distributed Computing

A Distributed Fault Analysis (DFA) Method for Fault Tolerance in High-Performance Computing Systems

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fault-tolerant grid architecture and practice

Abstract

Access this article

Similar content being viewed by others

Fault tolerance in computational grids: perspectives, challenges, and issues

A Survey on Fault Management Techniques in Distributed Computing

A Distributed Fault Analysis (DFA) Method for Fault Tolerance in High-Performance Computing Systems

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation