Skip to main content
Log in

Fault-tolerant grid architecture and practice

  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Grid computing emerges as effective technologies to couple geographically distributed resources and solve large-scale computational problems in wide area networks. The fault tolerance is a significant and complex issue in grid computing systems. Various techniques have been investigated to detect and correct faults in distributed computing systems. Unreliable fault detection is one of the most effective techniques. Globus as a grid middleware manages resources in a wide area network. The Globus fault detection service uses the well-known techniques based on unreliable fault detectors to detect and report component failures. However, more powerful techniques are required to detect and correct both system-level and application-level faults in a grid system, and a convenient toolkit is also needed to maintain the consistency in the grid. A fault-tolerant grid platform (FTGP) based on an unreliable fault detector and the Globus fault detection service is presented in this paper. The platform offers effective strategies in such three aspects as grid key components, user tasks, and high-level applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Stallings W. SNMP and SNMPv2: The infrastructure for network management.IEEE Communications Magazine, Mar., 1998, 36(3): 37–43.

    Article  Google Scholar 

  2. Armstrong R, Gannon D, Geist Aet al. Toward a common component architecture, for high performance scientific computing. InProc. the 8th IEEE Symposium on High Performance Distributed Computing, Redondo Beach, CA, Aug., 1999, pp.115–124.

  3. Dongarra J. An overview of computational grids and survey of a few research projects. InProc. Symposium on Global Information Processing Technology, Tokyo, Japan, 1999.

  4. Johnston W E, Gannon D, Nitzberg B. Grids as production computing environemnts: The engineering aspects of NASA's information power Grid. InProc. the 8th IEEE Symposium on High Performance Distributed Computing, Redondo Beach, CA, 1999, pp.197–204.

  5. Angulo D, Aydt R, Berman Fet al. Toward a framework for preparing and executing adaptive Grid programs. InProc. IPDPS'02, Fort Lauderdale, FL, 2002, pp.171–175.

  6. Chandra T D, Toueg S. Unreliable failure detectors for reliable distributed systems.Journal of the ACM, Mar. 1996, 43(2): 225–267.

    Article  MATH  MathSciNet  Google Scholar 

  7. Foster I, Kesselman C. Globus: A metacomputing infrastructure toolkit.International Journal of Supercomputer Applications, 1997, 11(2): 115–128.

    Article  Google Scholar 

  8. Stelling P, DeMatteis C, Foster Iet al. A fault detection service for wide area distributed computations.Cluster Computing, 1999, 2: 117–128.

    Article  Google Scholar 

  9. Eugster P T, Guerraoui R, Handurukande Set al. Lightweight probabilistic broadcast. InProc. the 2001 IEEE International Conference on Dependable Systems and Networks, San Francisco, CA, June, 2001, pp.443–452.

  10. Guerraoui R, Schiper A. Genuine atomic multicast in asynchronous distributed systems.Theoretical Computer Science, Mar. 2001, 254(1–2), 297–316.

    Article  MATH  MathSciNet  Google Scholar 

  11. Hadzilacos V, Toueg S. Fault-Tolerant Broadcasts and Related Problems.Distributed Systems, Mullender S (ed.), Addison-Wesley, 1993, pp.97–145.

  12. Birman K. Replication and fault tolerance in the ISIS system. InProc. the 10th ACM Symposium on Operating Systems Principles, Orcas Island, Washington, USA, Dec. 1985, pp.79–86.

  13. I Gupta, R van Renesse, K Birman. Scalable faulttolerant aggregation in large process groups. InProc. the 2001 IEEE International Conference on Dependable Systems and Networks, San Francisco, CA, June, 2001, pp.433–442.

  14. R van Renesse, T Hickey, K Birman. Design and performance of hours: A lightweight group communications system. Technical Report TR94-1442, Cornell University, 1994.

  15. Epema D H J, Livny M, Dantzig R Vet al. A world-wide flock of condors: Load sharing among workstation clusters.Future Generation Computer Systems, 1996, 12: 53–65.

    Article  Google Scholar 

  16. Litzkow M, Tannenbaum T, Basney J, Livny M. Checkpoint and Migration of UNIX, Processes in the Condor Distributed Processing System. Technical Report 1346, University of Wisconsin, Madison, Computer Sciences, 1997.

    Google Scholar 

  17. Ripeanu M. Peer-to-Peer architecture case study: Gnutella network. InProc. International Conference on Peer-to-peer Computing, Skyways, Sweden, 2001, pp.99–100.

  18. Stoica I, Morris R, Karger Det al. Chord: A scalable peer-to-peer Comkup service for Internet applications. InProc. ACM SIGCOMM, San Diego, CA, 2001, pp.149–160.

  19. Buyya R, Abramson D, Giddy J. Nimrod/G: An architecture for a resource management and scheduling system in a global computational grid. InProc. International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2000), Beijing, China, 2000, pp.283–289.

  20. Berman F, Wolski R, Figueira Set al., Application-level scheduling on distributed heterogenous networks. InProc. Supercomputing'96, Pittsburg, PA, 1996.

  21. Czajkowski K, Fitzgerald S, Foster Iet al. Grid information services for distributed resource sharing. InProc. the 10th IEEE Int. Symp. High-Performance Distributed Computing (HPDC-10), San Francisco, CA, 2001, pp.181–184.

  22. Howell J, Kotz D. End-to-End authorization. InProc. 2000 Symposium on Operating Systems Design and Implementation, San Diego, CA, 2000, pp.151–164.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hai Jin.

Additional information

This work is supported by the National Natural Science Foundation of China under Grants No.60125208 and No.60273076.

JIN Hai is a professor of computer science and engineering at the Huazhong University of Science and Technology (HUST) in China. He received his Ph.D. degree in computer engineering from HUST in 1994. In 1996, he was awarded German Academic Exchange Service (DAAD) fellowship for visiting the Technical University of Chemnitz in Germany. He worked for the University of Hong Kong between 1998 and 2000 and participated in the HKU Cluster project. He worked as a visiting scholar at the Internet and Cluster Computing Laboratory at the University of Southern California in 1999 to 2000. He is a member of IEEE and ACM, an associate editor ofInternational Journal of Parallel and Distributed Systems and Networks. He was the guest editors ofJournal of Parallel and Distributed Computing, Future Generation of Computer Systems, Cluster Computing, andCalculateur Parallèle. He is the executive member and region coordinator of IEEE Task Force on Cluster Computing (TFCC). He is IASTED technical committee member onParallel & Distributed Computing and Systems. He is the steering committee member ofIEEE/ACM International Symposium on Cluster Computing and Grid (CCGrid), and served as program vice-chair of theFirst IEEE/ACM International Symposium on Cluster Computing and Grid (CCGrid'01), The Fourth International Conference on Parallel and Distributed Computing, Applications, and Technologies (PDCAT 2003). He served as conference chairs forInternational Workshop on Internet Computing and E-Commerce from 2001 to 2003. He also served as program committee member for more than 30 international conferences/workshops. He has co-authored four books and published over 50 research papers. His research interests include parallel I/O, high performance storage system, cluster computing and grid computing, network security, and fault tolerance.

ZOU DeQing received his B.S. degree in computer science from Fuzhou University (China) in 1997 and entered Huazhong University of Science and Technology (China) for a M.S. degree in 1999. He is currently a Ph.D. candidate in the School of Computer, HUST. His research interests include grid computing, peer-to-peer computing, semantic web, operating system, and parallel program design.

CHEN HanHua received his B.S. degree in computer science from Wuhan University of Science and Technology (China) in 2001. He entered there for the M.S. degree in 1999. His research interests include grid computing, peer-to-peer computing, and web services.

SUN JianHua received her B.S. degree in School of Computer, Henan University (China) in 2000, entered Huazhong University of Science and Technology (China) for the M.S. degree in 2000, and now is a Ph.D. candidate in the School of Computer, HUST. Her research interests include computer and network security, and data mining.

WU Song received the Ph.D. degree in computer engineering from Huazhong University of Science and Technology (China) in 2002. He is an associate professor at the same university. He has been leading several National projects such as NFSC and 863. He has published more than 10 scientific papers. His research interests include cluster/grid computing, storage system and multimedia server.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jin, H., Zou, D., Chen, H. et al. Fault-tolerant grid architecture and practice. J. Comput. Sci. & Technol. 18, 423–433 (2003). https://doi.org/10.1007/BF02948916

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02948916

Keywords

Navigation