Science China Information Sciences

, Volume 58, Issue 5, pp 1–12 | Cite as

Service fault tolerance for highly reliable service-oriented systems: an overview

  • ZiBin Zheng
  • Michael Rung Tsong Lyu
  • HuaiMin Wang
Review Special Focus on High-Confidence Software Technologies


Service-oriented systems are widely-employed in e-business, e-government, finance, management systems, and so on. Service fault tolerance is one of the most important techniques for building highly reliable service-oriented systems. In this paper, we provide an overview of various service fault tolerance techniques, including sections on fault tolerance strategy design, fault tolerance strategy selection, and Byzantine fault tolerance. In the first section, we introduce the design of static and dynamic fault tolerance strategies, as well as the major problems when designing fault tolerance strategies. After that, based on various fault tolerance strategies, in the second section, we identify significant components from a complex service-oriented system, and investigate algorithms for optimal fault tolerance strategy selection. Finally, in the third section, we discuss a special type of service fault tolerance techniques, i.e., the Byzantine fault tolerance.


fault tolerance software reliability Web service SOA 



面向服务系统被广泛应用于电子商务、 电子政务、 金融、 管理系统等领域。 服务容错技术是用于建立高可靠性面向服务系统的最重要的技术之一。 本文给出了各种服务容错技术的概述, 包括三个部分: 容错策略设计, 容错策略选择, 及拜占庭容错。 第一部分主要关注静态及动态容错策略的设计, 及服务容错策略设计过程中需要解决的主要问题。 面对各种各样的服务容错策略, 第二部分包括快速定位复杂的面向服务系统关键模块的方法, 及最优容错策略选择算法。 最后, 第三部分将会讨论一种特殊的服务容错技术, 拜占庭容错。


容错 软件可靠性 网络服务 面向服务架构 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Lyu M R. Handbook of Software Reliability Engineering. New York: McGraw-Hill, 1996Google Scholar
  2. 2.
    Lyu M R. Software Fault Tolerance. Chichester: John Wiley & Sons, 1995Google Scholar
  3. 3.
    Wang H, Tang Y, Yin G, et al. Trustworthiness of internet-based software. Sci China Ser-F: Inf Sci, 2006, 49: 759–773CrossRefGoogle Scholar
  4. 4.
    Fang C-L, Liang D, Lin F, et al. Fault-tolerant Web services. J Syst Architect, 2007, 53: 21–38CrossRefGoogle Scholar
  5. 5.
    Salatge N, Fabre J-C. Fault tolerance connectors for unreliable Web services. In: Proceedings of 37th International Conference on Dependable Systems and Networks, Edinburgh, 2007. 51–60Google Scholar
  6. 6.
    Sheu G-W, Chang Y-S, Liang D, et al. A fault-tolerant object service on CORBA. In: Proceedings of 17th International Conference on Distributed Computing Systems, Baltimore, 1997. 393Google Scholar
  7. 7.
    Luckow A, Schnor B. Service replication in grids: ensuring consistency in a dynamic, failure-prone environment. In: Proceedings of IEEE International Symposium on Parallel and Distributed Processing, Miami, 2008. 1–7Google Scholar
  8. 8.
    Merideth M G, Iyengar A, Mikalsen T, et al. Thema: Byzantine fault-tolerant middleware for Web service applications. In: Proceedings of 24th IEEE Symposium on Reliable Distributed Systems, Orlando, 2005. 131–142CrossRefGoogle Scholar
  9. 9.
    Pallemulle S L, Thorvaldsson H D, Goldman K J. Byzantine fault-tolerant Web services for n-tier and service oriented architectures. In: Proceedings of 28th International Conference on Distributed Computing Systems, Beijing, 2008. 260–268Google Scholar
  10. 10.
    Salas J, Perez-Sorrosal F, Marta Pati N-M, et al. WS-replication: a framework for highly available Web services. In: Proceedings of 15th International Conference on World Wide Web, Edinburgh, 2006. 357–366CrossRefGoogle Scholar
  11. 11.
    Santos G T, Lung L C, Montez C. FTWeb: a fault tolerant infrastructure for Web services. In: Proceedings of 9th IEEE International Conference on Enterprise Computing, Enschede, 2005. 95–105Google Scholar
  12. 12.
    Randell B, Xu J. The evolution of the recovery block concept. In: Lyu M R, ed. Software Fault Tolerance. Chichester: John Wiley & Sons, 1995. 1–21Google Scholar
  13. 13.
    Avizienis A. The methodology of n-version programming. In: Lyu M R, ed. Software Fault Tolerance. Chichester: John Wiley & Sons, 1995. 23–46Google Scholar
  14. 14.
    Leu D, Bastani F, Leiss E. The effect of statically and dynamically replicated components on system reliability. IEEE Trans Rel, 1990, 39: 209–216CrossRefMATHGoogle Scholar
  15. 15.
    Zheng Z, Lyu M R. An adaptive QoS-aware fault tolerance strategy for Web services. Springer J Empir Softw Eng, 2010, 15: 323–345CrossRefGoogle Scholar
  16. 16.
    Ye X, Shen Y. Replicating multithreaded web services. In: Proceedings of 3rd International Symposium on Parallel and Distributed Processing and Applications, Nanjing, 2005. 162–167CrossRefGoogle Scholar
  17. 17.
    Osrael J, Froihofer L, Weghofer M, et al. Axis2-based replication middleware for Web services. In: Proceedings of IEEE International Conference on Web Services, Salt Lake City, 2007. 591–598CrossRefGoogle Scholar
  18. 18.
    Ye X. Providing reliable Web services through active replication. In: Proceedings of 6th IEEE/ACIS International Conference on Computer and Information Science, Melbourne, 2007. 1111–1116Google Scholar
  19. 19.
    Brito A, Fetzer C, Felber P. Multithreading-enabled active replication for event stream processing operators. In: Proceedings of 28th IEEE International Symposium on Reliable Distributed Systems, Niagara Falls, 2009. 22–31Google Scholar
  20. 20.
    Object Management Group. Fault-tolerant COBRA using entity redundancy: request for proposal. 98-04-01, 1998Google Scholar
  21. 21.
    Narasimhan P, Moser L E, Melliar-Smith P M. Enforcing determinism for the consistent replication of multithreaded CORBA applications. In: Proceedings of 18th IEEE Symposium on Reliable Distributed Systems, Lausanne, 1999. 263CrossRefGoogle Scholar
  22. 22.
    Fang C-L, Liang D, Chen C, et al. A redundant nested invocation suppression mechanism for active replication faulttolerant Web service. In: Proceedings of IEEE International Conference on e-Technology, e-Commerce and e-Service, Taipei, 2004. 9–16Google Scholar
  23. 23.
    Zheng Z, Zhou T C, Lyu M R, et al. FTCloud: a ranking-based framework for fault tolerant cloud applications. In: Proceedings of International Symposium on Software Reliability Engineering, San Jose, 2010. 398–407Google Scholar
  24. 24.
    Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine. In: Proceedings of 7th Internationl World Wide Web Conference, Brisbane, 1998Google Scholar
  25. 25.
    Zheng Z, Zhou T, Lyu M R, et al. Component ranking for fault-tolerant cloud applications. IEEE Trans Serv Comput, 2012, 5: 540–550CrossRefGoogle Scholar
  26. 26.
    Qiu W, Zheng Z, Wang X, et al. Reliability-based design optimization for cloud migration. IEEE Trans Serv Comput, 2014, 7: 223–236CrossRefGoogle Scholar
  27. 27.
    Zheng Z, Lyu M R. Selecting an optimal fault tolerance strategy for reliable service-oriented systems with local and global constraints. IEEE Trans Comput, 2015, 64: 219–232CrossRefMathSciNetGoogle Scholar
  28. 28.
    Cormen T, Leiserson C, Rivest R. Introduction to Algorithms. Cambridge: MIT Press, 1990MATHGoogle Scholar
  29. 29.
    Shahadat Khan E G M, Li Kin F, Akbar M. Solving the knapsack problem for adaptive multimedia systems. Stud Inf Univ, 2002, 2: 157–178Google Scholar
  30. 30.
    Lamport L, Shostak R, Pease M. The Byzantine generals problem. ACM Trans Program Lang Syst, 1982, 4: 382–401CrossRefMATHGoogle Scholar
  31. 31.
    Castro M, Liskov B. Practical Byzantine fault tolerance. In: Proceedings of 3rd Symposium on Operating Systems Design and Implementation, New Orleans, 1999. 1–14Google Scholar
  32. 32.
    Zhao W. BFT-WS: a Byzantine fault tolerance framework for Web services. In: Proceedings of 7th International IEEE EDOC Conference Workshop, Annapolis, 2007. 89–96Google Scholar
  33. 33.
    Li W, He J, Ma Q, et al. A framework to support survivable Web services. In: Proceedings of 19th IEEE International Symposium on Parallel and Distributed Processing, Denver, 2005. 93–94CrossRefGoogle Scholar
  34. 34.
    Rodrigues R, Castro M, Liskov B. BASE: using abstraction to improve fault tolerance. In: Proceedings of 18th Symposium on Operating Systems Principles, Banff, 2001. 15–28Google Scholar
  35. 35.
    Engelen R A V, Gallivan K A. The gSOAP toolkit for Web services and peer-to-peer computing networks. In: Proceedings of IEEE International Symposium on Cluster Computing and the Grid, Berlin, 2002. 128CrossRefGoogle Scholar

Copyright information

© Science China Press and Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • ZiBin Zheng
    • 1
  • Michael Rung Tsong Lyu
    • 1
  • HuaiMin Wang
    • 2
  1. 1.Shenzhen Research InstituteThe Chinese University of Hong KongHong KongChina
  2. 2.National Laboratory for Parallel & Distributed ProcessingNational University of Defense TechnologyChangshaChina

Personalised recommendations