Skip to main content
Log in

A survey on self-healing systems: approaches and systems

  • Published:
Computing Aims and scope Submit manuscript

Abstract

Present large-scale information technology environments are complex, heterogeneous compositions often affected by unpredictable behavior and poor manageability. This fostered substantial research on designs and techniques that enhance these systems with an autonomous behavior. In this survey, we focus on the self-healing branch of the research and give an overview of the current existing approaches. The survey is introduced by an outline of the origins of self-healing. Based on the principles of autonomic computing and self-adapting system research, we identify self-healing systems’ fundamental principles. The extracted principles support our analysis of the collected approaches. In a final discussion, we summarize the approaches’ common and individual characteristics. A comprehensive tabular overview of the researched material concludes the survey.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abbas N, Palankar M, Tambe S, Cook JE (2004) Infrastructure for making legacy systems self-managed. In: 2004 Workshop on Self-Managing Systems, Newport Beach, CA, USA, October 31–November 1

  2. Adve S, Harris A, Hughes C, Jones D, Kravets R, Nahrstedt K, Sachs D, Sasanka R, Srinivasan J, Yuan W (2002) The Illinois GRACE Project: global resource adaptation through cooperation. In: Proceedings of the workshop on self-healing, adaptive, and self-managed systems (SHAMAN)

  3. Akoglu A, Sreeramareddy A, Josiah J (2009) FPGA based distributed self healing architecture for reusable systems. Cluster Comput 12(3): 269–284

    Article  Google Scholar 

  4. Albrecht J, Oppenheimer D, Vahdat A, Patterson DA (2008) Design and implementation trade-offs for wide-area resource discovery. ACM Trans Internet Technol 8(4): 1–44

    Article  Google Scholar 

  5. Alonso J, Torres J, Moura Silva L, Griffith R, Kaiser G (2008) Towards self-adaptable monitoring framework for self-healing. Tech. Rep. TR-0150, Institute on Architectural issues: scalability, dependability, adaptability, CoreGRID-Network of Excellence

  6. Alpern B, Schneider FB (1989) Verifying temporal properties without temporal logic. ACM Trans Program Lang Syst 11(1): 147–167

    Article  MATH  Google Scholar 

  7. Angskun T, Fagg GE, Bosilca G, Pjesivac-Grbovic J, Dongarra JJ (2006a) Scalable fault tolerant protocol for parallel runtime environments. In: 2006 Euro PVM/MPI

  8. Angskun T, Fagg GE, Bosilca G, Pjesivac-Grbovic J, Dongarra JJ (2006b) Self-healing network for scalable fault tolerant runtime environments. In: Proceedings of 6th Austrian-Hungarian workshop on distributed and parallel systems. Springer, Innsbruck

  9. Arora A, Gouda M (1993) Closure and convergence: a foundation of fault-tolerant computing. IEEE Trans Softw Eng 19(11): 1015–1027

    Article  Google Scholar 

  10. Baresi L, Guinea S (2007) Dynamo and self-healing BPEL compositions. In: 29th International conference on software engineering-companion, ICSE 2007 Companion, pp 69–70

  11. Baresi L, Guinea S, Pasquale L (2007) Self-healing BPEL processes with Dynamo and the JBoss rule engine. In: ESSPE ’07: International workshop on Engineering of software services for pervasive environments. ACM, New York, pp 11–20

  12. Bigus JP, Schlosnagle DA, Pilgrim JR, Mills IWN, Diao Y (2002) Able: a toolkit for building multiagent autonomic systems. IBM Syst J 41(3): 350–371

    Article  Google Scholar 

  13. Blair GS, Coulson G, Andersen A, Blair L, Clarke M, Costa F, Duran-Limon H, Fitzpatrick T, Johnston L, Moreira R, Parlavantzas N, Saikoski K (2001) The design and implementation of Open ORB 2. IEEE Distributed Syst Online 2(6)

  14. Blair GS, Coulson G, Blair L, Duran-Limon H, Grace P, Moreira R, Parlavantzas N (2002) Reflection, self-awareness and self-healing in OpenORB. In: WOSS ’02: Proceedings of the first workshop on Self-healing systems. ACM, New York, pp 9–14

  15. Broy M (1997) Requirements engineering for embedded systems. In: Proceedings of the first workshop formal design of safety critical embedded systems (FemSys)

  16. Candea G, Cutler J, Fox A (2002) Improving availability with recursive micro-reboots: a soft-state system case study. In: Dependable systems and networks: performance and dependability symposium (DNS-PDS)

  17. Cheng SW, Garlan D, Schmerl B, Steenkiste P, Hu N (2002) Software architecture-based adaptation for grid computing. In: HPDC ’02: Proceedings of the 11th IEEE international symposium on high performance distributed computing. IEEE Computer Society, Washington, DC, p 389

  18. Cheng SW, Garlan D, Schmerl BR, Sousa JP, Spitnagel B, Steenkiste P (2002) Using architectural style as a basis for system self-repair. In: WICSA 3: Proceedings of the IFIP 17th world computer congress-TC2 Stream/3rd IEEE/IFIP conference on software architecture. Kluwer, The Netherlands, pp 45–59

  19. Cheng SW, Garlan D, Schmerl B (2006) Architecture-based self-adaptation in the presence of multiple objectives. In: SEAMS ’06: Proceedings of the 2006 international workshop on self-adaptation and self-managing systems. ACM, New York, pp 2–8

  20. Clarke EM, Grumberg O (1987) Avoiding the state explosion problem in temporal logic model checking. In: PODC Proceedings of the sixth annual ACM Symposium on Principles of distributed computing. ACM, New York, pp 294–303

  21. Corsava S, Getov V (2003) Intelligent architecture for automatic resource allocation in computer clusters. In: IPDPS ’03: Proceedings of the 17th international symposium on parallel and distributed processing. IEEE Computer Society, Washington, DC, p 201.1

  22. Coulouris G, Dollimore J, Kindberg T (1994) Distributed systems: concepts and design. Addison-Wesley Longman Publishing Co., Inc., Boston

    Google Scholar 

  23. Dabrowski C, Mills K (2002) Understanding self-healing in service-discovery systems. In: WOSS ’02: Proceedings of the first workshop on self-healing systems. ACM, New York, pp 15–20

  24. Dashofy EM, van der Hoek A, Taylor RN (2002) Towards architecture-based self-healing systems. In: WOSS ’02: Proceedings of the first workshop on Self-healing systems. ACM, New York, pp 21–26

  25. Dijkstra EW (1974) Self-stabilizing systems in spite of distributed control. Commun ACM 17(11): 643–644

    Article  MATH  Google Scholar 

  26. Dolev S, Schiller E (2004) Self-stabilizing group communication in directed networks. Acta Informatica 40(9): 609–636

    Article  MATH  MathSciNet  Google Scholar 

  27. Ellison R, Fisher D, Linger R, Lipson H, Longstaff T, Mead N (1999) Survivability: protecting your critical systems. Internet Comput IEEE 3(6): 55–63

    Article  Google Scholar 

  28. Fuad MM, Deb D, Oudshoorn MJ (2006) Adding self-healing capabilities into legacy object oriented application. In: ICAS ’06: Proceedings of the international conference on autonomic and autonomous systems. IEEE Computer Society, Washington, p 51

  29. Gabriel R (2008) On sustaining self. In: Self-sustaining systems: first workshop, S3 2008 Potsdam, Germany, May 15–16, 2008 Proceedings. Springer, Berlin, pp 51–53

  30. Ganek AG, Corbi TA (2003) The dawning of the autonomic computing era. IBM Syst J 42(1): 5–18

    Article  Google Scholar 

  31. Ghosh D, Sharman R, Raghav Rao H, Upadhyaya S (2007) Self-healing systems—survey and synthesis. Decis Support Syst 42(4): 2164–2185

    Article  Google Scholar 

  32. Ghosh S (2006) Distributed systems: an algorithmic approach. Chapman & Hall/CRC, Boca Raton

    Book  Google Scholar 

  33. Glass M, Lukasiewycz M, Streichert T, Haubelt C, Teich J (2007) Reliability-aware system synthesis. Design, Automation and Test in Europe Conference & Exhibition, pp 1–6

  34. Glass M, Lukasiewycz M, Reimann F, Haubelt C, Teich J (2008) Symbolic Reliability Analysis of Self-healing Networked Embedded Systems. In: SAFECOMP ’08: Proceedings of the 27th international conference on computer safety, reliability, and security. Springer, Berlin, pp 139–152

  35. Griffith R, Kaiser G (2005) Manipulating managed execution runtimes to support self-healing systems. SIGSOFT Softw Eng Notes 30(4): 1–7

    Article  Google Scholar 

  36. Halima RB, Drira K, Jmaiel M (2008) A QoS-oriented reconfigurable middleware for self-healing web services. In: ICWS ’08: Proceedings of the 2008 IEEE international conference on web services. IEEE Computer Society, Washington, pp 104–111

  37. Haydarlou A, Overeinder B, Brazier F (2005) A self-healing approach for object-oriented applications. In: Proceedings of the sixteenth international workshop on database and expert systems applications, pp 191–195

  38. Herder JN, Bos H, Gras B, Homburg P, Tanenbaum AS (2006) MINIX 3: a highly reliable, self-repairing operating system. SIGOPS Oper Syst Rev 40(3): 80–89

    Article  Google Scholar 

  39. Hirschfeld R, Rose K (2008) Self-sustaining systems: first workshop, S3 2008 Potsdam, Germany, May 15–16, 2008. Revised Selected Papers. Springer, Berlin

  40. Holderfield V, Huhns M (2003) A foundational analysis of software robustness using redundant agent collaboration. Lecture notes in computer science, pp 355–369

  41. Hong M, Huang G, Tsai W (2005) Towards self-healing systems via dependable architecture and reflective middleware. In: Proceedings: 10th IEEE international workshop on object-oriented real-time dependable systems, WORDS 2005, 2–4 February 2005, Sedona, Arizona. IEEE Computer Society, Washington, DC, p 337

  42. Huebscher MC, McCann JA (2008) A survey of autonomic computing—degrees, models, and applications. ACM Comput Surv 40(3): 1–28

    Article  Google Scholar 

  43. Huhns MN, Holderfield VT, Gutierrez RLZ (2003) Robust software via agent-based redundancy. In: AAMAS ’03: Proceedings of the second international joint conference on autonomous agents and multiagent systems. ACM, New York, pp 1018–1019

  44. IBM (2005) An architectural blueprint for autonomic computing. IBM

  45. Jennings N (2000) On agent-based software engineering. Artif Intell 117(2): 277–296

    Article  MATH  Google Scholar 

  46. Kaelbling LP (1993) Learning in embedded systems. The MIT Press, Cambridge

    Google Scholar 

  47. Kant L, Chen W (2004) Alarm model specification and dynamic multi-layer self-healing mechanisms for commercial and ad-hoc wireless networks. In: 15th IEEE international symposium on personal, indoor and mobile radio communications, PIMRC 2004, vol 2, pp 959–963

  48. Kant L, Chen W (2005) Service survivability in wireless networks via multi-layer self-healing. Wireless communications and networking conference, IEEE, vol 4, pp 2446–2452

  49. Kephart J, Walsh W (2004) An artificial intelligence perspective on autonomic computing policies. In: Proceedings fifth IEEE international workshop on policies for distributed systems and networks, POLICY 2004, pp 3–12

  50. Kephart JO, Chess DM (2003) The vision of autonomic computing. Comput IEEE Comput Soc Press 36(1): 41–50

    Google Scholar 

  51. Koch D, Streichert T, Dittrich S, Strengert C, Haubelt C, Teich J (2006) An operating system infrastructure for fault-tolerant reconfigurable networks. In: Proceedings of the 19th international conference on architecture of computing systems (ARCS 2006), Frankfurt/Main, Germany. Springer, Frankfurt, pp 202–216

  52. Kon F, Rom’an M, Liu P, Mao J, Yamane T, Magalha C, Campbell RH (2000) Monitoring, security, and dynamic configuration with the dynamicTAO reflective ORB. In: Middleware ’00: IFIP/ACM international conference on distributed systems platforms. Springer, Secaucus, pp 121–143

  53. Kopetz H (1997) Real-time systems: design principles for distributed embedded applications. Springer, Berlin

    MATH  Google Scholar 

  54. Ledoux T (1999) OpenCorba: a reflektive open broker. In: Reflection ’99: proceedings of the second international conference on meta-level architectures and reflection. Springer, London, pp 197–214

  55. Linger R, Mead N, Lipson H (1998) Requirements definition for survivable network systems. In: Proceedings of the 1998 international conference on requirements engineering (ICRE’98), pp 6–10

  56. Maes P (1987) Concepts and experiments in computational reflection. ACM Sigplan Notices 22(12): 147–155

    Article  Google Scholar 

  57. Merideth M (2003) Enhancing survivability with proactive fault-containment. In: DSN Student Forum, Citeseer

  58. Modafferi S, Conforti E (2006) Methods for enabling recovery actions in Ws-BPEL. Lect Notes in Comput Sci 4275: 219

    Article  Google Scholar 

  59. Modafferi S, Mussi E, Pernici B (2006) SH-BPEL: a self-healing plug-in for Ws-BPEL engines. In: MW4SOC ’06: Proceedings of the 1st workshop on middleware for service oriented computing (MW4SOC 2006). ACM, New York, pp 48–53

  60. Moo-Mena F, Garcilazo-Ortiz J, Basto-D’ıaz L, Curi-Quintal F, Alonzo-Canul F (2008) Defining a self-healing QoS-based infrastructure for web services applications. In: CSEWORKSHOPS ’08: Proceedings of the 2008 11th IEEE international conference on computational science and engineering-workshops. IEEE Computer Society, Washington, DC, pp 215–220

  61. Moser O, Rosenberg F, Dustdar S (2008) Non-intrusive monitoring and service adaptation for Ws-BPEL. In: WWW ’08: Proceeding of the 17th international conference on World Wide Web. ACM, New York, pp 815–824

  62. Norman DA, Ortony A, Russell DM (2003) Affect and machine design: lessons for the development of autonomous machines. IBM Syst J 42: 38–44

    Article  Google Scholar 

  63. Parashar M, Hariri S (2005) Autonomic computing: an overview. In: Unconventional programming paradigms. Springer, Berlin, pp 247–259

  64. Parnas DL (1972) On the criteria to be used in decomposing systems into modules. Commun ACM 15(12): 1053–1058

    Article  MATH  Google Scholar 

  65. Paul H (2001) Autonomic computing: IBM’s Perspective on the State of Information Technology. International Business Machines Corporation, http://www.research.ibm.com/autonomic/

  66. Picard RW (1997) Affective computing. The MIT Press, Cambridge

    Google Scholar 

  67. Pierce W (1965) Failure-tolerant computer design. Academic Press, New York

    Google Scholar 

  68. Razzaque MA, Dobson S, Nixon P (2007) Cross-layer architectures for autonomic communications. J Netw Syst Manage 15(1): 13–27

    Article  Google Scholar 

  69. Rhea S, Geels D, Roscoe T, Kubiatowicz J (2004) Handling churn in a DHT. In: ATEC ’04: Proceedings of the annual conference on USENIX Annual Technical Conference, USENIX Association, Berkeley, CA, USA, pp 10–10

  70. Rilling L (2006) Vigne: towards a self-healing grid operating system. In: Proceedings of Euro-Par 2006. Lecture notes in computer science, vol 4128. Springer, Dresden, pp 437–447

  71. Salehie M, Tahvildari L (2005) Autonomic computing: emerging trends and open problems. SIGSOFT Softw Eng Notes 30(4): 1–7

    Article  Google Scholar 

  72. Salehie M, Tahvildari L (2009) Self-adaptive software: landscape and research challenges. ACM Trans Auton Adapt Syst 4(2): 1–42

    Article  Google Scholar 

  73. Shapiro MW (2005) Self-healing in modern operating systems. Queue 2(9): 66–75

    Article  Google Scholar 

  74. Sloman A, Croucher M (1981) Why robots will have emotions. In: Proceedings IJCAI

  75. Sterritt R (2005) Autonomic computing. Innov Syst Softw Eng 1(1): 79–88

    Article  Google Scholar 

  76. Subramanian S, Thiran P, Narendra NC, Mostefaoui GK, Maamar Z (2008) On the enhancement of BPEL engines for self-healing composite web services. In: SAINT ’08: Proceedings of the 2008 international symposium on applications and the internet. IEEE Computer Society, Washington, DC, pp 33–39

  77. Tanenbaum A, Herder J, Bos H (2006) Can we make operating systems reliable and secure?. Computer 39(5): 44–51

    Article  Google Scholar 

  78. Tarr P, Ossher H, Harrison W, Sutton SM Jr (1999) N degrees of separation: multi-dimensional separation of concerns. In: ICSE ’99: Proceedings of the 21st international conference on software engineering. ACM, New York, pp 107–119

  79. Tesauro G, Chess DM, Walsh WE, Das R, Segal A, Whalley I, Kephart JO, White SR (2004) A multi-agent systems approach to autonomic computing. In: AAMAS ’04: Proceedings of the third international joint conference on autonomous agents and multiagent systems. IEEE Computer Society, Washington, pp 464–471

  80. Venishetti SK, Akoglu A, Kalra R (2007) Hierarchical built-in self-testing and fpga based healing methodology for system-on-a-chip. In: AHS ’07: Proceedings of the second NASA/ESA conference on adaptive hardware and systems. IEEE Computer Society, Washington, DC, pp 717–724

  81. White S, Hanson J, Whalley I, Chess D, Kephart J (2004) An architectural approach to autonomic computing. In: Proceedings international conference on autonomic computing, pp 2–9

  82. Winter R, Schiller J, Nikaein N, Bonnet C (2006) Crosstalk: cross-layer decision support based on global knowledge. Commun Mag IEEE 44(1): 93–99

    Article  Google Scholar 

  83. Yuan W, Senior Nahrstedt K, Adve SV, Jones DL, Kravets RH (2006) GRACE-1: cross-layer adaptation for multimedia quality and battery energy. IEEE Trans Mobile Comput 5(7): 799–815

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Harald Psaier.

Additional information

Communicated by C.H. Cap.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Psaier, H., Dustdar, S. A survey on self-healing systems: approaches and systems. Computing 91, 43–73 (2011). https://doi.org/10.1007/s00607-010-0107-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00607-010-0107-y

Keywords

Mathematics Subject Classification (2000)

Navigation