Abstract
Present large-scale information technology environments are complex, heterogeneous compositions often affected by unpredictable behavior and poor manageability. This fostered substantial research on designs and techniques that enhance these systems with an autonomous behavior. In this survey, we focus on the self-healing branch of the research and give an overview of the current existing approaches. The survey is introduced by an outline of the origins of self-healing. Based on the principles of autonomic computing and self-adapting system research, we identify self-healing systems’ fundamental principles. The extracted principles support our analysis of the collected approaches. In a final discussion, we summarize the approaches’ common and individual characteristics. A comprehensive tabular overview of the researched material concludes the survey.
Similar content being viewed by others
References
Abbas N, Palankar M, Tambe S, Cook JE (2004) Infrastructure for making legacy systems self-managed. In: 2004 Workshop on Self-Managing Systems, Newport Beach, CA, USA, October 31–November 1
Adve S, Harris A, Hughes C, Jones D, Kravets R, Nahrstedt K, Sachs D, Sasanka R, Srinivasan J, Yuan W (2002) The Illinois GRACE Project: global resource adaptation through cooperation. In: Proceedings of the workshop on self-healing, adaptive, and self-managed systems (SHAMAN)
Akoglu A, Sreeramareddy A, Josiah J (2009) FPGA based distributed self healing architecture for reusable systems. Cluster Comput 12(3): 269–284
Albrecht J, Oppenheimer D, Vahdat A, Patterson DA (2008) Design and implementation trade-offs for wide-area resource discovery. ACM Trans Internet Technol 8(4): 1–44
Alonso J, Torres J, Moura Silva L, Griffith R, Kaiser G (2008) Towards self-adaptable monitoring framework for self-healing. Tech. Rep. TR-0150, Institute on Architectural issues: scalability, dependability, adaptability, CoreGRID-Network of Excellence
Alpern B, Schneider FB (1989) Verifying temporal properties without temporal logic. ACM Trans Program Lang Syst 11(1): 147–167
Angskun T, Fagg GE, Bosilca G, Pjesivac-Grbovic J, Dongarra JJ (2006a) Scalable fault tolerant protocol for parallel runtime environments. In: 2006 Euro PVM/MPI
Angskun T, Fagg GE, Bosilca G, Pjesivac-Grbovic J, Dongarra JJ (2006b) Self-healing network for scalable fault tolerant runtime environments. In: Proceedings of 6th Austrian-Hungarian workshop on distributed and parallel systems. Springer, Innsbruck
Arora A, Gouda M (1993) Closure and convergence: a foundation of fault-tolerant computing. IEEE Trans Softw Eng 19(11): 1015–1027
Baresi L, Guinea S (2007) Dynamo and self-healing BPEL compositions. In: 29th International conference on software engineering-companion, ICSE 2007 Companion, pp 69–70
Baresi L, Guinea S, Pasquale L (2007) Self-healing BPEL processes with Dynamo and the JBoss rule engine. In: ESSPE ’07: International workshop on Engineering of software services for pervasive environments. ACM, New York, pp 11–20
Bigus JP, Schlosnagle DA, Pilgrim JR, Mills IWN, Diao Y (2002) Able: a toolkit for building multiagent autonomic systems. IBM Syst J 41(3): 350–371
Blair GS, Coulson G, Andersen A, Blair L, Clarke M, Costa F, Duran-Limon H, Fitzpatrick T, Johnston L, Moreira R, Parlavantzas N, Saikoski K (2001) The design and implementation of Open ORB 2. IEEE Distributed Syst Online 2(6)
Blair GS, Coulson G, Blair L, Duran-Limon H, Grace P, Moreira R, Parlavantzas N (2002) Reflection, self-awareness and self-healing in OpenORB. In: WOSS ’02: Proceedings of the first workshop on Self-healing systems. ACM, New York, pp 9–14
Broy M (1997) Requirements engineering for embedded systems. In: Proceedings of the first workshop formal design of safety critical embedded systems (FemSys)
Candea G, Cutler J, Fox A (2002) Improving availability with recursive micro-reboots: a soft-state system case study. In: Dependable systems and networks: performance and dependability symposium (DNS-PDS)
Cheng SW, Garlan D, Schmerl B, Steenkiste P, Hu N (2002) Software architecture-based adaptation for grid computing. In: HPDC ’02: Proceedings of the 11th IEEE international symposium on high performance distributed computing. IEEE Computer Society, Washington, DC, p 389
Cheng SW, Garlan D, Schmerl BR, Sousa JP, Spitnagel B, Steenkiste P (2002) Using architectural style as a basis for system self-repair. In: WICSA 3: Proceedings of the IFIP 17th world computer congress-TC2 Stream/3rd IEEE/IFIP conference on software architecture. Kluwer, The Netherlands, pp 45–59
Cheng SW, Garlan D, Schmerl B (2006) Architecture-based self-adaptation in the presence of multiple objectives. In: SEAMS ’06: Proceedings of the 2006 international workshop on self-adaptation and self-managing systems. ACM, New York, pp 2–8
Clarke EM, Grumberg O (1987) Avoiding the state explosion problem in temporal logic model checking. In: PODC Proceedings of the sixth annual ACM Symposium on Principles of distributed computing. ACM, New York, pp 294–303
Corsava S, Getov V (2003) Intelligent architecture for automatic resource allocation in computer clusters. In: IPDPS ’03: Proceedings of the 17th international symposium on parallel and distributed processing. IEEE Computer Society, Washington, DC, p 201.1
Coulouris G, Dollimore J, Kindberg T (1994) Distributed systems: concepts and design. Addison-Wesley Longman Publishing Co., Inc., Boston
Dabrowski C, Mills K (2002) Understanding self-healing in service-discovery systems. In: WOSS ’02: Proceedings of the first workshop on self-healing systems. ACM, New York, pp 15–20
Dashofy EM, van der Hoek A, Taylor RN (2002) Towards architecture-based self-healing systems. In: WOSS ’02: Proceedings of the first workshop on Self-healing systems. ACM, New York, pp 21–26
Dijkstra EW (1974) Self-stabilizing systems in spite of distributed control. Commun ACM 17(11): 643–644
Dolev S, Schiller E (2004) Self-stabilizing group communication in directed networks. Acta Informatica 40(9): 609–636
Ellison R, Fisher D, Linger R, Lipson H, Longstaff T, Mead N (1999) Survivability: protecting your critical systems. Internet Comput IEEE 3(6): 55–63
Fuad MM, Deb D, Oudshoorn MJ (2006) Adding self-healing capabilities into legacy object oriented application. In: ICAS ’06: Proceedings of the international conference on autonomic and autonomous systems. IEEE Computer Society, Washington, p 51
Gabriel R (2008) On sustaining self. In: Self-sustaining systems: first workshop, S3 2008 Potsdam, Germany, May 15–16, 2008 Proceedings. Springer, Berlin, pp 51–53
Ganek AG, Corbi TA (2003) The dawning of the autonomic computing era. IBM Syst J 42(1): 5–18
Ghosh D, Sharman R, Raghav Rao H, Upadhyaya S (2007) Self-healing systems—survey and synthesis. Decis Support Syst 42(4): 2164–2185
Ghosh S (2006) Distributed systems: an algorithmic approach. Chapman & Hall/CRC, Boca Raton
Glass M, Lukasiewycz M, Streichert T, Haubelt C, Teich J (2007) Reliability-aware system synthesis. Design, Automation and Test in Europe Conference & Exhibition, pp 1–6
Glass M, Lukasiewycz M, Reimann F, Haubelt C, Teich J (2008) Symbolic Reliability Analysis of Self-healing Networked Embedded Systems. In: SAFECOMP ’08: Proceedings of the 27th international conference on computer safety, reliability, and security. Springer, Berlin, pp 139–152
Griffith R, Kaiser G (2005) Manipulating managed execution runtimes to support self-healing systems. SIGSOFT Softw Eng Notes 30(4): 1–7
Halima RB, Drira K, Jmaiel M (2008) A QoS-oriented reconfigurable middleware for self-healing web services. In: ICWS ’08: Proceedings of the 2008 IEEE international conference on web services. IEEE Computer Society, Washington, pp 104–111
Haydarlou A, Overeinder B, Brazier F (2005) A self-healing approach for object-oriented applications. In: Proceedings of the sixteenth international workshop on database and expert systems applications, pp 191–195
Herder JN, Bos H, Gras B, Homburg P, Tanenbaum AS (2006) MINIX 3: a highly reliable, self-repairing operating system. SIGOPS Oper Syst Rev 40(3): 80–89
Hirschfeld R, Rose K (2008) Self-sustaining systems: first workshop, S3 2008 Potsdam, Germany, May 15–16, 2008. Revised Selected Papers. Springer, Berlin
Holderfield V, Huhns M (2003) A foundational analysis of software robustness using redundant agent collaboration. Lecture notes in computer science, pp 355–369
Hong M, Huang G, Tsai W (2005) Towards self-healing systems via dependable architecture and reflective middleware. In: Proceedings: 10th IEEE international workshop on object-oriented real-time dependable systems, WORDS 2005, 2–4 February 2005, Sedona, Arizona. IEEE Computer Society, Washington, DC, p 337
Huebscher MC, McCann JA (2008) A survey of autonomic computing—degrees, models, and applications. ACM Comput Surv 40(3): 1–28
Huhns MN, Holderfield VT, Gutierrez RLZ (2003) Robust software via agent-based redundancy. In: AAMAS ’03: Proceedings of the second international joint conference on autonomous agents and multiagent systems. ACM, New York, pp 1018–1019
IBM (2005) An architectural blueprint for autonomic computing. IBM
Jennings N (2000) On agent-based software engineering. Artif Intell 117(2): 277–296
Kaelbling LP (1993) Learning in embedded systems. The MIT Press, Cambridge
Kant L, Chen W (2004) Alarm model specification and dynamic multi-layer self-healing mechanisms for commercial and ad-hoc wireless networks. In: 15th IEEE international symposium on personal, indoor and mobile radio communications, PIMRC 2004, vol 2, pp 959–963
Kant L, Chen W (2005) Service survivability in wireless networks via multi-layer self-healing. Wireless communications and networking conference, IEEE, vol 4, pp 2446–2452
Kephart J, Walsh W (2004) An artificial intelligence perspective on autonomic computing policies. In: Proceedings fifth IEEE international workshop on policies for distributed systems and networks, POLICY 2004, pp 3–12
Kephart JO, Chess DM (2003) The vision of autonomic computing. Comput IEEE Comput Soc Press 36(1): 41–50
Koch D, Streichert T, Dittrich S, Strengert C, Haubelt C, Teich J (2006) An operating system infrastructure for fault-tolerant reconfigurable networks. In: Proceedings of the 19th international conference on architecture of computing systems (ARCS 2006), Frankfurt/Main, Germany. Springer, Frankfurt, pp 202–216
Kon F, Rom’an M, Liu P, Mao J, Yamane T, Magalha C, Campbell RH (2000) Monitoring, security, and dynamic configuration with the dynamicTAO reflective ORB. In: Middleware ’00: IFIP/ACM international conference on distributed systems platforms. Springer, Secaucus, pp 121–143
Kopetz H (1997) Real-time systems: design principles for distributed embedded applications. Springer, Berlin
Ledoux T (1999) OpenCorba: a reflektive open broker. In: Reflection ’99: proceedings of the second international conference on meta-level architectures and reflection. Springer, London, pp 197–214
Linger R, Mead N, Lipson H (1998) Requirements definition for survivable network systems. In: Proceedings of the 1998 international conference on requirements engineering (ICRE’98), pp 6–10
Maes P (1987) Concepts and experiments in computational reflection. ACM Sigplan Notices 22(12): 147–155
Merideth M (2003) Enhancing survivability with proactive fault-containment. In: DSN Student Forum, Citeseer
Modafferi S, Conforti E (2006) Methods for enabling recovery actions in Ws-BPEL. Lect Notes in Comput Sci 4275: 219
Modafferi S, Mussi E, Pernici B (2006) SH-BPEL: a self-healing plug-in for Ws-BPEL engines. In: MW4SOC ’06: Proceedings of the 1st workshop on middleware for service oriented computing (MW4SOC 2006). ACM, New York, pp 48–53
Moo-Mena F, Garcilazo-Ortiz J, Basto-D’ıaz L, Curi-Quintal F, Alonzo-Canul F (2008) Defining a self-healing QoS-based infrastructure for web services applications. In: CSEWORKSHOPS ’08: Proceedings of the 2008 11th IEEE international conference on computational science and engineering-workshops. IEEE Computer Society, Washington, DC, pp 215–220
Moser O, Rosenberg F, Dustdar S (2008) Non-intrusive monitoring and service adaptation for Ws-BPEL. In: WWW ’08: Proceeding of the 17th international conference on World Wide Web. ACM, New York, pp 815–824
Norman DA, Ortony A, Russell DM (2003) Affect and machine design: lessons for the development of autonomous machines. IBM Syst J 42: 38–44
Parashar M, Hariri S (2005) Autonomic computing: an overview. In: Unconventional programming paradigms. Springer, Berlin, pp 247–259
Parnas DL (1972) On the criteria to be used in decomposing systems into modules. Commun ACM 15(12): 1053–1058
Paul H (2001) Autonomic computing: IBM’s Perspective on the State of Information Technology. International Business Machines Corporation, http://www.research.ibm.com/autonomic/
Picard RW (1997) Affective computing. The MIT Press, Cambridge
Pierce W (1965) Failure-tolerant computer design. Academic Press, New York
Razzaque MA, Dobson S, Nixon P (2007) Cross-layer architectures for autonomic communications. J Netw Syst Manage 15(1): 13–27
Rhea S, Geels D, Roscoe T, Kubiatowicz J (2004) Handling churn in a DHT. In: ATEC ’04: Proceedings of the annual conference on USENIX Annual Technical Conference, USENIX Association, Berkeley, CA, USA, pp 10–10
Rilling L (2006) Vigne: towards a self-healing grid operating system. In: Proceedings of Euro-Par 2006. Lecture notes in computer science, vol 4128. Springer, Dresden, pp 437–447
Salehie M, Tahvildari L (2005) Autonomic computing: emerging trends and open problems. SIGSOFT Softw Eng Notes 30(4): 1–7
Salehie M, Tahvildari L (2009) Self-adaptive software: landscape and research challenges. ACM Trans Auton Adapt Syst 4(2): 1–42
Shapiro MW (2005) Self-healing in modern operating systems. Queue 2(9): 66–75
Sloman A, Croucher M (1981) Why robots will have emotions. In: Proceedings IJCAI
Sterritt R (2005) Autonomic computing. Innov Syst Softw Eng 1(1): 79–88
Subramanian S, Thiran P, Narendra NC, Mostefaoui GK, Maamar Z (2008) On the enhancement of BPEL engines for self-healing composite web services. In: SAINT ’08: Proceedings of the 2008 international symposium on applications and the internet. IEEE Computer Society, Washington, DC, pp 33–39
Tanenbaum A, Herder J, Bos H (2006) Can we make operating systems reliable and secure?. Computer 39(5): 44–51
Tarr P, Ossher H, Harrison W, Sutton SM Jr (1999) N degrees of separation: multi-dimensional separation of concerns. In: ICSE ’99: Proceedings of the 21st international conference on software engineering. ACM, New York, pp 107–119
Tesauro G, Chess DM, Walsh WE, Das R, Segal A, Whalley I, Kephart JO, White SR (2004) A multi-agent systems approach to autonomic computing. In: AAMAS ’04: Proceedings of the third international joint conference on autonomous agents and multiagent systems. IEEE Computer Society, Washington, pp 464–471
Venishetti SK, Akoglu A, Kalra R (2007) Hierarchical built-in self-testing and fpga based healing methodology for system-on-a-chip. In: AHS ’07: Proceedings of the second NASA/ESA conference on adaptive hardware and systems. IEEE Computer Society, Washington, DC, pp 717–724
White S, Hanson J, Whalley I, Chess D, Kephart J (2004) An architectural approach to autonomic computing. In: Proceedings international conference on autonomic computing, pp 2–9
Winter R, Schiller J, Nikaein N, Bonnet C (2006) Crosstalk: cross-layer decision support based on global knowledge. Commun Mag IEEE 44(1): 93–99
Yuan W, Senior Nahrstedt K, Adve SV, Jones DL, Kravets RH (2006) GRACE-1: cross-layer adaptation for multimedia quality and battery energy. IEEE Trans Mobile Comput 5(7): 799–815
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by C.H. Cap.
Rights and permissions
About this article
Cite this article
Psaier, H., Dustdar, S. A survey on self-healing systems: approaches and systems. Computing 91, 43–73 (2011). https://doi.org/10.1007/s00607-010-0107-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00607-010-0107-y
Keywords
- Autonomous behaving systems
- Autonomic computing
- Self-adaptive systems
- Self-* properties
- Self-healing principles
- Self-healing approaches
- Survey