Cluster Computing

, Volume 9, Issue 2, pp 175–190 | Cite as

Autonomous recovery in componentized Internet applications

  • George Candea
  • Emre Kiciman
  • Shinichi Kawamoto
  • Armando Fox


In this paper we show how to reduce downtime of J2EE applications by rapidly and automatically recovering from transient and intermittent software failures, without requiring application modifications. Our prototype combines three application-agnostic techniques: macroanalysis for fault detection and localization, microrebooting for rapid recovery, and external management of recovery actions. The individual techniques are autonomous and work across a wide range of componentized Internet applications, making them well-suited to the rapidly changing software of Internet services. The proposed framework has been integrated with JBoss, an open-source J2EE application server. Our prototype provides an execution platform that can automatically recover J2EE applications within seconds of the manifestation of a fault. Our system can provide a subset of a system's active end users with the illusion of continuous uptime, in spite of failures occurring behind the scenes, even when there is no functional redundancy in the system.


Operating System Communication Network Fault Detection Rapid Recovery Application Server 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In: Proc. 19th ACM Symposium on Operating Systems Principles, Bolton Lauding, NY (2003).Google Scholar
  2. 2.
    M. Baker and M. Sullivan. The Recovery Box: Using fast recovery to provide high availability in the UNIX environment. In: Proc. Summer USENIX Technical Conference, San Antonio, TX (1992).Google Scholar
  3. 3.
    M. Barnes. J2EE application servers: Market overview. The Meta Group, (Mar. 2004).Google Scholar
  4. 4.
    J.F. Bartlett. A NonStop kernel. In: Proc. 8th ACM Symposium on Operating Systems Principles, Pacific Grove, CA (1981).Google Scholar
  5. 5.
    A. Bouloutas, S. Calo, and A. Finkel. Alarm correlation and fault identification in communication networks. IEEE Transactions on Communications (1994).Google Scholar
  6. 6.
    E. Brewer. Lessons from giant-scale services. IEEE Internet Computing, 5(4) (July 2001) 46–55.Google Scholar
  7. 7.
    A. Brown, G. Kar, and A. Keller. An active approach to characterizing dynamic dependencies for problem determination in a distributed environment. In: Proceedings of the 7th IFIP/IEEE International Symposium on Integrated Network Management (IM 2001), Seattle, WA (May 2001).Google Scholar
  8. 8.
    Business Internet Group. The black Friday report on Web application integrity. San Francisco, CA (2003).Google Scholar
  9. 9.
    G. Candea, J. Cutler, A. Fox, R. Doshi, P. Garg, and R. Gowda. Reducing recovery time in a small recursively restartable system. In: Proc. International Conference on Dependable Systems and Networks, Washington, DC (June 2002).Google Scholar
  10. 10.
    G. Candea, M. Delgado, M. Chen, and A. Fox. Automatic failure-path inference: A generic introspection technique for software systems. In: Proc. 3rd IEEE Workshop on Internet Applications, San Jose, CA (2003).Google Scholar
  11. 11.
    G. Candea and A. Fox. Crash-only software. In: Proc. 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii (2003).Google Scholar
  12. 12.
    G. Candea, S. Kawamoto, Y. Fujiki, and A. Fox. A microrebootable system – design, implementation, and evaluation. In: Proc. 6th USENIX Symposium on Operating Systems Design and Implementation, San Francisco, CA (Dec. 2004).Google Scholar
  13. 13.
    E. Cecchet, J. Marguerite, and W. Zwaenepoel. Performance and scalability of EJB applications, In: Proc. 17th Conference on Object-Oriented Programming, Systems, Languages, and Applications, Seattle, WA (2002).Google Scholar
  14. 14.
    M. Chen, A. Accardi, E. Kiciman, D. Patterson, A. Fox, and E. Brewer. Path-based macroanalysis for large, distributed systems. In: First Symposium on Networked Systems Design and Implementation (2004).Google Scholar
  15. 15.
    M. Chen, A. Zheng, J. Lloyd, M. Jordan, and E. Brewer. Failure diagnosis using decision trees. In: International Conference on Autonomic Computing, New York, NY (May 2004).Google Scholar
  16. 16.
    R. Chillarege and N. S. Bowen. Understanding large system failures – a fault injection experiment”. In: Proc. International Symposium on Fault Tolerant Computing (June 1989).Google Scholar
  17. 17.
    J. Choi, M. Choi, and S. Lee. An alarm correlation and fault identification scheme based on OSI managed object classes. In: Proc. of IEEE Conference on Communications (1999).Google Scholar
  18. 18.
    T. C. Chou. Personal communication. Oracle Corp. (2003).Google Scholar
  19. 19.
    H. Cohen and K. Jacobs. Personal communication. Oracle Corporation (2002).Google Scholar
  20. 20.
    K. Coleman, J. Norris, A. Fox, and G. Candea. OnCall: Defeating spikes with a free-marker server cluster. In: Proc. International Conference on Autonomic Computing, New York, NY (May 2004).Google Scholar
  21. 21.
    A. Diaconescu, A. Mos, and J. Murphy. Automatic performance management in component based software systems. In: First International Conference on Autonomic Computing, New York, NY (May 2004).Google Scholar
  22. 22.
    S. Duvur. Personal communication. Sun Microsystems (2004).Google Scholar
  23. 23.
    D. Engler, D. Y. Chen, S. Hallem, A. Chou, and B. Chelf. Bugs as deviant behavior: A general approach to inferring errors in systems code. In: Proc. 18th ACM Symposium on Operating Systems Principles, Lake Louise, Canada (Oct 2001).Google Scholar
  24. 24.
    J. Gray. Why do computers stop and what can be done about it? In: Proc. 5th Symposium on Reliability in Distributed Software and Database Systems, Los Angeles, CA (1986).Google Scholar
  25. 25.
    S. D. Gribble, E. A. Brewer, J. M. Hellerstein, and D. Culler. Scalable, distributed data structures for Internet service construction. In: Proc. 4th USENIX Symposium on Operating Systems Design and Implementation, San Diego, CA (Oct. 2000).Google Scholar
  26. 26.
    S. Hangal and M. Lam. Tracking down software bugs using automatic anomaly detection. In: Proceedings of the International Conference on Software Engineering (May 2002).Google Scholar
  27. 27.
    M.-C. Hsueh, T. K. Tsai, and R. K. Iyer. Fault injection techniques and tools. IEEE Computer, 30(4) (1997) 75–82.Google Scholar
  28. 28.
    J. Boss. Homepage. (2002).
  29. 29.
    J. O. Kephart and D. M. Chess. The Vision of Autonomic Computing. Computer Magazine (Jan 2003).Google Scholar
  30. 30.
    E. Kiciman and A. Fox. “Detecting application-level failures in component-based Internet services,” In: IEEE Transactions on Neural Networks: Special Issue on Adaptive Learning Systems in Communication Networks (2005) (invited paper).Google Scholar
  31. 31.
    E. Lassettre, D. Coleman, Y. Diao, S. Froelich, J. Hellerstein, L. Hsiung, T. Mummert, M. Raghavachari, G. Parker, L. Russell, M. Surendra, V. Tseng, N. Wadia, and P. Ye. Dynamic Surge Protection: An Approach to Handling Unexpected Workload Surges with Resource Actions that have Lead Times. In: Proc. of 1st Workshop on Algorithms and Architectures for Self-Managing Systems, San Diego, CA (June 2003).Google Scholar
  32. 32.
    W. LeFebvre.—Facing a world crisis. In: 15th USENIX Systems Administration Conference (2001). Invited Talk.Google Scholar
  33. 33.
    H. Levine. Personal communication. (2003).Google Scholar
  34. 34.
    B. Ling, E. Kiciman, and A. Fox. Session state: Beyond soft state. In: First Symposium on Networked Systems Design and Implementation (2004).Google Scholar
  35. 35.
    D. E. Lowell, S. Chandra, and P. M. Chen. Exploring failure transparency and the limits of generic recovery. In: Proc. 4th USENIX Symposium on Operating Systems Design and Implementation, San Diego, CA (2000).Google Scholar
  36. 36.
    A. Messinger. Personal communication. BEA Systems (2004).Google Scholar
  37. 37.
    R. Miller. Response time in man-computer conversational transactions. In: Proc. AFIPS Fall Joint Computer Conference, volume 33 (1968).Google Scholar
  38. 38.
    N. Mitchell. IBM Research. Personal Communication (2004).Google Scholar
  39. 39.
    N. Mitchell and G. Sevitsky. LeakBot: An automated and lightweight tool for diagnosing memory leaks in large Java applications. In: Proc. 17th European Conference on Object-Oriented Programming, Darmstadt, Germany (July 2003).Google Scholar
  40. 40.
    B. Murphy and N. Davies. System reliability and availability drivers of Tru64 UNIX. In: Proc. 29th International Symposium on Fault-Tolerant Computing, Madison, WI (1999). Tutorial.Google Scholar
  41. 41.
    B. Murphy and T. Gent. Measuring system and software reliability using an automated data collection process. Quality and Reliability Engineering International, 11 (1995) 341–353.Google Scholar
  42. 42.
    N. Nystrom, M. R. Clarkson, and A. C. Myers. Polyglot: An Extensible Compiler Framework for Java. In: Proc. of the 12th International Conference on Compiler Construction, Warsaw, Poland (Apr. 2003).Google Scholar
  43. 43.
    D. Oppenheimer, A. Ganapathi, and D. Patterson. Why do Internet services fail, and what can be done about it? In: Proc. 4th USENIX Symposium on Internet Technologies and Systems, Seattle, WA (2003).Google Scholar
  44. 44.
    A. Pal. Personal communication. Yahoo!, Inc. (2002).Google Scholar
  45. 45.
    D. Reimer. IBM Research. Personal Communication (2004).Google Scholar
  46. 46.
    I. Rouvellou and G. W. Hart. Automatic alarm correlation for fault identification. In: Proc. IEEE INFOCOM '95 (1995).Google Scholar
  47. 47.
    W. D. Smith. TPC-W: Benchmarking an E-Commerce solution. Transaction Processing Council (2002).Google Scholar
  48. 48.
    L. Spainhower and T. A. Gregg. IBM S/390 parallel enterprise server G5 fault tolerance: A historical perspective. IBM Journal of Research and Development, 43(5/6) (1999).Google Scholar
  49. 49.
    M. Stoncbraker. The design of the Postgres storage system. In: Proc. 13th Conference on Very Large Databases, Brighton, England (1987).Google Scholar
  50. 50.
    M. Sullivan and R. Chillarege. Software defects and their impact on system availability — a study of field failures in operating systems. In: Proc. 21st International Symposium on Fault-Tolerant Computing, Montréal, Canada (1991).Google Scholar
  51. 51.
    Sun_Microsystems. Java Pet Store Demo. (2002).
  52. 52.
    TBD. A major Internet auction site. Terms of disclosure are being negotiated (May 2004).Google Scholar
  53. 53.
    Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. M. R. Kintala. Checkpointing and its applications. In: Proc. 25th International Symposium on Fault-Tolerant Computing (1995).Google Scholar
  54. 54.
    Y.-M. Wang, C. Verbowski, and D. R. Simon. Persistent-state checkpoint comparison for troubleshooting configuration failures. In: Proc. of the IEEE Conference on Dependable Systems and Networks (2003).Google Scholar
  55. 55.
    A. Ward, P. Glynn, and K. Richardson. Internet service performance failure detection. In: Proc. Web Server Performance Workshop (1998).Google Scholar
  56. 56.
    K. Whisnant, R. Iyer, P. Hones, R. Some, and D. Rennels. Experimental evaluation of the REE SIFT environment for space-borne applications. In: Proc. International Conference on Dependable Systems and Networks, Washington, DC (2002).Google Scholar
  57. 57.
    A. P. Wood. Software reliability from the customer view. IEEE Computer, 36(8) (Aug. 2003) 37–42.Google Scholar
  58. 58.
    A. Yemeni and S. Kliger. High speed and robust event correlation. IEEE Communications Magazine, 34(5) (May 1996).Google Scholar

Copyright information

© Springer Science + Business Media, Inc. 2006

Authors and Affiliations

  • George Candea
    • 1
  • Emre Kiciman
    • 1
  • Shinichi Kawamoto
    • 1
  • Armando Fox
    • 1
  1. 1.Computer Systems LabStanford UniversityUSA

Personalised recommendations