Autonomous recovery in componentized Internet applications


In this paper we show how to reduce downtime of J2EE applications by rapidly and automatically recovering from transient and intermittent software failures, without requiring application modifications. Our prototype combines three application-agnostic techniques: macroanalysis for fault detection and localization, microrebooting for rapid recovery, and external management of recovery actions. The individual techniques are autonomous and work across a wide range of componentized Internet applications, making them well-suited to the rapidly changing software of Internet services. The proposed framework has been integrated with JBoss, an open-source J2EE application server. Our prototype provides an execution platform that can automatically recover J2EE applications within seconds of the manifestation of a fault. Our system can provide a subset of a system's active end users with the illusion of continuous uptime, in spite of failures occurring behind the scenes, even when there is no functional redundancy in the system.

This is a preview of subscription content, log in to check access.


  1. 1.

    M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In: Proc. 19th ACM Symposium on Operating Systems Principles, Bolton Lauding, NY (2003).

  2. 2.

    M. Baker and M. Sullivan. The Recovery Box: Using fast recovery to provide high availability in the UNIX environment. In: Proc. Summer USENIX Technical Conference, San Antonio, TX (1992).

  3. 3.

    M. Barnes. J2EE application servers: Market overview. The Meta Group, (Mar. 2004).

  4. 4.

    J.F. Bartlett. A NonStop kernel. In: Proc. 8th ACM Symposium on Operating Systems Principles, Pacific Grove, CA (1981).

  5. 5.

    A. Bouloutas, S. Calo, and A. Finkel. Alarm correlation and fault identification in communication networks. IEEE Transactions on Communications (1994).

  6. 6.

    E. Brewer. Lessons from giant-scale services. IEEE Internet Computing, 5(4) (July 2001) 46–55.

    Google Scholar 

  7. 7.

    A. Brown, G. Kar, and A. Keller. An active approach to characterizing dynamic dependencies for problem determination in a distributed environment. In: Proceedings of the 7th IFIP/IEEE International Symposium on Integrated Network Management (IM 2001), Seattle, WA (May 2001).

  8. 8.

    Business Internet Group. The black Friday report on Web application integrity. San Francisco, CA (2003).

  9. 9.

    G. Candea, J. Cutler, A. Fox, R. Doshi, P. Garg, and R. Gowda. Reducing recovery time in a small recursively restartable system. In: Proc. International Conference on Dependable Systems and Networks, Washington, DC (June 2002).

  10. 10.

    G. Candea, M. Delgado, M. Chen, and A. Fox. Automatic failure-path inference: A generic introspection technique for software systems. In: Proc. 3rd IEEE Workshop on Internet Applications, San Jose, CA (2003).

  11. 11.

    G. Candea and A. Fox. Crash-only software. In: Proc. 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii (2003).

  12. 12.

    G. Candea, S. Kawamoto, Y. Fujiki, and A. Fox. A microrebootable system – design, implementation, and evaluation. In: Proc. 6th USENIX Symposium on Operating Systems Design and Implementation, San Francisco, CA (Dec. 2004).

  13. 13.

    E. Cecchet, J. Marguerite, and W. Zwaenepoel. Performance and scalability of EJB applications, In: Proc. 17th Conference on Object-Oriented Programming, Systems, Languages, and Applications, Seattle, WA (2002).

  14. 14.

    M. Chen, A. Accardi, E. Kiciman, D. Patterson, A. Fox, and E. Brewer. Path-based macroanalysis for large, distributed systems. In: First Symposium on Networked Systems Design and Implementation (2004).

  15. 15.

    M. Chen, A. Zheng, J. Lloyd, M. Jordan, and E. Brewer. Failure diagnosis using decision trees. In: International Conference on Autonomic Computing, New York, NY (May 2004).

  16. 16.

    R. Chillarege and N. S. Bowen. Understanding large system failures – a fault injection experiment”. In: Proc. International Symposium on Fault Tolerant Computing (June 1989).

  17. 17.

    J. Choi, M. Choi, and S. Lee. An alarm correlation and fault identification scheme based on OSI managed object classes. In: Proc. of IEEE Conference on Communications (1999).

  18. 18.

    T. C. Chou. Personal communication. Oracle Corp. (2003).

  19. 19.

    H. Cohen and K. Jacobs. Personal communication. Oracle Corporation (2002).

  20. 20.

    K. Coleman, J. Norris, A. Fox, and G. Candea. OnCall: Defeating spikes with a free-marker server cluster. In: Proc. International Conference on Autonomic Computing, New York, NY (May 2004).

  21. 21.

    A. Diaconescu, A. Mos, and J. Murphy. Automatic performance management in component based software systems. In: First International Conference on Autonomic Computing, New York, NY (May 2004).

  22. 22.

    S. Duvur. Personal communication. Sun Microsystems (2004).

  23. 23.

    D. Engler, D. Y. Chen, S. Hallem, A. Chou, and B. Chelf. Bugs as deviant behavior: A general approach to inferring errors in systems code. In: Proc. 18th ACM Symposium on Operating Systems Principles, Lake Louise, Canada (Oct 2001).

  24. 24.

    J. Gray. Why do computers stop and what can be done about it? In: Proc. 5th Symposium on Reliability in Distributed Software and Database Systems, Los Angeles, CA (1986).

  25. 25.

    S. D. Gribble, E. A. Brewer, J. M. Hellerstein, and D. Culler. Scalable, distributed data structures for Internet service construction. In: Proc. 4th USENIX Symposium on Operating Systems Design and Implementation, San Diego, CA (Oct. 2000).

  26. 26.

    S. Hangal and M. Lam. Tracking down software bugs using automatic anomaly detection. In: Proceedings of the International Conference on Software Engineering (May 2002).

  27. 27.

    M.-C. Hsueh, T. K. Tsai, and R. K. Iyer. Fault injection techniques and tools. IEEE Computer, 30(4) (1997) 75–82.

    Google Scholar 

  28. 28.

    J. Boss. Homepage. (2002).

  29. 29.

    J. O. Kephart and D. M. Chess. The Vision of Autonomic Computing. Computer Magazine (Jan 2003).

  30. 30.

    E. Kiciman and A. Fox. “Detecting application-level failures in component-based Internet services,” In: IEEE Transactions on Neural Networks: Special Issue on Adaptive Learning Systems in Communication Networks (2005) (invited paper).

  31. 31.

    E. Lassettre, D. Coleman, Y. Diao, S. Froelich, J. Hellerstein, L. Hsiung, T. Mummert, M. Raghavachari, G. Parker, L. Russell, M. Surendra, V. Tseng, N. Wadia, and P. Ye. Dynamic Surge Protection: An Approach to Handling Unexpected Workload Surges with Resource Actions that have Lead Times. In: Proc. of 1st Workshop on Algorithms and Architectures for Self-Managing Systems, San Diego, CA (June 2003).

  32. 32.

    W. LeFebvre.—Facing a world crisis. In: 15th USENIX Systems Administration Conference (2001). Invited Talk.

  33. 33.

    H. Levine. Personal communication. (2003).

  34. 34.

    B. Ling, E. Kiciman, and A. Fox. Session state: Beyond soft state. In: First Symposium on Networked Systems Design and Implementation (2004).

  35. 35.

    D. E. Lowell, S. Chandra, and P. M. Chen. Exploring failure transparency and the limits of generic recovery. In: Proc. 4th USENIX Symposium on Operating Systems Design and Implementation, San Diego, CA (2000).

  36. 36.

    A. Messinger. Personal communication. BEA Systems (2004).

  37. 37.

    R. Miller. Response time in man-computer conversational transactions. In: Proc. AFIPS Fall Joint Computer Conference, volume 33 (1968).

  38. 38.

    N. Mitchell. IBM Research. Personal Communication (2004).

  39. 39.

    N. Mitchell and G. Sevitsky. LeakBot: An automated and lightweight tool for diagnosing memory leaks in large Java applications. In: Proc. 17th European Conference on Object-Oriented Programming, Darmstadt, Germany (July 2003).

  40. 40.

    B. Murphy and N. Davies. System reliability and availability drivers of Tru64 UNIX. In: Proc. 29th International Symposium on Fault-Tolerant Computing, Madison, WI (1999). Tutorial.

  41. 41.

    B. Murphy and T. Gent. Measuring system and software reliability using an automated data collection process. Quality and Reliability Engineering International, 11 (1995) 341–353.

    Google Scholar 

  42. 42.

    N. Nystrom, M. R. Clarkson, and A. C. Myers. Polyglot: An Extensible Compiler Framework for Java. In: Proc. of the 12th International Conference on Compiler Construction, Warsaw, Poland (Apr. 2003).

  43. 43.

    D. Oppenheimer, A. Ganapathi, and D. Patterson. Why do Internet services fail, and what can be done about it? In: Proc. 4th USENIX Symposium on Internet Technologies and Systems, Seattle, WA (2003).

  44. 44.

    A. Pal. Personal communication. Yahoo!, Inc. (2002).

  45. 45.

    D. Reimer. IBM Research. Personal Communication (2004).

  46. 46.

    I. Rouvellou and G. W. Hart. Automatic alarm correlation for fault identification. In: Proc. IEEE INFOCOM '95 (1995).

  47. 47.

    W. D. Smith. TPC-W: Benchmarking an E-Commerce solution. Transaction Processing Council (2002).

  48. 48.

    L. Spainhower and T. A. Gregg. IBM S/390 parallel enterprise server G5 fault tolerance: A historical perspective. IBM Journal of Research and Development, 43(5/6) (1999).

  49. 49.

    M. Stoncbraker. The design of the Postgres storage system. In: Proc. 13th Conference on Very Large Databases, Brighton, England (1987).

  50. 50.

    M. Sullivan and R. Chillarege. Software defects and their impact on system availability — a study of field failures in operating systems. In: Proc. 21st International Symposium on Fault-Tolerant Computing, Montréal, Canada (1991).

  51. 51.

    Sun_Microsystems. Java Pet Store Demo. (2002).

  52. 52.

    TBD. A major Internet auction site. Terms of disclosure are being negotiated (May 2004).

  53. 53.

    Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. M. R. Kintala. Checkpointing and its applications. In: Proc. 25th International Symposium on Fault-Tolerant Computing (1995).

  54. 54.

    Y.-M. Wang, C. Verbowski, and D. R. Simon. Persistent-state checkpoint comparison for troubleshooting configuration failures. In: Proc. of the IEEE Conference on Dependable Systems and Networks (2003).

  55. 55.

    A. Ward, P. Glynn, and K. Richardson. Internet service performance failure detection. In: Proc. Web Server Performance Workshop (1998).

  56. 56.

    K. Whisnant, R. Iyer, P. Hones, R. Some, and D. Rennels. Experimental evaluation of the REE SIFT environment for space-borne applications. In: Proc. International Conference on Dependable Systems and Networks, Washington, DC (2002).

  57. 57.

    A. P. Wood. Software reliability from the customer view. IEEE Computer, 36(8) (Aug. 2003) 37–42.

    Google Scholar 

  58. 58.

    A. Yemeni and S. Kliger. High speed and robust event correlation. IEEE Communications Magazine, 34(5) (May 1996).

Download references

Author information



Corresponding author

Correspondence to George Candea.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Candea, G., Kiciman, E., Kawamoto, S. et al. Autonomous recovery in componentized Internet applications. Cluster Comput 9, 175–190 (2006).

Download citation


  • Operating System
  • Communication Network
  • Fault Detection
  • Rapid Recovery
  • Application Server