Utility-Driven Proactive Management of Availability in Enterprise-Scale Information Flows

  • Zhongtang Cai
  • Vibhore Kumar
  • Brian F. Cooper
  • Greg Eisenhauer
  • Karsten Schwan
  • Robert E. Strom
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4290)


Enterprises rely critically on the timely and sustained delivery of information. To support this need, we augment information flow middleware with new functionality that provides high levels of availability to distributed applications while at the same time maximizing the utility end users derive from such information. Specifically, the paper presents utility-driven ‘proactive availability-management’ techniques to offer (1) information flows that dynamically self-determine their availability requirement based on high-level utility specifications, (2) flows that can trade recovery time for performance based on the ‘perceived’ stability of and failure predictions (early alarm) for the underlying system, and (3) methods, based on real-world case studies, to deal with both transient and non-transient failures. Utility-driven ‘proactive availability-management’ is integrated into information flow middleware and used with representative applications. Experiments reported in the paper demonstrate middleware capability to self-determine availability guarantees, to offer improved performance versus a statically configured system, and to be resilient to a wide range of faults.


Active Node Failure Prediction Sequential Probability Ratio Test Upstream Node Availability Requirement 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    IBM: IBM global services: Improving systems availability, http://www.cs.cmu.edu/~priya/hawht.pdf
  2. 2.
    Gavrilovska, A., Schwan, K., Oleson, V.: A practical approach for zero downtime in an operational information system. In: Proc. of ICDCS (2002)Google Scholar
  3. 3.
    Gray, J., McJones, P.R., Blasgen, M.W., Lindsay, B.G., Lorie, R.A., Price, T.G., Putzolu, G.R., Traiger, I.L.: The recovery manager of the system R database manager. ACM Comput. Surv. 13(2) (1981)Google Scholar
  4. 4.
    Randell, B., Lee, P., Treleaven, P.C.: Reliability issues in computing system design. ACM Comput. Surv. 10(2) (1978)Google Scholar
  5. 5.
    Kumar, V., Cai, Z., Cooper, B.F., Eisenhauer, G., Schwan, K., Mansour, M., Seshasayee, B., Widener, P.: Implementing diverse messaging models with self-managing properties using iflow. In: Proc. of ICAC (2006)Google Scholar
  6. 6.
    Strom, R.E.: Fault-tolerance in the smile stateful publish-subscribe system. In: Proc. of the Int’l Workshop on Distributed Event-Based Systems (2004)Google Scholar
  7. 7.
    Stelling, P., Foster, I., Kesselman, C., Lee, C., Laszewski, G.V.: A fault detection service for wide area distributed computations. In: Proc. of HPDC (1998)Google Scholar
  8. 8.
    Walsh, W.E., Tesauro, G., Kephart, J.O., Das, R.: Utility functions in autonomic systems. In: Proc. of ICAC (2004)Google Scholar
  9. 9.
    Cohen, I., Goldszmidt, M., Kelly, T., Symons, J., Chase, J.: Correlating instrumentation data to system states: A building block for automated diagnosis and control. In: Proc. of OSDI (2004)Google Scholar
  10. 10.
    Gross, K.C., Lu, W.: Early detection of signal and process anomalies in enterprise computing systems. In: Proc. of IEEE International Conference on Machine Learning and Applications (2002)Google Scholar
  11. 11.
    Zavaljevski, N., Gross, K.C.: Uncertainty analysis for multivariate state estimation in mission-critical and safety-critical applications. In: Proc. MARCON (2000)Google Scholar
  12. 12.
    Blough, D., Liu, P.: FIMD-MPI: A tool for injecting faults into mpi applications. In: Proc. of IPDPS (2000)Google Scholar
  13. 13.
    Cai, Z., Kumar, V., Cooper, B.F., Eisenhauer, G., Schwan, K., Strom, R.: Utility-driven availability-management in enterprise-scale information flows, Technical report (2006), http://www.cercs.gatech.edu/tech-reports/
  14. 14.
    Balazinska, M., Balakrishnan, H., Madden, S., Stonebraker, M.: Fault-tolerance in the borealis distributed stream processing system. In: Proc. of the ACM SIGMOD international conference on Management of data (2005)Google Scholar
  15. 15.
    Mansour, M., Schwan, K.: I-rmi: Performance isolation in information flow applications. In: Alonso, G. (ed.) Middleware 2005. LNCS, vol. 3790, pp. 375–389. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  16. 16.
    Schwan, K., et al.: Autoflow: Autonomic information flows for critical information systems. In: Parashar, M., Hariri, S. (eds.) Autonomic Computing: Concepts, Infrastructure, and Applications. CRC Press, Boca Raton (2006)Google Scholar
  17. 17.
    Lepreau, J., et al.: The Utah network testbed, http://www.emulab.net/
  18. 18.
    Zegura, E.W., Calvert, K., Bhattacharjee, S.: How to model an internetwork. In: Proc. of IEEE INFOCOM (1996)Google Scholar
  19. 19.
    VINT Project: The network simulator - ns-2, http://www.isi.edu/nsnam/ns/
  20. 20.
    Wu, H., Kemme., B.: Fault-tolerance for stateful application servers in the presence of advanced transactions patterns. In: Proc. of SRDS (2005)Google Scholar
  21. 21.
    Sens, P., Folliot, B.: STAR A fault-tolerant system for distributed applications. Software - Practice and Experience 28(10) (1998)Google Scholar
  22. 22.
    Yu, H., Vahdat, A.: The costs and limits of availability for replicated services. In: Proc. of SOSP (2001)Google Scholar
  23. 23.
    Fox, A., Kiciman, E., Patterson, D.: Combining statistical monitoring and predictable recovery for self-management. In: Proc. of SIGSOFT workshop on Self-managed systems (2004)Google Scholar
  24. 24.
    Cheriton, D., Skeen, D.: Understanding the limitations of causally and totally ordered communication. In: Proc. of SOSP (1993)Google Scholar
  25. 25.
    Candea, G., Kawamoto, S., Fujiki, Y., Friedman, G., Fox, A.: Microreboot - a technique for cheap recovery. In: Proc. of OSDI (2004)Google Scholar
  26. 26.
    Russello, G., Chaudron, M., van Steen., M.: Dynamically adapting tuple replication for high availability in a shared data space. In: Proc. Int’l Conf. on Coordination Models and Languages (2005)Google Scholar
  27. 27.
    Schintke, F., Reinefeld., A.: Modeling replica availability in large data grids. Journal of Grid Computing (2003)Google Scholar
  28. 28.
    Rabinovich, M., Lazowska, E.: An efficient and highly available read-one write-all protocol for replicated data management. In: Proc. of the Int’l Conf. on Parallel and Distributed Information Systems (1993)Google Scholar
  29. 29.
    Group, O.M.: Final adopted specification for Fault Tolerant CORBA. In: OMG Technical Committee Document ptc/00-04-04 (2000)Google Scholar
  30. 30.
    Moorsel, A., Yajnik, S.: Design of a resource manager for fault-tolerant corba. In: Proc. of the Workshop on Reliable Middleware (1999)Google Scholar
  31. 31.
    Parrington, G.D., Shrivastava, S.K., Wheater, S.M., Little, M.C.: The design and implementation of Arjuna. USENIX Computing Systems (1995)Google Scholar
  32. 32.
    Friese, T., Muller, J., Freisleben, B.: Self-healing execution of business processes based on a peer-to-peer service architecture. In: Proc. of ICAC (2005)Google Scholar
  33. 33.
    Aiber, S., Gilat, D., Landau, A., Razinkov, N., Sela, A., Wasserkrug, S.: Autonomic self-optimization according to business objectives. In: Proc. of ICAC (2004)Google Scholar
  34. 34.
    Chess, D.M., Kumar, V., Segal, A., Whalley, I.: Availability-aware self-configuration in autonomic systems. In: Distributed Systems Operations and Management (2003)Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2006

Authors and Affiliations

  • Zhongtang Cai
    • 1
  • Vibhore Kumar
    • 1
  • Brian F. Cooper
    • 1
  • Greg Eisenhauer
    • 1
  • Karsten Schwan
    • 1
  • Robert E. Strom
    • 2
  1. 1.College of ComputingGeorgia Institute of TechnologyAtlantaUSA
  2. 2.IBM Watson Research CenterHawthorneUSA

Personalised recommendations