Cluster Computing

, Volume 9, Issue 4, pp 385–399

Discovering likely invariants of distributed transaction systems for autonomic system management

Article

Abstract

Large amount of monitoring data can be collected from distributed systems as the observables to analyze system behaviors. However, without reasonable models to characterize systems, we can hardly interpret such monitoring data effectively for system management. In this paper, a new concept named flow intensity is introduced to measure the intensity with which internal monitoring data reacts to the volume of user requests in distributed transaction systems. We propose a novel approach to automatically model and search relationships between the flow intensities measured at various points across the system. If the modeled relationships hold all the time, they are regarded as invariants of the underlying system. Experimental results from a real system demonstrate that such invariants widely exist in distributed transaction systems. Further we discuss how such invariants can be used to characterize complex systems and support autonomic system management.

Keywords

System management Distributed transaction systems Flow intensity Regression model Invariants 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    M.K. Aguilera, J.C. Mogul, J.L. Wiener, P. Reynolds, and A. Muthitacharoen, Performance debugging for distributed systems of black boxes, in: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (2003) pp. 74–89.Google Scholar
  2. 2.
    http://phx.corporate-ir.net/phoenix.zhtml? c=97664&p=iro-news Article&ID$=$798960&highlight=Google Scholar
  3. 3.
    W. Brogan, Modern Control Theory, 3rd edn (Prentice Hall, 1990).Google Scholar
  4. 4.
    M. Chen, A. Accardi, E. Kiciman, J. Lloyd, D. Patterson, A. Fox, and E. Brewer, Path-based failure and evolution management, in: 1st USENIX Symposium on Networked Systems Design and Implementation (NSDI ’04), San Francisco, CA (March, 2004), pp. 309–322.Google Scholar
  5. 5.
    http://www.nttdocomo.com/files/presscenter/34_No14_Doc.pdf/Google Scholar
  6. 6.
    I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox, Capturing, indexing, clustering, and retrieving system history, SIGOPS Oper. Syst. Rev. 39(5) (2005) 105–118.CrossRefGoogle Scholar
  7. 7.
    M. Ernst, J. Cockrell, W. Griswold, and D. Notkin, Dynamically discovering likely program invariants to support program evolution. IEEE Trans. on Software Engineering 27(2) (2001) 99–123.CrossRefGoogle Scholar
  8. 8.
    J. Gertler, Fault Detection and Diagnosis in Engineering Systems (Marcel Dekker, New York, 1998).Google Scholar
  9. 9.
    S. Hangal and M. Lam, Tracking down software bugs using automatic anomaly detection, in: Proceedings of the 24th International Conference on Software Engineering, (2002) pp. 291–301.Google Scholar
  10. 10.
    R. Isermann and P. Balle, Trends in the application of model-based fault detection and diagnosis of industrial process, Control Engineering Practice 5(5) (1997) 709–719.CrossRefGoogle Scholar
  11. 11.
    G. Jiang, H. Chen, and K. Yoshihira, Modeling and tracking of transaction flow dynamics for fault detection in complex systems, to appear in IEEE Trans. on Dependable and Secure Computing.Google Scholar
  12. 12.
    http://java.sun.com/products/JavaManagement/Google Scholar
  13. 13.
    L. Ljung, System Identification—Theory for The User, 2nd edn (Prentice Hall PTR, 1998).Google Scholar
  14. 14.
    J. O’Madadhain, D. Fisher, S. White, and Y. Boey, The jung (java universal network/graph) framework, Technical Report UCI-ICS 03-17, UC Irvine Information and Computer Science (2003). Available at jung.sourceforge.netGoogle Scholar
  15. 15.
    D. Oppenheimer, A. Ganapathi, and D. Patterson, Why do internet services fail, and what can be done about it, in: 4th Usenix Symposium on Internet Technologies and Systems (USITS03) (2003) pp. 1–16.Google Scholar
  16. 16.
    D. Patterson, A simple way to estimate the cost of downtime, in: Proceedings of LISA-2002: Sixteenth System Administration Conference (2002) pp. 185–188.Google Scholar
  17. 17.
    D. Patterson, A. Brown et al., Recovery-oriented computing (ROC): Motivation, definition, techniques, and case studies, Technical Report UCB//CSD-02-1175, UC Berkeley Computer Science, Available at roc.cs.berkley.edu (2002).Google Scholar
  18. 18.
    http://java.sun.com/developer/releases/petstore/Google Scholar
  19. 19.
    http://news.bbc.co.uk/2/hi/business/4395258.stmGoogle Scholar
  20. 20.
    A. Yemini and S. Kliger, High speed and robust event correlation, IEEE Communication Magazine, 34(5) (1996) 82–90.CrossRefGoogle Scholar
  21. 21.
    G. Zhen, G. Jiang, H. Chen, and K. Yoshihira, Tracking probabilistic correlation of monitoring data for fault detection in complex systems, in: The International Conference on Dependable Systems and Networks (DSN2006), Philadelphia, PA (June 2006).Google Scholar

Copyright information

© Springer Science + Business Media, LLC 2006

Authors and Affiliations

  1. 1.NEC Laboratories AmericaPrinceton

Personalised recommendations