M.K. Aguilera, J.C. Mogul, J.L. Wiener, P. Reynolds, and A. Muthitacharoen, Performance debugging for distributed systems of black boxes, in: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (2003) pp. 74–89.
http://phx.corporate-ir.net/phoenix.zhtml? c=97664&p=iro-news Article&ID$=$798960&highlight=
W. Brogan, Modern Control Theory, 3rd edn (Prentice Hall, 1990).
M. Chen, A. Accardi, E. Kiciman, J. Lloyd, D. Patterson, A. Fox, and E. Brewer, Path-based failure and evolution management, in: 1st USENIX Symposium on Networked Systems Design and Implementation (NSDI ’04), San Francisco, CA (March, 2004), pp. 309–322.
I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox, Capturing, indexing, clustering, and retrieving system history, SIGOPS Oper. Syst. Rev.
39(5) (2005) 105–118.CrossRef
M. Ernst, J. Cockrell, W. Griswold, and D. Notkin, Dynamically discovering likely program invariants to support program evolution. IEEE Trans. on Software Engineering
27(2) (2001) 99–123.CrossRef
J. Gertler, Fault Detection and Diagnosis in Engineering Systems (Marcel Dekker, New York, 1998).
S. Hangal and M. Lam, Tracking down software bugs using automatic anomaly detection, in: Proceedings of the 24th International Conference on Software Engineering, (2002) pp. 291–301.
R. Isermann and P. Balle, Trends in the application of model-based fault detection and diagnosis of industrial process, Control Engineering Practice
5(5) (1997) 709–719.CrossRef
G. Jiang, H. Chen, and K. Yoshihira, Modeling and tracking of transaction flow dynamics for fault detection in complex systems, to appear in IEEE Trans. on Dependable and Secure Computing.
L. Ljung, System Identification—Theory for The User, 2nd edn (Prentice Hall PTR, 1998).
J. O’Madadhain, D. Fisher, S. White, and Y. Boey, The jung (java universal network/graph) framework, Technical Report UCI-ICS 03-17, UC Irvine Information and Computer Science (2003). Available at jung.sourceforge.net
D. Oppenheimer, A. Ganapathi, and D. Patterson, Why do internet services fail, and what can be done about it, in: 4th Usenix Symposium on Internet Technologies and Systems (USITS03) (2003) pp. 1–16.
D. Patterson, A simple way to estimate the cost of downtime, in: Proceedings of LISA-2002: Sixteenth System Administration Conference (2002) pp. 185–188.
D. Patterson, A. Brown et al., Recovery-oriented computing (ROC): Motivation, definition, techniques, and case studies, Technical Report UCB//CSD-02-1175, UC Berkeley Computer Science, Available at roc.cs.berkley.edu (2002).
A. Yemini and S. Kliger, High speed and robust event correlation, IEEE Communication Magazine
, 34(5) (1996) 82–90.CrossRef
G. Zhen, G. Jiang, H. Chen, and K. Yoshihira, Tracking probabilistic correlation of monitoring data for fault detection in complex systems, in: The International Conference on Dependable Systems and Networks (DSN2006), Philadelphia, PA (June 2006).