Online Black-Box Failure Prediction for Mission Critical Distributed Systems

  • Roberto Baldoni
  • Giorgia Lodi
  • Luca Montanari
  • Guido Mariotta
  • Marco Rizzuto
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7612)


This paper introduces a novel approach to failure prediction for mission critical distributed systems that has the distinctive features to be black-box, non-intrusive and online. The approach combines Complex Event Processing (CEP) and Hidden Markov Models (HMM) so as to analyze symptoms of failures that might occur in the form of anomalous conditions of performance metrics identified for such purpose. The paper describes an architecture named CASPER, based on CEP and HMM, that relies on sniffed information from the communication network of a mission critical system, only, for predicting anomalies that can lead to software failures. An instance of CASPER has been implemented, trained and tuned to monitor a real Air Traffic Control (ATC) system. An extensive experimental evaluation of CASPER is presented. The obtained results show (i) a very low percentage of false positives over both normal and under stress conditions, and (ii) a sufficiently high failure prediction time that allows the system to apply appropriate recovery procedures.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Esper: Esper project web page (2011),
  2. 2.
    Rabiner, L., Juang, B.: An introduction to hidden markov models. IEEE ASSP Magazine 3(1), 4–16 (1986)CrossRefGoogle Scholar
  3. 3.
    Murphy, K.: Dynamic Bayesian Networks: Representation, Inference and Learning. PhD thesis, UC Berkeley, Computer Science Division (2002)Google Scholar
  4. 4.
    Salfner, F.: Event-based Failure Prediction: An Extended Hidden Markov Model Approach. PhD thesis, Department of Computer Science, Humboldt-Universität zu Berlin, Germany (2008)Google Scholar
  5. 5.
    Hoffmann, G.A., Salfner, F., Malek, M.: Advanced Failure Prediction in Complex Software Systems. Technical Report 172, Berlin, Germany (2004)Google Scholar
  6. 6.
    Yu, L., Zheng, Z., Lan, Z., Coghlan, S.: Practical online failure prediction for blue gene/p: Period-based vs event-driven. In: Proc. of IEEE/IFIP DSN-W 2011, pp. 259–264 (2011)Google Scholar
  7. 7.
    Williams, A.W., Pertet, S.M., Narasimhan, P.: Tiresias: Black-box failure prediction in distributed systems. In: Proc. of IEEE IPDPS 2007, Los Alamitos, CA, USA (2007)Google Scholar
  8. 8.
    Tan, Y., Gu, X., Wang, H.: Adaptive system anomaly prediction for large-scale hosting infrastructures. In: Proc. of ACM PODC 2010, pp. 173–182. ACM, New York (2010)Google Scholar
  9. 9.
    Aguilera, M.K., Mogul, J.C., Wiener, J.L., Reynolds, P., Muthitacharoen, A.: Performance debugging for distributed systems of black boxes. SIGOPS Oper. Syst. Rev. 37, 74–89 (2003)CrossRefGoogle Scholar
  10. 10.
    Fu, S., Zhong Xu, C.: Exploring event correlation for failure prediction in coalitions of clusters (2007)Google Scholar
  11. 11.
    Daidone, A., Di Giandomenico, F., Bondavalli, A., Chiaradonna, S.: Hidden markov models as a support for diagnosis: Formalization of the problem and synthesis of the solution. In: SRDS 2006, Leeds, UK, pp. 245–256 (2006)Google Scholar
  12. 12.
    Gu, X., Papadimitrioul, S., Yu, P.S., Chang, S.P.: Online failure forecast for fault-tolerant data stream processing. In: ICDE 2008, pp. 1388–1390 (2008)Google Scholar
  13. 13.
    Avizienis, A., Laprie, J.C., Randell, B., Landwehr, C.E.: Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Sec. Comput. 1(1), 11–33 (2004)CrossRefGoogle Scholar
  14. 14.
    Hood, C., Ji, C.: Proactive network-fault detection. IEEE Transactions on Reliability 46(3), 333–341 (1997)CrossRefGoogle Scholar
  15. 15.
    Thottan, M., Ji, C.: Properties of network faults. In: NOMS 2000, pp. 941–942 (2000)Google Scholar
  16. 16.
    Baldoni, R., Lodi, G., Mariotta, G., Montanari, L., Rizzuto, M.: Online Black-box Failure Prediction for Mission Critical Distributed Systems. Technical report (2012),
  17. 17.
    Object Management Group: CORBA. Specification, Object Management Group (2011)Google Scholar
  18. 18.

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Roberto Baldoni
    • 1
  • Giorgia Lodi
    • 1
  • Luca Montanari
    • 1
  • Guido Mariotta
    • 1
  • Marco Rizzuto
    • 2
  1. 1.“Sapienza” University of RomeRomeItaly
  2. 2.Finmeccanica GroupSelex Sistemi IntegratiRomeItaly

Personalised recommendations