Anomaly Detection and Levels of Automation for AI-Supported System Administration

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 1070)


Artificial Intelligence for IT Operations (AIOps) describes the process of maintaining and operating large IT infrastructures using AI-supported methods and tools on different levels. This includes automated anomaly detection and root cause analysis, remediation and optimization, as well as fully automated initiation of self-stabilizing activities. While the automation is mandatory due to the system complexity and the criticality of QoS-bounded responses, the measures compiled and deployed by the AI-controlled administration are not easily understandable or reproducible in all cases. Therefore, explainable actions taken by the automated systems are becoming a regulatory requirement for future IT infrastructures. In this paper we present a developed and deployed system named ZerOps as an example for the design of the corresponding architecture, tools, and methods. This system uses deep learning models and data analytics of monitoring data to detect and remediate anomalies.


AIOps Predictive fault tolerance Anomaly detection 


  1. 1.
    Endsley, M.R.: The application of human factors to the development of expert systems for advanced cockpits. In: Proceedings of the Human Factors Society Annual Meeting, vol. 31, pp. 1388–1392. SAGE Publications, Los Angeles (1987)Google Scholar
  2. 2.
    Endsley, M.R.: Level of automation effects on performance, situation awareness and workload in a dynamic control task. Ergonomics 42(3), 462–492 (1999)CrossRefGoogle Scholar
  3. 3.
    Ganek, A.G., Corbi, T.A.: The dawning of the autonomic computing era. IBM Syst. J. 42(1), 5–18 (2003)CrossRefGoogle Scholar
  4. 4.
    Ghosh, D., Sharman, R., Rao, H.R., Upadhyaya, S.: Self-healing systems survey and synthesis. Decis. Support Syst. 42(4), 2164–2185 (2007)CrossRefGoogle Scholar
  5. 5.
    Gulenko, A., Wallschläger, M., Schmidt, F., Kao, O., Liu, F.: Evaluating machine learning algorithms for anomaly detection in clouds. In: 2016 IEEE International Conference on Big Data (Big Data), pp. 2716–2721. IEEE (2016)Google Scholar
  6. 6.
    Gulenko, A., Wallschläger, M., Schmidt, F., Kao, O., Liu, F.: A system architecture for real-time anomaly detection in large-scale NFV systems. Procedia Comput. Sci. 94, 491–496 (2016)CrossRefGoogle Scholar
  7. 7.
    Hundman, K., Constantinou, V., Laporte, C.: Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding. In: Proceedings 24th ACM SIGKDD Conference on Knowledge Discovery, pp. 387–395. ACM (2018)Google Scholar
  8. 8.
    IBM: An architectural blueprint for autonomic computing. IBM White Paper 31, 1–6 (2006)Google Scholar
  9. 9.
    Kephart, J.O., Chess, D.M.: The vision of autonomic computing. Computer 1, 41–50 (2003)CrossRefGoogle Scholar
  10. 10.
    Schmidt, F., et al.: IFTM - unsupervised anomaly detection for virtualized network function services. In: 2018 IEEE International Conference on Web Services (ICWS), pp. 187–194 (2018)Google Scholar
  11. 11.
    Sheridan, T.B., Verplank, W.L.: Human and computer control of undersea teleoperators. Technical report, MIT Man-Machine Systems Lab (1978)Google Scholar
  12. 12.
    Solé, M., Muntés-Mulero, V., Rana, A.I., Estrada, G.: Survey on models and techniques for root-cause analysis. arXiv preprint arXiv:1701.08546 (2017)

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.TU BerlinBerlinGermany

Personalised recommendations