Skip to main content

A Systematic Mapping Study in AIOps

  • Conference paper
  • First Online:
Service-Oriented Computing – ICSOC 2020 Workshops (ICSOC 2020)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 12632))

Included in the following conference series:

Abstract

IT systems of today are becoming larger and more complex, rendering their human supervision more difficult. Artificial Intelligence for IT Operations (AIOps) has been proposed to tackle modern IT administration challenges thanks to AI and Big Data. However, past AIOps contributions are scattered, unorganized and missing a common terminology convention, which renders their discovery and comparison impractical. In this work, we conduct an in-depth mapping study to collect and organize the numerous scattered contributions to AIOps in a unique reference index. We create an AIOps taxonomy to build a foundation for future contributions and allow an efficient comparison of AIOps papers treating similar problems. We investigate temporal trends and classify AIOps contributions based on the choice of algorithms, data sources and the target components. Our results show a recent and growing interest towards AIOps, specifically to those contributions treating failure-related tasks (62%), such as anomaly detection and root cause analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abreu, R., Zoeteweij, P., Gemund, A.J.V.: Spectrum-based multiple fault localization. In: IEEE/ACM International Conference on Automated Software Engineering, November 2009. https://doi.org/10.1109/ase.2009.25

  2. Aguilera, M.K., Mogul, J.C., Wiener, J.L., Reynolds, P., Muthitacharoen, A.: Performance debugging for distributed systems of black boxes. ACM SIGOPS Oper. Syst. Rev. 37(5), 74–89 (2003). https://doi.org/10.1145/1165389.945454

    Article  Google Scholar 

  3. Lerner, A.: AIOps Platforms, August 2017. https://blogs.gartner.com/andrew-lerner/2017/08/09/aiops-platforms/

  4. Attariyan, M., Chow, M., Flinn, J.: X-ray: automating root-cause diagnosis of performance anomalies in production software. In: Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI 2012, Hollywood, CA, USA, pp. 307–320, October 2012. https://doi.org/10.5555/2387880.2387910

  5. Bahl, P., Chandra, R., Greenberg, A., Kandula, S., Maltz, D.A., Zhang, M.: Towards highly reliable enterprise network services via inference of multi-level dependencies. In: Proceedings of the 2007 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications - SIGCOMM (2007). https://doi.org/10.1145/1282380.1282383

  6. Barham, P., Isaacs, R., Mortier, R., Narayanan, D.: Magpie: online modelling and performance-aware systems. In: Proceedings of the 9th Conference on Hot Topics in Operating Systems, HOTOS 2003, Lihue, Hawaii, vol. 9, p. 15, May 2003. https://doi.org/10.5555/1251054.1251069

  7. Bodik, P., Goldszmidt, M., Fox, A., Woodard, D.B., Andersen, H.: Fingerprinting the datacenter: automated classification of performance crises. In: Proceedings of the 5th European Conference on Computer Systems - EuroSys 2010 (2010). https://doi.org/10.1145/1755913.1755926

  8. Chalermarrewong, T., Achalakul, T., See, S.C.W.: Failure prediction of data centers using time series and fault tree analysis. In: IEEE 18th International Conference on Parallel and Distributed Systems, December 2012. https://doi.org/10.1109/icpads.2012.129

  9. Chen, M., Kiciman, E., Fratkin, E., Fox, A., Brewer, E.: Pinpoint: problem determination in large, dynamic Internet services. In: Proceedings of IEEE International Conference on Dependable Systems and Networks (2002). https://doi.org/10.1109/dsn.2002.1029005

  10. Chow, M., Meisner, D., Flinn, J., Peek, D., Wenisch, T.F.: The mystery machine: end-to-end performance analysis of large-scale internet services. In: OSDI 2014: Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, pp. 217–231 (2014). https://doi.org/10.5555/2685048.2685066

  11. Cohen, I., Goldszmidt, M., Kelly, T., Symons, J., Chase, J.S.: Correlating instrumentation data to system states: a building block for automated diagnosis and control. In: Proceedings of the 6th USENIX Conference on Symposium on Operating Systems Design & Implementation, OSDI 2004 (2004). https://doi.org/10.5555/1251254.1251270

  12. Costa, C.H., Park, Y., Rosenburg, B.S., Cher, C.Y., Ryu, K.D.: A system software approach to proactive memory-error avoidance. In: SC 2014: International Conference for High Performance Computing, Networking, Storage and Analysis, November 2014. https://doi.org/10.1109/sc.2014.63

  13. Dang, Y., Lin, Q., Huang, P.: AIOps: real-world challenges and research innovations. In: IEEE/ACM 41st International Conference on Software Engineering: Companion, May 2019. https://doi.org/10.1109/icse-companion.2019.00023

  14. Davis, N.A., Rezgui, A., Soliman, H., Manzanares, S., Coates, M.: FailureSim: a system for predicting hardware failures in cloud data centers using neural networks. In: IEEE 10th International Conference on Cloud Computing (CLOUD), Jun 2017. https://doi.org/10.1109/cloud.2017.75

  15. Du, M., Li, F., Zheng, G., Srikumar, V.: DeepLog: anomaly detection and diagnosis from system logs through deep learning. In: Proceedings of ACM SIGSAC Conference on Computer and Communications Security (2017). https://doi.org/10.1145/3133956.3134015

  16. Garg, S., van Moorsel, A., Vaidyanathan, K., Trivedi, K.: A methodology for detection and estimation of software aging. In: Proceedings Ninth International Symposium on Software Reliability Engineering (Cat. No.98TB100257). IEEE Computer Society (1998). https://doi.org/10.1109/issre.1998.730892

  17. Islam, T., Manivannan, D.: Predicting application failure in cloud: a machine learning approach. In: 2017 IEEE International Conference on Cognitive Computing (ICCC), Jun 2017. https://doi.org/10.1109/ieee.iccc.2017.11

  18. Jalali, S., Wohlin, C.: Systematic literature studies: database searches vs. backward snowballing. In: Proceedings of the 2012 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, pp. 29–38, September 2012. https://doi.org/10.1145/2372251.2372257

  19. Kandula, S., Katabi, D., Vasseur, J.P.: Shrink: a tool for failure diagnosis in IP networks. In: Proceedings of the 2005 ACM SIGCOMM Workshop on Mining Network Data - MineNet 2005 (2005). https://doi.org/10.1145/1080173.1080178

  20. Kobbacy, K.A.H., Vadera, S., Rasmy, M.H.: AI and OR in management of operations: history and trends. J. Oper. Res. Soc. 58(1), 10–28 (2007). https://doi.org/10.1057/palgrave.jors.2602132

    Article  MATH  Google Scholar 

  21. Lakhina, A., Crovella, M., Diot, C.: Diagnosing network-wide traffic anomalies. In: Proceedings of the 2004 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications - SIGCOMM 2004. ACM Press (2004). https://doi.org/10.1145/1015467.1015492

  22. Lakhina, A., Crovella, M., Diot, C.: Mining anomalies using traffic feature distributions. In: Proceedings of the 2005 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications - SIGCOMM 2005. ACM Press (2005). https://doi.org/10.1145/1080091.1080118

  23. Li, Y., et al.: Predicting node failures in an ultra-large-scale cloud computing platform: an AIOps solution. ACM Trans. Software Eng. Methodol. 29(2), 1–24 (2020). https://doi.org/10.1145/3385187

    Article  Google Scholar 

  24. Liang, Y., Zhang, Y., Xiong, H., Sahoo, R.: Failure prediction in IBM BlueGene/L event logs. In: Seventh IEEE International Conference on Data Mining (ICDM) (2007). https://doi.org/10.1109/icdm.2007.46

  25. Lin, F., Beadon, M., Dixit, H.D., Vunnam, G., Desai, A., Sankar, S.: Hardware remediation at scale. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W), June 2018. https://doi.org/10.1109/dsn-w.2018.00015

  26. Lin, Q., Zhang, H., Lou, J.G., Zhang, Y., Chen, X.: Log clustering based problem identification for online service systems. In: Proceedings of the 38th ACM International Conference on Software Engineering Companion (ICSE) (2016). https://doi.org/10.1145/2889160.2889232

  27. Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Software Eng. 33(1), 2–13 (2007). https://doi.org/10.1109/TSE.2007.256941

    Article  Google Scholar 

  28. Chen, M.Y., Accardi, A., Kiciman, E., Lloyd, J., Patterson, D., Fox, A., Brewer, E.: Path-based failure and evolution management. In: Proceedings of the 1st Conference on Symposium on Networked Systems Design and Implementation, NSDI 2004, San Francisco, California, vol. 1, p. 23, March 2004. https://doi.org/10.5555/1251175.1251198

  29. Moody, A., Bronevetsky, G., Mohror, K., Supinski, B.R.D.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, November 2010. https://doi.org/10.1109/sc.2010.18

  30. Moore, A.W., Zuev, D.: Internet traffic classification using Bayesian analysis techniques. In: Proceedings of the 2005 ACM International Conference on Measurement and Modeling of Computer Systems - SIGMETRICS 2005 (2005). https://doi.org/10.1145/1064212.1064220

  31. Mukwevho, M.A., Celik, T.: Toward a smart cloud: a review of fault-tolerance methods in cloud systems. IEEE Trans. Serv. Comput. 1 (2018). https://doi.org/10.1109/tsc.2018.2816644

  32. Natella, R., Cotroneo, D., Duraes, J.A., Madeira, H.S.: On fault representativeness of software fault injection. IEEE Trans. Software Eng. 39(1), 80–96 (2013). https://doi.org/10.1109/tse.2011.124

    Article  Google Scholar 

  33. Nguyen, H., Shen, Z., Tan, Y., Gu, X.: FChain: toward black-box online fault localization for cloud systems. In: IEEE 33rd International Conference on Distributed Computing Systems, July 2013. https://doi.org/10.1109/icdcs.2013.26

  34. Petersen, K., Vakkalanka, S., Kuzniarz, L.: Guidelines for conducting systematic mapping studies in software engineering: an update. Inf. Softw. Technol. 64, 1–18 (2015). https://doi.org/10.1016/j.infsof.2015.03.007

    Article  Google Scholar 

  35. Pitakrat, T., Okanović, D., van Hoorn, A., Grunske, L.: Hora: architecture-aware online failure prediction. J. Syst. Softw. 137, 669–685 (2018). https://doi.org/10.1016/j.jss.2017.02.041

    Article  Google Scholar 

  36. Podgurski, A., et al.: Automated support for classifying software failure reports. In: Proceedings of IEEE 25th International Conference on Software Engineering (2003). https://doi.org/10.1109/icse.2003.1201224

  37. Salfner, F., Malek, M.: Using hidden semi-Markov models for effective online failure prediction. In: 26th IEEE International Symposium on Reliable Distributed Systems (SRDS), October 2007. https://doi.org/10.1109/srds.2007.35

  38. Samir, A., Pahl, C.: A controller architecture for anomaly detection, root cause analysis and self-adaptation for cluster architectures. In: International Conference on Adaptive and Self-Adaptive Systems and Applications (2019). 10993/42062

    Google Scholar 

  39. Shao, Q., Chen, Y., Tao, S., Yan, X., Anerousis, N.: Efficient ticket routing by resolution sequence mining. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD (2008). https://doi.org/10.1145/1401890.1401964

  40. Sharma, A.B., Chen, H., Ding, M., Yoshihira, K., Jiang, G.: Fault detection and localization in distributed systems using invariant relationships. In: 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), June 2013. https://doi.org/10.1109/dsn.2013.6575304

  41. Vaidyanathan, K., Trivedi, K.: A comprehensive model for software rejuvenation. IEEE Trans. Dependable Secure Comput. 2(2), 124–137 (2005). https://doi.org/10.1109/tdsc.2005.15

    Article  Google Scholar 

  42. Xu, H., et al.: Unsupervised anomaly detection via variational auto-encoder for seasonal KPIs in web applications. In: Proceedings of the 2018 World Wide Web Conference on World Wide Web (2018). https://doi.org/10.1145/3178876.3185996

  43. Xu, W., Huang, L., Fox, A., Patterson, D., Jordan, M.I.: Detecting large-scale system problems by mining console logs. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles - SOSP 2009 (2009). https://doi.org/10.1145/1629575.1629587

  44. Yuan, D., Mai, H., Xiong, W., Tan, L., Zhou, Y., Pasupathy, S.: SherLog: error diagnosis by connecting clues from run-time logs. In: ACM SIGARCH Computer Architecture News, vol. 38, no. 1, pp. 143–154 (2010). https://doi.org/10.1145/1735970.1736038

  45. Zhang, K., Xu, J., Min, M.R., Jiang, G., Pelechrinis, K., Zhang, H.: Automated IT system failure prediction: a deep learning approach. In: IEEE International Conference on Big Data (2016). https://doi.org/10.1109/bigdata.2016.7840733

  46. Zhang, S., et al.: Syslog processing for switch failure diagnosis and prediction in datacenter networks. In: IEEE/ACM 25th International Symposium on Quality of Service (IWQoS), June 2017. https://doi.org/10.1109/iwqos.2017.7969130

  47. Zheng, S., Ristovski, K., Farahat, A., Gupta, C.: Long short-term memory network for remaining useful life estimation. In: IEEE International Conference on Prognostics and Health Management (ICPHM) (2017). https://doi.org/10.1109/icphm.2017.7998311

  48. Zhou, W., Tang, L., Li, T., Shwartz, L., Grabarnik, G.Y.: Resolution recommendation for event tickets in service management. In: IFIP/IEEE International Symposium on Integrated Network Management (IM) (2015). https://doi.org/10.1109/inm.2015.7140303

  49. Zhu, J., He, P., Fu, Q., Zhang, H., Lyu, M.R., Zhang, D.: Learning to log: helping developers make informed logging decisions. In: IEEE/ACM 37th IEEE International Conference on Software Engineering, May 2015. https://doi.org/10.1109/icse.2015.60

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paolo Notaro .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Notaro, P., Cardoso, J., Gerndt, M. (2021). A Systematic Mapping Study in AIOps. In: Hacid, H., et al. Service-Oriented Computing – ICSOC 2020 Workshops. ICSOC 2020. Lecture Notes in Computer Science(), vol 12632. Springer, Cham. https://doi.org/10.1007/978-3-030-76352-7_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-76352-7_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-76351-0

  • Online ISBN: 978-3-030-76352-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics