Advertisement

Multi-source Distributed System Data for AI-Powered Analytics

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12054)

Abstract

The emerging field of Artificial Intelligence for IT Operations (AIOps) utilizes monitoring data, big data platforms, and machine learning, to automate operations and maintenance (O&M) tasks in complex IT systems. The available research data usually contain only a single source of information, often logs or metrics. The inability of the single-source data to describe precise state of the distributed systems leads to methods that fail to make effective use of the joint information, thus, producing large number of false predictions. Therefore, current data limits the possibilities for greater advances in AIOps research. To overcome these constraints, we created a complex distributed system testbed, which generates multi-source data composed of distributed traces, application logs, and metrics. This paper provides detailed descriptions of the infrastructure, testbed, experiments, and statistics of the generated data. Furthermore, it identifies how such data can be utilized as a stepping stone for the development of novel methods for O&M tasks such as anomaly detection, root cause analysis, and remediation.

The data from the testbed and its code is available at https://zenodo.org/record/3549604.

Keywords

AIOps Distributed system Dataset Tracing Metrics Logs Anomaly detection Root-cause analysis 

References

  1. 1.
    Ahmad, S., Lavin, A., Purdy, S., Agha, Z.: Unsupervised real-time anomaly detection for streaming data. Neurocomputing 262, 134–147 (2017)CrossRefGoogle Scholar
  2. 2.
    Alibaba trace data (2019). https://github.com/alibaba/clusterdata
  3. 3.
  4. 4.
    Correia, J., Ribeiro, F., Filipe, R., Arauio, F., Cardoso, J.: Response time characterization of microservice-based systems. In: 2018 IEEE 17th International Symposium on Network Computing and Applications (NCA), pp. 1–5. IEEE, New Jersey (2018)Google Scholar
  5. 5.
    Cortez, E., Bonde, A., Muzio, A., Russinovich, M., Fontoura, M., Bianchini, R.: Resource central: understanding and predicting workloads for improved resource management in large cloud platforms. In: Proceedings of the International Symposium on Operating Systems Principles (SOSP) (2017)Google Scholar
  6. 6.
    Dang, Y., Lin, Q., Huang, P.: AIOps: real-world challenges and research innovations. In: Proceedings of the 41st International Conference on Software Engineering: Companion Proceedings, pp. 4–5. IEEE Press (2019)Google Scholar
  7. 7.
    Du, M., Li, F., Zheng, G., Srikumar, V.: DeepLog: anomaly detection and diagnosis from system logs through deep learning. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 1285–1298. ACM, New York (2017)Google Scholar
  8. 8.
  9. 9.
    Goldstein, M.: Unsupervised Anomaly Detection Benchmark (2015).  https://doi.org/10.7910/DVN/OPQMVF
  10. 10.
    Google trace data (2019). https://github.com/google/cluster-data
  11. 11.
    Gulenko, A., Schmidt, F., Acker, A., Wallschlager, M., Kao, O., Liu, F.: Detecting anomalous behavior of black-box services modeled with distance-based online clustering. In: 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), pp. 912–915. IEEE, New Jersey (2018)Google Scholar
  12. 12.
    Ishikawa, K.: Guide to Quality Control. JUSE, Tokyo (2012)Google Scholar
  13. 13.
    Kolla-ansible’s documentation. https://docs.openstack.org/kolla-ansible/latest/
  14. 14.
    Li, F., Hu, B.: DeepJS: job scheduling based on deep reinforcement learning in cloud data center. In: Proceedings of the 2019 4th International Conference on Big Data and Computing, pp. 48–53. ACM, New York (2019)Google Scholar
  15. 15.
  16. 16.
    Meng, W., et al.: LogAnomaly: unsupervised detection of sequential and quantitative anomalies in unstructured logs. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, 10–16 August 2019, pp. 4739–4745 (2019)Google Scholar
  17. 17.
  18. 18.
    Nedelkoski, S., Cardoso, J., Kao, O.: Anomaly detection and classification using distributed tracing and deep learning. In: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 241–250. IEEE, New Jersey (2019)Google Scholar
  19. 19.
    Nedelkoski, S., Cardoso, J., Kao, O.: Anomaly detection from system tracing data using multimodal deep learning. In: 2019 IEEE 12th International Conference on Cloud Computing (CLOUD), pp. 179–186. IEEE, New Jersey (2019)Google Scholar
  20. 20.
    Nicolargo: nicolargo/glances (2019). https://github.com/nicolargo/glances
  21. 21.
    Openstack: openstack/osprofiler. https://github.com/openstack/osprofiler
  22. 22.
    OpenZipkin: openzipkin/zipkin (2018). https://github.com/openzipkin/zipkin
  23. 23.
  24. 24.
    Pina, F., Correia, J., Filipe, R., Araujo, F., Cardoso, J.: Nonintrusive monitoring of microservice-based systems. In: 2018 IEEE 17th International Symposium on Network Computing and Applications (NCA), pp. 1–8 (2018)Google Scholar
  25. 25.
  26. 26.
    Schmidt, F., et al.: IFTM - unsupervised anomaly detection for virtualized network function services. In: 2018 IEEE International Conference on Web Services (ICWS), pp. 187–194. IEEE, New Jersey (2018)Google Scholar
  27. 27.
    Schmidt, F., Suri-Payer, F., Gulenko, A., Wallschläger, M., Acker, A., Kao, O.: Unsupervised anomaly event detection for cloud monitoring using online arima. In: 2018 IEEE/ACM International Conference on Utility and Cloud Computing Companion (UCC Companion), pp. 71–76. IEEE, New Jersey (2018)Google Scholar
  28. 28.
    Shen, H., Li, C.: Zeno: a straggler diagnosis system for distributed computing using machine learning. In: Yokota, R., Weiland, M., Keyes, D., Trinitis, C. (eds.) ISC High Performance 2018. LNCS, vol. 10876, pp. 144–162. Springer, Cham (2018).  https://doi.org/10.1007/978-3-319-92040-5_8CrossRefGoogle Scholar
  29. 29.
    Shrivastwa, A., Sarat, S., Jackson, K., Bunch, C., Sigler, E., Campbell, T.: OpenStack: Building a Cloud Environment. Packt Publishing, Birmingham (2016)Google Scholar
  30. 30.
    Sridharan, C.: Distributed Systems Observability: A Guide to Building Robust Systems. O’Reilly Media, Sebastopol (2018)Google Scholar
  31. 31.
  32. 32.
    Taraslayshchuk: taraslayshchuk/es2csv (2018). https://github.com/taraslayshchuk/es2csv
  33. 33.
  34. 34.
    Zhang, C., Ma, Y.: Ensemble Machine Learning: Methods and Applications, 1st edn, p. 332. Springer, New York (2012).  https://doi.org/10.1007/978-1-4419-9326-7CrossRefzbMATHGoogle Scholar
  35. 35.
    Zhu, J., et al.: Tools and benchmarks for automated log parsing. In: Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice, pp. 121–130. IEEE Press, Piscataway (2019)Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2020

Authors and Affiliations

  1. 1.Technische Universität BerlinBerlinGermany
  2. 2.Huawei Munich Research CenterMunichGermany
  3. 3.Department of Informatics Engineering/CISUCUniversity of CoimbraCoimbraPortugal

Personalised recommendations