Skip to main content
Log in

An automatic model management system and its implementation for AIOps on microservice platforms

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

With the gradual expansion of microservice architecture-based applications, the complexity of system operation and maintenance is also growing significantly. With the advent of AIOps, it is now possible to automatically detect the state of the system, allocate resources, warn, and detect anomalies using machine learning models. Given the dynamic nature of online workloads, the running state of a microservice system in production is constantly in flux. Therefore, it is necessary to continuously train, encapsulate, and deploy models based on the current system status for the AIOps model to dynamically adapt to the system environment. This paper proposes a model update and management pipeline framework for AIOps models in microservices systems in order to accomplish the aforementioned objectives and simplify the process. In addition, a prototype system based on Kubernetes and Gitlab is designed to provide preliminary framework implementation and validation. The system consists of three components: model training, model packaging, and model deploying. Parallelization and parameter search are incorporated into the model training procedure in order to facilitate rapid training of multiple models and automated model hyperparameter tuning. We automate the packaging and deployment process using technology for continuous integration. Experiments are conducted to validate the prototype system, and the results demonstrate the feasibility of the proposed framework. This work serves as a useful resource for constructing an integrated and streamlined AIOps model management system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. Kubeflow - https://www.kubeflow.org.

  2. https://MLflow.org.

  3. S2I - https://github.com/openshift/source-to-image.

  4. Argo - https://argoproj.github.io.

  5. Harbor - https://goharbor.io.

  6. Determined - https://github.com/determined-ai/determined.

  7. ARIMA - https://github.com/jixinpu/aiopstools/blob/master/aiopstools.

  8. Transformer Time Series Prediction - https://github.com/oliverguhr/transformer-time-series-prediction.

References

  1. Cerny T, Donahoo MJ, Trnka M (2018) Contextual understanding of microservice architecture: current and future directions. ACM SIGAPP Appl Comput Rev 17(4):29–45

    Article  Google Scholar 

  2. Singh V, Peddoju SK (2017) Container-based microservice architecture for cloud applications. In: 2017 international conference on computing, communication and automation (ICCCA), p 847–852. IEEE

  3. Dang Y, Lin Q, Huang P (2019) Aiops: real-world challenges and research innovations. In: 2019 IEEE/ACM 41st international conference on software engineering: companion proceedings (ICSE-Companion), pages 4–5. IEEE

  4. Masood A, Hashmi A (2019) Aiops: Predictive analytics & machine learning in operations. In: cognitive computing recipes, p 359–382. Springer

  5. Haselböck S, Weinreich R (2017) Decision guidance models for microservice monitoring. In: 2017 IEEE international conference on software architecture workshops (ICSAW), p 54–61. IEEE

  6. Wang L, Zhao N, Chen J, Li P, Zhang W, Sui K (2020) Root-cause metric location for microservice systems via log anomaly detection. In: 2020 IEEE international conference on web services (ICWS), p 142–150. IEEE

  7. Diethe T, Borchert T, Thereska E, Balle B, Lawrence N (2019) Continual learning in practice. arXiv preprint arXiv:1903.05202

  8. Stocco A, Tonella P (2020) Towards anomaly detectors that learn continuously. In: 2020 IEEE international symposium on software reliability engineering workshops (ISSREW), p 201–208. IEEE

  9. Li Y, Jiang ZM, Li H, Hassan AE, He C, Huang R, Zeng Z, Wang M, Chen P (2020) Predicting node failures in an ultra-large-scale cloud computing platform: an aiops solution. ACM Trans Softw Eng Method (TOSEM) 29(2):1–24

    Article  Google Scholar 

  10. Merkel D et al (2014) Docker: lightweight linux containers for consistent development and deployment. Linux J 239:2

    Google Scholar 

  11. Schneider R (2008) Continuous integration: improving software quality and reducing risk. Softw Quality Profession 10(4):51

    Google Scholar 

  12. Humble J, Farley D (2010) Continuous delivery: reliable software releases through build, test, and deployment automation. Pearson Education

  13. Zhou Y, Yu Y, Ding B (2020) Towards mlops: a case study of ml pipeline platform. In: 2020 international conference on artificial intelligence and computer engineering (ICAICE), p 494–500. IEEE

  14. Zhu L, Bass L, Champlin-Scharff G (2016) Devops and its practices. IEEE Softw 33(3):32–34

    Article  Google Scholar 

  15. Tamburri DA (2020) Sustainable mlops: trends and challenges. In: 2020 22nd international symposium on symbolic and numeric algorithms for scientific computing (SYNASC), p 17–23. IEEE

  16. Ebert C, Gallardo G, Hernantes J, Serrano N (2016) Devops. IEEE Softw 33(3):94–100

    Article  Google Scholar 

  17. Fontenla-Romero O, Guijarro-Berdiñas B, Martinez-Rego D, Pérez-Sánchez B, Peteiro-Barral D (2013) Online machine learning. In: Efficiency and scalability methods for computational intellect, p 27–54. IGI Global

  18. Baylor D, Koc L, Koo CY, Lew L, Jain V (2017) Tfx: A tensorflow-based production-scale machine learning platform. In: Acm Sigkdd international conference

  19. Garg S, Pundir P, Rathee G, Gupta PK, Garg S, Ahlawat S (2021) On continuous integration / continuous delivery for automated deployment of machine learning models using mlops. In: 2021 IEEE fourth international conference on artificial intelligence and knowledge engineering (AIKE), p 25–28

  20. Antonini M, Pincheira M, Vecchio M, Antonelli F (2022) Tiny-mlops: a framework for orchestrating ml applications at the far edge of iot systems. In: 2022 IEEE international conference on evolving and adaptive intelligent systems (EAIS), p 1–8

  21. Chen V, Ren J, Wang L, Pu Y, Yang K, Wu W (2022) Microegrcl: An edge-attention-based graph neural network approach for root cause localization in microservice systems. In: Computing Service-Oriented (ed) Troya J, Medjahed B, Piattini M, Yao L, Fernandez P, Ruiz-Cortes A. Springer Nature Switzerland, Cham, pp 264–272

Download references

Acknowledgements

This paper was supported by National Key R &D Program of China (Funding No. 2021ZD0110601) and the State Key Laboratory of Software Development Environment (Funding No. SKLSDE-2020ZX-01).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruibo Chen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, R., Pu, Y., Shi, B. et al. An automatic model management system and its implementation for AIOps on microservice platforms. J Supercomput 79, 11410–11426 (2023). https://doi.org/10.1007/s11227-023-05123-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-023-05123-4

Keywords

Navigation