Advertisement

Microscope: Pinpoint Performance Issues with Causal Graphs in Micro-service Environments

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11236)

Abstract

Driven by the emerging business models (e.g., digital sales) and IT technologies (e.g., DevOps and Cloud computing), the architecture of software is shifting from monolithic to microservice rapidly. Benefit from microservice, software development, and delivery processes are accelerated significantly. However, along with many micro services running in the dynamic cloud environment with complex interactions, identifying and locating the abnormal services are extraordinarily difficult. This paper presents a novel system named “Microscope” to identify and locate the abnormal services with a ranked list of possible root causes in Micro-service environments. Without instrumenting the source code of micro services, Microscope can efficiently construct a service causal graph and infer the causes of performance problems in real time. Experimental evaluations in a micro-service benchmark environment show that Microscope achieves a good diagnosis result, i.e., 88% in precision and 80% in recall, which is higher than several state-of-the-art methods. Meanwhile, it has a good scalability to adapt to large-scale micro-service systems.

Keywords

Microservice Kubernetes Root cause analytics Cloud computing 

Notes

Acknowledgment

The work described in this paper was supported by the National Key R&D Program of China (2018YFB1004804), the National Natural Science Foundation of China (61722214) and the Guangdong Province Universities and Colleges Pearl River Scholar Funded Scheme 2016.

References

  1. 1.
  2. 2.
  3. 3.
  4. 4.
    Barham, P., Donnelly, A., Isaacs, R., Mortier, R.: Using magpie for request extraction and workload modelling. In: OSDI, vol. 4, pp. 18–18 (2004)Google Scholar
  5. 5.
    Bodik, P., Goldszmidt, M., Fox, A., Woodard, D.B., Andersen, H.: Fingerprinting the datacenter: automated classification of performance crises. In: Proceedings of the 5th European conference on Computer systems, pp. 111–124. ACM (2010)Google Scholar
  6. 6.
    Chen, M.Y., Kiciman, E., Fratkin, E., Fox, A., Brewer, E.: Pinpoint: problem determination in large, dynamic internet services. In: Proceedings of International Conference on Dependable Systems and Networks (DSN 2002), pp. 595–604. IEEE (2002)Google Scholar
  7. 7.
    Chen, P., Qi, Y., Zheng, P., Hou, D.: Causeinfer: automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In: INFOCOM, 2014 Proceedings IEEE, pp. 1887–1895. IEEE (2014)Google Scholar
  8. 8.
    Chow, M., Meisner, D., Flinn, J., Peek, D., Wenisch, T.F.: The mystery machine: end-to-end performance analysis of large-scale internet services. In: Proceedings of the 11th symposium on Operating Systems Design and Implementation, pp. 217–231 (2014)Google Scholar
  9. 9.
    Cohen, I., Chase, J.S., Goldszmidt, M., Kelly, T., Symons, J.: Correlating instrumentation data to system states: a building block for automated diagnosis and control. In: OSDI, vol. 4, pp. 16–16 (2004)Google Scholar
  10. 10.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  11. 11.
    Ellis, B., Wong, W.H.: Learning causal Bayesian network structures from experimental data. J. Am. Stat. Assoc. 103(482), 778–789 (2008)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Fonseca, R., Porter, G., Katz, R.H., Shenker, S., Stoica, I.: X-trace: a pervasive network tracing framework. In: Proceedings of the 4th USENIX conference on Networked systems design implementation, pp. 271–284. USENIX Association (2007)Google Scholar
  13. 13.
    Granger, C.W.: Investigating causal relations by econometric models and cross-spectral methods. Econom.: J. Econom. Soc. 37, 424–438 (1969)CrossRefGoogle Scholar
  14. 14.
    Ibidunmoye, O., Hernández-Rodriguez, F., Elmroth, E.: Performance anomaly detection and bottleneck identification. ACM Comput. Surv. (CSUR) 48(1), 4 (2015)CrossRefGoogle Scholar
  15. 15.
    Jayathilaka, H., Krintz, C., Wolski, R.: Performance monitoring and root cause analysis for cloud-hosted web applications. In: Proceedings of the 26th International Conference on World Wide Web, pp. 469–478. International World Wide Web Conferences Steering Committee (2017)Google Scholar
  16. 16.
    Kalisch, M., Bühlmann, P.: Estimating high-dimensional directed acyclic graphs with the PC-algorithm. J. Mach. Learn. Res. 8(Mar), 613–636 (2007)zbMATHGoogle Scholar
  17. 17.
    Kalisch, M., Bühlmann, P.: Robustification of the PC-algorithm for directed acyclic graphs. J. Comput. Graph. Stat. 17(4), 773–789 (2008)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Kandula, S., Mahajan, R., Verkaik, P., Agarwal, S., Padhye, J., Bahl, P.: Detailed diagnosis in enterprise networks. ACM SIGCOMM Comput. Commun. Rev. 39(4), 243–254 (2009)CrossRefGoogle Scholar
  19. 19.
    Kim, M., Sumbaly, R., Shah, S.: Root cause detection in a service-oriented architecture. In: ACM SIGMETRICS Performance Evaluation Review, vol. 41, pp. 93–104. ACM (2013)Google Scholar
  20. 20.
    Newman, S.: Building Microservices, 1st edn. O’Reilly Media Inc., Sebastopol (2015)Google Scholar
  21. 21.
    Sharma, B., Jayachandran, P., Verma, A., Das, C.R.: CloudPD: problem determination and diagnosis in shared dynamic clouds. In: 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 1–12. IEEE (2013)Google Scholar
  22. 22.
    Spirtes, P., et al.: Causation, Prediction, and Search. MIT press, Cambridge (2000)zbMATHGoogle Scholar
  23. 23.
    Thalheim, J., et al.: Sieve: actionable insights from monitored metrics in distributed systems. In: Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference, pp. 14–27. ACM (2017)Google Scholar
  24. 24.
    Wang, P., et al.: Cloudranger: root cause identification for cloud native systems. In: Proceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2018). IEEE (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.School of Data and Computer ScienceSun Yat-sen UniversityGuangzhouChina

Personalised recommendations