Science China Information Sciences

, Volume 55, Issue 12, pp 2757–2773 | Cite as

Localizing root causes of performance anomalies in cloud computing systems by analyzing request trace logs

  • HaiBo Mi
  • HuaiMin Wang
  • YangFan Zhou
  • Michael R. Lyu
  • Hua Cai
Research Paper Progress of Projects Supported by NSFC


It is hard to localize the primary cause of performance anomalies in cloud computing systems because of the complexity of interactions between components. The hidden connections in the huge number of request execution paths in such systems usually contain useful information for diagnosing performance anomalies. We propose an approach to localize anomalous invoked methods and their physical locations by leveraging request trace logs, which involves two steps: (1) firstly, cluster the requests according to their corresponding call sequences, identify anomalous requests with principal component analysis, and then pick out anomalous methods with Mann-Whitney hypothesis test; (2) secondly, compare the behavior similarities of all replicated instances of the anomalous methods with Jensen-Shannon divergence, and select the ones whose behaviors are different from those of others, which will be chosen as the final culprits of performance anomalies. We conduct experiments with four real-world cases to validate our approach in Alibaba Cloud Computing Inc. The results demonstrate that our approach can locate the prime causes of performance anomalies with the low false-positive rate and false-negative rate.


cloud computing systems performance anomalies request trace logs fault localization 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Lu X, Wang H, Wang J, et al. Internet-based virtual computing environment: beyond the data center as a computer. Futur Gener Comp Syst, 2013, 29: 309–322CrossRefGoogle Scholar
  2. 2.
    Han S, Dang Y, Ge S, et al. Performance debugging in the large via mining millions of stack traces. In: Proceedings of the 34th International Conference on Software Engineering, Zurich, 2012. 176–186Google Scholar
  3. 3.
    Chilimbi T, Liblit B, Mehra K, et al. Holmes: Effective statistical debugging via efficient path profiling. In: 31st IEEE International Conference on Software Engineering, Vancouver, 2009. 34–44Google Scholar
  4. 4.
    Killian C, Nagaraj K, Pervez S, et al. Finding latent performance bugs in systems implementations. In: Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering. New York: ACM, 2010. 17–26CrossRefGoogle Scholar
  5. 5.
    Lan Z, Zheng Z, Li Y. Toward automated anomaly identification in large-scale systems. IEEE Trans Parallel Distrib Syst, 2010, 21: 174–187CrossRefGoogle Scholar
  6. 6.
    Malik H, Adams B, Hassan A. Pinpointing the subsystems responsible for the performance deviations in a load test. In: Proceedings of 21st IEEE International Symposium on Software Reliability Engineering, San Jose, 2010. 201–210Google Scholar
  7. 7.
    Reynolds P, Killian C, Wiener J, et al. Pip: Detecting the unexpected in distributed systems. In: Symposium on Networked Systems Design and Implementation, San Jose, 2006, 115–128Google Scholar
  8. 8.
    Sambasivan R, Zheng A, De Rosa M, et al. Diagnosing performance changes by comparing request flows. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation. Berkeley: USENIX Association, 2011. 43–56Google Scholar
  9. 9.
    Jin G, Song L, Shi X, et al. Understanding and detecting real-world performance bugs. In: The 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation. New York: ACM, 2012. 77–88CrossRefGoogle Scholar
  10. 10.
    Thereska E, Ganger G. Ironmodel: Robust performance models in the wild. ACM SIGMETRICS Perform Eval Rev, 2008, 36: 253–264CrossRefGoogle Scholar
  11. 11.
    Mi H B, Wang H M, Yin G, et al. Performance problems diagnosis in cloud computing systems via analyzing request trace logs. In: The 13th International Conference on Network Operations and Management Symposium (NOMS), Maui, 2012. 893–899Google Scholar
  12. 12.
    Jolliffe I. Principal Component Analysis, 2nd ed. New York: Springer, 2002MATHGoogle Scholar
  13. 13.
    Fay M, Proschan M. Wilcoxon-Mann-Whitney or t-test on assumptions for hypothesis tests and multiple interpretations of decision rules. Stat Surv, 2010, 4: 1–39MathSciNetMATHCrossRefGoogle Scholar
  14. 14.
    Melville P, Yang S, Saar-Tsechansky M, et al. Active learning for probability estimation using jensen-shannon divergence. In: Proceedings of the 16th European Conference on Machine Learning. Berlin/Heidelberg: Springer-Verlag, 2005. 268–279Google Scholar
  15. 15.
    Sigelman B, Barroso L, Burrows M, et al. Dapper, a large-scale distributed systems tracing infrastructure. Technical Report dapper-2010-1, Google, 2010Google Scholar
  16. 16.
    Park I, Buch R. Event tracing-improve debugging and performance tuning with ETW. MSDN Mag, 2007. 81–92Google Scholar
  17. 17.
    Thereska E, Salmon B, Strunk J, et al. Stardust: tracking activity in a distributed storage system. ACM SIGMETRICS Perform Eval Rev, 2006, 34: 3–14CrossRefGoogle Scholar
  18. 18.
    Sang B, Zhan J, Lu G, et al. Precise, scalable, and online request tracing for multi-tier services of black boxes. IEEE Trans Parallel Distrib Syst, 2010, 99: 1–16Google Scholar
  19. 19.
    Tak B, Tang C, Zhang C, et al. Vpath: precise discovery of request processing paths from black-box observations of thread and network activities. In: Proceedings of the 2009 Conference on USENIX Annual Technical Conference. Berkeley: USENIX Association, 2009. 19–32Google Scholar
  20. 20.
    Koskinen E, Jannotti J. Borderpatrol: isolating events for black-box tracing. ACM SIGOPS Operat Syst Rev, 2008, 42: 191–203CrossRefGoogle Scholar
  21. 21.
    Reynolds P, Wiener J, Mogul J, et al. Wap5: black-box performance debugging for wide-area systems. In: Proceedings of the 15th International Conference on World Wide Web. New York: ACM, 2006. 347–356CrossRefGoogle Scholar
  22. 22.
    Aguilera M, Mogul J, Wiener J, et al. Performance debugging for distributed systems of black boxes. ACM SIGOPS Operat Syst Rev, 2003, 37: 74–89CrossRefGoogle Scholar
  23. 23.
    Chen M, Kiciman E, Fratkin E, et al. Pinpoint: Problem determination in large, dynamic internet services. In: Proceedings of 32nd IEEE International Conference on Dependable Systems and Networks, Bethesda, 2002. 595–604Google Scholar
  24. 24.
    Chen M, Accardi A, Kiciman E, et al. Path-based faliure and evolution management. In: Proceedings of the 1st Conference on Symposium on Networked Systems Design and Implementation, Vol. 1. Berkeley: USENIX Association, 2004. 23–36Google Scholar
  25. 25.
    Barham P, Donnelly A, Isaacs R, et al. Using Magpie for request extraction and workload modelling. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design and Implementation. Berkeley: USENIX Association, 2004. 259–272Google Scholar
  26. 26.
    Fonseca R, Porter G, Katz R, et al. X-trace: A pervasive network tracing framework. In: Proceedings of the 4th USENIX Conference on Networked Systems Design and Implementation. Berkeley: USENIX Association, 2007. 20–33Google Scholar
  27. 27.
    Mi H, Wang H, Yin G, et al. Magnifier: Online detection of performance problems in large-scale cloud computing systems. In: Proceedings of 8th IEEE International Conference on Services Computing, Washington DC, 2011. 418–425Google Scholar
  28. 28.
    Wang C, Schwan K, Talwar V, et al. A flexible architecture integrating monitoring and analytics for managing largescale data centers. In: Proceedings of the 8th ACM International Conference on Autonomic Computing. New York: ACM, 2011. 141–150Google Scholar
  29. 29.
    Wang C, Viswanathan K, Choudur L, et al. Statistical techniques for online anomaly detection in data centers. In: Proceedings of the 12th IFIP/IEEE International Symposium on Integrated Network Management, Dublin, 2011. 385–392Google Scholar
  30. 30.
    Bodik P, Goldszmidt M, Fox A, et al. Fingerprinting the datacenter: automated classification of performance crises. In: Proceedings of the 5th European Conference on Computer Systems. New York: ACM, 2010. 111–124Google Scholar
  31. 31.
    Wang C, Talwar V, Schwan K, et al. Online detection of utility cloud anomalies using metric distributions. In: Proceedings of the IEEE Network Operations and Management Symposium, Osaka, 2010. 96–103Google Scholar
  32. 32.
    Lakhina A, Crovella M, Diot C. Diagnosing network-wide traffic anomalies. ACM SIGCOMM Comput Commun Rev, 2004, 34: 219–230CrossRefGoogle Scholar
  33. 33.
    Xu W, Huang L, Fox A, et al. Detecting large-scale system problems by mining console logs. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles. New York: ACM, 2009. 117–132CrossRefGoogle Scholar
  34. 34.
    Oliner A, Aiken A. Online detection of multi-component interactions in production systems. In: 41st IEEE/IFIP International Conference on Dependable Systems & Networks (DSN), Hong Kong, 2011. 49–60Google Scholar
  35. 35.
    Ringberg H, Soule A, Rexford J, et al. Sensitivity of PCA for traffic anomaly detection. ACM SIGMETRICS Perform Eval Rev, 2007, 35: 109–120CrossRefGoogle Scholar
  36. 36.
    King J, Jackson D. Variable selection in large environmental data sets using principal components analysis. Environmetrics, 1999, 10: 67–77CrossRefGoogle Scholar
  37. 37.
    Ghemawat S, Gobioff H, Leung S. The Google file system. ACM SIGOPS Operat Syst Rev, 2003, 37: 29–43CrossRefGoogle Scholar

Copyright information

© Science China Press and Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • HaiBo Mi
    • 1
  • HuaiMin Wang
    • 1
  • YangFan Zhou
    • 2
  • Michael R. Lyu
    • 2
  • Hua Cai
    • 3
  1. 1.National Laboratory for Parallel and Distributed ProcessingNational University of Defense TechnologyChangshaChina
  2. 2.Shenzhen Research InstituteThe Chinese University of Hong KongShenzhenChina
  3. 3.Computing PlatformAlibaba Cloud Computing CompanyHangzhouChina

Personalised recommendations