Frontiers of Computer Science

, Volume 7, Issue 3, pp 431–445 | Cite as

An online service-oriented performance profiling tool for cloud computing systems

  • Haibo Mi
  • Huaimin Wang
  • Yangfan Zhou
  • Michael Rung-Tsong Lyu
  • Hua Cai
  • Gang Yin
Research Article


The growing scale and complexity of component interactions in cloud computing systems post great challenges for operators to understand the characteristics of system performance. Profiling has long been proved to be an effective approach to performance analysis; however, existing approaches confront new challenges that emerge in cloud computing systems. First, the efficiency of the profiling becomes of critical concern; second, service-oriented profiling should be considered to support separation-of-concerns performance analysis. To address the above issues, in this paper, we present P-Tracer, an online performance profiling tool specifically tailored for cloud computing systems. P-Tracer constructs a specific search engine that proactively processes performance logs and generates a particular index for fast queries; second, for each service, P-Tracer retrieves a statistical insight of performance characteristics from multi-dimensions and provides operators with a suite of web-based interfaces to query the critical information. We evaluate P-Tracer in the aspects of tracing overheads, data preprocessing scalability and querying efficiency. Three real-world case studies that happened in Alibaba cloud computing platform demonstrate that P-Tracer can help operators understand software behaviors and localize the primary causes of performance anomalies effectively and efficiently.


cloud computing performance profiling performance anomaly visual analytics 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ren G, Tune E, Moseley T, Shi Y, Rus S, Hundt R. Google-wide profiling: a continuous profiling infrastructure for data centers. IEEE Micro Magazine, 2010, 30(4): 65–79CrossRefGoogle Scholar
  2. 2.
    Graham S, Kessler P, McKusick M. Gprof: a call graph execution profiler. ACM SIGPLAN Notices, 2004, 39(4): 49–57CrossRefGoogle Scholar
  3. 3.
    Mohr B, Wylie B, Wolf F. Performance measurement and analysis tools for extremely scalable systems. Concurrency and Computation: Practice and Experience, 2010, 22(16): 2212–2229CrossRefGoogle Scholar
  4. 4.
    Thereska E, Salmon B, Strunk J, Wachs M, Abd-El-Malek M, Lopez J, Ganger G. Stardust: tracking activity in a distributed storage system. ACM SIGMETRICS Performance Evaluation Review, 2006, 34(1): 3–14CrossRefGoogle Scholar
  5. 5.
    Cantrill B, Shapiro M, Leventhal A. Dynamic instrumentation of production systems. In: Proceedings of the 2004 USENIX Annual Technical Conference. 2004, 2–15Google Scholar
  6. 6.
    Traeger A, Deras I, Zadok E. DARC: dynamic analysis of root causes of latency distributions. ACM SIGMETRICS Performance Evaluation Review, 2008, 36(1): 277–288CrossRefGoogle Scholar
  7. 7.
    Huang X, Wang W, Zhang W, Wei J, Huang T. An adaptive performance modeling approach to performance profiling of multi-service web applications. In: Proceedings of the 35th IEEE Computer Software and Applications Conference. 2011, 4–13Google Scholar
  8. 8.
    Sigelman B, Barroso L, Burrows M, Stephenson P, Plakal M, Beaver D, Jaspan S, Shanbhag C. Dapper, a large-scale distributed systems tracing infrastructure. Technical Report, Google, 2010Google Scholar
  9. 9.
    Park I, Buch R. Event tracing-improve debugging and performance tuning with etw. MSDN Magazine-Louisville. 2007, 81–92Google Scholar
  10. 10.
    Sang B, Zhan J, Lu G, Wang H, Xu D, Wang L, Zhang Z, Jia Z. Precise, scalable, and online request tracing for multitier services of black boxes. IEEE Transactions on Parallel and Distributed Systems, 2012, 23(6): 1159–1167CrossRefGoogle Scholar
  11. 11.
    Tak B, Tang C, Zhang C, Govindan S, Urgaonkar B, Chang R. Vpath: precise discovery of request processing paths from black-box observations of thread and network activities. In: Proceedings of the 2009 Conference on USENIX Annual Technical Conference. 2009, 19–32Google Scholar
  12. 12.
    Koskinen E, Jannotti J. Borderpatrol: isolating events for black-box tracing. ACM SIGOPS Operating Systems Review, 2008, 42(4): 191–203CrossRefGoogle Scholar
  13. 13.
    Reynolds P, Wiener J, Mogul J, Aguilera M, Vahdat A. WAP5: blackbox performance debugging for wide-area systems. In: Proceedings of the 15th International Conference onWorldWideWeb. 2006, 347–356Google Scholar
  14. 14.
    Aguilera M, Mogul J, Wiener J, Reynolds P, Muthitacharoen A. Performance debugging for distributed systems of black boxes. ACM SIGOPS Operating Systems Review, 2003, 37(5): 74–89CrossRefGoogle Scholar
  15. 15.
    Mills D. Network time protocol (Version 3) specification, implementation and analysis. RFC Editor, 1992Google Scholar
  16. 16.
    Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 2008, 51(1): 107–113CrossRefGoogle Scholar
  17. 17.
    Abdi H. Coefficient of variation. Sage Publications, 2010Google Scholar
  18. 18.
    Massie M, Chun B, Culler D. The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing, 2004, 30(7): 817–840CrossRefGoogle Scholar
  19. 19.
    Fay M, Proschan M. Wilcoxon-mann-whitney or t-test? on assumptions for hypothesis tests and multiple interpretations of decision rules. Statistics Surveys, 2010Google Scholar
  20. 20.
    Malik H, Adams B, Hassan A. Pinpointing the subsystems responsible for the performance deviations in a load test. In: Proceedings of the 21st International Symposium on Software Reliability Engineering. 2010, 201–210Google Scholar
  21. 21.
    Bodik P, Goldszmidt M, Fox A, Woodard D, Andersen H. Fingerprinting the datacenter: automated classification of performance crises. In: Proceedings of the 5th European Conference on Computer Systems. 2010, 111–124Google Scholar
  22. 22.
    Misailovic S, Sidiroglou S, Hoffmann H, Rinard M. Quality of service profiling. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering. 2010, 25–34Google Scholar
  23. 23.
    Barham P, Donnelly A, Isaacs R, Mortier R. Using magpie for request extraction and workload modelling. In: Proceedings of the 6th Symposium on Opearting Systems Design and Implementation (OSDI). 2004, 259–272Google Scholar
  24. 24.
    Chen M, Kiciman E, Fratkin E, Fox A, Brewer E. Pinpoint: Problem determination in large, dynamic internet services. In: Proceedings of the 32nd International Conference on Dependable Systems and Net works. 2002, 595–604Google Scholar
  25. 25.
    Chen M, Accardi A, Kiciman E, Lloyd J, Patterson D, Fox A, Brewer E. Path-based faliure and evolution management. In: Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation. 2004, 23–36Google Scholar
  26. 26.
    Chang F, Dean J, Ghemawat S, Hsieh W, Wallach D, Burrows M, Chandra T, Fikes A, Gruber R. Bigtable: a distributed storage system for structured data. ACM Transactions on Computer Systems, 2008, 26(2): 1–26MATHCrossRefGoogle Scholar
  27. 27.
    Sambasivan R, Zheng A, De Rosa M, Krevat E, Whitman S, Stroucken M, Wang W, Xu L, Ganger G. Diagnosing performance changes by comparing request flows. In: Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation. 2011, 43–56Google Scholar
  28. 28.
    Reynolds P, Killian C, Wiener J, Mogul J, Shah M, Vahdat A. Pip: detecting the unexpected in distributed systems. In: Proceedings of the 3rd Symposium on Networked Systems Design and Implementation. 2006, 115–128Google Scholar
  29. 29.
    Thereska E, Ganger G. Ironmodel: robust performance models in the wild. ACM SIGMETRICS Performance Evaluation Review, 2008, 36(1): 253–264CrossRefGoogle Scholar
  30. 30.
    Mann G, Sandler M, Krushevskaja D, Guha S, Even-Dar E. Modeling the parallel execution of black-box services. In: Proceedings of the 3rd USENIX Conference on Hot Topics in Cloud Computing. 2011, 20–24Google Scholar
  31. 31.
    Ostrowski K, Mann G, Sandler M. Diagnosing latency in multi-tier black-box services. In: Proceedings of the 5th Workshop on Large Scale Distributed Systems and Middleware. 2011Google Scholar
  32. 32.
    Mi H, Wang H, Zhou Y, Lyu M R, Cai H. P-tracer: service-oriented performance profiling in cloud computing systems. In: Proceedings of IEEE 36th Annual Computer Software and Applications Conference. 2012Google Scholar
  33. 33.
    Zhang Z, Zhan J, Li Y, Wang L, Meng D, Sang B. Precise request tracing and performance debugging for multi-tier services of black boxes. In: Proceedings of the 2009 IEEE/IFIP International Conference on Dependable Systems & Networks. 2009, 337–346Google Scholar

Copyright information

© Higher Education Press and Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Haibo Mi
    • 1
  • Huaimin Wang
    • 1
  • Yangfan Zhou
    • 2
  • Michael Rung-Tsong Lyu
    • 2
  • Hua Cai
    • 3
  • Gang Yin
    • 1
  1. 1.National Lab for Parallel & Distributed ProcessingNational University of Defense TechnologyChangshaChina
  2. 2.Shenzhen Research InstituteThe Chinese University of Hong KongShenzhenChina
  3. 3.Computing PlatformAlibaba Cloud Computing CompanyHangzhouChina

Personalised recommendations