ASDF: An Automated, Online Framework for Diagnosing Performance Problems

  • Keith Bare
  • Soila P. Kavulya
  • Jiaqi Tan
  • Xinghao Pan
  • Eugene Marinelli
  • Michael Kasick
  • Rajeev Gandhi
  • Priya Narasimhan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6420)

Abstract

Performance problems account for a significant percentage of documented failures in large-scale distributed systems, such as Hadoop. Localizing the source of these performance problems can be frustrating due to the overwhelming amount of monitoring information available. We automate problem localization using ASDF, an online diagnostic framework that transparently monitors and analyzes different time-varying data sources (e.g., OS performance counters, Hadoop logs) and narrows down performance problems to a specific node or a set of nodes. ASDF’s flexible architecture allows system administrators to easily customize data sources and analysis modules for their unique operating environments. We demonstrate the effectiveness of ASDF’s diagnostics on documented performance problems in Hadoop; our results indicate that ASDF incurs an average monitoring overhead of 0.38% of CPU time and achieves a balanced accuracy of 80% at localizing problems to the culprit node.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Foundation, T.A.S.: Hadoop (2007), http://hadoop.apache.org/core
  2. 2.
    Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: USENIX Symposium on Operating Systems Design and Implementation, San Francisco, CA, pp. 137–150 (December 2004)Google Scholar
  3. 3.
    Foundation, T.A.S.: Apache’s JIRA issue tracker (2006), https://issues.apache.org/jira
  4. 4.
  5. 5.
    Packard, H.: Hp operations manager (2010), http://www.managementsoftware.hp.com
  6. 6.
    LLC., N.E.: Hagios (2008), http://www.nagios.org
  7. 7.
    Ganglia: Ganglia monitoring system (2007), http://ganglia.info
  8. 8.
    Barham, P., Donnelly, A., Isaacs, R., Mortier, R.: Using Magpie for request extraction and workload modelling. In: USENIX Symposium on Operating Systems Design and Implementation, San Francisco, CA (December 2004)Google Scholar
  9. 9.
    Inc., S.: Splunk: The it search company (2005), http://www.splunk.com
  10. 10.
    ZeroC, I.: Internet Communications Engine, ICE (2010), http://www.zeroc.com/ice.html
  11. 11.
    Sigelman, B.H., Barroso, L.A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S., Shanbhag, C.: Dapper, a large-scale distributed systems tracing infrastructure. Technical Report dapper-2010-1, Google (April 2010)Google Scholar
  12. 12.
    Fonseca, R., Porter, G., Katz, R., Shenker, S., Stoica, I.: X-Trace: A pervasive network tracing framework. In: USENIX Symposium on Networked Systems Design and Implementation, Cambridge, MA (April 2007)Google Scholar
  13. 13.
  14. 14.
    Ghemawat, S., Gobioff, H., Leung, S.: The Google File System. In: ACM Symposium on Operating Systems Principles, Lake George, NY, pp. 29 – 43 (October 2003)Google Scholar
  15. 15.
    Tan, J., Narasimhan, P.: RAMS and BlackSheep: Inferring white-box application behavior using black-box techniques. Technical Report CMU-PDL-08-103, Carnegie Mellon University PDL (May 2008)Google Scholar
  16. 16.
    Tan, J., Pan, X., Kavulya, S., Gandhi, R., Narasimhan, P.: Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop. In: USENIX Workshop on Hot Topics in Cloud Computing (HotCloud), San Diego, CA (June 2009)Google Scholar
  17. 17.
    Tan, J., Kavulya, S., Gandhi, R., Narasimhan, P.: Visual, log-based causal tracing for performance debugging of MapReduce systems. In: International Conference on Distributed Computing Systems, Genoa, Italy (June 2010)Google Scholar
  18. 18.
    Tan, J., Pan, X., Kavulya, S., Gandhi, R., Narasimhan, P.: SALSA: Analyzing Logs as State Machines. In: USENIX Workshop on Analysis of System Logs, San Diego, CA (December 2008)Google Scholar
  19. 19.
    Pan, X., Tan, J., Kavulya, S., Gandhi, R., Narasimhan, P.: Ganesha: Black-Box Diagnosis of MapReduce Systems. In: Workshop on Hot Topics in Measurement and Modeling of Computer Systems (HotMetrics), Seattle, WA (June 2009)Google Scholar
  20. 20.
    Pan, X., Tan, J., Kavulya, S., Gandhi, R., Narasimhan, P.: Blind Men and the Elephant: Piecing together Hadoop for diagnosis. In: International Symposium on Software Reliability Engineering (ISSRE), Mysuru, India (November 2009)Google Scholar
  21. 21.
    Konwinski, A., Zaharia, M., Katz, R., Stoica, I.: X-tracing Hadoop. Hadoop Summit (March 2008)Google Scholar
  22. 22.
    Cohen, I.: Machine learning for automated diagnosis of distributed systems performance. SF Bay ACM Data Mining SIG (August 2006)Google Scholar
  23. 23.
    Xu, W., Huang, L., Fox, A., Patterson, D.A., Jordan, M.I.: Detecting large-scale system problems by mining console logs. In: ACM Symposium on Operating Systems Principles, Big Sky, Montana, pp. 117–132 (October 2009)Google Scholar
  24. 24.
    Aguilera, M.K., Mogul, J.C., Wiener, J.L., Reynolds, P., Muthitacharoen, A.: Performance debugging for distributed system of black boxes. In: ACM Symposium on Operating Systems Principles, Bolton Landing, NY, 74–89 (October 2003)Google Scholar
  25. 25.
    Kiciman, E., Fox, A.: Detecting application-level failures in component-based internet services. IEEE Trans. on Neural Networks: Special Issue on Adaptive Learning Systems in Communication Networks 16(5), 1027–1041 (2005)Google Scholar
  26. 26.
    Chen, M.Y., Kiciman, E., Fratkin, E., Fox, A., Brewer, E.: Pinpoint: Problem determination in large, dynamic internet services. In: IEEE Conference on Dependable Systems and Networks, Bethesda, MD (June 2002)Google Scholar
  27. 27.
    Cohen, I., Zhang, S., Goldszmidt, M., Symons, J., Kelly, T., Fox, A.: Capturing, indexing, clustering, and retrieving system history. In: ACM Symposium on Operating Systems Principles, Brighton, United Kingdom, pp. 105–118 (October 2005)Google Scholar
  28. 28.
    Kiciman, E., Fox, A.: Detecting application-level failures in component-based internet services. In: USENIX Symposium on Networked Systems Design and Implementation, San Jose, CA, pp. 115– 128 (May 2006)Google Scholar
  29. 29.
    Hauswirth, M., Diwan, A., Sweeney, P., Hind, M.: Vertical profiling: Understanding the behavior of object-oriented applications. In: ACM Conference on Object-Oriented Programming, Systems, Languages, and Applications, Vancouver, BC, Canada, pp. 251 – 269 (October 2004)Google Scholar
  30. 30.
    Tucek, J., Lu, S., Huang, C., Xanthos, S., Zhou, Y.: Triage: diagnosing production run failures at the user’s site. In: Symposium on Operating Systems Principles (SOSP), Stevenson, WA, pp. 131–144 (October 2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Keith Bare
    • 1
  • Soila P. Kavulya
    • 1
  • Jiaqi Tan
    • 2
  • Xinghao Pan
    • 2
  • Eugene Marinelli
    • 1
  • Michael Kasick
    • 1
  • Rajeev Gandhi
    • 1
  • Priya Narasimhan
    • 1
  1. 1.Carnegie Mellon UniversityPittsburghUSA
  2. 2.DSO National LaboratoriesSingapore

Personalised recommendations