Chapter

Architecting Dependable Systems VII

Volume 6420 of the series Lecture Notes in Computer Science pp 201-226

ASDF: An Automated, Online Framework for Diagnosing Performance Problems

  • Keith BareAffiliated withLancaster UniversityCarnegie Mellon University
  • , Soila P. KavulyaAffiliated withLancaster UniversityCarnegie Mellon University
  • , Jiaqi TanAffiliated withCarnegie Mellon UniversityDSO National Laboratories
  • , Xinghao PanAffiliated withCarnegie Mellon UniversityDSO National Laboratories
  • , Eugene MarinelliAffiliated withLancaster UniversityCarnegie Mellon University
  • , Michael KasickAffiliated withLancaster UniversityCarnegie Mellon University
  • , Rajeev GandhiAffiliated withLancaster UniversityCarnegie Mellon University
  • , Priya NarasimhanAffiliated withLancaster UniversityCarnegie Mellon University

* Final gross prices may vary according to local VAT.

Get Access

Abstract

Performance problems account for a significant percentage of documented failures in large-scale distributed systems, such as Hadoop. Localizing the source of these performance problems can be frustrating due to the overwhelming amount of monitoring information available. We automate problem localization using ASDF, an online diagnostic framework that transparently monitors and analyzes different time-varying data sources (e.g., OS performance counters, Hadoop logs) and narrows down performance problems to a specific node or a set of nodes. ASDF’s flexible architecture allows system administrators to easily customize data sources and analysis modules for their unique operating environments. We demonstrate the effectiveness of ASDF’s diagnostics on documented performance problems in Hadoop; our results indicate that ASDF incurs an average monitoring overhead of 0.38% of CPU time and achieves a balanced accuracy of 80% at localizing problems to the culprit node.