A Holistic Approach to Log Data Analysis in High-Performance Computing Systems: The Case of IBM Blue Gene/Q

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9523)

Abstract

The complexity and cost of managing high-performance computing infrastructures are on the rise. Automating management and repair through predictive models to minimize human interventions is an attempt to increase system availability and contain these costs. Building predictive models that are accurate enough to be useful in automatic management cannot be based on restricted log data from subsystems but requires a holistic approach to data analysis from disparate sources. Here we provide a detailed multi-scale characterization study based on four datasets reporting power consumption, temperature, workload, and hardware/software events for an IBM Blue Gene/Q installation. We show that the system runs a rich parallel workload, with low correlation among its components in terms of temperature and power, but higher correlation in terms of events. As expected, power and temperature correlate strongly, while events display negative correlations with load and power. Power and workload show moderate correlations, and only at the scale of components. The aim of the study is a systematic, integrated characterization of the computing infrastructure and discovery of correlation sources and levels to serve as basis for future predictive modeling efforts.

Keywords

Data science Correlation analysis HPC system monitoring Log data integration Predictive modeling 

References

  1. 1.
    Agelastos, A., et al.: The lightweight distributed metric service: a scalable infrastructure for continuous monitoring of large scale computing systems and applications. In: SC 2014, pp. 154–165 (2014)Google Scholar
  2. 2.
    Balliu, A., et al.: Bidal: big data analyzer for cluster traces. In: Informatika (BigSys workshop), vol. 232, pp. 1781–1795. GI-Edition Lecture Notes in Informatics (2014)Google Scholar
  3. 3.
    Bartolini, A., et al.: Unveiling eurora - thermal and power characterization of the most energy-efficient supercomputer in the world. In: Proceedings of the Conference on Design, Automation & Test in Europe, pp. 277:1–277:6. DATE 2014 (2014)Google Scholar
  4. 4.
    Chen, Y., Alspaugh, S., Katz, R.H.: Design insights for MapReduce from diverse production workloads. Technical report, UC Berkeley UCB/EECS-2 (2012)Google Scholar
  5. 5.
    Di, S., Kondo, D., Cirne, W.: Characterization and comparison of google cloud load versus grids. In: IEEE CLUSTER, pp. 230–238 (2012)Google Scholar
  6. 6.
    Dudko, R., Sharma, A., Tedesco, J.: Effective failure prediction in hadoop clusters. University of Idaho White Paper, pp. 1–8 (2012)Google Scholar
  7. 7.
    Falciano, F., Rossi, E.: Fermi: the most powerful computational resource for italian scientists. EMBnet J. 18(A), 62 (2012)CrossRefGoogle Scholar
  8. 8.
    Gao, J.: Machine learning applications for data center optimisation. Google White Paper (2014)Google Scholar
  9. 9.
    Javadi, B., et al.: The failure trace archive: enabling the comparison of failure measurements and models of distributed systems. J. Parallel Distrib. Comput. 73(8), 1208–1223 (2013)CrossRefGoogle Scholar
  10. 10.
    Lakner, G., et al.: IBM system blue gene solution: blue gene/Q system administration. IBM Redbooks (2013)Google Scholar
  11. 11.
    Liang, Y., et al.: Failure prediction in IBM bluegene/L event logs. IEEE ICDM, pp. 583–588 (2007)Google Scholar
  12. 12.
    Liu, Z., Cho, S.: Characterizing machines and workloads on a google cluster. In: 8th SRMPDS (2012)Google Scholar
  13. 13.
    Milano, J., et al.: IBM system blue gene solution: blue gene/Q hardware overview and installation planning. IBM Redbooks (2013)Google Scholar
  14. 14.
    Ponemon Institute Research, Emerson Network Power: Cost of Data Center Outages (2013)Google Scholar
  15. 15.
    Reiss, C., et al.: Heterogeneity and dynamicity of clouds at scale: google trace analysis. In: ACM SoCC (2012)Google Scholar
  16. 16.
    Salfner, F., Lenk, M., Malek, M.: A survey of online failure prediction methods. ACM Comput. Surv. (CSUR) 42(3), 1–68 (2010)CrossRefGoogle Scholar
  17. 17.
    Sîrbu, A., Babaoglu, O.: Towards data-driven autonomics in data centers. In: International Conference on Cloud and Autonomic Computing (ICCAC) (2015)Google Scholar
  18. 18.
    Valentini, G.L., et al.: An overview of energy efficiency techniques in cluster computing systems. Cluster Comput. 16(1), 3–15 (2013)CrossRefGoogle Scholar
  19. 19.
    Wang, G., et al.: Towards synthesizing realistic workload traces for studying the hadoop ecosystem. In: IEEE MASCOTS, pp. 400–408 (2011)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringUniversity of BolognaBolognaItaly

Personalised recommendations