Advertisement

Framework for Enabling System Understanding

  • J. Brandt
  • F. Chen
  • A. Gentile
  • Chokchai (Box) Leangsuksun
  • J. Mayo
  • P. Pebay
  • D. Roe
  • N. Taerat
  • D. Thompson
  • M. Wong
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7156)

Abstract

Building the effective HPC resilience mechanisms required for viability of next generation supercomputers will require in depth understanding of system and component behaviors. Our goal is to build an integrated framework for high fidelity long term information storage, historic and run-time analysis, algorithmic and visual information exploration to enable system understanding, timely failure detection/prediction, and triggering of appropriate response to failure situations. Since it is unknown what information is relevant and since potentially relevant data may be expressed in a variety of forms (e.g., numeric, textual), this framework must provide capabilities to process different forms of data and also support the integration of new data, data sources, and analysis capabilities. Further, in order to ensure ease of use as capabilities and data sources expand, it must also provide interactivity between its elements. This paper describes our integration of the capabilities mentioned above into our OVIS tool.

Keywords

resilience HPC system monitoring 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Brandt, J., Gentile, A., Mayo, J., Pebay, P., Roe, D., Thompson, D., Wong, M.: Methodologies for Advance Warning of Compute Cluster Problems via Statistical Analysis: A Case Study. In: Proc. 18th ACM Int’l Symp. on High Performance Distributed Computing, Workshop on Resiliency in HPC, Munich, Germany (2009)Google Scholar
  2. 2.
    Brandt, J., Gentile, A., Houf, C., Mayo, J., Pebay, P., Roe, D., Thompson, D., Wong, M.: OVIS-3 User’s Guide. Sandia National Laboratories Report, SAND2010-7109 (2010)Google Scholar
  3. 3.
    Brandt, J., Debusschere, B., Gentile, A., Mayo, J., Pebay, P., Thompson, D., Wong, M.: OVIS-2 A Robust Distributed Architecture for Scalable RAS. In: Proc. 22nd IEEE Int’l Parallel and Distributed Processing Symp., 4th Workshop on System Management Techniques, Processes, and Services, Miami, FL (2008)Google Scholar
  4. 4.
    Stearley, J., Corwell, S., Lord, K.: Bridging the gaps: joining information sources with Splunk. In: Proc. Workshop on Managing Systems Via Log Analysis and Machine Learning Techniques, Vancouver, BC, Canada (2010)Google Scholar
  5. 5.
    Taerat, N., Brandt, J., Gentile, A., Wong, M., Leangsuksun, C.: Baler: Deterministic, Lossless Log Message Clustering Tool. In: Proc. Int’l Supercomputing Conf., Hamburg, Germany (2011)Google Scholar
  6. 6.
    Lan, Z., Zheng, Z., Li, Y.: Toward Automated Anomaly Identification in Large-Scale Systems. IEEE Trans. on Parallel and Distributed Systems 21, 174–187 (2010)CrossRefGoogle Scholar
  7. 7.
    Zheng, Z., Li, Y., Lan, Z.: Anomaly localization in large-scale clusters. In: Proc. IEEE Int’l Conf. on Cluster Computing (2007)Google Scholar
  8. 8.
    EDAC. Error Detection and Reporting Tool see, for example, Documentation in the Linux Kernel (linux/kernel/git/torvalds/linux-2.6.git)/Documentation/edac.txtGoogle Scholar
  9. 9.
  10. 10.
    Hyperic. VMWare, http://www.hyperic.com
  11. 11.
  12. 12.
  13. 13.
    OVIS. Sandia National Laboratories, http://ovis.ca.sandia.gov
  14. 14.
  15. 15.
    SLURM. Simple Linux Utility for Resource Management, http://www.schedmd.com
  16. 16.

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • J. Brandt
    • 1
  • F. Chen
    • 1
  • A. Gentile
    • 1
  • Chokchai (Box) Leangsuksun
    • 2
  • J. Mayo
    • 1
  • P. Pebay
    • 1
  • D. Roe
    • 1
  • N. Taerat
    • 2
  • D. Thompson
    • 1
  • M. Wong
    • 1
  1. 1.Sandia National LaboratoriesLivermoreUSA
  2. 2.Louisiana Tech UniversityRustonUSA

Personalised recommendations