Skip to main content

Advertisement

SpringerLink
Log in
Menu
Find a journal Publish with us
Search
Cart
Book cover

European Conference on Parallel Processing

Euro-Par 2011: Euro-Par 2011: Parallel Processing Workshops pp 231–240Cite as

  1. Home
  2. Euro-Par 2011: Parallel Processing Workshops
  3. Conference paper
Framework for Enabling System Understanding

Framework for Enabling System Understanding

  • J. Brandt30,
  • F. Chen30,
  • A. Gentile30,
  • Chokchai (Box) Leangsuksun31,
  • J. Mayo30,
  • P. Pebay30,
  • D. Roe30,
  • N. Taerat31,
  • D. Thompson30 &
  • …
  • M. Wong30 
  • Conference paper
  • 1095 Accesses

  • 1 Citations

Part of the Lecture Notes in Computer Science book series (LNTCS,volume 7156)

Abstract

Building the effective HPC resilience mechanisms required for viability of next generation supercomputers will require in depth understanding of system and component behaviors. Our goal is to build an integrated framework for high fidelity long term information storage, historic and run-time analysis, algorithmic and visual information exploration to enable system understanding, timely failure detection/prediction, and triggering of appropriate response to failure situations. Since it is unknown what information is relevant and since potentially relevant data may be expressed in a variety of forms (e.g., numeric, textual), this framework must provide capabilities to process different forms of data and also support the integration of new data, data sources, and analysis capabilities. Further, in order to ensure ease of use as capabilities and data sources expand, it must also provide interactivity between its elements. This paper describes our integration of the capabilities mentioned above into our OVIS tool.

Keywords

  • resilience
  • HPC
  • system monitoring

These authors were supported by the United States Department of Energy, Office of Defense Programs. Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed-Martin Company, for the United States Department of Energy under contract DE-AC04-94-AL8500.

Download conference paper PDF

References

  1. Brandt, J., Gentile, A., Mayo, J., Pebay, P., Roe, D., Thompson, D., Wong, M.: Methodologies for Advance Warning of Compute Cluster Problems via Statistical Analysis: A Case Study. In: Proc. 18th ACM Int’l Symp. on High Performance Distributed Computing, Workshop on Resiliency in HPC, Munich, Germany (2009)

    Google Scholar 

  2. Brandt, J., Gentile, A., Houf, C., Mayo, J., Pebay, P., Roe, D., Thompson, D., Wong, M.: OVIS-3 User’s Guide. Sandia National Laboratories Report, SAND2010-7109 (2010)

    Google Scholar 

  3. Brandt, J., Debusschere, B., Gentile, A., Mayo, J., Pebay, P., Thompson, D., Wong, M.: OVIS-2 A Robust Distributed Architecture for Scalable RAS. In: Proc. 22nd IEEE Int’l Parallel and Distributed Processing Symp., 4th Workshop on System Management Techniques, Processes, and Services, Miami, FL (2008)

    Google Scholar 

  4. Stearley, J., Corwell, S., Lord, K.: Bridging the gaps: joining information sources with Splunk. In: Proc. Workshop on Managing Systems Via Log Analysis and Machine Learning Techniques, Vancouver, BC, Canada (2010)

    Google Scholar 

  5. Taerat, N., Brandt, J., Gentile, A., Wong, M., Leangsuksun, C.: Baler: Deterministic, Lossless Log Message Clustering Tool. In: Proc. Int’l Supercomputing Conf., Hamburg, Germany (2011)

    Google Scholar 

  6. Lan, Z., Zheng, Z., Li, Y.: Toward Automated Anomaly Identification in Large-Scale Systems. IEEE Trans. on Parallel and Distributed Systems 21, 174–187 (2010)

    CrossRef  Google Scholar 

  7. Zheng, Z., Li, Y., Lan, Z.: Anomaly localization in large-scale clusters. In: Proc. IEEE Int’l Conf. on Cluster Computing (2007)

    Google Scholar 

  8. EDAC. Error Detection and Reporting Tool see, for example, Documentation in the Linux Kernel (linux/kernel/git/torvalds/linux-2.6.git)/Documentation/edac.txt

    Google Scholar 

  9. Ganglia, http://ganglia.info

  10. Hyperic. VMWare, http://www.hyperic.com

  11. lm-sensors, http://www.lm-sensors.org/

  12. Nagios, http://www.nagios.org

  13. OVIS. Sandia National Laboratories, http://ovis.ca.sandia.gov

  14. RRDtool, http://www.rrdtool.org

  15. SLURM. Simple Linux Utility for Resource Management, http://www.schedmd.com

  16. Splunk, http://www.splunk.com

Download references

Author information

Authors and Affiliations

  1. Sandia National Laboratories, Livermore, CA, USA

    J. Brandt, F. Chen, A. Gentile, J. Mayo, P. Pebay, D. Roe, D. Thompson & M. Wong

  2. Louisiana Tech University, Ruston, LA, 71270, USA

    Chokchai (Box) Leangsuksun & N. Taerat

Authors
  1. J. Brandt
    View author publications

    You can also search for this author in PubMed Google Scholar

  2. F. Chen
    View author publications

    You can also search for this author in PubMed Google Scholar

  3. A. Gentile
    View author publications

    You can also search for this author in PubMed Google Scholar

  4. Chokchai (Box) Leangsuksun
    View author publications

    You can also search for this author in PubMed Google Scholar

  5. J. Mayo
    View author publications

    You can also search for this author in PubMed Google Scholar

  6. P. Pebay
    View author publications

    You can also search for this author in PubMed Google Scholar

  7. D. Roe
    View author publications

    You can also search for this author in PubMed Google Scholar

  8. N. Taerat
    View author publications

    You can also search for this author in PubMed Google Scholar

  9. D. Thompson
    View author publications

    You can also search for this author in PubMed Google Scholar

  10. M. Wong
    View author publications

    You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

  1. Scilytics, Koellnerhofgasse 3/15A, 1010, Vienna, Austria

    Michael Alexander

  2. ICAR-CNR, Via P. Castellino, 111, 80131, Napoli, Italy

    Pasqua D’Ambra

  3. University of Amsterdam, 1090, Amsterdam, Netherlands

    Adam Belloum

  4. Innovative Computing Laboratory, The University of Tennessee, US

    George Bosilca

  5. Department of Experimental Medicine and Clinic, University Magna Græcia, 88100, Catanzaro, Italy

    Mario Cannataro

  6. Computer Science Department, University of Pisa, Italy

    Marco Danelutto

  7. Second University of Naples, Italy

    Beniamino Di Martino

  8. TUMünchen,, Boltzmannstr. 3, ,, 85748, Garching, Germany

    Michael Gerndt

  9. Equipe Runtime, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France

    Emmanuel Jeannot & Raymond Namyst & 

  10. Equipe HIEPACS, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France

    Jean Roman

  11. Computer Science and Mathematics Division, Oak Ridge National Laboratory, 37831-6164, Oak Ridge, TN, USA

    Stephen L. Scott

  12. Department of Scientific Computing, University of Vienna, Nordbergstr. 15/3C, 1090, Vienna, Austria

    Jesper Larsson Traff

  13. Computer Science and Mathematics Division, Oak Ridge National Laboratory, 37831, Oak Ridge, TN, USA

    Geoffroy Vallée

  14. Technische Universität München, Germany

    Josef Weidendorfer

Rights and permissions

Reprints and Permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Brandt, J. et al. (2012). Framework for Enabling System Understanding. In: Alexander, M., et al. Euro-Par 2011: Parallel Processing Workshops. Euro-Par 2011. Lecture Notes in Computer Science, vol 7156. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29740-3_27

Download citation

  • .RIS
  • .ENW
  • .BIB
  • DOI: https://doi.org/10.1007/978-3-642-29740-3_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-29739-7

  • Online ISBN: 978-3-642-29740-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Search

Navigation

  • Find a journal
  • Publish with us

Discover content

  • Journals A-Z
  • Books A-Z

Publish with us

  • Publish your research
  • Open access publishing

Products and services

  • Our products
  • Librarians
  • Societies
  • Partners and advertisers

Our imprints

  • Springer
  • Nature Portfolio
  • BMC
  • Palgrave Macmillan
  • Apress
  • Your US state privacy rights
  • Accessibility statement
  • Terms and conditions
  • Privacy policy
  • Help and support

167.114.118.210

Not affiliated

Springer Nature

© 2023 Springer Nature