Skip to main content

Advertisement

SpringerLink
Log in
Menu
Find a journal Publish with us
Search
Cart
Book cover

European Conference on Parallel Processing

Euro-Par 2011: Euro-Par 2011: Parallel Processing Workshops pp 234–243Cite as

  1. Home
  2. Euro-Par 2011: Parallel Processing Workshops
  3. Conference paper
A Case for Virtual Machine Based Fault Injection in a High-Performance Computing Environment

A Case for Virtual Machine Based Fault Injection in a High-Performance Computing Environment

  • Thomas Naughton30,
  • Geoffroy Vallée30,
  • Christian Engelmann30 &
  • …
  • Stephen L. Scott30 
  • Conference paper
  • 1366 Accesses

  • 2 Citations

Part of the Lecture Notes in Computer Science book series (LNTCS,volume 7155)

Abstract

Large-scale computing platforms provide tremendous capabilities for scientific discovery. As applications and system software scale up to multi- petaflops and beyond to exascale platforms, the occurrence of failure will be much more common. This has given rise to a push in fault-tolerance and resilience research for high-performance computing (HPC) systems. This includes work on log analysis to identify types of failures, enhancements to the Message Passing Interface (MPI) to incorporate fault awareness, and a variety of fault tolerance mechanisms that span redundant computation, algorithm based fault tolerance, and advanced checkpoint/restart techniques.

While there is much work to be done on the FT/Resilience mechanisms for such large-scale systems, there is also a profound gap in the tools for experimentation. This gap is compounded by the fact that HPC environments have stringent performance requirements and are often highly customized. The tool chain for these systems are often tailored for the platform and the operating environments typically contain many site/machine specific enhancements. Therefore, it is desirable to maintain a consistent execution environment to minimize end-user (scientist) interruption.

The work on system-level virtualization for HPC system offers a unique opportunity to maintain a consistent execution environment via a virtual machine (VM). Recent work on virtualization for HPC has shown that low-overhead, high performance systems can be realized [7, 15]. Virtualization also provides a clean abstraction for building experimental tools for investigation into the effects of failures in HPC and the related research on FT/Resilience mechanisms and policies. In this paper we discuss the motivation for tools to perform fault injection in an HPC context. We also present the design of a new fault injection framework that can leverage virtualization.

Keywords

  • Virtual Machine
  • Message Passing Interface
  • System Under Test
  • Fault Injection
  • Fault Tolerance Mechanism

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

ORNL’s work was supported by the U.S. Department of Energy, under Contract DE-AC05-00OR22725.

Download conference paper PDF

References

  1. Buntinas, D., Bosilica, G., Graham, R.L., Vallée, G., Watson, G.R.: A Scalable Tools Communication Infrastructure. In: Proceedings of the 22nd International High Performance Computing Symposium (HPCS 2008), June 9-11, session track: 6th Annual Symposium on OSCAR and HPC Cluster Systems (OSCAR 2008). IEEE Computer Society (2008), http://www.csm.ornl.gov/oscar08/

  2. Gupta, R., Beckman, P., Park, B.H., Lusk, E., Hargrove, P., Geist, A., Lumsdaine, A., Dongarra, J.: Cifts: A coordinated infrastructure for fault-tolerant systems. In: International Conference on Parallel Processing, ICPP (2009)

    Google Scholar 

  3. Hoarau, W., Lemarinier, P., Herault, T., Rodriguez, E., Tixeuil, S., Cappello, F.: Fail-mpi: How fault-tolerant is fault-tolerant mpi? In: IEEE International Conference on Cluster Computing, pp. 1–10 (September 2006)

    Google Scholar 

  4. Hoarau, W., Tixeuil, S., Vauchelles, F.: Fail-fci: Versatile fault injection. Future Generation Computer Systems 23(7), 913–919 (2007), http://www.sciencedirect.com/science/article/pii/S0167739X07000209

    CrossRef  Google Scholar 

  5. Hsueh, M.C., Tsai, T.K., Iyer, R.K.: Fault injection techniques and tools. Computer 30(4), 75–82 (1997)

    CrossRef  Google Scholar 

  6. Carreira, J., Madeira, H., Silva, J.G.: Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers. IEEE Transactions on Software Engineering 24(2) (February 1998), http://www.xception.org/files/IEEETSE98.pdf

  7. Lange, J., Pedretti, K., Hudson, T., Dinda, P., Cui, Z., Xia, L., Bridges, P., Jaconette, S., Levenhagen, M., Brightwell, R., Widener, P.: Palacios and Kitten: High Performance Operating Systems For Scalable Virtualized and Native Supercomputing. Tech. Rep. NWU-EECS-09-14, Northwestern University, July 20 (2009), http://v3vee.org/papers/NWU-EECS-09-14.pdf

  8. Le, M., Gallagher, A., Tamir, Y.: Challenges and Opportunities with Fault Injection in Virtualized Systems. In: First International Workshop on Virtualization Performance: Analysis, Characterization, and Tools, Austin, Texas, USA (April 2008), http://www.cs.ucla.edu/~tamir/papers/vpact08.pdf

  9. Marinescu, P.D., Candea, G.: LFI: A Practical and General Library-Level Fault Injector. In: Proceedings of the 39th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2009), June 29 - July 2. IEEE (2009), http://dslab.epfl.ch/pubs/lfi/index.html

  10. Potyra, S., Sieh, V., Cin, M.D.: Evaluating fault-tolerant system designs using FAUmachine. In: Proceedings of the 2007 Workshop on Engineering Fault Tolerant Systems (EFTS 2007), p. 9. ACM, New York (2007)

    CrossRef  Google Scholar 

  11. Silva, J.G., Carreira, J., Madeira, H., Costa, D., Moreira, F.: Experimental assessment of parallel systems. In: Proceedings of the 26th Annual International Symposium on Fault-Tolerant Computing (FTCS 1996), June 25-27, pp. 415–424 (1996)

    Google Scholar 

  12. Stott, D.T., Floering, B., Burke, D., Kalbarczyk, Z., Iyer, R.K.: NFTAPE: A framework for assessing dependability in distributed systems with lightweight fault inectors. In: Proceedings of the 4th IEEE International Computer Performance and Dependability Symposium (IPDS), pp. 91–100. IEEE (March 2000)

    Google Scholar 

  13. Süßkraut, M., Creutz, S., Fetzer, C.: Fast fault injection with virtual machines (fast abstract). In: Supplement of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN2007) (June 2007), http://wwwse.inf.tu-dresden.de/papers/preprint-suesskraut2007DSNb.pdf

  14. Vallée, G., Naughton, T., Scott, S.L.: System Management Software for Virtual Environments. In: Proceedings of the ACM International Conference on Computing Frontiers (CF 2007), Ischia, Italy, May 7-9 (2007)

    Google Scholar 

  15. Youseff, L., Seymour, K., You, H., Dongarra, J., Wolski, R.: The impact of paravirtualized memory hierarchy on linear algebra computational kernels and software. In: Proceedings of the 17th International Symposium on High Performance Distributed Computing (HPDC 2008), pp. 141–152. ACM, New York (2008)

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

  1. Oak Ridge National Laboratory, Computer Science and Mathematics Division, Oak Ridge, TN, 37831, USA

    Thomas Naughton, Geoffroy Vallée, Christian Engelmann & Stephen L. Scott

Authors
  1. Thomas Naughton
    View author publications

    You can also search for this author in PubMed Google Scholar

  2. Geoffroy Vallée
    View author publications

    You can also search for this author in PubMed Google Scholar

  3. Christian Engelmann
    View author publications

    You can also search for this author in PubMed Google Scholar

  4. Stephen L. Scott
    View author publications

    You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

  1. Scilytics, Koellnerhofgasse 3/15A, 1010, Vienna, Austria

    Michael Alexander

  2. ICAR-CNR, Via P. Castellino, 111, 80131, Napoli, Italy

    Pasqua D’Ambra

  3. University of Amsterdam, 1090, Amsterdam, Netherlands

    Adam Belloum

  4. Innovative Computing Laboratory, The University of Tennessee, USA

    George Bosilca

  5. Department of Experimental Medicine and Clinic, University Magna Græcia, 88100, Catanzaro, Italy

    Mario Cannataro

  6. Computer Science Department, University of Pisa, Italy

    Marco Danelutto

  7. Second University of Naples, Italy

    Beniamino Di Martino

  8. TU München, Boltzmannstr. 3, 85748, Garching, Germany

    Michael Gerndt

  9. Equipe Runtime, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France

    Emmanuel Jeannot & Raymond Namyst & 

  10. Equipe HIEPACS, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France

    Jean Roman

  11. Oak Ridge National Laboratory, Computer Science and Mathematics Division, 37831-6164, Oak Ridge, TN, USA

    Stephen L. Scott

  12. Department of Scientific Computing, University of Vienna, Nordbergstr. 15/3C, 1090, Vienna, Austrial

    Jesper Larsson Traff

  13. Computer Science and Mathematics Division, Oak Ridge National Laboratory, 37831, Oak Ridge, TN, USA

    Geoffroy Vallée

  14. Technische Universität München, Germany

    Josef Weidendorfer

Rights and permissions

Reprints and Permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Naughton, T., Vallée, G., Engelmann, C., Scott, S.L. (2012). A Case for Virtual Machine Based Fault Injection in a High-Performance Computing Environment. In: Alexander, M., et al. Euro-Par 2011: Parallel Processing Workshops. Euro-Par 2011. Lecture Notes in Computer Science, vol 7155. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29737-3_27

Download citation

  • .RIS
  • .ENW
  • .BIB
  • DOI: https://doi.org/10.1007/978-3-642-29737-3_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-29736-6

  • Online ISBN: 978-3-642-29737-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Search

Navigation

  • Find a journal
  • Publish with us

Discover content

  • Journals A-Z
  • Books A-Z

Publish with us

  • Publish your research
  • Open access publishing

Products and services

  • Our products
  • Librarians
  • Societies
  • Partners and advertisers

Our imprints

  • Springer
  • Nature Portfolio
  • BMC
  • Palgrave Macmillan
  • Apress
  • Your US state privacy rights
  • Accessibility statement
  • Terms and conditions
  • Privacy policy
  • Help and support

167.114.118.210

Not affiliated

Springer Nature

© 2023 Springer Nature