Skip to main content

Advertisement

SpringerLink
Log in
Menu
Find a journal Publish with us
Search
Cart
Book cover

European Conference on Parallel Processing

Euro-Par 2011: Euro-Par 2011: Parallel Processing Workshops pp 221–230Cite as

  1. Home
  2. Euro-Par 2011: Parallel Processing Workshops
  3. Conference paper
Simulating Application Resilience at Exascale

Simulating Application Resilience at Exascale

  • Rolf Riesen30,
  • Kurt B. Ferreira31,
  • Maria Ruiz Varela32,
  • Michela Taufer32 &
  • …
  • Arun Rodrigues31 
  • Conference paper
  • 1104 Accesses

  • 3 Citations

Part of the Lecture Notes in Computer Science book series (LNTCS,volume 7156)

Abstract

The reliability mechanisms for future exascale systems will be a key aspect of their scalability and performance. With the expected jump in hardware component counts, faults will become increasingly common compared to today’s systems. Under these circumstances, the costs of current and emergent resilience methods need to be reevaluated. This includes the cost of recovery, which is often ignored in current work, and the impact of hardware features such as heterogeneous computing elements and non-volatile memory devices. We describe a simulation and modeling framework that enables the measurement of various resilience algorithms with varying application characteristics. For this framework we outline the simulator’s requirements, its application communication pattern generators, and a few of the key hardware component models.

Keywords

  • State Machine
  • Output Port
  • Communication Pattern
  • Sandia National Laboratory
  • Collective Operation

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Download conference paper PDF

References

  1. Alvisi, L., Elnozahy, E.N., Rao, S., Husain, S.A., Mel, A.D.: An analysis of communication induced checkpointing. In: FTCS (1999)

    Google Scholar 

  2. Bergman, K., et al.: Exascale computing study: Technology challenges in achieving exascale systems (2008)

    Google Scholar 

  3. Bianchini, R., et al.: System resiliency at extreme scale (2009)

    Google Scholar 

  4. Elnozahy, E., Plank, J.: Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery. IEEE Transactions on Dependable and Secure Computing 1(2) (2004)

    Google Scholar 

  5. Hsieh, M., Thompson, K., Song, W., Rodrigues, A., Riesen, R.: A framework for architecture-level power, area and thermal simulation and its application to network-on-chip design exploration. In: 1st Intl. Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems, PMBS 2010 (November 2010)

    Google Scholar 

  6. Maloney, A., Goscinski, A.: A survey and review of the current state of rollback-recovery for cluster systems. Concurrency and Computation: Practice and Experience (April 2009)

    Google Scholar 

  7. Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: SC (2010)

    Google Scholar 

  8. Oldfield, R.A., Arunagiri, S., Teller, P.J., Seelam, S., Varela, M.R., Riesen, R., Roth, P.C.: Modeling the impact of checkpoints on next-generation systems. In: 24th IEEE Conference on Mass Storage Systems and Technologies (September 2007)

    Google Scholar 

  9. Ribeiro, P., Silva, F., Lopes, L.: Efficient parallel subgraph counting using g-tries. In: Cluster Computing (2010)

    Google Scholar 

  10. Riesen, R.: Communication patterns. In: Workshop on Communication Architecture for Clusters CAC 2006 (April 2006)

    Google Scholar 

  11. Riesen, R., Ferreira, K., Stearley, J.: See applications run and throughput jump: The case for redundant computing in HPC. In: 1st Intl. Workshop on Fault-Tolerance for HPC at Extreme Scale, FTXS 2010 (June 2010)

    Google Scholar 

  12. Rodrigues, A., Cook, J., Cooper-Balis, E., Hemmert, K.S., Kersey, C., Riesen, R., Rosenfield, P., Oldfield, R., Weston, M., Barrett, B., Jacob, B.: The structural simulation toolkit. In: 1st Intl. Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems, PMBS 2010 (November 2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

  1. IBM Research, Ireland

    Rolf Riesen

  2. Sandia National Laboratories, Albuquerque, NM, 87123, USA

    Kurt B. Ferreira & Arun Rodrigues

  3. University of Delaware, USA

    Maria Ruiz Varela & Michela Taufer

Authors
  1. Rolf Riesen
    View author publications

    You can also search for this author in PubMed Google Scholar

  2. Kurt B. Ferreira
    View author publications

    You can also search for this author in PubMed Google Scholar

  3. Maria Ruiz Varela
    View author publications

    You can also search for this author in PubMed Google Scholar

  4. Michela Taufer
    View author publications

    You can also search for this author in PubMed Google Scholar

  5. Arun Rodrigues
    View author publications

    You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

  1. Scilytics, Koellnerhofgasse 3/15A, 1010, Vienna, Austria

    Michael Alexander

  2. ICAR-CNR, Via P. Castellino, 111, 80131, Napoli, Italy

    Pasqua D’Ambra

  3. University of Amsterdam, 1090, Amsterdam, Netherlands

    Adam Belloum

  4. Innovative Computing Laboratory, The University of Tennessee, US

    George Bosilca

  5. Department of Experimental Medicine and Clinic, University Magna Græcia, 88100, Catanzaro, Italy

    Mario Cannataro

  6. Computer Science Department, University of Pisa, Italy

    Marco Danelutto

  7. Second University of Naples, Italy

    Beniamino Di Martino

  8. TUMünchen,, Boltzmannstr. 3, ,, 85748, Garching, Germany

    Michael Gerndt

  9. Equipe Runtime, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France

    Emmanuel Jeannot & Raymond Namyst & 

  10. Equipe HIEPACS, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France

    Jean Roman

  11. Computer Science and Mathematics Division, Oak Ridge National Laboratory, 37831-6164, Oak Ridge, TN, USA

    Stephen L. Scott

  12. Department of Scientific Computing, University of Vienna, Nordbergstr. 15/3C, 1090, Vienna, Austria

    Jesper Larsson Traff

  13. Computer Science and Mathematics Division, Oak Ridge National Laboratory, 37831, Oak Ridge, TN, USA

    Geoffroy Vallée

  14. Technische Universität München, Germany

    Josef Weidendorfer

Rights and permissions

Reprints and Permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Riesen, R., Ferreira, K.B., Varela, M.R., Taufer, M., Rodrigues, A. (2012). Simulating Application Resilience at Exascale. In: Alexander, M., et al. Euro-Par 2011: Parallel Processing Workshops. Euro-Par 2011. Lecture Notes in Computer Science, vol 7156. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29740-3_26

Download citation

  • .RIS
  • .ENW
  • .BIB
  • DOI: https://doi.org/10.1007/978-3-642-29740-3_26

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-29739-7

  • Online ISBN: 978-3-642-29740-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Search

Navigation

  • Find a journal
  • Publish with us

Discover content

  • Journals A-Z
  • Books A-Z

Publish with us

  • Publish your research
  • Open access publishing

Products and services

  • Our products
  • Librarians
  • Societies
  • Partners and advertisers

Our imprints

  • Springer
  • Nature Portfolio
  • BMC
  • Palgrave Macmillan
  • Apress
  • Your US state privacy rights
  • Accessibility statement
  • Terms and conditions
  • Privacy policy
  • Help and support

167.114.118.210

Not affiliated

Springer Nature

© 2023 Springer Nature