Simulating Application Resilience at Exascale

  • Rolf Riesen
  • Kurt B. Ferreira
  • Maria Ruiz Varela
  • Michela Taufer
  • Arun Rodrigues
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7156)


The reliability mechanisms for future exascale systems will be a key aspect of their scalability and performance. With the expected jump in hardware component counts, faults will become increasingly common compared to today’s systems. Under these circumstances, the costs of current and emergent resilience methods need to be reevaluated. This includes the cost of recovery, which is often ignored in current work, and the impact of hardware features such as heterogeneous computing elements and non-volatile memory devices. We describe a simulation and modeling framework that enables the measurement of various resilience algorithms with varying application characteristics. For this framework we outline the simulator’s requirements, its application communication pattern generators, and a few of the key hardware component models.


State Machine Output Port Communication Pattern Sandia National Laboratory Collective Operation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Alvisi, L., Elnozahy, E.N., Rao, S., Husain, S.A., Mel, A.D.: An analysis of communication induced checkpointing. In: FTCS (1999)Google Scholar
  2. 2.
    Bergman, K., et al.: Exascale computing study: Technology challenges in achieving exascale systems (2008)Google Scholar
  3. 3.
    Bianchini, R., et al.: System resiliency at extreme scale (2009)Google Scholar
  4. 4.
    Elnozahy, E., Plank, J.: Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery. IEEE Transactions on Dependable and Secure Computing 1(2) (2004)Google Scholar
  5. 5.
    Hsieh, M., Thompson, K., Song, W., Rodrigues, A., Riesen, R.: A framework for architecture-level power, area and thermal simulation and its application to network-on-chip design exploration. In: 1st Intl. Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems, PMBS 2010 (November 2010)Google Scholar
  6. 6.
    Maloney, A., Goscinski, A.: A survey and review of the current state of rollback-recovery for cluster systems. Concurrency and Computation: Practice and Experience (April 2009)Google Scholar
  7. 7.
    Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: SC (2010)Google Scholar
  8. 8.
    Oldfield, R.A., Arunagiri, S., Teller, P.J., Seelam, S., Varela, M.R., Riesen, R., Roth, P.C.: Modeling the impact of checkpoints on next-generation systems. In: 24th IEEE Conference on Mass Storage Systems and Technologies (September 2007)Google Scholar
  9. 9.
    Ribeiro, P., Silva, F., Lopes, L.: Efficient parallel subgraph counting using g-tries. In: Cluster Computing (2010)Google Scholar
  10. 10.
    Riesen, R.: Communication patterns. In: Workshop on Communication Architecture for Clusters CAC 2006 (April 2006)Google Scholar
  11. 11.
    Riesen, R., Ferreira, K., Stearley, J.: See applications run and throughput jump: The case for redundant computing in HPC. In: 1st Intl. Workshop on Fault-Tolerance for HPC at Extreme Scale, FTXS 2010 (June 2010)Google Scholar
  12. 12.
    Rodrigues, A., Cook, J., Cooper-Balis, E., Hemmert, K.S., Kersey, C., Riesen, R., Rosenfield, P., Oldfield, R., Weston, M., Barrett, B., Jacob, B.: The structural simulation toolkit. In: 1st Intl. Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems, PMBS 2010 (November 2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Rolf Riesen
    • 1
  • Kurt B. Ferreira
    • 2
  • Maria Ruiz Varela
    • 3
  • Michela Taufer
    • 3
  • Arun Rodrigues
    • 2
  1. 1.IBM ResearchIreland
  2. 2.Sandia National LaboratoriesAlbuquerqueUSA
  3. 3.University of DelawareUSA

Personalised recommendations