Simulating Application Resilience at Exascale
The reliability mechanisms for future exascale systems will be a key aspect of their scalability and performance. With the expected jump in hardware component counts, faults will become increasingly common compared to today’s systems. Under these circumstances, the costs of current and emergent resilience methods need to be reevaluated. This includes the cost of recovery, which is often ignored in current work, and the impact of hardware features such as heterogeneous computing elements and non-volatile memory devices. We describe a simulation and modeling framework that enables the measurement of various resilience algorithms with varying application characteristics. For this framework we outline the simulator’s requirements, its application communication pattern generators, and a few of the key hardware component models.
KeywordsState Machine Output Port Communication Pattern Sandia National Laboratory Collective Operation
Unable to display preview. Download preview PDF.
- 1.Alvisi, L., Elnozahy, E.N., Rao, S., Husain, S.A., Mel, A.D.: An analysis of communication induced checkpointing. In: FTCS (1999)Google Scholar
- 2.Bergman, K., et al.: Exascale computing study: Technology challenges in achieving exascale systems (2008)Google Scholar
- 3.Bianchini, R., et al.: System resiliency at extreme scale (2009)Google Scholar
- 4.Elnozahy, E., Plank, J.: Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery. IEEE Transactions on Dependable and Secure Computing 1(2) (2004)Google Scholar
- 5.Hsieh, M., Thompson, K., Song, W., Rodrigues, A., Riesen, R.: A framework for architecture-level power, area and thermal simulation and its application to network-on-chip design exploration. In: 1st Intl. Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems, PMBS 2010 (November 2010)Google Scholar
- 6.Maloney, A., Goscinski, A.: A survey and review of the current state of rollback-recovery for cluster systems. Concurrency and Computation: Practice and Experience (April 2009)Google Scholar
- 7.Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: SC (2010)Google Scholar
- 8.Oldfield, R.A., Arunagiri, S., Teller, P.J., Seelam, S., Varela, M.R., Riesen, R., Roth, P.C.: Modeling the impact of checkpoints on next-generation systems. In: 24th IEEE Conference on Mass Storage Systems and Technologies (September 2007)Google Scholar
- 9.Ribeiro, P., Silva, F., Lopes, L.: Efficient parallel subgraph counting using g-tries. In: Cluster Computing (2010)Google Scholar
- 10.Riesen, R.: Communication patterns. In: Workshop on Communication Architecture for Clusters CAC 2006 (April 2006)Google Scholar
- 11.Riesen, R., Ferreira, K., Stearley, J.: See applications run and throughput jump: The case for redundant computing in HPC. In: 1st Intl. Workshop on Fault-Tolerance for HPC at Extreme Scale, FTXS 2010 (June 2010)Google Scholar
- 12.Rodrigues, A., Cook, J., Cooper-Balis, E., Hemmert, K.S., Kersey, C., Riesen, R., Rosenfield, P., Oldfield, R., Weston, M., Barrett, B., Jacob, B.: The structural simulation toolkit. In: 1st Intl. Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems, PMBS 2010 (November 2010)Google Scholar