Abstract
This paper presents a model and performance study for Algorithm-Based Focused Recovery (ABFR) applied to N-body computations, subject to latent errors. We make a detailed comparison with the classical Checkpoint/Restart (CR) approach. While the model applies to general frameworks, the performance study is limited to perfect binary trees, due to the inherent difficulty of the analysis. With ABFR, the crucial parameter is the detection interval, which bounds the error latency. We show that the detection interval has a dramatic impact on the overhead, and that optimally choosing its value leads to significant gains over the CR approach.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Errors that cannot be detected are beyond the ability of any error recovery system to consider.
- 2.
Assuming expensive checks means that any improvements in checking can be incorporated – cost is not a disqualifier.
References
Snir, M., Wisniewski, R.W., Abraham, J.A., Adve, S.V., Bagchi, S., Balaji, P., Belak, J., Bose, P., Cappello, F., Carlson, B., et al.: Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. 28(2), 129–173 (2014)
Huang, K.H., Abraham, J.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 100(6), 518–528 (1984)
Chen, Z.: Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. In: PPoPP, pp. 167–176 (2013)
Fang, A., Cavelan, A., Robert, Y., Chien, A.A.: Resilience for stencil computations with latent errors. In: The 46th International Conference on Parallel Processing (ICPP 2017). IEEE Computer Society Press (2017)
Dun, N., et al.: Data decomposition in monte carlo neutron transport simulations using global view arrays. Int. J. High Perform. Comput. Appl. 29, 348–365 (2015)
Fang, A., Chien, A.A.: Applying GVR to molecular dynamics: enabling resilience for scientific computations. Technical report TR-2014-04, University of Chicago (2014)
Chien, A., et al.: Versioned distributed arrays for resilience in scientific applications: global view resilience. Procedia Comput. Sci. 51, 29–38 (2015)
Chien, A., et al.: Exploring versioned distributed arrays for resilience in scientific applications: global view resilience. Int. J. High Perform. Comput. Appl. (2016)
Platform: NERSC CORI. https://www.nersc.gov/users/computational-systems/cori/
Platform: JUQUEEN. http://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/JUQUEEN/JUQUEEN_node.html
Dun, N., Pleiter, D., Fang, A., Vandenbergen, N., Chien, A.A.: Multi-versioning performance opportunities in BGAS system for resilience. In: Kunkel, J.M., Balaji, P., Dongarra, J. (eds.) ISC High Performance 2016. LNCS, vol. 9697, pp. 486–504. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41321-1_25
Blelloch, G., Narlikar, G.: A practical comparison of \(n\)-body algorithms. In: Parallel Algorithms. Series in Discrete Mathematics and Theoretical Computer Science. American Mathematical Society (1997)
Eastwood, J., Hockney, R.: Computer Simulation Using Particles. McGrawHill, New York (1981)
Van Albada, G., Van Leer, B., Roberts Jr., W.: A comparative study of computational methods in cosmic gas dynamics. Astron. Astrophys. 108, 76–84 (1982)
Appel, A.W.: An efficient program for many-body simulation. SIAM J. Sci. Statist. Comput. 6(1), 85–103 (1985)
Greengard, L., Rokhlin, V.: A fast algorithm for particle simulations. J. Comput. Phys. 73(2), 325–348 (1987)
Barnes, J., Hut, P.: A hierarchical o (n log n) force-calculation algorithm. Nature 324(6096), 446–449 (1986)
Hernquist, L.: Performance characteristics of tree codes. Astrophys. J. Suppl. Ser. 64, 715–734 (1987)
McMillan, S.L., Aarseth, S.J.: An o (n log n) integration scheme for collisional stellar systems. Astrophys. J. 414, 200–212 (1993)
Springel, V., Yoshida, N., White, S.D.: Gadget: a code for collisionless and gasdynamical cosmological simulations. New Astronomy 6(2), 79–117 (2001)
O’Gorman, T.: The effect of cosmic rays on the soft error rate of a DRAM at ground level. IEEE Trans. Electron. Devices 41(4), 553–557 (1994)
Ziegler, J.F., Curtis, H.W., Muhlfeld, H.P., Montrose, C.J., Chin, B.: IBM experiments in soft fails in computer electronics. IBM J. Res. Dev. 40(1), 3–18 (1996)
Moody, A., Bronevetsky, G., Mohror, K., Supinski, B.R.d.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: SC. ACM (2010)
Ferreira, K., Stearley, J., Laros, J.H.I., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P.G., Arnold, D.: Evaluating the viability of process replication reliability for exascale systems. In: SC 2011. ACM (2011)
Fiala, D., Mueller, F., Engelmann, C., Riesen, R., Ferreira, K., Brightwell, R.: Detection and correction of silent data corruption for large-scale high-performance computing. In: SC. ACM (2012)
Casanova, H., Bougeret, M., Robert, Y., Vivien, F., Zaidouni, D.: Using group replication for resilience on exascale systems. Int. J. High Perform. Comput. Appl. 28(2), 210–224 (2014)
Lyons, R.E., Vanderkulk, W.: The use of triple-modular redundancy to improve computer reliability. IBM J. Res. Dev. 6(2), 200–209 (1962)
Avizienis, A., Laprie, J., Randell, B., Landwehr, C.E.: Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Sec. Comput. 1(1), 11–33 (2004)
Huang, K.H., Abraham, J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 33(6), 518–528 (1984)
Bosilca, G., Delmas, R., Dongarra, J., Langou, J.: Algorithm-based fault tolerance applied to high performance computing. J. Parallel Distrib. Comput. 69(4), 410–416 (2009)
Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: ICS. ACM (2012)
Benson, A.R., Schmit, S., Schreiber, R.: Silent error detection in numerical time-stepping schemes. Int. J. High Perform. Comput. Appl. 29, 403–421 (2014)
Sao, P., Vuduc, R.: Self-stabilizing iterative solvers. In: ScalA 2013 (2013)
Heroux, M., Hoemmen, M.: Fault-tolerant iterative methods via selective reliability. Research report SAND2011-3915 C, Sandia National Laboratories (2011)
Elliott, J., Hoemmen, M., Mueller, F.: Evaluating the impact of SDC on the GMRES iterative solver. In: IPDPS. IEEE (2014)
Bronevetsky, G., de Supinski, B.: Soft error vulnerability of iterative linear algebra methods. In: ICS. ACM (2008)
Berrocal, E., Bautista-Gomez, L., Di, S., Lan, Z., Cappello, F.: Lightweight silent data corruption detection based on runtime data analysis for HPC applications. In: HPDC. ACM (2015)
Bautista Gomez, L., Cappello, F.: Detecting silent data corruption through data dynamic monitoring for scientific applications. In: PPoPP. ACM (2014)
Bautista Gomez, L., Cappello, F.: Detecting and correcting data corruption in stencil applications through multivariate interpolation. In: FTS. IEEE (2015)
Bautista Gomez, L., Cappello, F.: Exploiting spatial smoothness in HPC applications to detect silent data corruption. In: HPCC. IEEE (2015)
Ciocca, E., Koren, I., Koren, Z., Krishna, C.M., Katz, D.S.: Application-level fault tolerance in the orbital thermal imaging spectrometer. In: PRDC. IEEE (2004)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Cavelan, A., Fang, A., Chien, A.A., Robert, Y. (2018). Resilient N-Body Tree Computations with Algorithm-Based Focused Recovery: Model and Performance Analysis. In: Jarvis, S., Wright, S., Hammond, S. (eds) High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. PMBS 2017. Lecture Notes in Computer Science(), vol 10724. Springer, Cham. https://doi.org/10.1007/978-3-319-72971-8_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-72971-8_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-72970-1
Online ISBN: 978-3-319-72971-8
eBook Packages: Computer ScienceComputer Science (R0)