Abstract
We discuss techniques for efficient local detection of silent data corruption in parallel scientific computations, leveraging physical quantities such as momentum and energy that may be conserved by discretized PDEs. The conserved quantities are analogous to “algorithm-based fault tolerance” checksums for linear algebra but, due to their physical foundation, are applicable to both linear and nonlinear equations and have efficient local updates based on fluxes between subdomains. These physics-based checksums enable precise intermittent detection of errors and recovery by rollback to a checkpoint, with very low overhead when errors are rare. We present applications to both explicit hyperbolic and iterative elliptic (unstructured finite-element) solvers with injected memory bit flips.
Under the terms of Contract DE-NA0003525, there is a non-exclusive license for use of this work by or on behalf of the U.S. Government.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bautista-Gomez, L., Benoit, A., Cavelan, A., Raina, S.K., Robert, Y., Sun, H.: Which verification for soft error detection? In: Proceedings of the 22nd IEEE International Conference on High Performance Computing (HiPC) (2015)
Benson, A.R., Schmit, S., Schreiber, R.: Silent error detection in numerical time-stepping schemes. Int. J. High Perform. Comput. Appl. 29(4), 403–421 (2015)
Bridges, P.G., Ferreira, K.B., Heroux, M.A., Hoemmen, M.: Fault-tolerant linear solvers via selective reliability (2012). https://arxiv.org/abs/1206.1390
Cappello, F., Geist, A., Gropp, W., Kale, S., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomput. Front. Innov. 1(1), 5–28 (2014)
Daly, J.: A model for predicting the optimum checkpoint interval for restart dumps. In: Proceedings of the International Conference on Computational Science (2003)
Gamell, M., et al.: Local recovery and failure masking for stencil-based applications at extreme scales. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2015)
Heroux, M.A., et al.: Improving performance via mini-applications. Report SAND2009-5574, Sandia National Laboratories (2009)
Hukerikar, S., Engelmann, C.: Resilience design patterns: a structured approach to resilience at extreme scale. Supercomput. Front. Innov. 4(3), 4–42 (2017)
Rinard, M.: Parallel synchronization-free approximate data structure construction. In: Proceedings of the 5th USENIX Workshop on Hot Topics in Parallelism (2013)
Salloum, M., Mayo, J., Armstrong, R.: Physics-based checksums for silent-error detection in PDE solvers. In: SIAM Conference on Computational Science and Engineering (2019)
Salloum, M., Mayo, J.R., Armstrong, R.C.: In-situ mitigation of silent data corruption in PDE solvers. In: Proceedings of the 6th Workshop on Fault-Tolerance for HPC at Extreme Scale (2016)
Subasi, O., et al.: MACORD: online adaptive machine learning framework for silent error detection. In: Proceedings of the IEEE International Conference on Cluster Computing (2017)
Tao, D., et al.: New-sum: a novel online ABFT scheme for general iterative methods. In: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (2016)
Teranishi, K., et al.: ASC CSSE level 2 milestone #6362: resilient asynchronous many-task programming model. Report SAND2018-9672, Sandia National Laboratories (2018)
Acknowledgments
Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration (NNSA) under contract DE-NA0003525. This work was funded by NNSA’s Advanced Simulation and Computing (ASC) Program. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 National Technology & Engineering Solutions of Sandia, LLC.
About this paper
Cite this paper
Salloum, M., Mayo, J.R., Armstrong, R.C. (2020). Physics-Based Checksums for Silent-Error Detection in PDE Solvers. In: Schwardmann, U., et al. Euro-Par 2019: Parallel Processing Workshops. Euro-Par 2019. Lecture Notes in Computer Science(), vol 11997. Springer, Cham. https://doi.org/10.1007/978-3-030-48340-1_52
Download citation
DOI: https://doi.org/10.1007/978-3-030-48340-1_52
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-48339-5
Online ISBN: 978-3-030-48340-1
eBook Packages: Computer ScienceComputer Science (R0)