Skip to main content

Physics-Based Checksums for Silent-Error Detection in PDE Solvers

  • Conference paper
  • First Online:
Euro-Par 2019: Parallel Processing Workshops (Euro-Par 2019)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11997))

Included in the following conference series:

Abstract

We discuss techniques for efficient local detection of silent data corruption in parallel scientific computations, leveraging physical quantities such as momentum and energy that may be conserved by discretized PDEs. The conserved quantities are analogous to “algorithm-based fault tolerance” checksums for linear algebra but, due to their physical foundation, are applicable to both linear and nonlinear equations and have efficient local updates based on fluxes between subdomains. These physics-based checksums enable precise intermittent detection of errors and recovery by rollback to a checkpoint, with very low overhead when errors are rare. We present applications to both explicit hyperbolic and iterative elliptic (unstructured finite-element) solvers with injected memory bit flips.

Under the terms of Contract DE-NA0003525, there is a non-exclusive license for use of this work by or on behalf of the U.S. Government.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bautista-Gomez, L., Benoit, A., Cavelan, A., Raina, S.K., Robert, Y., Sun, H.: Which verification for soft error detection? In: Proceedings of the 22nd IEEE International Conference on High Performance Computing (HiPC) (2015)

    Google Scholar 

  2. Benson, A.R., Schmit, S., Schreiber, R.: Silent error detection in numerical time-stepping schemes. Int. J. High Perform. Comput. Appl. 29(4), 403–421 (2015)

    Article  Google Scholar 

  3. Bridges, P.G., Ferreira, K.B., Heroux, M.A., Hoemmen, M.: Fault-tolerant linear solvers via selective reliability (2012). https://arxiv.org/abs/1206.1390

  4. Cappello, F., Geist, A., Gropp, W., Kale, S., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomput. Front. Innov. 1(1), 5–28 (2014)

    Google Scholar 

  5. Daly, J.: A model for predicting the optimum checkpoint interval for restart dumps. In: Proceedings of the International Conference on Computational Science (2003)

    Google Scholar 

  6. Gamell, M., et al.: Local recovery and failure masking for stencil-based applications at extreme scales. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2015)

    Google Scholar 

  7. Heroux, M.A., et al.: Improving performance via mini-applications. Report SAND2009-5574, Sandia National Laboratories (2009)

    Google Scholar 

  8. Hukerikar, S., Engelmann, C.: Resilience design patterns: a structured approach to resilience at extreme scale. Supercomput. Front. Innov. 4(3), 4–42 (2017)

    Google Scholar 

  9. Rinard, M.: Parallel synchronization-free approximate data structure construction. In: Proceedings of the 5th USENIX Workshop on Hot Topics in Parallelism (2013)

    Google Scholar 

  10. Salloum, M., Mayo, J., Armstrong, R.: Physics-based checksums for silent-error detection in PDE solvers. In: SIAM Conference on Computational Science and Engineering (2019)

    Google Scholar 

  11. Salloum, M., Mayo, J.R., Armstrong, R.C.: In-situ mitigation of silent data corruption in PDE solvers. In: Proceedings of the 6th Workshop on Fault-Tolerance for HPC at Extreme Scale (2016)

    Google Scholar 

  12. Subasi, O., et al.: MACORD: online adaptive machine learning framework for silent error detection. In: Proceedings of the IEEE International Conference on Cluster Computing (2017)

    Google Scholar 

  13. Tao, D., et al.: New-sum: a novel online ABFT scheme for general iterative methods. In: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (2016)

    Google Scholar 

  14. Teranishi, K., et al.: ASC CSSE level 2 milestone #6362: resilient asynchronous many-task programming model. Report SAND2018-9672, Sandia National Laboratories (2018)

    Google Scholar 

Download references

Acknowledgments

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration (NNSA) under contract DE-NA0003525. This work was funded by NNSA’s Advanced Simulation and Computing (ASC) Program. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jackson R. Mayo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 National Technology & Engineering Solutions of Sandia, LLC.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Salloum, M., Mayo, J.R., Armstrong, R.C. (2020). Physics-Based Checksums for Silent-Error Detection in PDE Solvers. In: Schwardmann, U., et al. Euro-Par 2019: Parallel Processing Workshops. Euro-Par 2019. Lecture Notes in Computer Science(), vol 11997. Springer, Cham. https://doi.org/10.1007/978-3-030-48340-1_52

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-48340-1_52

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-48339-5

  • Online ISBN: 978-3-030-48340-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics