Advertisement

Lightweight and Accurate Silent Data Corruption Detection in Ordinary Differential Equation Solvers

  • Pierre-Louis GuhurEmail author
  • Hong Zhang
  • Tom Peterka
  • Emil Constantinescu
  • Franck Cappello
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9833)

Abstract

Silent data corruptions (SDCs) are errors that corrupt the system or falsify results while remaining unnoticed by firmware or operating systems. In numerical integration solvers, SDCs that impact the accuracy of the solver are considered significant. Detecting SDCs in high-performance computing is necessary because results need to be trustworthy and the increase of the number and complexity of components in emerging large-scale architectures makes SDCs more likely to occur. Until recently, SDC detection methods consisted in replicating the processes of the execution or in using checksums (for example algorithm-based fault tolerance). Recently, new detection methods have been proposed relying on mathematical properties of numerical kernels or performing data analysis of the results modified by the application. None of those methods, however, provide a lightweight solution guaranteeing that all significant SDCs are detected. We propose a new method called Hot Rod as a solution to this problem. It checks and potentially corrects the data produced by numerical integration solvers. Our theoretical model shows that all significant SDCs can be detected. We present two detectors and conduct experiments on streamline integration from the WRF meteorology application. Compared with the algorithmic detection methods, the accuracy of our first detector is increased by \(52\,\%\) with a similar false detection rate. The second detector has a false detection rate one order of magnitude lower than these detection methods while improving the detection accuracy by \(23\,\%\). The computational overhead is lower than \(5\,\%\) in both cases. The model has been developed for an explicit Runge-Kutta method, although it can be generalized to other solvers.

Keywords

Resilience Fault tolerance Runge-kutta Numerical integration solvers HPC SDC 

Notes

Acknowledgments

We express our gratitude to Julie Bessac for assistance with the algorithm and Gail Pieper for comments that greatly improved the manuscript. We also gratefully acknowledge the use of the services and facilities of the Decaf project at Argonne National Laboratory, supported by U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357, program manager Lucy Nowell. We also thank the anonymous reviewers for their helpful comments.

References

  1. 1.
    Bagatin, M., Gerardin, S.: Ionizing Radiation Effects in Electronics: From Memories to Imagers. Devices, Circuits, and Systems. CRC Press, Cleveland, Boca Raton (2015)Google Scholar
  2. 2.
    Bairavasundaram, L.N., Goodson, G.R., Pasupathy, S., Schindler, J.: An analysis of latent sector errors in disk drives. ACM SIGMETRICS Perform. Eval. Rev. 35, 289–300 (2007)CrossRefGoogle Scholar
  3. 3.
    Benson, A.R., Schmit, S., Schreiber, R.: Silent error detection in numerical time-stepping schemes. Int. J. High Perform. Comput. Appl. 29, 403–421 (2014)CrossRefGoogle Scholar
  4. 4.
    Butcher, J., Johnston, P.: Estimating local truncation errors for Runge-Kutta methods. J. Comput. Appl. Math. 45(1), 203–212 (1993)MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Cash, J.R., Karp, A.H.: A variable order Runge-Kutta method for initial value problems with rapidly varying right-hand sides. ACM TOMS 16(3), 201–222 (1990)MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Ceschino, F., Kuntzmann, J.: Numerical solution of initial value problems (1966)Google Scholar
  7. 7.
    Chen, L., Avizienis, A.: N-version programming: A fault-tolerance approach to reliability of software operation. In: Digest of Papers FTCS-8, pp. 3–9 (1978)Google Scholar
  8. 8.
    Di, S., Cappello, F.: Adaptive impact-driven detection of silent data corruption for HPC applications. In: IEEE Transactions on Parallel and Distributed Systems (2016)Google Scholar
  9. 9.
    Ghosh, S., Basu, S., Touba, N.A.: Selecting error correcting codes to minimize power in memory checker circuits. J. Low Power Electron. 1, 63–72 (2005)CrossRefGoogle Scholar
  10. 10.
    Guerraoui, R., Schiper, A.: Software-based replication for fault tolerance. Computer 4, 68–74 (1997)CrossRefGoogle Scholar
  11. 11.
    Guo, H., He, W., Peterka, T., Shen, H.W., Collis, S.M., Helmus, J.J.: Finite-time lyapunov exponents and lagrangian coherent structures in uncertain unsteady flows. In: IEEE TVCG (Proceedings of the PacificVis 16) 22, to appear (2016)Google Scholar
  12. 12.
    Huang, K.H., Abraham, J., et al.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 100(6), 518–528 (1984)CrossRefzbMATHGoogle Scholar
  13. 13.
    Hwang, A.A., Stefanovici, I.A., Schroeder, B.: Cosmic rays don’t strike twice: understanding the nature of DRAM errors and the implications for system design. ACM SIGPLAN Not. 47, 111–122 (2012)Google Scholar
  14. 14.
    Krishnamoorthy, K., Mathew, T.: Statistical tolerance regions: theory, applications, and computation, vol. 744. Wiley, Hoboken (2009)CrossRefzbMATHGoogle Scholar
  15. 15.
    Lapinsky, S.E., Easty, A.C.: Electromagnetic interference in critical care. J. Crit. Care 21(3), 267–270 (2006)CrossRefGoogle Scholar
  16. 16.
    McLoughlin, T., Laramee, R.S., Peikert, R., Post, F.H., Chen, M.: Over two decades of integration-based, geometric flow visualization. In: Eurographics 2009 State of the Art Report, pp. 73–92. Munich, Germany (2009)Google Scholar
  17. 17.
    Peterka, T., Ross, R., Nouanesengsy, B., Lee, T.Y., Shen, H.W., Kendall, W., Huang, J.: A study of parallel particle tracing for steady-state and time-varying flow fields. In: IPDPS, pp. 580–591. IEEE (2011)Google Scholar
  18. 18.
    Snir, M., Wisniewski, R.W., Abraham, J.A., Adve, S.V., Bagchi, S., Balaji, P., Belak, J., Bose, P., Cappello, F., Carlson, B., et al.: Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. 28, 129–173 (2014)CrossRefGoogle Scholar
  19. 19.
    Stoller, L., Morrison, D.: A method for the numerical integration of ordinary differential equations. Math. Tables Other Aids Comput. 12, 269–272 (1958)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Pierre-Louis Guhur
    • 1
    • 2
    Email author
  • Hong Zhang
    • 1
  • Tom Peterka
    • 1
  • Emil Constantinescu
    • 1
  • Franck Cappello
    • 1
  1. 1.Argonne National LaboratoryLemontUSA
  2. 2.ENS de CachanCachanFrance

Personalised recommendations