Physics-Based Checksums for Silent-Error Detection in PDE Solvers

Salloum, Maher; Mayo, Jackson R.; Armstrong, Robert C.

doi:10.1007/978-3-030-48340-1_52

Maher Salloum²²,
Jackson R. Mayo²² &
Robert C. Armstrong²²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11997))

Included in the following conference series:

European Conference on Parallel Processing

1249 Accesses
1 Citations

Abstract

We discuss techniques for efficient local detection of silent data corruption in parallel scientific computations, leveraging physical quantities such as momentum and energy that may be conserved by discretized PDEs. The conserved quantities are analogous to “algorithm-based fault tolerance” checksums for linear algebra but, due to their physical foundation, are applicable to both linear and nonlinear equations and have efficient local updates based on fluxes between subdomains. These physics-based checksums enable precise intermittent detection of errors and recovery by rollback to a checkpoint, with very low overhead when errors are rare. We present applications to both explicit hyperbolic and iterative elliptic (unstructured finite-element) solvers with injected memory bit flips.

Under the terms of Contract DE-NA0003525, there is a non-exclusive license for use of this work by or on behalf of the U.S. Government.

You have full access to this open access chapter, Download conference paper PDF

On the Resilience of Conjugate Gradient and Multigrid Methods to Node Failures

A Minimally Intrusive Low-Memory Approach to Resilience for Existing Transient Solvers

Article Open access 12 July 2018

Lightweight and Accurate Silent Data Corruption Detection in Ordinary Differential Equation Solvers

Keywords

1 Introduction

The effects of faults at extreme scale are a growing concern for high-performance computing (HPC) applied to scientific simulation [4]. Much resilience work deals with recovery from hard failures, such as a node that crashes. However, erroneous behavior can manifest in other ways. For example, an error may not immediately cause a crash, but may lead to an insidious wrong answer or cascade to a costly wider failure, which could be avoided if caught earlier. Thus, detecting errors with locality in space and time provides the best opportunity to mitigate them.

In scientific computations, error detection at the application level is facilitated by properties that are common in these simulations and are typically violated when errors occur: smoothness, conservation, and other numerical characteristics. In the face of uncertainty about likely error types and rates at extreme scale, improved algorithmic detection can aid both diagnosis and recovery.

Silent hardware errors, such as silent data corruption, are a prime example where precise detection is important. The future prevalence of these errors is unclear, but there is concern that they will be significant at extreme scale [4]. In addition, improved algorithmic detection could help diagnose and localize subtle software issues such as numerical instability and race conditions [2, 9].

Existing work on algorithm-based fault tolerance (ABFT) has developed approaches for application-level error detection. Generic ABFT for linear algebra solvers can be achieved using checksums [13]. In addition, scientific computations often feature physical conserved quantities such as energy or momentum, which can be viewed as a type of checksum, even for nonlinear problems. Such checksums and conserved quantities enable detecting errors reliably. However, in their standard form, they are defined globally, so in a parallel solver they require expensive collective communication [1] and do not localize errors to specific processes or tasks.

Spatially local error detection offers the potential for greater scalability of resilience, reducing communication and allowing more efficient local (rather than global) recovery, just as is sought for other localized failures in parallel programming models [6, 14]. Techniques explored for detecting errors locally in scientific computations include machine learning [12], comparison between different numerical methods [2], and outlier detection [11], but these techniques are empirical and inexact, with significant risk of false positives and false negatives.

Here we present a “physics-based checksum” (PBC) approach that builds on ABFT checksums and physical conservation laws applicable to scientific computations, and enables precise and efficient local error detection when such conserved quantities exist. As long as some form of checkpoint/restart remains viable, focusing purely on detection can allow recovering from occasional silent errors by rollback, as for hard failures. This is efficient for rare errors because it avoids the cost of more complex checksums that would support not only detection but also correction (roll-forward).

While the greatest expected benefit of the PBC approach is in conjunction with local recovery (restarting only the processes or tasks with errors) [14], the present work uses global checkpoint/restart (driven by local PBC detection) to illustrate the effectiveness in a familiar resilience setting. We demonstrate the approach in simple MPI-based solvers for partial differential equations (PDEs) and evaluate the effect on solver completion time and accuracy in the presence of emulated silent errors.

An abstract of this work was presented previously [10].

2 Checksum Approaches for Resilience

2.1 Error Detection Concepts

Checksums aim to introduce efficient redundancy in a solver via a smaller “side” computation that remains consistent with the solver state if all computations are correct (Fig. 1). State-of-the-art linear algebra checksums (LACs) [13], when verified after a series of linear algebra operations, can indicate with very high probability whether an error (processor or memory error) occurred somewhere in those operations (including in the checksum itself). Even if multiple errors occur, precise cancellation of their effects so that the checksum still matches is very unlikely. Thus, the verification of consistency can be performed intermittently, e.g., just before each checkpoint.

The checksum for a floating-point vector u is typically taken as the sum of its entries, $Q(u) = e^T u$, where $e = \{1, \dots , 1\}$. When an operation is performed on u, the linearity of such a checksum allows it to be updated in a way other than directly recomputing it, thus providing the redundant error check. Even with correctly functioning hardware and software, algebraic checksum relations hold numerically only to the level of floating-point roundoff. Silent errors in low-order bits whose numerical magnitude is within the roundoff level will be false negatives (undetected). When a checksum is verified by recomputing it from the underlying data, it is prudent to re-initialize (refresh) the checksum to remove accumulated roundoff drift.

From a different perspective, physical conserved quantities can be used in a similar way. Global conservation laws of the form

$$\begin{aligned} Q = \int _{\text {space}} dV\, \rho = \text {const}, \end{aligned}$$

(1)

where $\rho $ is a density expressible in terms of solver variables, are an exact property of many continuum equations, including nonlinear ones. We here consider the preferred case of a “conservative discretization”, where a version of the conservation law holds independent of the mesh size or time step and is exact up to roundoff. As with standard LACs, these conserved quantities can detect errors reliably (via comparison of Q at an initial time and a later time) but involve global communication and do not localize errors in space.

To better leverage the benefits of conservation laws and create efficient local PBCs, we consider the more fundamental, local form of a continuum conservation law, $\partial \rho /\partial t = -\varvec{\nabla } \cdot \mathbf {J}$, where $\mathbf {J}$ is the flux density of the conserved quantity. Then, defining the conserved quantity in a spatial region R (e.g., a computational subdomain), $Q(R) = \int _R dV\, \rho $, we find the integrated conservation law

$$\begin{aligned} \frac{dQ(R)}{dt} = -\oint _{\partial R} d\mathbf {S} \cdot \mathbf {J}. \end{aligned}$$

(2)

Thus, Q(R) changes only due to the flux through the boundary $\partial R$. The flux is much faster to compute than Q(R) itself because the integral in (2) is lower-dimensional. When a discretized form of the local conservation law holds, Q(R) is a local PBC that can be updated efficiently and verified intermittently, in contrast to generic LACs [13] that are as costly to update as to verify. While this conservation derivation applies to time-dependent problems, we show in Sect. 4.1 that PBCs of the same form also apply to iterative elliptic solvers.

2.2 Injecting and Recovering from Errors

To demonstrate the practical effectiveness of PBC error detection, we test parallel solvers in a simple resilience framework with emulated silent errors. As in previous work [11], each solver process includes a concurrent thread that performs asynchronous, uniformly distributed bit flips in the large memory regions in use (floating-point data arrays) at an adjustable rate. Such a memory error model is representative of other error types also [3], such as processor errors.

We use a simple global checkpoint/restart scheme where verification of local checksums and writing of checkpoints occur periodically after a certain number of solver time steps or iterations, termed the verification interval. In our solvers, to establish a baseline given ideal checkpoint reliability and performance, checkpoints are stored in memory and are not subject to error injection, and time spent in checkpointing is not included in our measurements of resilience overhead. Rather, we measure the cost of updating and verifying the checksums and of redoing the computations from the previous checkpoint (global rollback) when an error is detected by any process based on a local checksum discrepancy. Checksum verification occurs together with each checkpoint, so the verification cost has the same effect as checkpointing cost. The cost could be adjusted to reflect any specific checkpoint storage technology. We seek resilience efficiency similar to that seen in standard global checkpoint/restart usage, which can achieve very low overhead using long intervals when failures are rare [5].

The impact of silent errors should be judged in relation to existing numerical inaccuracies (roundoff, discretization, and incomplete convergence) that solvers exhibit even on perfect hardware. An error rate is considered tolerated by a solver, and overhead results are reported, only when the solver reliably finishes with accuracy similar to that of an error-free run. Silent errors are stochastic and vary from run to run, so the results must be considered as a distribution. A solver is deemed to fail in the presence of errors if, in >10% of runs, it takes longer than a cutoff time or returns a solution for which the residual or error compared to an analytic solution is more than 3 times that obtained by a run without error injection.

3 Application to 1D Hyperbolic Solvers

We describe the application of PBCs to a linear advection equation and to the nonlinear Burgers equation, and present test results for the latter.

3.1 Algorithm

The 1D linear advection equation is written as

$$\begin{aligned} \frac{\partial \phi (t, x)}{\partial t} + \nu \frac{\partial \phi (t, x)}{\partial x}=0, \end{aligned}$$

(3)

where $\nu $ is a constant. The explicit finite-difference Lax-Wendroff scheme for the linear advection equation is determined by the stencil

$$\begin{aligned} \phi ^{n+1}_j = \frac{c(c+1)}{2} \phi ^n_{j-1} + (1-c^2) \phi ^n_j + \frac{c(c-1)}{2} \phi ^n_{j+1}, \quad 0 \le j \le N-1, \end{aligned}$$

(4)

where the CFL number is $c = \nu \, \varDelta t/\varDelta x$. This can be thought of as a linear algebra operation, a sparse matrix-vector product $\phi ^{n+1} = A \phi ^n$, where the tridiagonal matrix A is not explicitly stored.

The vector checksum $Q(\phi ) = e^T \phi = \sum _j \phi _j$, where e is a vector of ones, is the discrete version of the quantity $\int dx\, \phi $ conserved by the continuum PDE (3). The checksum computed for each update $\phi ^{n+1}$ should correspond to the matrix-vector product. A general LAC formula for such a checksum update is

$$\begin{aligned} Q(\phi ^{n+1}) = \left( e^T A - d e^T \right) \phi ^n + d\, Q(\phi ^n), \end{aligned}$$

(5)

where d is an arbitrary scalar constant, whose choice may affect the detectability of propagated errors [13]. In general, this approach incurs the cost of the dot product of $(e^T A - d e^T)$ with $\phi ^n$, the former being a constant precomputed vector.

However, based on our physical reasoning, it must be possible to compute the update more efficiently. The natural PBC is obtained with the choice $d = 1$. For global conservation (e.g., a periodic closed domain), all columns of A have sum 1, as is seen by adding the coefficients of the three terms in (4); so $(e^T A - e^T) = 0$ and the update is trivial: $Q(\phi ^{n+1}) = Q(\phi ^n)$. For local conservation (e.g., a subdomain within a parallel computation), the column sums of the local matrix A differ from 1 only at the boundaries where fluxes occur, i.e., $(e^T A - e^T)$ is a sparse vector, and the update is much more efficient than a general dot product. For the parallel Lax-Wendroff scheme, the local PBC update is

$$\begin{aligned} Q(\phi ^{n+1}) = Q(\phi ^{n}) + \frac{c(c+1)}{2} (\phi ^{n}_{-1} - \phi ^{n}_{N-1}) + \frac{c(c-1)}{2} (\phi ^{n}_{N} - \phi ^{n}_0), \end{aligned}$$

(6)

where $\phi ^n_{-1}$ and $\phi ^n_N$ are values communicated from neighboring subdomains.

PBCs can also be constructed for nonlinear equations where LACs do not apply. The 1D inviscid Burgers equation is written as

$$\begin{aligned} \frac{\partial u(t, x)}{\partial t} + \nu \, u(t,x) \frac{\partial u(t, x)}{\partial x}=0. \end{aligned}$$

(7)

The explicit finite-difference MacCormack scheme for the Burgers equation is determined by the stencil

$$\begin{aligned} u_j^{n+1}= & {} \frac{1}{2} (u_j^n +u_j^*) - \frac{c}{4} \left( (u_j^*)^2 - (u_{j-1}^*)^2 \right) , \quad 0 \le j \le N, \nonumber \\ u_j^*= & {} u_j^n - \frac{c}{2} \left( (u_{j+1}^n)^2 - (u_j^n)^2 \right) . \end{aligned}$$

(8)

This stencil cannot be cast purely in terms of linear algebra operations. However, the conservation principle is still valid for the MacCormack scheme, which is conservative by construction. The checksum $Q(u) = e^T u = \sum _j u_j$ corresponds to the momentum $\int dx\, u$ conserved by the Burgers equation, with the continuum flux density $J = \frac{1}{2} \nu u^2$. The corresponding PBC update is

$$\begin{aligned} Q(u^{n+1}) = Q(u^n) + \frac{c}{4} \left( (u_{-1}^*)^2 + (u_0^n)^2 \right) - \frac{c}{4} \left( (u_{N-1}^*)^2 + (u_N^n)^2 \right) . \end{aligned}$$

(9)

Here again, the checksum can be updated from the previous time step by only adding contributions from boundary terms.

3.2 Evaluation

Alongside a typical behavior of global checkpoint/restart for hard failures as a comparison, the overhead results for the Burgers equation are shown in Fig. 2. Upon completing a given verification interval (VI), a global restart is performed if any subdomain’s recomputed “true” checksum $Q_t$ differs from its efficiently updated checksum Q by more than $10^{-2}$. The cost of checksum verification is reduced with a longer VI, leading to the initial decreasing trend of overhead with VI, but as VI increases further, the overhead increases due to more restarts and more wasted work. The optimal VI increases at lower error rates. Error injection is also performed on the non-robust version of the solver without error detection, to determine the maximum error rate tolerated. As shown, error rates significantly higher than this level can be tolerated by the robust solver with overhead of $\sim $10% or less.

4 Application to 3D Elliptic Solver

To illustrate the applicability of PBCs to iterative unstructured applications, we consider a conjugate gradient solver modeled on the HPCCG and MiniFE mini-apps [7].

4.1 Algorithm

The 3D Laplace equation is a linear elliptic PDE often solved using a finite-element method. The solution is represented as a vector x encoding a superposition of basis functions (elements) defined on a mesh, and the PDE is discretized as a linear system $A x = b$. Here A is a sparse, symmetric “stiffness matrix” determined by the basis functions, and b is a vector determined by the boundary conditions. In a parallel solver, the mesh is partitioned into subdomains and the corresponding blocks of A, b, and x are distributed among the processes. A typical iterative solver approach is the conjugate gradient method, which repeatedly updates an estimate of the solution x using linear algebra operations until the residual $b - Ax$ becomes sufficiently small. HPCCG implements an unpreconditioned conjugate gradient solver for the Laplace equation using a notional hexahedral mesh.

A key operation in the conjugate gradient solver is a sparse matrix-vector product Ap, where p is a vector generated within the algorithm. As discussed in Sect. 3.1, the generic LAC update for this operation is

$$\begin{aligned} Q(Ap) = \left( e^T A - d e^T \right) p + d\, Q(p), \end{aligned}$$

(10)

requiring a dot product that is as costly as recomputing the checksum. Again, a more efficient update is possible with the PBC approach. In our problem, e (a vector of ones) represents a superposition of elements into a constant function, and A represents a differential operator constructed from gradients; thus $e^T A$, corresponding to the derivative of a constant, is a sparse vector (zero except at boundaries). We can take $d = 0$ and obtain the simpler PBC update

$$\begin{aligned} Q(Ap) = \left( e^T A \right) p. \end{aligned}$$

(11)

Even though elliptic equations do not involve time advancement and so a conservation law does not literally apply, the solver operations are mathematically analogous to time steps and PBCs can still be used.

To obtain a somewhat more generic example, we replace HPCCG’s simple cubic mesh by a cylinder composed of wafers with an unstructured cross-section. Our solver reads in a corresponding stiffness matrix computed offline using basis functions that interpolate between values assigned to each mesh node (trilinear hexahedral elements). Each process operates on a subset of the wafers. The curved surface of the cylinder uses a standard Neumann zero-flux boundary condition, so fluxes in and out of subdomains occur on the boundaries between wafers. The mesh, stiffness matrix, and PBC update are visualized in Fig. 3.

In this case, due the uniformity of the cylinder, the above-diagonal blocks are copies of a square matrix B and the below-diagonal blocks are $B^T$. The nonzero entries in $e^T A$ arise from these B and $B^T$ blocks that couple adjacent subdomains. If $p_i$ denotes the part of the vector p on process i, which can span several wafers, then let $p_{i,l}$ and $p_{i,r}$ be the sub-vectors corresponding to the leftmost and rightmost of these wafers. The local PBC update on process i for the vector $q = Ap$ is then

$$\begin{aligned} Q_q = (e^T B^T) p_{i-1,r} + (e^T B) p_{i+1,l} - (e^T B^T) p_{i,l} - (e^T B) p_{i,r}. \end{aligned}$$

(12)

The full conjugate gradient method including local error detection and global checkpoint/restart is shown in Algorithm 1. Steps in blue are PBC updates performed during every iteration, while steps in green are error detection and checkpointing operations performed only after each verification interval. The basis for detection is the relative discrepancy in each local checksum, e.g., $\eta _x = (Q_x - Q_{x,t})/\Vert x_i\Vert _1$, upon computing the true checksum $Q_{x,t} = e^T x_i$ on process i.

We note several details of error detection:

Verifying the x, p, and r checksums is sufficient because an error in q propagates to an error in r that remains detectable.
The PBC update (11) does not itself preserve the detectability of an error in p, because $Q_p$ is not used in computing $Q_q$. However, because a multiple of p is subsequently added to x, the consequence would still be a detectable error in x. Our results support that errors are detected well with $d = 0$.
The dot products $p^T q$ and $r^T r$ require special consideration because dot products do not have checksums [13]. In our memory error model, this is not a problem because an existing error in p, q, or r that affects a dot product will also affect the subsequent use of the same vectors in a detectable way.
Error injection is not performed on the stiffness matrix A itself. If corruption of static data like A is a concern, then there are simple protection schemes that can be used [8], but we do not consider this here.

4.2 Evaluation

Error detection thresholds are chosen based on the maximum roundoff-induced checksum discrepancies observed in the solver in the absence of any injected errors. These accumulated errors in the checksum updates increase with VI due to the nonlinear feedback in the conjugate gradient algorithm over iterations and between processes. We have fitted thresholds for our cylinder example as a function of subdomain size and VI.

We now examine the overhead induced by our error detection mechanism. With no error injection, we compare the overhead of PBC-based detection to a version where the LAC with $d > 0$ [13], but computed locally, is used for the matrix-vector product. As shown in Fig. 4, the PBC approach has significantly lower overhead for larger computational subdomains and larger VI (infrequent verification, expected to be feasible for low error rates). This difference occurs because the LAC update requires a dot product with cost proportional to the subdomain volume at every iteration, whereas the PBC update requires computations only along the subdomain boundaries, which are smaller by a ratio $R_{\text {vb}}$. In the remaining results, we set $R_{\text {vb}}=8$, corresponding to a subdomain size of 8840 mesh points per process.

The results of overhead measurements with error injection and local PBC detection, shown in Fig. 5, are similar to the those for explicit solvers and likewise reflect the similarity to hard-failure checkpoint/restart (left plot in Fig. 2). A difference is that the conjugate gradient solver cannot afford as large a VI, because roundoff in the checksum updates propagates more strongly through the algorithm and error detection becomes less precise. Error rates and VIs plotted in Fig. 5 are those for which the accuracy criteria in Sect. 2.2 are met.

5 Conclusion

We have demonstrated a streamlined approach to silent-error detection that shows promise for physics simulations. Physics-based checksums (PBCs) enable precise and efficient local error detection with intermittent verification. In conjunction with recovery by rollback, PBCs fit into a typical checkpoint/restart resilience technique. Moreover, PBCs can apply to a range of solvers and error types that may occur at extreme scale. The approach has generality for scientific computing due to its physical foundation.

While existing ABFT linear algebra checksums correspond to conserved quantities in special cases, the conservation viewpoint leads to a general and efficient method for updating subdomain checksums using boundary fluxes, including for nonlinear equations. The local detection provided by these checksums can be further leveraged with local recovery [14].

Reliable algorithmic error detection provides a risk mitigation for future HPC systems and opens a broader space for co-design in which hardware reliability requirements could be relaxed. The conditions under which resilience techniques are effective can provide useful guidance for these future system designs.

References

Bautista-Gomez, L., Benoit, A., Cavelan, A., Raina, S.K., Robert, Y., Sun, H.: Which verification for soft error detection? In: Proceedings of the 22nd IEEE International Conference on High Performance Computing (HiPC) (2015)
Google Scholar
Benson, A.R., Schmit, S., Schreiber, R.: Silent error detection in numerical time-stepping schemes. Int. J. High Perform. Comput. Appl. 29(4), 403–421 (2015)
Article Google Scholar
Bridges, P.G., Ferreira, K.B., Heroux, M.A., Hoemmen, M.: Fault-tolerant linear solvers via selective reliability (2012). https://arxiv.org/abs/1206.1390
Cappello, F., Geist, A., Gropp, W., Kale, S., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomput. Front. Innov. 1(1), 5–28 (2014)
Google Scholar
Daly, J.: A model for predicting the optimum checkpoint interval for restart dumps. In: Proceedings of the International Conference on Computational Science (2003)
Google Scholar
Gamell, M., et al.: Local recovery and failure masking for stencil-based applications at extreme scales. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2015)
Google Scholar
Heroux, M.A., et al.: Improving performance via mini-applications. Report SAND2009-5574, Sandia National Laboratories (2009)
Google Scholar
Hukerikar, S., Engelmann, C.: Resilience design patterns: a structured approach to resilience at extreme scale. Supercomput. Front. Innov. 4(3), 4–42 (2017)
Google Scholar
Rinard, M.: Parallel synchronization-free approximate data structure construction. In: Proceedings of the 5th USENIX Workshop on Hot Topics in Parallelism (2013)
Google Scholar
Salloum, M., Mayo, J., Armstrong, R.: Physics-based checksums for silent-error detection in PDE solvers. In: SIAM Conference on Computational Science and Engineering (2019)
Google Scholar
Salloum, M., Mayo, J.R., Armstrong, R.C.: In-situ mitigation of silent data corruption in PDE solvers. In: Proceedings of the 6th Workshop on Fault-Tolerance for HPC at Extreme Scale (2016)
Google Scholar
Subasi, O., et al.: MACORD: online adaptive machine learning framework for silent error detection. In: Proceedings of the IEEE International Conference on Cluster Computing (2017)
Google Scholar
Tao, D., et al.: New-sum: a novel online ABFT scheme for general iterative methods. In: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (2016)
Google Scholar
Teranishi, K., et al.: ASC CSSE level 2 milestone #6362: resilient asynchronous many-task programming model. Report SAND2018-9672, Sandia National Laboratories (2018)
Google Scholar

Download references

Acknowledgments

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration (NNSA) under contract DE-NA0003525. This work was funded by NNSA’s Advanced Simulation and Computing (ASC) Program. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government.

Author information

Authors and Affiliations

Sandia National Laboratories, P.O. Box 969, Livermore, CA, 94551, USA
Maher Salloum, Jackson R. Mayo & Robert C. Armstrong

Authors

Maher Salloum
View author publications
You can also search for this author in PubMed Google Scholar
Jackson R. Mayo
View author publications
You can also search for this author in PubMed Google Scholar
Robert C. Armstrong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jackson R. Mayo .

Editor information

Editors and Affiliations

Gesellschaft für Wissenschaftliche Datenverarbeitung mbH, Göttingen, Germany
Ulrich Schwardmann
Gesellschaft für Wissenschaftliche Datenverarbeitung mbH, Göttingen, Germany
Christian Boehme
CiTIUS, Santiago de Compostela, Spain
Dora B. Heras
University of Rome "Tor Vergata", Rome, Italy
Valeria Cardellini
Inria Bordeaux Sud-Ouest, Talence, France
Emmanuel Jeannot
Engineering Sardegna, Cagliari, Italy
Antonio Salis
University of Turin, Torino, Italy
Claudio Schifanella
University College Dublin, Dublin, Ireland
Ravi Reddy Manumachu
DLR-AS, Göttingen, Germany
Dieter Schwamborn
University of Pisa, Pisa, Italy
Laura Ricci
Ajou University, Suwon, Korea (Republic of)
Oh Sangyoon
RRZE Friedrich-Alexander-Universität, Erlangen, Germany
Thomas Gruber
ICAR-CNR, Napoli, Italy
Laura Antonelli
Tennessee Technological University, Cookeville, TN, USA
Stephen L. Scott

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Salloum, M., Mayo, J.R., Armstrong, R.C. (2020). Physics-Based Checksums for Silent-Error Detection in PDE Solvers. In: Schwardmann, U., et al. Euro-Par 2019: Parallel Processing Workshops. Euro-Par 2019. Lecture Notes in Computer Science(), vol 11997. Springer, Cham. https://doi.org/10.1007/978-3-030-48340-1_52

Download citation

DOI: https://doi.org/10.1007/978-3-030-48340-1_52
Published: 29 May 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-48339-5
Online ISBN: 978-3-030-48340-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics