A posteriori error analysis of round-off errors in the numerical solution of ordinary differential equations

We prove sharp, computable error estimates for the propagation of errors in the numerical solution of ordinary differential equations. The new estimates extend previous estimates of the influence of data errors and discretization errors with a new term accounting for the propagation of numerical round-off errors, showing that the accumulated round-off error is inversely proportional to the square root of the step size. As a consequence, the numeric precision eventually sets the limit for the pointwise computability of accurate solutions of any ODE. The theoretical results are supported by numerically computed solutions and error estimates for the Lorenz system and the van der Pol oscillator.


Introduction
We consider the numerical solution of general initial value problems for systems of ordinary differential equations (ODE), u(t) = f (u(t), t), t ∈ (0, T ], where the right-hand side f : R N × [0, T ] → R N is assumed to be Lipschitz continuous in u and continuous in t.Our objective is to analyse the error in an approximate solution U : [0, T ] → R N computed by a single-step numerical method, such as an explicit or implicit Runge-Kutta method.For the numerical results presented at the end of this work, we have used a time-stepping method formulated as a Galerkin finite element method -which, for any particular choice of finite element basis and quadrature, will correspond to a particular implicit Runge-Kutta method -but stress that the theoretical results are not particular to time-stepping based on Galerkin formulations but generic to all single-step methods. The propagation of local errors and accumulation of global errors in the numerical solution of ODE have been studied extensively in the literature, see e.g.[5,7,6,8,2].These estimates are based on the formulation of an auxiliary dual problem: the linearised adjoint problem.From the solution of the dual problem, the accumulation rate of local errors may be computed, either as global stability factors or as local stability weights.These factors or weights, together with a measure of the local error, typically the residual R(t) = U − f (U(t), t) lead to a computable estimate of the global error.
Standard estimates may include various sources contributing to the global error, such as discretisation errors, accounting for the use of finite time steps, quadrature errors, accounting for the approximation of the right-hand side f by a particular quadrature rule, and data errors, accounting for the approximation of the initial value u 0 .In this work, we extend these estimates by adding a new term accounting for the use of finite numeric precision in the computation of the numerical solution.This error is normally neglected, since it is typically much smaller than the contribution from the data or discretisation error.However, when the system (1) is very sensitive to perturbations, when the time interval [0, T ] is very long, or when a solution is sought with very high accuracy, the effect of numerical round-off errors as a result of finite numeric precision can and will be the dominating error source, which ultimately limits the computability of a given problem.

Main result
We prove below that the error E in a computed numerical solution U approximating the exact solution u of the ODE (1) is the sum of three contributions: where E D is the data error, which is nonzero if U (0) = u(0); E G is the discretisation error, which is nonzero as a result of a finite time step; and E C is the computational error, which is nonzero as a result of finite numerical precision.Furthermore, we bound each of the three contributions as the product of a stability factor and a residual that measures the size of local contributions to the error.The size of the residuals may be estimated in terms of the size of the time step.We find that where ∆t is the size of the time step, r the order of convergence of the numerical method, and S D (T ), S G (T ), S C (T ) are stability factors which can be computed a posteriori.

Error analysis
The error analysis is based on the solution of an auxiliary dual problem and follows the now well established techniques developed in [7], [6], [5] and [1], with extensions to account for the accumulation of round-off errors.
The dual (linearised adjoint) problem takes the form of an initial value problem for a system of linear ordinary differential equations: Here, Ā denotes the Jacobian matrix of the right-hand side f averaged over the approximate solution U and the exact solution u: By the chain rule, it follows that Based on the formulation of the dual problem we may now derive a (standard) error representation (Theorem 1).It represents the error in an approximate solution U (computed by any numerical method) in terms of the residual R of the computed solution and the solution z of the dual problem (2).The only assumption we make on the numerical solution U is that it is piecewise smooth on a partition of the interval [0, T ] (or that it may be extended to such a function).At points where U is smooth, the residual is defined by Theorem 1 (Error representation) Let u : [0, T ] → R N be the exact solution of the initial value problem (1), let z : [0, T ] → R N be the solution of the dual problem (2), and let U : [0, T ] → R N be any piecewise smooth approximation of u on a partition Then, the error U (T )−u(T ) may be represented by Proof By the definition of the dual problem, we find that which completes the proof.
Remark 1 Theorem 1 holds for any piecewise smooth function U : [0, T ] → R N , in particular for any piecewise smooth extension of any approximate numerical solution obtained by any numerical method for (1).
We next investigate the contribution to the error in the computed numerical solution U from errors in initial data, numerical discretisation, and computation (round-off errors), E = E D + E G + E C , and derive sharp bounds for each term.
To estimate the computational error, we introduce the discrete residual R defined as follows.For any p ≥ 0, let {λ k } p k=0 be the Lagrange nodal basis for P p ([0, 1]), the space of polynomials of degree ≤ p on [0, 1], on a partition We also define the corresponding interpolation operator π onto the space of piecewise polynomial functions on the partition 0 We may now prove the following a posteriori error estimate.
Theorem 2 (Error estimate) Let u : [0, T ] → R N be the exact solution of the initial value problem (1), let z : [0, T ] → R N be the solution of the dual problem (2), and let U : [0, T ] → R N be any piecewise smooth approximation of u on a partition Then, for any p ≥ 0, the following error estimate holds: where Here, Cp and C ′ p are constants depending only on p.The stability factors S D , S G , and S C are defined by Proof Starting from the error representation of Theorem 1, we add and subtract the degree p left-continuous piecewise polynomial interpolant πz defined above to obtain z T , e(T ) = z(0), e(0) We first note that the data error E D is bounded by z(0) e(0) ≡ S D e(0) .By an interpolation estimate, we may estimate the discretisation error E G by where Cp is an interpolation constant.Finally, to estimate the computational error, we expand πz in the nodal basis to obtain Remark 2 Theorem 2 estimates the size of z T , U (T )−u(T ) for any given vector z T .We may thus estimate any bounded linear functional of the error at the final time by choosing z T as the corresponding Riesz representer.In particular, we may estimate the error in any component u i (T ) of the solution by setting z T to the ith unit vector for i = 1, 2, . . ., N .
Theorem 2 extends standard a posteriori error estimates for systems of ordinary differential equations in two ways.First, it does not make any assumption on the underlying numerical method.Second, it includes the effect of numerical round-off errors.A similar estimate can be found in [21] but only for the simplest case of the piecewise linear cG(1) method (Crank-Nicolson).
We now investigate the propagation of numerical round-off errors in more detail.As in the proof of Theorem 2, E C denotes the computational error defined by Theorem 2 bounds the computational error in terms of the discrete residual defined in (4).The discrete residual tests the continuous residual R = U − f of (1) against polynomials of degree p.In particular, it tests how well the numerical method satisfies the relation With a machine precision of size ǫ mach , our best hope is that the numerical method satisfies (7) to within a tolerance of size ǫ mach for each component of the vector We thus have the following corollary.
Corollary 1 The computational error E C of Theorem 2 is bounded by This indicates that the computational error scales like ∆t −1 ; the smaller the time step, the larger the computational error.At first, this seems non-intuitive, but it is a simple consequence of the fact that a smaller time step leads to a larger number of time steps and thus a larger number of round-off errors.The estimate of Corollary 1 is overly pessimistic.It is based on the assumption that round-off errors accumulate without cancellation.In practice, the round-off error is sometimes positive and sometimes negative.As a simple model, we make the assumption that the round-off error is a random variable which takes the value +ǫ mach or −ǫ mach with equal probabilities, for all m, k, i.In reality, round-off errors are not uncorrelated random variables, but the simple model ( 8) may still give useful results.For a discussion on the applicability of random models to the propagation of round-off errors, see [11] (Section 2.8) and [10].
Under the assumption (8), we find that the expected size of the computational error scales like ∆t −1/2 .As we shall see in the next section, this is also confirmed by numerical experiments.A similar result is obtained in a series of papers by Li et.al [19,20].In [19], it is first noted that there exists an optimal time step; that is, a time step for which discretisation errors and round-off errors balance.In [20], it is then found that the round-off error is inversely proportional to the square root of the time step.These results are confirmed by the following theorem.
Theorem 3 Assume that the round-off error is a random variable of size ±ǫ mach with equal probabilities.Then the root-mean squared expected computational error E C of Theorem 2 is bounded by where and C ′ p is a constant depending only on p.
Proof As in the proof of Theorem 2, we obtain where by assumption ( Rm k ) i = ǫ mach x mki and x mki = ±1 with probability 0.5 and 0.5, respectively.It follows that We now note that x 2 mki = 1.Furthermore, y ijklmn = x mki x nlj is a random variable which takes the values +1 and −1 with equal probabilities.We thus find that . This completes the proof.
Remark 3 By additional assumptions on the smoothness of the dual solution z, one may relate the error to the expected value of the distance from the starting point for a random walk which is √ n for n steps (for n large), and prove a similar estimate for the expected absolute value of the computational error, where S C = T 0 πz dt.
Remark 4 By Cauchy-Schwarz, the stability factor S C of Theorem 2 is bounded by √ T S C2 .
Remark 5 In [10], the effect of numerical round-off error accumulation and its relation to Brownian motion (Brouwer's law) are discussed in the context of symplectic methods for Hamiltonian systems.It should be noted that although the assumptions of Theorem 3 are similar to those in [10], namely that the process of error accumulation for round-off errors is random rather than systematic, the point under discussion in the present work is different: the effect of time step size rather than the effect of the interval length.
We conclude this section by discussing how the above error estimates apply to the particular methods used in this work.The estimate of Theorem 2 is valid for any numerical method but is of particular interest as an a posteriori error estimate for the finite element methods cG(q) and dG(q) (see [12,13,14,4]).
The continuous and discontinuous Galerkin methods cG(q) and dG(q) are formulated by requiring that the residual R = U − f (U, •) be orthogonal to a suitable space of test functions.By making a piecewise polynomial ansatz, the solution may be computed on a sequence of intervals partitioning the computational domain [0, T ] by solving a system of equations for the degrees of freedom on each consecutive interval.For a particular choice of numerical quadrature and degree q, the cG(q) and dG(q) methods both reduce to standard implicit Runge-Kutta methods.
In the case of the cG(q) method, the numerical solution U is a continuous piecewise polynomial of degree q that on each interval (t n−1 , tn] satisfies for all v ∈ P q−1 ([t n−1 , tn]).It follows that the discrete residual ( 4) is zero if p ≤ q−1.However, this is only true in exact arithmetic.In practice, the discrete residual is nonzero and measures how well we solve the cG(q) equations ( 9), including round-off errors and errors from numerical quadrature. 1 For the cG(q) method, we further expect the residual to converge as ∆t q .Thus, choosing p = q − 1 in Theorem 2, one may expect the error for the cG(q) method to scale as Here, S(T ) denotes a generic stability factor.As in Theorem 2, each term contributing to the total error is in reality multiplied by a particular stability factor.In practice, however, the growth rates of the different stability factors are similar and related by a constant factor. 1 To account for additional quadrature errors present if the integral of ( 9) is approximated by quadrature, one may add and subtract an interpolant πf of the right-hand side f in the proof of Theorem 2 to obtain an additional term

Numerical results
In this section, we present numerical results in support of Theorem 2 and Theorem 3. The examples are the well-known Lorenz system and Van der Pol oscillator.Both examples illustrate the competing convergence rates for discretisation errors, decreasing rapidly for smaller time steps, and computational errors (round-off error), increasing for smaller time steps.
The numerical results were obtained using the authors' software package Tanganyika [23] which implements the methods described in [21] using high precision numerics provided by GMP [9].A complete code for reproducing all results in this paper is available at [18].For details on the implementation, see [16].
The Lorenz system is deterministically chaotic.In the context of a posteriori error analysis of numerical methods for the solution of ODE initial value problems, as in the present work, this means that solutions may, in principle, be computed over arbitrarily long time intervals, but to a rapidly increasing cost as function of the final time T .

Computability and growth of stability factors
In [8], computability was demonstrated and quantified for the Lorenz on time intervals of moderate length (T = 30) on a standard desktop computer.This result was further extended to time T = 48 in [22], using high order ( e(T ) ∼ ∆t 30 ) finite element methods.Solutions over longer time intervals have been computed based on shadowing (the existence of a nearby exact solution), see [3], but for unknown initial data.Related work on high-precision numerical methods applied to the Lorenz system include [26] and [15].
In [17], the authors study the computability of the Lorenz system in detail on the time interval [0, 1000].Computability is here defined as the maximal final time T = T (ǫ mach ) such that a solution may be computed with a given machine precision ǫ mach .The computability may be estimated by examining the growth rate of the stability factors appearing in the error estimate of Theorem 2. By numerical solution of the dual problem, it was found in [17] that the stability factors grow exponentially as S(T ) ∼ 10 0.388T ∼ 10 0.4T ; see Figure 1.By examining in detail the terms contributing to the error estimate (5), one finds that an optimal step size is given by ∆t ∼ ǫ mach and that the computability of the Lorenz system is given by where n mach = − log 10 ǫ mach is the number of significant digits.Based on this estimate, one may conclude that with 16-digit precision, the Lorenz system is computable on [0, 40], while using 400 digits, the Lorenz system is computable on [0, 1000].
In Figure 2, we plot the solution of the Lorenz system on the interval [0, 1000].The solution was computed with cG(100), which is a method of order 2q = 200, a time step of size ∆t = 0.0037, 420-digit precision arithmetic2 , and a tolerance for the discrete residual of size ǫ mach ≈ 2.26 • 10 −424 .The very rapid (exponential) accumulation of numerical errors makes the Lorenz "fingerprint" displayed in Figure 2 useful as a reference for verification of solutions of the Lorenz system.If a solution is only slightly wrong, the error is quickly magnified so that the error becomes visible by a direct inspection of a plot of the solution.

Order of convergence and optimal step size
We next investigate how the accumulated error at final time depends on the size of the time step ∆t.According to (10), we expect the error to scale like ∆t 2q + ∆t − 1 /2 ǫ mach .Thus, for a gradually decreasing step size, we expect the error to decrease at a rate of ∆t 2q .However, as the time step becomes smaller the second term ∆t − 1 /2 will grow and, for small enough ∆t, be the dominating contribution to the error.This picture is confirmed by the results presented in Figure 3 for two numerical methods, the 2nd order cG(1) and the 10th order cG(5) method.Of particular interest in this figure is the very short range in which the 10th order convergence of the cG(5) method is recovered; with only 16 digits of precision, the dominating contribution to the total error is the accumulated round-off error.We also note that for both methods, one may find an optimal size of the time step ∆t for which both contributions to the total error are balanced.
In Figure 4, results are presented for an investigation of the influence on both the step size ∆t and the polynomial degree q in the cG(q) method.As expected, the minimal error is obtained when both the polynomial degree q and the step Fig. 2 Accurate reference solution for the three components of the Lorenz system on the interval [0, 1000] with the x and y components plotted in blue and green respectively (and almost overlaid) and the z component in red.size are maximal.Maximising the step size minimizes the influence of numerical round-off errors (the term ∆t − 1 /2 ), and as a consequence the polynomial degree q must be large in order to suppress the discretisation error (the term ∆t 2q ).

The Van der Pol oscillator
We next consider the Van der Pol oscillator, given by the second order ODE ü Rewritten as a system of first order equations, it reads We compute solutions on [0, 2µ] for µ = 10 3 and u(0) = (2, 0).This configuration is used as a test problem for ODE solvers in [25].For large values of the parameter µ, the solution quickly approaches a limit cycle.
As for the Lorenz system, the stability factors(s) grow very rapidly (exponentially), as indicated in Figures 5 and 6.However, the rapid growth is localised in time close to T ≈ 807 • n for n ∈ N.For times before or after these points of instability, the stability factor is of moderate size.This means that solutions are difficult to compute only at points near the points of instability; that is, a solution may be easily computed at time t = 1000 but not at time t = 807.This rapid growth of stability factors is reflected in the growth of the error for numerical solutions as shown in Figures 7 and 8. Examining these plots in more detail, we notice that in accordance with the error estimate of Theorem 2 and Equation (10).For the numerical solutions studied in Figures 7 and 8, the discretisation error dominates for low order methods.As the polynomial degree q is increased, the error decreases until the point when the computational error starts to dominate.We notice that the baseline error is of size E ∼ 10 −14 for

Conclusions
We have proved error estimates accounting for data, discretisation and computational (round-off) errors in the numerical solution of initial value problems for ordinary differential equations.These error estimates quantify the accumulation rates for numerical round-off error as inversely proportional to the square root of the step size, and proportional to a specific computable stability factor.The effect  6) cG( 8) cG( 10) cG( 12) cG( 13) cG( 14) cG( 15) of round-off errors is mostly pronounced for large values of the stability factor, which includes both chaotic dynamical systems as well as long-time integration of systems which exhibit only a moderate growth of the stability factor.

z
T , e(T ) = z T , e(T ) − T 0 ż + Ā⊤ z, e dt = z T , e(T ) z, e dt, where e = U −u. Noting that Ā⊤ z, e = z, Āe and integrating by parts, we obtain z T , e(T ) = z(0), e(0) dt ≡ S C and C ′ p is a constant depending only on p.This completes the proof.
U .It follows by the Cauchy-Schwarz inequality that max

Fig. 1
Fig. 1 Growth of the stability factor S C (left) for the Lorenz system on the time interval [0, 1000] and a detailed plot on the time interval [0, 50] (right).

Fig. 5
Fig.5Growth of the computational stability factor S C (T ) for the Van der Pol oscillator(13).

Fig. 6
Fig.6Detail of growth of the computational stability factor S C (T ) for the Van der Pol oscillator(13).

Fig. 7 Fig. 8
Fig. 7 Growth of error for solutions of the Van der Pol oscillator (13) computed with time step ∆t = 10 −3 .