A probabilistic model for the numerical solution of initial value problems
Abstract
We study connections between ordinary differential equation (ODE) solvers and probabilistic regression methods in statistics. We provide a new view of probabilistic ODE solvers as active inference agents operating on stochastic differential equation models that estimate the unknown initial value problem (IVP) solution from approximate observations of the solution derivative, as provided by the ODE dynamics. Adding to this picture, we show that several multistep methods of Nordsieck form can be recasted as Kalman filtering on qtimes integrated Wiener processes. Doing so provides a family of IVP solvers that return a Gaussian posterior measure, rather than a point estimate. We show that some such methods have low computational overhead, nontrivial convergence order, and that the posterior has a calibrated concentration rate. Additionally, we suggest a step size adaptation algorithm which completes the proposed method to a practically useful implementation, which we experimentally evaluate using a representative set of standard codes in the DETEST benchmark set.
Keywords
Initial value problems Nordsieck methods Runge–Kutta methods Filtering Gaussian processes Markov processes Probabilistic numericsMathematics Subject Classification
60H30 62M05 65C20 65L05 65L061 Introduction
Numerical algorithms estimate intractable quantities from tractable ones. It has been pointed out repeatedly (Poincaré 1896; Diaconis 1988; O’Hagan 1992) that this process is structurally similar to statistical inference, where the tractable computations play the role of data in statistics, and the intractable quantities relate to latent, inferred quantities. In recent years, the search for numerical algorithms which return probability distributions over the solution for a given numerical problem has become an active area of research (Hennig et al. 2015). Several models and methods have been proposed for the solution of initial value problems (IVPs) (Skilling 1992; Chkrebtii et al. 2016; Schober et al. 2014a; Conrad et al. 2017; Kersting and Hennig 2016; Teymur et al. 2016). However, these probabilistic algorithms have no immediate connection to the extensive literature on this task in numerical analysis. Most importantly, such inference algorithms do not come with convergence analysis out of the box. The methods in Chkrebtii et al. (2016), Conrad et al. (2017) and Teymur et al. (2016) have convergence results, but their respective implementations are based on sampling schemes and, thus, do not offer guarantees for individual runs. The methods in Schober et al. (2014a) and Kersting and Hennig (2016) offer a deterministic execution and an analytical guarantee for the first step, but we will show that this guarantee is lacking for the whole integration domain.
In this paper, we present a class of probabilistic solvers which combine properties of the standard and the probabilistic algorithms. We formulate desiderata that users might have for a probabilistic numerical algorithm. We present one construction that fulfills these desiderata and we provide a MATLAB code^{1} which we compare empirically against other available codes. The construction uses the algebra of Gaussian inference to provide a Gaussian posterior distribution over the solution of an IVP. In particular, we show that the posterior mean can be understood as a multistep method in Nordsieck representation, and thus, analytical results about these methods carry over to the present algorithm. Additionally, we propose to interpret the posterior covariance as a measure of uncertainty or error estimator and argue that this interpretation can be analytically justified. In the context of a larger pipeline of empirical studies and numerical computations, the framework of probability modeling provides a common language to analyze the epistemic confidence in its result (Cockayne et al. 2017). In the framework of Cockayne et al. (2017), the code provides approximate Bayesian uncertainty quantification (Sullivan 2015) at low computational overhead and almost complete backwards compatibility to the MATLAB IVP solver suite.
1.1 Problem description
IVPs are a particularly deeply studied class of ODErelated tasks. Part of their significance is due to the Picard–Lindelöf theorem which guarantees local unique existence of solutions. As a consequence, IVPs lend themselves to be solved by socalled stepbystep methods, where the solution is advanced iteratively on expanding meshes \(\Delta _{n+1} :={(\{t_0, \ldots , {t_n}\} \cup \{{t_{n+1}}\})} \supset \Delta _n\). The knots \(t_n\) of a mesh are either generated on a regular grid \(t_n:=t_0 + hn, n = 0, \ldots , N\) for some \(N \in \mathbb {N}\) and \(h = (T  t_0)N^{1}\) or the step size h may very per step, thus yielding \(t_n = t_0 + \sum _{i=1}^n h_i\).

Probabilistic inference The computations should be operations on probability distributions.

Global definition The probabilistic model should not depend on the discretization mesh.

Deterministic execution When run several times on the same problem, the algorithm produces the same output each time.

Analytic guarantees The algorithm’s output should have desirable analytic properties.

Online execution The algorithm execution can be extended indefinitely when required.

Speed. The execution time should not be prohibitively slow.

Problem adaptiveness The algorithm should automatically adapt parameters to problem and accuracy requirements.
2 From classical to probabilistic numerical algorithms
In this section, we explain and motivate the first two items from our list of desiderata in turn—probabilistic inference and global definition.
Properties of existing PNM ODE solvers
Method  Glob. def.?  Determ.?  Guarantees? 

Skilling (1992)  \(\checkmark \)  \(\times \)  \(\times \) 
Chkrebtii et al. (2016)  \(\checkmark \)  \(\times \)  \(\approx \) 
Schober et al. (2014a)  \(\approx \)  \(\checkmark \)  \(\approx \) 
Conrad et al. (2017)  \(\times \)  \(\times \)  \(\approx \) 
Kersting and Hennig (2016)  \(\checkmark \)  \(\checkmark \)  \(\times \) 
Teymur et al. (2016)  \(\times \)  \(\times \)  \(\approx \) 
PFOS (this paper)  \(\checkmark \)  \(\checkmark \)  \(\checkmark \) 
Accepting the probabilistic approach as a framework for plausible reasoning (Jeffreys 1969; Cox 1946; Hennig et al. 2015), we require a probability measure or law \(P_Y\) over the numerical solution \(Y_t\). The computations necessary for the construction of \(P_Y\) should be interpretable as (approximate) probabilistic inference. When such an interpretation is admissible, we call the resulting algorithm a probabilistic numerical method (PNM) for the purposes of this paper. A more rigorous definition has been given by Cockayne et al. (2017). The motivation behind this requirement is that there should not be an analysis gap between statistical and numerical computations. This is particularly beneficial, when the differential equation solver is embedded in a longer chain of computations (Cockayne et al. 2017). In principle, this should allow to build finetuned methods adapting to sources of data uncertainty and computational approximation during runtime and provide richer feedback of approximation quality as recently empirically validated by Schober et al. (2014b) and Hauberg et al. (2015).
We propose to think about the probabilistic framework as a more informative output information than the point estimates returned by classical numerical algorithms (see also Hennig et al. 2015).
Furthermore, a probabilistic IVP solver shall be called globally defined on its input domain \(\mathbb {T}\), if its probabilistic interpretation does not depend on the discretization mesh \(\Delta \). PNMs satisfying this property provide two benefits. Users may evaluate the (predictive posterior) distribution \(P(Y_t\,\,z_{[n]})\) for any \(t \in \mathbb {T}\). In particular, users may evaluate \(P(Y_t\,\,z_{[n]})\) for \(t \notin {\Delta }\). Thus, users may request \(P(Y_{t_s}\,\,z_{[n]}), t_s \in \Delta _S\) and the support of a userdefined mesh \(\Delta _S\) is not a separate requirement. Secondly, this implies that the inference can be paused and continued after every expansion from \(\Delta _n \mapsto \Delta _{n+1}\). In principle, this also enables iterative refinement of the solution quality based on its prediction uncertainty.
Table 1 lists PNM ODE solvers that have been proposed in the literature. A \(\checkmark \) indicates that the method satisfies a given property, a \(\times \) indicates that a method does not satisfy a given property, and a \(\approx \) indicates that a property holds with some restrictions. The listing shows that almost all methods proposed so far are globally defined. Furthermore, we see that the definition is independent of a method being sampling based or not. The method proposed by Conrad et al. (2017) is a generative process on subintervals \([{t_n}, {t_{n+1}}] \subset \mathbb {T}\) based on a numerical discretization. It is easy to construct two different meshes \(\Delta _n, \Delta _n'\) that define different distributions for \(Y_t\) in the case of \(y' = \lambda y\), and a general argument can be made from this example. In Teymur et al. (2016), the predictive posterior is only defined on the discretization mesh. This defect is not for lack of definition, but a consequence of the underlying numerical method the probabilistic algorithm is built upon. Since the method is defined on a windowed data frame, it is easy to construct a mesh such that the prediction \(Y_t\) at time t will be different depending on the window \([t_{ni}, \dots , t_{n+j}] \ni t\) is considered to be part of.
The analysis in Schober et al. (2014a) proposes two main modes of operation: naive chaining and probabilistic continuation. Naive chaining is not a globally defined method since mesh points \(t_n\) are part of adjacent Runge–Kutta blocks, and the corresponding predictive posterior distribution \(P(Y_{t_n}\,\,z_{[n]})\) is different for these two blocks. Probabilistic continuation is globally defined, but there has been no convergence theory for this mode yet. This paper fills this gap.
2.1 Statespace models for Gauss–Markov processes
Our approximate model of the true solution y(t) is a vector \(\mathbf {x}(t) = (y^{(0)}(t), \ldots , y^{(q)}(t))^{\intercal }\) where \(y^{(i)}(t)\) is the true ith derivative of y(t) at time t. We represent the prior uncertainty about \(\mathbf {x}(t)\) by the distribution \(P(\mathbf {X}_t)\) of the random variable \(\mathbf {X}_t\)—or more generally as the measure or the law \(P_{\mathbf {X}}\) of the stochastic process \(\mathbf {X}\)—which is then conditioned on the observed values.
Here, we consider models where \(\mathbf {L}\) is the last standard basis vector \(\mathbf {e}_q\) and \(\mathbf {F} = \mathbf {U}_{q+1} + \mathbf {e}_q\mathbf {f}^{\intercal }\) is a (transposed) companion matrix. Here, \(\mathbf {U}_{q+1}\) denotes the upper shift matrix and the row vector \(\mathbf {f}^{\intercal }\) contains the coefficients in the last row of \(\mathbf {F}\). In this case, the vectorvalued process \(\mathbf {X}_t = (X_{t,0}, \ldots , X_{t,q})^{\intercal }\) obtains the interpretation \(\mathbf {X}_t = (Y_t, Y_t', \ldots , Y^{(q)}_t)^{\intercal }\), because the form of \(\mathbf {F}\) and \(\mathbf {L}\) implies that the realizations of \(Y_t\) are qtimes continuously differentiable on \(\mathbb {R}\). Later, we will also consider scaled systems \(\tilde{\mathbf {X}}_t = \mathbf {B} \mathbf {X}_t\) with an invertible linear transformation \(\mathbf {B}\). In this case, we denote by \(\mathbf {H}_i\) the matrix that projects onto the ith derivative \(Y^{(i)}_t = \mathbf {H}_i \tilde{\mathbf {X}}_t :=e_i^{\intercal }\mathbf {B}^{1} \tilde{\mathbf {X}}_t\). Two particular models of this type are the qtimes integrated Wiener process (IWP(q)) and the continuous autoregressive processes of order q. Detailed introductions can be found, for example, in Karatzas and Shreve (1991), Øksendal (2003) and Särkkä (2006). SDEs can also be seen as pathspace representations of more general temporal Gaussian processes arising in machine learning models (Särkkä et al. 2013).
Models of form (4) are also related to nonparametric spline regression models (Wahba 1990) which often have a natural interpretation in frequentist analysis (Kimeldorf and Wahba 1970). Conceptually, these models are a compromise between globally defined parametric models, which might be too restrictive to achieve convergence, and local parametric models, which might be too expressive to be captured by a globally defined probability distribution. Models of this type have been studied in the literature (Loscalzo and Talbot 1967; Andria et al. 1973), but the presentation here starts from other principles.
The choice of prior measure \(P_{\mathbf {X}}\) in Eq. (4) can be interpreted as a prior assumption or belief encoded in the algorithm, in the sense that the algorithm amounts to an autonomous agent. We emphasize that if one adopts this view, then the results reported in later sections amount to an external analysis of the effects of these assumptions. That is, we will show that if the agent is based on this prior measure \(P_{\mathbf {X}}\) with a likelihood to be defined in Sect. 2.3, they give rise to a posterior distribution with certain desirable properties. By contrast, one could also take a more restrictive standpoint internal to the algorithm and state that the proposed method works well if the true solution \(\mathbf {x}(t)\) is indeed a sample from \(P_{\mathbf {X}}\). This is expressly not our viewpoint here, and it would be a flawed argument, too, given that in practice, \(\mathbf {x}(t)\) is defined through the ODE, thus evidently not a sample from any stochastic process.
2.2 Data generation mechanism
In contrast, many numerical IVP solvers proceed in a stepbystep manner. Having computed a numerical approximation \(P_{Y\,\,z_{[n]}}\) on the mesh \(\Delta _n\), a prediction \(y^_{n+1}\) of \(y({t_{n+1}})\) is used to evaluate \(f({t_{n+1}}, y^_{n+1})\) and the resulting output \(z_{n+1}\) is used to update the approximation \(P_{Y\,\,z_{[n+1]}}\) on the extended mesh \(\Delta _{n+1}\). For example, in a deterministic IVP the data \((t_0, y_0)\) can be used to construct the observation \(z_0 = f(t_0,y_0)\) which satisfies the probabilistic interpretation of \(y'(t_0) \sim \delta (z_0  y'(t_0))\). This serves as a corner case for the general situation. Setting \({t_{1}}:=t_0\) and \(z_{1} :=y_0\), it follows that \(y(t_0) \sim \delta (z_{1}  y({t_{1}}))\) and the initial value requires almost no special treatment. The concept is illustrated in Algorithm 1 and can, in principle, be extended indefinitely, at constant cost per step. The term predict–evaluate–correct (PEC) or predictor–corrector methods have a more technical meaning in classic textbooks (Hairer et al. 1987; Deuflhard and Bornemann 2002), but the idea is common to many numerical IVP solvers. Chkrebtii et al. (2016) calls the process of evaluating \(f({t_n},y^_{t_n})\) with tentative \(y^_{t_n}\) to generate \(z_n\) a model interrogation. From a statistical perspective, this concept of active model interrogation is similar to the sequential analysis of Wald (1973) and Owhadi and Scovel (2016).
Algorithm 1 conveys the general idea of a probabilistic ODE solver while omitting parameter tuning aspects like error control and step size selection. The exact form of line 5 depends on the choice of observation construction and data likelihood model. Without data, the prior induces a probability distribution on the hidden state \(\mathbf {{X}}_{t_n}\). It remains to construct an observation \(z_n\) and a likelihood model \(P(z_n\,\,\mathbf {X}_{t_n})\).
2.3 Observation assumptions
From the analytical viewpoint external to the algorithm itself, of course, one does not expect that the model assumption of a Gaussian likelihood, much less one with vanishing width, to hold in reality. The point of the analysis in Sect. 3.1 is to demonstrate that this model and evaluation scheme yield a method satisfying sufficient conditions to prove that its point estimate converges at a nontrivial order for some choices of state spaces, while simultaneously keeping computational cost very low (that is very similar to that of classic multistep solvers). That is because the predictive posterior distributions \(P(\mathbf {X}_{t_n}\,\,z_{[n]})\) can be computed by the lineartime algorithm known as Kalman filtering (Kalman 1960; Särkkä 2006, 2013). The marginal predictive posterior distributions given all data \(P(\mathbf {X}_t\,\,z_{[N]})\) can be computed using the Rauch–Tung–Striebel smoothing equations (Rauch et al. 1965; Särkkä 2006, 2013). Simultaneously, one can draw samples from the full joint posterior. These two operations increase the computational cost minimally: They require additional computations comparable to those used for interpolation in classic solvers, but neither smoothing nor sampling requires additional evaluations of f. The computational complexity stays linear in number of data points collected. If the full joint posterior is also required for some reason, this is also possible to construct (Solin 2016; Grigorievskiy et al. 2016). As a second consequence, the computation becomes deterministic which enables unit testing of the resulting code.
As a side remark, we note some obvious restrictions of the combination of Gaussian (process) prior and likelihood used here: Since this combination means the posterior is always a Gaussian process, one cannot hope to accurately capture bifurcation events, higherorder correlations in the discretization errors or other higherorder effects.
2.4 Detailed example
Figure 2 shows the state of the algorithm after 2 steps have been taken. The solution looks discontinuous, because the information of later updates \(z_n\) has not been propagated to previous time points \(t_m, m < n\). The last column of Fig. 2 shows the (predictive posterior) smoothing distribution wherein all the information is globally available.
3 Classical analysis for the probabilistic method
The most important test for any numerical algorithm is that it works in practice and under the requirements of potential users. The proposed probabilistic numerical algorithm has been motivated and derived from the computational properties that established classical algorithms provide. The classical algorithms have been studied intensely for over a century, to a point where the theory could almost be considered complete (Gear 1981). Thus, a newly proposed algorithm—even when motivated from a different background—should stand up to classical analysis.
While many specialized models and algorithms have been proposed, two standard classes of algorithms have become prevalent for the solution of (1): Runge–Kutta (RK) methods and (linear) multistep methods (LMMs) or combinations thereof (general linear methods, GLMs Butcher 1985). These classes share a similar type of algorithmic structure and analysis: At time \({t_n}\), evaluate f with a numerical approximation \({y_{t_n}}\) to construct an updated numerical approximation \({y_{t_{n+1}}}\) from linear combinations of the function evaluations \({f_{t_n}}\) (exact definitions below). The update weights are parameters of a given method and, if chosen appropriately, can be shown to coincide with the Taylor approximation of the true solution y up to q terms.
In the following, we present results relating the newly proposed probabilistic method to existing algorithms, which allows us to transfer the known results to our method. Interpreting the probabilistic model from the viewpoint of classical analysis adds a justification to the assumptions made in the previous sections by saying that these assumptions—unintuitive as they may be at first—are the same assumptions that are implied by the application of a classical algorithm.
3.1 On the connection to Nordsieck methods
If \(z^{(M)}\) is used in the computation of (28), the resulting algorithm is called a P(EC)^{M} method. If Eq. (29) is solved up to numerical precision, the method is called a P(EC)\(^{\infty }\) method. Nordsieck methods with suitable weights \(\mathbf {l}\) can be shown to have local truncation error of order q or \(q+1\) (Skeel and Jackson 1977; Skeel 1979). More details can also be found in standard textbooks (Hairer et al. 1987; Deuflhard and Bornemann 2002).
We will now show how the Kalman filter (20)–(25) can be rewritten such that the mean prediction takes the form of (28). This enables to analyze the proposed algorithm in light of classical Nordsieck method results, but can also guide the further development of the probabilistic approach with the experience of existing software.
Given these invariants, Eq. (37) has the structure of a multistep method written in Nordsieck form (28). The only difference is the changing weight vector \(\mathbf {K}_n\) (37) as compared to the constant weights in (28). Multistep methods with varying weights have been studied in the literature (Crouzeix and Lisbona 1984; Brown et al. 1989). These works are often in the context of variable step sizes \(h_n \ne h\), but variablecoefficient methods have also been studied for other purposes, for example cyclic methods (Albrecht 1978). These works have in common that the weights are free variables that are not limited through the choice of model class. As a consequence, determining optimal weights can be algebraically difficult (Hairer et al. 1987, §III.5).
Here, variable step sizes are easily obtained by working with representation (4) instead of (33) and computing (8) according to \(h_n\). In contrast to classical methods, the weights \(\mathbf {K}_n\) cannot be chosen freely, but are determined through the choice of model (4) and the evolution of the underlying uncertainty \(\mathbf {C}_{t_n}\). While Kersting and Hennig (2016) provide some preliminary empirical evidence that these adaptive weights \(\mathbf {K}_n\) might actually improve the estimate, more rigorous analysis is required for theoretical guarantees.
In fact, Skeel (and Jackson) (1976, 1977) consider more general propagation matrices \(\mathbf {S}\) for \({\mathbf {x}_{t_n}} = \mathbf {S} {\mathbf {x}_{t_{n1}}}\) in Eq. (28). Every model of form (4) implies such a general propagation matrix by identifying \(\mathbf {S}_n = (\mathbf {I} \mathbf {K}_n \mathbf {H}_1) \mathbf {A}(h_n)\). Thus, applying the Kalman filter to LTI SDE models is structurally equivalent to a variablecoefficient multistep method. This motivates the following definition and Algorithm 2 for the probabilistic solution of initial value problems.
Definition 1
A probabilistic filtering ODE solver (PFOS) is the Kalman filter applied to an initial value problem with an underlying Gauss–Markov linear, timeinvariant SDE and Gaussian observation likelihood model.
As was the case in Algorithm 1, the exact form of lines 10–12 depend on the choice of likelihood model (cf. Kersting and Hennig 2016).
We will now study the longterm behavior of the PFOS. In particular, we will ask what is the longterm behavior for the sequence of Kalman gains \((\mathbf {K}_n)_{n=0,\ldots }\) and how this will influence the solution quality. It can be shown that its properties are linked to properties of the discrete algebraic Riccati equation, of which the theory has largely been developed (Lancaster and Rodman 1995). Denote by \({\upgamma } : \mathbb {R}^{(q+1) \times (q+1)} \rightarrow \mathbb {R}^{(q+1) \times (q+1)}\) the function that maps the covariance matrix \(\mathbf {C}_{t_{n1}}\) of the previous knot \({t_{n1}}\) to the covariance matrix \(\mathbf {C}_{t_n}\) at the current knot \({t_n}\) (Eq. (38)). If there exists a (unique) fixed point \(\mathbf {C}^*\) of \({\upgamma }\), it is called the steady state of model (4). Associated with a fixed point \(\mathbf {C}^*\) is also a constant Kalman gain \(\mathbf {K}^*\) that is obtained at the (numerical) convergence of \(\mathbf {C}^*\).
We will now show that there is a subset of model (4) that converges to a steady state. This subsystem completely determines a constant Kalman gain \(\mathbf {K}^*\) at least in the case of the IWP(1) and IWP(2). Thus, just like in the equivalence result for the Runge–Kutta methods in Schober et al. (2014a), the result of the PFOS is equivalent (in the sense of numerically identical) after an initialization period to a corresponding classical Nordsieck method defined by the weight vector \({\mathbf {K}^*}\) and we can apply all the known theory of multistep methods to the mean of the PFOS.
Proposition 1
The PFOS arising from the once integrated Wiener process IWP(1) is equivalent in its predictive posterior mean to the P(EC)^{1} implementation of the trapezoidal rule.
Proof
The following Theorem 1 for the IWP(2) requires a bit more algebra, but is based on the same principle.
Theorem 1
The predictive posterior mean of the IWP(2) with fixed step size h is a thirdorder Nordsieck method, when the predictive distribution has reached the steady state.
Proof
The proof proceeds in two steps. First, we show that the update equations induce a specific form for the covariance matrix \(\mathbf {C}_{t_n}\). Then, we will analyze individual entries.
Although Theorem 1 is only valid when the system has reached its steady state, we find that the convergence (visualized in Fig. 3) is rapid in practice. In the extreme case of \(q=1\) (not shown), in fact it is instantaneous, and Proposition 1 is valid from the second step onwards. This limitation could also be circumvented in practice by initializing \(\mathbf {C}_{t_{1}}\) at steadystate coefficients, but this possibility is not required to achieve highorder convergence on the benchmark problems we considered.
Inspecting the weights of the IWP(2), we find that this method has not been considered previously in the literature, and, in particular, cannot be related to any of the typical formulas, such as Adams–Moulton or backward differentiation formulas. This is not surprising, since the result of this method has been constructed to be twice continuously differentiable, whereas there is no such guarantee for the solution provided by the typical methods. In fact, the probabilistic Nordsieck method is much closer related to splinebased multistep methods such as Loscalzo and Talbot (1967), Loscalzo (1969), Byrne and Chi (1972) and Andria et al. (1973) since Gaussian process regression models have a onetoone correspondence to spline smoothing in a reproducing kernel Hilbert space of appropriate choice (Kimeldorf and Wahba 1970; Wahba 1990). This also justifies the application of a fullsupport distribution, even though it is known that the solution will remain in a compact set. In the former case, the interpretation is one of averagecase error, whereas in the latter, the bound corresponds to the worstcase error (Paskov 1993).
Figure 4 displays the workprecision diagram for the IWP(1) and IWP(2) applied to the examplary problem of Sect. 2.4. The plot shows a good agreement between the theoretical rate and the observed rate of convergence.
We conclude this section by considering some implications of the probabilistic interpretation in contrast to other classical multistep methods.
Keeping all hyperparameters (order q, prior diffusion intensity \(\sigma ^2\) and step size h) fixed, the gain \(\mathbf {K}_n\) is completely determined, and, as a consequence, we could have chosen to fully solve implicit Eq. (29) for the generation of \(z_n\). Solving (29) up to numerical precision can be interpreted as learning the true value of the model (4) at \(t_n\) which gives another justification for using \(R_n^2 = R^2 = 0\). Since the P(EC)\(^\infty \) and the P(EC)^{M} have the same order for all M (Deuflhard and Bornemann 2002), this argument can be extended to the case of the PEC^{1} implementation which gives the most natural connection to the Kalman filter.
In fact, a P(EC)\(^M\), \(M > 1\), implementation would collect and put aside the values \(z_n^{(1)}, \ldots , z_n^{(M1)}\), which seems unintuitive from an inference point of view, where it is natural to assume that more data should yield a better approximation. A natural question would be whether this is a case of diminishing returns of approximation quality for computational power, but this is beyond the scope of this paper.
It is natural to ask what happens in the case of the IWP(q), \(q > 2\). Using techniques from the analysis of Kalman filters, one can show that these models also contain a stable subsystem and that the weights \(\mathbf {K}_n\) will converge to a fixed point \(\mathbf {K}^*\), even for nonzero, but constant, \(R^2\). However, it remains unclear whether they will be practical. In particular, these methods might even be unstable for most spline models (Loscalzo and Talbot 1967). We have tested the IWP(q), \(q \in \{1, \ldots , 4\}\), empirically on the Hull et al. benchmark (see Sect. 5) and have observed that these converge in practice on these nonstiff problems.
3.2 Initialization via Runge–Kutta methods
Thus far, we have provided the analysis of the longterm behavior of the algorithm, when several Kalman filter steps have been computed and the steady state is reached. Crucially, a necessary condition for this analysis is that enough updates have been performed such that the observable space spans the entire state space, which is \(q+1\) updates in the case for the IWP(q).
Thus, the question remains how to initialize the filter. Schober et al. (2014a) have shown that there are Runge–Kutta steps that coincide with the maximum a posteriori (MAP) of the IWP(q) for \(q \le 3\). This requires \(q+1\) updates using a diffuse prior with \(\mathbf {C}_{t_{1}}= \lim _{\mathcal {H} \rightarrow \infty } \mathbf {Q}(\mathcal {H})\). In practice, one takes a Runge–Kutta step with the corresponding formula and plugs the resulting values into the analytic expressions for \(\mathbf {m}_{t_1}\) and \(\mathbf {C}_{t_1}\) at \(t_1\). Additionally to the cases presented in Schober et al. (2014a), we can report a match between a fourstep Runge–Kutta formula of order four and the IWP(4). This match is obtained for the evaluation knots \(t_0 + c_ih\) with the vector \(\mathbf {c} = (0, 1/3, 1/2, 1)^{\intercal }\). Details and exact expressions are given in Appendix B. This approach is structurally similar to an algorithm given by Gear (1980) for the case of classical Runge–Kutta and Nordsieck methods.
However, we want to stress that the analysis by Schober et al. (2014a) is done under exactly the same model and with the same assumptions that have been presented here in different notation. Therefore, the initialization does not require a separate model and our requirement of a globally defined solver still holds.
Finally, it should be pointed out that this is only one feasible initialization. In cases where automatic differentiation (Griewank and Walther 2008) is available, this can be used to initialize the Nordsieck vector up to numerical precision and set \(\mathbf {C}_{t_{1}}\) to \(\mathbf {0}\). Nordsieck originally proposed (Nordsieck 1962) start with an initial vector \(\mathbf {m}_{t_{1}}= 0\), followed by \(q+1\) steps with positive and \(q+1\) with negative direction (that is, integrating backwards to the start). One interpretation is that the method uses \(\mathbf {m}_{t_{1}}= \mathbb {E}[X_{t_{1}}\,\,\tilde{z}_{1}, \dots , \tilde{z}_q]\), with tentative \(\tilde{z}_i\) computed out of this process.
4 Error estimation and hyperparameter adaptation
While the general algorithm described in Sect. 3.1 can be applied to any IVP at this stage, a modern ODE solver also requires the ability to automatically select sensible values for its hyperparameters. The filter has three remaining parameters to choose: the dimensionality q of the state space, the diffusion amplitude \(\sigma ^2\) and the step size h.
To obtain a globally consistent probability distribution, we fix \(q = 2\) throughout the integration to test the thirdorder method presented in Sect. 3.1. For the remaining two parameters, we first note that estimating \(\sigma ^2\) will lend itself naturally to choose the step size. To see this, one needs to make the connection to classical ODE solvers and the interpretation of the statespace model. In classical ODE solvers, \(h_n\) is determined based on local error analysis, that is, \(h_n\) is a function of the error \(e_{t_n}\) introduced from step \({t_{n1}}\) to step \({t_n}\). Then, \(h_n\) is computed as a function of the allowed tolerance and the expected error which is assumed to evolve similarly to the current error.
To allow for a greater flexibility of the model, we allow amplitude \(\sigma ^2\) to vary for different steps \(\sigma ^2_{t_n}\). Note that the mean values are then no longer independent of \(\sigma ^2\), because the factor no longer cancels out in the computation of \(K_n\) in Eq. (24). However, this situation is indeed intended: If there was more diffusion in \([t_{n1}, t_n]\), we want a stronger update to the mean solution as the observed value is more informative. Additionally, Eq. (22) is independent of \(\sigma ^2_{t_n}\) or any other covariance information \(\mathbf {P}_{t_n}^, \mathbf {Q}(h)\). Therefore, we can apply Eq. (22) before (21), update \(\sigma ^2_{t_n}\) and then continue to compute the rest of the Kalman step. This idea is similar in spirit to (Jazwinski 1970, §11), but follows the general idea of error estimation in numerical ODE solvers, where local error information is available only.
At this point, the inference interpretation of numerical computation comes to bear: once the initial modeling decision—modeling a deterministic object with a probability measure to describe the uncertainty over the solution—is accepted, everything else follows naturally from the probabilistic description. Most importantly, there are no neglected higherorder terms, as they are all incorporated in the diffusion assumption.
This kind of lightweight error estimation is a key ingredient to probabilistic numerical methods: one goal of a probabilistic model is improved decisions under uncertainty. This uncertainty is necessarily a crude approximation, since a more accurate error estimator could be used to improve the overall solution quality. However, the reduction in computational efforts up to a tolerated error is exactly what modern numerical solvers try to achieve.
4.1 Global versus local error estimation
The results presented in preceding sections pertain to the estimation of local extrapolation errors. It is a wellknown aspect of ODE solvers (Hairer et al. 1987, §III.5) that the global error can be exponentially larger than the local error. More precisely, to scale the stochastic process such that the variance of the resulting posterior measure relates to the square global error, the intensity \(\sigma ^2_n\) of the stochastic process must be multiplied by a factor (Hairer et al. 1987, Thm III.5.8) \(\exp (L^*(Tt_0))\), where \(L^*\) is a constant depending on the problem. Although related, \(L^*\) is not the same as the local Lipschitz constant L and harder to estimate in practice (more details in Hairer et al. 1987, §III.5). We stress that this issue does not invalidate the probabilistic interpretation of the posterior measure as such. It is just that the scale of the posterior has to be estimated differently if the posterior is supposed to capture global error instead of local error. In practice, the global error estimate resulting from this rescaling is often very conservative.
5 Experiments
To evaluate the model, we provide two sets of experiments. First, we qualitatively examine the uncertainty quantification by visualizing the posterior distribution of two example problems. We also compare our proposed observation assumption against the model described by Chkrebtii et al. (2016). Second, we more rigorously evaluate the solver on a benchmark and compare it to existing nonprobabilistic codes. Our goal in this work is to construct an algorithm that produces meaningful probabilistic output at a computational cost that is comparable to that of classic, nonprobabilistic solvers. The experiments will show that this is indeed possible. Other probabilistic methods, in particular that of Chkrebtii et al. (2016), aim for a more expressive, nonGaussian posterior. In exchange, the computational cost of these methods is at least a large multiple of that of the method proposed here, or even polynomially larger. These methods and ours differ in their intended use cases: More elaborate but expensive posteriors are valuable for tasks in which uncertainty quantification is a central goal, while our solver aims to provide a meaningful posterior as additional functionality in settings where fast estimates are the principal objective.
5.1 Uncertainty quantification
We apply the probabilistic filtering ODE solver on two problems with attracting orbits: the Brusselator (Lefever and Nicolis 1971) and van der Pol’s equation (1926). The filter is applied twice on each problem, once with a fixed step size and once with the adaptive step size algorithm described in Sect. 4. To get a visually interesting plot, the fixed step size and the tolerance threshold were chosen as large as possible without causing instability. Both cases are modeled with a local diffusion parameter \(\sigma ^2_n\) which is estimated using the maximum likelihood estimator of Sect. 4. In the following plots, the samples use the scale \(\sigma ^2_n\) arising from the local error estimate. Because these systems are attractive, the global error correction mentioned in Sect. 4.1 would lead to significantly more conservative uncertainty.
The results in Fig. 6 demonstrate the effectiveness of the error estimator. This problem also demonstrates the quality and utility of the step size adaptation algorithm, since on the majority of the solution trajectory the algorithm is not limited by stability constrains. In the right plot, it can be seen how an increase in step size \(h_{n+1} > h_n\) can also lead to a reduction in posterior uncertainty. This is a consequence of \(\sigma ^2_{t_{n+1}}/\sigma ^2_{t_n}< 1\).
Figure 10 in Appendix also displays the solution as a function of time.
with a positive constant \(\mu > 0\). Originally, this model has been used to describe vacuum tube circuits. The limit cycle alternates between a nonstiff phase of rapid change and a stiff phase of slow decay. The larger \(\mu \), the more pronounced both effects are. In our example, we set \(\mu = 1\) and integrate over one period with the initial value on the graph of the limit cycle. Exact values can be found in Hairer et al. (1987, §I.16).
Figure 7 plots the filter results. Figure 11 displays the solution as a function of time. In the case of van der Pol’s equation, the benefit of step size adaptation is essentially nil, because conservative adaptation—in particular from a cautious starting step size—consumes the gains on the nonstiff parts. However, the example demonstrates the capability to learn an anisotropic diffusion model for individual components.
Finally, we compare two different strategies of quantifying the uncertainty. To this end, we compare our proposed model to the observation model proposed by Chkrebtii et al. (2016, §3.1). In this case, we set \(z_n = f({t_n}, (u_{t_n})_0), u_{t_n}\sim \mathcal {N}(\mathbf {m}^_{t_n}, \mathbf {C}^_{t_n})\). Figure 8 shows samples of the posterior distribution, computed with two different evaluation schemes. This scheme is not exactly the same as the one proposed by Chkrebtii et al.—their algorithm actually has cubic complexity in the number of fevaluations; thus, it is limited to a relatively small number of evaluation steps. But our version captures the principal difference between their algorithm and the simpler filter proposed here: Their algorithm draws separate samples involving independent evaluations of f at perturbed locations, while ours draws samples from a single posterior constructed from one single set of fevaluations. As expected, the model of Chkrebtii et al. provides a richer output structure, for example, by identifying divergent solutions (right subplot) if the solver leaves the region of attraction. However, to obtain individual samples, the entire algorithm has to run repeatedly, so the cost of producing S samples is S times that of our algorithm, which produces all its samples in one run, without requiring additional evaluations of f.
5.2 Benchmark evaluation
As is the case with many modern solvers, the theoretical guarantees do not extend to the full implementation with error estimation and step size control. Therefore, an empirical assessment is necessary to compare against trusted implementations. We compare the proposed Kalman filter to a representative set of standard algorithms on the DETEST benchmark set (Hull et al. 1972). While other standardized tests have been proposed (Crane and Fox 1969; Krogh 1973), DETEST has repeatedly been described as representative (Shampine et al. 1976; Deuflhard 1983). By choosing the same comparison criteria across all test problems and tested implementations, the benchmark provides the necessary data to make predictions on the behavior on a large class of problems.
Summary of DETEST results
Method  Total fcn. evals.  Avg. % deceived  Max. error 

\(\varvec{\epsilon = 10^{3}}\)  
Extrapolation  16553  2.0  7.8 
Adams (Krogh)  5394  1.1  5.3 
Adams (Gear)  9498  0.9  1.5 
RK (4th, Kutta)  8363  5.1  25.9 
RK (6th, Butcher)  11105  5.1  1788.1 
RK (8th, Shanks)  12355  6.3  1120.6 
RK (3th, Shampine)  15085  5.9  2.4 
RK (5th, Shampine)  5785  11.2  9.5 
Adams (Shampine)  5692  6.5  7.7 
PNM  19091  0.2  1.5 
\(\varvec{\epsilon = 10^{6}}\)  
Extrapolation  26704  0.1  2.3 
Adams (Krogh)  11353  1.4  7.3 
Adams (Gear)  18155  0.8  2.6 
RK (4th, Kutta)  30763  1.8  29.1 
RK (6th, Butcher)  23540  1.6  142.5 
RK (8th, Shanks)  20493  4.2  4.7 
RK (3th, Shampine)  430975  0.0  1.9 
RK (5th, Shampine)  19879  0.0  1.1 
Adams (Shampine)  10777  3.6  6.3 
PNM  405469  0.0  1.4 
\(\varvec{\epsilon = 10^{9}}\)  
Extrapolation  43054  0.0  0.6 
Adams (Krogh)  18984  0.5  4.0 
Adams (Gear)  38439  2.3  2.7 
RK (4th, Kutta)  146262  0.3  2.9 
RK (6th, Butcher)  58634  0.9  443.4 
RK (8th, Shanks)  39663  2.1  20.9 
RK (3th, Shampine)  13587187  3.1  689.0 
RK (5th, Shampine)  103345  0.1  2.4 
Adams (Shampine)  18274  2.2  11.5 
PNM  12731730  4.5  1938.0 
So, as expected, the error estimator is typically a conservative one.
While the probabilistic method does not achieve the same high performance as modern higherorder codes, the performance matches the results of a production Runge–Kutta code of the same order. This is of particular interest since applications in the lowaccuracy regime could benefit the most from accurate error indicators (Gear 1981).
6 Conclusions
We proposed a probabilistic inference model for the numerical solution of ODEs and showed the connections with established methods. In particular, we showed how probabilistic inference in Gauss–Markov systems given by a linear timeinvariant stochastic differential equations leads to Nordsiecktype methods. The maximum a posteriori estimate of the once integrated Wiener process IWP(1) is equivalent to the trapezoidal rule. The twice integrated Wiener process IWP(2) is equivalent to a thirdorder Nordsiecktype method which can be thought of as a splinebased multistep method. We demonstrated the practicality of this probabilistic IVP solver by comparing against other stateoftheart implementations.
The probabilistic formulation has already proven to be beneficial in larger chains of computations involving boundary value problems (Schober et al. 2014b; Hauberg et al. 2015). While the method presented in this paper is restricted to IVPs, there has also been work on extending the formalism of splines to boundary value problems (Mazzia et al. 2006, 2009). We expect that similar classical guarantees should be transferable to probabilistic boundary value problem solvers as well. Conversely, the probabilistic treatment of the IVP may be beneficial in bigger pipelines as well (cf. Chkrebtii et al. 2016).
Footnotes
Notes
Acknowledgements
Open access funding provided by Max Planck Society. The authors thank Hans Kersting for valuable discussions and helpful comments on the manuscript. The authors also thank the feedback of the anonymous reviewers which helped to improve the presentation significantly.
Supplementary material
References
 Albrecht, P.: Explicit, optimal stability functionals and their application to cyclic discretization methods. Computing 19(3), 233–249 (1978)MathSciNetCrossRefMATHGoogle Scholar
 Andria, G.D., Byrne, G.D., Hill, D.R.: Integration formulas and schemes based on gsplines. Math. Comput. 27(124), 831–838 (1973)MathSciNetMATHGoogle Scholar
 Brown, P., Byrne, G., Hindmarsh, A.: Vode: a variablecoefficient ode solver. SIAM J. Sci. Stat. Comput. 10(5), 1038–1051 (1989)MathSciNetCrossRefMATHGoogle Scholar
 Butcher, J.: General linear method: a survey. Appl. Numer. Math. 1(4), 273–284 (1985)MathSciNetCrossRefMATHGoogle Scholar
 Byrne, G.D., Chi, D.N.H.: Linear multistep formulas based on gsplines. SIAM J. Numer. Anal. 9(2), 316–324 (1972)MathSciNetCrossRefMATHGoogle Scholar
 Byrne, G.D., Hindmarsh, A.C.: A polyalgorithm for the numerical solution of ordinary differential equations. ACM Trans. Math. Softw. 1(1), 71–96 (1975)MathSciNetCrossRefMATHGoogle Scholar
 Chkrebtii, O.A., Campbell, D.A., Calderhead, B., Girolami, M.A.: Bayesian solution uncertainty quantification for differential equations. Bayesian Anal. 11(4), 1239–1267 (2016)MathSciNetCrossRefMATHGoogle Scholar
 Cockayne, J., Oates, C., Sullivan, T., Girolami, M.: Bayesian Probabilistic Numerical Methods. ArXiv eprints (2017)Google Scholar
 Conrad, P.R., Girolami, M., Särkkä, S., Stuart, A., Zygalakis, K.: Statistical analysis of differential equations: introducing probability measures on numerical solutions. Stat. Comput. 27(4), 1065–1082 (2017)MathSciNetCrossRefGoogle Scholar
 Cox, R.: Probability, frequency and reasonable expectation. Am. J. Phys. 14(1), 1–13 (1946)MathSciNetCrossRefMATHGoogle Scholar
 Crane, P., Fox, P.: A comparative study of computer programs for integrating differential equations. Bell Telephone Laboratories, New York (1969)Google Scholar
 Crouzeix, M., Lisbona, F.: The convergence of variablestepsize, variableformula, multistep methods. SIAM J. Numer. Anal. 21(3), 512–534 (1984)MathSciNetCrossRefMATHGoogle Scholar
 Deuflhard, P.: Order and stepsize control in extrapolation methods. Numer. Math. 41(3), 399–422 (1983)MathSciNetCrossRefMATHGoogle Scholar
 Deuflhard, P., Bornemann, F.: Scientific Computing with Ordinary Differential Equations. Springer, New York (2002)CrossRefMATHGoogle Scholar
 Diaconis, P.: Bayesian numerical analysis. Stat. Decis. Theory Relat. Top. IV(1), 163–175 (1988)Google Scholar
 Gear, C.: Numerical solution of ordinary differential equations: Is there anything left to do? SIAM Rev. 23(1), 10–24 (1981)MathSciNetCrossRefMATHGoogle Scholar
 Gear, C.W.: RungeKutta starters for multistep methods. ACM Trans. Math. Softw. 6(3), 263–279 (1980)MathSciNetCrossRefMATHGoogle Scholar
 Giné, E., Nickl, R.: Mathematical Foundations of InfiniteDimensional Statistical Models, vol. 40. Cambridge University Press, Cambridge (2015)MATHGoogle Scholar
 Grewal, M.S., Andrews, A.P.: Kalman Filtering: Theory and Practice Using MATLAB. Wiley, New York (2001)MATHGoogle Scholar
 Griewank, A., Walther, A.: Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, 2nd edn. No. 105 in Other Titles in Applied Mathematics. SIAM, Philadelphia (2008)Google Scholar
 Grigorievskiy, A., Lawrence, N., Särkkä, S.: Parallelizable Sparse Inverse Formulation Gaussian Processes (SpInGP). ArXiv eprints (2016)Google Scholar
 Hairer, E., Nørsett, S., Wanner, G.: Solving Ordinary Differential Equations INonstiff Problems. Springer, Berlin (1987)CrossRefMATHGoogle Scholar
 Hartikainen, J., Särkkä, S.: Kalman filtering and smoothing solutions to temporal Gaussian process regression models. IEEE International Workshop on Machine Learning for Signal Processing (MLSP) 2010, 379–384 (2010)CrossRefMATHGoogle Scholar
 Hauberg, S., Schober, M., Liptrot, M., Hennig, P., Feragen, A.: A random riemannian metric for probabilistic shortestpath tractography. In: Navab, N., Hornegger, J., Wells, W., Frangi, A. (eds.) Medical Image Computing and ComputerAssisted Intervention – MICCAI 2015. MICCAI 2015. Lecture Notes in Computer Science, vol 9349. Springer, Cham (2015)Google Scholar
 Hennig, P., Osborne, M.A., Girolami, M.: Probabilistic numerics and uncertainty in computations. Proc. R. Soc. Lond. A Math. Phys. Eng. Sci. 471(2179), 20150142 (2015)MATHGoogle Scholar
 Hull, T., Enright, W., Fellen, B., Sedgwick, A.: Comparing numerical methods for ordinary differential equations. SIAM J. Numer. Anal 9(4), 603–637 (1972)MathSciNetCrossRefMATHGoogle Scholar
 Jazwinski, A.H.: Stochastic Processes and Filtering Theory. Academic Press, London (1970)MATHGoogle Scholar
 Jeffreys, H.: Theory of Probability, 3rd edn. Oxford University Press, Oxford (1969)MATHGoogle Scholar
 Kalman, R.E.: A new approach to linear filtering and prediction problems. J. Fluids Eng. 82(1), 35–45 (1960)Google Scholar
 Karatzas, I., Shreve, S.E.: Brownian Motion and Stochastic Calculus. Springer, Berlin (1991)MATHGoogle Scholar
 Kersting, H.P., Hennig, P.: Active uncertainty calibration in Bayesian ODE solvers. In: Janzing, I. (eds.) Uncertainty in Artificial Intelligence (UAI), vol. 32. AUAI Press (2016)Google Scholar
 Kimeldorf, G.S., Wahba, G.: A correspondence between bayesian estimation on stochastic processes and smoothing by splines. Ann. Math. Stat. 41(2), 495–502 (1970)MathSciNetCrossRefMATHGoogle Scholar
 Krogh, F.T.: On testing a subroutine for the numerical integration of ordinary differential equations. J. ACM 20(4), 545–562 (1973)CrossRefMATHGoogle Scholar
 Lancaster, P., Rodman, L.: Algebraic riccati equations. Clarendon press, Oxford (1995)MATHGoogle Scholar
 Lefever, R., Nicolis, G.: Chemical instabilities and sustained oscillations. J. Theor. Biol. 30(2), 267–284 (1971)CrossRefMATHGoogle Scholar
 Loscalzo, F.R.: An introduction to the application of spline functions to initial value problems. In: Greville, T.N.E. (ed.) Theory and Applications of Spline Functions, pp. 37–64. Academic Press, New York (1969)Google Scholar
 Loscalzo, F.R., Talbot, T.D.: Spline function approximations for solutions of ordinary differential equations. SIAM J. Numer. Anal. 4(3), 433–445 (1967)MathSciNetCrossRefMATHGoogle Scholar
 Mazzia, F., Sestini, A., Trigiante, D.: BSpline linear multistep methods and their continuous extensions. SIAM J. Numer. Anal. 44(5), 1954–1973 (2006)MathSciNetCrossRefMATHGoogle Scholar
 Mazzia, F., Sestini, A., Trigiante, D.: The continuous extension of the bspline linear multistep methods for BVPs on nonuniform meshes. Appl. Numer. Math. 59(3–4), 723–738 (2009). Selected Papers from NUMDIFF11Google Scholar
 Nordsieck, A.: On numerical integration of ordinary differential equations. Math. Comput. 16(77), 22–49 (1962)MathSciNetCrossRefMATHGoogle Scholar
 O’Hagan, A.: Some Bayesian numerical analysis. Bayesian. Stat. 4, 345–363 (1992)MathSciNetGoogle Scholar
 Øksendal, B.: Stochastic Differential Equations: An Introduction with Applications, 6th edn. Springer, Berlin (2003)CrossRefMATHGoogle Scholar
 Owhadi, H., Scovel, C.: Toward machine Wald. In: Ghanem, R., Higdon, D., Owhadi, H. (eds.) Springer Handbook of Uncertainty Quantification, pp. 1–35. Springer (2016)Google Scholar
 Paskov, S.H.: Average case complexity of multivariate integration for smooth functions. J. Complex. 9(2), 291–312 (1993)MathSciNetCrossRefMATHGoogle Scholar
 Poincaré, H.: Calcul des probabilités. GauthierVillars, Paris (1896)MATHGoogle Scholar
 Rauch, H.E., Striebel, C., Tung, F.: Maximum likelihood estimates of linear dynamic systems. AIAA J. 3(8), 1445–1450 (1965)MathSciNetCrossRefGoogle Scholar
 Särkkä, S.: Recursive Bayesian Inference on Stochastic Differential Equations. Ph.D. thesis, Helsinki University of Technology (2006)Google Scholar
 Särkkä, S.: Bayesian Filtering and Smoothing. Cambridge University Press, Cambridge (2013)CrossRefMATHGoogle Scholar
 Särkkä, S., Solin, A., Hartikainen, J.: Spatiotemporal learning via infinitedimensional Bayesian filtering and smoothing. IEEE Signal Process. Mag. 30(4), 51–61 (2013)CrossRefGoogle Scholar
 Schober, M., Duvenaud, D., Hennig, P.: Probabilistic ODE Solvers with Runge–Kutta Means. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 27, pp. 739–747. Curran Associates, Inc. (2014a)Google Scholar
 Schober, M., Kasenburg, N., Feragen, A., Hennig, P., Hauberg, S.: Probabilistic shortest path tractography in DTI using Gaussian Process ODE solvers. In: Medical Image Computing and ComputerAssisted Intervention–MICCAI 2014. Springer (2014b)Google Scholar
 Shampine, L., Watts, H., Davenport, S.: Solving nonstiff ordinary differential equationsthe state of the art. SIAM Rev. 18, 376–411 (1976)MathSciNetCrossRefMATHGoogle Scholar
 Skeel, R.: Analysis of fixedstepsize methods. SIAM J. Numer. Anal. 13(5), 664–685 (1976)MathSciNetCrossRefMATHGoogle Scholar
 Skeel, R.D.: Equivalent forms of multistep formulas. Math. Comput. 33(148), 1229–1250 (1979)MathSciNetCrossRefMATHGoogle Scholar
 Skeel, R.D., Jackson, L.W.: Consistency of nordsieck methods. SIAM J. Numer. Anal. 14(5), 910–924 (1977)MathSciNetCrossRefMATHGoogle Scholar
 Skilling, J.: Bayesian solution of ordinary differential equations. In: Smith, C.R., Erickson, G.J., Neudorfer, P.O. (eds.) Maximum Entropy and Bayesian Methods. Fundamental Theories of Physics (An International Book Series on The Fundamental Theories of Physics: Their Clarification, Development and Application), vol 50. Springer, Dordrecht (1992)Google Scholar
 Solin, A.: Stochastic Differential Equation Methods for SpatioTemporal Gaussian Process Regression. Ph.D. thesis, Aalto University, Helsinki 2016)Google Scholar
 Stuart, A.M.: Inverse problems: a Bayesian perspective. Acta Numer. 19, 451559 (2010)Google Scholar
 Sullivan, T.J.: Introduction to Uncertainty Quantification, vol. 63. Springer, Berlin (2015)Google Scholar
 Teymur, O., Zygalakis, K., Calderhead, B.: Probabilistic linear multistep methods. Adv. Neural Inf. Process. Syst. (2016)Google Scholar
 van der Pol, B.: Lxxxviii. on relaxationoscillations. Lond. Edinb. Dublin Philos. Mag. J. Sci. 2(11), 978–992 (1926)Google Scholar
 Wahba, G.: Spline Models for Observational Data. No. 59 in CBMSNSF Regional Conferences Series in Applied Mathematics. SIAM (1990)Google Scholar
 Wald, A.: Sequential Analysis. Courier Corporation, Mineola (1973)Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.