Bayesian ODE solvers: the maximum a posteriori estimate

There is a growing interest in probabilistic numerical solutions to ordinary differential equations. In this paper, the maximum a posteriori estimate is studied under the class of ν\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\nu $$\end{document} times differentiable linear time-invariant Gauss–Markov priors, which can be computed with an iterated extended Kalman smoother. The maximum a posteriori estimate corresponds to an optimal interpolant in the reproducing kernel Hilbert space associated with the prior, which in the present case is equivalent to a Sobolev space of smoothness ν+1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\nu +1$$\end{document}. Subject to mild conditions on the vector field, convergence rates of the maximum a posteriori estimate are then obtained via methods from nonlinear analysis and scattered data approximation. These results closely resemble classical convergence results in the sense that a ν\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\nu $$\end{document} times differentiable prior process obtains a global order of ν\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\nu $$\end{document}, which is demonstrated in numerical examples.


Introduction
Let T = [0, T ], T < ∞, f : T × R d → R d , y 0 ∈ R d and consider the following ordinary differential equation (ODE): where D denotes the time derivative operator. Approximately solving (1) on a discrete mesh T N = {t n } N n=0 , 0 = t 0 < t 1 < . . . < t N = T , involves finding a functionŷ such that y(t n ) ≈ y(t n ), n = 0, 1, . . . , N and a procedure for findingŷ is called a numerical solver. This is an important problem in science and engineering, and vast base of knowledge has accumulated, as summarised by, for example, Deuflhard and Bornemann (2002), Hairer et al. (1987), Hairer and Wanner (1996), and Butcher (2008).
Classically, the error of a numerical solver has been quantified in terms of the worst case error. However, in applications where a numerical solution is sought as a component of a larger statistical inference problem (see, e.g., Matsuda andMiyatake 2019 andKersting et al. 2020), it is desirable that the error can be quantified with the same semantic, that is to say, probabilistically (Hennig et al., 2015, Oates and. Hence the recent endeavour to develop probabilistic ODE solvers.
Probabilistic ODE solvers can roughly be divided into two classes, sampling based solvers and deterministic solvers. The former class includes classical ODE solvers that are stochastically perturbed (Abdulle and Garegnani, 2020, Conrad et al., 2017, Lie et al., 2019, Teymur et al., 2018, solvers that approximately sample from a Bayesian inference problem (Tronarp et al., 2019b), and solvers that perform Gaussian process regression on stochastically generated data (Chkrebtii et al., 2016). On the other hand, deterministic solvers formulate the problem as a Gaussian process regression problem, either with a data generation mechanism (Hennig and Hauberg, 2014, Kersting and Hennig, 2016, Magnani et al., 2017, Schober et al., 2014, Skilling, 1992 or by attempting to constrain the estimate to solve the ODE at each point on the mesh (John et al., 2019, Tronarp et al., 2019b. For computational reasons it is fruitful to select the Gaussian process prior to be of Markov type (Kersting and Hennig, 2016, Magnani et al., 2017, Tronarp et al., 2019b, as this reduces cost of inference from O(N 3 ) to O(N ) Särkkä, 2010, Särkkä et al., 2013). Because of the connection between inference with Gauss-Markov processes priors and spline interpolation (Kimeldorf and Wahba, 1970, Sidhu and Weinert, 1979, Weinert and Kailath, 1974, the Gaussian process regression approaches are intimately connected with the spline approach to ODEs (Schumaker, 1982, Wahba, 1973. The notion of Bayesian solvers was defined by Cockayne et al. (2019), which poses the approximation to the solution of the ODE as a Bayesian inference problem. Under particular conditions on the vector field, the solvers of Kersting and Hennig (2016), Magnani et al. (2017), Schober et al. (2019), and Tronarp et al. (2019b) produce the exact posterior, if in addition a smoothing recursion is implemented, which corresponds to solving the batch problem as posed by John et al. (2019). In some cases, the exact Bayesian solution can also be obtained by exploiting Lie theory (Wang et al., 2018).
In this paper, the Bayesian formalism of Cockayne et al. (2019) is adopted for probabilistic solvers and priors of Gauss-Markov type are considered. However, rather than the exact posterior, the maximum a posteriori (MAP) estimate is studied. Many of the aforementioned Gaussian inference approaches are related to the MAP estimate. They can similarly to classical solvers, be classified as explicit , semi-implicit (Tronarp et al., 2019b), and implicit, which correspond to cases under which conditions they produce the exact posterior. Due to the Gauss-Markov prior, the method of John et al. (2019) can be implemented efficiently by the extended Kalman smoother (Bell, 1994). Furthermore, the Gauss-Markov prior corresponds to a reproducing kernel Hilbert space (RKHS) of Sobolev type and the MAP estimate is equivalent to the optimum of a minimum norm problem with nonlinear constraints in the RKHS. This enables the use of results from scattered data approximation (Arcangéli et al., 2007, Wendland andRieger, 2005) to establish, under mild conditions, that the MAP estimate converges to the true solution at a high polynomial rate in terms of the fill-distance (or equivalently, the maximum step size).
The rest of the paper is organised as follows. In Section 2, the solution of the ODE (1) is formulated as a Bayesian inference problem and the associated MAP problem is stated. In Section 3, various methods for inference, which are based on Gaussian filtering and smoothing are presented. In the context of ODE solvers, two new variants are introduced, which are based on the iterated extended Kalman filter (Bell and Cathey, 1993) and the iterated extended Kalman smoother (Bell, 1994), respectively. In Section 4, the connection between MAP estimation and optimisation in a certain reproducing kernel Hilbert space is reviewed. In Section 6, the error of the MAP estimate is analysed, for which polynomial convergence rates in the fill-distance are obtained. These rates are demonstrated in Section 7, and the paper is finally concluded by a discussion in Section 8.

Notation
Let Ω ⊂ R, then for a (weakly) differentiable function u : Ω → R d , its (weak) derivative is denoted by Du, or sometimesu. The space of m times continuously differentiable functions from Ω to R d is denoted by C m (Ω, R d ). The space of absolutely continuous functions is denoted by AC(Ω, R d ). The vector valued Lesbegue spaces are denoted by L p (Ω, R d ) and the related Sobolev spaces of m times weakly differentiable functions are denoted by H m If p = 2, the equivalent norm is sometimes used. The Sobolev (semi-)norms are given by (Adams andFournier, 2003, Valent, 2013) Henceforth the domain and codomain of the function spaces will be omitted unless required for clarity. Furthermore, for a function φ with domain Ω ⊂ R, its left-limit at t is denoted by For a positive definite matrix Σ, its symmetric square root is denoted by Σ 1/2 , and the associated Mahalanobis norm of a vector a is denoted by a Σ = a T Σ −1 a.

A Probabilistic State-Space Model
The present approach involves defining a probabilistic state-space model, from which the approximate solution to (1) is inferred. This is essentially the same approach as that of Tronarp et al. (2019b). The class of priors considered is defined in Section 2.1 and the data model is introduced in Section 2.2.

The Prior
Let ν be a positive integer, F m ∈ R d×d , 0 ≤ m ≤ ν and, Γ ∈ R d×d a positive definite matrix, and define the following differential operator: The class of priors considered herein is then given by where W is a standard Wiener process onto R d , X(0) ∼ N (0, Σ(t − 0 )), and G Y is the Green's function associated with D on T with initial condition D m y(t 0 ) = 0, m = 0, . . . , ν. The Green's function is given by where E m = e m ⊗ I d , m = 0, . . . , ν, {e m } ν m=0 is the canonical basis on R ν+1 , I d is the identity matrix in R d×d , θ is Heaviside's step function, and F ∈ R d(ν+1)×d(ν+1) whose d × d blocks are given by By construction, (5) has a state-space representation, which is given by the following stochastic differential equation (Øksendal, 2003) where X takes values in R d(ν+1) and the mth sub-vector of X is given by X m = D m Y and takes values in R d for 0 ≤ m ≤ ν. The transition densities for X are given by (Särkkä and Solin, 2019) where

The Selection of Prior
Selecting the prior can be quite an intricate task. While ν determines the smoothness of the prior, the actual estimator will be of smoothness ν + 1 (see Section 4) and the convergence results of Section 6 pertain to the case when the solution is of smoothness ν + 1 as well. Consequently, if it is known that the solution is of smoothness α ≥ 2 then setting ν = α − 1 ensures the present convergence guarantees are in effect. Though it is likely convergence rates can be obtained for priors that are "too smooth" as well (see Kanagawa et al. 2020 for such results pertaining to numerical integration). Once the degree of smoothness ν has been selected, the parameters Σ(t − 0 ), {F m } ν m=0 , and Γ need to be selected. Some common sub-families of priors are listed below.
• (Released ν times integrated Wiener process onto R d ). The process Y is a ν times integrated Wiener process if F m = 0, m = 1, . . . , ν. The parameters Σ(t − 0 ) and Γ are free. Though it is advisable to set Γ = σ 2 I d for some scalar σ 2 . In this case σ 2 can be fit (estimated) to the particular ODE being solved (see Section 5.1). This class of processes is denoted by Y ∼ IWP(Γ, ν).
• (Mateŕn processes of smoothness ν onto R). If d = 1 then Y is a Mateŕn process of smoothness ν if (cf. Hartikainen and Särkkä 2010) for some λ, σ 2 > 0, and Σ(t − 0 ) is set to the stationary covariance matrix of the resulting X process. If d > 1 then each coordinate of the solution can be modelled by an individual Mateŕn process.
Remark 1. Many popular choices of Gaussian processes not mentioned here also have state-space representations or can be approximated by a state-space model (Hartikainen and Särkkä, 2010, Karvonen and Sarkkä, 2016, Solin and Särkkä, 2014, Tronarp et al., 2018b. A notable example is Gaussian processes with squared exponential kernel (Hartikainen and Särkkä, 2010). See Chapter 12 of Särkkä and Solin (2019), for a thorough exposition.
The IWP class of priors corresponds to polynomial splines in the limit Σ(t − 0 ) → ∞ (Wahba, 1978), and produces methods that are intimately connected with classical Nordsieck methods . Hence it appears as a natural choice if no further information pertaining to the solution (1) is available. On the other hand, suppose the problem is semi-linear: Dy(t) = Λy(t) + ε(t, y(t)).

The Data Model
For the Bayesian formulation of probabilistic numerical methods, the data model is defined in terms of an information operator . In this paper, the information operator is given by where S f is the Nemytsky operator associated with the vector field f (Marcus and Mizel, 1973), 1 that is, Clearly, Z maps the solution of (1) to a known quantity, the zero function. Consequently, inferring Y reduces to conditioning on The function Z[Y ](t) can be expressed in simpler terms by use of the process X. That is, define the function then Furthermore, it is necessary to account for the initial condition, X 0 (0) = y 0 , and with small additional cost the initial condition of the derivative can also be enforced X 1 (0) = f (0, y 0 ).

Maximum A Posteriori Estimation
The MAP estimate for Y restricted to T N is given as the solution to the optimisation problem max where p(y(t 0:N ),ẏ(t 0:N )) is probability density of (Y,Ẏ ) restricted to T N . However, since X is a Markov process it is advantageous to pose the equivalent MAP problem for X restricted to T N , which in view of (8) is given by where h n = t n − t n−1 is the step size sequence and · Σ is the Mahalanobis norm associated with the positive definite matrix Σ.

Gaussian Inference
In the previous section, the following probabilistic state-space model was defined, and in this section methods for inference are developed. All these methods are based on Gaussian filtering and smoothing (Särkkä, 2013, Särkkä andSolin, 2019), where the vector field is linearised in various ways. Some of these methods have already appeared in the literature (John et al., 2019, Tronarp et al., 2019b, while the iterative ones, as applied to solving ODEs, are new. Define the information sets In Gaussian filtering and smoothing, only (approximations of) the mean and covariance matrix of X(t) are tracked. The (approximate) mean and covariance at time t, conditioned on Z (t) are denoted by µ F (t) and Σ F (t), respectively, and µ F (t − ) and Σ F (t − ) correspond to conditioning on Z (t − ), which are limits from the left 2 . The (approximate) mean and covariance conditioned on Z (T ) at time t are denoted by µ S (t) and Σ S (t), respectively.
Remark 2. The mean vectors µ F and µ S contain estimates of D m y, m = 0, . . . , ν for the solution of (1). The idea of tracking derivatives of the solution goes back to Nordsieck (1962). The connection between Nordsieck methods and Gaussian inference based solvers was discussed by Schober et al. (2019).

Inference with Affine Vector Fields
While the information operator is not affine in general, all the methods that are discussed in the sequel are based on replacing it with an affine approximations via affine approximation of the vector field. This results in an affine and Gaussian approximation to the state-space model, for which the Bayesian filtering and smoothing, and consequently MAP problem can be solved exactly (Särkkä, 2013). Hence it is instructive to consider inference for the case of affine vector fields first as it provides the template for the approximate methods. That is, let the vector field be affine: Then the information operator reduces to and the inference problem reduces to Gaussian process regression (Rasmussen and Williams, 2006) with a linear combination of function and derivative observations. In the spline literature this is known as (extended) Hermite-Birkhoff data (Sidhu and Weinert, 1979). In this case, the inference problem can be solved exactly with Gaussian filtering and smoothing (Kalman and Bucy, 1961, Kalman, 1960, Rauch et al., 1965, Särkkä, 2013, Särkkä and Solin, 2019, which is reviewed in the following. Before starting the filtering and smoothing recursions, the process X needs to be conditioned on the initial values Recall that the filtering distribution is right continuous with left limits. More specifically, µF (t − ) = µF (t) and ΣF (t − ) = ΣF (t), unless t ∈ TN , where they jump.

This is can be done by a Kalman update
The filtering mean and covariance for each interval [t n−1 , t n ) and n = 1, . . . , N is governed byμ which on the mesh is solved by where h n = t n − t n−1 , n = 1, . . . , N is the step size sequence. The prediction moments at t ∈ T N are then corrected according to the Kalman update The smoothing mean and covariance are continuous and evolve according to . On the mesh T N the smoothing moments are given by While Gaussian filtering and smoothing only provides the posterior for affine vector fields, it forms the template for nonlinear problems as well. That is, the vector field is the vector field is replaced by an affine approximation. Approaches for doing this are discussed in the following.

Approximate Gaussian Inference
For non-affine vector fields, only the update becomes intractable. Approximation methods involve different ways of approximating the vector field with an affine function whereafter approximate filter means and covariances are obtained by pluggingΛ and ζ into (23). In the present setting, only Taylor series methods are considered, some of which have already appeared in the literature , Tronarp et al., 2019b. Though, there exists other approximation methods based on statistical linear regression (García-Fernández et al., 2015, Lefebvre et al., 2002, Tronarp et al., 2018a, which are based on cubature integration Hennig, 2016, Tronarp et al., 2019b). For classification purposes it is fruitful to study these methods in terms how well they approximate Bayes' rule in terms of the local MAP problem: The method of classification is then provided by Definition 1.

Definition 1.
A Gaussian solver is said to be one of the following.
Remark 4. Definition 1 is merely an analogue to the classifications of standard numerical analysis (Hairer and Wanner, 1996). Classically, implicit methods have to solve a rootfinding problem, which is solved exactly for linear vector fields by semi-implicit methods, and is also typically solved exactly for vector fields that are constant in y by explicit methods.
Here the root finding problem is replaced by the constrained minimisation problem, namely the local MAP problem (26).
The simplest approach is to make a zeroth order approximation of f around µ F (t − n ), which is due to Schober et al. (2019), and the linearisation parameters are given bŷ If the vector field is constant in y, f (t, y) = ζ(t), then it is clear that the approximation (27) is exact and the local MAP problem (26) is solved exactly. Therefore this is an explicit method. The next best approach is to make a first order approximation around µ F (t − n ) (Tronarp et al., 2019b), and the linearisation parameters are given bŷ where J f is the Jacobian of f with respect to the second argument. This method is referred to as the extended Kalman filter (EKF) in signal processing literature (Särkkä, 2013), and was used to design probabilistic ODE solvers by Tronarp et al. (2019b). Clearly that the approximation (28) is exact if f is affine in y. Consequently, this method is semi-implicit. It is not possible to utilise Taylor series expansions to get exact inference for more general classes of vector fields. However, improvements can be made by iteratively relinearising the vector field at the filter update. Let µ 0 Inserting these parameters into (23) then gives µ l+1 F (t n ) and Σ l+1 F (t n ), which leads to the iterated extended Kalman filter (IEKF) (Bell and Cathey, 1993). It is easy to show that z(t n , µ F (t n )) holds at the fixed point, therefore this method is implicit. However, for the smoothing mean z(t n , µ S (t n )) = 0 will in general only hold when the vector field is affine.
By a slight modification of the IEKF, namely re-linearising around the smoothing mean rather than the filtering mean, the iterated extended Kalman smoother (IEKS) is retrievedΛ The smoothing mean and covariance at iteration l + 1, µ l+1 S (t) and Σ l+1 S (t), are then obtained by running the filter and smoother with the parameters in (30). Furthermore, the IEKS is just the Gauss-Newton algorithm for the maximum a posteriori trajectory (Bell, 1994), consequently, at the fixed point and under some conditions on the Jacobian of the vector field it can be shown that the fixed-point is at least a local optimum to the MAP problem (17) (Knoth, 1989). Moreover, the IEKS is just a clever implementation of the method of John et al. (2019) whenever the prior process has a state-space representation.

The Reproducing Kernel Hilbert Space Perspective
The correspondence between inference in stochastic processes and optimisation in reproducing kernel Hilbert spaces is well known (Kimeldorf and Wahba, 1970, Sidhu and Weinert, 1979, Weinert and Kailath, 1974. This correspondence is indeed present in the current setting as well, in the sense that MAP estimation as discussed in Section 2.3 is equivalent to optimisation in the reproducing kernel Hilbert space (RKHS) associated with Y and X (see Kanagawa et al. 2018, Proposition 3.6 for standard Gaussian process regression). The purpose of this section is thus to establish that the RKHS associated with Y , which establishes what function space the estimators discussed in Section 3.2 lie in. Furthermore, it is shown that the MAP estimate is equivalent to an interpolation problem in this RKHS, which implies properties on its norm. These results will then be used in the convergence analysis of the MAP estimate in Section 6.

The Reproducing Kernel Hilbert Space of the Prior
The RKHS of the Wiener process with domain T and codomain R d is the set (cf. van der Vaart and van Zanten 2008, section 10) with inner product given by Let Y ν+1 denote the reproducing kernel Hilbert space associated with the prior process Y , then Y ν+1 is given by (van der Vaart and van Zanten, 2008, lemmas 7.1, 8.1, and 9.1) the image of the operator where y 0 ∈ R d(ν+1) andẇ y ∈ L 2 (T, R d ). That is, and inner product is given by Since G Y is the Green's function of a differential operator of order ν + 1 with smooth coefficients, Y ν+1 can be identified as follows. A function y : Hence by similar arguments as for the released ν times integrated Wiener process, Proposition 1 holds (see proposition 2.6.24 and remark 2.6.25 of Giné and Nickl 2016 which is also the covariance function of Y . The linear functionals are continuous and their representers are given by where R (m,k) denotes R differentiated m and k times with respect to the first and second arguments, respectively. Furthermore, define the matrix η m s = η m,e 1 s . . . η m,e d s , and with notation overloaded in the obvious way, the following identities hold Since there is a one-to-one correspondence between the processes Y and X, the RKHS associated with X is isometrically isomorphic to Y ν+1 , and it is given by where x m is the mth sub-vector of x of dimension d. The kernel associated with X ν+1 is given by and the d × d blocks of P are given by and ψ s = P (t, s) is the representer of evaluation at s, In the following, the short-hands Y = Y ν+1 and X = X ν+1 are in effect.

Nonlinear Kernel Interpolation
Now consider the kernel interpolation problem where the feasible set is given by Define the following subspaces Y Since R N (m) is a closed linear sub-space of Y it follows that any y ∈ Y can be written as y = y + y ⊥ with y ∈ R N (m) and y ⊥ ∈ R ⊥ N (m), where R ⊥ N (m) is the orthogonal complement to R N (m). Similarly to other situations (Cox and O'Sullivan, 1990, Girosi et al., 1995, Kimeldorf and Wahba, 1971) our optimum can be expanded in a finite sub-space spanned by representers, which is the statement of Proposition 2.
Proposition 2. The solution to (40) is contained in R N (1).
Proof. Any y ∈ Y has the orthogonal decomposition y = y + y ⊥ , where y ∈ R N (1) and y ⊥ ∈ R ⊥ N (1). However, it must be the case that y ⊥ Y = 0, since 1 2 By Proposition 2 the optimal point of (40) can be written as However, it is more convenient to expand the optimal point in the larger subspace, where x is the equivalent element in X and or more compactly Here P is the kernel matrix associated with function value observations of X at T N . That is, (45) is up to a constant equal to the negative log-density of X restricted to T N . Proposition 3 immediately follows.

Proposition 3. The optimisation problem (40) is equivalent to the MAP problem (17).
Proof. Note that the kernel P associated with X (see (38)) is also is the covariance function for the X process. Consequently, x 2 X is up to some scaling and constant equal to the negative log-likelihood of X restricted to T N . Now X is a Markov process, so in view of (8) and (9) and the conclusion follows.

Uncertainty Quantification and Calibration
An important aspect of the probabilistic approach to numerical analysis is that the method produces an error estimate in the language of probability. Any such error estimate is only meaningful if it can be related to the actual error of the method. Denote the true solution of the problem by y * ∈ Y, then the optimal interpolant can be written as a linear projection of y * onto R N (1),ŷ = Π N y * . Suppose an estimate of L α y * is sought for some class of continuous linear functional L α = {L α : α ∈ α, L α : Y → R} indexed by some compact set α. Then by Cauchy-Schwartz inequality, the error can be written as where η α e,N is the representer of L α − L α • Π N . The norm of the error functional coincides with the posterior variance (see e.g., Briol et al. 2019 for the numerical integration case) Consequently, the worst case error for the class L α in the unit ball of Y is Unfortunately it appears this line of reasoning generally breaks down in the present case. That is, if the vector field is not affine, then the optimal interpolant can not be expressed as a linear projection of the true solution, and the situation is significantly more complicated. However, in Section 6 bounds are obtained for the nonlinear functionals indexed by t ∈ T defined by and its derivatives up to order ν, though not in terms of the posterior variance. In any case, as the error is generally problem dependent, the kernel parameters should be calibrated to the problem. How to do this for the noise scale of the prior is discussed in the sequel.

Calibrating the Noise Scale
For a full statistical treatment of the inference problem, the parameters F m m = 0, . . . , ν, Γ and Σ(t − 0 ) need to be estimated. Of particular importance in terms of calibrating uncertainty properly are Σ(t − 0 ) and Γ (see (7)). As discussed in Section 6, the parameters F m m = 0, . . . , ν can have a significant impact on the constants appearing in the convergence rates of the MAP estimator. Nevertheless, the present discussion is just concerned with the calibration of uncertainty.
It can be shown that the logarithm of (quasi-) likelihood as produced by the Gaussian inference methods is, up to an unimportant constant, given by (cf. Tronarp et al. 2019a Additionally, if Σ(t − 0 ) = σ 2Σ (t − 0 ) and Γ = σ 2Γ for some positive definite matricesΣ F (t − 0 ) andΓ, then it can be shown that the log-likelihood, up to some unimportant constant, reduces to (see Appendix C of Tronarp et al. 2019b where· denotes the output of the filter using the parameters (Σ(t − 0 ),Γ) rather than (Σ(t − 0 ), Γ). This yields the following proposition, which is proven in Appendix C of Tronarp et al. (2019b), mutatis mutandis.
and Γ = σ 2Γ for some positive definite matrices Σ(t − 0 ) andΓ, then the (quasi-) maximum likelihood estimate of σ 2 is given bŷ Bounds for worst case overconfidence and underconfidence under maximum likelihood estimation of σ 2 has recently been obtained by Karvonen et al. (2020). These results appear to carry over to the present setting for affine vector fields. However, it is not immediately clear how to generalise this to a larger class of vector fields.

Convergence Analysis
In this section, error bounds of the kernel interpolantŷ as defined by (40), and by Proposition 3 the MAP estimate is obtained. These bounds will be in terms of the fill-distance of the mesh T N , which is given by 4 In the following results from the scattered data approximation literature (Arcangéli et al., 2007, Wendland andRieger, 2005) are employed. More specifically, for any y ∈ Y, which satisfies the initial condition y(0) = y 0 , formally has the following representation where the error operator E is defined as Of course any reasonable estimatorŷ ought to have the property that Z[ŷ ](t) ≈ 0 for t ∈ T N . The approach is thus to bound Z[ŷ ](t) in some suitable norm, which in turn gives a bound on E[ŷ ](t). Throughout the discussion ν ≥ 1 is some fixed integer, which corresponds to the differentiability of the prior, that is, the kernel interpolant is in H ν+1 2 (T, R d ). Furthermore, some regularity of the vector field will be required, namely Assumption 1, given below.
the derivatives D α f are locally Lipschitz continuous for any multi-index α with |α| ≤ ν. A convergence analysis of the filter based on the zeroth order linearisation (see Eq. (27)) was carried out by Kersting et al. (2018), where they assumed that f was in C ν and D α f Lipschitz continuous and bounded for any |α| ≤ ν. That is, for the purposes of proving convergence of the MAP estimate, one extra degree of smoothness is imposed on f , while the rather strong assumptions on its derivatives are relaxed.
Assumption 1 will, without explicit mention, be in force throughout the discussion of this section. Essentially, it implies that (i) the model is well specified and (ii) the information operator is well behaved. This shall be made precise in the following.

Model Correctness and Regularity of the Solution
Since ν ≥ 1, Assumption 1 implies f is locally Lipschitz, and the classical existence and uniqueness results for the solution of Equation (1) apply. The extra smoothness on f ensures the solution itself is sufficiently smooth for present purposes. These facts are summarised in Theorem 1. For proof(s) refer to (Arnol'd, 1992, chapter 4, paragraph 32).

Theorem 1. Equation (1) admits a unique solution y
It immediately follows that the model is correctly specified in the sense that y * ∈ Y, which is Corollary 1.
Corollary 1 essentially ensures that there is an a priori bound on the norm of the MAP estimate. Namely ŷ Y ≤ y * Y , which follows from the definition (see (40)).

Properties of the Information Operator
By Proposition 1, Y correspond to the Sobolev space H ν+1 2 (T, R d ), consequently it is crucial to understand how the Nemytsky operator S f , and consequently Z, act on Sobolev spaces. Fortunately, for the Nemytsky operator, the work has already been done (Valent, 1985(Valent, , 2013, and Theorem 2 is immediate.

Theorem 2. Let U be an open subset of
where U some open subset of R d . The Nemytsky operator S f i , associated with the ith coordinate of f is then C 1 mapping from U onto H ν 2 (T, R) for i = 1, . . . , d. If in addition, U is convex and bounded, then for any y ∈ U there is number c 0 (y ) > 0 such that Proof. A direct application of Theorem 4.1 of (Valent, 2013, page 32) establishes the first claim. Furthermore, since the conditions of Theorem 4.5 by Valent (2013) are satisfied, the second claim follows (see point (ii) in the proof of Theorem 4.5 by Valent 2013, page 37).
Essentially, Theorem 2 establishes that S f i as a mapping of U onto H ν 2 (T, R) is locally Lipschitz. This property is inherited by the information operator, which is Proposition 5.
Proposition 5. In the same setting as Theorem 2. The ith coordinate of the information operator, Z i , is a C 1 mapping from U onto H ν 2 (T, R), for i = 1, . . . , d. If in addition, U is convex and bounded, then for any y ∈ U there is number c 1 (y , ν, f i , U ) > 0 such that Proof. The differential operator De T i is a C 1 mapping of U onto H ν 2 (T, R). Consequently, by Theorem 2 the same holds for the operator De T i − S f i = Z i . For the second part, the triangle inequality gives and clearly Consequently, by Theorem 2 the statement holds by selecting c 1 (y , ν, f i , U ) = 1 + c 0 (y )|f i | ν+1,U .

Convergence of the MAP Estimate
Proceeding with the convergence analysis of the MAP estimate can finally be done in view of the regularity properties of the solution y * and the information operator Z established by Corollary 1 and Proposition 5. Combining these results with Theorem 4.1 of Arcangéli et al. (2007) leads to Lemma 1.
In view of Lemma 1, for any estimatorŷ ∈ Y, its convergence rate can be established provided the following is shown: Neither (i) nor (ii) appear trivial to establish for any of the estimators discussed Section 3 in general. However, (i) and (ii) hold for the optimal (MAP) estimateŷ, which yields Theorem 3.
Theorem 3. Let q ∈ [1, ∞], then under the same assumptions as in Lemma 1, there exists a constant c 4 (y * , ν, f i , r) such that for δ < δ 0,ν the following holds for i = 1, . . . , d: , and Z i [ŷ] | T N ∞ = 0 by definition, henceŷ ∈ B(0, ρ Y ), and Lemma 1 gives for m = 1, . . . , ν By Proposition 1, the fact that ŷ Y ≤ y * Y , and the triangle inequality, there exists a constant c B (independent ofŷ and y * ) such that and thus the second bound holds by selecting For the first bound, the triangle inequality for integrals gives which combined with the second bound gives the first.
At first glance, it may appear that there is an appalling absence of dependence on T in the constants of the convergence rates provided by Theorem 3. This is not the case, the T dependence have conveniently been hidden in y * Y and possibly c 4 (y * , ν, f i , r). Now c 4 (y * , ν, f i , r) depends on c 0 (y * ) and |f i | ν+1,B(0,r) , unfortunately an explicit expression for c 0 (y * ) is not provided by Valent (2013), which makes the effect of c 4 (y * , ν, f i , r) difficult to untangle. Nevertheless, the factor y * Y does indeed depend on the interval length T . For example, let λ, y 0 ∈ R and consider the following ODĖ Setting Σ(t − 0 ) = I and selecting the prior IWP(I, ν) gives the following (in this case Consequently, the global error can be quite bad when λ > 0 and T is large even when δ is very small, which is the usual situation (cf. Theorem 3.4 of Hairer et al. 1987).

Discussion of Convergence Results
In the present context it is instructive to view the solution of (1) as a family of a quadrature problems whereẏ(t) = f (t, y(t)) is modelled by an element of H ν 2 (T, R d ). In view of Theorem 3, D mẏ converges uniformly to D mẏ * at a rate of δ ν−m−1/2 , m = 0, . . . , ν − 1, thus forẏ the same rate as for standard spline interpolation is obtained (Schultz, 1970). Furthermore, the rate obtained forŷ by Theorem 3 matches the rate for integral approximations using Sobolev kernels (Kanagawa et al., 2020, Proposition 1). That is, although dealing with a nonlinear interpolation/integration problem, Assumption 1 ensures the problem is still nice enough for the optimal interpolant to enjoy the classical convergence rates.
Global convergence of the filter associated with the zeroth order linearisation was examined by Kersting et al. (2018) under some different assumption on the vector field (see Remark 6). However, there the discussion of global convergence was limited to the priors in the class IWP(Γ, 1), for which a global convergence rate of δ is demonstrated, which matches the rate for the MAP estimator obtained by Theorem 3 (ν = 1). It is important to stress that the statement of Theorem 3 only pertains to the MAP estimate, and the method examined by Kersting et al. (2018) is in general not the MAP estimate (unless f (t, y) = ζ(t) for some ζ). Consequently, Theorem 3 is not a generalisation of the results by Kersting et al. (2018). However, in view of the discussion in Section 6.3 and the empirical findings in Section 7, it appears that Theorem 3 can be generalised to the estimator examined by Kersting et al. (2018), and indeed all the estimators discussed in Section 3.2.

Numerical Examples
In this section, the methods discussed in Section 3, the smoother based on the zeroth order method is denoted by EKS0, the smoother based on the first order method is denoted by EKS1, and the iterated extended Kalman smoother is denoted by IEKS. In particular the convergence rates of the MAP estimator from Section 6 are verified, which appear to be generalisable to the other methods as well.

The Logistic Equation
Consider the logistic equatioṅ which has the following solution.
The initial condition is set to y 0 = 15/100 and approximate solutions are computed by EKS0, EKS1, and IEKS on the interval [0, 1] on a uniform, dense using, grid with interval length 2 −12 using a prior in the class IWP(I, ν), ν = 1, . . . , 4. The filter updates only occur on a decimation of this dense grid by a factor of 2 3+m , m = 1, . . . , 8, which yields the fill-distances δ m = 2 m−10 , m = 1, . . . , 8. The L ∞ error of the zeroth and first derivative estimates of the methods are computed on the dense grid and compared to δ ν and δ ν−1/2 (predicted rates), respectively. The errors of the approximate solutions versus fill-distance are shown in Figure 1 and it appears that EKS0, EKS1, and IEKS all attain at worst the predicted rates once δ is small enough. It appears the rate for IEKS1/IEKS tapers off for ν = 4 and small δ. However, it can be verified that this is due to numerical instability when computing the smoothing gains as the prediction covariances Σ F (t − n ) become numerically singular for too small h n (see (25a)). The results are similar for the derivative of the approximate solution, see Figure 2.
Solution estimates by EKS0 and EKS1 are illustrutated in Figure 3 for ν = 2 and δ = 2 −4 (IEKS is very similar EKS1 and therefore not shown). While both methods produce credible intervals that cover the true solution, those of EKS1 are much tighter. That is, here the EKS1 estimate is of higher quality than that of EKS0, which is particularly clear when looking at the derivative estimates.

A Riccati Equation
The convergence rates are examined for a Riccati equation as well. That is, consider the following ODEẏ The initial condition is set to y 0 = 1. Just as for the logistic map, the solution is approximated by EKS0, EKS1, and IEKS on the interval [0, 1], using a IWP(I, ν), ν = 1, . . . , 4, for various fill-distances δ. The L ∞ errors of the zeroth and first derivative estimates are shown in Figures 4 and 5, respectively. The general results are the same as before, EKS1 and IEKS are very similar, and EKS0 is some orders of magnitude worse while still appearing to converge at a similar rate as the former. The numerical instability in the computation of smoothing gains is still present for large ν and small δ. Additionally, the output of the solvers for ν = 2 is visualised for step-sizes of h = 0.125 and h = 0.25 in Figures 6 and 7, respectively. It can be seen that already for h = 0.25, the solution estimate and uncertainty quantification of the IEKS, while EKS0 and EKS1 leave room for improvement in terms of both accuracy and uncertainty quantification. By halving the step-size EKS1 and IEKS become near identical (wherefore IEKS is not shown in Figure 6), though the error of the EKS0 is still oscillating quite a bit, particularly for the derivative.

Conclusion and Discussion
In this paper, the maximum a posteriori estimate associated with the Bayesian solution of initial value problems  was examined. Several Gaussian approximations were reviewed and classified as explicit , semi-implicit (Tronarp et al., 2019b), and implicit depending on when they solve a local MAP problem, which for explicit and semi-implicit methods means they also solve the global MAP problem in the present setting. Furthermore, it was shown that the MAP estimate corresponds to the optimal interpolant in a Sobolev space, which along with tools from nonlinear analysis Valent (2013) and semi-norm estimates for Sobolev functions with scattered zeroes (Arcangéli et al., 2007) was exploited to obtain convergence rates for the MAP estimate when the vector field is sufficiently smooth.
While the present results are encouraging, there is a lot of open topics for future research. For example, in the present setting the MAP estimate is just taken as a given. Though of course, in practice a reliable method to evaluate it is required. For this end the MAP problems (17) and (26a) need to be analysed more carefully. In particular it would be fruitful to establish which conditions on the vector field and the fill-distance are required to ensure the local optima are global optima for the MAP problem(s), or the convexity of the problem (Boyd and Vandenberghe, 2004). This would of course imply that  1989). If convergence becomes an issue, more advanced MAP estimators can be considered, such as Levenberg-Marquardt (Särkkä and Svensson, 2020) or alternate direction method of multipliers (Aravkin et al., 2017, Boyd et al., 2011, Gao et al., 2019. Furthermore, the empirical findings of Section 7 suggests, although not being MAP estimators, EKS0, EKS1, and IEKF can likely be given convergence statements similar to Theorem 3. It is not immediately clear what the most effective approach for this purpose is. On one hand, one can attempt to significantly extend the results of Kersting et al. (2018), which is more in line with how convergence rates are obtained for classical solvers. On the other hand, it seems like the local MAP problem (26a) can also be cast as a constrained optimisation problem in an RKHS Y n corresponding to the Sobolov space H ν+1 2 ([t n−1 , t n ], R d ), in which case a lot of the arguments from Section 7 could be recycled for local convergence analysis. In either case a complicating factor is that the filtering and smoothing defects z(t n , µ F (t n )) and z(t n , µ S (t n )), respectively, need to be controlled. Another issue is the need for a prior bound on the RKHS norm of µ S ∈ X (recall ρ in Lemma 1).
Another issue is the designing of the mesh, T N , which has been completely omitted in the present work, classically this is referred to as step size control (Hairer et al., 1987). This is in fact one of the more important aspects of designing solvers, which ought to use available computational resources economically while still producing solution estimates of acceptable accuracy. On the other hand, one of the possible advantages of the probabilistic approach is that in some situations it may be the case that an inaccurate solution estimate is acceptable if the provided uncertainty accurately reflects the error.
In statistical terminology, the mesh design/step size control is an issue of experimental design . A heuristic method for designing the mesh on the fly, as the filter is run, was proposed by Schober et al. (2019). This approach monitors the whitened residuals ξ(t n ) = S −1/2 (t n )(ζ(t n ) − C(t n )µ F (t − n )).
The statistics of ξ(t n ) can be calibrated on-line as the estimateσ 2 N can be calculated resursive using the filter output (see Proposition 4). This method is structurally similar to classical methods of step size control (Byrne and Hindmarsh, 1975). In the context of probabilistic ODE solvers, an information theoretic approach to step size control was recently suggested by Chkrebtii and Campbell (2019) for their sampling based solver (Chkrebtii et al., 2016). More generally, a Bayesian experimental design viewpoint for probabilistic numerics was recently explored by .
Another important topic that needs to be considered in practice is the stability properties of the filter and smoother, which is of utmost importance for integrating stiff systems (Hairer and Wanner, 1996). These stability properties depend on the parameters A(h n ), Q(h n ), C(t n ), and ζ(t n ). The latter two parameters depend on the linearisation method, while the latter two parameters depend on the selection of prior (an issue that was omitted from the discussion in Section 2.1.1) Moore, 1979, 1981). The most basic notion of stability is that of A-stability (Dahlquist, 1963), which for a fixed step-size h n = h considers the following test equatioṅ y(t) = Λy(t).
An ODE solver is said to be A-stable if its estimate of the solution of (62) converges to 0 as t → ∞ whenever the real part of the eigenvalues of Λ are strictly negative. The inference problem can be solved exactly by a Kalman filter (and EKF/IEKF). A peculiar result is that for a prior in the IWP(Γ, ν) class, the convergence to zero of the estimate does not depend on the spectrum of Λ but rather on its rank. That is, the Kalman filter estimate converges to zero as t → ∞ provided Λ is of full rank (Tronarp et al., 2019b, Theorem 2). While this is a solid start, stability analysis for the full class of priors discussed in Section 2.1 and linearisation methods of Section 3.2 ought to be carried out.