Bayesian Numerical Methods for Nonlinear Partial Differential Equations

The numerical solution of differential equations can be formulated as an inference problem to which formal statistical approaches can be applied. However, nonlinear partial differential equations (PDEs) pose substantial challenges from an inferential perspective, most notably the absence of explicit conditioning formula. This paper extends earlier work on linear PDEs to a general class of initial value problems specified by nonlinear PDEs, motivated by problems for which evaluations of the right-hand-side, initial conditions, or boundary conditions of the PDE have a high computational cost. The proposed method can be viewed as exact Bayesian inference under an approximate likelihood, which is based on discretisation of the nonlinear differential operator. Proof-of-concept experimental results demonstrate that meaningful probabilistic uncertainty quantification for the unknown solution of the PDE can be performed, while controlling the number of times the right-hand-side, initial and boundary conditions are evaluated. A suitable prior model for the solution of the PDE is identified using novel theoretical analysis of the sample path properties of Mat\'{e}rn processes, which may be of independent interest.


Introduction
Classical numerical methods for differential equations produce an approximation to the solution of the differential equation whose error (called numerical error ) is uncertain in general. For nonadaptive numerical methods, such as finite difference methods, the extent of numerical error is often estimated by comparing approximations at different mesh sizes (Strikwerda, 2004), while for adaptive numerical methods, such as certain finite element methods, the global tolerance that was given as an algorithmic input is used as a proxy for the size of the numerical error (Grätsch and Bathe, 2005). On the other hand, probability theory provides a natural language in which uncertainty can be expressed and, since the solution of a differential equation is unknown, it is interesting to ask whether probability theory can be applied to quantify numerical uncertainty. This perspective, in which numerical tasks are cast as problems of statistical inference, is pursued in the field of probabilistic numerics (Diaconis, 1988;Hennig et al., 2015;Larkin, 1972;O'Hagan, 1992). A probabilistic numerical method (PNM) describes uncertainty in the quantity of interest by returning a probability distribution as its output. The Bayesian statistical framework is particularly natural in this context , since the output of a Bayesian PNM carries the formal semantics of posterior uncertainty and can be unambiguously interpreted. However, Wang et al. (2020) argued that a rigorous Bayesian treatment of ordinary differential equations (ODEs) may only be possible in a limited context (namely, when the ODE admits a solvable Lie algebra). This motivates the development of approximate Bayesian PNMs, which aim to approximate the differential equation in such a way that exact Bayesian inference can be performed, motivated in particular by challenging numerical tasks for which numerical uncertainty cannot be neglected.
This paper focuses on partial differential equations (PDEs); in particular, we focus on PDEs whose governing equations must be evaluated pointwise at high computational cost. To date, no (approximate or exact) Bayesian PNM for the numerical solution of nonlinear PDEs has been proposed. However, the cases of nonlinear ODEs and linear PDEs have each been studied. In Chkrebtii et al. (2016) the authors constructed an approximate Bayesian PNM for the solution of initial value problems specified by either a nonlinear ODE or a linear PDE. The approach relied on conjugacy of Gaussian measures under linear operators to cycle between assimilating and simulating gradient field "data" in parallel. (Recall that the gradient field of an ODE can be a nonlinear function of the state variable, necessitating a linear approximation of the gradient field to obtain a conjugate Gaussian treatment.) The case of ODEs was discussed by Cockayne et al. (2019), who proposed a general computational strategy called numerical disintegration (similar to a sequential Monte Carlo method for rare event simulation) that can in principle be used to condition on nonlinear functionals of the unknown solution, such as provided by the gradient field, and thereby instantiate an exact Bayesian PNM. However, the authors emphasised that such methods are typically impractical. Tronarp et al. (2019Tronarp et al. ( , 2021 demonstrated how nonlinear filtering techniques can be used to obtain low-cost approximations to the PNM described by Cockayne et al. (2019) and applied their approach to approximate the solution of ODEs. Further work on approximate Bayesian PNM for ODEs includes Bosch et al. (2021); Chkrebtii and Campbell (2019); Kersting and Hennig (2016); Kersting et al. (2020); Krämer and Hennig (2020); Schober et al. (2014Schober et al. ( , 2019; Skilling (1992); Teymur et al. (2016Teymur et al. ( , 2018. The focus of much of this earlier work was in the design of a PNM whose mean or mode coincides in some sense with a classical numerical method for ODEs (e.g. a Runge-Kutta method). For more detail on existing PNM for ODEs the reader is referred to the recent survey of Wang et al. (2020). Linear elliptic PDEs were considered by Cockayne et al. (2016); Owhadi (2015); Särkkä (2011) and Chkrebtii et al. (2016), and in this setting exact Gaussian conditioning can be performed. Chkrebtii et al. (2016) also presented a nonlinear PDE (Navier-Stokes), but a pseudo-spectral projection in Fourier space was applied at the outset to transform the PDE into a system of first-order ODEs -an approach that exploited the specific form of that PDE. Chen et al. (2021) extended Owhadi (2015) to nonlinear PDEs, focusing on maximum a posteriori estimation as opposed to uncertainty quantification for the solution of the PDE.
This paper presents the first (approximate) Bayesian PNM for numerical uncertainty quantification in the setting of nonlinear PDEs. Our strategy is based on local linearisation of the nonlinear differential operator, in order to perform conjugate Gaussian updating in an approximate Bayesian framework. Broadly speaking, our approach is a natural generalisation of the approach taken by Chkrebtii et al. (2016) for ODEs, but with local linearisation to address the additional challenges posed by nonlinear PDEs. The aim is to quantify numerical uncertainty with respect to the unknown solution of the PDE. An important point is that we consider only PDEs for which evaluation of either the right-hand side or the initial or boundary conditions is associated with a high computational cost; we do not aim to numerically solve PDEs for which a standard numerical method can readily be employed to drive numerical error to a negligible level, nor do we aim to compete with standard numerical methods in terms of CPU requirement. Such problems occur in diverse application areas, such as modelling of ice sheets, carbon and nitrogen cycles (Hurrell et al., 2013), species abundance and ecosystems (Fulton, 2010), each in response to external forcing from a meteorlogical model, or in solving PDEs that themselves depend on the solution of an auxiliary PDE, which occur both when operator splitting methods are used (MacNamara and Strang, 2016) and when sensitivity equations, expressing the rate of change of the solution of a PDE with respect to its parameters, are to be solved (Cockayne and Duncan, 2020;Petzold et al., 2006). These applications provide strong motivation for PNM, since typically it will not be possible to obtain an accurate approximation to the solution of the PDE and the rich, probabilistic description of numerical uncertainty provided by a PNM can be directly useful (e.g. .
The remainder of the paper is structured as follows: In Section 2 the proposed method is presented. The choice of prior is driven by the mathematical considerations described in Section 3. A detailed experimental assessment is performed in Section 4. Concluding remarks are contained in Section 5.

Methods
In Section 2.1 we present the general form of the nonlinear PDE that we aim to solve using PNM. The use of finite differences for local linearisation is described in Section 2.2. Then, in Section 2.3 we present our proposed approximate Bayesian PNM, discussing how computations are performed and how the associated uncertainty is calibrated.

Set-Up and Notation
For a set S ⊆ R d , let C 0 (S) denote the vector space of continuous functions c : S → R. For two multi-indices α, β ∈ N d 0 , we write α ≤ β if α i ≤ β i for each i = 1, . . . , d. For a multi-index β ∈ N d 0 , we let |β| = β 1 + · · · + β d and let C β (S) ⊆ C 0 (S) denote those functions c whose partial derivatives α ≤ β exist and are continuous in S. Let T ∈ (0, ∞) and let Γ be an open and bounded set in R d , whose boundary is denoted ∂Γ. Let β ∈ N d 0 be a multi-index and consider a differential operator and the associated initial value problem with Dirichlet boundary conditions (1) whose unique classical (i.e. strong) solution u ∈ C β ([0, T ] × Γ) is assumed to exist 1 . The task considered in this paper is to produce a probability distribution on C β ([0, T ] × Γ) that (approximately) carries the semantics of Bayesian inference for u; i.e. we seek to develop an (approximate) Bayesian PNM for the numerical solution of (1) . In particular, we are motivated by the problems described in Section 1, for which evaluation of f , g and h are associated with a high computational cost. Such problems provide motivation for a careful quantification of uncertainty regarding the unknown solution u, since typically it will not be possible to obtain a sufficient number of evaluations of f , g and h in order for u to be precisely identified.

Why Not Emulation?
Given that the dominant computational cost is assumed to be evaluation of f , g and h, it is natural to ask whether the uncertainty regarding these functions can be quantified using a probabilistic model, such as an emulator (Kennedy and O'Hagan, 2001). This would in principle provide a straight-forward Monte Carlo solution to the problem of quantifying uncertainty in the solution u of (1), where first one simulates an instance of f , g and h from the emulator and then applies a classical numerical method to solve (1) to high numerical precision. The problem with this approach is that construction of a defensible emulator is difficult; the functions f , g and h are coupled together by the nonlinear PDE in (1) and, for example, it cannot simultaneously hold that each of f , g and h are Gaussian processes. In fact, the challenge of ensuring that samples of f , g and h are consistent with the existence of a solution to (1) poses a challenge that is comparable with solving the PDE itself. This precludes a straight-forward emulation approach to (1) and motivates our focus on PNM in the remainder, where uncertainty is quantified in the solution space of (1).

Finite Difference Approximation of Differential Operators
If D is linear then the differential equation in (1) is said to be linear and one or more of the Bayesian PNM of Chkrebtii and Campbell (2019); Chkrebtii et al. (2016); Cockayne et al. (2016) may be applied (assuming any associated method-specific requirements are satisfied). If D is nonlinear then at most we can express D = P + Q, where P is linear and Q is nonlinear (naturally such representations are non-unique in general). For example, for Burgers' equation x , Q = u∂ x and also the trivial Q = D, P = 0. In this paper we aim, given a decomposition of D in terms of P and Q, to adaptively approximate Q by a linear operator, in order that exact Gaussian conditioning formulae can be exploited. Although we do not prescribe how to select P and Q, one should bear in mind that we aim to construct a linear approximation of Q, meaning that a decomposition should be identified that renders Q as close to linear as possible, to improve the quality of the approximation. The effect of different selections for P and Q is investigated in Section 4.2.
To adaptively construct linear approximations to the nonlinear differential operator Q, we propose to exploit traditional finite difference formulae (Strikwerda, 2004). Note that our conceptualisation of these approximations as linear operators for Gaussian conditioning is somewhat 1 The existence of a strong solution is a nontrivial assumption, since several PDEs admit only a weak solution; see Section 1.3.2 of Evans (1998) for definitions and background. A well-known class of classical numerical methods that also presuppose the existence of a strong solution are the radial basis function methods (Fornberg and Flyer, 2015). In Section 4.2 we consider, empirically, the performance of the method developed in this paper when applied to a PDE for which a strong solution does not exist.
non-traditional. Define a time discretisation grid t = [t 0 , t 1 . . . t n−1 ], where 0 = t 0 < t 1 < · · · < t n−1 ≤ T with the increment δ := t i − t i−1 fixed. For concreteness, consider Burgers' equation with P = ∂ t − ε∂ 2 x , Q = u∂ x . The following discussion is intended only to be informal. Suppose that the unknown solution u(t i−1 , ·) at time t i−1 has been approximated to accuracy O(δ) by u i−1 (·), as quantified by a norm · on C β ([0, T ] × Γ). Then we could adaptively build a linear approximation to Q at time t i as This provides an approximation D i = P + Q i to the original differential operator D, at time t i , with accuracy O(δ). To achieve higher order accuracy, we can use higher order approximations of Q. For example, letting ∂u ∂t i−1 (x) denote an approximation to ∂u ∂t (t i−1 , x), we could take The only requirement that we impose on finite difference approximations is that Q i uses (only) data that were gathered at earlier time points t i−1 , t i−2 , . . . , analogous to backward difference formulae. This is to ensure that the approximations Q i are well-defined before they are used in our method, which is described next. Henceforth we assume that an appropriate representation D = P + Q has been identified and an appropriate linear approximation to Q has been selected. The next section describes how probabilistic inference for u can then proceed.

Proposed Approach
In this section we describe our proposed method. Recall that we assume there exists a unique u ∈ C β ([0, T ]×Γ) for which (1) is satisfied. Since (1) represents an infinite number of constraints, it is not generally possible to recover u exactly with a finite computational budget. Our proposed method mirrors a general approach used to construct Bayesian PNM , in that we consider conditioning on only a finite number of the constraints in (1) and reporting the remaining uncertainty as our posterior. The case of nonlinear PDEs presents an additional challenge in that a subset of the constraints are nonlinear, and are therefore not amenable to exact Gaussian conditioning. To circumvent this issue, we condition on linear approximations to the constraints following the ideas developed in Section 2.2.

Prior Distribution
The starting point of any Bayesian analysis is the elicitation of a suitable prior distribuition. In our case, we minimally require a prior that is supported on C β ([0, T ] × Γ), since we a priori know that the solution u to (1) has this level of regularity. Our approach is rooted in Gaussian conditioning and thus the regularity of Gaussian process sample paths must be analysed. This analysis is somewhat technical and we therefore defer the discussion of prior elicitation to Section 3. For the remainder of this section we assume that a suitable Gaussian process prior U ∼ GP(µ, Σ) has been elicited. Here µ : is the covariance function; see Rasmussen and Williams (2006) for background. The random variable notation U serves to distinguish the true solution u of (1) from our probabilistic model for it. The specific choices of µ and Σ discussed in Section 3 have sufficient regularity for the subsequent derivations in this section to be well-defined.
Our first task is to condition on (or assimilate) a finite number of constraints that encode the initial condition u(0, x) = g(x), x ∈ Γ. For this purpose we use the spatial discretisation x, and condition on the data U (0, x) = g(x) at each x ∈ x \ ∂Γ. (For example, if Γ = [0, 1] and 0 = x 1 < x 2 < · · · < x m−1 < x m = 1, then we condition on U (0, x i ) = g(x i ) for i = 2, . . . , m − 1. The two boundary locations x 1 , x m ∈ ∂Γ are excluded since these constraints are assimilated as part of the boundary condition, which will shortly be discussed.) To perform conditioning, we use the following vectorised shorthand: Then let a 0 := (t 0 , x \ ∂Γ) denote the locations in [0, T ] × Γ where the initial condition is to be assimilated. At a 0 we have the initial data y 0 := g(a 0 ). These initial data are assimilated into the Gaussian process model according to the standard conditioning formulae (eq. 2.19; Rasmussen and Williams, 2006) where r, s ∈ [0, T ] × Γ.

Time Stepping
Having assimilated the initial data, we now turn to the remaining constraints in (1). Following traditional time-stepping algorithms, we propose to proceed iteratively, beginning at time t 0 and then advancing to t 1 , t 2 , and ultimately to t n−1 . At each iteration i we aim to condition on a finite number of constraints that encode the boundary condition u(t i , x) = h(t i , x), x ∈ ∂Γ, and the differential equation itself Du(t i , x) = f (t i , x), x ∈ Γ. For this purpose we again use the spatial discretisation x, and condition on the boundary data U (t i , x) = h(t i , x) at each x ∈ x∩∂Γ and the differential data Since D is nonlinear, there are no explicit formulae that can be used in general to assimilate the differential data, so instead we propose to condition on the approximate constraints D i U (t i , x) = f (t i , x), x ∈ x where D i = P + Q i is an adaptively defined linear approximation to D, which will be problem-specific and chosen in line with the principles outlined in Section 2.2. For a univariate function such as µ and a linear operator L, we denote µ L (r) = (Lµ)(r). For a bivariate function such as Σ, we denote Σ L (r, s) = L r Σ(r, s), where L r denotes the action of L on the r argument. In addition, we denote ΣL(r, s) = L s Σ(r, s) and we allow subscripts to be concatenated, such as Σ L,L = (Σ L ) L for another linear operator L .
Fix i ∈ {0, 1, . . . , n − 1}. Let b i = (t i , x ∩ ∂Γ) denote the locations in [0, T ] × ∂Γ where the boundary conditions at time t i are to be assimilated. At b i we have boundary data h(b i ). Correspondingly, we have differential data f (v i ) and we concatenate all data at time i into a single vector y i := [h(b i ) , f (v i ) ] , so that y i represents all the information on which (approximate) conditioning is to be performed. Upon assimilating these data we obtain The result of performing n time steps of the algorithm just described is a Gaussian process GP(µ n , Σ n ), to which we associate the semantics of an (approximate) posterior in a Bayesian PNM for the solution of (1). The Bayesian interpretation of GP(µ n , Σ n ) is reasonable since this distribution arises from the conditioning of the prior GP(µ, Σ) on a finite number of constraints that are (approximately) satisfied by the solution u of (1). This is clarified in the following statement: Lemma 1. The stochastic process U n obtained above is identical to the distribution obtained when U ∼ GP(µ, K) is conditioned on the dataset Proof. This follows immediately from the self-consistency property of Bayesian inference (invariance to the order in which data are conditioned), but for completeness we demonstrate their algebraic equivalence in Appendix A.1.
Remark 1 (Computational Complexity). The computational cost of our algorithm is not competitive with that of a standard numerical method. 2 However, we are motivated by problems for which f , g and h are associated with a high computational cost, for which the auxiliary computation required to provide probabilistic uncertainty quantification is inconsequential. Thus we merely remark that the iterative algorithm we presented is gated by the inversion of the matrix A i at the i th time step, the size of which is O(m), independent of i, and therefore the complexity of predicting the final state u(T, ·) of the PDE by performing n iterations of the above algorithm is O(nm 3 ). For comparison, direct Gaussian conditioning on the information in Lemma 1 would incur a higher computational cost of O(n 3 m 3 ), but would provide the joint distribution over the solution u(t, ·) at all times t ∈ [0, T ]. Although we do not pursue it in this paper, in the latter case the grid structure present in t and x could be exploited to mitigate the O(n 3 m 3 ) cost; for example, a compactly supported covariance model Σ would reduce the cost by a constant factor (Gneiting, 2002), or if the preconditions of Schäfer et al. (2021) are satisfied then their approach would reduce the cost to O(nm log(nm) log d+1 (nm/ )) at the expense of introducing an error of O( ). See also the recent work of de Roos et al. (2021).
Remark 2. The posterior mean µ i+1 can be interpreted as a particular instance of a radial basis method (Fornberg and Flyer, 2015), as a consequence of the representer theorem for kernel interpolants (Schölkopf et al., 2001). For brevity we do not explore this connection further, but we note that a similar connection was explored in detail in Cockayne et al. (2016).

Calibration of Uncertainty
The principal advantage of PNM over classical numerical methods is that they provide probabilistic quantification of uncertainty, in our case expressed in the Bayesian framework, which can be integrated along with other sources of uncertainty to facilitate inferences and decisionmaking in a real-world context. In order for our posterior distribution to faithfully reflect the scale of uncertainty about the solution of (1), we must allow the hyper-parameters of the prior model to adapt to the dataset. However, we do not wish to sacrifice the sequential nature of our algorithm and thus we seek an approach to hyper-parameter estimation that operates in real-time as the algorithm is performed.
To achieve this we focus on a covariance model Σ(·, ·; σ) with a scalar hyper-parameter denoted σ > 0, which is assumed to satisfy Σ(·, ·; σ) = σ 2 Σ(·, ·; 1). Such a σ is sometimes called a scale or amplitude hyper-parameter of the covariance model. From Lemma 1 it follows that σ directly controls the spread of the posterior and it is therefore essential that σ is estimated from data in order that the uncertainty reported by the posterior can be meaningful. To estimate σ, we propose to maximise the predictive likelihood of the "differential data" f (v i ), given the information collected up to iteration i − 1, for i ∈ {0, . . . , n − 1}, which can be considered as an empirical Bayes approach based on just those factors in the likelihood that correspond to the differential data. The reasons for focussing on the differential data (as opposed to also including the initial and boundary data) are twofold; first, the differential data constitutes the vast majority of the dataset, and second, this simplifies the computational implementation, described next.
At iteration i, the predictive likelihood for , and the observed differential data are f (v i ). Thus we select σ to maximise the full predictive likelihood of the differential data Crucially, the linear operators D i that we constructed do not directly depend on σ, and it is a standard property of Gaussian conditioning that Σ i . These facts permit a simple closed form expression for the maximiserσ of (3), namelŷ where M −1/2 denotes an inverse matrix square root; (M 1/2 ) 2 = M . Furthermore, it is clear from (4) that in practice one can simply run our proposed algorithm with the prior covariance model Σ(·, ·; 1) and then report the posterior covarianceσ 2 Σ i+1 (·, ·; 1), so that hyper-parameter estimation is performed in real-time without sacrificing the sequential nature of the algorithm.
Closed form expressions such as (4) are not typically available for other hyper-parameters that may be included in the covariance model, and we therefore assume in the sequel that any other hyper-parameters have been expert-elicited. This limits the applicability of our method to situations where some prior expert insight can be provided. However, we note that data-driven estimation of the amplitude parameter σ is able to compensate to a degree for mis-specification of other parameters in the covariance model.

Relation to Earlier Work
Here we summarise how the method just proposed relates to existing literature on Bayesian PNM and beyond.
The sequential updating procedure that we have proposed is similar to that of Chkrebtii et al. (2016) in the special case of a linear PDE. It is not identical in these circumstances though, for two reasons: First, Chkrebtii et al. (2016) incorporated the initial condition u(0, x) = g(x), x ∈ Γ, into the prior model, whereas we explicitly conditioned on initial data g(a 0 ) during the initialisation step of the method. This direct encoding of the initial condition in Chkrebtii et al. (2016) relies on g being analytically tractable in order that a suitable prior can be derived by hand. Our treatment of g as a black-box function from which initial data are provided is therefore more general. Second, in Chkrebtii et al. (2016) the authors advocated the use of an explicit measurement error model, whereas our conditioning formula assume that the differential data y i are exact measurements of U , as clarified in Lemma 1. For linear PDEs this assumption is correct, but it is an approximation in the case a nonlinear PDE. Our decision not to employ a measurement error model here is due to the fact that the scale of the measurement error cannot easily be estimated in an online manner as part of a sequential algorithm, without further approximations being introduced.
To limit scope, the adaptive selection of the t i or x j was not considered, but we refer the reader to Chkrebtii and Campbell (2019) for an example of how this can be achieved using Bayesian PNM. Note, however, that adaptive selection of a time grid may be problematic when evaluation of either f or h is associated with a high computational cost, since the possibility of taking many small time steps relinquishes control of the computational budget. For this reason, non-adaptive methods may be preferred in this context, since the run-time of the PNM can be provided up-front.
The choice of linearisation Q i was left as an input to the proposed method, with some guidelines (only) provided in Section 2.2. This can be contrasted with recent work for ODEs in Bosch et al. (2021); Tronarp et al. (2019Tronarp et al. ( , 2021, where first order Taylor series were used to automatically linearise a nonlinear gradient field. It would be possible to also consider the use of Taylor series methods for nonlinear PDEs. However, their use assumes that the gradient field is analytically tractable and can be differentiated, while in the present paper we are motivated by situations in which f is a black-box that can (only) be point-wise evaluated. The use of linearisations in PNM was also explored Chen et al. (2021), in the maximum a posteriori estimation context.
The combination of local linearisation and Gaussian process conditioning was also studied by Raissi et al. (2018), who considered initial value problems specified by PDEs, where the initial condition was random and the goal was to approximate the implied distribution over the solution space of the PDE. The authors observed that if the initial condition was a Gaussian process, then approximate conjugate Gaussian computation is possible when a finite difference approximation to the differential operator was employed. This provided a one-pass, cost-efficient alternative to the Monte Carlo approach of repeatedly sampling an initial condition and then applying a classical numerical method. Our work bears a superficial similarity to Raissi et al. (2018) and related work on physics-informed Gaussian process regression (e.g. Chen et al., 2020;Jidling et al., 2017;Wang and Berger, 2016;Wheeler et al., 2014), in that finite difference approximations enable approximate Gaussian conditioning to be performed. However, these authors are addressing a fundamentally different problem to that addressed in the present paper; we aim to quantify numerical uncertainty for a single (i.e. non-random) PDE. Accordingly, in this paper we emphasise issues that are critical to the performance of PNM, such as ensuring that the posterior is supported on a set of functions whose regularity matches that known to be possessed by the solution of the PDE (Section 3), and explicitly assessing the quality of the credible sets provided by our PNM (Section 4).

Prior Construction
This section is dedicated to presenting a prior construction that ensures samples generated from the prior are elements of C β ([0, T ] × Γ), the set in which a solution to (1) is sought. First, in Section 3.1 we introduce the technical notions of sample continuity and sample differentiability, clarifying what properties of the prior are required to hold. These sample-path properties are distinct from mean-square properties, the latter being more commonly studied. Then, in Section 3.2 we formally prove that the required properties holds for a particular Matérn tensor product, which we then advocate as a default choice for our PNM. These results may also be of independent interest.

Mathematical Properties for the Prior
This paper is concerned with the strong solution of (1), which is an element of C β ([0, T ] × Γ). It is therefore logical to construct a prior distribution whose samples also belong to this set. In particular, if the true solution has β i derivatives in the variable z i (for instance because the PDE features a term ∂ β i z i u, where we have set z = (t, x)), it would be appropriate to construct a prior (and hence a posterior) whose samples also have β i derivatives in the variable z i .
To make this discussion precise, we make explicit a probability space (Ω, F, P) and recall the fundamental definitions of sample continuity and sample differentiability for a random field X : I × Ω → R defined on an open, pathwise-connected set I ⊆ R d (i.e. I is an interval when d = 1): Definition 1 (Sample Continuity). X is said to be sample continuous if, for P-almost all ω ∈ Ω, the sample path X(·, ω) is continuous (everywhere) in I.
. Then X is said to be sample partial differentiable in the sequence of directions v if for P-almost all ω ∈ Ω, the following limit exists for all z ∈ I The limits above are taken sequentially from left to right. In the discussions that follow, we take v i ∈ {e 1 , e 2 , . . . , e d }, the standard Cartesian unit basis vectors of R d , in which case the usual partial derivatives are retrieved, and we use the shorthand D p X(z, v, ω) = ∂ α X(z, ω) to denote sample partial derivatives, where α = (α 1 , . . . , α p ), |α| = p, and α i denotes the number of times the variable z i is differentiated. A similar property, which is more easily studied than sample continuity (resp. sample differentiability), is mean-square continuity (resp. mean-square differentiability). This property is recalled next, since we will make use of mean-square properties en route to establishing sample path properties in Section 3.2.
For a mean-square differentiable Gaussian processes X, with mean function µ ∈ C α (I) and covariance function Σ ∈ C (α,α) (I × I), one has to denote mean-square partial derivatives, where again the v i are unit vectors parallel to the coordinate axes, |α| = p, and α i denotes the number of times the variable z i is differentiated. 3 See Stein (1999, Section 2.6). If X is mean-square continuous (resp. mean-square differentiable for all α ≤ β) at all z ∈ I, then we say simply that X is mean-square continuous (resp. order β mean-square differentiable ). In contrast to sample path properties, mean-square properties are often straight-forward to establish. In particular, if X is weakly stationary with autocovariance function Σ(z) = Σ(z, 0), then meaning that X is mean-square continuous whenever its autocovariance function Σ is continuous at 0 (Stein, 1999, Section 2.4).

Matérn Tensor Product
Our aim in this section is to establish sample path properties for a particular choice of prior, namely a Gaussian process with tensor product Matérn covariance, to ensure that prior (and posterior) samples are contained in C β ([0, T ] × Γ). Surprisingly, we are unable to find explicit results in the literature for the sample path properties of commonly used covariance models; this is likely due to the comparative technical difficulty in establishing sample path properties compared to mean-square properties. Our aim in this section is, first, to furnish a gap in the literature by rigorously establishing the sample differentiability properties of Gaussian processes defined by the Matérn covariance function and, second, to put forward a default prior for our PNM that takes values in C β ([0, T ] × Γ).
Definition 5 (Matérn Covariance). Let ν = p + 1 2 where p ∈ N. The Matérn covariance function is defined, for z, z ∈ R, as The proof of the following result can be found in Appendix A.2.
Proposition 1 (Mean-Square Differentiability of Univariate Matérn). Let I ⊆ R be an open set and let µ ∈ C p (I). Then any process X ∼ GP(µ, K ν ), with K ν as in (6) with ν = p + 1 2 , is order p mean-square differentiable. Furthermore, ∂ p ms X is mean-square continuous. Following a general approach outlined in Potthoff (2010), and focussing initially on the univariate case, our first step toward establishing sample differentiability is to establish sample continuity of the mean-square derivatives. Recall that, for two stochastic processes X,X on a domain I, we sayX is a modification of X if, for every z ∈ I, P(X(z, ω) =X(z, ω)) = 1. A modification of a stochastic process does not change its mean square properties, but sample path properties need not be invariant to modification. 4 For Gaussian processes, which are characterised up to modifications by their finite dimensional distributions, it is standard practice to work with continuous modifications when they exist (see for example Dudley, 1967;Marcus and Shepp, 1972). The proof of the following result can be found in Appendix A.3.
Proposition 2. Let X be as in Proposition 1. Then ∂ i ms X has a modification that is sample continuous for all 0 ≤ i ≤ p.
The second step is to leverage a fundamental result on the sample path properties of stochastic processes.
Theorem 1 (Criterion for Sample Differentiability; Theorem 3.2 of Potthoff (2010)). Let I ⊆ R d be an open, pathwise connected set, and consider a random field X : I × Ω → R such that E[X(z, ω) 2 ] < ∞ for all z ∈ I. Suppose X is first order mean-square differentiable, with meansquare partial derivatives D 1 ms X(·, e k , ω), 1 ≤ k ≤ d, themselves being mean-square continuous and having modifications that are sample continuous. Then X has a modificationX that is first order sample partial differentiable, with partial derivatives D 1X (·, e k , ω), 1 ≤ k ≤ d, themselves being sample continuous and satisfying, almost surely, D 1X (·, e k , ω) = D 1 ms X(·, e k , ω), 1 ≤ k ≤ d.
Since continuity of partial derivatives implies differentiability, the conclusion of Theorem 1 implies that X is first order sample differentiable.
Iterative application of Theorem 1 to higher order derivatives provides the following, whose proof can be found in Appendix A.4. Corollary 1. Fix p ∈ N. Let I ⊆ R d be an open, pathwise connected set and consider X ∼ GP(µ, Σ) with µ ∈ C p (I) and Σ ∈ C (p,p) (I × I), so that X has mean-square partial derivatives ∂ β ms X, β ∈ N d 0 , |β| ≤ p. Suppose ∂ β ms X is mean-square continuous and sample continuous for all |β| ≤ p. Then X has continuous sample partial derivatives ∂ β X, and they satisfy ∂ β X = ∂ β ms X almost surely, for all |β| ≤ p.
This provides a strategy to establish sample properties of Matérn processes, such as the following: Corollary 2 (Sample Differentiability of Univariate Matérn). Let I ⊆ R be an open interval and let X ∼ GP(µ, K ν ), with µ ∈ C p (I) and K ν as in (6). Then there exists a modificationX of X such that P(X ∈ C p (I)) = 1.
Proof. By Proposition 1, X is order p mean-square differentiable and ∂ i ms X is mean-square continuous for 0 ≤ i ≤ p. By Proposition 2, we may work with a modificationX of X such that ∂ i msX (= ∂ i ms X) is sample continuous for each 0 ≤ i ≤ p. One can directly verify that K ν ∈ C (p,p) (I × I); see the calculations in Appendix A.2. The result then follows from Corollary 1.
Corollary 2 is stronger than existing results in the literature, the most relevant of which is Scheuerer (2010, Theorem 5), who showed that samples from GP(µ, K ν+ ) are C p (I) for any > 0. Finally, we present a multivariate version of the previous result, which will be exploited in the experiments that we perform in Section 4. Importantly, we allow for different smoothness in the different variables, which is necessary to properly capture the regularity of solutions to PDEs. The proof of this result is contained in Appendix A.5.

Experimental Assessment
In this section, proof-of-concept numerical studies of three different initial value problems are presented. The first and simplest case is a homogeneous Burger's equation, a PDE with one nonlinear term and a solution that is known to be smooth. The second case is a porous medium equation, with two nonlinear terms appearing in the PDE and a solution that is known to be piecewise smooth, so that a classical solution does not exist and our modelling assumptions are violated. The third case returns to Burger's equation but now with forcing, to simulate a scenario where the right hand side f is a black box function that may be evaluated at a high computational cost. All three experiments are synthetic, in the sense that the functions f , g and h, which in our motivating task are considered to be black boxes associated with a high computational cost, are in actual fact simple analytic expressions, enabling a thorough empirical assessment to be performed.
In order to assess the empirical performance of our algorithm, two distinct performance measures were employed. The first of these aims to assess the accuracy of the posterior mean, which is analogous to how classical numerical methods are assessed. For this purpose the L ∞ error was considered: In practice the value of (7) is approximated by taking the maximum over the grid t × x on which the data y 0 , . . . , y n−1 were obtained. Accuracy that is comparable to a classical numerical method is of course desirable, but it is not our goal to compete with classical numerical methods in terms of L ∞ error. The second statistic that we consider assesses whether the distributional output from our PNM is calibrated, in the sense that the scale of the Gaussian posterior is comparable with the difference between the posterior mean µ n and the true solution u of (1): This performance measure will be called a Z-score, in analogy with traditional terminology from statistics. For the purpose of this exploratory work, values of Z that are orders of magnitude smaller than 1 are interpreted as indicating that the distributional output from the PNM is under-confident, while values that are orders of magnitude greater than 1 indicate that the PNM is over-confident. A PNM that is neither under nor over confident is said to be calibrated (precise definitions of the term "calibrated" can be found in Cockayne et al., 2021;Karvonen et al., 2020, but the results we present are straight-forward to interpret using the informal approach just described). Our goal in this work is to develop an approximately Bayesian PNM for nonlinear PDEs that is both accurate and calibrated. Again, in practice the supremum in (8) is approximated by the maximum over the t × x grid.
For all experiments below, we consider uniform temporal and spatial grids of respective sizes n = 2 i + 1, m = 2 j + 1, where i, j ∈ {2, 3, 4, 5, 6, 7}. This ensures that the grid points at which data are obtained are strictly nested as either the temporal exponent i or the spatial exponent j are increased. The prior mean µ(t, x) = 0 for all t ∈ [0, T ], x ∈ Γ, will be used throughout.

Homogeneous Burger's Equation
Our first example is the homogeneous Burger's equation with initial and boundary conditions and, for our experiments, α = 0.02, a = 1, b = 2, k = 1, T = 30 and L = 2π. These initial and boundary conditions were chosen because they permit a closed-form solution that can be used as a ground truth for our assessment. To linearise the differential operator Burger's equation we consider approximations of the form in (2), i.e.
where u i−1 (x) was taken equal to the predictive mean µ i−1 (t i , x) arising from the Gaussian process approximation U i−1 . Default Prior: Burger's equation has a first order temporal derivative term and a second order spatial derivative term, so following the discussion in Section 3 we consider as a default a Gaussian process prior with covariance function Σ that is a product between a Matérn 3/2 kernel K 3/2 (t, t ) for the temporal component, and a Matérn 5/2 kernel K 5/2 (x, x ) for the spatial component: Σ((t, x), (t , x )) = K 3/2 (t, t ; ρ 1 , σ 1 )K 5/2 (x, x ; ρ 2 , σ 2 ) (10) The notation in (10) makes explicit the dependence on the amplitude hyper-parameters σ 1 , σ 2 and the length-scale hyper-parameters ρ 1 , ρ 2 ; note that only the product σ := σ 1 σ 2 of the two amplitide parameters is required to be specified. For the experiments below σ was estimated as per (4), while the length-scale parameters were fixed at values ρ 1 = 6, ρ 2 = 3 (not optimised; these were selected based on a post-hoc visual check of the credible sets in Figure 1). From Theorem 2, this construction ensures that the prior is supported on C (1,2) ([0, T ]×[0, L]). Typical output from our PNM equipped with the default prior is presented in Figure 1.
An Alternative Prior: The Matérn covariance models assume only the minimal amount of smoothness required for the PDE to be well-defined. However, in this assessment the ground truth u is available (9) and is seen to be infinitely differentiable in (0, T ] × [0, 2π]. It is therefore interesting to explore whether a prior that encodes additional smoothness can improve on the default prior in (10). A prototypical example of such a prior is Σ((t, x), (t , x )) = C(t, t ; ρ 3 , σ 3 )C(x, x ; ρ 4 , σ 4 ) where C(z, z ; ρ, σ) := σ 1 + (z − z ) 2 ρ 2 −1 is the rational quadratic covariance model. For the experiments below σ was estimated as per (4), while the length-scale parameters were fixed at values ρ 3 = √ 3, ρ 4 = √ 3.

Results:
The error E ∞ was computed at 36 combinations of temporal and spatial grid sizes (n, m) and results for the default prior are displayed in the top row of Figure 2. It can be seen that the error E ∞ is mostly determined, in this example, by the finite length n of the temporal grid rather than the length m of the spatial grid. The slope of the curves in Figure 2 is consistent with a convergence rate of O(n −1 ) for the error E ∞ when spatial discretisation is neglected. The Z-scores asociated with the default prior (bottom row of Figure 2) appear to be bounded as (n, m) are increased, tending toward 0 but taking values of order 1 for all regimes, except for the smallest value (m = 5) of the spatial grid. This provides evidence that our proposed PNM, equipped with the default prior, is either calibrated or slightly under-confident but, crucially from a statistical perspective, it is not over-confident. Equivalent results for the alternative covariance model are presented for the error E ∞ in the top row of Figure 3 and for the Z-score in the bottom row of Figure 3. Here the error E ∞ is again gated by the size n of the temporal grid and decreases at a faster rate compared to when the default prior was used. This is perhaps expected because the rational quadratic covariance model reflects the true smoothness of the solution better than the Matérn model. However, the Z-scores associated with the alternative covariance model are considerably higher, appearing to grow rapidly as n → ∞ with m fixed. This suggests that the alternative covariance model is inappropriate, causing our PNM to be over-confident. We speculate this may be because u does not belong to the support of the rational quadratic covariance model, but note that the support of a Gaussian process can be difficult to characterise (Karvonen, 2021). These results support our proposed strategy for prior selection in Section 3.

Porous Medium Equation
Our second example is the porous medium equation which is more challenging compared to Burger's equation because the solution is only piecewise smooth, meaning that a strong solution does not exist and our modelling assumptions are violated. Furthermore, there are two distinct nonlinearities in the differential operator, allowing us to explore the impact of the choice of linearisation on the performance of the PNM. For our experiment we fix k = 2, so that the porous medium equation becomes and we consider the initial and boundary conditions with t 0 = 2, T = 8, L/2 = 10. These initial and boundary conditions were chosen because they permit a (unique) closed-form solution, due to Barenblatt (1952): The solution is therefore only piecewise smooth, with discontinuous first derivatives at x 2 = 12t 2/3 , which are inside of the domain [−L/2, L/2] for all t ∈ [t 0 , t 0 + T ].
Prior: Henceforth we consider the default prior advocated in Section 3 and Section 4.1, with amplitude σ estimated using maximum likelihood and length-scale parameters fixed at values ρ 1 = 1, ρ 2 = 2 (not optimised; based on a simple post-hoc visual check).
Choice of Linearisation: The differential operator here contains the nonlinear component Qu = (∂ x u) 2 + u∂ 2 x u that must be linearised. The first term (∂ x u) 2 appeared also in Burger's equation, and we linearise this term in an identical way to that used in Section 4.1. The second term u∂ 2 x u can be linearised in at least two distinct ways, fixing either u or ∂ 2 x u to suitable constant values adaptively based on quantities that have been pre-computed. Thus we consider the two linearisations where we recall that µ i−1 (t i , x) is the predictive mean arising from the Gaussian process approximation U i−1 . Through simulation we aim to discover which (if either) linearisation is more appropriate for use in our PNM.

Conservation of Mass:
In addition to admitting multiple linearisations, we consider the porous medium equation because when k > 1 it exhibits a conservation law, which is typical of many nonlinear PDEs that are physically-motivated. Specifically, integrating (11) with respect to x gives d dt and, from the fact that u = 0 for all x 2 ≥ 12t 2/3 , it follows that ∂ x (u k ) = ku k−1 ∂ x u = 0 for all x 2 ≥ 12t 2/3 and thus  methods (LeVeque, 2002) and symplectic integrators (Sanz-Serna, 1992). Interestingly, it is quite straight-forward to enforce this conservation law in our PNM by adding additional linear constraints to the system in Lemma 1. Namely, we add the linear constaints at each point i ∈ {1, . . . , n − 1} on the temporal grid. The performance of our PNM both with and without conservation of mass will be considered.
Results: Empirical results based on the linearisation Q (1) (without conservation of mass) are contained in Figure 4. In this case (and in contrast to our results for Burger's equation in Section 4.1), the error E ∞ is seen to be gated by the smaller of the finite length n of the temporal grid and the length m of the spatial grid. The Z-score values appear to be of order 1 as (n, m) are simultaneously increased, but are slightly higher than for Burger's equation, which may reflect the fact that the solution to the porous medium equation is only piecewise smooth. For increasing n with m fixed the PNM appears to become over-confident, while for increasing m with n fixed the PNM appears to become under-confident; a conservative choice would therefore be to take m ≥ n. Interestingly, this behaviour of the Z-scores is similar to that observed for the rational quadratic covariance model in Section 4.1, and may reflect the fact that in both cases the solution u is outside the support of the covariance model. Next we compared the performance of the linearisation Q (1) with the linearisation Q (2) . The error E ∞ associated to Q (2) (not shown) was larger than the error of Q (1) , and the Z-scores for Q (2) are displayed in Figure 5. Our objective is to quantify numerical uncertainty, so it is essential that output from the PNM is calibrated. Unfortunately, it can be seen that the Z-scores associated with Q (2) are unsatisfactory; for large m the scores are two orders of magintude larger than 1, indicating that the PNM is over-confident. The failure of Q (2) to provide calibrated output is likely due to the fact that approximation of the second order derivative term ∂ 2 x u is more challenging compared to approximation of the solution u, since ∂ 2 x is less regular than u and since our initial and boundary data relate directly to u itself.
Finally we considered inclusion of the conservation law into the PNM. For this purpose we used the best-performing linearisation Q (1) . The errors E ∞ and Z-scores are shown in Figure 6 and can be compared to the equivalent results without the conservation law applied, in Figure 4. It can be seen that the error E ∞ is lower when the conservation law is applied, and moreover the Z-scores are slightly reduced, remaining order 1. These results agree with the intuition that incorporating additional physical constraints, when they are known, can have a positive impact on the performance of our PNM.

Forced Burger's Equation
Our final experiment concerns a nonlinear PDE whose right hand side f is considered to be a black box, associated with a substantial computational cost. To avoid confounding due to the choice of differential operator, we consider again the differential operator from Burger's equation for which the behaviour of our PNM was studied in Section 4.1 (in the case f = 0). The initial and boundary conditions are and we set α = 1, T = 30, L = 1. The aims of this experiment are two-fold: Our first aim is to evaluate the performance of our PNM when the function f is non-trivial (e.g. involving oscillatory behaviour), to understand whether the output from our PNM remains calibrated or not. Recall that our experiments are synthetic, meaning that the black box f is in actual fact an analytic expression, in this case f (t, x) := 10 sin(6πx) cos(3πt) + 2|sin(3πx) cos(6πt)|, enabling a thorough assessment to be performed. This forcing term is deliberately chosen to have some non-smoothness (from the absolute value function) and oscillatory behaviour, as might be encountered in output from a complex computer model. Our second aim is to compare the accuracy of our PNM against a classical numerical method whose computational budget (as quantified by the number of times f is evaluated) is identical to our PNM. The solution to (12) does not admit a closed form, so for our ground truth we used a numerical solution computed using the MATLAB function pdepe, which implements Skeel and Berzins (1990) based on a uniform spatial grid of size 512 and an adaptively selected temporal grid. Our PNM was implemented with the same linearisation used in Section 4.1.
Prior: Again we consider the default prior advocated in Section 3 and Section 4.1, with amplitude σ estimated using maximum likelihood and length-scale parameters fixed at values ρ 1 = 0.5, ρ 2 = 0.5 (not optimised; based on a simple post-hoc visual check).
Crank-Nicolson Benchmark: In this scenario, where the black box function f is associated with a high computational cost, non-adaptive numerical methods are preferred, to control the total computational cost. From a classical perspective, the finite difference methods are natural candidates for the numerical solution of (12). Finite difference methods are classified into explicit and implicit schemes. Explicit schemes are much easier to solve but typically require certain conditions to be met for numerical stability (Thomas, 1998, Table 5.3.1). For example, for the 2D heat equation, it is required that ∆t/(α∆x 2 ) + ∆t/(α∆y 2 ) ≤ 1/2, where α is the diffusivity constant, ∆t the time resolution, and ∆x, ∆y the spatial resolutions (Thomas, 1998, page 158). Such conditions, which require the spatial resolution to be much finer than the time resolution, may be difficult to establish when f is a black box or when manual selection of the solution grid is not possible. Implicit methods in general are more difficult to solve, but stability is often guaranteed. For example, for the same 2D heat equation problem, the Crank-Nicolson scheme (a second order, implicit method) is unconditionally stable (Thomas, 1998, page 159). For these reasons, we considered the Crank-Nicolson finite difference method (Crank and Nicolson, 1947) as a classical numerical method that is well-suited to the task at hand. In the implementation of Crank-Nicholson on the inhomogeneous Burger's equation, the nonlinear term is approximated via a lag nonlinear term (Thomas, 1998, page 140). The same regular temporal grid t and regular spatial grid x were employed in both Crank-Nicolson and our PNM, so that the computational costs for both methods (as quantified in terms of the number of evaluations of f ) are identical.

Results:
The error E ∞ and Z-scores for our PNM are displayed in Figure 7. The error E ∞ is seen to be gated by the size m of the spatial grid and descreases as (n, m) are simultaneously increased. The Z-score values appear to be of order 1 as (n, m) are simultaneously increased, but for increasing n with m fixed the PNM appears to become over-confident; a conservative choice would be to take m ≥ n, which is also what we concluded from the porous medium equation. These results suggest the output from our PNM is reasonably well-calibrated. Finally, we considered the accuracy of our PNM compared to the Crank-Nicolson benchmark. The error E ∞ for Crank-Nicolson is jointly displayed in Figure 7, below the error plots for our PNM, and interestingly, it is generally larger than the error obtained with our PNM. This provides reassurance that our PNM is as accurate as could reasonably be expected.

Conclusion
This paper addressed an important and under-studied problem in numerical analysis; the numerical solution of a PDE under severe restrictions on evaluation of the initial, boundary and/or forcing terms f , g and h in (1). Such restructions occur when f , g and/or h are associated with a computational cost, such as being output from a computationally intensive computer model (Fulton, 2010;Hurrell et al., 2013) or arising as the solution to an auxiliary PDE (Cockayne and Duncan, 2020;MacNamara and Strang, 2016). In many such cases it is not possible to obtain an accurate approximation of the solution of the PDE, and at best one can hope to describe trajectories that are compatible with the limited information available on the PDE. To provide a principled resolution, in this paper we cast the numerical solution of a nonlinear PDE as an inference problem within the Bayesian framework and proposed a probabilistic numerical  Hennig et al., 2015). A common feature of these tasks is that their difficulty justifies the use of sophisticated statistical machinery, such as Gaussian processes, that themselves may be associated with a computational cost. The PNM developed in this paper has a complexity O(nm 3 ) to approximate the final state of the PDE, or O(n 3 m 3 ) to approximate the full solution trajectory of the PDE. This renders our PNM computationally intensive -potentially orders of magnitude slower than a classical numerical method -but such increase in cost can be justified when the demands of evaluating f , g and h exceed those of running the PNM (for example, when evaluation of f requires simulation from a climate model Fulton, 2010;Hurrell et al., 2013). Further work will be required to establish our approach as a general purpose numerical tool for nonlinear PDEs: First, the non-unique partitioning of the differential operator D into linear and nonlinear components, P and Q, together with the non-unique linearisation of Q, necessitates some expert input. This is analogus to the selection of a suitable numerical method in the classical setting, but the classical literature has benefitted from decades of research and extensive practical guidence is now available in that context. Here we took a first step to automation by rigorously establishing sample path properties of a Matérn tensor product covariance model Σ, along with presenting a closed-form maximum likelihood estimator for the amplitude of Σ. The user is left to provide suitable length-scale parameter(s), which is roughly analogous to requiring the user to specify a mesh density in a finite element method (an accepted reality in that context). Second, an extensive empirical assessment will be required to systematically assess the performance of the method; our focus in the present paper was methodology and theory, providing only an experimental proof-of-concept. In particular, it will be important to assess diagnostics for failure of the method; it seems plausible that statistically-motivated diagnostics, such as held-out predictive likelihood, could be used to indicate the quality of the output from the PNM. Finally, we acknowledge that the problem we considered in (1) represents only one class of nonlinear PDEs and further work will be required to develop PNM for other classes of PDEs, such as boundary value problems and PDEs defined on more general domains.

A. Appendix
This supplement contains proofs for all novel theoretical results in the main text.

A.1. Proof of Lemma 1
Proof of Lemma 1.
Our starting point is the stochastic process defined in Lemma 1, and from this we aim to derive the iterative formulation in the main text. The derivation requires several items of notation to be introduced. First, let . . .
The mean and covariance of U i+1 , as defined in Lemma 1, are equal to Similarly, introduce the notation so that the application of D i to µ i+1 and Σ i+1 may be expressed as Notice that we have a recursive partitioning of M i+1 into blocks of the form Thus we may use the block matrix inversion formula to deduce that so our definition of A i coincides with that in the main text, and enables us to simplify M −1 Therefore we have that which can be simplified by noting that to produce the iterative formulation for the mean in the main text: For the covariance we have This can be simplified by noting that and an analogous calculation shows Finally, we must check that the value K (2p) ν (0) agrees with the two limits just derived. K (2p−1) ν (h) is continuously differentiable, so K (2p−1) ν (0) = 0 because it is an odd function (as it is an odd derivative of an even function K ν ). Thus we have that as required.

A.3. Proof of Proposition 2
As in the proof of Proposition 1, the assumed regularity of µ, being an element of C p (I), implies that X and X −µ have identical differentiability properties up to pth order, and we may therefore assume µ = 0. Our main tool is the Kolmogorov continuity theorem (see for example Kunita (1997), Section 1.4): Theorem 3 (Kolmogorov's Continuity Theorem). Let I ⊆ R d be an open set, and let z be a dense subset of I. Let X : I × Ω → R be a random field. If there exists constants α, β, C > 0 such that E |X(z, ω) − X(z , ω)| α ≤ C z − z d+β for all z, z ∈ z, then there exists a modification of X that is sample continuous.
Lemma 2. Let I ⊆ R d be an open set and suppose that a positive definite function Σ : I × I → R satisfies Σ(z, z) + Σ(z , z ) − 2Σ(z, z ) ≤ C z − z γ for some γ, C ∈ (0, ∞) and all z, z ∈ I. Let X ∼ GP(0, Σ). Then there exists a modification of X that is sample continuous.

Proof of Proposition 2.
Our aim is to show that ∂ (i) ms X satisfies the preconditions of Lemma 2. This process has covariance function K (i,i) ν (z, z ) = (−1) i K (2i) ν (z −z ). From stationarity we have, with h = z −z , From similar calculations to those performed in the proof of Proposition 1, we have that for all 0 ≤ i ≤ p, a k |h| k for some real coefficients a k . Therefore: This final term can be upper bounded by an expression of the form C|h| γ for sufficiently large C > 0 and γ = 1. Indeed, as h → 0 the behaviour of (13) is O(|h|). As |h| → ∞ the exponential term dominates and (13) decays to a 0 . In the region 0 < |h| < ∞, (13) is smooth. Thus we can use Lemma 2 to conclude that ∂ (i) ms X has a modification that is sample continuous.

A.4. Proof of Corollary 1
For p = 1 the result is immediate from Theorem 1, so in what follows we concentrate on p > 1.
The main technical challenge of this proof is to deal with modifications, which arise with each application of Theorem 1. Recall that two stochastic processes X andX are said to be indistinguishable if P(X(z, ω) =X(z, ω) ∀z ∈ I) = 1. If X andX are modifications of each other and each is sample continuous, then X andX are indistinguishable (Jeanblanc et al., 2009, Section 1.1). Proof of Corollary 1.
We first present a proof for d = 1, to improve transparency of the argument, then we present the argument for the general case d ≥ 1. Note that, since we are considering Gaussian processes, the requirement for a second moment in Theorem 1 is automatically satisfied.
For each 0 ≤ i < p, it is assumed that ∂ i ms X has mean square derivative ∂ i+1 ms X that is mean square continuous and sample continuous. Theorem 1 therefore implies that each ∂ i ms X has a modification, denoted ψ i , that is sample continuously differentiable, and satisfying ∂ψ i = ∂ i+1 ms X almost surely. Since ψ i and ∂ i ms X are sample continuous they are indistinguishable; i.e. almost surely ψ i = ∂ i ms X. It follows that, for each 0 ≤ i < p, we have almost surely that ∂ i ψ 0 = ∂ i−1 (∂ψ 0 ) = ∂ i−1 ψ 1 = · · · = ψ i , while for i = p we have that ∂ p ψ 0 = ∂ψ p−1 = ∂ p ms X.
The case d ≥ 1 is analogous with more notation is involved; though, since we assumed Σ ∈ C (p,p) (I × I) we may employ the shorthand notation ∂ β ms X for all |β| ≤ p (since the order of derivatives can be freely interchanged). For each 0 ≤ i < p and |β| = i, it is assumed that ∂ β ms X has mean square partial derivatives ∂ β+γ ms X, |γ| = 1, that are mean square continuous and sample continuous. Theorem 1 therefore implies that each ∂ β ms X has a modification, denoted ψ β , that is sample continuously differentiable, and satisfying ∂ γ ψ β = ∂ β+γ ms X almost surely for all |γ| = 1. Since ψ β and ∂ β ms X are sample continuous they are indistinguishable; i.e. almost surely ψ β = ∂ β ms X. It follows that, for each 0 ≤ i < p and |β| = i, we have almost surely that ∂ β ψ 0 = ψ β , while for and |β| = p we have that ∂ β ψ 0 = ∂ β ms X.
Proof. Since µ ∈ C p (I) we may define µ (−q) to be the q times integrated µ for each q ∈ {0, . . . , p}. Note that µ (−q) is well-defined since µ is assumed to be bounded.
Our aim is to use Theorem 4, so we must check that the preconditions are satisfied. (14) is satisfied because H K for the Matérn ν kernel is norm equivalent to the Sobolev Space W α Since β i + α i − P i ≤ β i , and from the product form of Σ (β+α−P,β+α−P ) and the univariate calculations performed in the proofs of Proposition 2 and Proposition 3, (18) is verified. Thus, by Lemma 2, ∂ α ms Z has a sample continuous modification, for any α ∈ N d 0 , |α| ≤ p. Finally, we apply Corollary 1 to deduce that there exists a modificationZ of Z with P(Z ∈ C p (I)). It follows thatX = ∂ P −βZ is a modification of X that satisfies P(X ∈ C β (I)).