1 Introduction

Combining complex mathematical models with observational data is an extremely challenging yet ubiquitous problem in the fields of modern applied mathematics and data science. Inverse problems, where one is interested in learning inputs to a mathematical model such as physical parameters and initial conditions given partial and noisy observations of model outputs, are hence of frequent interest. Adopting a Bayesian approach (Kaipio and Somersalo 2005; Stuart 2010), we incorporate our prior knowledge on the inputs into a probability distribution, the prior distribution, and obtain a more accurate representation of the model inputs in the posterior distribution, which results from conditioning the prior distribution on the observed data.

The posterior distribution contains all the necessary information about the characteristics of our inputs. However, in most cases the posterior is unfortunately intractable and one needs to resort to sampling methods such as Markov chain Monte Carlo (MCMC) (Robert and Casella 2004; Brooks et al. 2011) to explore it. A major challenge in the application of MCMC methods to problems of practical interest is the large computational cost associated with numerically solving the mathematical model for a given set of the input parameters. Since the generation of each sample by an MCMC method requires a solve of the governing equations, and often millions of samples are required in practical applications, this process can quickly become very costly.

One way to deal with the challenge of full Bayesian inference for complex models is the use of surrogate models, also known as emulators, meta-models or reduced order models. Instead of using the complex (and computationally expensive) model, one uses a simpler and computationally more efficient model to approximate the solution of the governing equations, which in turn is used to approximate the data likelihood. Within the statistics literature, the most commonly used type of surrogate model is a Gaussian process emulator (Rasmussen and Williams 2006; Stein 1999; Sacks et al. 1989; Kennedy and O’Hagan 2000; O’Hagan 2006; Higdon et al. 2004), but other types of surrogate models can also be used including projection-based methods (Bui-Thanh et al. 2008), generalised Polynomial Chaos (Xiu and Karniadakis 2003; Marzouk et al. 2007), sparse grid collocation (Babuska et al. 2007; Marzouk and Xiu 2009) and adaptive subspace methods (Constantine 2015; Constantine et al. 2014).

In this paper, we focus on the use of Gaussian process surrogate models for approximating the posterior distribution in inverse problems, where the forward model is related to the solution of a linear partial differential equation (PDE). In particular, we consider two different ways of using the surrogate model, emulating either the parameter-to-observation map or the negative log-likelihood. Convergence properties of the corresponding posterior approximations, as the number of design points N used to construct the surrogate model goes to infinity, have recently been studied in Stuart and Teckentrup (2018); Teckentrup (2020); Helin et al. (2023). These results put the methodology on a firm theoretical footing, and show that the error in the approximate posterior distribution can be bounded by the corresponding error in the surrogate model. Furthermore, the error in the approximate posteriors tends to zero as N tends to infinity. However, when the forward model of interest is given by a complex model such as a PDE, one normally operates in a regime where only a very limited number of design points N can be used due to constraints on computational cost. This setting is less understood and is the main setting of interest in this paper.

With a small number of design points, different modelling choices made in the derivation of the approximate posterior can have a large effect on its accuracy. In particular, the choice of Gaussian prior distribution in the emulator is crucial, as it heavily influences its accuracy. Intuitively, we want to make the Gaussian prior as informative as possible, by incorporating known information about the underlying forward model. For example, such a Gaussian prior specially tailored to solving the forward problem in linear PDEs can be found in Raissi et al. (2017). For incorporating more general constraints, we refer the reader to the recent review (Swiler et al. 2021). Other modelling choices that require careful consideration are whether we build a surrogate model for the parameter-to-observation map or the log-likelihood directly, and whether we use the full distribution of the emulator or only the mean (see e.g. Stuart and Teckentrup (2018); Lie et al. (2018)).

The focus of this paper is on computational aspects of the use of Gaussian process surrogate models in PDE inverse problems, with particular emphasis on the setting where the number of design points is limited by computational constraints. The main contributions of this paper are the following:

  1. 1.

    We extend the PDE-informed Gaussian process priors from Raissi et al. (2017) to enable their use in inverse problems, which requires a Gaussian process prior as a function of both the spatial variable of the PDE and the unknown parameter(s).

  2. 2.

    By showing that the required gradients can be computed explicitly, we establish that gradient-based MCMC samplers such as the Metropolis-adjusted Langevin algorithm (MALA) can be used to efficiently sample from the approximate posterior distributions.

  3. 3.

    Using a range of numerical examples, we demonstrate the isolated effects of various modelling choices made, and thus offer valuable insights and guidance for practitioners. This includes choices of posterior approximation in the inverse problem (e.g. emulating the parameter-to-observation map or the log-likelihood) and on prior distributions for the Gaussian process emulator (e.g. black-box or PDE-constrained).

The rest of the paper is organised as follows. In Sect. 2 we set up notation with respect to the inverse problems of interest and discuss the different kinds of posterior approximations that result from using Gaussian surrogate models for the data-likelihood. We then proceed in Sect. 3 to present our main methodology, discussing how one can blend better-informed Gaussian surrogate models with inverse problems as well as presenting the MCMC algorithm that we use. A number of different numerical experiments that illustrate the computational benefits of our approach are then presented in Sect. 4, and finally Sect. 5 provides a summary and discussion of the main results.

2 Preliminaries

We now give more details about the type of inverse problems considered in this paper and discuss different aspects of Gaussian emulators, as well as the corresponding type of approximate posteriors considered in this work. At the end of this section, we summarise in Table 1 all the different notations introduced in this section.

Table 1 The list of symbols and notations used in this paper

2.1 PDE inverse problems

Consider the linear PDE

$$\begin{aligned} \mathcal {L}^{\varvec{\theta }} u(\textbf{x})&= f(\textbf{x}), \qquad \textbf{x} \in D, \end{aligned}$$
(1a)
$$\begin{aligned} {\mathcal {B}} u(\textbf{x})&= g(\textbf{x}), \qquad {\textbf{x}} \in \partial D, \end{aligned}$$
(1b)

posed on a domain \(D \subseteq \mathbb {R}^{{d}_{\textbf{x}}}\), where \(\mathcal {L}^{\varvec{\theta }}\) denotes a linear differential operator depending on parameters \(\varvec{\theta } \in \mathcal {T} \subseteq \mathbb {R}^{{d}_{\varvec{\theta }}}\) and the linear operator \({\mathcal {B}}\) incorporates boundary conditions. The inverse problem of interest in this paper is to infer the parameters \(\varvec{\theta }\) from the noisy data \(\textbf{y} \in \mathbb {R}^{{d}_{\textbf{y}}}\) given by

$$\begin{aligned} \textbf{y} = \mathcal {G}_{X}(\varvec{\theta }) + \varvec{\eta }, \end{aligned}$$
(2)

where \(X=\{\textbf{x}_{1},\cdots ,\textbf{x}_{{d}_{\textbf{y}}} \} \subset {\overline{D}}\) are the spatial points where we observe the solution u of our PDE, \({\mathcal {G} _{X}:\mathcal {T}\rightarrow \mathbb {R}^{{d}_{\textbf{y}}}}\) is the parameter-to-observation map defined by \(\mathcal {G}_{X}(\varvec{\theta }) = \{u(\textbf{x}_j; \varvec{\theta })\}_{j=1}^{{{d}_{\textbf{y}}}}\), and \(\varvec{\eta } \sim \mathcal {N}(0,\Gamma _{\eta })\) is an additive Gaussian noise term with covariance matrix \(\Gamma _{\eta } = \sigma _{\eta }^2 I_{{d}_{\textbf{y}}}\). Note that the assumption of Gaussianity and diagonal noise covariance is done for simplicity, but these assumptions can be relaxed (Lie et al. 2018). Likewise, the methodology generalises straightforwardly to general bounded linear observation operators applied to the PDE solution u (see the discussion in Sect. 3.1).

To solve the inverse problem we will adopt a Bayesian approach (Stuart 2010). That is, prior to observing the data \(\textbf{y}\), \(\varvec{\theta }\) is assumed to be distributed according to a prior density \(\pi _0(\varvec{\theta }),\) and we are interested in the updated posterior density \(\pi (\varvec{\theta }|{\textbf{y}})\). From (2) we have \( \textbf{y}|\varvec{\theta }\sim \mathcal {N}(\mathcal {G}_{X}(\varvec{\theta }),\Gamma _{\eta })\), so the likelihood is

$$\begin{aligned} L(\textbf{y}|{\varvec{\theta }})&\propto \exp {\left( -\frac{1}{2}\Vert \mathcal {G}_{X}(\varvec{\theta })-\textbf{y}\Vert ^2_{\Gamma _{\eta }}\right) } \nonumber \\&:=\exp {\left( -\Phi (\varvec{\theta },\textbf{y})\right) }, \end{aligned}$$
(3)

where the function \(\Phi :\mathcal {T}\times \mathbb {R}^{{d}_{\textbf{y}}}\rightarrow \mathbb {R}\) is called the negative log-likelihood or potential and \(\Vert \textbf{z}\Vert _{\Gamma _\eta }:= \textbf{z}^\textrm{T} \Gamma _\eta ^{-1} \textbf{z}\) denotes the norm weighted by \(\Gamma _\eta ^{-1}\). Note that our notation of \(\Vert \textbf{z}\Vert _{\Gamma _\eta }\) here follows the convention introduced in Stuart (2010). Then by Bayes’ formula we have

$$\begin{aligned} \pi ({\varvec{\theta }}|{\textbf{y}}) \propto L(\textbf{y}|\varvec{\theta })\pi _0(\varvec{\theta }). \end{aligned}$$

The posterior distribution \(\pi ({\varvec{\theta }}|{\textbf{y}})\) is in general intractable, and we need to resort to sampling methods such as MCMC to extract information from it. However, generating a sample typically involves evaluating the likelihood and hence the solution of the PDE (1), which can be prohibitively costly. This motivates the use of surrogate models to emulate the PDE solution, which in turn is used to approximate the posterior and hence accelerate the sampling process.

2.2 Gaussian processes

Gaussian process regression (GPR) is a flexible non-parametric model for Bayesian inference (Rasmussen and Williams 2006). Our starting point for approximating an arbitrary function \(\textbf{g}:\mathcal {T}\rightarrow \mathbb {R}^d\), for some \(d \in \mathbb {N}\) is the Gaussian process prior

$$\begin{aligned} \textbf{g}_{0}(\varvec{\theta }) \sim \text {GP}(\textbf{m}(\varvec{\theta }), K(\varvec{\theta },\varvec{\theta }')), \end{aligned}$$
(4)

where \(\textbf{m}:\mathcal {T}\rightarrow \mathbb {R}^{d}\) is a mean function and \(K(\varvec{\theta },\varvec{\theta }'):\mathcal {T}\times \mathcal {T}\rightarrow \mathbb {R}^{d \times d}\) is the matrix-valued positive definite covariance function which represents the covariance between the different entries of \(\textbf{g}\) evaluated at \(\varvec{\theta }\) and \(\varvec{\theta }'\). Distinct from the prior introduced earlier in solving the Bayesian inverse problem, this prior is built for Gaussian process regression. When emulating the forward map the function \(\textbf{g}\) corresponds to the PDE solution evaluated at \({d}_{\textbf{y}}\) different spatial points, and hence \(d={d}_{\textbf{y}}\). In contrast, \(d=1\) when directly emulating the log-likelihood.

In the case where \(d>1\) there is a number of choices that one can make for the matrix-valued covariance function (Alvarez et al. 2012). In this section, for simplicity we will assume that the matrix \(K(\varvec{\theta },\varvec{\theta }')\) takes the form

$$\begin{aligned} K(\varvec{\theta },\varvec{\theta }')=k(\varvec{\theta },\varvec{\theta }') I_{d} \end{aligned}$$

for some scalar-valued covariance function \(k(\varvec{\theta },\varvec{\theta }'):\mathcal {T}\times \mathcal {T}\rightarrow \mathbb {R}\), implying that the entries of \(\textbf{g}\) are independent. We will refer to this as the baseline model. As we will see later better emulators can be constructed by relaxing this independence assumption.

The mean function and the covariance function fully characterise our Gaussian prior. A typical choice for \(\textbf{m}\) is to set it to zero, while common choices for the covariance function \(k(\varvec{\theta },\varvec{\theta }')\) include the squared exponential covariance function (Rasmussen and Williams 2006)

$$\begin{aligned} k(\varvec{\theta },\varvec{\theta }') = \sigma ^2 \exp \left( -\frac{\left\Vert \varvec{\theta }- \varvec{\theta }'\right\Vert ^2}{2l^2}\right) , \end{aligned}$$
(5)

and the Matérn covariance function (Rasmussen and Williams 2006)

$$\begin{aligned}&k(\varvec{\theta },\varvec{\theta }') = \nonumber \\&\frac{\sigma ^2}{\Gamma (\nu )2^{\nu -1}}\left( \sqrt{2\nu }\frac{\left\Vert \varvec{\theta }-\varvec{\theta }'\right\Vert }{l}\right) ^\nu B_{\nu }\left( \sqrt{2\nu }\frac{\left\Vert \varvec{\theta }-\varvec{\theta }'\right\Vert }{l}\right) . \end{aligned}$$
(6)

For both kernels, the hyperparameter \(\sigma ^2 >0\) governs the magnitude of the covariance and the hyperparameter \(l>0\) governs the length-scale at which the entries of \(\textbf{g}_0(\varvec{\theta })\) and \(\textbf{g}_0(\varvec{\theta }')\) are correlated. For the Matérn covariance function the smoothness of the entries of \(\textbf{g}_0\) depends on the hyperparameter \(\nu > 0\). In the limit \(\nu \rightarrow \infty \) we obtain the squared exponential covariance function, which gives rise to infinitely differentiable sample paths for \(\textbf{g}_0\).

Now suppose that we are given data in the form of N distinct design points \(\Theta = \{\varvec{\theta }^i\}_{i = 1}^{N} \subseteq \mathbb {R}^{{d}_{\mathbf {\varvec{\theta }}}}\) with corresponding function values

$$\begin{aligned} \textbf{g}(\Theta ):=[\textbf{g}(\varvec{\theta }^1);\cdots ; \textbf{g}(\varvec{\theta }^N)] \in \mathbb {R}^{N{d}_{\textbf{y}}}. \end{aligned}$$

Since we have assumed that the multi-output function \(\textbf{g}_{0}\) is a Gaussian process, the vector

$$\begin{aligned}{}[\textbf{g}_{0}(\varvec{\theta }^1); \cdots ; \textbf{g}_{0}(\varvec{\theta }^N);\textbf{g}_{0}(\tilde{\varvec{\theta }})] \in \mathbb {R}^{(N+1){d}_{\textbf{y}}}, \end{aligned}$$

for any test point \(\tilde{\varvec{\theta }}\), follows a multivariate Gaussian distribution. The conditional distribution of \(\textbf{g}_0(\tilde{\varvec{\theta }})\) given the set of values \(\textbf{g}(\Theta )\) is then again Gaussian with mean and covariance given by the standard formulas for the conditioning of Gaussian random variables (Rasmussen and Williams 2006). In particular, if we denote with \(\textbf{g}^{N}\) the Gaussian process (4) conditioned on the values \(\textbf{g}(\Theta )\) we have

$$\begin{aligned} \textbf{g}^{N}(\varvec{\theta }) \sim \text {GP}(\textbf{m}^{\textbf{g}}_{N}(\varvec{\theta }),K_{N}(\varvec{\theta },\varvec{\theta }')) \end{aligned}$$
(7)

where the predictive mean vector \(\textbf{m}^{\textbf{g}}_{N}\) and the predictive covariance matrix \(K_{N}(\varvec{\theta },\varvec{\theta }')\) are given by

$$\begin{aligned}&\textbf{m}^{\textbf{g}}_{N}(\varvec{\theta }) = \textbf{m}(\varvec{\theta })+K(\varvec{\theta },\Theta )K(\Theta ,\Theta )^{-1}\left( \textbf{g}(\Theta )-\textbf{m}(\Theta )\right) \end{aligned}$$
(8)
$$\begin{aligned}&K_{N}(\varvec{\theta },\varvec{\theta }') = K(\varvec{\theta },\varvec{\theta }') - K(\varvec{\theta },\Theta )K(\Theta ,\Theta )^{-1}K(\varvec{\theta }',\Theta )^{T}, \end{aligned}$$
(9)

with

$$\begin{aligned}&\textbf{m}(\Theta ) = [\textbf{m}(\varvec{\theta }^{1});\cdots ; \textbf{m}(\varvec{\theta }^{N})] \in \mathbb {R}^{N{d}_{\textbf{y}}}, \\&K(\varvec{\theta },\Theta ) = [K(\varvec{\theta },\varvec{\theta }^1),\dots ,K(\varvec{\theta },\varvec{\theta }^N)] \in \mathbb {R}^{ {d}_{\textbf{y}}\times N{d}_{\textbf{y}}} \end{aligned}$$

and

$$\begin{aligned} K(\Theta ,\Theta ) = \begin{bmatrix} K(\varvec{\theta }^1,\varvec{\theta }^1)&{}\dots &{}K(\varvec{\theta }^1,\varvec{\theta }^N)\\ \vdots &{} &{}\vdots \\ K(\varvec{\theta }^N,\varvec{\theta }^1)&{}\dots &{}K(\varvec{\theta }^N,\varvec{\theta }^N)\\ \end{bmatrix} \in \mathbb {R}^{ N{d}_{\textbf{y}}\times N{d}_{\textbf{y}}} \end{aligned}$$

We note that \(\textbf{g}^{N}\) is the Gaussian process posterior, but to avoid confusion with the posterior of the Bayesian inverse problem, we call it the predictive Gaussian process. In addition, for clarity of notation, we use regular font for scalar values, bold font for vector values, and capital font for matrices (details in Table 1).

2.3 Gaussian emulators and approximate posteriors

We now discuss two different approaches for constructing a Gaussian emulator and using it for approximating the posterior of interest. The first approach constructs an emulator for the forward map \(\mathcal {G}_{X}\), while the second approach is based on constructing an emulator directly for the log-likelihood.

2.3.1 Emulating the forward map

Given the data set \({\mathcal {G}_{X}}(\Theta ) = \{\mathcal {G}_{X}(\varvec{\theta }^i)\}_{i=1}^{N}\), we can now proceed with building our Gaussian process emulator for the forward map \(\mathcal {G}_{X}\). Therefore, using similar notation to (7), we denote the predictive Gaussian process by \(\mathcal {G}_{X}^{N}\). One then needs to decide how to incorporate the emulation for the construction of an approximate posterior. In particular, depending on what type of information we plan to utilize, different approximations will be obtained. If we use its predictive mean \(\textbf{m}^{\mathcal {G}_X}_{N}\) as a point estimator of the forward map \(\mathcal {G}_{X}\), we obtain

$$\begin{aligned} \pi ^{N,\mathcal {G}_{X}}_{\textrm{mean}}(\varvec{\theta }|\textbf{y}) \propto \exp \left( -\frac{1}{2}\Vert \textbf{m}^{\mathcal {G}_X}_{N}(\varvec{\theta })-\textbf{y}\Vert ^2_{\Gamma _\eta }\right) \pi _0(\varvec{\theta }). \end{aligned}$$
(10)

Alternatively, we can try to exploit the full information given by the Gaussian process by incorporating its variance in the posterior approximation. A natural way to do this is to consider the following approximationFootnote 1:

$$\begin{aligned}&\pi ^{N,\mathcal {G}_{X}}_{\textrm{marginal}}(\varvec{\theta }|\textbf{y})\propto \mathbb {E}\left( \exp \left( -\frac{1}{2}\Vert \mathcal {G}_{X}^N(\varvec{\theta })-\textbf{y}\Vert ^2_{\Gamma _\eta }\right) \pi _0(\varvec{\theta })\right) \nonumber \\&\propto \left( \frac{\exp \left( -\frac{1}{2}\Vert \textbf{m}^{\mathcal {G}_X}_{N}(\varvec{\theta })-\textbf{y}\Vert ^2_{(K_{N}(\varvec{\theta },\varvec{\theta })+\Gamma _{\eta })}\right) }{\sqrt{(2\pi )^{d_\textbf{y}}\det \left( K_{N}(\varvec{\theta },\varvec{\theta })+\Gamma _{\eta }\right) }}\right) \pi _0(\varvec{\theta }), \end{aligned}$$
(11)

where the expectation is taken over the probability space of the Gaussian process posterior. A detailed derivation of (11) can be found in Appendix A. Comparing (11) with (10), the likelihood function in the marginal approximation is Gaussian with additional uncertainty \(K_{N}(\varvec{\theta },\varvec{\theta })\) from the emulator included into its covariance matrix. Hence, for a fixed parameter \(\varvec{\theta }\), the likelihood function in (11) will be less concentrated due to variance inflation. When the magnitude of \(K_{N}(\varvec{\theta },\varvec{\theta })\) is small compared to that of \(\Gamma _{\eta }\), the marginal approximation will be similar to the mean-based approximation.

2.3.2 Emulating the log-likelihood

Another way of building the emulator is to model the potential function \(\Phi \) directly. We can convert the data set \(\mathcal {G}_{X}(\Theta )\) into a data set of negative log-likelihood evaluations \(\varvec{\Phi }(\Theta ) = \{\Phi (\varvec{\theta }^i,\textbf{y})\}_{i = 1}^{N}\), and obtain the predictive Gaussian process \(\Phi ^{N}(\varvec{\theta })\sim \text {GP}(m^{\Phi }_{N}(\varvec{\theta }), k_{N}(\varvec{\theta },\varvec{\theta }))\). Again, if we only include the mean of the Gaussian process emulator the posterior approximation becomes

$$\begin{aligned} \pi ^{N,\Phi }_{\textrm{mean}}(\varvec{\theta }|\textbf{y}) \propto \exp \left( -m^{\Phi }_{N}(\varvec{\theta })\right) \pi _0(\mathbf {\varvec{\theta }}), \end{aligned}$$
(12)

while, in a similar fashion to the forward map emulation, we can take into account the covariance of our emulator to obtain the approximate posterior

$$\begin{aligned}&\pi ^{N,\Phi }_{\textrm{marginal}}(\varvec{\theta }|\textbf{y})\propto \mathbb {E}\left( (\exp {\left( -\Phi ^N(\varvec{\theta })\right) })\pi _{0}(\varvec{\theta })\right) \nonumber \\&\qquad \propto \exp {\left( -m^{\Phi }_{N}(\varvec{\theta })+ \frac{1}{2}k_{N}(\varvec{\theta },\varvec{\theta })\right) }\pi _{0}(\varvec{\theta }). \end{aligned}$$
(13)

The derivation of (13) is similar to that of (11). Note that in this case, the following relationship holds between the two approximate posteriors

$$\begin{aligned} \pi ^{N,\Phi }_{\textrm{marginal}}(\varvec{\theta }|\textbf{y}) \propto \pi ^{N,\Phi }_{\textrm{mean}}(\varvec{\theta }|\textbf{y}) \exp {\left( \frac{1}{2}k_{N}(\varvec{\theta },\varvec{\theta })\right) }, \end{aligned}$$

which again illustrates a form of variance inflation for the marginal posterior approximation.

In summary, we have two methods for approximating the true posterior: the mean-based approximation and the marginal approximation; and we have two types of emulators: the forward map emulator and the potential function emulator; thus by combination we have four types of approximation in total. The convergence properties of all these approximate posteriors were the subject of study in Stuart and Teckentrup (2018); Teckentrup (2020); Helin et al. (2023), where it was proved under suitable assumptions that all of them converge to the true posterior as \(N \rightarrow \infty \). However, in the case of small N, the difference between the approximate posteriors could be large and which one we choose is important. Furthermore, the type of Gaussian process emulator used plays an even bigger role in this case, and one would like to use a Gaussian prior that is as informative as possible. We discuss how to do this in the next section.

3 Methodology

Having described the different types of posterior approximations we will consider, in this section we discuss different modelling approaches for the prior distribution used in our Gaussian emulators. In doing this it is important to note that the function that we are interested to emulate, in this case the forward map \(\mathcal {G}_{X}(\varvec{\theta })\), depends not only on the parameters \(\varvec{\theta }\) of our PDE, but also on the locations of the spatial observations. Thus in terms of modelling, one would like to take this into account and build spatial correlation explicitly into the prior covariance. Note that when emulating the potential \(\Phi \) instead of the forward map \({\mathcal {G}}_{X}\), we are emulating a scalar-valued function. Since \(\Phi \) is a non-linear function of \({\mathcal {G}}_{X}\), it is not possible to extend the ideas of spatial correlation presented in this section to emulating \(\Phi \), and in particular, it is not possible to construct a PDE-informed emulator in the same way.

Introducing spatial correlation when emulating \(\mathcal {G}_{X}(\varvec{\theta })\) can be done in two different ways, the first by prescribing some explicit form of spatial correlation, and the second by using the fact that we know that our forward map is associated with the solution of a linear PDE. We do this in Sect. 3.1. It is important to note that in both cases it is possible to calculate the gradients with respect to the parameters \(\varvec{\theta }\) in a closed form, which can then be used to sample from the approximate posterior distributions using gradient-based MCMC methods such as MALA. We discuss this in more detail in Sect. 3.2.

3.1 Correlated and PDE-informed priors

We now discuss two different approaches to incorporate spatial correlation into our prior covariance function for the forward map \(\mathcal {G}_{X}(\varvec{\theta })\). Even though this is a function from the parameter space \(\mathcal {T}\) to the observation space \(\mathbb {R}^{{d}_{\textbf{y}}}\), for introducing more complicated spatial correlation it is useful to think first about the PDE solution \(u(\varvec{\theta },\textbf{x})\) as a function from \(\mathcal {T} \times {\overline{D}} \) to \(\mathbb {R}\). We introduce the prior covariance function \(k((\varvec{\theta },\textbf{x}),(\varvec{\theta }',\textbf{x}'))\) for \(u(\varvec{\theta },\textbf{x})\), and choose a separable model

$$\begin{aligned} k((\varvec{\theta },\textbf{x}),(\varvec{\theta }',\textbf{x}'))=k_p(\varvec{\theta },\varvec{\theta }')k_s(\textbf{x}, \textbf{x}'), \end{aligned}$$
(14)

where \(k_p\) and \(k_s\) are the covariance functions for the parameters \(\varvec{\theta }\) and the spatial points \(\textbf{x}\) respectively.

Using the fact that the forward map \(\mathcal {G}_{X}\) relates to the point-wise evaluation of the function \(u(\varvec{\theta }, \textbf{x})\) for \(\textbf{x} \in X\), and assuming zero mean, we then obtain the Gaussian prior

$$\begin{aligned} \mathcal {G}_{X}(\varvec{\theta }) \sim \text {GP}(0,K(\varvec{\theta },\varvec{\theta }')), \end{aligned}$$
(15)

with

$$\begin{aligned} K(\varvec{\theta },\varvec{\theta }') = k_p(\varvec{\theta },\varvec{\theta }')K_s(X, X), \end{aligned}$$

where \(K_s\) is the covariance matrix with entries \((K_{s}(X,X))_{i,j}=k_{s}(\textbf{x}_{i},\textbf{x}_{j})\), \( \textbf{x}_{i},\textbf{x}_{j} \in X \). This prior can then be conditioned on training data \({\mathcal {G}}_X(\Theta ) \), and due to the separable structure in (14), the predictive mean \(\textbf{m}_N^{{\mathcal {G}}_X}(\varvec{\theta })\) is in fact the same as for the baseline model in Sect. 2.2. See Appendix B for details.

The second way of introducing spatial correlation is explicitly taking into account that the forward map is related to a PDE solution. Given the PDE system

$$\begin{aligned} \mathcal {L}^{\varvec{\theta }} u(\textbf{x})&= f(\textbf{x}), \qquad {\textbf{x}} \in D, \\ {\mathcal {B}} u(\textbf{x})&= g(\textbf{x}), \qquad {\textbf{x}} \in \partial D, \end{aligned}$$

as described in Sect. 2, we can build a joint prior between u, f and g. In particular, if we take fixed points \(\textbf{x},\textbf{x}_{f} \in D\) and \(\textbf{x}_{g} \in \partial D\) we have that

$$\begin{aligned}&\begin{bmatrix}u(\varvec{\theta },\textbf{x})\\ g(\varvec{\theta },{\textbf{x}_{g}})\\ f(\varvec{\theta },\textbf{x}_{f}) \end{bmatrix} \sim \text {GP}\left( \varvec{0}, k_p(\varvec{\theta },\varvec{\theta }') \right. \nonumber \\&\left. \begin{bmatrix} k_s(\textbf{x},\textbf{x}) &{} \mathcal {B}k_s(\textbf{x},\textbf{x}_{g}) &{} \mathcal {L}^{\varvec{\theta }'}k_s(\textbf{x},\textbf{x}_{f})\\ \mathcal {B}k_s(\textbf{x}_{g},\textbf{x}) &{} \mathcal {B}\mathcal {B}k_s(\textbf{x}_{g},\textbf{x}_{g}) &{} \mathcal {B}\mathcal {L}^{\varvec{\theta }'}k_s(\textbf{x}_{g},\textbf{x}_{f})\\ \mathcal {L}^{\varvec{\theta }}k_s(\textbf{x}_{f},\textbf{x}) &{} \mathcal {L}^{\varvec{\theta }}\mathcal {B}k_s(\textbf{x}_{f},\textbf{x}_{g}) &{} \mathcal {L}^{\varvec{\theta }}\mathcal {L}^{\varvec{\theta }'}k_s(\textbf{x}_{f},\textbf{x}_{f})\\ \end{bmatrix}\right) , \end{aligned}$$
(16)

where the above is a Gaussian process as a function of \(\varvec{\theta }\), and we have used known properties of linear operators applied to Gaussian processes (see e.g. (Matsumoto and Sullivan 2023)) in the derivation. The idea of a joint prior between u and f was also used in Raissi et al. (2017); Spitieris and Steinsland (2023); Cockayne et al. (2017); Pförtner et al. (2022), while (Chris J. Oates and Girolami 2019) uses this explicitly in an inverse problem setting. The crucial difference is that in these works u and f were considered as functions of the spatial variable \(\textbf{x}\) only, while here we instead explicitly model the dependency of u on \(\varvec{\theta }\). We then have

$$\begin{aligned} \begin{bmatrix} \mathcal {G}_{X}(\varvec{\theta })\\ g(\varvec{\theta },X_{g})\\ f(\varvec{\theta },X_f) \end{bmatrix}\sim \text {GP}\left( \varvec{0}, K(\varvec{\theta },\varvec{\theta }')\right) , \end{aligned}$$
(17)

where

$$\begin{aligned}&K(\varvec{\theta },\varvec{\theta }') = k_p(\varvec{\theta },\varvec{\theta }')\\&\begin{bmatrix} K_s(X,X) &{} \mathcal {B}K_s(X,X_{g}) &{} \mathcal {L}^{\varvec{\theta }'}K_s(X,X_f)\\ \mathcal {B}K_s(X_{g},X) &{} \mathcal {B}\mathcal {B}K_s(X_{g},X_{g}) &{} \mathcal {B}\mathcal {L}^{\varvec{\theta }'}K_s(X_{g},X_f)\\ \mathcal {L}^{\varvec{\theta }}K_s(X_f,X) &{} \mathcal {L}^{\varvec{\theta }}\mathcal {B}K_s(X_f,X_{g}) &{} \mathcal {L}^{\varvec{\theta }}\mathcal {L}^{\varvec{\theta }'}K_s(X_f,X_f)\\ \end{bmatrix} \end{aligned}$$

and \(X_{g} \subset \partial D\) and \(X_{f} \subset D\) are collections of \(d_g\) and \(d_f\) points at which g and f have been evaluated, respectively. Note that the marginal prior placed on \({\mathcal {G}}_X\) is the same as in (15).

The prior (17) can then again be conditioned on training data as in Sect. 2.2, see Appendix B for details. Note that in this case we are updating our prior on \({\mathcal {G}}_X(\varvec{\theta })\) using the observations \(g(\Theta ,X_g)\) and \(f(\Theta ,X_f)\) as well as \(\mathcal G_X(\Theta )\), essentially augmenting the space on which the emulator \(\mathcal {G}^N_{X}(\varvec{\theta })\) is trained. Since g and f are assumed known, these additional observations are cheap to obtain. It is also possible to condition on training data \(g(\Theta _g,X_g)\) and \(f(\Theta _f,X_f)\), for point sets \(\Theta _g\) and \(\Theta _f\) different to \(\Theta \), and this has been found to be beneficial in some of the numerical experiments (see Sect. 4 and Appendix D).

3.2 MCMC algorithms

To extract information from the posterior, MCMC algorithms are powerful and popular tools (Robert and Casella 2004; Brooks et al. 2011). In this work, we will consider the Metropolis-Adjusted Langevin Algorithm (MALA) (Roberts and Tweedie 1996), which is a type of MCMC algorithm that uses gradient information to accelerate the convergence of the sampling chain. Central to the idea of MALA is the over-damped Langevin stochastic differential equation (SDE):

$$\begin{aligned} d\varvec{\theta }= \nabla \log \pi (\varvec{\theta }|{\textbf{y}}) dt+ \sqrt{2} dW, \end{aligned}$$
(18)

where W is a standard \({d_{\varvec{\theta }}}\)-dimensional Brownian motion. Under mild conditions on the posterior \(\pi \) (Robert and Casella 2004), (18) is ergodic and has \(\pi \) as its stationary distribution, so that the probability density function of \(\varvec{\theta }(t)\) tends to \(\pi \) as \(t \rightarrow \infty \).

Algorithm 1
figure a

Metropolis-Adjusted Langevin Algorithm

In practice (18) is discretised with a simple Euler-Maruyama method with a time step \(\gamma \):

$$\begin{aligned} \varvec{\theta }_{n+1} = \varvec{\theta }_{n} + \gamma \nabla \log \pi (\varvec{\theta }|\textbf{y}) + \sqrt{2\gamma }\xi _n, \end{aligned}$$
(19)

with \(\xi _n \sim \mathcal {N}(0,1)\). Assuming that the dynamics of (19) remain ergodic the corresponding numerical invariant measure would not necessarily coincide with the posterior. To alleviate this bias, one needs to incorporate an accept-reject mechanism (Sanz-Serna 2014). This gives rise to MALA as described in Algorithm 1.

An advantage of using the Gaussian process emulator in the posterior is that, assuming the prior is differentiable, \(\nabla \log \pi ^N(\varvec{\theta }|{\textbf{y}})\) can be computed analytically for the mean-based and marginal approximations introduced in Sect. 2.3, which enables us to easily implement the MALA algorithm. Note that in contrast since the true posterior involves the (analytical or numerical) solution u to the PDE (1a)-(1b), it is usually impossible to compute these gradients analytically and one needs to resort to their numerical approximation. The following Lemma gives the gradient of the different approximate posteriors. The proof can be found in Appendix C.

Lemma 1

Given the Gaussian process \(\mathcal {G}^N_X \sim \text {GP}(\textbf{m}^{\mathcal {G}_X}_N(\varvec{\theta }),K_N(\varvec{\theta },\varvec{\theta }))\) that emulates the forward map \(\mathcal {G}_X\) with data \(\mathcal {G}_X(\Theta )\), we have the gradient of the mean-based approximation of the posterior

$$\begin{aligned}&\nabla \log (\pi _\textrm{mean}^{N,\mathcal {G}_X} (\varvec{\theta }|\textbf{y})) \\&\quad = -\frac{1}{\sigma ^{2}_{\eta }} \nabla \textbf{m}_N^{\mathcal {G}_{X}}(\varvec{\theta }) ^{T}(\textbf{m}_N^{\mathcal {G}_{X}}(\varvec{\theta })-\textbf{y}) + \nabla \log \pi _0(\varvec{\theta }), \end{aligned}$$

and the gradient of the marginal approximation of the posterior

$$\begin{aligned}&\nabla \log (\pi _\textrm{marginal}^{N,\mathcal {G}_X} (\varvec{\theta }|\textbf{y})) \\&\quad = -\nabla \textbf{m}_N^{\mathcal {G}_{X}}(\varvec{\theta })^T(K_N(\varvec{\theta },\varvec{\theta })+\Gamma _{\eta })^{-1}(\textbf{m}_N^{\mathcal {G}_{X}}(\varvec{\theta })-\textbf{y}) \\&\qquad - \frac{1}{2}(\textbf{m}_N^{\mathcal {G}_{X}}(\varvec{\theta })-\textbf{y})^T\nabla \left( (K_N(\varvec{\theta },\varvec{\theta })+\Gamma _{\eta })^{-1}\right) (\textbf{m}_N^{\mathcal {G}_{X}}(\varvec{\theta })-\textbf{y})\\&\qquad - \frac{1}{2}\left( {{{\,\textrm{Tr}\,}}\left( (K_N(\varvec{\theta },\varvec{\theta })+\Gamma _{\eta } )^{-1}\right) \nabla (K_N(\varvec{\theta },\varvec{\theta }))}\right) \\&\qquad + \nabla \log \pi _0 (\varvec{\theta }),\\&\text {where }\\&\nabla \left( (K_N(\varvec{\theta },\varvec{\theta })+\Gamma _{\eta })^{-1}\right) \\&\quad = -(K_N(\varvec{\theta },\varvec{\theta })+\Gamma _{\eta })^{-1}\nabla \left( K_N(\varvec{\theta },\varvec{\theta })\right) (K_N(\varvec{\theta },\varvec{\theta })+\Gamma _{\eta })^{-1}\\&\quad \text {and } \nabla K_N(\varvec{\theta },\varvec{\theta }) = 2\nabla K(\varvec{\theta },\Theta )K(\Theta ,\Theta )^{-1}K(\Theta ,\varvec{\theta }). \end{aligned}$$
Table 2 Symbols and notations used in numerical experiments

4 Numerical experiments

We now discuss a number of different numerical experiments related to inverse problems for the PDE (1a)-(1b) in various set-upsFootnote 2 in terms of the number of spatial and parameter dimensions as well as for different types of forward models. A common theme in all our experiments is that the number of training points N is small, as this would be the case in large-scale applications in practice where increasing the number of training points is often either very costly or infeasible. The number of training points N used will serve as a benchmark for comparing different methodologies. Throughout all our numerical experiments in Sects. 4.1-4.3 when comparing the different approaches we keep N fixed. The value of N is chosen in such a way to ensure that significant uncertainty remains present in the emulator, which is typically the case in applications. Alternatively, one could ask what number of training points for each model is needed to reach a certain accuracy, however as explained above, this is not the viewpoint taken here. Precise timings for each of the approaches are reported in Sect. 4.4.

In cases where the PDE solution is not available in closed form, we use the finite element software Firedrake (Rathgeber et al. 2016) to obtain the "true" solution. Furthermore, when using MALA we adaptively tune the step size to achieve an average acceptance probability 0.573 (Brooks et al. 2011). In all our numerical experiments, we replace the uniform prior with a smooth approximation given by the \(\lambda -\)Moreau-Yosida envelope (Bauschke et al. 2011) with \(\lambda = 10^{-3}\).

The selection of hyperparameters is crucial for the application of Gaussian process regression. In this paper, we optimize the hyperparameters by minimizing the negative log marginal likelihood, which in general could be computationally intensive. We simplify this process by assuming isotropy for the length-scale in the covariance function for \(\varvec{\theta }\) as in (5) and (6), so the optimization of the hyperparameters becomes a two-dimensional problem. Additionally, since we are operating in the small training data regime, the computational cost of evaluating the log-likelihood is small. To emphasise the improvement brought by the structure and also for simplicity, in the case of the spatially correlated and the PDE-constrained models, we use the same hyperparameters for the covariance function of the unknown parameter \(\varvec{\theta }\) as in the baseline model and only optimise the hyperparameters of the spatial covariance function. In principle, these assumptions can be relaxed to achieve potentially higher accuracy in the regression. The computational timings for optimization of the hyperparameters can be found in Appendix D.

To clarify the notation we use in our numerical experiments, we recall some of it in Table 2.

4.1 Examples in one spatial dimension

4.1.1 Two-dimensional piece-wise constant diffusion coefficient

We consider an elliptic equation with a 2-dimensional piece-wise constant diffusion coefficient; we have the following equation

$$\begin{aligned}&-\frac{\textrm{d}}{\textrm{d}x}(\exp (\kappa (x, \varvec{\theta }))\frac{\textrm{d}}{\textrm{d}x} {u}(x)) = 4x, \nonumber \\&x \in (0,1), \quad \varvec{\theta }\in [-1,1]^{2}, \nonumber \\&u(0) =0, \qquad u(1) =2, \end{aligned}$$
(20)

where \(\kappa \) is defined as piece-wise constant over four equally spaced intervals. More precisely, we consider

$$\begin{aligned} \kappa (x,\varvec{\theta })= \left\{ \begin{array}{lr} 0, &{} \text {for } x \in [0,\frac{1}{4})\\ \theta _{1}, &{} \text {for } x \in [\frac{1}{4},\frac{1}{2})\\ \theta _{2}, &{} \text {for } x \in [\frac{1}{2},\frac{3}{4})\\ 1 &{} \text {for } x \in [\frac{3}{4},1] \end{array}\right. \end{aligned}$$
(21)

Since it is not possible to solve (20) explicitly, we use Firedrake to obtain its solution.

Fig. 1
figure 1

Error between the predictive mean of PDE constrained emulator and the ground truth (\(\varvec{\theta }= \varvec{\theta }^{\dagger }\)) at the \({d}_{\textbf{y}}= 6\) observation points for different \(\bar{N}\) (\(d_{f} = 20\)) (top plot) and different \(d_{f}\) (\(\bar{N} = 10\)) (bottom plot) with \(d_g = 2\) fixed

Throughout this numerical experiment, we take the prior of the parameters to be the uniform distribution on \([-1,1]^{2}\), and we generate our data \(\textbf{y}\) according to equation (2) for \(\varvec{\theta }^{\dagger }=[0.098, 0.430]\), \({d}_{\textbf{y}}=6\) (equally spaced points in [0,1]) and noise level \(\sigma ^{2}_{\eta }=10^{-4}\). For the covariance kernels, we choose \(k_p\) to be the squared exponential kernel and \(k_s\) to be the Matèrn kernel with \(\nu = \frac{5}{2}\).

For the PDE constrained model, we first test the effect of additional training data \(g(\Theta _g, X_g)\) and \(f(\Theta _f, X_f)\) on the accuracy of the emulator. In Fig. 1, we see that as \(d_{f}\) and \(\bar{N}\) increase, the accuracy of emulators gradually increases.

We now use MALA to obtain samples for all our approximate posteriors using \(10^{6}\) samples. For all models, we have used \(N=4\) training points (chosen to be the first 4 points in the Halton sequence), while additionally for the PDE-constrained model, we have used \(\bar{N} = 10\) (chosen to be the next 10 points in the Halton sequence), \(d_{f} = 20\) and \(d_{g} = 2\). Since we do not have access to the true posterior, we consider the results obtained from a mean-based approximation with the baseline model for \(N=10^{2}\) training points as the ground truth.

Fig. 2
figure 2

Contour plots of the approximate mean-based and marginal-based posteriors: baseline model (top left plot), spatially correlated (top right plot), PDE-constrained (bottom plot). The symbol \("+"\) denotes \(\varvec{\theta }^{\dagger }\). \(\mathcal {G}_X\) is the discretised solution u in (20)

As we can see in Fig. 2, all the mean-based posteriors fail to put significant posterior mass near the true parameter value \(\varvec{\theta }^{\dagger }\). The situation improves when the uncertainty of the emulator is taken into account as we can see for the marginal approximations. Out of the three different models, the PDE-constrained one seems to be performing best since it is placing the most posterior mass around the true value \(\varvec{\theta }^{\dagger }\). This is further illustrated in Fig. 3 where we plot the \(\theta _{1}\) and \(\theta _{2}\) marginals for all the mean-based posterior approximations \(\pi ^{N,\mathcal {G}_X}_{\textrm{mean}}\), \(\pi ^{N,\mathcal {G}_X,s}_{\textrm{mean}}\), \(\pi ^{N,\mathcal {G}_X,\textrm{PDE}}_{\textrm{mean}}\) and the marginal-based posterior approximations \(\pi ^{N,\mathcal {G}_X}_{\textrm{marginal}}\), \(\pi ^{N,\mathcal {G}_X,s}_{\textrm{marginal}}\), \(\pi ^{N,\mathcal {G}_X,\textrm{PDE}}_{\textrm{marginal}}\). Note that the marginal plot could be misleading regarding the overall performance of the approximations, for example in Fig. 3 (top right) the baseline model seems to be better than the PDE-constrained model, but from Fig. 2 we know that this is not true. In other words, the marginal posteriors are better approximations than the joint posterior. When we increase \(d_{f}\) from 20 to 50, the accuracy of the approximation improves as we can see in Fig. 4 where we compare PDE-constrained approximations for the two different values of \(d_{f}\).

Fig. 3
figure 3

Comparison of different models’ marginal distribution when \(N=4\), for PDE model \(d_f = 20\) and \(\bar{N} = 20\): mean-based approximation of the \(\theta _1\) marginal (top left plot) and \(\theta _2\) marginal (top right plot), marginal approximation of the \(\theta _1\) marginal (bottom left plot) and \(\theta _2\) marginal (bottom right plot). \(\mathcal {G}_X\) is the discretised solution u in (20) with diffusion coefficient (21)

Fig. 4
figure 4

Comparison of different models’ marginal distribution when \(N=4\): mean-based approximation (left plot) and marginal-based approximation (right plot)

4.1.2 Parametric expansion for the diffusion coefficient

In this example, we study again (20), but this time instead of working with a piece-wise constant diffusion coefficient we assume that the diffusion coefficient satisfies the following parametric expansion

$$\begin{aligned} \kappa (\varvec{\theta },x) = {\sum _{n=1}^{d_{\varvec{\theta }}}\sqrt{a_n}\theta _n b_n(x)} \end{aligned}$$
(22)

where \(a_n = \frac{8}{\omega ^2_n + 16}\), \(b_{n}(x) = A_n (\sin (\omega _n x) + \frac{\omega _n}{4}\cos (\omega _n x))\), \(\omega _n\) is the \(n_{th}\) solution of the equation \(\tan (\omega _n) = \frac{8\omega _n}{\omega ^2_n - 16}\) and \(A_n\) is a normalisation constant which makes \(\Vert b_n\Vert _{L^2(0,1)} = 1\). This choice is motivated by the fact that for \(\{\theta _n\}_{n=1}^{d_{\varvec{\theta }}}\) i.i.d. standard normal random variables, this is a truncated Karhunen-Loève expansion of \(\log (\kappa (\varvec{\theta }, x)) \sim \text {GP}(0, \exp (-\Vert x-x'\Vert _1))\) (Ghanem and Spanos 1991).

In terms of the inverse problem setting, we are using the same parameters as before (\(\varvec{\theta }^{\dagger } = [0.098,0.430]\), \({d}_{\textbf{y}}= 6\), noise level \(\sigma ^2_{\eta } = 10^{-4}\)). The number of training points for all the emulators has been set to \(N=4\) (chosen using the Halton sequence), while in the case of the PDE-constrained emulator we have used \(\bar{N}=10\) and \(d_{f}=8\). Furthermore, throughout this numerical experiment, we take the prior of the parameters to be the uniform distribution on \([-1,1]^{2}\). For the choices of kernels, we use the squared exponential kernel for both \(k_p\) and \(k_s\).

As in the previous experiments, we produce \(10^{6}\) samples of the posteriors using MALA, and use the results obtained by a mean-based approximation with the baseline model for \(N=10^{2}\) training points as the ground truth.

Fig. 5
figure 5

Comparison of different models’ marginal distribution when \(N=4\), for PDE model \(\bar{N} = 10\) and \(d_f = 8\): mean-based approximation of the \(\theta _1\) marginal (top left plot) and \(\theta _2\) marginal (top right plot), marginal approximation of the \(\theta _1\) marginal (bottom left plot) and \(\theta _2\) marginal (bottom right plot). \(\mathcal {G}_X\) is the discretised solution u in (20) with diffusion coefficient (22) and \({d}_{\mathbf {\varvec{\theta }}}= 2\)

We now plot in Fig. 5 the \(\theta _{1}\) and \(\theta _{2}\) marginals for the different Gaussian emulators both in the case of mean-based and marginal posterior approximations. As we can see in Fig. 5 for the mean-based posterior approximations, the baseline and spatially correlated model fail to capture the true posterior while this is not the case for the PDE-constrained model since the agreement with the true posterior is excellent. When looking at the marginal approximations in Fig. 5 (bottom left and bottom right) we can see that the marginals for the baseline and spatially correlated models move closer towards the true value \(\varvec{\theta }^{\dagger }\) and exhibit variance inflation. This is, however, not the case for the PDE-constrained model since again it is in excellent agreement with the true posterior.

4.2 Two spatial dimensions

In this example, we increase the spatial dimension from \(d_\textbf{x} = 1\) to \(d_\textbf{x} = 2\) and use a 2-dimensional piece-wise constant as the diffusion coefficient. The values of the diffusion coefficient are set in a similar way to the first example, but depending only on the first dimension of \(\textbf{x}\):

$$\begin{aligned} \kappa (\textbf{x},\varvec{\theta })= \left\{ \begin{array}{lr} 0, &{} \text {for } x_1 \in [0,\frac{1}{4}),\\ \theta _{1}, &{} \text {for } x_1 \in [\frac{1}{4},\frac{1}{2}),\\ \theta _{2}, &{} \text {for } x_1 \in [\frac{1}{2},\frac{3}{4}),\\ 1, &{} \text {for } x_1 \in [\frac{3}{4},1]. \end{array}\right. \end{aligned}$$
(23)

The boundary conditions are a mixture of Neumann and Dirichlet conditions, given by

$$\begin{aligned}&\partial _{x_1} u(x_1,0) = \partial _{x_1} u(x_1,1) = 0,&\text {for } x_1 \in [0,1], \\&u(0,x_2) = 1, \quad u(1,x_2) = 0,&\text {for } x_2 \in [0,1]. \end{aligned}$$

These boundary conditions define a flow cell, with no flux at the top and bottom boundary (\(x_2=0,1\)) and flow from left to right induced by the higher value of u at \(x_1=0\).

For the observation, we generate our data \(\textbf{y}\) according to equation (2) for \(\varvec{\theta }^{\dagger }=[0.098, 0.430]\) with \({d}_{\textbf{y}}=6\) (chosen to be the first 6 points in the Halton sequence) and a noise level \(\sigma ^{2}_{\eta }=10^{-5}\). In addition, for the baseline and spatially correlated models, we have used \(N=4\) training points (chosen to be the first 4 points in the Halton sequence), while additionally for the PDE-constrained model, we have used \(\bar{N} = 30\), \(d_{{f}}=30\) and \(d_g=8\), corresponding to 2 equally spaced points on each boundary. For the covariance kernels, we let \(k_p\) be the squared exponential kernel and \(k_s\) be the Matèrn kernel with \(\nu = \frac{5}{2}\).

Fig. 6
figure 6

Comparison of different models’ marginal distribution when \(N=4\), for PDE model \(\bar{N} = 30\) and \(d_f = 30\): mean-based approximation of the \(\theta _1\) marginal (top left plot) and the \(\theta _2\) marginal (top right plot), and marginal approximation of the \(\theta _1\) marginal (bottom left plot) and the \(\theta _2\) marginal (bottom right plot). \(\mathcal {G}_X\) is the discretised solution u with \(d_\textbf{x} = 2\) and diffusion coefficient (23)

We plot the mean-based approximate marginal posteriors in Fig. 6 (top left and top right). We can see that in this case, the PDE-constrained model significantly improves the approximation accuracy, which is different from the previous piece-wise constant diffusion coefficient example in one spatial dimension. In Fig. 6 (bottom left and bottom right), we compare the marginal approximation for the three models, and we again see that the PDE-constrained model performs best.

4.3 Emulating the negative log-likelihood

As discussed in Sect. 2.3.2, we can emulate the negative log-likelihood directly with Gaussian process regression instead of emulating the forward map. Since emulation of the log-likelihood involves emulating a non-linear functional of the PDE solution u, we are not able to incorporate spatial correlation or PDE constraints in the same way. We test the performance of the mean-based approximation (12) and the marginal approximation (13) using the previous examples: problem (20) with diffusion coefficient (21) with \(d_\textbf{x} = 1\) and \(d_\textbf{x} = 2\). All parameters are kept the same as in Sect. 4.1.1 and Sect. 4.2.

In Fig. 7, we compare the mean-based approximation with the emulation of the log-likelihood \(\Phi \) and the observation operator \({\mathcal {G}}_X\) using baseline model. We see that the results are very different in both examples. For the \(d_\textbf{x}=1\), emulating the log-likelihood function performs better than the emulation of the observation with the baseline model, the approximated posterior is closer to the true posterior. For the \(d_\textbf{x} = 2\), its performance is much worse. Hence, emulating the log-likelihood with a small amount of data could be less reliable compared to emulating the forward map. If we increase the number of training data to \(N=10\) for the \(d_\textbf{x} = 2\) case, we can see an improvement of accuracy in Fig. 8, but it is still worse than emulating the forward map with the baseline model.

Fig. 7
figure 7

Comparison of emulating log-likelihood function and emulating observations when \(N=4\). Both approximations are the mean-based approximation. \(\mathcal {G}_X\) is the negative log-likelihood function in the problem (20) with the diffusion coefficient (21) with \(d_\textbf{x} = 1\) (left plot) and \(d_\textbf{x} = 2\) (right plot)

Similarly, we see in Fig. 9 that marginal approximations of the posterior based emulation of the log-likelihood appear to be less reliable, but including more training points can again improve the performance.

Fig. 8
figure 8

The accuracy of the emulator is improved when N increases (\(N=10\)). \(\mathcal {G}_X\) is the negative log-likelihood function in the problem (20) with the diffusion coefficient (21) with \(d_\textbf{x} = 2\) and mean-based approximation

Fig. 9
figure 9

Marginal approximation with \(N=4\) (left plot) and marginal approximation with \(N=10\) (right plot). \(\mathcal {G}_X\) is the negative log-likelihood function in the problem (20) with the diffusion coefficient (21) with \(d_\textbf{x} = 1\)

4.4 Computational timings

In this section, we discuss computational timings. We focus on the computational gains resulting from using Gaussian process emulators instead of the PDE solution in the posterior (see Table 3) and the relative costs of sampling from the various approximate posteriors (see Tables 4, 5, 6 and 7).

Table 3 gives average computational timings comparing the evaluation of the solution of the PDE using Firedrake with using the Gaussian process surrogate model. For the baseline surrogate model, the two primary costs are (i) computing the coefficients \(\varvec{\alpha } = K(\Theta ,\Theta )^{-1} {\mathcal {G}}_X(\Theta )\), which is an offline cost and only needs to be done once, and (ii) computing the predictive mean \(m_N^f(\varvec{\theta }) = K(\varvec{\theta },\Theta ) \varvec{\alpha }\), which is the online cost and needs to be done for every new test point \(\varvec{\theta }\). We see that evaluating \(m_N^f(\varvec{\theta })\) is orders of magnitude faster than evaluating \({\mathcal {G}}_X(\varvec{\theta })\).

In Tables 4, 5 and 6, we compare average computational timings of drawing one sample from the approximate posterior with different models. In Table 4, we see that the mean-based approximation with the PDE-informed prior is more expensive than the one with the baseline prior, by a factor of 2–4 depending on the setting. This is to be expected, since the PDE-informed posterior mean \(\textbf{m}^{{\mathcal G}_X}_{N,X_f,X_g}\) involves matrices of larger dimensions than the baseline posterior mean \(\textbf{m}^{{{\mathcal {G}}}_X}_{N}\).

Table 5 investigates the different marginal approximations. Compared to the mean-based approximations in Table 4, we see that the marginal approximations are more expensive by a factor of around 2 for the baseline model and around 3–10 for the PDE-constrained model. Within the different marginal approximations, the spatially correlated model is not much more expensive than the baseline model, whereas, depending on the setting, the PDE-constrained model is 2–30 times more expensive.

In Table 6, we can see that emulating the log-likelihood significantly reduces the cost of sampling from the mean-based and marginal approximations, by around a factor of 20 compared to the baseline model for emulating the observations.

Table 3 Timings of PDE solution versus baseline Gaussian process emulator
Table 4 Timings of different mean-based approximations (baseline and PDE-constrained)
Table 5 Timings of different marginal approximations (baseline, spatially correlated and PDE-constrained); \({\overline{N}}\) and \(d_f\) are as in Table 4
Table 6 Timings of mean-based and marginal approximation when emulating the log-likelihood

Finally, Table 7 shows the effective sample sizes (ESSs) obtained for the different posterior approximations with MALA. We can see that the ESSs are all comparable, implying that it is meaningful to look at the cost per sample to compare the different approximate posteriors in terms of computational cost.

5 Conclusions, discussion and actionable advice

Bayesian inverse problems for PDEs pose significant computational challenges. The application of state-of-the-art sampling methods, including MCMC methods, is typically computationally infeasible due to the large computational cost of simulating the underlying mathematical model for a given value of the unknown parameters. A solution to alleviate this problem is to use a surrogate model to approximate the PDE solution within the Bayesian posterior distribution. In this work we considered the use of Gaussian process surrogate models, which are frequently used in engineering and geo-statistics applications and offer the benefit of built-in uncertainty quantification in the variance of the emulator.

The focus of this work was on practical aspects of using Gaussian process emulators in this context, providing efficient MCMC methods and studying the effect of various modelling choices in the derivation of the approximate posterior on its accuracy and computational efficiency. We now summarise the main conclusions of our investigation.

Table 7 Effective sample size for \(10^6\) samples
  1. 1.

    Emulating log-likelihood vs emulating observations. We can construct an emulator for the negative log-likelihood \(\Phi \) or the parameter-to-observation map \({\mathcal {G}}_X\) in the likelihood (3).

    • Computational efficiency. The log-likelihood \(\Phi \) is always scalar-valued, independent of the number of observations \({d}_{\textbf{y}}\), which makes the computation of the approximate likelihood for a given value of the parameters \(\varvec{\theta }\) much cheaper than the approximate likelihood with emulated \({\mathcal {G}}_X\). The relative cost will depend on \({d}_{\textbf{y}}\).

    • Accuracy. When only limited training data are provided, emulating \({\mathcal {G}}_X\) appears more reliable than emulating \(\Phi \), even with the baseline model. The major advantage of emulating \({\mathcal {G}}_X\) is that it allows us to include correlation between different observations, i.e. between the different entries of \({\mathcal {G}}_X\). This substantially increases the accuracy of the approximate posteriors, in particular if we use the PDE structure to define the correlations (see point 3 below).

  2. 2.

    Mean-based vs marginal posterior approximations. We can use only the mean of the Gaussian process emulator to define the approximate posterior as in (10) and (12), or we can make use of its full distribution to define the marginal approximate posteriors as in (11) and (13).

    • Computational efficiency. The mean-based approximations are faster to sample from using MALA. This is due to simpler structure of the gradient required for the proposals. The difference in computational times depends on the prior chosen, and is greater for the PDE-constrained model.

    • Accuracy. The marginal approximations correspond to a form of variance inflation in the approximate posterior (see Sect. 2.3), representing our incomplete knowledge about the PDE solution. They thus combat over-confident predictions. In our experiments, we confirm that they typically allocate larger mass to regions around the true parameter value than the mean-based approximations.

  3. 3.

    Spatial correlation and PDE-constrained priors.

    • Computational efficiency. Introducing the spatially correlated model only affects the marginal approximation, and sampling from the marginal approximate posterior with the spatially correlated model is slightly slower than with the baseline model. The PDE-constrained model significantly increases the computational times for both the mean-based and marginal approximations, with the extent depending on the size of the additional training data.

    • Accuracy. Introducing spatial correlation improves the accuracy of the marginal approximation compared to the baseline model. The most accurate results are obtained with the PDE-constrained priors, which are problem specific and more informative. A benefit of the spatially correlated model is that it does not rely on the underlying PDE being linear, and easily extends to non-linear settings.

In summary, the marginal posterior approximations and the spatially correlated/ PDE-constrained prior distributions provide mechanisms for increasing the accuracy of the inference and avoiding over-confident biased predictions, without the need to increase N. This is particularly useful in practical applications, where the number of model runs N available to train the surrogate model may be very small due to constraints in time and/or cost. This does result in higher computational cost compared to mean-based approximations based on black-box priors, but may still be the preferable option if obtaining another training point is impossible or computationally very costly.

Variance inflation, as exhibited in the marginal posterior approximations considered in this work, is a known tool to improve Bayesian inference in complex models, see e.g. (Conrad et al. 2017; Calvetti et al. 2018; Fox et al. 2020). Conceptually, it is also related to including model discrepancy (Kennedy and O’Hagan 2000; Brynjarsdóttir and O’Hagan 2014). The approach to variance inflation presented in this work has several advantages. Firstly, the variance inflation being equal to the predictive variance of the emulator means that the amount of variance inflation included depends on the location \(\varvec{\theta }\) in the parameter space. We introduce more uncertainty in parts of the parameter space where we have less training points and the emulator is possibly less accurate. Secondly, the amount of variance inflation can be tuned in a principled way using standard techniques for hyperparameter estimation in Gaussian process emulators. There is no need to choose a model for the variance inflation separately to choosing the emulator, since this is determined automatically as part of the emulator.

We did not apply optimal experimental design in this work, i.e. how we should optimally choose the locations \(\Theta \) of the training data. One would expect that using optimal design will have a large influence on the accuracy of the approximate posteriors, especially for small N. In the context of inverse problems, one usually wants to place the training points in regions of parameter space where the (approximate) posterior places significant mass (see e.g. Helin et al. (2023) and the references therein). For a fair comparison between all scenarios, and to eliminate the interplay between optimal experimental design and other modelling choices, we have chosen the training points as a space-filling design in our experiments. We expect the same conclusions to hold with optimally placed points.