1 Introduction

The study of Bayesian inverse problems [8, 26] has gained wide attention during the last decade as the increase in computational resources and algorithmic development have enabled uncertainty quantification in numerous new applications in science and engineering. Large-scale problems, where the computational burden of the likelihood is prohibitive, are, however, still a subject of ongoing research.

In this paper we study the Laplace approximation of the posterior distribution arising in nonlinear Bayesian inverse problems. The Laplace approximation is obtained by replacing the log-posterior density with its second order Taylor approximation around the maximum a posteriori (MAP) estimate and renormalizing the density. This produces a Gaussian measure centered at the maximum a posteriori (MAP) estimate with a covariance corresponding to the Hessian of the negative log-posterior density (see, e.g., [3, Section 4.4]).

The asymptotic behavior of the parametric Laplace approximation in the small noise or large data limit has been studied extensively in the past (see, e.g., [30]). We note that in terms of approximation properties with respect to taking a posterior expectation over a given function, there is a long line of research which we discuss below. Our work is parallel to this effort in that we aim to estimate the total variation (TV) distance between the two probability measures. On the one hand, the error in TV distance bounds the error of the expectation of any function with respect to the Laplace approximation. On the other hand, it is a measure of the non-Gaussianity of the posterior distribution. Thus, our results describe and quantify how the nonlinearity of the forward mapping translates into non-Gaussianity of the posterior distribution.

Our work is motivated by a recent result by Schillings, Sprungk, and Wacker in [25], where the authors show that in the context of Bayesian inverse problems, the Laplace approximation error in Hellinger distance converges to zero in the order of the noise level. In practice, one is, however, often interested in estimating the error for a given, fixed noise level. It can, e.g., be unclear if the noise level is small enough in order to dominate the error estimate. Indeed, the nonlinearity of the forward mapping (more generally, the non-Gaussianity of the likelihood) or a large problem dimension can have a signifact contribution to the constant appearing in the asymptotic estimates. Therefore, it is of interest to quantify such effects in non-asymptotic error estimates for the Laplace approximation. This is the main goal of our work.

1.1 Our contributions

The main contribution of this work is threefold:

  1. 1.

    In Theorem 3.4, we derive our central error estimate for the total variation distance of the Laplace posterior approximation in nonlinear Bayesian inverse problems. The error bound consists of two error terms for which we derive an implicit optimal balancing rule in Proposition 3.13. We assume uniform bounds on the third differentials of log-likelihood and log-prior density as well as a quadratic lower bound on the log-posterior density to control the error. Given such bounds the error estimate can be numerically evaluated.

  2. 2.

    In Theorem 4.1, we derive a further estimate for the Laplace approximation error that makes the effect of noise level, the bounds specified above and the dimension of the problem explicit. This error estimate readily implies linear rates of convergence for fixed problem dimension both in the small noise limit and when the third differential of the log-likelihood goes to zero, see Corollary 4.4. It furthermore leads to a convergence rate for increasing problem dimension in terms of noise level, problem dimension, and aforementioned bounds aligned with [15], see Corollary 4.6.

  3. 3.

    In Theorem 5.3, we quantify the error of the Laplace approximation in terms of the nonlinearity of the forward mapping for linear inverse problems with small nonlinear perturbation and Gaussian prior distribution. We assume uniform bounds on the differentials of the nonlinear perturbation of up to third order to control the error. This error estimate immediately implies linear convergence in terms of the size of the perturbation. Moreover, such a result provides insight into Bayesian inference in nonlinear inverse problems, where linearization of the forward mapping has suitable approximation properties.

1.2 Relevant literature

The asymptotic approximation of general integrals of the form \(\int e^{\lambda f(x)}g(x) \mathrm {d}x\) by Laplace’s method is presented in [21, 30]. Non-asymptotic error bounds for the Laplace approximation of such integrals have been stated in the univariate [20] and multivariate case [7, 16]. The Laplace approximation error and its convergence in the limit \(\lambda \rightarrow \infty \) have been estimated in the multivariate case when the function f depends on \(\lambda \) or the maximizer of f is on the boundary of the integration domain [10]. A representation of the coefficients appearing in the asymptotic expansion of the approximated integral utilizing ordinary potential polynomials is given in [18].

The error estimates on the Laplace approximation in TV distance are closely connected to the so-called Bernstein–von Mises (BvM) phenomenon that quantifies the convergence of the scaled posterior distribution toward a Gaussian distribution in the large data or small noise limit. Parametric BvM theory is well-understood [12, 29]. Our work is inspired by a BvM result by Lu [15], where a parametric BvM theorem for nonlinear Bayesian inverse problems with an increasing number of parameters is proved. Similar to our objectives, he quantifies the asymptotic convergence rate in terms of noise level, nonlinearity of the forward mapping and dimension of the problem. However, our emphasis differs from [15] (and other BvM results) in that we are not restricted to considering the vanishing noise limit, but are more interested in quantifying the effect of small nonlinearity or dimension at a fixed noise level. We also point out that BvM theory has been developed for non-parametric Bayesian inverse problems (see, e.g., [6, 17, 19]), where the convergence is quantified in a distance that metrizes the weak convergence.

Let us conclude by briefly emphasizing that the Laplace approximation is widely utilized for different purposes in computational Bayesian statistics including, i.a., the celebrated INLA algorithm [22]. It has also recently gained popularity in optimal Bayesian experimental design (see, e.g., [1, 14, 23]). Moreover, it provides a convenient reference measure for numerical quadrature [4, 24] or importance sampling [2].

1.3 Organization of the paper

Before we present the aforementioned three main results in Sects. 3 to 4 to 5, we introduce our set-up and notation, Laplace’s method, and the total variation metric in Sect. 2. In Sect. 3, we introduce our central error bound for the Laplace approximation and explain the idea behind its proof. In Sect. 4, we derive an explicit error estimate for the Laplace approximation and describe its asymptotic behavior. In Sect. 5, we prove the error estimate for inverse problems with small nonlinearity in the forward mapping and Gaussian prior distribution.

2 Preliminaries and set-up

We consider for \(\varepsilon > 0\) the inverse problem of recovering \(x \in \mathbb {R}^d\) from a noisy measurement \(y \in \mathbb {R}^d\), where

$$\begin{aligned} y = G(x) + \sqrt{\varepsilon }\eta , \end{aligned}$$

\(\eta \in \mathbb {R}^d\) is random noise with standard normal distribution \(\mathscr {N}(0,I_d)\), and G: \({\mathbb {R}^d}\rightarrow {\mathbb {R}^d}\) is a possibly nonlinear mapping. In the following, \({|\cdot |}\) denotes the Euclidean norm on \(\mathbb {R}^d\). If we assume a prior distribution \(\mu \) on \(\mathbb {R}^d\) with Lebesgue density \(\exp (-R(x))\), then Bayes’ formula yields a posterior distribution \(\mu ^y\) with density

$$\begin{aligned} \mu ^y(\mathrm {d}x) \propto \exp \left( -\frac{1}{2\varepsilon } {|y - G(x) |}^2 \right) \mu (\mathrm {d}x) = \exp \left( -\frac{1}{2\varepsilon } {|y - G(x) |}^2 - R(x) \right) \mathrm {d}x.\nonumber \\ \end{aligned}$$
(2.1)

For all \(x, y \in \mathbb {R}^d\), we denote the scaled negative log-likelihood by

$$\begin{aligned} \varPhi (x) = \frac{1}{2}{|y - G(x) |}^2. \end{aligned}$$

If

$$\begin{aligned} x \mapsto \varPhi (x) + \varepsilon R(x) \end{aligned}$$

has a unique minimizer in \(\mathbb {R}^d\), we call this minimizer the maximum a posteriori (MAP) estimate and denote it by \(\hat{x}= \hat{x}(y)\). Furthermore, we set

$$\begin{aligned} I(x) := \varPhi (x) + \varepsilon R(x) - \varPhi (\hat{x}) - \varepsilon R(\hat{x}) \end{aligned}$$

for all \(x \in \mathbb {R}^d\). This way, I is nonnegative, the MAP estimate \(\hat{x}\) minimizes I and satisfies \(I(\hat{x}) = 0\). Moreover, we can express the posterior density as

$$\begin{aligned} \mu ^y(\mathrm {d}x) = \frac{1}{Z} \exp \left( -\frac{1}{\varepsilon }I(x) \right) \mathrm {d}x \end{aligned}$$
(2.2)

with a normalization constant Z.

Laplace’s method approximates the posterior distribution by a Gaussian distribution \({\mathscr {L}_{\mu ^y}}\) whose mean and covariance are chosen in such a way that its log-density agrees, up to a constant, with the second order Taylor polynomial around \(\hat{x}\) of the log-posterior density. If \(I \in C^2({\mathbb {R}^d},\mathbb {R})\), the Laplace approximation of \(\mu ^y\) is defined as

$$\begin{aligned} \mathscr {L}_{\mu ^y} := \mathscr {N}({\hat{x}}, \varepsilon \varSigma ), \end{aligned}$$

where \(\varSigma := (D^2I({\hat{x}}))^{-1}\). Here, DI denotes the differential of I, and we identify \(D^2I(\hat{x})\) with the Hessian matrix \(\{D^2I(\hat{x})(e_j,e_k)\}_{j,k = 1}^d\). The Lebesgue density of \({\mathscr {L}_{\mu ^y}}\) is given by

$$\begin{aligned} \mathscr {L}_{\mu ^y}(\mathrm {d}x)&= \frac{1}{\widetilde{Z}} \exp \left( -\frac{1}{2\varepsilon }{\Vert x - {\hat{x}} \Vert }_\varSigma ^2\right) \mathrm {d}x \\&= \frac{1}{\widetilde{Z}} \exp \left( -\frac{1}{2\varepsilon }D^2 I({\hat{x}})(x - {\hat{x}},x - {\hat{x}})\right) \mathrm {d}x, \end{aligned}$$

where

$$\begin{aligned} \widetilde{Z} = \int _{\mathbb {R}^d} \exp \left( -\frac{1}{2\varepsilon }{\Vert x - {\hat{x}} \Vert }_\varSigma ^2\right) \mathrm {d}x = \varepsilon ^{\frac{d}{2}} (2\pi )^\frac{d}{2} \sqrt{\det \varSigma }. \end{aligned}$$
(2.3)

Since \(I(\hat{x}) = 0\) and \(DI(\hat{x}) = 0\), \(\frac{1}{2\varepsilon }{\Vert x - \hat{x} \Vert }_\varSigma ^2\) is precisely the truncated Taylor series of \(I/\varepsilon \) around \({\hat{x}}\).

The total variation (TV) distance between two probability measures \(\nu \) and \(\mu \) on \(({\mathbb {R}^d},\mathscr {B}({\mathbb {R}^d}))\) is defined as

$$\begin{aligned} {d_\text {TV}}(\nu ,\mu ) = \sup _{A \in \mathscr {B}(\mathbb {R}^d)} {|\nu (A) - \mu (A) |}, \end{aligned}$$

see Section 2.4 in [27]. It has the alternative representation

$$\begin{aligned} {d_\text {TV}}(\nu ,\mu ) = \frac{1}{2} \sup _{{\Vert f \Vert }_\infty \le 1} \left\{ \int _{\mathbb {R}^d}f \mathrm {d}\nu - \int _{\mathbb {R}^d}f \mathrm {d}\mu \right\} = \frac{1}{2} \int _{\mathbb {R}^d} \left| \frac{d\nu }{d\rho } - \frac{d\mu }{d\rho } \right| d\rho \end{aligned}$$

where \({\Vert f \Vert }_\infty := \sup _{x \in {\mathbb {R}^d}} {|f(x) |}\) and \(\rho \) can be any probability measure dominating both \(\mu \) and \(\nu \), see Remark 5.9 in [27] and equation (1.12) in [11]. The total variation distance is valuable for the purpose of uncertainty quantification because it bounds the error of any credible region when using a measure \(\nu \) instead of another measure \(\mu \). It can, moreover, be used to bound the difference in expectation of any bounded function f on \({\mathbb {R}^d}\) with respect to \(\mu \) and \(\nu \), respectively, by

$$\begin{aligned} \left|\mathbb {E}^\nu [f] - \mathbb {E}^\mu [f] \right| \le 2{\Vert f \Vert }_\infty {d_\text {TV}}(\nu ,\mu ), \end{aligned}$$

see Lemma 1.32 in [11]. By Kraft’s inequality

$$\begin{aligned} d_\text {H}(\mu ,\nu )^2 \le {d_\text {TV}}(\mu ,\nu ) \le \sqrt{2}d_\text {H}(\mu ,\nu ), \end{aligned}$$

the total variation distance bounds the square of the Hellinger distance

$$\begin{aligned} d_\text {H}(\mu ,\nu ) = \left( \frac{1}{2} \int _{\mathbb {R}^d} \left| \sqrt{\frac{d\nu }{d\rho }} - \sqrt{\frac{d\mu }{d\rho }} \right| ^2 d\rho \right) ^\frac{1}{2}, \end{aligned}$$

see Definition 1.28 and Lemma 1.29 in [11] or [9], while both metrics induce the same topology. The bounded Lipschitz metric

$$\begin{aligned} d_\text {BL}(\mu ,\nu ) = \frac{1}{2}\sup _{{\Vert f \Vert }_\infty + {\Vert f \Vert }_\text {Lip} \le 1} \left\{ \int _{\mathbb {R}^d}f \mathrm {d}\mu - \int _{\mathbb {R}^d}f \mathrm {d}\nu \right\} , \end{aligned}$$

which induces the topology of weak convergence of probability measures, is trivially bounded by the total variation distance. Here, we denote

$$\begin{aligned} {\Vert f \Vert }_\text {Lip} := \sup _{x,y \in {\mathbb {R}^d},\,x \ne y} \frac{{|f(x) - f(y) |}}{{|x - y |}}. \end{aligned}$$

For further information on the relation between the total variation distance and other probability metrics we refer the survey paper [5].

3 Central error estimate

We will use the following ideas to bound the error of the Laplace approximation \(\mathscr {L}_{\mu ^y}\) for a given realization of the data \(y \in \mathbb {R}^d\). First, we will prove the fundamental estimate

$$\begin{aligned} {d_\text {TV}}(\mu ^y,{\mathscr {L}_{\mu ^y}}) \le \frac{1}{\widetilde{Z}} \int _{\mathbb {R}^d} \left|\exp \left( -\frac{1}{\varepsilon }I(x)\right) - \exp \left( -\frac{1}{2\varepsilon }{\Vert x - \hat{x} \Vert }_\varSigma ^2\right) \right| \mathrm {d}x. \end{aligned}$$
(3.1)

If we have a radial upper bound \(f({\Vert x - \hat{x} \Vert }_\varSigma )\) for the integrand on the right hand side of (3.1), we can estimate

$$\begin{aligned} {d_\text {TV}}(\mu ^y,{\mathscr {L}_{\mu ^y}}) \le \int _{\mathbb {R}^d}f({\Vert x - \hat{x} \Vert }_\varSigma ) \mathrm {d}x = \sqrt{\det \varSigma } \int _{\mathbb {R}^d}f({|u |}) \mathrm {d}u, \end{aligned}$$

where we applied a change of variable to a local parameter \(u := \varSigma ^{-\frac{1}{2}}(x - \hat{x})\). This integral, we can now express as a 1-dimensional integral using polar coordinates.

The integrand on the right hand side of (3.1) is very small and flat around \(\hat{x}\), since \(\frac{1}{2}{\Vert x - \hat{x} \Vert }_\varSigma ^2\) is the second order Taylor expansion of I(x) around \(\hat{x}\), and it falls off as \({|x |} \rightarrow \infty \) because it is integrable. Its mass is thus concentrated in an intermediate distance from \(\hat{x}\). This can be seen, e.g., in Fig. 1. We exploit this structure by splitting up the integral in (3.1) and bounding the integrand on a \(\varSigma \)-norm ball

$$\begin{aligned} U(r_0) := \{ x \in {\mathbb {R}^d}: {\Vert x - \hat{x} \Vert }_\varSigma \le r_0 \} \end{aligned}$$

around the MAP estimate \(\hat{x}\) and on the remaining space \(\mathbb {R}^d {\setminus } U(r_0)\) separately. On \(U(r_0)\), we then control the integrand by imposing uniform bounds on the third order differentials of the log-likelihood and the log-prior density. Outside of \(U(r_0)\), we control it by imposing a quadratic lower bound on I.

We make the following assumptions on \(\varPhi \), R, I, \(\hat{x}\), and \(\varSigma \), which will be further discussed in Remark 3.9.

Assumption 3.1

We have \(\varPhi , R \in C^3({\mathbb {R}^d},\mathbb {R})\), I has a unique global minimizer \({\hat{x}} = {\hat{x}}(y) \in {\mathbb {R}^d}\) and \(D^2I({\hat{x}})\) is positive definite.

Assumption 3.2

There exists a constant \(K>0\) such that

$$\begin{aligned} \max \left\{ {\Vert D^3 \varPhi (x) \Vert }_\varSigma , {\Vert D^3 R(x) \Vert }_\varSigma \right\} \le K\end{aligned}$$

for all \(x \in \mathbb {R}^d\), where

$$\begin{aligned} {\Vert D^3\varPhi (x) \Vert }_\varSigma := \sup \Big \{ \big |D^3\varPhi (x)(h_1,h_2,h_3)\big |: {\Vert h_1 \Vert }_\varSigma , {\Vert h_2 \Vert }_\varSigma , {\Vert h_3 \Vert }_\varSigma \le 1 \Big \}. \end{aligned}$$

Assumption 3.3

There exists \(0 < \delta \le 1\) such that

$$\begin{aligned} I(x) \ge \frac{\delta }{2}{\Vert x - \hat{x} \Vert }_\varSigma ^2 \quad \text {for all }x \in \mathbb {R}^d. \end{aligned}$$

Let \(\varGamma (z)\) denote the classical gamma function and \(\gamma (a,z)\) the lower incomplete gamma function. Then,

$$\begin{aligned} \varXi _d(t) := \frac{\gamma \left( \frac{d}{2}, \frac{t}{2}\right) }{\varGamma \left( \frac{d}{2}\right) } \quad \text {for all }t \ge 0, d > 0, \end{aligned}$$

describes the probability of a Euclidean ball in \(\mathbb {R}^d\) with radius \(\sqrt{t}\) around 0 under a standard Gaussian measure (see Lemma 3.12).

The main result of this section is the following error estimate.

Theorem 3.4

Suppose that Assumptions 3.13.3 hold. Then we have

$$\begin{aligned} {d_\text {TV}}(\mu ^y,{\mathscr {L}_{\mu ^y}}) \le E_1(r_0) + E_2(r_0) \end{aligned}$$
(3.2)

for all \(r_0 \ge 0\), where

$$\begin{aligned} E_1(r_0)&:= c_d \varepsilon ^{-\frac{d}{2}} \int _0^{r_0} f(r) r^{d-1} \mathrm {d}r, \end{aligned}$$
(3.3)
$$\begin{aligned} E_2(r_0)&:= \delta ^{-\frac{d}{2}}\left( 1 - \varXi _d\bigg (\frac{\delta r_0^2}{\varepsilon }\bigg )\right) \end{aligned}$$
(3.4)

for all \(r_0 \ge 0\),

$$\begin{aligned} f(r) := \left[ \exp \left( \frac{(1 + \varepsilon )K}{6\varepsilon }r^3\right) - 1\right] \exp \left( -\frac{1}{2\varepsilon }r^2\right) \end{aligned}$$
(3.5)

for all \(r \ge 0\), and

$$\begin{aligned} c_d := \frac{2^{1-\frac{d}{2}}}{\varGamma \left( \frac{d}{2}\right) }. \end{aligned}$$

Remark 3.5

The two functions \(E_1\) and \(E_2\) are continuous and monotonic with the following asymptotic behavior. The first error term \(E_1(r_0)\) obeys

$$\begin{aligned} E_1(0) = 0, \quad \lim _{r_0 \rightarrow \infty } E_1(r_0) = \infty , \end{aligned}$$

whereas the second error term \(E_2(r_0)\) satisfies

$$\begin{aligned} E_2(0) = 2\delta ^{-\frac{d}{2}}, \quad \lim _{r_0 \rightarrow \infty } E_2(r_0) = 0. \end{aligned}$$

This can be seen as follows.

The function \(f(r)r^{d-1}\) is bounded on the interval [0, 1], so that the integral \(\int _0^{r_0} f(r)r^{d-1} \mathrm {d}r\) converges to 0 as \(r_0 \rightarrow 0\), and hence also \(E_1(r_0)\). On the other hand, \(f(r)r^{d-1}\) converges to \(\infty \) as \(r \rightarrow \infty \), so that the integral \(\int _0^{r_0} f(r)r^{d-1} \mathrm {d}r\) and \(E_1(r_0)\) converge to \(\infty \) as \(r \rightarrow \infty \). Since \(f(r)r^{d-1}\) is positive for all \(r \ge 0\), \(E_1\) moreover increases monotonically. By definition of the lower incomplete gamma function, \(\varXi _d\) increases monotonically and \(\varXi _d(t) \in [0,1]\) for all \(t \ge 0\) and \(d > 0\). Moreover, \(\varXi _d(t) \rightarrow 0\) as \(t \rightarrow 0\) and \(\varXi _d(t) \rightarrow 1\) as \(t \rightarrow \infty \). Consequently, \(E_2(r_0)\) converges toward \(2\delta ^{-d/2}\) as \(r_0 \rightarrow 0\), and toward 0 as \(r_0 \rightarrow \infty \). The asymptotic behavior of \(E_2\) is described more precisely in Lemma 4.3.

The following three propositions formalize the ideas described in the beginning of this section and constitute the prove of Theorem 3.4.

Proposition 3.6

(Fundamental estimate) The Laplace approximation \({\mathscr {L}_{\mu ^y}}\) of \(\mu ^y\) satisfies

$$\begin{aligned} {d_\text {TV}}(\mu ^y,{\mathscr {L}_{\mu ^y}}) \le \int _{\mathbb {R}^d} \left|\exp \left( -\frac{1}{\varepsilon }R_2(x)\right) - 1 \right| {\mathscr {L}_{\mu ^y}}(\mathrm {d}x), \end{aligned}$$

where \(R_2(x) := I(x) - \frac{1}{2}{\Vert x - \hat{x} \Vert }_\varSigma ^2\) for all \(x \in \mathbb {R}^d\).

Proof

For a fixed \(\varepsilon > 0\) we can estimate

$$\begin{aligned} 2{d_\text {TV}}(\mu ^y,{\mathscr {L}_{\mu ^y}})&= \int _{\mathbb {R}^d} \left| \frac{1}{Z}\exp \left( -\frac{1}{\varepsilon }I(x)\right) - \frac{1}{\widetilde{Z}}\exp \left( -\frac{1}{2\varepsilon }{\Vert x - \hat{x} \Vert }_\varSigma ^2\right) \right| \mathrm {d}x \\&= \frac{1}{\widetilde{Z}} \int _{\mathbb {R}^d} \left| \frac{\widetilde{Z}}{Z}\exp \left( -\frac{1}{\varepsilon }I(x)\right) - \exp \left( -\frac{1}{2\varepsilon }{\Vert x - \hat{x} \Vert }_\varSigma ^2\right) \right| \mathrm {d}x \\&\le \frac{1}{\widetilde{Z}} \left| \frac{\widetilde{Z}}{Z} - 1 \right| \int _{\mathbb {R}^d} \exp \left( -\frac{1}{\varepsilon }I(x)\right) \mathrm {d}x \\&\quad + \frac{1}{\widetilde{Z}} \int _{\mathbb {R}^d} \left| \exp \left( -\frac{1}{\varepsilon }I(x)\right) - \exp \left( -\frac{1}{2\varepsilon }{\Vert x - \hat{x} \Vert }_\varSigma ^2\right) \right| \mathrm {d}x \\&= \frac{1}{\widetilde{Z}} \left| \widetilde{Z} - Z \right| + \int _{\mathbb {R}^d} \left| \exp \left( -\frac{1}{\varepsilon }I(x) + \frac{1}{2\varepsilon }{\Vert x - \hat{x} \Vert }_\varSigma ^2\right) - 1 \right| {\mathscr {L}_{\mu ^y}}(\mathrm {d}x). \\&= \frac{1}{\widetilde{Z}} \left| \widetilde{Z} - Z \right| + \int _{\mathbb {R}^d} \left| \exp \left( -\frac{1}{\varepsilon }R_2(x)\right) - 1 \right| {\mathscr {L}_{\mu ^y}}(\mathrm {d}x). \end{aligned}$$

Now, the estimate

$$\begin{aligned} \left| Z - \widetilde{Z}\right|&\le \int _{\mathbb {R}^d} \left| \exp \left( -\frac{1}{\varepsilon }I(x)\right) - \exp \left( -\frac{1}{2\varepsilon }{\Vert x - \hat{x} \Vert }_\varSigma ^2\right) \right| \mathrm {d}x \\&= \widetilde{Z} \int _{\mathbb {R}^d} \left| \exp \left( -\frac{1}{\varepsilon }R_2(x)\right) - 1\right| {\mathscr {L}_{\mu ^y}}(\mathrm {d}x) \end{aligned}$$

yields the proposition. \(\square \)

Proposition 3.7

(Close range estimate) Suppose that Assumption 3.2 holds. Then it follows that

$$\begin{aligned} \int _{U(r_0)} \left|\exp \left( -\frac{1}{\varepsilon }R_2(x)\right) - 1 \right| {\mathscr {L}_{\mu ^y}}(\mathrm {d}x) \le c_d \varepsilon ^{-\frac{d}{2}} \int _0^{r_0} f(r) r^{d-1} \mathrm {d}r \end{aligned}$$

for all \(r_0 \ge 0\), where f and \(c_d\) are defined as in Theorem 3.4.

Proposition 3.8

(Far range estimate) Suppose that Assumption 3.3 holds. Then we have

$$\begin{aligned} \int _{\mathbb {R}^d {\setminus } U(r_0)} \left|\exp \left( -\frac{1}{\varepsilon }R_2(x)\right) - 1 \right| {\mathscr {L}_{\mu ^y}}(\mathrm {d}x) \le \delta ^{-\frac{d}{2}}\left( 1 - \varXi _d\bigg (\frac{\delta r_0^2}{\varepsilon }\bigg )\right) \end{aligned}$$
(3.6)

for all \(r_0 \ge 0\).

The proof of Theorem 3.4 is now very short.

Proof of Theorem 3.4

By Proposition 3.6 we have

$$\begin{aligned} {d_\text {TV}}(\mu ^y,{\mathscr {L}_{\mu ^y}}) \le \int _{\mathbb {R}^d} \left|\exp \left( -\frac{1}{\varepsilon }R_2(x)\right) - 1 \right| {\mathscr {L}_{\mu ^y}}(\mathrm {d}x). \end{aligned}$$

Now, splitting up this integral into integrals over \(U(r_0)\) and its complement and applying Propositions 3.7 and 3.8 proves the statement. \(\square \)

Remark 3.9

  1. 1.

    Because of \(I(\hat{x}) = 0\) and the necessary optimality condition \(DI(\hat{x}) = 0\), the function \(R_2(x) = I(x) - \frac{1}{2}{\Vert x - \hat{x} \Vert }_\varSigma ^2\) defined in Proposition 3.6 is precisely the remainder of the second order Taylor polynomial of I around \(\hat{x}\). In Proposition 3.7, Assumption 3.2 is used to control \(R_2\) near the MAP estimate by bounding the third order differential of I. In Proposition 3.8, in turn, Assumption 3.3 is used to control \(R_2\) at a distance from \(\hat{x}\) by bounding it from below by \(-\frac{1 - \delta }{2}{\Vert x - \hat{x} \Vert }_\varSigma ^2\).

  2. 2.

    The constant \(K\ge 0\) in Assumption 3.2 quantifies the non-Gaussianity of the likelihood and the prior distribution and can be arbitarily large. Assumption 3.3 bounds the unnormalized log-posterior density from above by a multiple of the unnormalized log-density of the Laplace distribution, where the constant \(\delta > 0\) represents the scaling factor and can be arbitrarily small. This restricts our results to posterior distributions whose tail does not decay slower than that of a Gaussian distribution. Assumption 3.3 can for example be violated if a prior distribution with heavier than Gaussian tail is chosen such as a Cauchy distribution and if the forward mapping is linear but singular. Our main interest lies on inverse problems with a posterior distribution that is not too different from a Gaussian distribution, since this is a setting in which the Laplace approximation can be expected to yield reasonable results.

  3. 3.

    In case of a linear inverse problem and a Gaussian prior distribution, the Laplace approximation is exact, so that Assumptions 3.2 and 3.3 are trivially satisfied with \(K= 0\) and \(\delta = 1\). We will see in Sect. 5 that Assumptions 3.2 and 3.3 are satisfied for nonlinear inverse problems with \(\delta \) and \(K\) as given in Propositions 5.6 and 5.7 if the prior distribution is Gaussian and the nonlinearity of the forward mapping is small enough. In this case, the quadratic lower bound on I in Assumption 3.3 restricts the nonlinearity of the forward mapping to be small enough such that the tail of the posterior distribution does not decay slower than that of a Gaussian distribution.

  4. 4.

    Note that neither in Sect. 3 nor in Sect. 4 we make use of the Gaussianity of the noise. Therefore, the results of these sections remain valid for non-Gaussian noise as long as the log-likelihood satisfies Assumptions 3.2 and 3.3. In case of noise with a log-density \(-\nu \in C^3({\mathbb {R}^d})\), the negative log-likelihood takes the form \(\varPhi (x) = \nu (y - G(x))\) and we have \(I(x) = \nu (y - G(x)) + \varepsilon R(x) + c\). Consider for example standard multivariate Cauchy noise, where

    $$\begin{aligned} \nu (\eta ) = - \ln \left[ \frac{C}{\left( 1 + {|\eta |}^2\right) ^\frac{d + 1}{2}} \right] = \frac{d + 1}{2} \ln \left( 1 + {|\eta |}^2\right) - \ln C \end{aligned}$$

    for all \(\eta \in {\mathbb {R}^d}\). The derivatives of up to third order of \(s \mapsto \ln (1 + s)\), \(s \ge 0\), are bounded since they are continuous and converge to 0 as s tends to infinity. By the smoothness of \(x \mapsto {|x |}^2\), \(\nu \) is therefore in \(C^3({\mathbb {R}^d})\) and

    $$\begin{aligned} {\Vert D^3\nu (x) \Vert } := \sup _{{|h_1 |}, {|h_2 |}, {|h_3 |} \le 1} \left|D^3\nu (x)(h_1,h_2,h_3) \right| \end{aligned}$$

    is uniformly bounded. In case of a linear forward mapping, the uniform boundedness transfers to \({\Vert D^3 \varPhi (x) \Vert }\) and we can estimate

    $$\begin{aligned} {\Vert D^3\varPhi (x) \Vert }_\varSigma \le \left\Vert \varSigma ^{-\frac{1}{2}} \right\Vert ^{-3}{\Vert D^3\varPhi (x) \Vert }\end{aligned}$$

    for any symmetric positive definite matrix \(\varSigma \).

  5. 5.

    We make Assumptions 3.2 and 3.3 globally, i.e., for all \(x \in {\mathbb {R}^d}\), for the sake of simplicity. For a given \(r_0 \ge 0\), Theorem 3.4 remains valid if Assumption 3.2 only holds for \({\Vert x - \hat{x} \Vert }_\varSigma \le r_0\) and if Assumption 3.3 only holds for \({\Vert x - \hat{x} \Vert }_\varSigma \ge r_0\). This allows for prior distributions which are not supported on the whole space \({\mathbb {R}^d}\), as long as the support of the prior contains the set \(U(r_0)\) and \(R \in C^3(U(r_0))\). In this case, R and I are allowed to take values in \({\overline{\mathbb {R}}} := \mathbb {R}\cup \{\infty \}\) and we follow the convention \(\exp (-\infty ) = 0\).

  6. 6.

    The constant \(K\) in Assumption 3.2 can be replaced by a radial bound \(\rho ({\Vert x - \hat{x} \Vert }_\varSigma )\) with a monotonically increasing function \(\rho \). This way, an estimate of the form (3.2) can be obtained with f replaced by

    $$\begin{aligned} \widetilde{f}(r) = \left( \exp \left( \frac{1 + \varepsilon }{6\varepsilon }\rho (r)r^3\right) - 1\right) \exp \left( -\frac{1}{2\varepsilon }r^2\right) . \end{aligned}$$

Remark 3.10

Both the unnormalized posterior density \(\exp (-\frac{1}{\varepsilon }I(x))\) and the unnormalized Gaussian density \(\exp (-\frac{1}{2\varepsilon }{\Vert x - \hat{x} \Vert }_\varSigma ^2)\) attain their maximum 1 in \(\hat{x}\). The densities of \(\mu ^y\) and \({\mathscr {L}_{\mu ^y}}\) themselves, however, take the values 1/Z and \(1/\widetilde{Z}\) in \(\hat{x}\) due to the different normalization, see Figure 1. For this reason, the probability of small balls around \(\hat{x}\) under \(\mu ^y\) and \({\mathscr {L}_{\mu ^y}}\) differs asymptotically by a factor of \(\widetilde{Z}/Z\). This has several consequences in case that the normalization constants Z and \(\widetilde{Z}\) differ considerably.

One the one hand, credible regions around \(\hat{x}\) may have considerably different size under the posterior distribution and its Laplace approximation. On the other hand, the integrand

$$\begin{aligned} \frac{1}{2} \left|\frac{\exp (-\frac{1}{\varepsilon }I(x))}{Z} - \frac{\exp (-\frac{1}{2\varepsilon }{\Vert x - \hat{x} \Vert }_\varSigma ^2)}{\widetilde{Z}} \right| \end{aligned}$$

of the total variation distance \({d_\text {TV}}(\mu ^y,{\mathscr {L}_{\mu ^y}})\) may, unlike the integrand of the fundamental estimate (3.1), have a significant amount of mass around \(\hat{x}\), see Figure 1. This means that a significant portion of the error when approximating the probability of an event under \(\mu ^y\) by that under \({\mathscr {L}_{\mu ^y}}\) may be due to the difference in their densities near the MAP estimate \(\hat{x}\). So although the Laplace approximation is defined by the local properties of the posterior distribution in the MAP estimate, it is not necessarily a good local approximation around it.

A large difference in the normalization constants Z and \(\widetilde{Z}\) as mentioned above reflects that the log-posterior density cannot be approximated well globally by its second order Taylor polynomial around \(\hat{x}\). In the proof of Proposition 3.6, we saw that the difference in normalization is in fact bounded by the total variation of the unnormalized densities. The value of Proposition 3.6 lies in providing an estimate for the total variation error that only involves unnormalized densities.

Fig. 1
figure 1

The probability densities of a posterior distribution \(\mu ^y\) and its Laplace approximation \({\mathscr {L}_{\mu ^y}}\) (left), as well as the integrands of the total variation distance between \(\mu ^y\) and \({\mathscr {L}_{\mu ^y}}\) and of the fundamental estimate (3.1) (right)

In the following sections, we present the proofs of our close and far range estimate, and characterize the optimal choice of \(r_0\).

3.1 Proof of Proposition 3.7

We consider the close range integral

$$\begin{aligned} \int _{U(r_0)} \left|\exp \left( -\frac{1}{\varepsilon }R_2(x)\right) - 1 \right| {\mathscr {L}_{\mu ^y}}(\mathrm {d}x) \end{aligned}$$

over the \(\varSigma \)-norm ball with radius \(r_0 \ge 0\). The proof of our close range estimate is based upon the following estimate for the remainder term \(R_2(x)\).

Lemma 3.11

If Assumption 3.2 holds, then we have

$$\begin{aligned} {|R_2(x) |} \le \frac{1 + \varepsilon }{6} K {\Vert x-\hat{x} \Vert }_\varSigma ^3 \quad \text {for all}~x \in {\mathbb {R}^d}. \end{aligned}$$
(3.7)

Proof

We set \(h := x - \hat{x}\) and write the remainder of the second order Taylor polynomial of \(\varPhi \) in mean-value form as

$$\begin{aligned} R_{2,\varPhi }(x) := \varPhi (x) - \varPhi (\hat{x}) - D\varPhi (\hat{x})(h) - \frac{1}{2} D^2\varPhi (\hat{x})(h,h) = \frac{1}{6} D^3\varPhi (z)(h,h,h) \end{aligned}$$

for some \(z \in \hat{x}+ [0,1]h\). Since \(\hat{x}+ [0,1]h \subset U({\Vert x - \hat{x} \Vert }_\varSigma )\), we can now use the multilinearity of \(D^3\varPhi (z)\) to express \(R_{2,\varPhi }\) as

$$\begin{aligned} R_{2,\varPhi }(x) = \frac{1}{6} D^3\varPhi (z)\left( \frac{h}{{\Vert h \Vert }_\varSigma },\frac{h}{{\Vert h \Vert }_\varSigma },\frac{h}{{\Vert h \Vert }_\varSigma }\right) {\Vert h \Vert }_\varSigma ^3 \end{aligned}$$

for some \(z \in U({\Vert x - \hat{x} \Vert }_\varSigma )\), and estimate

$$\begin{aligned} {|R_{2,\varPhi }(x) |} \le \frac{1}{6} {\Vert D^3\varPhi (z) \Vert }_\varSigma {\Vert h \Vert }_\varSigma ^3 \le \frac{1}{6} K {\Vert h \Vert }_\varSigma ^3 \end{aligned}$$

using Assumption 3.2. By proceeding similarly for R, we now obtain

$$\begin{aligned} {|R_2(x) |} \le {|R_{2,\varPhi }(x) |} + \varepsilon {|R_{2,R}(x) |} \le \frac{1}{6} K {\Vert h \Vert }_\varSigma ^3 + \frac{\varepsilon }{6} K {\Vert h \Vert }_\varSigma ^3. \end{aligned}$$

\(\square \)

Now, we can prove our close range estimate.

Proof of Proposition 3.7

By Lemma 3.11 and (2.3), we have

$$\begin{aligned} \int _{U(r_0)} \left|\exp \left( -\frac{1}{\varepsilon }R_2(x)\right) - 1 \right| {\mathscr {L}_{\mu ^y}}(\mathrm {d}x)&\le \frac{1}{\widetilde{Z}} \int _{U(r_0)} f({\Vert x - \hat{x} \Vert }_\varSigma ) \mathrm {d}x \\&= \frac{1}{\widetilde{Z}} \sqrt{\det \varSigma } \int _{\{u \in {\mathbb {R}^d}: {|u |} \le r_0\}} f({|u |}) \mathrm {d}u \\&= \varepsilon ^{-\frac{d}{2}}(2\pi )^{-\frac{d}{2}} \cdot d \kappa _d \int _0^{r_0} f(r) r^{d-1} \mathrm {d}r, \end{aligned}$$

since \({|u |} = {\Vert x - \hat{x} \Vert }_\varSigma \). Here

$$\begin{aligned} \kappa _d = \frac{\pi ^{d/2}}{\varGamma (d/2 + 1)} \end{aligned}$$

denotes the volume of the d-dimensional Euclidean unit ball (\(d\kappa _d\) is its surface area). Using the fundamental recurrence \(\varGamma (z+1) = z\varGamma (z)\), we can write

$$\begin{aligned} d\kappa _d = \frac{2\pi ^{d/2}}{\varGamma (d/2)}. \end{aligned}$$

\(\square \)

3.2 Proof of Proposition 3.8

Now, we consider the integral

$$\begin{aligned} \int _{{\mathbb {R}^d}{\setminus } U(r_0)} \left|\exp \left( -\frac{1}{\varepsilon }R_2(x)\right) - 1 \right| {\mathscr {L}_{\mu ^y}}(\mathrm {d}x) \end{aligned}$$

over the space outside of a \(\varSigma \)-norm ball with radius \(r_0 \ge 0\). In the proof of our far range estimate, the following expression is used to describe the probability of \({\mathbb {R}^d}{\setminus } U(r_0)\) under \({\mathscr {L}_{\mu ^y}}\). Let \(\varGamma (a,z)\) denote the upper incomplete gamma function.

Lemma 3.12

Let \(\nu = \mathscr {N}(\hat{x}, \delta ^{-1}\varepsilon \varSigma )\) with \(\delta > 0\). Then,

$$\begin{aligned} \nu \left( {\mathbb {R}^d}{\setminus } U(r_0)\right) = \frac{1}{\varGamma (d/2)}\varGamma \bigg (\frac{d}{2},\frac{\delta r_0^2}{2\varepsilon }\bigg ) \quad \text {for all }r_0 \ge 0. \end{aligned}$$

Proof

We compute the tail integral explicitly using a local parameter and polar coordinates. This yields

$$\begin{aligned} \nu (\{{\Vert x - \hat{x} \Vert }_\varSigma \ge r_0\})&= \frac{\delta ^{d/2}}{\varepsilon ^{d/2}(2\pi )^{d/2}\sqrt{\det \varSigma }} \int _{\{{\Vert x - \hat{x} \Vert }_\varSigma \ge r_0\}} \exp \left( -\frac{\delta }{2\varepsilon }{\Vert x - \hat{x} \Vert }_\varSigma ^2\right) \mathrm {d}x \\&= \frac{\delta ^{d/2}}{\varepsilon ^{d/2}(2\pi )^{d/2}} \int _{\{{|u |} \ge r_0\}} \exp \left( -\frac{\delta }{2\varepsilon }{|u |}^2\right) \mathrm {d}u \\&= \frac{\delta ^{d/2}}{\varepsilon ^{d/2}(2\pi )^{d/2}} d\kappa _d \int _{r_0}^\infty \exp \left( -\frac{\delta }{2\varepsilon }r^2\right) r^{d-1} \mathrm {d}r. \end{aligned}$$

We can express this integral in terms of the upper incomplete gamma function by substituting \(s = \delta r^2/2\varepsilon \) (note that \(r'(s) = 2^{-1/2}\varepsilon ^{1/2}\delta ^{-1/2}s^{-1/2}\)) as

$$\begin{aligned} \int _{r_0}^\infty e^{-\frac{\delta }{2\varepsilon }r^2} r^{d-1} \mathrm {d}r = \int _{\frac{\delta r_0^2}{2\varepsilon }}^\infty e^{-s} 2^{\frac{d}{2}-1}\varepsilon ^{\frac{d}{2}}\delta ^{-\frac{d}{2}} s^{\frac{d}{2} - 1} \mathrm {d}s = 2^{\frac{d}{2} - 1}\varepsilon ^{\frac{d}{2}}\delta ^{-\frac{d}{2}} \varGamma \left( \frac{d}{2}, \frac{\delta r_0^2}{2\varepsilon }\right) . \end{aligned}$$

This leads to

$$\begin{aligned} \nu (\{{\Vert x - \hat{x} \Vert }_\varSigma \ge r_0\}) = \frac{d}{2}\frac{1}{\varGamma (d/2 + 1)}\varGamma \left( \frac{d}{2}, \frac{\delta r_0^2}{2\varepsilon }\right) . \end{aligned}$$

Now, using the fundamental recurrence \(\varGamma (z + 1) = z\varGamma (z)\) completes the proof. \(\square \)

Now, we can prove our far range estimate.

Proof of Proposition 3.8

Let \(x \in \mathbb {R}^d {\setminus } U(r_0)\). We distinguish between two cases. First, consider the case that \(R_2(x) \ge 0\). For \(t \ge 0\), the estimate \({|e^{-t} - 1 |} \le 1\) holds. This implies

$$\begin{aligned} \left|\exp \left( -\frac{1}{\varepsilon }R_2(x)\right) - 1 \right| \le 1. \end{aligned}$$

Next, consider the case that \(R_2(x) < 0\). By Assumption 3.3, we have

$$\begin{aligned} R_2(x) = I(x) - \frac{1}{2}{\Vert x - \hat{x} \Vert }_\varSigma ^2 \ge \frac{\delta - 1}{2}{\Vert x - \hat{x} \Vert }_\varSigma ^2 \end{aligned}$$

for all \(x \in \mathbb {R}^d\). For \(t \le 0\) we have \({|e^{-t} - 1 |} = e^{-t} - 1 < e^{-t}\), and thus

$$\begin{aligned} \left|\exp \left( -\frac{1}{\varepsilon }R_2(x)\right) - 1 \right| \le \exp \left( -\frac{1}{\varepsilon }R_2(x)\right) \le \exp \left( -\frac{\delta - 1}{2\varepsilon }{\Vert x - \hat{x} \Vert }_\varSigma ^2\right) . \end{aligned}$$

Together, this shows that

$$\begin{aligned} \left|\exp \left( -\frac{1}{\varepsilon }R_2(x)\right) - 1 \right| \le \exp \left( -\frac{\delta - 1}{2\varepsilon }{\Vert x - \hat{x} \Vert }_\varSigma ^2\right) \end{aligned}$$

for all \(x \in \mathbb {R}^d\). Now it follows that

$$\begin{aligned} \left|\exp \left( -\frac{1}{\varepsilon }R_2(x)\right) - 1 \right| \exp \left( -\frac{1}{2\varepsilon }{\Vert x - \hat{x} \Vert }_\varSigma ^2\right) \le \exp \left( -\frac{\delta }{2\varepsilon }{\Vert x - \hat{x} \Vert }_\varSigma ^2\right) \end{aligned}$$

for all \(x \in \mathbb {R}^d\). This yields

$$\begin{aligned}&\int _{\mathbb {R}^d {\setminus } U(r_0)} \left|\exp \left( -\frac{1}{\varepsilon }R_2(x)\right) - 1 \right| {\mathscr {L}_{\mu ^y}}(\mathrm {d}x) \\&\le \frac{1}{(2\pi \varepsilon )^{d/2}\sqrt{\det \varSigma }} \int _{\mathbb {R}^d {\setminus } U(r_0)} \exp \left( -\frac{\delta }{2\varepsilon }{\Vert x - \hat{x} \Vert }_\varSigma ^2\right) \mathrm {d}x \end{aligned}$$

Now, the proposition follows from

$$\begin{aligned}&\frac{\delta ^{d/2}}{(2\pi \varepsilon )^{d/2}\sqrt{\det \varSigma }} \int _{\mathbb {R}^d {\setminus } U(r)} \exp \left( -\frac{\delta }{2\varepsilon }{\Vert x - \hat{x} \Vert }_\varSigma ^2\right) \mathrm {d}x \\&= \frac{1}{\varGamma (d/2)}\varGamma \bigg (\frac{d}{2}, \frac{\delta r^2}{2\varepsilon }\bigg ) = 1 - \varXi _d\bigg (\frac{\delta r_0^2}{\varepsilon }\bigg ), \end{aligned}$$

which in turn holds by Lemma 3.12 and the identity \(\varGamma (a,z) = \varGamma (a) - \gamma (a,z)\). \(\square \)

3.3 Optimal choice of the parameter

We have the following necessary optimality condition for the parameter \(r_0\) in Theorem 3.4.

Proposition 3.13

The optimal choice of \(r_0\) in the error bound (3.2) is either 0 or satisfies

$$\begin{aligned} \exp \left( \frac{(1 + \varepsilon )K}{6\varepsilon } r_0^3\right) - 1 - \exp \left( \frac{1 - \delta }{2\varepsilon }r_0^2\right) = 0. \end{aligned}$$

Proof

The terms \(E_1\) and \(E_2\) are differentiable on \([0,\infty )\). Clearly, the optimal \(r_0\) is either 0 or satisfies the identity

$$\begin{aligned} E'(r_0) = E_1'(r_0) + E_2'(r_0) = 0. \end{aligned}$$
(3.8)

We have that

$$\begin{aligned} E_1'(r_0) = c_d \varepsilon ^{-\frac{d}{2}}f(r_0) r_0^{d-1} = c_d \varepsilon ^{-\frac{d}{2}}\left( \exp \left( \frac{(1 + \varepsilon )K}{6\varepsilon } r_0^3\right) - 1\right) \exp \left( -\frac{1}{2\varepsilon } r_0^2\right) r_0^{d-1} \end{aligned}$$

and

$$\begin{aligned} E_2'(r_0) = -\frac{2\delta ^{1-\frac{d}{2}} r_0}{\varepsilon } \varXi _d'\left( \frac{\delta r_0^2}{\varepsilon }\right) = -\frac{2}{2^{d/2}\varGamma (d/2)} \cdot \frac{r_0^{d-1}}{\varepsilon ^{d/2}} \exp \left( -\frac{\delta }{2\varepsilon }r_0^2\right) \end{aligned}$$

Identity (3.8) now corresponds to

$$\begin{aligned} \exp \left( \frac{(1 + \varepsilon )K}{6\varepsilon } r_0^3\right) - 1 - \exp \left( \frac{1 - \delta }{2\varepsilon }r_0^2\right) = 0 \end{aligned}$$

which yields the result. \(\square \)

Remark 3.14

The right hand side of the far range estimate (3.6) can be written as

$$\begin{aligned} c_d \varepsilon ^{-\frac{d}{2}} \int _{r_0}^\infty \exp \left( -\frac{\delta }{2\varepsilon }r^2\right) r^{d-1} \mathrm {d}r, \end{aligned}$$

where \(c_d\) is defined as in Theorem 3.4. The optimal choice of \(r_0\) is therefore one for which the integrands \(f(r_0)r^{d-1}\) and \(\exp (-\delta r_0^2/2\varepsilon )r^{d-1}\) of the close and the far range estimate take the same value.

4 Explicit error estimate

Here, we present a non-asymptotic error estimate in terms of \(K\), \(\delta \), \(\varepsilon \), and the problem dimension d. While Theorem 3.4 constitutes a non-asymptotic error estimate and is the sharpest of our three main results, it is not immediately clear how the non-Gaussianity of the likelihood and the prior distribution, as quantified by the constant \(K\) in Assumption 3.2, the noise level, and the problem dimension affect the error bound. The purpose of the following theorem is to make this influence more explicit.

Theorem 4.1

Suppose that I, \(\varPhi \), and R satisfy Assumptions 3.13.3. If K, \(\delta \), \(\varepsilon \), and d satisfy

$$\begin{aligned} \frac{6\delta ^\frac{3}{2}}{(1 + \varepsilon )\varepsilon ^\frac{1}{2} K}&\ge \max \left\{ 8d^\frac{3}{2}, \left( 8 \ln \left( \frac{2}{C(1 + \varepsilon )\varepsilon ^\frac{1}{2} K\delta ^\frac{d}{2}}\cdot \frac{\varGamma \big (\frac{d}{2}\big )}{\varGamma \big (\frac{d}{2} + \frac{3}{2}\big )}\right) \right) ^\frac{3}{2}\right\} \end{aligned}$$
(4.1)

with \(C := \sqrt{2}e/3\), then

$$\begin{aligned} {d_\text {TV}}\left( \mu ^{y}, \mathscr {L}_{\mu ^{y}}\right) \le 2C(1 + \varepsilon )\varepsilon ^\frac{1}{2} K\frac{\varGamma \left( \frac{d}{2} + \frac{3}{2}\right) }{\varGamma \left( \frac{d}{2}\right) }. \end{aligned}$$
(4.2)

Remark 4.2

Condition (4.1) can be interpreted in the following way. For given d and \(\delta \), it imposes an upper bound on the noise level \(\varepsilon ^{1/2}\) and K, whereas for given \(\delta \), K, and \(\varepsilon \), it imposes an upper bound on the dimension d. As \(d \rightarrow \infty \), the ratio \(\varGamma \big (\frac{d}{2} + \frac{3}{2}\big )/\varGamma \big (\frac{d}{2}\big )\) grows in the order of \(d^{3/2}\), see [28, pp. 67–68].

In order to prove this theorem, we introduce an exponential tail estimate for the Laplace approximation, which is a modified version of [25, Prop. 4].

Lemma 4.3

Let \(\nu = \mathscr {N}(\hat{x}, \delta ^{-1}\varepsilon \varSigma )\) with \(\delta > 0\). Then,

$$\begin{aligned} \nu (\mathbb {R}^d {\setminus } U(r)) = 1 - \varXi _d\bigg (\frac{\delta r^2}{\varepsilon }\bigg ) \le 2\exp \left( -\frac{\delta r^2}{8\varepsilon }\right) \end{aligned}$$

holds for all \(r \ge 2(d\varepsilon /\delta )^{1/2}\).

Proof

By Lemma 3.12, we have

$$\begin{aligned} \nu ({\mathbb {R}^d}{\setminus } U(r_0)) = \frac{1}{\varGamma (d/2)}\varGamma \bigg (\frac{d}{2},\frac{\delta r_0^2}{2\varepsilon }\bigg ) = 1 - \varXi _d\bigg (\frac{\delta r_0^2}{\varepsilon }\bigg ). \end{aligned}$$

Let \(x \sim \mathscr {N}(\hat{x}, \delta ^{-1}\varepsilon \varSigma )\). Then \(u := \varSigma ^{-1/2}(x - \hat{x}) \sim \mathscr {N}(0,\delta ^{-1}\varepsilon I_d)\). The concentration inequality for Gaussian measures yields

$$\begin{aligned} \mathbb {P}({|u |}> s + \mathbb {E}[{|u |}]) \le \mathbb {P}\big (\big |{|u |} - \mathbb {E}[{|u |}]\big |> s\big ) \le 2 \exp \left( -\frac{s^2}{2\sigma ^2} \right) \quad \text {for all }s > 0, \end{aligned}$$

where \(\sigma := \sup _{{|z |} \le 1} \mathbb {E}[{|{(z,u)} |}^2]\), see [13, Chapter 3]. Now,

$$\begin{aligned} \sigma ^2 = \lambda _\text {max}(\delta ^{-1}\varepsilon I_d) = \delta ^{-1}\varepsilon , \end{aligned}$$

and

$$\begin{aligned} \mathbb {E}[{|u |}] \le \mathbb {E}[{|u |}^2]^\frac{1}{2} = \delta ^{-\frac{1}{2}}\varepsilon ^\frac{1}{2}{{\,\mathrm{tr}\,}}(I_d)^\frac{1}{2} = \delta ^{-\frac{1}{2}}\varepsilon ^\frac{1}{2} d^\frac{1}{2}. \end{aligned}$$

By choosing \(s = \delta ^{-\frac{1}{2}}\varepsilon ^{1/2}d^{1/2}\) and using that \(r \ge 2s\), we obtain

$$\begin{aligned} \mu ^y(\mathbb {R}^d {\setminus } U(r)) = \mathbb {P}({|u |}> r) \le \mathbb {P}\left( {|u |} > \frac{r}{2} + \mathbb {E}[{|u |}]\right) \le 2\exp \left( -\frac{\delta r^2}{8\varepsilon }\right) . \end{aligned}$$

\(\square \)

Proof of Theorem 4.1

We choose

$$\begin{aligned} r_0 := \left( \frac{6\varepsilon }{(1 + \varepsilon )K}\right) ^\frac{1}{3}. \end{aligned}$$

According to Theorem 3.4, we then have

$$\begin{aligned} {d_\text {TV}}\left( \mu ^{y}, \mathscr {L}_{\mu ^{y}}\right) \le E_1(r_0;K,\varepsilon ,d) + E_2(r_0;\delta ,\varepsilon ,d). \end{aligned}$$

For all \(t \ge 0\), the exponential function satisfies the estimate

$$\begin{aligned} \exp (t) - 1 = \left( 1 - \exp (-t)\right) \exp (t) \le t\exp (t). \end{aligned}$$

Therefore, we have

$$\begin{aligned} \begin{aligned} E_1(r_0;K,\varepsilon ,d)&= \frac{2}{\varGamma \left( \frac{d}{2}\right) } (2\varepsilon )^{-\frac{d}{2}} \int _0^{r_0} \left( \exp \left( \frac{(1 + \varepsilon )K}{6\varepsilon }r^3\right) - 1\right) \exp \left( -\frac{1}{2\varepsilon }r^2\right) r^{d-1} \mathrm {d}r \\&\le \frac{2}{\varGamma \left( \frac{d}{2}\right) } (2\varepsilon )^{-\frac{d}{2} - 1} \frac{(1 + \varepsilon )K}{3} \int _0^{r_0} \exp \left( \frac{(1 + \varepsilon )K}{6\varepsilon } r^3 - \frac{1}{2\varepsilon }r^2\right) r^{d+2} \mathrm {d}r. \end{aligned} \end{aligned}$$

By the choice of \(r_0\), we have

$$\begin{aligned} \frac{(1 + \varepsilon )K}{6\varepsilon } r^3 \le 1 \quad \text {for all }r \in [0,r_0], \end{aligned}$$

so that the integral is bounded by

$$\begin{aligned} e \int _0^{r_0} \exp \left( -\frac{1}{2\varepsilon }r^2\right) r^{d+2} \mathrm {d}r. \end{aligned}$$

By substituting \(s = r^2/2\varepsilon \), we can in turn express this integral as

$$\begin{aligned} \frac{1}{2}(2\varepsilon )^{\frac{d}{2} + \frac{3}{2}} \int _0^{\frac{r_0^2}{2\varepsilon }} e^{-s}s^{\frac{d}{2} + \frac{1}{2}} \mathrm {d}s = \frac{1}{2}(2\varepsilon )^{\frac{d}{2} + \frac{3}{2}} \gamma \left( \frac{d}{2} + \frac{3}{2}, \frac{r_0^2}{2\varepsilon }\right) . \end{aligned}$$

Now, we use the inequality \(\gamma (a,z) \le \varGamma (a)\) to obtain that

$$\begin{aligned} E_1(r_0;K,\varepsilon ,d) \le \frac{2^\frac{1}{2} e}{3} (1 + \varepsilon ) \varepsilon ^\frac{1}{2} K\frac{\varGamma \big (\frac{d}{2} + \frac{3}{2}\big )}{\varGamma \big (\frac{d}{2}\big )}. \end{aligned}$$

By condition (4.1), we have

$$\begin{aligned} r_0 = \left( \frac{6\varepsilon }{(1 + \varepsilon )K}\right) ^\frac{1}{3} \ge 2\left( \frac{d\varepsilon }{\delta }\right) ^\frac{1}{2}. \end{aligned}$$

Thus, we may apply Lemma 4.3, which yields

$$\begin{aligned} \begin{aligned} E_2(r_0;\delta ,\varepsilon ,d)&= \delta ^{-\frac{d}{2}} \left( 1 - \varXi \bigg (-\frac{\delta r_0^2}{\varepsilon }\bigg )\right) \le 2\delta ^{-\frac{d}{2}} \exp \left( -\frac{\delta r_0^2}{8\varepsilon }\right) \\&= 2\delta ^{-\frac{d}{2}} \exp \left( -\frac{1}{8}\left( \frac{6\delta ^\frac{3}{2}}{(1 + \varepsilon )\varepsilon ^\frac{1}{2} K}\right) ^\frac{2}{3}\right) \le C(1 + \varepsilon )\varepsilon ^\frac{1}{2} K\frac{\varGamma \big (\frac{d}{2} + \frac{3}{2}\big )}{\varGamma \big (\frac{d}{2}\big )} \end{aligned} \end{aligned}$$

by condition (4.1). Now, we obtain by summing up that

$$\begin{aligned} {d_\text {TV}}\left( \mu ^{y}, \mathscr {L}_{\mu ^{y}}\right) \le 2C (1 + \varepsilon )\varepsilon ^\frac{1}{2} K\frac{\varGamma \big (\frac{d}{2} + \frac{3}{2}\big )}{\varGamma \big (\frac{d}{2}\big )}. \end{aligned}$$

\(\square \)

4.1 Asymptotic behavior for fixed and increasing problem dimension

Now, we describe the convergence of the Laplace approximation for a sequence of nonlinear problems that satisfy Assumptions 3.13.3 with varying bounds \(\{K_n\}_{n\in \mathbb {N}}\) and \(\{\delta _n\}_{n\in \mathbb {N}}\), respectively, and varying squared noise levels \(\{\varepsilon _n\}_{n\in \mathbb {N}}\), both in case of a fixed and an increasing problem dimension. We denote the data by \(y_n\), the prior distribution by \(R_n\), the scaled negative log-likelihood by \(\varPhi _n\), and set \(I_n(x) = \varPhi _n(x) + \varepsilon _n R_n(x)\).

First, we consider the case that the problem dimension d remains constant.

Corollary 4.4

(Fixed problem dimension) Suppose that \(I_n\), \(\varPhi _n\), and \(R_n\) satisfy Assumptions 3.13.3. If \(\varepsilon _n^{1/2}K_n \rightarrow 0\) and if there exist \({\underline{\delta }} > 0\) and \(N_0 \in \mathbb {N}\) such that

$$\begin{aligned} \delta _n \ge {\underline{\delta }} \quad \text {and}\quad \varepsilon _n \le 1 \end{aligned}$$

for all \(n \ge N_0\), then there exist \(C = C(d) > 0\) and \(N_1 \ge N_0\) such that

$$\begin{aligned} {d_\text {TV}}\left( \mu ^{y_n}, \mathscr {L}_{\mu ^{y_n}}\right) \le C \varepsilon _n^\frac{1}{2}K_n \end{aligned}$$

for all \(n \ge N_1\).

Proof

Since \(\{\delta _n\}_{n\in \mathbb {N}}\) is bounded from below and \(\{\varepsilon _n\}_{n\in \mathbb {N}}\) is bounded from above, the left hand side of (4.1) is bounded from below by \(C_1/\varepsilon _n^{1/2} K_n\) for some \(C_1 > 0\). On the other hand, the right hand side of (4.1) is bounded from above by \((8 \ln C_2/\varepsilon _n^{1/2} K_n)^{2/3}\) for large enough n and some \(C_2 > 0\), since \(\{\delta _n\}_{n\in \mathbb {N}}\) is bounded from below and \(\{\varepsilon _n\}_{n\in \mathbb {N}}\) is bounded from below by 0. Consequently, there exists \(N_1 \ge N_0\) such that condition (4.1) holds for all \(n \ge N_1\) by the convergence \(\varepsilon _n^{1/2} K_n \rightarrow 0\) and since \(\lim _{t \rightarrow \infty } t^{-2/3} \ln t = 0\). Now, Theorem 4.1 yields the proposition. \(\square \)

Remark 4.5

Corollary 4.4 covers two cases of particular interest: That of \(K_n \rightarrow 0\) while \(\varepsilon _n = \varepsilon \) remains constant, which yields a rate of \(K_n\), and that of \(\varepsilon _n \rightarrow 0\) while \(K_n = K\) remains constant, which yields a rate of \(\varepsilon _n^{1/2}\). The former case can, for example, occur if the sequence of forward mappings \(G_n\) converges pointwise towards a linear mapping, see Sect. 5. The convergence rate in the latter case, i.e., in the small noise limit, agrees with the rate established in [25, Theorem 2] if we set \(\varepsilon _n = \frac{1}{n}\).

Now, we consider the case of an increasing problem dimension \(d \rightarrow \infty \). To this end, we index \(K_d\), \(\delta _d\), \(\varepsilon _d\), and \(R_d\) by \(d \in \mathbb {N}\).

Corollary 4.6

(Increasing problem dimension) Suppose that \(I_d\), \(\varPhi _d\), and \(R_d\) satisfy Assumptions 3.13.3 and that \(\varepsilon _d^{1/2}K_d \rightarrow 0\). If there exists \(N_0 \in \mathbb {N}\) such that \(\delta _d \le e^{-1/2}\), \(\varepsilon _d \le 1\), and

$$\begin{aligned} \frac{3}{\varepsilon _d^\frac{1}{2}K_d} \ge \left( \frac{8}{\delta _d}\ln \frac{1}{\delta _d}\right) ^\frac{3}{2} d^\frac{3}{2} \end{aligned}$$
(4.3)

for all \(d \ge N_0\), then for every \(C > 2\sqrt{2}e/3\), there exists \(N_1 \ge N_0\) such that

$$\begin{aligned} {d_\text {TV}}\left( \mu ^{y_d}, \mathscr {L}_{\mu ^{y_d}}\right) \le C \varepsilon _d^\frac{1}{2}K_d d^\frac{3}{2} \end{aligned}$$

for all \(d \ge N_1\).

Proof

We can write condition (4.1) as

$$\begin{aligned} \left( \frac{6}{(1 + \varepsilon _d)\varepsilon _d^\frac{1}{2}K_d}\right) ^\frac{2}{3}&\ge \frac{4d}{\delta _d} \quad \text {and} \end{aligned}$$
(4.4)
$$\begin{aligned} \left( \frac{6}{(1 + \varepsilon _d)\varepsilon _d^\frac{1}{2}K_d}\right) ^\frac{2}{3}&\ge \frac{8}{\delta _d} \ln \left( \frac{2}{C(1 + \varepsilon _d)\varepsilon _d^\frac{1}{2}K_d} \cdot \frac{\varGamma \big (\frac{d}{2}\big )}{\varGamma \big (\frac{d}{2} + \frac{3}{2}\big )}\right) + \frac{4d}{\delta _d} \ln \frac{1}{\delta _d}. \end{aligned}$$
(4.5)

By [28, pp. 67–68], we have

$$\begin{aligned} \lim _{d \rightarrow \infty } \frac{\varGamma \big (\frac{d}{2} + \frac{3}{2}\big )}{\varGamma \big (\frac{d}{2}\big )}\left( \frac{d}{2}\right) ^{-\frac{3}{2}} = 1, \end{aligned}$$
(4.6)

so that the first term on the right hand side of (4.5) is bounded from above by

$$\begin{aligned} 8e^\frac{1}{2} \ln \frac{C_1}{\varepsilon _d^\frac{1}{2}K_d d^\frac{3}{2}} \le 8e^\frac{1}{2} \ln \frac{C_1}{\varepsilon _d^\frac{1}{2}K_d} \end{aligned}$$

for some \(C_1 > 0\), which in turn is bounded from above by

$$\begin{aligned} \frac{1}{2}\left( \frac{3}{\varepsilon _d^{1/2}K_d}\right) ^\frac{2}{3} \le \frac{1}{2}\left( \frac{6}{(1 + \varepsilon _d)\varepsilon _d^{1/2}K_d}\right) ^\frac{2}{3} \end{aligned}$$
(4.7)

for large enough d, due to the convergence \(\varepsilon _d^{1/2}K_d \rightarrow 0\), the boundedness from above of \(\{\varepsilon _n\}_{n\in \mathbb {N}}\), and since \(\lim _{t \rightarrow \infty } t^{-2/3} \ln t = 0\). By (4.3), the second term on the right hand side of (4.5) is bounded from above by (4.7) for all \(d \ge N_0\) as well. Due to the assumption \(\delta _d \le e^{-1/2}\) and (4.7), condition (4.3) also ensures that (4.4) is satisfied for all \(d \ge N_0\). Therefore, condition (4.1) is satisfied for large enough d. Now, Theorem 4.1 and (4.6) yield that for every \(C > 2\sqrt{2}e/3\) there exists \(N_1 \ge N_0\) such that

$$\begin{aligned} {d_\text {TV}}\left( \mu ^{y_d}, \mathscr {L}_{\mu ^{y_d}}\right) \le C \varepsilon _d^\frac{1}{2}K_d d^\frac{3}{2} \end{aligned}$$

for all \(d \ge N_1\). \(\square \)

5 Perturbed linear problems with Gaussian prior

In this section we consider the case that the forward mapping G is given by a linear mapping with a small nonlinear perturbation, i.e., that

$$\begin{aligned} G_\tau (x) = Ax + \tau F(x), \end{aligned}$$
(5.1)

with \(A \in \mathbb {R}^{d \times d}\), \(F \in C^3(\mathbb {R}^d)\), and \(\tau \ge 0\). We then quantify the error of the Laplace approximation for small \(\tau \), that is, when the nonlinearity of the forward mapping is small, and for fixed problem dimension d. In order to isolate the effect of the nonlinearity on estimate (4.2), we consider the case when not only the noise, but also the prior distribution is Gaussian. This ensures that all non-Gaussianity in the posterior distribution results from the nonlinearity of \(G_\tau \).

We assign a prior distribution \(\mu = \mathscr {N}(m_0,\varSigma _0)\) with \(m_0 \in \mathbb {R}^d\) and symmetric, positive definite \(\varSigma _0 \in \mathbb {R}^{d \times d}\), i.e., we set

$$\begin{aligned} R(x)&:= -\ln \left( \frac{1}{(2\pi )^{d/2}\sqrt{\det \varSigma _0}} \exp \left( -\frac{1}{2}{\Vert x - m_0 \Vert }_{\varSigma _0}^2\right) \right) \nonumber \\&\; = \frac{1}{2} {\Vert x - m_0 \Vert }_{\varSigma _0}^2 + \frac{d}{2}\ln 2\pi + \frac{1}{2}\ln \det \varSigma _0. \end{aligned}$$
(5.2)

For each \(\tau \ge 0\) we denote the data by \(y_\tau \) and the scaled negative log-likelihood by \(\varPhi _\tau (x) = \frac{1}{2}{|G_\tau (x) - y_\tau |}^2\).

We make the following assumptions on the function \(I_\tau \) and the perturbation F. Let \(B(r) \subset \mathbb {R}^d\) denote the closed Euclidean ball with radius r around the origin.

Assumption 5.1

We assume that there exists \(\tau _0 > 0\) such that for all \(\tau \in [0,\tau _0]\),

$$\begin{aligned} I_\tau (x) = \frac{1}{2}{|Ax + \tau F(x) - y_\tau |}^2 + \frac{\varepsilon }{2}{\Vert x - m_0 \Vert }_{\varSigma _0}^2 + c_\tau \end{aligned}$$

has a unique minimizer \(\hat{x}_\tau \) with \(D^2I_\tau (\hat{x}_\tau ) > 0\). Furthermore, we assume that \(y_\tau \), \(\hat{x}_\tau \) and \(\varSigma _\tau := D^2I_\tau (\hat{x}_\tau )^{-1}\) converge as \(\tau \rightarrow 0\) with \(\lim _{\tau \rightarrow 0} \varSigma _\tau > 0\) and denote their limits by y, \(\hat{x}\), and \(\varSigma \), respectively.

Assumption 5.2

There exist constants \(C_0, \dots , C_3>0\) and \(\tau _0>0\) such that

$$\begin{aligned} {\Vert D^jF(x) \Vert }_{\varSigma _\tau } \le C_j, \qquad j=0,\dots ,3, \end{aligned}$$

for all \(x \in {\mathbb {R}^d}\) and \(\tau \in [0,\tau _0]\), and there exists \(M > 0\) such that

$$\begin{aligned} D^3 F \equiv 0 \qquad \text {on }\mathbb {R}^d {\setminus } B(M). \end{aligned}$$

The idea behind the following theorem is to make explicit how the nonlinearity of the forward mapping, as quantified by \(\tau \) and the constants \(C_0, \dots , C_3, M\) in Assumption 5.2, influences the total variation error bound of Theorem 4.1.

Theorem 5.3

Suppose that Assumptions 5.1 and 5.2 hold. Then, there exists \(\tau _1 \in (0,\tau _0]\) such that

$$\begin{aligned} {d_\text {TV}}\left( \mu ^{y_\tau },\mathscr {L}_{\mu ^{y_\tau }}\right) \le C(1 + \varepsilon )\varepsilon ^\frac{1}{2} \left( V(\tau )\tau + \frac{W}{2}\tau ^2\right) \end{aligned}$$

for all \(\tau \in [0,\tau _1]\), where

$$\begin{aligned} C&:= \frac{2}{3}\sqrt{2}e \frac{\varGamma \left( \frac{d}{2} + \frac{3}{2}\right) }{\varGamma \left( \frac{d}{2}\right) }, \\ V(\tau )&:= C_3\left( {\Vert A \Vert }M + {|y_\tau |}\right) + 3C_2\left\Vert A\varSigma _\tau ^\frac{1}{2} \right\Vert , \\ W&:= C_3 C_0 + 3 C_2 C_1. \end{aligned}$$

Moreover, \(\{V(\tau )\}_{\tau \in [0,\tau _1]}\) is bounded.

Remark 5.4

  1. 1.

    The choice of the upper bound \(\tau _1\) is made explicit in the proof of Theorem 5.3 and depends on d and \(\varepsilon \), i.a., through \(\delta _0\) as defined in Proposition 5.6. The proof of Theorem 5.3 can be adapted to yield a result analogous to Corollary 4.6 in the case when the problem dimension d tends to \(\infty \) while the size \(\tau _d\) of the perturbation tends to 0. Then, \(\delta _{\tau _d}\) may converge to 0, and (4.3) imposes a bound on the rate at which \(\{\tau _d\}_{d\in \mathbb {N}}\) tends to 0.

  2. 2.

    By the boundedness and continuity of F, \(G_\tau \) \(\varGamma \)-converges toward A. By the fundamental theorem of \(\varGamma \)-convergence and Assumption 5.1, this, in turn, implies that \(\hat{x}\) is the minimizer of

    $$\begin{aligned} I(x) = \frac{1}{2}{|Ax - y |}^2 + \frac{\varepsilon }{2}{\Vert x - m_0 \Vert }_{\varSigma _0}^2 + c \end{aligned}$$

    and that \(\varSigma = D^2I(\hat{x})^{-1}\).

  3. 3.

    Theorem 5.3 remains valid if the assumption that \(D^3F \equiv 0\) outside of a bounded set is replaced by

    $$\begin{aligned} {\Vert D^3F(x) \Vert }_{\varSigma _\tau } \le C_3 M {|x |}^{-1} \end{aligned}$$

    for all \(x \in \mathbb {R}{\setminus } \{0\}\) and \(\tau \in [0,\tau _0]\).

In order to prove Theorem 5.3, we first show that Assumptions 3.2 and 3.3 are satisfied for small enough \(\tau \) and determine the bounds \(K_\tau \) and \(\delta _\tau \). Then, we derive the error estimate for the perturbed linear case.

5.1 Verifying Assumption 3.3

We verify that Assumption 3.3 holds for small enough \(\tau \) and determine \(\delta _\tau \). First, we estimate \(I_\tau \) from below.

Lemma 5.5

For all \(\tau \ge 0\) and \(x \in {\mathbb {R}^d}\), we have

$$\begin{aligned} I_\tau (x)&\ge \frac{1}{2}{|G_\tau (x) - G_\tau (\hat{x}_\tau ) |}^2 + \frac{\varepsilon }{2}{\Vert x - \hat{x}_\tau \Vert }_{\varSigma _0}^2 \\&\quad - {|G_\tau (x) - G_\tau (\hat{x}_\tau ) - DG_\tau (\hat{x}_\tau )(x -\hat{x}_\tau ) |}\cdot {|G_\tau (\hat{x}_\tau ) - y_\tau |}. \end{aligned}$$

Proof

Since \(\hat{x}_\tau \) satisfies the necessary optimality condition

$$\begin{aligned} D\varPhi _\tau (\hat{x}_\tau ) + \varepsilon DR(\hat{x}_\tau ) = DI_\tau (\hat{x}_\tau ) = 0, \end{aligned}$$

we can write \(I_\tau (x)\) as

$$\begin{aligned} I_\tau (x) = \varPhi _\tau (x) - \varPhi _\tau (\hat{x}_\tau ) - D\varPhi _\tau (\hat{x}_\tau )(x - \hat{x}_\tau ) + \varepsilon \Big (R(x) - R(\hat{x}_\tau ) - DR(\hat{x}_\tau )(x - \hat{x}_\tau )\Big ) \end{aligned}$$

for all \(x \in \mathbb {R}^d\). For the log-likelihood, we have

$$\begin{aligned} \varPhi _\tau (x) - \varPhi _\tau (\hat{x}_\tau ) = \frac{1}{2} |G_\tau (x) - G_\tau (\hat{x}_\tau )|^2 + {(G_\tau (x) - G_\tau (\hat{x}_\tau ), G_\tau (\hat{x}_\tau ) - y_\tau )}, \end{aligned}$$

and

$$\begin{aligned} D\varPhi _\tau (\hat{x}_\tau )(x - \hat{x}_\tau ) = {(DG_\tau (\hat{x}_\tau )(x - \hat{x}_\tau ), G_\tau (\hat{x}_\tau ) - y_\tau )} \end{aligned}$$

for all \(x \in {\mathbb {R}^d}\). From this, we obtain

$$\begin{aligned} \begin{aligned}&\varPhi _\tau (x) - \varPhi _\tau (\hat{x}_\tau ) - D\varPhi _\tau (\hat{x}_\tau )(x - \hat{x}_\tau ) \\&\quad \ge \frac{1}{2}{|G_\tau (x) - G_\tau (\hat{x}_\tau ) |}^2 - {|G_\tau (x) - G_t(\hat{x}_\tau ) - DG_\tau (\hat{x}_t)(x -\hat{x}_\tau ) |}\cdot {|G_\tau (\hat{x}_\tau ) - y_\tau |} \end{aligned} \end{aligned}$$
(5.3)

using the Cauchy–Schwarz inequality. For the log-prior density, we have

$$\begin{aligned} DR(\hat{x}_\tau )(x - \hat{x}_\tau ) = {(\hat{x}_\tau - m_0,x - m_0)}_{\varSigma _0} - {\Vert \hat{x}_\tau - m_0 \Vert }_{\varSigma _0}^2 \end{aligned}$$

for all \(x \in {\mathbb {R}^d}\), and thus

$$\begin{aligned} R(x) - R(\hat{x}_\tau ) - DR(\hat{x}_\tau )(x - \hat{x}_\tau ) = \frac{1}{2}{\Vert x - \hat{x}_\tau \Vert }_{\varSigma _0}^2. \end{aligned}$$
(5.4)

Now, adding up (5.3) and (5.4) multiplied with \(\varepsilon \) yields the proposition. \(\square \)

Proposition 5.6

Suppose that Assumption 5.2 holds. Then there exists \(\tau _0 > 0\), such that \(I_\tau \) satisfies

$$\begin{aligned} I_\tau (x) \ge \frac{\delta _\tau }{2} {\Vert x-\hat{x}_\tau \Vert }_{\varSigma _\tau }^2 \end{aligned}$$

for all \(x\in \mathbb {R}^d\) and \(\tau \in [0,\tau _0]\), where

$$\begin{aligned} \delta _\tau := \gamma _1(\tau ) - \gamma _2(\tau ){|G_\tau (\hat{x}_\tau ) - y_\tau |} > 0 \end{aligned}$$

with

$$\begin{aligned} \gamma _1(\tau ) := \frac{1}{\left\Vert \varSigma _\tau ^{-\frac{1}{2}}(A^{\mathrm {T}}A + \varepsilon \varSigma _0^{-1})^{-\frac{1}{2}} \right\Vert ^2} - C_1^2\tau ^2 \quad \text {and}\quad \gamma _2(\tau ) := C_2\tau \end{aligned}$$

for all \(\tau \in [0,\tau _0]\). Furthermore, \(\lim _{\tau \rightarrow 0} \delta _\tau > 0\).

Proof

Let \(x \in {\mathbb {R}^d}\) be arbitrary. Then, there exists \(z_1 \in \mathbb {R}^d\) such that \(F(x) - F(\hat{x}_\tau ) = DF(z_1)(x - \hat{x}_\tau )\). Therefore,

$$\begin{aligned} {|F(x) - F(\hat{x}_\tau ) |} \le {\Vert DF(z_1) \Vert }_{\varSigma _\tau } {\Vert x - \hat{x}_\tau \Vert }_{\varSigma _\tau } \le C_1 {\Vert x - \hat{x}_\tau \Vert } \end{aligned}$$

by Assumption 5.2. Moreover, we have

$$\begin{aligned} {\Vert x - \hat{x}_\tau \Vert }_{\varSigma ^\tau } \le \left\Vert \varSigma _\tau ^{-\frac{1}{2}} \left( A^{\mathrm {T}}A + \varepsilon \varSigma _0^{-1}\right) ^{-\frac{1}{2}} \right\Vert \cdot \left|\left( A^{\mathrm {T}}A + \varepsilon \varSigma _0^{-1}\right) ^\frac{1}{2} (x - \hat{x}_\tau ) \right|, \end{aligned}$$

and hence

$$\begin{aligned}&\frac{1}{2} {|G_\tau (x) - G_\tau (\hat{x}_\tau ) |}^2 + \frac{\varepsilon }{2} {\Vert x-\hat{x}_\tau \Vert }_{\varSigma _0}^2 \\&\quad \ge \frac{1}{2}{|A(x-\hat{x}_\tau ) |}^2 - \frac{\tau ^2}{2} {|F(x) - F(\hat{x}_\tau ) |}^2 + \frac{\varepsilon }{2} {\Vert x-\hat{x}_\tau \Vert }_{\varSigma _0}^2 \\&\quad \ge \frac{1}{2} \left( \left( A^{\mathrm {T}}A + \varepsilon \varSigma _0^{-1}\right) (x-\hat{x}_\tau ), x-\hat{x}_\tau \right) - \frac{\tau ^2C_1^2}{2} {\Vert x - \hat{x}_\tau \Vert }_{\varSigma _\tau }^2 \\&\quad \ge \frac{1}{2}\gamma _1(\tau ) {\Vert x - \hat{x}_\tau \Vert }_{\varSigma _\tau }^2 \end{aligned}$$

for all \(\tau \le \tau _0\). There also exists \(z_2 \in {\mathbb {R}^d}\) such that

$$\begin{aligned} G_\tau (x) - G_\tau (\hat{x}_\tau ) - DG_\tau (\hat{x}_\tau )(x - \hat{x}_\tau ) = \frac{1}{2} D^2G_\tau (z_2)(x - \hat{x}_\tau , x - \hat{x}_\tau ), \end{aligned}$$

and \(D^2G_\tau = \tau D^2F\). By Assumption 5.2, we thus have

$$\begin{aligned} \begin{aligned} {|G_\tau (x) - G_\tau (\hat{x}_\tau ) - DG_\tau (\hat{x}_\tau )(x - \hat{x}_\tau ) |}&\quad \le \frac{\tau }{2}{\Vert D^2 F(z_2) \Vert }_{\varSigma _\tau } {\Vert x - \hat{x}_\tau \Vert }_{\varSigma _\tau }^2 \\&\quad \le \frac{\tau C_2}{2} {\Vert x - \hat{x}_\tau \Vert }_{\varSigma _\tau }^2 = \frac{1}{2}\gamma _2(\tau ) {\Vert x - \hat{x}_\tau \Vert }_{\varSigma _\tau }^2 \end{aligned} \end{aligned}$$

for all \(\tau \le \tau _0\). Now, Lemma 5.5 yields that

$$\begin{aligned} I_\tau (x) \ge \frac{1}{2}\gamma _1(\tau ) {\Vert x - \hat{x}_\tau \Vert }_{\varSigma _\tau }^2 - \frac{1}{2}\gamma _2(\tau ) {\Vert x - \hat{x}_\tau \Vert }_{\varSigma _\tau }^2 {|G_\tau (\hat{x}_\tau ) - y_\tau |} = \frac{\delta _\tau }{2} {\Vert x - \hat{x}_\tau \Vert }_{\varSigma _\tau }^2. \end{aligned}$$

It remains to show that \(\lim _{\tau \rightarrow 0} \delta _\tau > 0\). The convergence of \(y_\tau \) and \(\hat{x}_\tau \) yields

$$\begin{aligned} G_\tau (\hat{x}_\tau ) - y_\tau \rightarrow A\hat{x}- y. \end{aligned}$$

Now, it follows from \(\lim _{\tau \rightarrow 0} \gamma _2(\tau ) = 0\) and the convergence of \(\varSigma _\tau \) that

$$\begin{aligned} \lim _{\tau \rightarrow 0} \delta _t = \lim _{\tau \rightarrow 0} \gamma _1(\tau ) = \frac{1}{\left\Vert \varSigma ^{-\frac{1}{2}} (A^{\mathrm {T}}A + \varepsilon \varSigma _0^{-1})^{-\frac{1}{2}} \right\Vert ^2} > 0. \end{aligned}$$

\(\square \)

5.2 Verifying Assumption 3.2

Now, we verify that Assumption 3.2 holds for small \(\tau \). The following proposition also describes how the nonlinearity of the forward mapping translates into non-Gaussianity of the likelihood, as quantified by the costant \(K_\tau \).

Proposition 5.7

Suppose that Assumption 5.2 holds. Then \(\varPhi _\tau \) and R satisfy Assumption 3.2 for all \(\tau \in [0,\tau _0]\) with

$$\begin{aligned} K_\tau := \tau \left( C_3 \left( {\Vert A \Vert } M + {|y_\tau |}\right) + 3 C_2 \left\Vert A\varSigma _\tau ^\frac{1}{2} \right\Vert \right) + \frac{\tau ^2}{2}\left( C_3 C_0 + 3 C_2 C_1\right) . \end{aligned}$$

Proof

We express the scaled negative log-likelihood for all \(x \in \mathbb {R}^d\) as

$$\begin{aligned} \varPhi _\tau (x)&= \frac{1}{2}{|Ax + \tau F(x) - y_\tau |}^2 \\&= \frac{1}{2}{|Ax - y_\tau |}^2 + \tau {(Ax - y_\tau , F(x))} + \frac{\tau ^2}{2}{|F(x) |}^2 \\&= \varPhi _0(x) + \tau \varPsi _{1}(x; \tau ) + \tau ^2\varPsi _2(x), \end{aligned}$$

where \(\varPsi _{1}(x; \tau ) := {(Ax - y_\tau , F(x))}\) and \(\varPsi _2(x) := \frac{1}{2}{|F(x) |}^2\). For the first term, we have \(D^3\varPhi _0(x) = 0\) for all \(x \in {\mathbb {R}^d}\) due to the linearity of A. The third differentials of \(\varPsi _{1},\varPsi _2\) can be stated explicitly as

$$\begin{aligned} D^3\varPsi _{1}(x; \tau )(h_1,h_2,h_3)&= {(D^3F(x)(h_1,h_2,h_3), Ax - y_\tau )} + {(D^2F(x)(h_1,h_2), Ah_3)} \\&\quad + {(D^2F(x)(h_2,h_3), Ah_1)} + {(D^2F(x)(h_1,h_3), Ah_2)}, \end{aligned}$$

and

$$\begin{aligned} D^3\varPsi _2(x)(h_1,h_2,h_3)&= \frac{1}{2}{(D^3F(x)(h_1,h_2,h_3), F(x))} + \frac{1}{2}{(D^2F(x)(h_1,h_2), DF(x)h_3)} \\&\quad + \frac{1}{2}{(D^2F(x)(h_2,h_3), DF(x)h_1)} + \frac{1}{2}{(D^2F(x)(h_1,h_3), DF(x)h_2)} \end{aligned}$$

for all \(x, h_1, h_2, h_3 \in {\mathbb {R}^d}\) and \(\tau \ge 0\). Therefore, we have

$$\begin{aligned} \begin{aligned} {\Vert D^3\varPsi _{1}(x; \tau ) \Vert }_{\varSigma _\tau }&\le {\Vert D^3F(x) \Vert }_{\varSigma _\tau } {|Ax - y_\tau |} + 3{\Vert D^2F(x) \Vert }_{\varSigma _\tau } \left\Vert A\varSigma _\tau ^\frac{1}{2} \right\Vert \\&\le C_3 \left( {\Vert A \Vert } M + {|y_\tau |}\right) + 3 C_2 \left\Vert A\varSigma _\tau ^\frac{1}{2} \right\Vert \end{aligned} \end{aligned}$$

for all \(x \in {\mathbb {R}^d}\) and \(\tau \le \tau _0\) by Assumption 5.2. Moreover, we obtain

$$\begin{aligned} {\Vert D^3\varPsi _2(x) \Vert }_{\varSigma _\tau }&\le \frac{1}{2}{\Vert D^3F(x) \Vert }_{\varSigma _\tau } {|F(x) |} + \frac{3}{2}{\Vert D^2F(x) \Vert }_{\varSigma _\tau }{\Vert DF(x) \Vert }_{\varSigma _\tau } \\&\le \frac{1}{2} C_3 C_0 + \frac{3}{2} C_2 C_1 \end{aligned}$$

for all \(x \in {\mathbb {R}^d}\) and \(\tau \le \tau _0\). Now, it follows that

$$\begin{aligned} {\Vert D^3\varPhi _\tau (x) \Vert }_{\varSigma _\tau }&\le \tau {\Vert D^3\varPsi _1(x; \tau ) \Vert }_{\varSigma _\tau } + \tau ^2{\Vert D^3\varPsi _2(x) \Vert }_{\varSigma _\tau } \\&\le \tau \left( C_3 \left( {\Vert A \Vert } M + {|y_\tau |}\right) + 3 C_2 \left\Vert A\varSigma _\tau ^\frac{1}{2} \right\Vert \right) + \frac{\tau ^2}{2}\left( C_3 C_0 + 3 C_2 C_1\right) \end{aligned}$$

for all \(x \in {\mathbb {R}^d}\) and \(\tau \le \tau _0\). \(\square \)

5.3 Proof of Theorem 5.3

Proof of Theorem 5.3

First of all, we note that Assumption 3.1 holds for all \(\tau \le \tau _0\) by definition of \(G_\tau \), R and by Assumption 5.1. Second of all, we note that Assumption 3.3 holds for all \(\tau \le \tau _0\) by Proposition 5.6 with \(\delta _\tau \) as defined in Proposition 5.6, and that \(\delta _0 = \lim _{\tau \rightarrow 0} \delta _\tau > 0\). This allows us to choose \(\tau _1 \le \tau _0\) such that \(\delta _\tau \ge \frac{1}{2}\delta _0 =: {\underline{\delta }}\) for all \(\tau \in [0,\tau _1]\). Third of all, we note that Assumption 3.2 holds for all \(\tau \le \tau _0\) by Proposition 5.7 with \(K_\tau \) as defined in Proposition 5.7, and that \(\lim _{\tau \rightarrow 0} K_\tau = 0\).

Since \(\{\delta _\tau \}_{\tau \in [0,\tau _1]}\) is bounded from below, the left hand side of condition (4.1) from Theorem 4.1 is bounded from below by \(\kappa _1/K_\tau \) for some \(\kappa _1 > 0\), and the right hand side of condition (4.1) is bounded from above by \((8\ln \kappa _2/K_\tau )^{3/2}\) for small enough \(\tau \) and some \(\kappa _2 > 0\). Therefore, we can choose \(\tau _2 \le \tau _1\) such that condition (4.1) is satisfied for all \(\tau \in [0,\tau _2]\). Now, Theorem 4.1 yields that

$$\begin{aligned} {d_\text {TV}}\left( \mu ^{y_\tau }, \mathscr {L}_{\mu ^{y_\tau }}\right)&\le \frac{2}{3}\sqrt{2}e(1 + \varepsilon )\varepsilon ^\frac{1}{2} \frac{\varGamma \big (\frac{d}{2} + \frac{3}{2}\big )}{\varGamma \big (\frac{d}{2}\big )} \\&\quad \cdot \left[ \left( C_3\left( {\Vert A \Vert }M + {|y_t |}\right) + 3C_2\left\Vert A\varSigma _\tau ^\frac{1}{2} \right\Vert \right) \tau + \frac{1}{2}\left( C_3 C_0 + 3 C_2 C_1\right) \tau ^2\right] \end{aligned}$$

for all \(\tau \in [0,\tau _2]\). By the convergence of \(y_\tau \) and \(\varSigma _\tau \), we can, moreover, choose \(\tau _3 \le \tau _2\) such that both \(\{{|y_\tau |}\}_{\tau \in [0,\tau _3]}\) and \(\{{\Vert A\varSigma _\tau ^{1/2} \Vert }\}_{\tau \in [0,\tau _3]}\) are bounded. \(\square \)

6 Outlook

In this paper we prove novel error estimates for the Laplace approximation when applied to nonlinear Bayesian inverse problems. Here, the error is measured in TV distance and our estimates aim to quantify effects independent of the noise level. Our central error estimate in Theorem 3.4 is of particular use for high-dimensional problems because it can be evaluated without integrating in \({\mathbb {R}^d}\). Our estimate in Theorem 4.1 makes the influence of the noise level, the nonlinearity of the forward operator, and the problem dimension explicit. Our estimate for perturbed linear problems in Theorem 5.3, in turn, specifies in more detail how the properties of the nonlinear perturbation affect the approximation error.

We point out that our central estimate diverges with an increasing dimension for a fixed noise level and forward mapping, and therefore such asymptotics does not provide any added value compared to the trivial TV upper bound of 1. This unsatisfactory observation is natural since the limiting posterior and Laplace approximation (if well-defined) are singular with respect to each other and, consequently, the TV distance is maximized. Future study is therefore needed to establish similar bounds with distances that metrize the weak convergence such as the 1-Wasserstein distance. Such effort would be aligned with recent developments in BvM theory that extend to nonparametric Bayesian inference and, in particular, Bayesian inverse problems.