On the Convergence of the Laplace Approximation and Noise-Level-Robustness of Laplace-based Monte Carlo Methods for Bayesian Inverse Problems

The Bayesian approach to inverse problems provides a rigorous framework for the incorporation and quantification of uncertainties in measurements, parameters and models. We are interested in designing numerical methods which are robust w.r.t. the size of the observational noise, i.e., methods which behave well in case of concentrated posterior measures. The concentration of the posterior is a highly desirable situation in practice, since it relates to informative or large data. However, it can pose a computational challenge for numerical methods based on the prior or reference measure. We propose to employ the Laplace approximation of the posterior as the base measure for numerical integration in this context. The Laplace approximation is a Gaussian measure centered at the maximum a-posteriori estimate and with covariance matrix depending on the logposterior density. We discuss convergence results of the Laplace approximation in terms of the Hellinger distance and analyze the efficiency of Monte Carlo methods based on it. In particular, we show that Laplace-based importance sampling and Laplace-based quasi-Monte-Carlo methods are robust w.r.t. the concentration of the posterior for large classes of posterior distributions and integrands whereas prior-based importance sampling and plain quasi-Monte Carlo are not. Numerical experiments are presented to illustrate the theoretical findings.


Introduction
The identification of unknown parameters from noisy observations arises in various areas of application, e.g., engineering systems, biological models, environmental systems. In recent years, Bayesian inference has become a popular approach to model the inverse problem [37], i.e., noisy observations are used to update the knowledge of unknown parameters from a prior distribution to the posterior distribution. The latter is then the solution of the Bayesian inverse problem and obtained by conditioning the prior distribution on the data. This approach is very appealing in various fields of applications, since uncertainty quantification can be performed, once the prior distribution is updated-barring the fact that Bayesian credible sets are not in a one-to-one correspondence to classical confidence sets, see [6,38].
To ensure the applicability of the Bayesian approach to computationally demanding models, there has been a lot of research effort towards improved algorithms allowing for effective sampling or integration w.r.t. the resulting posterior measure. For example, the computational burden of expensive forward or likelihood models can be reduced by surrogates or multilevel strategies [26,13,19,33] and for many classical sampling or integration methods such as Quasi-Monte Carlo [11], Markov chain Monte Carlo [5,31,40], and numerical quadrature [34,4] we now know modifications and conditions which ensure a dimensionindependent efficiency.
However, a third common challenge for many numerical methods has drawn surprisingly less attention so far: the challenge of concentrated posterior measures such as µ n (dx) = 1 Z n exp (−nΦ n (x)) µ 0 (dx), Z n := Here, n 1 and µ 0 denotes a reference or prior probability measure on R d and Φ n : R d → [0, ∞) are negative log-likelihood functions resulting, e.g., from n observations. From a modeling point of view the concentration effect of the posterior is a highly desirable situation due to large data sets and less remaining uncertainty about the parameter to be inferred. From a numerical point of view, on the other hand, this can pose a delicate situation, since standard methods may perform worse and worse if the concentration increases due to n → ∞. Hence, understanding how sampling or quadrature methods for µ n behave as n → ∞ is a crucial task with immediate benefits for practical purposes.
Numerical methods are often based on the prior µ 0 , since µ 0 is usually a simple measure allowing for direct sampling or explicit quadrature formulas. However, for large n most of the corresponding sample points or quadrature nodes will be placed in regions of low posterior importance missing the needle in the haystackthe minimizers of Φ n . An obvious way to circumvent this is to use a numerical integration w.r.t. another reference measure which can be straightforwardly computed or sampled from and concentrates around those minimizers and shrinks like the posterior measures µ n as n → ∞. In this paper we consider numerical methods based on a Gaussian approximation of µ n -the Laplace approximation.
When it comes to integration w.r.t. an increasingly concentrated function, the well-known and widely used Laplace's method provides explicit asymptotics for such integrals, i.e., under certain regularity conditions [42] we have for n → ∞ that where x ∈ R denotes the assumed unique minimizer of Φ : R d → R. This formula is derived by approximating Φ by its second-order Taylor polynomial at x . We could now use (2) and its application to Z n in order to derive that R d f (x) µ n (dx) → f (x ) as n → ∞. However, for finite n this is only of limited use, e.g., consider the computation of posterior probabilities where f is an indicator function. Thus, in practice we still rely on numerical integration methods in order to obtain a reasonable approximation of the posterior integrals R d f (x) µ n (dx). Nonetheless, the second-order Taylor approximation employed in Laplace's method provides us with (a guideline to derive) a Gaussian measure approximating µ n .
This measure itself is often called the Laplace approximation of µ n and will be denoted by L µn . Its mean is given by the maximum a-posteriori estimate (MAP) of the posterior µ n and its covariance is the inverse Hessian of the negative log posterior density. Both quantities can be computed efficiently by numerical optimization and since it is a Gaussian measure it allows for direct samplings and easy quadrature formulas. The Laplace approximation is widely used in optimal (Bayesian) experimental design to approximate the posterior distribution (see, for example, citealexanderian2016fast) and has been demonstrated to be particularly useful in the large data setting, see [24,32] and the references therein for more details. Moreover, in several recent publications the Laplace approximation was already proposed as a suitable reference measure for numerical quadrature [36,4] or importance sampling [1]. Note that preconditioning strategies based on Laplace approximation are also referred to as Hessian-based strategies due to the equivalence of the inverse covariance and the Hessian of the corresponding optimization problem, cp. [4]. In [36], the authors showed that a Laplace approximation-based adaptive Smolyak quadrature for Bayesian inference with affine parametric operator equations exhibits a convergence rate independent of the size of the noise, i.e., independent of n.
This paper extends the analysis in [36] for quadrature to the widely applied Laplace-based importance sampling and Laplace-based quasi-Monte Carlo (QMC) integration.
Before we investigate the scale invariance or robustness of these methods we examine the behaviour of the Laplace approximation and in particular, its density dµn dLµ n w.r.t. µ n . The reason behind is that, for importance sampling as well as QMC integration this density naturally appears in the methods, hence, if it deteriorates as n → ∞, this will be reflected in a deteriorating efficiency of the method. For example, for Φ n ≡ Φ the density w.r.t. the prior measure dµn dµ 0 = exp(−nΦ)/Z n deteriorates to a Dirac function at the minimizer x of Φ as n → ∞ which causes the shortcomings of Monte Carlo or QMC integration w.r.t. µ 0 as n → ∞. However, for the Laplace approximation we show that the density dµn dLµ n converges L µn -almost everywhere to 1 which in turn results in a robust-and actually improving-performance w.r.t. n of related numerical methods. In summary, the main results of this paper are the following: 1. Laplace Approximation: Given mild conditions the Laplace approximation L µn converges in Hellinger distance to µ n : d H (µ n , L µn ) ∈ O(n −1/2 ).
This result is closely related to the well-known Bernstein-von Mises theorem for the posterior consistency in Bayesian inference [39]. The significant difference here is that the covariance in the Laplace approximation depends on the data and the convergence holds for the particularly observed data whereas in the classical Bernstein-von Mises theorem the covariance is the inverse of the expected Fisher information matrix and the convergence is usually stated in probability.
• Prior-based Importance Sampling: The asymptotic variance of prior-based importance sampling, i.e., the prior µ 0 is used as the importance distribution, for computing the expectation of smooth integrands f ∈ L 2 µ 0 (R) w.r.t. such measures µ n deteriorates like n d/2−1 . • Laplace-based Importance Sampling. The (random) error e n,N (f ) of Laplace-based importance sampling for computing expectations of smooth integrands f ∈ L 2 µ 0 (R) w.r.t. such measures µ n using a fixed number of samples N ∈ N decays in probability almost like n −1/2 , i.e.,  • Prior-based Quasi-Monte Carlo: The root mean squared error estimate for computing integrals of the form (2) by QMC using randomly shifted Lattice rules deteriorates like n d/4 as n → ∞.
• Laplace-based Quasi-Monte Carlo: If the lattice rule is transformed by an affine mapping based on the mean and the covariance of the Laplace approximation, then the resulting root mean squared error decays like n −d/2 for integrals of the form (2).
The outline of the paper is as follows: in Section 2 we introduce the Laplace approximation for measures of the form (1) and the notation of the paper. In Section 2.2 we study the convergence of the Laplace approximation. We also consider the case of singular Hessians or perturbed Hessians and provide some illustrative numerical examples. At the end of the section, we shortly discuss the relation to the classical Bernsteinvon Mises theorem. The main results about importance sampling and QMC using the prior measure and the Laplace approximation, respectively, are then discussed in Section 3. We also briefly comment on existing results for numerical quadrature and provide several numerical examples illustrating our theoretical findings. The appendix collects the rather lengthy and technical proofs of the main results.

Convergence of the Laplace Approximation
We start with recalling the classical Laplace method for the asymptotics of integrals.
where D ⊆ R d is a possibly unbounded domain and let the following assumptions hold: 1. The integral J(n) converges absolutely for each n ∈ N.
2. There exists an x in the interior of D such that for every r > 0 there holds 3. The function f : D → R is (2p + 1) times continuously differentiable and Φ : R d → R is (2p + 1) times continuously differentiable (for some p ≥ 1) in a neighborhood of x and the Hessian H := ∇ 2 Φ(x ) is positive definite.
Then J(n) has an asymptotic expansion where c k ∈ R and, particularly, Let us assume that Φ(x ) = 0, then applying the theorem to both integrals gives where · A = A 1/2 · for a symmetric positive definite matrix A ∈ R d×d . In other words, the asymptotic behaviour of f e −nΦ dx, in particular, its convergence to zero, is the same as of the integral of f w.r.t. an unnormalized Gaussian density with mean in x and covariance (nH ) −1 . If we consider now probability measures µ n as in (1) but with Φ n ≡ Φ where Φ satisfies the assumptions of Theorem 1, and if we suppose that µ 0 possesses a continuous Lebesgue density π 0 : R → [0, ∞) with π 0 (x ) > 0, then Theorem 1 will imply that for sufficiently smooth f ∈ L 1 µ 0 (R) where N x,C = N (x, C) denotes a Gaussian measure with mean x ∈ R d and covariance C ∈ R d×d . Note that due to normalization we do need not to assume Φ(x ) = 0 here. This particular notion of "relative" weak convergence 1 of µ n to N x ,(nH ) −1 for the case Φ n ≡ Φ suggests to use N x ,(nH ) −1 as a Gaussian approximation to µ n . In the next subsection we derive similar Gaussian approximation for the general case Φ n ≡ Φ, whereas subsection 2.2 includes convergence results of the Laplace approximation in terms of the Hellinger distance.
Bayesian inference We present some context for the form of equation (1) in the following. Integrals of the form (1) arise naturally in the Bayesian setting for inverse problems with large amount of observational data or informative data. Note that the mathematical results for the Laplace approximation given in section 2 are derived in a much more general setting and are not restricted to integrals w.r.t. the posterior in the Bayesian inverse framework. We refer to [20,7] and the references therein for a detailed introduction to Bayesian inverse problems.
Consider a forward response operator G ∈ C(R d , R K ) mapping the unknown parameters x ∈ R d to the data space R K , where K ∈ N denotes the number of observations. We investigate the inverse problem of recovering unknown parameters x ∈ R d from noisy observation y ∈ R K given by where η ∼ N (0, Γ) is a Gaussian random variable with mean zero and covariance matrix Γ, which models the noise in the observations and in the model.
The Bayesian approach for this inverse problem of inferring x from y (which is ill-posed without further assumptions) works as follows: For fixed y ∈ R K we introduce the least-squares functional (or negative loglikelihood in the language of statistics) Φ(·; y) : with | · | Γ := |Γ − 1 2 · | denoting the weighted Euclidean norm in R K . The unknown parameter x is modeled as a R d -valued random variable with prior distribution µ 0 (independent of the observational noise η), which regularizes the problem and makes it well-posed by application of Bayes' theorem: The pair (x, y) is a jointly varying random variable on R d ×R K and hence the solution to the Bayesian inverse problem is the conditional or posterior distribution µ of x given the data y where the law µ is given by with the normalization constant Z := R d exp(−Φ(x; y))µ 0 (dx). If we assume a decaying noise-level by introducing a scaled noise covariance Γ n = 1 n Γ, the resulting noise model η n ∼ N (0, Γ n ) yields an n-dependent log-likelihood term which results in posterior measures µ n of the form (1) with Φ n (x) = Φ(x; y). Similarly, an increasing number n ∈ N of data y 1 , . . . , y n ∈ R k resulting from n observations of G(x) with independent noises η 1 , . . . , η n ∼ N (0, Γ) yields posterior measures µ n as in (1) with Φ n (x) = 1 n n j=1 Φ(x; y j ).
Hence, also the measures µ n in (1) are absolutely continuous w.r.t. Lebesgue measure, i.e., where I n : S 0 → R is given by In order to define the Laplace approximation of µ n we need the following basic assumption.
Assumption 1. There holds Φ n , π 0 ∈ C 2 (S 0 , R). Furthermore, I n has a unique minimizer x n ∈ S 0 satisfying where the latter denotes positive definiteness.
Remark 1. Assuming that min x∈S 0 I n (x) = 0 is just a particular (but helpful) normalization and in general not restrictive: If min x∈S 0 I n (x) = c > −∞, then we can simply set for which we obtain and min x∈S 0Î n (x) = min xΦn (x) − 1 n log π 0 (x) = 0. Given Assumption 1 we define the Laplace approximation of µ n as the following Gaussian measure Thus, we have where we can view as the second-order Taylor approximationĨ n = T 2 I n (x n ) of I n at x n . This point of view is crucial for analyzing the approximation Lebesgue density of the prior: µ 0 (dx) = π 0 (x)dx (4) Z 0 = 1 hypothetically: Normalization constant of π 0 µ n posterior probability measure (1), (5) Φ n (scaled) negative loglikelihood of posterior (1) I n (scaled) negative logdensity of posterior (6) π n (x) unnormalized Lebesgue density of the posterior Lemma 1 Z n normalization constant of π n (1) L µn Laplace approximation of µ n (8) I n = T 2 I n (x n ) (scaled) negative logdensity of L µn (9) π n unnormalized Lebesgue density of L µn Lemma 1 Z n normalization constant ofπ n (8) Notation and recurring equations. Before we continue, we collect recurring important definitions and where they can be found in Table 1 and provide the following important equations cheat sheet x − x n 2 ∇ 2 In(xn) )dx (relative to Lebesgue measure)

Convergence in Hellinger Distance
By a modification of Theorem 1 for integrals w.r.t. a weight e −nΦn(x) we may show a corresponding version of (3), i.e., for sufficiently smooth f ∈ L 1 µ 0 (R) However, in this section we study a stronger convergence of L µn to µ n , namely, w.r.t. the total variation distance d TV and the Hellinger distance d H . Given two probability measures µ, ν on R d and another probability measure ρ dominating µ and ν the total variation distance of µ and ν is given by and their Hellinger distance by It holds true that see [16,Equation (8)]. Note that, d TV (µ n , L µn ) → 0 implies that | dµn dLµ n − 1| dL µn → 0 whereas the weak convergence (10) by Laplace's method means | f dµn f dLµ n −1| → 0. In order to establish our convergence result, we require almost the same assumptions as in Theorem 1, but now uniformly w.r.t. n: Assumption 2. There holds Φ n , π 0 ∈ C 3 (S 0 , R) for all n and 1. there exist the limits in R d and R d×d , respectively, with H being positive definite and x belonging to the interior of S 0 .
2. For each r > 0 there exists an n r ∈ N, δ r > 0 and K r < ∞ such that The only additional assumptions in comparison to the classical convergence theorem of the Laplace method are about the third derivatives of π 0 and Φ n and the convergence of x n → x . We remark that (11) implies and, thus, also lim n→∞ C n = H −1 .
We start our analysis with the following helpful lemma.
Lemma 1. Let Assumption 1 and 2 be satisfied and let π n ,π n : R d → [0, ∞) denote the unnormalized Lebesgue densities of µ n and L µn , respectively, given by Then, for any p ∈ N R d π n (x) π n (x) Proof. We define the remainder term i.e., for x ∈ S 0 we have πn(x) πn(x) = exp(−nR n (x)). Moreover, note that for x ∈ S c 0 there holds π n (x) = 0. Thus, we obtain where we define for a given radius r > 0 J 0 (n) := L µn (S c 0 ), In Appendix B.1 we prove that which then yields the statement.
Lemma 1 provides the basis for our main convergence theorem. Proof. We start with For the first term there holds due to Lemma 1 For the second term on the right-hand side we obtain Furthermore, due to Lemma 1 there exists a c < ∞ such that This yields which concludes the proof.
Convergence of other Gaussian approximations. Let us consider now a sequence of arbitrary Gaussian approximationsμ n = N (a n , 1 n B n ) to the measures µ n in (1). Under which conditions on a n ∈ R d and B n ∈ R d×d do we still obtain the convergence d H (µ n ,μ n ) → 0? Of course, a n → x seems to be necessary but how about the covariances B n ? Due to the particular scaling of 1/n appearing in the covariance of L µn , one might suppose that for exampleμ n = N (x n , 1 n I d ) orμ n = N (x n , 1 n B) with an arbitrary symmetric and positive definite (spd) B ∈ R d×d should converge to µ n as n → ∞. However, since The following result shows that, in general,μ n = N (x n , 1 n I d ) orμ n = N (x n , 1 n B) do not converge to µ n .
Theorem 3. Let the assumptions of Lemma 1 be satisfied.
1. Forμ n := N (x n , 1 n B n ), n ∈ N, with spd B n , we have that If so and if C n − B n ∈ O(n −1 ), then we even have d H (µ n ,μ n ) ∈ O(n −1/2 ).
The proof is straightforward given the exact formula for the Hellinger distance of Gaussian measures and can be found in Appendix B.2. Thus, Theorem 3 tells us that, in general, the Gaussian measuresμ n = N (x n , 1 n I d ) do not converge to µ n as n → ∞ whereas it is easily seen thatμ n = N (x n , 1 n H ), indeed, do converge.
Relation to the Bernstein-von Mises theorem in Bayesian inference. The Bernstein-von Mises (BvM) theorem is a classical result in Bayesian inference and asymptotic statistics in R d stating the posterior consistency under mild assumptions [39]. Its extension to infinite-dimensional situations does not hold in general [8,14], but can be shown under additional assumptions [15,2,3,27]. In order to state the theorem we introduce the following setting: represents the negative log-likelihood function for observing y ∈ S y given a parameter value x ∈ R d . Assuming a prior measure µ 0 (dx) = π 0 (x)1 S 0 (x) dx for the unknown parameter, the resulting posterior after n observations y i of the independent Y i , i = 1, . . . , n, is of the form (1) with We will denote the corresponding posterior measure by µ y 1 ,...,yn n in order to highlight the dependence of the particular data y 1 , . . . , y n . The BvM theorem states now the convergence of this posterior to a sequence of Gaussian measures. This looks very similar to the statement of Theorem 2. However, the difference lies in the Gaussian measures as well as the kind of convergence. In its usual form the BvM theorem states under similar assumptions as for Theorem 2 that there holds in the large data limit where µ Y 1 ,...,Yn n is now a random measure depending on the n independent random variables Y 1 , . . . , Y n and where the convergence in probability is taken w.r.t. randomness of the Y i . Moreover,x n =x n (Y 1 , . . . , Y n ) denotes an efficient estimator of the true parameter x 0 ∈ S 0 -e.g., the maximum-likelihood or MAP estimatorand I x 0 denotes the Fisher information at the true parameter x 0 , i.e., Now both, the BvM theorem and Theorem 2, state the convergence of the posterior to a concentrating Gaussian measure where the rate of concentration of the latter (or better: of its covariance) is of order n −1 . Furthermore, also the rate of convergence in the BvM theorem can be shown to be of order n −1/2 [18]. However, the main differences are: • The BvM states convergence in probability (w.r.t. the randomness of the Y i ) and takes as basic covariance the inverse expected Hessian of the negative log likelihood at the data generating parameter value x 0 . Working with this quantity requires the knowledge of the true value x 0 and the covariance operator is obtained by marginalizing over all possible data outcomes Y . This Gaussian measure is not a practical tool to be used but rather a limiting distribution of a powerful theoretical result reconciling Bayesian and classical statistical theory. For this reason, the Gaussian approximation in the statement of the BvM theorem can be thought of as being a "prior" approximation (in the loosest meaning of the word). Additionally, a crucial requirement is that the problem is well-specified meaning that x 0 is an interior point of the prior support S 0 -although there exist results for misspecified models, see below.
• Theorem 2 states the convergence for given realizations y i and takes the Hessian of the negative log posterior density evaluated at the current MAP estimate x n and the current data y 1 , . . . , y n . This means that we do not need to know the true parameter value x 0 and we employ the actual data realization at hand rather than averaging over all outcomes. Hence, we argue that the Laplace approximation (as stated in this context) provides a "posterior" approximation converging to the Bayesian posterior as n → ∞.
Also, we require that the limit x = lim n→∞ x n is an interior point of the prior support S 0 .
The following example illustrates the difference between the two Gaussian measures: Let x 0 ∈ R be an unknown parameter. Consider n measurements y k ∈ R, k = 1, . . . , n, where y k is a realization of For the Bayesian inference we assume a prior N (0, τ 2 ) on x. Then the Bayesian posterior is of the form µ n (dx) ∝ exp(−nI n (x)) where The MAP estimator x n is the Laplace approximation's mean and can be computed numerically as a minimizer of I n (x). It can be shown that x n converges to x = x 0 for almost surely all realizations y k of Y k due to the strong law of large numbers. Now we take the Hessian (w.r.t. x) of I n , y k and evaluate it in x n to obtain the covariance of the Laplace approximation, and, thus, On the other hand we compute the Gaussian BvM approximation: The Fisher information is given as (recall that Φ is the loglikelihood term as defined above) and hence we get the Gaussian approximation Now we clearly see the difference between the two measures and how they will be asymptotically identical, since x n → x = x 0 due to consistency, 1 n n k=1 y k converging a.s. tox 3 0 due to the strong law of large numbers, and with the prior-dependent part vanishing for n → 0.
We note that in [21] a BvM theorem is proven without the assumption that x 0 belongs to the interior of S 0 . However, in this case the basic covariance is not the Fisher information but the Hessian of the mapping x → d KL (ν 0 ||ν x ) evaluated at its unique minimizer where d KL (ν 0 ||ν x ) denotes the Kullback-Leibler divergence of the data distribution ν x given parameter x ∈ S 0 w.r.t. the true data distribution ν 0 .
Remark 2. Having raised the issue whether the BvM approximation N (x n , n −1 I −1 x 0 ) or the Laplace one L µn are closer to a given posterior µ n , one can of course ask for the best Gaussian approximation of µ n w.r.t. a certain distance or divergence. Thus, we mention [25,29] where such a best approximation w.r.t. the Kullback-Leibler divergence is considered. The authors also treat the case of best Gaussian mixture approximations for multimodal distributions and state a BvM like convergence result for the large data (and small noise) limit. However, the computation of such a best approximation can become costly whereas the Laplace approximation can be obtained rather cheaply.

The case of singular Hessians
The assumption, that the Hessians H n = ∇ 2 Φ n (x n ) as well as their limit H are positive definite, is quite restrictive. For example, for Bayesian inference with more unknown parameters than observational information, this assumption is not satisfied. Hence, we discuss in this subsection the convergence of the Laplace approximation in case of singular Hessians H n and H . Nonetheless, we assume throughout the section that Assumption 1 is satisfied and the Laplace approximation L µn , thus, well-defined. This means in particular that we suppose a regularizing effect of the log prior density log π 0 on the minimization of I n (x) = Φ n (x)− 1 n log π 0 (x). We first discuss necessary conditions for the convergence of the Laplace approximation and subsequently state a positive result for Gaussian reference measures µ 0 .
Necessary conditions. Let us consider the simple case of Φ n ≡ Φ, i.e., the probability measures µ n are given by where we assume now that Φ : On the other hand, the associated Laplace approximations L µn will converge weakly to the Dirac measure δ M L on the linear manifold Hence, it is necessary for the convergence L µn → µ n in total variation or Hellinger distance that M Φ = M L , i.e., that the manifold of minimizers of Φ is linear. In order to ensure the latter, we state the following.
Assumption 3. Let X ⊆ R d be a linear subspace such that for a projection P X onto X there holds and let the restriction Φ n : X → R possesses a unique and nondegenerate global minimum for each n ∈ N.
For the case Φ n = Φ this assumption implies, that where X c denotes a complementary subspace to X , i.e., X ⊕ X c = R d and x ∈ X the unique minimizer of Φ over X . Besides that, Assumption 3 also yields that x H n x = 0 iff x ∈ X c . Hence, this also holds for the limit H = lim n→∞ H n and we obtain Moreover, since Assumption 3 yields where x X := P X x and x c := P X c x = x − x X , the marginal of µ n coincides with the marginal of µ 0 on X c . Hence, the Laplace approximation can only converge to µ n if this marginal is Gaussian. We, therefore, consider the special case of Gaussian reference measures µ 0 .
Convergence for Gaussian reference µ 0 . A useful feature of Gaussian reference measures µ 0 is that the Laplace approximation possesses a convenient representation via its density w.r.t. µ 0 .
Proposition 1 (cf. [41, Proposition 1]). Let Assumption 1 be satisfied and µ 0 be Gaussian. Then there holds where T 2 Φ n (·; x n ) denotes the Taylor polynomial of order 2 of Φ n at the point x n ∈ R d .
In fact, the representation (16) does only hold for reference measures µ 0 with Lebesgue density π 0 : R d → R d×d satisfying ∇ 3 log π 0 ≡ 0. Corollary 1. Let Assumption 1 be satisfied and µ 0 be Gaussian. Further, let Assumption 3 hold true and assume that the restriction Φ n : X → R and the marginal density π 0 on X satisfy Assumption 2 on X . Then the approximation result of Theorem 2 holds.
Proof. By using Proposition 1, we can express the Hellinger distance d H (µ n , L µn ) as follows We use now the decomposition We note, that due to Assumption 3, we have that We then obtain by disintegration and denotingΦ n ( where µ 0 (dx X ) denotes the marginal of µ 0 on X . Since Φ n and I n (x X ) = Φ n (x X ) − 1 n log π 0 (x X ), where π 0 (x X ) denotes the Lebesgue density of the marginal µ 0 (dx X ), satisfy the assumptions of Theorem 2 on S 0 ∩ X = X , the statement follows.
We provide some illustrative examples for the theoretical results stated in this subsection.
Example 1 (Divergence of the Laplace approximation in the singular case). We assume a Gaussian prior We plot the Lebesgue densities of the resulting µ n and L µn for n = 128 in the left and middle panel of Figure  1. The red line in both plots indicate the different manifolds around which µ n and L µn , respectively, concentrate as n → ∞. As M Φ = M L , we observe no convergence of the Laplace approximation as n → ∞, see the right panel of Figure 1. Here, the Hellinger distance is computed numerically by applying a tensorized trapezoidal rule on a suffieciently large subdomain of R 2 . Hellinger distance

Convergence of Laplace-Approximation
n -LA n rate: 0.00 Figure 1: Plots of the Lebesgue densities of µ n (left) and L µn (middle) for n = 128 as well as the Hellinger distance between µ n and L µn for Example 1. The red line in the left and middle panel represents the manifold M Φ and M L around which µ n and L µn , respectively, concentrate as n → ∞.
Example 2 (Convergence of the Laplace approximation in the singular case in the setting of Corollary 1). Again, we suppose a Gaussain prior µ 0 = N (0, In the left and middle panel of Figure 2 we present the Lebesgue densities of µ n and its Laplace approximation L µn for n = 25 and by the red line the manifolds M Φ = M L = x + X c . We observe the convergence guaranteed by Corollary 1 in the right panel of Figure 2 where we can also notice a preasymptotic phase with a shortly increasing Hellinger distance. Such a preasmyptotic phase is to be expected due to d H (µ n , L µn ) ∈ O(n −1/2 ) + O(e −nδr n d/2 ) as shown in the proof of Theorem 2.

Robustness of Laplace-based Monte Carlo Methods
In practice, we are often interested in expectations or integrals of quantities of interest f : For example, in Bayesian statistics the posterior mean (f (x) = x) or posterior probabilities (f (x) = 1 A (x), A ∈ B(R d )) are desirable quantities. Since µ n is seldom given in explicit form, numerical integration must be applied for approximating such integrals. To this end, since the reference or prior measure µ 0 is typically a well-known measure for which efficient numerical quadrature methods are available, the integral w.r.t. µ n is rewritten as two integrals w.r.t. µ 0 If then a quadrature rule such as . This might be a good approximation for a finite n ∈ N. However, as soon as n → ∞ the likelihood term exp(−nΦ n (x i )) will deteriorate and this will be reflected by a deteriorating efficiency of the quadrature scheme-not in terms of the convergence rate w.r.t. N , but w.r.t. the constant in the error estimate, as we will display later in examples.
If the Gaussian Laplace approximation L µn of µ n is used as the reference measure for numerical integration instead of µ 0 , we get the following approximation where π n andπ n denote the unnormalized Lebesgue density of µ n and L µn , respectively. This time, we can not only apply well-known quadrature and sampling rules for Gaussian measures, but moreover, we also know due to Lemma 1, that the ratio πn(x) πn(x) converges in mean w.r.t. L µn to 1. Hence, we do not expect a deteriorating efficiency of the numerical integration as n → ∞. On the contrary, as we subsequently discuss for several numerical integration methods, their efficiency for a finite number of samples N ∈ N will even improve as n → ∞ if they are based on the Laplace approximation L µn .
For the sake of simplicity, we consider the simple case of Φ n ≡ Φ + const in the following presentationnonetheless, the presented results can be extended to the general case given appropriate modifications of the assumptions. Thus, we consider probability measures µ n of the form where we assume that Φ satisfies the assumptions of Theorem 1. However, when dealing with the Laplace approximation of µ n and, particularly, with the ratios of the corresponding normalizing constants, it is helpful to use the following representation where ι n := min x∈S 0 Φ(x) − 1 n log π 0 (x) and Z n = e nιn R d e −nΦ(x) π 0 (x) dx. By this construction the resulting I n (x) := Φ n (x)− 1 n log π 0 (x) satisfies I n (x n ) = 0 as required in Assumption 1 for the construction of the Laplace approximation L µn .
Preliminaries. Before we start analyzing numerical methods based on the Laplace approximation as their reference measure, we take a closer look at the details of the asymptotic expansion for integrals provided in Theorem 1 and their implications for expectations w.r.t. µ n given in (21). 1. The coefficients: The proof of Theorem 1 in [42, Section IX.5] provides explicit expressions 2 for the coefficients c k ∈ R in the asymptotic expansion namely-given sufficient regularity of the involved mappings-that with h : Ω → U (x ) being a diffeomorphism between 0 ∈ Ω ⊂ R d and a particular neighborhood U (x ) of x mapping h(0) = x and such that det(∇h(0)) = 1. The diffeomorphism h is specified by the well-known Morse's Lemma and depends only on Φ. Moreover, the constants with ∆ denoting the Laplace operator and e 1 the first unit vector in R d , and more general, 2. The normalization constant of µ n : Theorem 1 implies that Hence, we obtain for the normalizing constant Z n in (21) that If we compare this to the normalizing constantZ n = n −d/2 det(2πC n ) of its Laplace approximation we get We now show that Z ñ Z n ∼ 1 +cn −1 .
This yields (25). 3. The expectation w.r.t. µ n : The expectation of a sufficiently smooth f ∈ L 1 µ 0 (R) w.r.t. µ n is given by and by applying the asymptotic expansion above to both integrals as well as the rule for the division of asymptotic expansions, we obtain that c 2 0 (π 0 ) c 0 (f π 0 ). 4. The variance w.r.t. µ n : The variance of a sufficiently smooth f ∈ L 2 µ 0 (R) w.r.t. µ n is given by We can apply now the previous results about E µn [f ] ∼ f (x ) +c 1 (f, π 0 )n −1 + . . . and E µn f 2 ∼ f 2 (x ) +c 1 (f 2 , π 0 )n −1 + . . . as well as the basic calculus for asymptotic expansions in order to get that A straightforward calculation using the explicit formulas for c 1 (f 2 π 0 ) and c 1 (f π 0 ) yields that hence, since ∇h(0) is regular by construction, the variance Var µn (f ) decays like n −1 provided that ∇f (x ) = 0-otherwise it decays (at least) like n −2 .

Importance Sampling
Importance sampling is a variant of Monte Carlo integration where an integral w.r.t. µ is rewritten as an integral w.r.t. a dominating importance distribution µ ν, i.e., The integral appearing on the righthand side is then approximated by Monte Carlo integration w.r.t. ν: given N independent draws x i , i = 1, . . . , N , according to ν we estimate Often the density or importance weight function w = dµ dν : R d → [0, ∞) is only known up to a normalizing constantw ∝ dµ dν . In this case, we can use self-normalized importance sampling As for Monte Carlo, there holds a strong law of large numbers (SLLN) for self-normalized importance sampling, i.e., where X i ∼ ν are i.i.d. , which follows from the ususal SLLN and the continuous mapping theorem. Moreover, by the classical central limit theorem (CLT) and Slutsky's theorem also a similar statement holds for self-normalized importance sampling: given that Thus, the asymptotic variance σ 2 µ,ν (f ) serves as a measure of efficiency for self-normalized importance sampling. To ensure a finite σ 2 µ,ν (f ) for many functions of interest f , e.g., bounded f , the importance distribution ν has to have heavier tails than µ such that the ratio dµ dν belongs to L 2 ν (R), see also [30,Section 3.3]. Moreover, if we even have dµ dν ∈ L ∞ ν (R) we can bound i.e., the ratio between the asymptotic variance of importance sampling w.r.t. ν and plain Monte Carlo w.r.t. µ can be bounded by the L ∞ ν -or supremum norm of the importance weight dµ dν . For the measures µ n a natural importance distribution (called ν above) which allows for direct sampling are the reference measure µ 0 and the Gaussian Laplace approximation L µn . We study the behaviour of the resulting asymptotic variances σ 2 µn,µ 0 (f ) and σ 2 µn,Lµ n (f ) in the following.
Prior importance sampling. First, we consider µ 0 as importance distribution. For this choice the importance weight function w n := dµn dµ 0 is given by (21). Concerning the bound in (27), we immediately notice, assuming w.l.o.g. that min x Φ(x) = Φ(x ) = 0, that w n L ∞ ∼ Z −1 n e nιn ∼cn d/2 explodes as n → ∞, cf. (24). Of course, that is just the deterioration of an upper bound, but in fact we can prove the following rather negative result.
Lemma 2. Given µ n as in (21) and the assumptions of Theorem 1, we have for sufficiently smooth f ∈ which then yields that Moreover, for simplicity we assume w.l.o.g. that Φ(x ) = 0. We study by analyzing the growth of the numerator and denominator w.r.t. n. Due to the prelimiaries presented above we know that e −2nιn Z 2 n ∼ c 2 0 n −d with c 0 = π 0 (x )/ det(2πH ) > 0. Concerning the numerator we start with decomposing where this time We derive now asymptotic expansions of these terms based on Theorem 1. It is easy to see that the assumptions of Theorem 1 are also fulfilled when considering integrals w.r.t. e −2nΦ . We start with J 1 and obtain due to f (x ) = 0 that where c 1 (f 2 ) ∈ R is the same as c 1 (f 2 π 0 ) in (22) but for 2Φ instead of Φ. Next, we consider J 2 and recall that due to f (x ) = 0 we have E µn [f ] ∼c 1 (f )n −1 where we omit the dependence on π 0 inc 1 (f ) =c 1 (f, π 0 ) for clarity. We obtain where c 1 (f ) ∈ R is again the same as c 1 (f π 0 ) in (22) but for 2Φ instead of Φ. Finally, we take a look at J 3 . To this end, we have with c 0 ∈ R denoting the coefficient c 0 (π 0 ) as in (22) but for 2Φ instead of Φ. Thus, we obtain by the basic calculus of asymptotic expansions that Hence, J 1 has the dominating power w.r.t. n and we have that At this point, we remark that due to the assumption ∇f (x ) = 0 we have c 1 (f 2 ) = 0: we know by (23) that ) and h denotes the diffeomorphism for 2Φ appearing in Morse's lemma and mapping 0 to x ; applying the product formula and using f (x ) = 0 we get that ∆F (x ) = ∆(f 2 (h (x ))); similarly, the chain rule and f (x ) = 0 yields ∆(f 2 (h (x ))) = 2 ∇h (0)∇f (x ) 2 ; since h is a diffeomorphishm ∇h (0) is regular and we get that c 1 (f 2 ) = 0. Thus, the statement follows by and by recalling that Var µn (f ) ∼ cn −1 because of ∇f (x ) = 0.
Thus, Lemma 2 tells us that the asymptotic variance of importance sampling for µ n with the reference µ 0 as importance distribution grows like n d/2−1 as n → ∞ for a large class of integrands. Hence, its efficiency deteriorates like n d/2−1 for d ≥ 3 as the target measures µ n become more concentrated.
Laplace-based importance sampling. We now consider the Laplace approximation L µn as importance distribution which yields the following importance weight function n for x ∈ S 0 . In order to ensure w n ∈ L 2 Lµ n (R) we need that Despite pathological counterexamples a sound requirement for w n ∈ L 2 Lµ n (R) is that for example by assuming that there exist δ, c 1 > 0, c 0 > 0, and n 0 ∈ N such that If the Lebesgue density π 0 of µ 0 is bounded, then (29) is equivalent to the existence of n 0 and ac 0 such that Unfortunately, condition (29) is not enough to ensure a well-behaved asymptotic variance σ 2 µn,Lµ n (f ) as n → ∞, since Although, we know due to (25) thatZ n Zn → 1 as n → ∞, the supremum norm of the importance weight w n of Laplace-based importance sampling will explode exponentially with n if min x R n (x) < 0. This can be sharpened to proving that even the asymptotic variance of Laplace-based importance sampling w.r.t. µ n as in (21) deteriorates exponentially as n → ∞ for many functions f : This means, except when Φ is basically strongly convex, the asymptotic variance of Laplace-based important sampling can explode exponentially or not even exist as n increases. However, in the good case, so to speak, we obtain the following.

Proposition 2.
Consider the measures µ n as in (21), let the assumptions of Theorem 1 be satisfied, and let there exist an n 0 ∈ N such that for all n ≥ n 0 we have Then we have for each f ∈ L 2 µ 0 (f ) Proof. The assumption (30) ensures that R n (x) = I n (x) −Ĩ n (x) ≥ 0 for each x ∈ S 0 . Thus, w n L ∞ =Z n Z n and the assertion follows by (27) and the fact that lim n→∞Z n Zn = 1 due to (25).
Condition (30) is for instance satisfied, if I n is strongly convex with a constant γ ≥ λ min (∇ 2 I n (x n )) where the latter denotes the smallest eigenvalue of the positive definite Hessian ∇ 2 I n (x n ). However, this assumption or even (30) is quite restrictive and, probably, hardly fulfilled for many interesting applications. Moreover, the success in practice of Laplace-based importance sampling is well-documented. How come that despite a possible infinite asymptotic variance Laplace-based importance sampling performs that well? In the following we refine our analysis and exploit the fact that the Laplace approximation concentrates around the minimizer of I n . Hence, with an increasing probability samples drawn from the Laplace approximation are in a small neighborhood of the minimizer. Thus, if I n is, e.g., only locally strongly convex-which the assumptions of Theorem 2 actually imply-then with a high probability the mean squared error might be small. We clarify these arguments in the following and present a positive result for Laplace-based importance sampling under mild assumptions but for a weaker error criterion than the decay of the mean squared error.
First we state a concentration result for N samples drawn from L µn which is an immediate consequence of Proposition 6. Proposition 3. Let N ∈ N be arbitrary and let X (n) i ∼ L µn be i.i.d. where i = 1, . . . , N . Then, for a sequence of radii r n ≥ r 0 n −q > 0, n ∈ N, with q ∈ (0, 1/2) we have Remark 4. In the following we require expectations w.r.t. restrictions of the measures µ n in (21) to shrinking balls B rn (x n ). To this end, we note that the statements of Theorem 1 also hold true for shrinking domains D n = B rn (x ) with r n = r 0 n −q as long as q < 1/2. This can be seen from the proof of Theorem 1 in [42, Section IX.5]. In particular, all coefficients in the asymptotic expansion for Dn f (x) exp(−nΦ(x))dx with sufficiently smooth f are the same as for D f (x) exp(−nΦ(x))dx and the difference between both integrals decays for increasing n like exp(−cn ) for an > 0 and c > 0. Concerning the balls B rn (x n ) with decaying radii r n = r 0 n −q , q ∈ [0, 1/2), we have due to x n − x ∈ O(n −1 )-see Remark 3that B rn/2 (x ) ⊂ B rn (x n ) ⊂ B 2rn (x ) for sufficiently large n. Thus, the facts for µ n as in (21) stated in the preliminaries before Section 3.1 do also apply to the restrictions of µ n to B rn (x n ) with r n = r 0 n −q , q ∈ [0, 1/2). In particular, the difference between E µn [f ] and E µn [f | B rn (x n )] decays faster than any negative power of n as n → ∞.
The next result shows that the mean absolute error of the Laplace-based importance behaves like n −(3q−1) conditional on all N samples belonging to shrinking balls B rn (x n ) with r n ∼ n −q , q ∈ (1/3, 1/2).
of the Laplace-based importance sampling with N samples for a sufficiently smooth f ∈ L 2 µ 0 (R) and B rn (x n ) with radii r n = r 0 n −q with q ∈ (1/3, 1/2) that Proof. We start with  To this end, we write the self-normalizing Laplace-based importance sampling estimator as and recall that w n is as in (28) andw n (x) = exp(−nR n (x)). Notice that Let us denote the event that X The first term in the last line can be bounded by the conditional variance of S n,N given X see Remark 4 and the preliminaries before Subsection 3.1. Thus, and it remains to study if E [ |(H n,N − 1) S n,N | | A n,N ] ∈ O(n −(3q−1) ). Given that X (n) 1 , . . . , X (n) N ∈ B rn (x n ) we can bound the values of the random variable H n,N for sufficiently large n: first, we have that Z n /Z n ∼ 1 +c 1 n −1 +c 2 n −2 + . . . as n → ∞ due to (25) and second Since |R n (x)| ≤ c 3 x − x n 3 for |x − x n | ≤ r n due to the local boundedness of the third derivative of I n and r n = r 0 n −q , we have that where c > 0. Thus, there exist α n ≤ 1 ≤ β n with α n ∼ e −cn 1−3q (1 +c 1 n −1 +c 2 n −2 + . . .) and β n ∼ e cn 1−3q (1 +c 1 n −1 +c 2 n −2 + . . .) such that Since e ±cn 1−3q (1 +c 1 n −1 +c 2 n −2 + . . .) = 1 ± cn 1−3q +c 1 n −1 ± . . . we get that for sufficiently large n there exists ac > 0 such that Hence, n. This concludes the proof.
We now present our main result for the Laplace-based importance sampling which states that the corresponding error decays in probability to zero as n → ∞ and the order of decay is arbitrary close to n −1/2 . Theorem 4. Let the assumptions of Lemma 3 be satisfied. Then, for any sample size N ∈ N the error e n,N (f ) of Laplace-based importance sampling for a sufficiently smooth f ∈ L 2 µ 0 (R) satisfies n δ e n,N (f ) Proof. Let 0 ≤ δ < 1/2 and > 0 be arbitrary. We need to show that lim n→∞ P n δ e n,N (f ) > = 0.
Again, let us denote the event that X N ∈ B rn (x n ) by A n,N for brevity. By Proposition 3 we obtain for radii r n = r 0 n −q with q ∈ (1/3, 1/2) that P n δ e n,N (f ) ≤ ≥ P n δ e n,N (f ) ≤ and X 1 , . . . , X N ∈ B rn (x n ) = P n δ e n,N (f ) ≤ | A n,N P(A n,N ) The second term on the righthand side in the last line obviously tends to 1 exponentially as n → ∞. Thus, it remains to prove that lim n→∞ P n δ e n,N ≤ | X 1 , . . . , X N ∈ B rn (x n ) = 1 To this end, we apply a conditional Markov inequality for the positive random variable e n,N (f ), i.e., where we used Lemma 3. Choosing q ∈ (1/3, 1/2) such that q > 1+δ 3 ∈ [1/3, 1/2) yields the statement.

Quasi-Monte Carlo Integration
We now want to approximate integrals as in (19) w.r.t. measures µ n (dx) ∝ exp(−nΦ(x))µ 0 (dx) as in (21) by Quasi-Monte Carlo methods. These will be used to estimate the ratio Z n /Z n by separately approximating the two integrals Z n and Z n in (19). The preconditioning strategy using the Laplace approximation will be explained exemplarily for Gaussian and uniform priors, two popular choices for Bayesian inverse problems. We start the discussion by first focusing on a uniform prior distribution µ 0 = U([− 1 2 , 1 2 ] d ). The integrals Z n and Z n are then where we set Θ n (x) := exp(−nΦ(x)) for brevity. We consider Quasi-Monte Carlo integration based on shifted Lattice rules: an N -point Lattice rule in the cube [− 1 2 , 1 2 ] d is based on points where z ∈ {1, . . . , N − 1} d denotes the so-called generating vector, ∆ is a uniformly distributed random shift on [− 1 2 , 1 2 ] d and frac denotes the fractional part (component-wise). These randomly shifted points provide unbiased estimators of the two integrals Z n and Z n in (31). Under the assumption that the quantity of interest f : R d → R is linear and bounded, we can focus in the following on the estimation of the normalization constant Z n , the results can be then straightforwardly generalized to the estimation of Z n . For the estimator Z n,QM C we have the following well-known error bound.
The norm Θ n γ in the convergence analysis depends on n, in particular, it can grow polynomially w.r.t. the concentration level n of the measures µ n as we state in the next result. The proof of Proposition 4 is rather technical and can be found in Appendix B.3. We remark that Proposition 4 just tells us that the root mean squared error estimate for QMC integration based on the prior measure explodes like n d/4 . This does in general not indicate that the error itself explodes; in fact the QMC integration error for the normalization constant is bounded by 1 in our setting. Nonetheless, Proposition 4 indicates that a naive Quasi-Monte Carlo integration based on the uniform prior µ 0 is not suitable for highly concentrated target or posterior measures µ n . We subsequently propose and study a Quasi-Monte Carlo integration based on the Laplace approximation L µn .
Laplace-based Quasi-Monte Carlo. To stabilize the numerical integration for concentrated µ n , we propose a preconditioning based on the Laplace approximation, i.e., an affine rescaling according to the mean and covariance of L µn . In the uniform case, the functionals I n are independent of n. The computation of the Laplace approximation requires therefore only one optimization to solve for x n = x = argmin x∈[− 1 In particular, the Laplace approximation of µ n is given by L µn = N (x , 1 n H −1 ) where H denotes the positive definite Hessian ∇ 2 Φ(x ). Hence, H allows for an orthogonal diagonalization H = QDQ with orthogonal matrix Q ∈ R d×d and diagonal matrix D = diag(λ 1 , . . . λ d ) ∈ R d×d , λ 1 ≥ . . . ≥ λ d > 0.
We now use this decomposition in order to construct an affine transformation which reverses the increasing concentration of µ n and yields a QMC approach robust w.r.t. n. This transformation is given by where τ ∈ (0, 1) is a truncation parameter. The idea of the transformation g n is to zoom into the parameter domain and thus, to counter the concentration effect. The domain will then be truncated to G n := The determinant of the Jacobian of g n is given by det(∇g n (x)) ≡ C trans,n = 2| ln τ | n d 2 det(H ) ∼ c τ n −d/2 . We will now explain how the parameter τ effects the truncation error. For given τ ∈ (0, 1), the Laplace approximation is used to determine the truncation effect: Thus, since due to the concentration effect of the Laplace approximation we have S 0 L µn (dx) → 1 exponentially with n, we get thus, the truncation error S 0 \Gn L µn (dx) becomes arbitrarly small for sufficiently small τ 1, since erf(t) → 1 as t → 1. If we apply now QMC integration using shifted Lattice rule in order to compute the integral over [− 1 2 , 1 2 ] d on the righthand side of (34), we obtain the following estimator forẐ n in (34): with x i as in (32). Concerning the norm Θ n • g n γ appearing in the error bound for |Ẑ n −Ẑ n,QM C | we have now the following result.
Proposition 5. Let the assumptions of Theorem 1 be satisfied. Then, for the norm Θ n • g n γ with g n as above there holds Θ n • g n γ ∈ O(1) as n → ∞.
Again, the proof is rather technical and can be found in Appendix B.4. This proposition yields now our main result.  N + d 2 N ) operations, such that for κ ∈ (1/2, 1] Proof. The triangle inequality leads to a separate estimation of the domain truncation error of the integral w.r.t. the posterior and the QMC approximation error, i.e.
The second term on the right hand side corresponds to the QMC approximation error. Thus, Theorem 5 and Proposition 5 imply where the term n − d 2 is due to C trans,n ∼ c λ n − d 2 . The domain truncation error can be estimated similar to the proof of Lemma 1: . The result follows by the proof of Lemma 1.
Remark 5. In the case of Gaussian priors, the transformation simplifies to w = g n (x) = x + n 1 2 QD − 1 2 x due to the unboundedness of the prior support. However, to show an analogous result to Corollary 2, uniform bounds w.r. to n on the norm of the mixed first order derivatives of the preconditioned posterior density Θ n (g n (T −1 x)) in a weighted Sobolev space, where T −1 denotes the inverse cumulative distribution function of the normal distribution, need to be proven. See [22] for more details on the weighted space setting in the Gaussian case. Then, a similar result to Corollary 2 follows straightforwardly from [22,Thm 5.2]. The numerical experiments shown in subsection 3.3 suggest that we can expect a noise robust behavior of Laplacebased QMC methods also in the Gaussian case. This will be subject to future work.
Remark 6. Note that the QMC analysis in Theorem 5 can be extended to an infinite dimensional setting, cp. [22] and the references therein for more details. This opens up the interesting possibility to generalize the above results to the infinite dimensional setting and to develop methods with convergence independent of the number of parameters and independent of the measurement noise. Furthermore, higher order QMC methods can be used for cases with smooth integrands, cp. [12,10,9], leading to higher convergence rates than the first order methods discussed here. In the uniform setting, it has been shown in [36] that the assumptions on the first order derivatives (and also higher order derivatives) of the transformed integrand are satisfied for Bayesian inverse problems related to a class of parametric operator equations, i.e., the proposed approach leads to a robust performance w.r.t. the size of the measurement noise for integrating w.r.t. posterior measure resulting from this class of forward problems. The theoretical analysis of this setting will be subject to future work.
Remark 7 (Numerical Quadrature). Higher regularity of the integrand allows to use higher order methods such as sparse quadrature and higher order QMC methods, leading to faster convergence rates. In the infinite dimensional Bayesian setting with uniform priors, we refer to [34,35] for more details on sparse quadrature for smooth integrands. In the case of uniform priors, the methodology introduced above can be used to bound the quadrature error for the preconditioned integrand by the truncation error and the sparse grid error.

Examples
In this subsection we present two examples illustrating our previous theoretical results for importance sampling and quasi-Monte Carlo integration based on the reference measure µ 0 and the Laplace approximation L µn of the target measure µ n . Both examples are Bayesian inference or inverse problems where the first one uses a toy forwad map and the second one is related to inference for a PDE model.

Algebraic Bayesian inverse problem
We consider inferring x ∈ [− 1 2 , 1 2 ] d for d = 1, 2, 3, 4 based on a uniform prior µ 0 = U([− 1 2 , 1 2 ] d ) and a realisation of the noisy observable of Y = F (X) + ε n where X ∼ µ 0 and the noise ε n ∼ N (0, n −1 Γ d ), We used y = G(0.25 · 1 d ) throughout where 1 d = (1, . . . , 1) ∈ R d . We then compute the posterior expectation of the quantity of interest f (x) = x 1 + · · · + x d . To this end, we employ importance sampling and quasi-Monte Carlo integration based on µ 0 and the Laplace approximation L µn as outlined in the precious subsections. We compare the output of these methods to a reference solution obtained by a brute-force tensor grid trapezoidal rule for integration. In particular, we estimate the root mean squared error (RMSE) of the methods and how it evolves as n increases.
Results for importance sampling. In order to be sufficiently close to the asymptotic limit, we use N = 10 5 samples for self-noramlized importance sampling. We run 1, 000 independent simulations of the importance sampling integration and compute the resulting empirical RMSE. In Figure 3 we present the results for increasing n and various d for prior-based and Laplace-based importance sampling. We obtain a good match to the theoretical results, i.e., the RMSE for choosing the prior measure as importance distribution behaves like n d/4−1/2 in accoradance to Lemma 2. Besides that the RMSE for choosing the Laplace approximation as importance distribution decays like n 1/2 after a preasymptotic phase. This is relates to the statement of Theorem Results for quasi-Monte Carlo. We use N = 2 10 quasi-Monte Carlo points for prior-and Laplace-based QMC. For the Laplace-case we employ a truncation parameter of τ = 10 −6 and discard all transformed points outside of the domain [− 1 2 , 1 2 ] d . Again, we run 1, 000 random shift simulations for both QMC methods and estimate the empirical RMSE. However, for QMC we report the relative RMSE, since, for example, the decay of the normalization constant Z n ∈ O(n −d/2 ) dominates the growth of the absolute error of prior QMC integration for the normalization constant. In Figure 4 and 5 we display the resulting relative RMSE for the quantity related integral Z n , the normalization constant Z n , i.e., and the resulting ratio Z n Zn for increasing n and various d for prior-based and Laplace-based QMC. For priorbased QMC we observe for dimensions d ≥ 2 a algebraic growth of the relative error. In the previous subsection we have proven that the corresponding classical error bound will explode which does not necessary imply that the error itself explodes-as we can see for d = 1. However, this simple example shows that also the error will often grow algebraically with increasing n. For the Laplace-based QMC we observe on the other hand in Figure 5 a decay of the relative empirical RMSE. By Corollary 2 we can expect that the relative errors stay bounded as n → ∞. This provides motivation for a further investigation. In particular, we will analysize the QMC ratio estimator for Z n Zn in a future work.

Bayesian inference for an elliptic PDE
In the following we illustrate the preconditioning ideas from the previous section by Bayesian inference with differential equations. To this end we consider the following model parametric elliptic problem with f (t) = 100 · t, t ∈ [0, 1], and diffusion coefficient  where ψ k (t) = 0.1 k sin(kπt) and the x k ∈ R, k = 1, . . . , d, are to be infered by noisy observations of the solution q at certain points t j ∈ [0, 1]. For d = 1, 2 these observations are taken at t 1 = 0.25 and t 2 = 0.75 and for d = 3 they are taken at t j ∈ {0.125, 0.25, 0.375, 0.6125, 0.75, 0.875}. We suppose an additive Gaussian observational noise ∼ N (0, Γ n ) with noise covariance Γ n = n −1 Γ obs and Γ obs ∈ R 2×2 or Γ obs ∈ R 7×7 , respectively, specified later on. In the following we place a uniform and a Gaussian prior µ 0 on R d and would like to integrate w.r.t. the resulting posterior µ n on R d which is of the form (21) with where F : R d → R 2 for d = 1, 2, and F : R d → R 7 for d = 3, respectively, denotes the mapping from the coefficients x := (x k ) d k=1 to the observations of the solution q of the elliptic problem above and the vector y ∈ R 2 or y ∈ R 7 , respectively, denotes the observational data resulting from y = F (x) + with as above. Our goal is then to compute the posterior expectation (i.e., w.r.t. µ n ) of the following quantity of interest f : R d → R: f (u) is the value of the solution q of the elliptic problem at t = 0.5.
Uniform prior. We place a uniform prior µ 0 = U([− 1 2 , 1 2 ] d ) for d = 1, 2 or 3 and choose Γ obs = 0.01I 2 for d = 1, 2 and Γ obs = 0.01I 7 for d = 3. We display the resulting posteriors µ n for d = 2 in Figure 6 illustrating the concentration effect of the posterior for various values of the noise scaling n and the resulting transformed posterior with Φ • g n based on Laplace approximation. The truncation parameter is set to τ = 10 −6 . We observe the almost quadratic behavior of the preconditioned posterior, as expected from the theoretical results. We are now interested in the numerical performance of the Importance Sampling and QMC method based on the prior distribution compared to the performance of the preconditioned versions based on Laplace approximation. The QMC estimators are constructed by an off-the-shelf generating vector (order-2 randomly shifted weighted Sobolev space), which can be downloaded from https://people.cs.kuleuven. be/˜dirk.nuyens/qmc-generators/ (exod2 base2 m20 CKN.txt). The reference solution used to estimate the error is based on a (tensor grid) trapezoidal rule with 10 6 points in 1D, 4 · 10 6 points in 2D in the original domain, i.e., the truncation error is quantified and in the transformed domain in 3D with 10 6 points. Figure 7 illustrates the robust behavior of the preconditioning strategy w.r.t. the scaling 1/n of the observational noise. Though we know from the theory that in the low dimensional case (1D, 2D), the importance sampling method based on the prior is expected to perform robust, we encounter numerical instabilities due to the finite number of samples used for the experiments. The importance sampling results are based on 10 6 sampling points, the QMC results on 8192 shifted lattice points with 2 6 random shifts. Figure 8 shows the RMSE of the normalization constant Z n using the preconditioned QMC approach with respect to the noise scaling 1/n. We observe a numerical confirmation of the predicted dependence of    We remark that the numerical experiments for the ratio suggest a behavior n −1/2 , i.e., a rate independent of the dimension d, of the RMSE for the preconditioned QMC approach, cp. Figure 7. This will be subject to future work.
Gaussian prior. We choose as prior µ 0 = N (0, I 2 ) for the coefficients x = (x 1 , x 2 ) ∈ R 2 forû 2 in the elliptic problem above. For the noise covariance we set this time Γ obs = I 2 . The performance of the prior based and preconditioned version of Importance Sampling is depicted in Figure 9 (left hand side). Clearly, the Laplace approximation as a preconditioner improves the convergence behavior; we observe a robust behavior w.r.t. the noise level.
The convergence of the QMC approach is depicted in Fig 9 (right hand side), showing a consistent behavior with the considerations in the previous section.  Figure 9: The (estimated) root mean square error (RMSE) of the approximation of the quantity of interest for different noise levels (n = 10 0 , . . . , 10 8 ) using the Importance Sampling strategy (left hand side) and QMC method (right hand side). The first row shows the result based on prior information (Gaussian prior), the second row the results using the Laplace approximation for preconditioning. The reference solution is based on a tensor grid Gauss-Hermite rule with 10 4 points for the preconditioned integrand using the Laplace approximation.

Conclusions and Outlook to Future Work
This paper makes a number of contributions in the development of numerical methods for Bayesian inverse problems, which are robust w.r.t. the size of the measurement noise or the concentration of the posterior measure, respectively. We analyzed the convergence of the Laplace approximation to the posterior distribution in Hellinger distance. This forms the basis for the design of variance robust methods. In particular, we proved that Laplace-based importance sampling behaves well in the small noise or large data size limit, respectively. For uniform priors, Laplace-based QMC methods have been developed with theoretically and numerically proven errors decaying with the noise level or concentration of the measure, repestively. Some future directions of this work include the development of noise robust Markov chain Monte Carlo methods and the combination of dimension independent and noise robust strategies. This will require the study of the Laplace approximation in infinite dimensions in a suitable setting. Finally, we could study in more details the error in the ratio estimator using Laplace-based QMC methods. The use of higher order QMC methods has been proven to be a promising direction for a broad class of Bayesian inverse problems and the design of noise robust versions is an interesting and potentially fruitful research direction. Proposition 6. Let Assumption 1 and Assumption 2 be satisfied. Then, for any r > 0 there exists an c r > 0 such that Proof. Let X n ∼ N 0, n −1 C n . Then L µn (B c r (x n )) = P ( X n > r) .
We now use the well-known concentration of Gaussian measures, namely, where σ 2 n := sup x ≤1 E |x X n | 2 , see [23,Chapter 3]. There holds σ 2 n = λ max (n −1 C n ) = n −1 C n and we get Due to Assumption 2, i.e., C n → H −1 > 0, there exists a finite constant 0 < c such that C n ≥ c for all n ∈ N. Analogously, there exist a constant K < ∞ such that tr (C n ) ≤ K for all n. The latter implies Hence, for an arbitrary r let n 0 be such that E [ X n ] ≤ r/2 for all n ≥ n 0 , which yields

B Proofs
We collect the rather technical proofs in this appendix.

B.1 Proof of Lemma 1
Recall that we want to bound and r > 0 is an at the moment arbitrary radius which will be specified in the following first paragraph.
Bounding J 1 . Due to Φ n , log π 0 ∈ C 3 (S 0 ; R), we have that for any x ∈ B r (x n ) there exists a ξ x,n ∈ B r (x n ) such that Moreover, since x n → x there exists an 0 < r 0 < ∞ such that Hence, the local uniform boundedness of ∇ 3 I n (·) , see Assumption 2, implies the existence of a finite c 3 > 0 such that for sufficiently large n, i.e., n ≥ n r , we have Thus, we obtain, due to | e −t −1| = 1 − e −t ≤ e t −1 for t ≥ 0, which yields Now, since C −1 n → H > 0, there exists for sufficiently large n a γ > 0 such that Hence, for x ∈ B r (x n ), i.e., x − x n ≤ r, we get By choosing r := γ 2c 3 we obtain further Let us now introduce the auxiliary Gaussian measure ν n := N 0, 1 nγ I with which we get There holds now due to the continuity of the determinant and H > 0. Moreover, since 1 − e −t ≤ t for t ≥ 0 we obtain with ξ ∼ N (0, I) that This yields J 1 (n) ∈ O(n −p/2 ) for the particular choice r = γ 2c 3 . In the following two paragraphs we will use exactly this particular radius.

This yields immediately in combination with
. For x ∈ M n , on the other hand, there holds

This yields
again due toZ n ∈ O(n −d/2 ) and the boundedness of (π 0 (x n )) n∈N . In summary, we get

B.2 Proof of Theorem 3
A straightforward calculation, see also [28,Exercise 1.14], yields that for Gaussian measures N a,Q := N (a, Q), N b,Q := N (b, Q) and N a,R := N (a, R) we have , By (12) we obtain forμ n := N (x n , 1 n B n ) that Due to the local Lipschitz continuity of the determinant and C −1 n − H → 0 we obtain the first statement forμ n := N (x n , 1 n B n ). Furthermore, by the triangle inequality and exploiting the local Lipschitz continuity of f (x) = 1 x and of the determinant, we obtain n with a generic constant c > 0. Moreover, due to the local Lipschitz continuity of the square root of a matrix, we get where we used that C The second item follows by applying the triangle inequality, expressing the Hellinger distance between N (x n , 1 n B n ) and N (a n , 1 n B n ) by d H (N an,n −1 Bn , N xn,n −1 Bn ) = √ 2 1 − exp n 8 x n − a n 2 B −1 n and estimating 1 − exp n 8 x n − a n 2 B −1 n ≤ n 8 x n − a n 2 B −1 n ≤ cn x n − a n 2 ∈ O(n −1 ), where we used the fact that the spd matrices B n converge to the spd matrix H , hence, the sequence of the smallest eigenvalue of B n is bounded away from zero.
By the application of the multivariate Faa di Bruno-formula to Θ n (x) = exp(−nΦ(x)) = u(v(x)) with u(t) = exp(−nt), v(x) = Φ(x), we obtain for ν ⊂ {1, . . . , d} that By setting x −ν := x {1,...,d}\ν we get where we shortened Π d := Π({1, . . . , d}). We will now investigate, how-i.e., to which power w.r.t. n- decays as n → ∞. To this end, we write and apply in the following Laplace's method in order to derive the asymptotics of We can extend this reasoning to arbitrary partitions P,P ∈ Π d . To this end, let |P | 1 := |{B ∈ P : |B| = 1}| denote the number of single blocks |B| = 1 in P ∈ Π d . Then, we know by the definition of h P,P that h P,P posseses a zero of order |P | 1 + |P | 1 in x . This in turn, implies that the first |P |