The forward–backward envelope for sampling with the overdamped Langevin algorithm

In this paper, we analyse a proximal method based on the idea of forward–backward splitting for sampling from distributions with densities that are not necessarily smooth. In particular, we study the non-asymptotic properties of the Euler–Maruyama discretization of the Langevin equation, where the forward–backward envelope is used to deal with the non-smooth part of the dynamics. An advantage of this envelope, when compared to widely-used Moreu–Yoshida one and the MYULA algorithm, is that it maintains the MAP estimator of the original non-smooth distribution. We also study a number of numerical experiments that support our theoretical findings.


Introduction
The problem of calculating expectations with respect to a probability distribution p in R d is ubiquitous throughout applied mathematics, statistics, molecular dynamics, statistical physics and other fields.In practice, often d is large, which renders deterministic techniques, such as quadrature methods, computationally intractable.In contrast, probabilistic methods do not suffer from the curse of dimensionality and are often the method of choice when the dimension d is large.In particular, Markov chain Monte Carlo (MCMC) methods are based on the construction of a Markov chain in R m with m ≥ d, for which the invariant distribution (or its suitable marginal) coincides with the target distribution p [1].
Often, such Markov chains are based on the discretization of stochastic differential equations (SDEs).One such SDE, which is also the focus of this paper, is the (overdamped) Langevin equation where {W t } t is the standard d-dimensional Brownian motion and ∇f denotes the gradient of a continuously-differentiable function f : R d → R.Under mild assumptions on f , the dynamics of (1) are ergodic with respect to the distribution p ∝ e −f .In particular, p is the invariant distribution of (1) [2].The discretization of (1), however, requires special care, since the resulting discrete Markov chain might not be ergodic [3].In addition, even if ergodic, the resulting discrete Markov chain often has a different invariant distribution than p, known as the numerical invariant distribution p.The study of the asymptotic error between the numerical invariant distribution p and the target distribution p has received considerable attention recently [4,5].In particular, [4] investigated the effect of discretization on the convergence of the ergodic averages, and [5] presented general order conditions to ensure that the numerical invariant distribution accurately approximates the target distribution.
Another active line of research quantifies the nonasymptotic error between the numerical invariant distribution p and the target distribution p.In particular, when p is a smooth and strongly log-concave distribution, [6] established non-asymptotic bounds in total variation distance for the Euler-Maruyama discretization of (1), commonly known as the unadjusted Langevin algorithm (ULA).These results have also been extended to the Wasserstein distance W 2 in [7,8,9,10,11], to name a few.Typically, these works study the number of iterations that the numerical integrator would require to achieve a desired accuracy, when applied to a target distribution p with a known condition number.
In fact, the above strong log-concavity of p can be substantially relaxed.In particular, using a variant of the reflection coupling, the recent work [12] derived non-asymptotic bounds for the ULA in the Wasserstein distance W 1 , when p is strictly log-concave outside of a ball in R d .Similar results for the Wasserstein distance W 2 have also been presented in [13].
Within the class of log-concave distributions, a significant challenge for the Langevin diffusion in (1) arises when the target distribution p is not smooth and/or has a compact (convex) support in R d .One approach to address this challenge is to replace the non-smooth distribution p with a smooth proxy obtained via the so-called Moreu-Yoshida (MY) envelope.This new smooth density remains log-concave and, hence, amenable to the non-asymptotic results discussed earlier.When the support of p is also compact, proximal Monte Carlo methods have been explored in [14,15,16].It is also worth noting that [17] pursued a different approach for sampling from compactly-supported densities that does not involve the MY envelope.
A potential drawback of the above approach is that the MY envelope often does not maintain the maximum a posteriori (MAP) estimator.That is, the above approach alters the location at which the new (smooth) density reaches its maximum.This is a well-known issue in the context of (non-smooth) convex optimization and is often resolved by appealing to the proximal gradient method.The latter can be understood as the Euler discretization of the so-called forwardbackward (FB) envelope [18].
Contributions.This work explores and analyzes the use of the FB envelope for sampling from non-smooth and compactly-supported log-concave distributions.In analogy with the Langevin proximal Monte Carlo, we replace the non-smooth density with a smooth proxy obtained via the FB envelope.In particular, this proxy is strongly log-concave over long distances.
Crucially, the new proxy also maintains the MAP estimator , under certain assumptions.However, this improvement comes at the cost of requiring additional smoothness for the smooth part of the density.Lastly, the strong convexity of the new proxy over long distances allows us to utilise the work of [12] to obtain non-asymptotic guarantees for our method in the Wasserstein distance W 1 .
In addition to investigating the use of FB envelope in sampling, this work has the following contributions: • It introduces a general theoretical framework for sampling from nonsmooth densities by introducing the notion of admissible envelopes.MY and FB envelopes are both instances of admissible envelopes.
• It proposes a new Langevin algorithm to sample from non-smooth densities, dubbed EULA, which generalizes MYULA.EULA can work with any admissible envelope (e.g., MY or FB) and can handle a family of increasingly more accurate envelopes rather than a fixed envelope.
Organization.The rest of the paper is organised as follows.Section 2 formalizes the problem of sampling from a non-smooth and compactly-supported logconcave distribution.As a proxy for this non-smooth distribution, its (smooth) Moreau-Yosida (MY) envelope is reviewed in Section 3.This section also explains the main limitation of MY envelope, i.e., its inaccurate MAP estimation.
In Section 4, we introduce the forward-backward (FB) envelope which overcomes the key shortcoming of the MY envelope.Section 5 introduces and analyses EULA, an extension of the popular ULA for sampling from a non-smooth distribution.EULA can be adapted to various envelopes.In particular, MYULA from [14] is a special case of EULA for the MY envelope.Section 6 proves the iteration complexity of EULA and Section 7 presents a few numerical examples to support the theory developed here.

Statement of the Problem
Consider a compact convex set K ⊂ R d .For a pair of functions f : R d → R and g : R d → R, our objective in this work is to sample from the probability distribution whenever the ratio above is well-defined.In order to sample from p, we only have access to the gradient of f and the proximal operator for g, to be defined later.Our assumptions on K, f , g are detailed below.
Assumption 2.1.We make the following assumptions: Here, B(0, r) is the Euclidean ball of radius r centered at the origin.
(ii) Assume also that f : R d → R is a convex function that is three-times continuously differentiable.
(iii) Assume lastly that g : Moreover, we assume that g is continuous. 1 few important remarks about Assumption 2.1 are in order.First, in the special case when f is a convex quadratic [20], the assumption of thricedifferentiability above is trivially met and some of the developments below are simplified.However, our more general setup here necessitates the thricedifferentiability above and results below in more involved technical derivations.
Second, instead of the two functions f , g, it will be more convenient to work with two new functions f, g, without any loss of generality.More specifically, consider a convex function f that coincides with f on the set K, has a compact support and a continuously differentiable Hessian.
For this function f , the compactness of K and smoothness of f in Assumption 2.1 together imply that f, ∇f, ∇2 f are all Lipschitz-continuous functions.To summarize, for the function f described above, there exist nonnegative constants λ 0 , λ 1 , λ 2 , λ 3 such that Let us also define the proper closed convex function where 1 K is the indicator function for the set K. That is, 1 The compactness of K and continuity of ḡ in Assumption 2.1 together imply that g is finite, when its domain is limited to the set K. Outside of the set K, g is infinite.To summarize, the new function g is lower semi-continuous and also satisfies max We can now revisit (2) and, using the new functions f, g, we rewrite the definition of p as Above, F is often referred to as the potential associated with p.The last identity above holds by construction.Indeed, on the set K, the functions f and f coincide.Likewise, on the set K, the functions g and g coincide.On the other hand, outside of the set K, both sides of the last equality above are infinite.
In view of (6b), we will often use f and f interchangeably throughout this work, depending on the context.Likewise, we will use g and g interchangeably.Note also that the integral in the denominator above is finite by Assumption 2.1.When there is no confusion, we will overload our notation and use p to also denote the probability measure associated with the law p.
Since g is not differentiable, F in (6b) is itself non-differentiable.In turn, this means that one cannot use gradient based algorithms such as ULA to sample from p ∝ e −F [10].One way to deal with this issue is to replace F with a smooth function F γ , which we will refer to as an envelope, to which we can then apply ULA.It is reasonable to require this envelope F γ to fulfill the following admissibility assumptions.Definition 2.2 (Admissible envelopes).For γ 0 > 0, the functions {F γ : and (ii) F γ converges pointwise to F , i.e., lim γ→0 F γ (x) = F (x) for every x ∈ R d .
(iii) F γ is λ γ -smooth, i.e., there exists a constant λ γ ≥ 0 such that If {F γ : γ ∈ (0, γ 0 )} are admissible envelopes of F , we can define the corresponding probability densities Remark 2.3.In the definition above, the property (i) implies that p γ can be normalized.This observation, combined with the property (ii), imply after an application of the dominated convergence theorem that where the probability measures p and p γ are defined in (6a) and (7), respectively.That is, p γ converges weakly to p in the limit of γ → 0. (For completeness, the proof of this claim is included in the appendix.)In other words, we can use p γ as a proxy for p, provided that γ is sufficiently small.Finally, as we will see shortly, the property (iii) guarantees the convergence of the ULA to an invariant distribution close to p γ , provided that the step size of the ULA is small.

Moreau-Yosida Envelope and Its Limitation
For γ > 0, let us define where is the Moreau-Yosida (MY) envelope of g.Somewhat inaccurately, we will also refer to F MY γ as the MY envelope of F , to distinguish F MY γ from its newer alternatives.It is well-known that g γ is γ −1 -smooth and that g γ converges pointwise to g in the limit of γ → 0. These facts enable us to establish the admissibility of MY envelopes, as detailed below.All proofs are deferred to the appendices.We note that the result below closely relates to [16, Proposition 1].Proposition 3.1 (Admissibility of MY envelopes).Suppose that Assumption 2.1 is fulfilled.Then {F MY γ : γ > 0} are admissible envelopes of F in (6b).In particular, ∇F MY γ is (λ 2 + γ −1 )-Lipschitz continuous, and given by the expression x − P γg (x) γ ∈ ∇f (x) + ∂g(P γg (x)), Above, λ 2 was defined in (3c), P γg : R d → R d is the proximal operator associated with the function γg, and ∂g(z) is the subifferential of g at z [21].
Remark 3.2 (Connection to Nesterov's smoothing technique).Alternatively, we can also view the MY envelope through the lens of Nesterov's smoothing technique [22].More specifically, if Assumption 2.1 is fulfilled, one can invoke a standard minimax theorem to verify that where g * is the Fenchel conjugate of g.The right-hand side above plays a key role in Nesterov's technique for minimizing the non-smooth function F in (6a).
In view of the admissibility of F MY γ by Proposition 3.1, applying ULA to the new potential F MY γ leads to a well-defined algorithm, see Remark 2.3.In addition, if γ is sufficiently small, p MY γ ∝ e −F MY γ would be close to the target distribution p by Remark 2.3.This technique is known as MYULA [14].However, a limitation of the MY envelope is that the minimizers of F MY γ are not necessarily the same as the minimizers of F .In turn, the MAP estimator of p MY γ , denoted by x MY γ , might not coincide with the MAP estimator of p, except in the limit of γ → 0. That is This observation is particularly problematic because, as we will see later, very small values of γ are often avoided in practice due to numerical stability issues.
In view of this discussion, our objective is to replace the MY envelope with a new envelope that has the same minimizers as F for all sufficiently small γ, and not just in the limit of γ → 0.

Forward-Backward Envelope
In this section, we will study an envelope that addresses the limitations of the MY envelope.More specifically, for γ > 0, let us recall from [18] that the forward-backward (FB) envelope of the function F in (6b) is defined as where g γ was defined in (8).A number of useful properties of F FB γ are collected below for the convenience of the reader [18].Recall that P γg denotes the proximal operator associated with the function γg in (9).Proposition 4.1 (Properties of the FB envelope).Suppose that Assumption 2.1 is fulfilled.For γ ∈ (0, 1/λ 2 ) and every x ∈ R d , it holds that (i) F (P γg (x − γ∇f (x)) ≤ F FB γ (x) ≤ F (x), which relates the function F to its FB envelope.
, which relates the MY and FB envelopes of the function F .
(iii) F FB γ is continuously differentiable and its gradient is given by (iv) arg min F FB γ = arg min F , i.e., the function F and its FB envelope have the same minimizers.
In view of Proposition 4.1(iv), a remarkable property of the FB envelope is that the modes of p FB γ ∝ e −F FB γ coincide with the modes of the target measure p ∝ e −F , for all sufficiently small γ, rather than only in the limit of γ → 0. Indeed, very small values of γ are often avoided in practice due to numerical stability issues.This observation signifies the advantage of the FB envelope over the MY envelope.Recall that the modes of the MY envelope coincide with those of p only in the limit of γ → 0, see Section 3. As a side note, let us remark that the proximal gradient descent algorithm for minimizing the (non-smooth) function F coincides with the gradient descent (with variable metric) for minimizing the (smooth) function F FB γ , whenever γ is sufficiently small [18].It is also easy to use Proposition 4.1 to check the admissibility of the FB envelopes, as summarized below.Proposition 4.2 (Admissibility of FB envelopes).Suppose that Assumption 2.1 is fulfilled.Then {F FB γ : γ ∈ (0, γ FB )} are admissible envelopes of F in (6b), where Moreover, it holds that x − y, where .
The equation (11) provides valuable information about the landscape of the FB envelope of F , which we now summarize: (11a) means that F FB γ is a λ FB γsmooth function.The smoothness of F FB γ in (FB) is not surprising since both f and g γ are smooth functions.(Recall that g γ is the MY envelope of g, which is known to be γ −1 -smooth.) Moreover, even though F FB γ is not necessarily a strongly convex function, (11b) implies that F FB γ behaves like a strongly convex function over long distances.As detailed in the proof, (11b) holds essentially because the MY envelope of the indicator function 1 K is the function 1 2γ dist(•, K) 2 .The latter function grows quadratically faraway from the origin.Here, dist(•, K) is the distance to the set K.
It is worth noting that a similar result to Proposition 4.2 is implicit in [14].That is, the MY envelope F MY γ also satisfies (11), albeit with different constants.
Remark 4.3 (Convergence in the Wasserstein metric).Recall from Remark 2.3 that p FB γ converges weakly to p in the limit of γ → 0. This weak convergence implies convergence in the Wasserstein metric by [23,Lemma 2.6]: We recall that, for two probability measures q 1 and q 2 that satisfy E x∼q1 x 2 < ∞ and E y∼q2 y 2 < ∞, their 1-Wasserstein or Kantorovich distance [24] is defined as With some abuse of notation, throughout this work, we will occasionally replace the probability measures with the corresponding probability distributions or random variables.
A non-asymptotic version of Remark 4.3 is presented below, which bounds the Wasserstein distance between the two probability measures p FB γ and p.In effect, the result below is an analogue of [14,Proposition 5] for the MY envelope.
The key ingredient of their result is the Steiner's formula for the volume of the set K + B(0, t) = {x : dist(x, K) ≤ t} for every t ≥ 0. The previous sum is in the Minkowski sense.Essentially, our proof strategy is to use Proposition 4.1(ii) to relate the FB and MY envelopes and then invoke [14, Proposition 5].
Theorem 4.4 (Wassenstein distance between p FB γ and p).Suppose that Assumption 2.1 is fulfilled.For γ ∈ (0, γ FB ), it holds that where Above, vol i (K) is the i-th intrinsic volume of K, see [25].In particular, the d-th volume of K coincides with the standard volume of K, i.e., vol d (K) = vol(K).Moreover, to keep the notation light, above we have suppressed the dependence of c 1 to c FB 5 on K, f, g, γ.
As a sanity check, consider the special case of g = 1 K , where 1 K is the indicator function for the set K. Then we can use (15) to verify that c 1 and I 1 (γ) and I 2 (γ) and I 2 (γ/(1 − γλ 2 )) all vanish when we send γ → 0. Consequently, both the left-and right-hand sides of ( 14) vanish if we send γ → 0. When g = 1 K , then Theorem 4.4 is precisely the analogue of [14, Proposition 5].Their work, however, does not cover the case of g = 1 K .
In our result, when g = 1 K and γ → 0, the right-hand side of ( 14) converges to the nonzero value e maxx∈K g(x) , unlike the left-hand side of ( 14), which converges to zero by (12).Improving (14) in the case g = 1 K appears to require highly restrictive assumptions on g which we wish to avoid here.Moreover, in practice, very small values of γ are often avoided due to numerical stability issues.In this sense, improving (14) for very small values of γ might have limited practical value.
To summarize this section, the FB envelopes {F FB γ : γ ∈ (0, γ 0,FB )} are admissible and we can use them as a differentiable proxy for the non-smooth function F in (6b).Crucially, the FB envelope addresses the key limitation of the MY envelope, i.e., the modes of p FB γ ∝ e −F FB γ coincide with the modes p ∝ e −F , for all sufficiently small γ, rather than only in the limit of γ → 0.

EULA: Envelope Unadjusted Langevin Algorithm
We have so far introduced two smooth envelopes for the non-smooth function F in (6b), namely, the MY envelope F MY γ in (MY) and the FB envelope F FB γ in (FB).We also described in Section 4 the advantage of the FB envelope over the MY envelope.To keep our discussion general, below we consider admissible envelopes {F γ : γ ∈ (0, γ 0 )} for the target function F in (6b), see Definition 2.2.
Our discussion below can be specialized to either of the envelopes by setting For the time being, let us fix γ ∈ (0, γ 0 ).Unlike F , note that ∇F γ exists and is Lipschitz continuous by Definition 2.2(iii).We can now use the ULA [10] to sample from p γ ∝ e −Fγ , as a proxy for the target measure p ∝ e −F .The k-th iteration of the resulting algorithm is where h is the step size and ζ k+1 ∈ R d is a standard Gaussian random vector, independent of {ζ i } i≤k .In particular, if we choose F γ = F MY γ , then ( 16) coincides with the MYULA from [14].
Under standard assumptions, to be reviewed later, the Markov chain {x k } k≥0 in ( 16) has a unique invariant probability measure, which we denote by p γ,h .There are two sources of error that contribute to the difference between p γ,h and the target measure p in (6a), which we list below: 1. First, note that ( 16) is only intended to sample from p γ , as a proxy for the target distribution p.That is, the first source of error is the difference between the two probability measures p γ and p.
2. Second, the step size h is known to contribute to the difference between the two probability measures p γ,h and p γ , see [10].This bias vanishes only in the limit of h → 0.
In fact, instead of ( 16), we study here a slightly more general algorithm that allows γ and h to vary.More specifically, for a nonincreasing sequence {γ k } k≥0 and step sizes {h k } k≥0 , the k-th iteration of this more general algorithm is where ζ k+1 inR d is a standard Gaussian random vector independent of {ζ i } i≤k .
In particular, if we set γ k = γ in (EULA) for every k ≥ 0, then we retrieve (16).Alternatively, if {γ k } k≥0 is a decreasing sequence, then F γ k becomes an increasingly better approximation of the target potential function F as k increases, see Definition 2.2(ii).That is, (EULA) uses increasingly better approximations of the potential function F as k increases.
We next present the iteration complexity of the (EULA) for admissible envelopes {F γ : γ ∈ (0, γ 0 )}, where admissibility was defined in Definition 2.2.The result below can be specialized to both MY and FB envelopes by setting Theorem 5.1 (Iteration complexity of (EULA)).For γ 0 > 0, consider admissible envelopes {F γ : γ ∈ (0, γ 0 )} of F in (6b), see Definition 2.2.For µ γ > 0 and ρ γ ≥ 0, we additionally assume that F γ satisfies the inequality Consider two sequences {γ k } k≥0 ⊂ (0, γ 0 ) and {h k } k≥0 ⊂ R + .For the algorithm (EULA), let q k denote the law of x k for every integer k ≥ 0. That is, x k ∼ q k for every k ≥ 0. Then the W 1 distance between q k and the target measure p ∝ e −F in (6a) is bounded by for every k ≥ 0, provided that Above, c 0 ≥ 0.007 is a universal constant specified in [12,Equation (6.6)].Moreover, Note that, when ρ γ = 0, then (17) requires F γ to be µ γ -strongly convex for every γ ∈ (0, γ 0 ).This can happen, for example, when f itself is a strongly convex function.A particularly important special case of Theorem 5.1 is when we choose the FB envelope, and use a fixed γ and step size h.Corollary 5.2 (Iteration complexity of (EULA) for FB envelope).Suppose that Assumption 2.1 is fulfilled.For the algorithm (EULA), suppose that γ k = γ ∈ (0, γ FB ) and F γ k = F FB γ and h k = h > 0 for every integer k ≥ 0, see (FB) and (10).In (EULA), also let q k denote the law of x k for every k ≥ 0. That is, x k ∼ q k for every integer k ≥ 0. Then the W 1 distance between q k and the target measure p ∝ e −F in (6a) is bounded by for every k ≥ 0, provided that Above, c 0 ≥ 0.007 is a universal constant specified in [12,Equation (6.6)].Moreover, The remaining quantities were defined in Propositions 4.2 and 4.4.
We remark that Corollary 5.2 for the FB envelope is the analogue of [14, Proposition 7] for the MY envelope.However, note that [14, Proposition 7] requires f to be strongly convex whereas we merely assume f to be convex, see Assumption 2.1.

Proof of Theorem 5.1 (Iteration Complexity of EULA)
To begin, we let Q k denote the Markov transition kernel associated with the Markov chain {x k } k≥0 .This transition kernel is specified as where Normal(a, B) is the Gaussian probability measure with mean a ∈ R d and covariance matrix B ∈ R d×d .Above, note that Q k depends on both h k and γ k .We also let q k denote the law of x k , i.e., x k ∼ q k .Using the standard notation, we can now write that To be precise, ( 22) is equivalent to Recall that p γ k+1 serves as a proxy for the target probability measure p.The W 1 distance between q k+1 and p γ k can be bounded as where the second line follows from the triangle inequality.We separately control each W 1 distance in the last line above.For the first distance, we plan to invoke Theorem 2.12 from [12], reviewed below for the convenience of the reader.It is worth noting that a similar result to the one below appears in [13, Corollary 2.4].
Proposition 6.1 ([12], Theorem 2.12).Let , where c 0 ≥ 0.007 is a universal constant specified in [12,Equation (6.6)].Then it holds that where W Θ is defined similar to (13) but the ℓ 2 -norm is replaced with Θ. Above, to keep the notation light, we have suppressed the dependence of c 6 and c 7 and c 8 on K, f, g, h k .Moreover, the two metrics W Θ and W 1 are related as For the second W 1 distance in the last line of ( 24), the following result is standard, see appendix for the proof.Lemma 6.2 (Discretization error).It holds that In fact, it is more common to write the left-hand side of (27) in terms of the Markov transition kernel of the corresponding Langevin diffusion, as discussed in the proof of Lemma 6.2.By combining Proposition 6.1 and Lemma 6.2, we can now revisit (24) and write that (see (24)) (Lemma 6.2) ( Using the triangle inequality, it immediately follows that W Θ (q k+1 , p γ k+1 ) (see (28)) By unwrapping (29), we find that (26)) (1 − c 8 h j ) (see (29)) (see ( 26)) Lastly, we can use (30) in order to bound the W 1 distance at iteration k to the target measure p as which completes the proof of Theorem 5.1.

Numerical experiments
A number of numerical experiments are presented below to support our theoretical contributions.

Truncated Gaussian
Our first numerical experiment deals with sampling from a truncated Gaussian distribution, restricted to a box K d ⊂ R d .For this problem the potential U : R d → R is specified as Here similarly to [14] the (i, j)th entry of the covariance matrix is given by Σ i,j := 1 1 + |i − j| .
We now consider three scenarios, namely, • d = 10 with K 10 = [0, 5] × [0, 0.5] 9 • d = 100 with K 100 = [0, 5] × [0, 0.5] 99 .Using quadrature techniques, it is possible in the two-dimensional case (d = 2) to calculate exactly the mean and the covariance of the truncated Gaussian distribution, as well as the corresponding approximations obtained via MY and FB envelopes.More specifically, Figure 1 uses MATLAB's integral2 command to plot the following quantities for various values of γ: The horizontal lines in Figure 1 show the ground truth values obtained by MATLAB's integral2 command, i.e., For small values of the parameter γ, we observe in Figure 1 that the FB envelope better approximates the mean of the first component than the MY envelope.However, the FB envelope tends to overestimate the variance.This can be understood by comparing the two envelopes since in the case of the MY envelope the smoothing is more localized compared to the FB envelope.Such explicit calculations are not tractable in higher dimensions, i.e., for d ∈ {10, 100}.Instead, we now generate 10 6 samples from the truncated Gaussian  distribution p by applying MYULA and FBULA.As our ground truth, we also generate 10 5 samples from p with the wall HMC (wHMC) [26].In all three approaches, the initial 10% of the obtained samples are discarded as the burnin period.In terms of the parameters, we set γ = 0.05 and fix h = 0.005 for all of our experiments.
The results are visualized in Figures 2 and 3.More specifically, Figure 2 corresponds to d = 10 and shows the estimates for E p [x i ] for i ∈ {1, 2, 3}, obtained by MYULA, FBULA, and wHMC.Similarly, Figure 3 corresponds to d = 100.These figures indicate that, in all of these cases, FBULA is providing a more accurate approximation of the expectation compared to MYULA.

Tomographic image reconstruction
We now study a tomographic image reconstruction problem.In this case the true image is taken to the Shepp-Logan phantom test image of dimension d = 128 × 128, in which we applied a Fourier operator F followed by a subsampling operator A, reducing the observed pixels by approximately 85%.Finally, zeromean additive Gaussian noise ξ is added with standard deviation σ = 10 −2 to produce an incomplete observation y = AF x + ξ where y ∈ C p .Note that p < d.With regards to the prior, we use the total-variation norm with an additional constraint for the size of the pixels.This leads to the following  posterior distribution: with β = 100.Above, 1 [0,1] d is the convex indicator function on the unit cube, as the pixel values for this experiment are scaled to the range [0, 1].Following ( 4) and (6b), we have that f (x) = y − AF x 2 /2σ 2 and g(x) = βTV(x) + 1 [0,1] d (x). Figure 4(a) shows the Shepp-Logan phantom tomography test image for this experiment and Figure 4(b) shows the amplitude of the (noisy) Fourier coefficients collected in the observation vector y (in logarithmic scale).In this figure, black regions represent unobserved pixels.
We have set γ = 1/5L f where L f = 1/σ 2 = 10 4 for both MYULA and FBULA.Figure 5(a) shows the evolution of the values of log π(x) from (33) of both MYULA and FBULA with the step-size h = 1/(L f + 1/γ) = 1.67 × 10 −5 .We observe that both methods converge at a similar rate.However, Figure 5(b) shows the evolution of the mean-squared error (MSE) between the ergodic mean of the samples and the true image x.Here, it can be seen FBULA reaches a better MSE level compared to MYULA.We have also included in Figure 5(c),(d) the posterior mean estimated by both MYULA and FBULA.where the first line above uses the triangle inequality.We next verify that Definition 2.2(ii) holds.In one direction, it is easy to see that g γ ≤ g for every γ > 0. In the other direction, we fix x ∈ R d and distinguish two cases: 1.When x ∈ K, let us fix an arbitrary ǫ > 0. Because g(x) < ∞ and min z g(z) > −∞ by ( 5), there exists a sufficiently small γ ǫ > 0 such that the following holds for every γ ≤ γ ǫ : min We now use the above inequality as part of the following argument, min which holds when x ∈ K and for every γ ≤ γ ǫ .In view of (A.2) and (A.7), we conclude that x − P γg (x) 2 ≤ ǫ, for every x ∈ K and every γ ≤ γ ǫ .Since the choice of ǫ was arbitrary, we arrive at where the last inequality follows from the fact that g is lower semi-continuous, see above Equation ( 5).If we combine (A.9) with the earlier observation that g γ ≤ g for every γ, we reach the conclusion that lim γ→0 g γ (x) = g(x), provided that x ∈ K. Lastly, after recalling the definitions of F and F MY γ in (6b) and (MY), respectively, we arrive at 2. When x / ∈ K, note that g(x) = ∞ by (5).Note also that where dist(x, K) is the Euclidean distance from x to the set K. The maximum above is finite by (5).Above, by sending γ to zero, we immediately find that lim γ→0 g γ (x) = ∞ = g(x), provided that x / ∈ K. Recalling the definition of F and F γ in (6b) and (MY), we conclude that Together, (A.8) and (A.12) imply that lim γ→0 F γ (x) = F (x) for every x ∈ R d .Lastly, we now verify that Definition 2.2(i) is satisfied: Because, by Assumption 2.2(i), K is enclosed inside a ball of radius R centered at the origin, it holds that where (a) + := max(a, 0).When x 2 is sufficiently large, we can simplify (A.13).
In particular, (A.13) immediately implies that If x 2 ≥ 2R, then we can use the convexity of f Assumption 2.1(ii), in order to write that for x 2 ≥ 2R.We now set By its construction, note that F 0 satisfies e −F 0 (x) dx < ∞ and F 0 ≥ F γ , required in Definition 2.2(i).The former claim is true because F 0 (x) is quadratic for large x 2 and e −F 0 thus decays rapidly faraway from the origin.This completes the proof of Proposition 3.1.

B Proof of Proposition 4.2
To establish the admissibility of the FB envelopes, we verify the requirements in Definition 2.2.We first verify that Definition 2.2(iii) holds.To begin, for x ∈ R d , let x := x − γ∇f (x), T γ (x) := P γg (x), (B.17) for short.Above, recall that P γg is the proximal operator associated with the function γg, see (9).We can then apply the Moreau decomposition [28] to write that x = P γg (x) + P γg * (•/γ) (x), (B.18 where g * is the Fenchel conjugate of g.For future reference, we record the following observations: Above, I d ∈ R d×d is the identity matrix.Recall the expression for ∇F FB γ from Proposition 4.1(iii).We next establish that F FB γ is smooth.Below, without loss of generality, we can assume that y 2 ≤ λ 0 : (3), and (B.23) still holds.This observation justfies our earlier restriction to the case where y 2 ≤ λ 0 .Next, we show the strong convexity of F FB γ over long distances.Again, without loss of generality, we assume below that y 2 ≤ λ 0 and write that where the second line uses Proposition 4.1(iii).When x − y 2 is sufficiently large, the first term in the last line above dominates the second term.In par-ticular, it holds that In words, F FB γ behaves like a strongly convex function over long distances.It is not difficult to verify that Definition 2.2(ii) is also valid for the FB envelopes: To that end, one needs to combine Proposition 3.1 with the relation between the MY and FB envelopes in Proposition 4.1(ii).
Lastly, we now verify that the requirement in Definition 2.2(i) is met for the FB envelopes: Below, suppose that γ ∈ (0, 1/λ 2 ).On the one hand, Proposition 4.1(ii) implies that .
On the other hand, by Proposition 3.1, there exists such that e −F 0 is integrable.By combining the preceding two observations, we find that the FB envelopes satisfy Definition 2.2(i) for every γ ∈ (0, 1/λ 2 ).This completes the proof of Proposition 4.2.

C Proof of Theorem 4.4
Let us define where 1 K is the indicator function on the set K. Note that the target distribution p in (6a) coincides with q in the special case where g = 1 K .Let dist(x, K) denote the distance from x to the set K. For γ > 0, we also define to be the MY envelope of the indicator function on the set K. We denote the corresponding probability distribution by When γ is sufficiently small, we may intuitively regard q γ as a proxy for q.We also recall our earlier notation for the convenience of the reader: Above, the functions F and F FB γ were defined in (6b) and (FB), respectively.Our objective is to control the distance W 1 (p FB γ , p).To begin, we use the triangle inequality to write that The following three lemmas each bounds one of the terms on the right-hand side above.
Lemma C.1.It holds that where where vol i (K) is the i-th intrinsic volume of K [25].In particular, the d-th volume of K coincides with the standard volume of K, i.e., vol d (K) = vol(K).
Lemma C.3.It holds that where To keep the notation light, above we have suppressed the dependence of c 1 to c 4 on K, f, g.With Lemmas C.1-C.3 at hand, we revisit (C.30) and write that This completes the proof of Proposition 4.4.

D Proof of Lemma C.1
In order to upper bound the distance W 1 (q γ , p FB γ ), we will use the following simple result which bounds the W 1 distance under small perturbations.
Lemma D.1.For constants α ≥ β and β ′ ≤ 1, consider two functions h 1 : R d → R and h 2 : R d → R that are related as Then it holds that As a sanity check, note that the right-hand side of (D.38) is always nonnegative because, by assumption, α ≥ β and β ′ ≤ 1.Moreover, if we set α = β = 0 and β ′ = 1, the right-hand side of (D.38) reduces to zero, as expected.
Let us now recall from (C.28) and (6a) that q γ ∝ e −f −1K,γ and p FB γ ∝ e −F FB γ , respectively.Our plan is to invoke Lemma D.1 with the choice of In turn, this plan necessitates that we verify (D.37) for the choice of h 1 = f + 1 K,γ and h 2 = F FB γ .We begin by relating the two functions as Above, note that max z∈K g(z) is finite by the construction of g, see (5).In the other direction, we similarly write that Above, min z f (z) is finite by the construction of f , see (3).To summarize, for our choice of h 1 = f + 1 K,γ and h 2 = F FB γ , (D.37) is satisfied with With (D.41) at hand, we can invoke Lemma D.1 to find that x 2 e −f (x)− dist(x,K) 2 To complete the proof, we need the following result, which is proved in [14,29] using the well-known Steiner's formula from integral geometry [25].For completeness, a self-contained proof is given later.
Lemma D.2.It holds that where K c = R d \K is the complement of the set K and vol i (K) is the i-th intrinsic volume of K [25].In particular, the d-th volume of K coincides with the standard volume of K, i.e., vol d (K) = vol(K).
In particular, note that I 1 (γ) and I 2 (γ) both vanish in the limit of γ → 0, as expected.To control the first fraction in the last line of (D.42), we use Lemma D.2 to write that where the last line uses Lemma D.2.For the second fraction in the last line of (D.42), we similarly use Lemma D.2 to write that (D.45) We can now use (D.44) and (D.45) to upper bound the W 1 distance in (D.42) as which completes the proof of Lemma C.1.

E Proof of Lemma D.1
Let us first recall Theorem 6.15 in [24], which can be used to link the W 1 and TV distances of two rapidly decaying distributions.

F Proof of Lemma D.2
Let ι(A) = 1 if the claim A is true and ι(A) = 0 otherwise.Using the fact that we can rewrite the left-hand side of (D.43b) as Note that {x : dist(x, K) ≤ t} is the tube of radius t around the set K. We can represent this set more compactly as K + B(0, t), where the addition is in the Minkowski sense.With this in mind, we rewrite the last line above as We can now use the Steiner's formula [25] to express the volume above as where κ i is the volume of the unit ball in R i and vol i (K) is the i-th intrinsic volume of K [25].In particular, the d-th intrinsic volume of K coincides with its the standard volume, i.e., vol d (K) = vol(K).Substituting the above identity back into (F.55),we find that which establishes (D.43b).Above, in the second line we used the identity To prove (D.43a), we write that To prove (D.43c), we use the fact that K ⊂ B(0, R) by Assumption 2.1(i), which implies that dist(x, K) ≥ x 2 − R. (F.60) In turn, we use (F.60) to write that We can again use the Steiner's formula to calculate the volume of the tube of radius t around K, which appears in the last line above.Substituting from (F.56) into the last line above, we find that which establishes (D.43c).This completes the proof of Lemma D.2.

G Proof of Lemma C.2
Recall from (C.26) that q ∝ e −f −1K , where 1 K denotes the indicator function on the set K. Recall also from (C.28) that q γ ∝ e −f −1K,γ , where 1 K,γ (x) = 1 2γ dist(x, K) 2 is the MY envelope of 1 K , see (C.27).We will repeatedly use these two distributions in the proof.To begin, let us invoke Theorem E.1 to write that For the first integral in the last line above, we use the fact that K ⊂ B(0, R) in Assumption 2.1(i) and write that where the last line uses the fact that We continue to simplify the last line of (G.64) as For the second integral in the last line of (G.63), we can again use the definitions of q and q γ to write that where the first identity above uses the fact that q is supported on K.With (G.65) and (G.66) at hand, we revisit (G.63) and write that W 1 (q γ , q) ≤ e maxx f (x)−minx f (x) RI 1 (γ) + I 2 (γ) vol(K) + I 1 (γ) , (G.67) where we used (G.63), (G.65), and (G.66).This completes the proof of Lemma C.2.

H Proof of Lemma C.3
Recall from (C.26) and (6a) that q ∝ e −f −1K and p ∝ e −F , respectively.The function F was defined in (6b).In order to bound W 1 (q, p), our plan is to invoke Lemma D.1 with h 1 = f + 1 K and h 2 = F .As before, this means that we need to verify (D.37) for the choice of h 1 = f + 1 K and h 2 = F FB γ .We begin by relating these two functions together as ≤ max z∈K g(z) + f (x) + 1 K (x).(H.68) In the first line above, we used the fact that F and 1 K both take infinity outside of the set K. In the other direction, we write that We can now invoke Lemma D.1 to find that W 1 (q, p) = W 1 e −f −1K e −f −1K , e −F e −F ≤ e max x∈K g(x)−minx∈K g(x) − e min x∈K g(x)−maxx∈K g(x) • x 2 e −f (x)−1K(x) dx e −f (x)−1K(x) dx (Lemma D.1) ≤ e max x∈K g(x)−minx∈K g(x) − e min x∈K g(x)−maxx∈K g(x) • R e −f (x)−1K(x) dx e −f (x)−1K(x) dx (K ⊂ B(0, R)) = e maxx∈K g(x)−minx∈K g(x) − e minx∈K g(x)−maxx∈K g(x) R, (H.71) where we used the fact that K ⊂ B(0, R) by Assumption 2.1(i) in the penultimate line above.This completes the proof of Lemma C. where {B t } t≥0 is the standard Brownian motion.Above, note that the initial probability measure is p γ k ∝ e −Fγ k .From Definition 2.2(iii), recall that F γ k is λ γ k -smooth and is, moreover, coercive.To be concrete, the latter means that F γ k satisfies lim Therefore Theorem 3.4 in [23] ensures that p γ k ∝ e −Fγ k is the invariant probability measure of (I.72).In particular, because x 0 ∼ p γ k in (I.72), it holds that In the remainder of the proof, we will upper bound the right-hand side above, which can be thought of as the discretization error associated with (I.72).To begin, recall from (I.73) that x H k ∼ p γ k , and note that = have the same distribution.Above, recall that ζ k+1 ∼ normal(0, I d ).In view of (I.77) and (I.78), we revisit (I.76) and write that (see (13)

Figure 1 :
Figure 1: This figure compares the MY and FB envelopes for the two-dimensional truncated Gaussian distribution p ∝ e −U −1 K 2 specified in Section 7.1.The horizontal lines in the left and right panels show, respectively, the expectation and variance of the first coordinate, namely, Ep[x1] and varp[x1] = Ep[x 2 1 ] − (Ep[x1]) 2 .The blue and red graphs in both panels show the estimated values of Ep[x1] and varp[x1], obtained via MY and FB envelopes.That is, the graphs on the left correspond to E p MY γ [x1] and E p FB γ [x1], for various values of γ.Similarly, the graphs on the right correspond to var p MY γ [x1] and var p FB γ [x1], for various values of γ.

Figure 2 :
Figure 2: This figure shows the boxplots for the expectations of x1, x2, x3 for the truncated Gaussian distribution in dimension 10 obtained by MYULA, FBULA, and wHMC.The last approach serves as the ground truth.

Figure 3 :
Figure 3: Boxplots for the expectations of x1, x2, x3 for the truncated Gaussian distribution in dimension 100, obtained by MYULA, FBULA, and wHMC.The last approach serves as the ground truth.

Figure 4 :
Figure 4: Tomography experiment: (a) True image x of dimension d = 128×128.(b) Incomplete and noisy observation y, amplitude of Fourier coefficients in logarithmic scale.

Figure 5 :
Figure 5: Tomography experiment: (a) Convergence to the typical set of the posterior distribution (33).(b) Evolution of the MSE in stationarity.Posterior mean of (33) as estimated with (c) MYULA and (d) FBULA, respectively, after 10 6 iterations.
) and non-expansiveness of the proximal operator) (A.5) line uses Lemma D.1 and the last line uses (C.27) and (D.41).
3.Consider the stochastic differential equationdx t = −∇F γ k (x t ) + √ 2 dB t , x 0 ∼ p γ k , (I.72) I Proof of Lemma 6.2 t ∼ p γ k , t ≥ 0. (I.73)Let {P k } k≥0 denote the Markov transition kernel associated with the Markov chain {x H k } k≥0 , where H k := is the elapsed time since initialization.Using this transition kernel, we can rewrite (I.73) as p γ k = p γ k P k , k ≥ 0. (I.75)Finally, we can write the quantity of interest on the left-hand side of (27) asW 1 (p γ k Q k , p γ k ) = W 1 (p γ k Q k , p γ k P k ).
∇F γ k (x t ) − ∇F γ k (x H k ) dt E F γ k (x t ) − ∇F γ k (x H k ) 2 dt (Minkowski's integral inequality) E x t − x H k 2 dt, (Definition 2.2(iii)) (I.79)where the last line above uses the fact that F γ k is λ γ k -smooth.To estimate the expectation in the last line above, we write thatE x t − x H k 2To estimate the first expectation in the last line of (I.80), recall from the last line of (I.79) that t ∈ [H k , H k+1 ) and then note thatE x∼pγ k ∇F γ k (x) 2 ds (see (I.73)) = (t − H k )E x∼pγ k ∇F γ k (x) 2 ≤ h k E x∼pγ k ∇F γ k (x) 2 (see (I.74)) ≤ h k E x∼pγ k ∇F γ k (x) 2 E x∼pγ k [∆F γ k (x)] (integration by part and p γ k ∼ e −Fγ k ) ≤ h k λ γ k d, (I.81)where the last line above follows from ∆F γ k (x) = trace(∇ 2 F γ k (x)) and the fact that F γ k is λ γ k -smooth by Definition 2.2(iii).To estimate the second expectation in the last line of (I.80), note thatWe now plug in the bounds in (I.81) and (I.83) back into (I.80) to obtain thatE x t − x H k 2 ≤By substituting the above bound back into (I.79),wearrive atW 1 (p γ k Q k , p γ k ) ≤ λ γ k ≤ λ γ k (H k+1 − H k ) h k d • h k λ γ k + t H k E ∇F γ k (x s ) 2 ds + √ 2E