1 Introduction

The problem of calculating expectations with respect to a probability distribution \(\mathfrak {p}\) in \(\mathbb {R}^d\) is ubiquitous throughout applied mathematics, statistics, molecular dynamics, statistical physics and other fields.

In practice, often the dimension d is large, which renders deterministic techniques, such as quadrature methods, computationally intractable. In contrast, probabilistic methods do not suffer from this curse of dimensionality and are often the method of choice when the dimension d is large.

In particular, Markov chain Monte Carlo (MCMC) methods are based on the construction of a Markov chain in \(\mathbb {R}^{m}\) with \(m \ge d\), for which the invariant distribution (or its suitable marginal) coincides with the target distribution \(\mathfrak {p}\) (Brooks et al. 2011).

Often, such Markov chains are based on the discretization of stochastic differential equations (SDEs). One such SDE, which is also the focus of this paper, is the (overdamped) Langevin equation

$$\begin{aligned} {\text {d}}X_{t}=- \nabla f(X_{t}){\text {d}}t+\sqrt{2}{\text {d}}W_{t}, \end{aligned}$$
(1)

where \(\{W_t\}_t\) is the standard d-dimensional Brownian motion and \(\nabla f\) denotes the gradient of a continuously-differentiable function \(f:\mathbb {R}^d\rightarrow \mathbb {R}\). Under mild assumptions on f, the dynamics of (1) are ergodic with respect to the distribution \(\mathfrak {p}\propto e^{-f}\). In particular, \(\mathfrak {p}\) is the invariant distribution of (1); see Milstein and Tretyakov (2007).

The discretization of (1), however, requires special care because the resulting discrete Markov chain might not be ergodic anymore (Roberts and Tweedie 1996). In addition, even if ergodic, the resulting discrete Markov chain often has a different invariant distribution than \(\mathfrak {p}\), known as the numerical invariant distribution \({\widehat{\mathfrak {p}}}\). The study of the asymptotic error between the numerical invariant distribution \({\widehat{\mathfrak {p}}}\) and the target distribution \(\mathfrak {p}\) has received considerable attention recently (Mattingly et al. 2010; Abdulle et al. 2014). In particular, Mattingly et al. (2010) investigated the effect of discretization on the convergence of the ergodic averages, and Abdulle et al. (2014) presented general order conditions to ensure that the numerical invariant distribution accurately approximates the target distribution.

Another active line of research quantifies the nonasymptotic error between the numerical invariant distribution \({\widehat{\mathfrak {p}}}\) and the target distribution \(\mathfrak {p}\). In particular, when \(\mathfrak {p}\) is a smooth and strongly log-concave distribution, Dalalyan (2017) established non-asymptotic bounds in total variation distance for the Euler–Maruyama discretization of (1), commonly known as the unadjusted Langevin algorithm (ULA). These results have also been extended to the Wasserstein distance \(\textrm{W}_{2}\) and the KL divergence in Durmus and Moulines (2017), Dalalyan (2017), Durmus and Moulines (2019), Dalalyan and Karagulyan (2019), Durmus et al. (2019), to name a few. Typically, these works study the number of iterations that the numerical integrator would require to achieve a desired accuracy, when applied to a target distribution \(\mathfrak {p}\) with a known condition number.

In fact, the above strong log-concavity of \(\mathfrak {p}\) can be substantially relaxed. In particular, using a variant of the reflection coupling, the recent work (Eberle and Majka 2019) derives non-asymptotic bounds for the ULA in the Wasserstein distance \(\textrm{W}_1\), when \(\mathfrak {p}\) is strictly log-concave outside of a ball in \(\mathbb {R}^d\). Similar results for the Wasserstein distance \(\textrm{W}_2\) have been obtained in Majka et al. (2020).

Within the class of log-concave distributions, a significant challenge for the Langevin diffusion in (1) arises when the target distribution \(\mathfrak {p}\) is not smooth and/or has a compact (convex) support in \(\mathbb {R}^d\). One approach to address this challenge is to replace the non-smooth distribution \(\mathfrak {p}\) with a smooth proxy obtained via the so-called Moreu–Yoshida (MY) envelope. This new smooth density remains log-concave and thus amenable to the non-asymptotic results discussed earlier. When the support of \(\mathfrak {p}\) is also compact, proximal Monte Carlo methods have been explored in Brosse et al. (2017), Pereyra (2016), Durmus et al. (2018). It is also worth noting that (Bubeck et al. 2018) pursued a different approach for sampling from compactly-supported densities that does not involve the MY envelope.

A potential drawback of the above approach is that the MY envelope often does not maintain the maximum a posteriori (MAP) estimator. That is, the above approach alters the location at which the new (smooth) density reaches its maximum. This is a well-known issue in the context of (non-smooth) convex optimization and is often resolved by appealing to the proximal gradient method. The latter can be understood as the Euler discretization of the so-called forward–backward (FB) envelope  (Stella et al. 2017).

1.1 Contributions

This work explores and analyzes the use of the FB envelope for sampling from non-smooth and compactly-supported log-concave distributions. In analogy with the Langevin proximal Monte Carlo, we replace the non-smooth density with a smooth proxy obtained via the FB envelope. In particular, this proxy is strongly log-concave over long distances.

Crucially, the new proxy also maintains the MAP estimator under certain assumptions. However, this improvement comes at the cost of requiring additional smoothness for the smooth part of the density. Lastly, the strong convexity of the new proxy over long distances allows us to utilise the work of Eberle and Majka (2019) to obtain non-asymptotic guarantees for our method in the Wasserstein distance \(\textrm{W}_1\).

In addition to investigating the use of the FB envelope in sampling, this work has the following contributions:

  • It introduces a general theoretical framework for sampling from non-smooth densities by introducing the notion of admissible envelopes. MY and FB envelopes are both instances of admissible envelopes.

  • It proposes a new Langevin algorithm to sample from non-smooth densities, dubbed EULA. EULA generalizes MYULA in the sense that it can work with any admissible envelope, e.g., MY or FB. EULA can also handle a family of increasingly more accurate envelopes rather than using a fixed envelope.

1.2 Organisation

The rest of the paper is organised as follows. Section 2 formalizes the problem of sampling from a non-smooth and compactly-supported log-concave distribution. As a proxy for this non-smooth distribution, its (smooth) Moreau–Yosida (MY) envelope is reviewed in Sect. 3. This section also explains the main limitation of MY envelope, i.e., its inaccurate MAP estimation. In Sect. 4, we introduce the forward–backward (FB) envelope which overcomes the key shortcoming of the MY envelope.

Section 5 introduces and analyses EULA, an extension of the popular ULA for sampling from a non-smooth distribution. EULA can be adapted to various envelopes. In particular, MYULA from Brosse et al. (2017) is a special case of EULA for the MY envelope. Section 6 proves the iteration complexity of EULA and Sect. 7 presents a few numerical examples to support the theory developed here.

2 Statement of the problem

Consider a compact convex set \(\textrm{K}\subset \mathbb {R}^d\). For a pair of functions \(\overline{f}:\mathbb {R}^d\rightarrow \mathbb {R}\) and \(\overline{g}:\mathbb {R}^d\rightarrow \mathbb {R}\), our objective in this work is to sample from the probability distribution

$$\begin{aligned} \mathfrak {p}(x) := {\left\{ \begin{array}{ll} \frac{e^{-\overline{f}(x)-\overline{g}(x)}}{\int _{\textrm{K}} e^{-\overline{f}(z)-\overline{g}(z)} \,{\text {d}}z } &{} x\in \textrm{K}, \\ 0 &{} x\notin \textrm{K}, \end{array}\right. } \end{aligned}$$
(2)

whenever the ratio above is well-defined. In order to sample from \(\mathfrak {p}\), we only have access to the gradient of \(\overline{f}\) and the proximal operator for \(\overline{g}\), to be defined later. Our assumptions on \(\textrm{K},\overline{f},\overline{g}\) are detailed below.

Assumption 1

We make the following assumptions:

  1. 1.

    For radii \(R\ge r>0\), assume that \(\textrm{K}\subset \mathbb {R}^d\) is a compact convex body that satisfies \(\textrm{B}(0,r)\subset \textrm{K}\subset \textrm{B}(0,R)\). Here, \(\textrm{B}(0,r)\) is the Euclidean ball of radius r centered at the origin.

  2. 2.

    Assume also that \(\overline{f}:\mathbb {R}^d\rightarrow \mathbb {R}\) is a convex function that is three-times continuously differentiable.

  3. 3.

    Assume lastly that \(\overline{g}:\mathbb {R}^d\rightarrow (-\infty ,\infty ]\) is a proper closed convex function. Moreover, we assume that \(\overline{g}\) is continuous.Footnote 1

A few important remarks about Assumption 1 are in order. First, in the special case when \(\overline{f}\) is a convex quadratic (Luu et al. 2021), the assumption of thrice-differentiability above is trivially met and some of the developments below are simplified. However, our more general setup here necessitates the thrice-differentiability above and leads below to more involved technical derivations.

Second, for the sake of mathematical convenience, instead of the two functions \(\overline{f},\overline{g}\), we will work throughout this manuscript with two new functions fg, without any loss of generality.

More specifically, instead of \(\overline{f}\), we consider a convex function f that coincides with \(\overline{f}\) on the set \(\textrm{K}\), has a compact support and a continuously differentiable Hessian. For this function f, the compactness of \(\textrm{K}\) and the smoothness of \(\overline{f}\) in Assumption 1 together imply that \(f,\nabla f,\nabla ^2 f\) are all Lipschitz-continuous functions. To summarize, for the function f described above, there exist nonnegative constants \(\lambda _0,\lambda _1,\lambda _2,\lambda _3\) such that

$$\begin{aligned}&f(x) = 0, \quad \text {if}\quad \Vert x\Vert _2 \ge \lambda _0, \end{aligned}$$
(3a)
$$\begin{aligned}&|f(x)-f(y) |\le \lambda _1 \Vert x-y\Vert _2, \end{aligned}$$
(3b)
$$\begin{aligned}&\Vert \nabla f(x) - \nabla f(y) \Vert _2 \le \lambda _2 \Vert x-y\Vert _2, \end{aligned}$$
(3c)
$$\begin{aligned}&\Vert \nabla ^2 f(x) - \nabla ^2 f(y) \Vert \le \lambda _3 \Vert x-y\Vert _2, \end{aligned}$$
(3d)

for every \(x,y\in \mathbb {R}^d\). In particular, the compactness of \(\textrm{K}\) and the differentiability of \(\overline{f}\) in Assumption 1 imply that f is finite, i.e., \(\max _x |f(x)|<\infty \).

Instead of \(\overline{g}\), we consider the proper closed convex function g, defined as

$$\begin{aligned} g := \overline{g}+1_\textrm{K}, \end{aligned}$$
(4)

where \(1_\textrm{K}\) is the indicator function for the set \(\textrm{K}\). That is, \(1_\textrm{K}(x)=0\) if \(x\in \textrm{K}\) and \(1_\textrm{K}(x)=\infty \) if \(x\notin \textrm{K}\). The compactness of \(\textrm{K}\) and the continuity of \({\bar{g}}\) in Assumption 1 together imply that g is finite, when its domain is limited to the set \(\textrm{K}\). Outside of the set \(\textrm{K}\)g is infinite. To summarize, the new function g is lower semi-continuous and also satisfies

$$\begin{aligned} \max _{x\in \textrm{K}} |g(x) |<\infty , \quad g(x)=\infty , \quad \text {if}\quad x\notin \textrm{K}. \end{aligned}$$
(5)

Using the new functions f and g, we rewrite the definition of \(\mathfrak {p}\) in (2) as

$$\begin{aligned} \mathfrak {p}(x)&:= \frac{e^{-F(x)}}{\int _{\mathbb {R}^d} e^{-F(z)} \,{\text {d}}z }, \end{aligned}$$
(6a)
$$\begin{aligned} F(x)&:=\overline{f}(x)+\overline{g}(x)+1_\textrm{K}(x)\nonumber \\&= f(x) + g(x). \end{aligned}$$
(6b)

Above, F is often referred to as the potential associated with \(\mathfrak {p}\). The last identity above holds by construction. Indeed, on the set \(\textrm{K}\), the functions f and \(\overline{f}\) coincide. Likewise, on the set \(\textrm{K}\), the functions g and \(\overline{g}\) coincide. Outside of the set \(\textrm{K}\), however, both sides of the last equality above are infinite. In particular, F is finite on \(\textrm{K}\) and infinite outside of this set.

In view of (6b), instead of \(\overline{f}\) and \(\overline{g}\), we can work with f and g without any change to the target distribution \(\mathfrak {p}\). This will be more convenient for the proofs. As a side note, Assumption 1 implies that \(\mathfrak {p}\) is finite; see after (3) and (5)–(6). The integral in the denominator in (6a) is also finite by Assumption 1. When there is no confusion, we will overload our notation and use \(\mathfrak {p}\) to also denote the probability measure associated with the law \(\mathfrak {p}\).

Since g is not differentiable, F in (6b) is itself non-differentiable. In turn, this means that one cannot provably use gradient based algorithms (such as the ULA) to sample from \(\mathfrak {p}\propto \exp (-F)\) (Dalalyan and Karagulyan 2019). One way to deal with this issue is to replace F with a smooth function \(F_{\gamma }\), which we will refer to as an envelope. To this envelope we can then apply the ULA. It is reasonable to require this envelope \(F_\gamma \) to fulfill the following admissibility assumptions.

Definition 1

(Admissible envelopes) For \(\gamma ^0>0\), the functions \(\{F_\gamma :\gamma \in (0,\gamma ^0)\}\) are admissible envelopes of F if

  1. 1.

    There exists a function \(F^0:\mathbb {R}^d\rightarrow [-\infty ,\infty ]\) such that \(e^{-F^0}\) is integrable, and \(F_\gamma \) dominates \(F^0\). That is, for every \(x\in \mathbb {R}^d\) and \(\gamma \in (0, \gamma ^0)\), it holds that \(F_\gamma (x) \ge F^0(x)\).

  2. 2.

    \(F_\gamma :\mathbb {R}^d\rightarrow [-\infty ,\infty ]\) converges pointwise to F. That is, \( \lim _{\gamma \rightarrow 0} F_\gamma (x)= F(x)\) for every \(x\in \mathbb {R}^d\).

  3. 3.

    \(F_\gamma \) is \(\lambda _\gamma \)-smooth, i.e., there exists a constant \(\lambda _{\gamma }\ge 0\) such that for all \(x,y\in \mathbb {R}^d\)

    $$\begin{aligned} \Vert \nabla F_\gamma (x)-\nabla F_\gamma (y)\Vert _2\le \lambda _{\gamma } \Vert x-y\Vert _2,\quad \gamma \in (0,\gamma ^0). \end{aligned}$$

If \(\{F_\gamma :\gamma \in (0, \gamma ^0)\}\) are admissible envelopes of F, we can define the corresponding probability densities: For every \(x\in \mathbb {R}^d\) and every \(\gamma \in (0,\gamma ^0)\), we define

$$\begin{aligned}&\mathfrak {p}_\gamma (x) :{=} \frac{e^{-F_\gamma (x)}}{\int _{\mathbb {R}^d}e^{-F_\gamma (z)}\, {\text {d}}z}. \end{aligned}$$
(7)

Remark 1

Definition 1.1 implies that \(\mathfrak {p}_\gamma \) can be normalized and (7) is thus well-defined. This observation and Definition 1.2 together imply, after an application of the dominated convergence theorem, that

$$\begin{aligned} \lim _{\gamma \rightarrow 0}\mathfrak {p}_\gamma (x) = \mathfrak {p}(x), \quad x\in \mathbb {R}^d, \end{aligned}$$

where the probability measures \(\mathfrak {p}\) and \(\mathfrak {p}_\gamma \) are defined in (6a) and (7), respectively. That is, \(\mathfrak {p}_\gamma \) converges weakly to \(\mathfrak {p}\) in the limit of \(\gamma \rightarrow 0\). In other words, we can use \(\mathfrak {p}_\gamma \) as a proxy for \(\mathfrak {p}\), provided that \(\gamma \) is sufficiently small. Finally, as we will see shortly, Definition 1.3 will help us establish the convergence of the ULA to an invariant distribution close to \(\mathfrak {p}_\gamma \), provided that the step size of the ULA is small.

3 Moreau–Yosida envelope and its limitation

For \(\gamma >0\), let us define

$$\begin{aligned} F^{MY }_\gamma (x) := f(x) + g_{\gamma }(x), \end{aligned}$$
(MY)

where

$$\begin{aligned} g_{\gamma }(x):=\min _{z \in \mathbb {R}^{d}} \left\{ g(z)+\frac{1}{2\gamma }\Vert x-z\Vert _2^{2} \right\} \end{aligned}$$
(8)

is the Moreau–Yosida (MY) envelope of g. Somewhat inaccurately, we will also refer to \(F_\gamma ^MY \) as the MY envelope of F, to distinguish \(F^MY _\gamma \) from its alternative later in this paper. It is well-known that \(g_\gamma \) is \(\gamma ^{-1}\)-smooth and that \(g_\gamma \) converges pointwise to g in the limit of \(\gamma \rightarrow 0\). These facts enable us to establish the admissibility of MY envelopes, as detailed below. All proofs are deferred to the appendices. We note that the result below closely relates to Durmus et al. (2018, Proposition 1). In effect, the result below restates the result in Durmus et al. (2018) in the framework of admissible envelopes.

Proposition 1

(Admissibility of MY envelopes) Suppose that Assumption 1 is fulfilled. Then \(\{F_\gamma ^{MY }:\gamma > 0\}\) are admissible envelopes of F in (6b). In particular, \(\nabla F_\gamma ^{MY }\) is \((\lambda _2+\gamma ^{-1})\)-Lipschitz continuous, and given by the expression

$$\begin{aligned}&\nabla F_\gamma ^{MY }(x) = \nabla f(x) + \frac{x- P_{\gamma g}(x)}{\gamma }, \nonumber \\&P_{\gamma g}(x) := {{\,\mathrm{\hbox {arg min}}\,}}_{z\in \mathbb {R}^d} \left\{ g(z) + \frac{1}{2\gamma }\Vert x-z\Vert _2^2\right\} . \end{aligned}$$
(9)

Above, \(\lambda _2\) was defined in (3c), and \(P_{\gamma g}:\mathbb {R}^d\rightarrow \mathbb {R}^d\) is the proximal operator associated with the function \(\gamma g\).

Remark 2

(Connection to Nesterov’s smoothing technique) Alternatively, we can also view the MY envelope through the lens of Nesterov’s smoothing technique (Nesterov 2005). More specifically, if Assumption 1 is fulfilled, one can invoke a standard minimax theorem to verify that

$$\begin{aligned} g_\gamma (x) = \max _{z\in \mathbb {R}^d}\left\{ \langle x,z\rangle - g^*(z) - \frac{\gamma }{2}\Vert z\Vert _2^2 \right\} , \quad x\in \mathbb {R}^d, \end{aligned}$$

where \(g^*\) is the Fenchel conjugate of g. This interpretation is central to Nesterov’s technique for minimizing the non-smooth function F in (6a).

In view of the admissibility of \(F^{MY }_{\gamma }\) by Proposition 1, we can apply the ULA to the new (smooth) potential \(F^{MY }_{\gamma }\) instead of the nonsmooth F. In particular, if \(\gamma \) is sufficiently small, then \(\mathfrak {p}^MY _\gamma \propto \exp (-{F_{\gamma }^MY })\) would be close to the target distribution \(\mathfrak {p}\) by Remark 1. This technique is known as the MYULA (Brosse et al. 2017).

However, a limitation of the MY envelope is that the minimizers of \(F^{MY}_{\gamma }\) are not necessarily the same as the minimizers of F. In turn, the MAP estimator of \(\mathfrak {p}_\gamma ^MY \), denoted by \(x^MY _\gamma \), might not coincide with the MAP estimator of \(\mathfrak {p}\), except in the limit of \(\gamma \rightarrow 0\). That is,

$$\begin{aligned} \lim _{\gamma \rightarrow 0} F_{\gamma }(x_\gamma ^{MY })= \min _x F(x). \end{aligned}$$

This observation is particularly problematic because, as we will see later, very small values of \(\gamma \) are often avoided in practice due to numerical stability issues.

In view of this discussion, our objective is to replace the MY envelope with a new envelope that has the same minimizers as F for all sufficiently small \(\gamma \), not just in the limit of \(\gamma \rightarrow 0\).

4 Forward–backward envelope

In this section, we will study an envelope that addresses the limitations of the MY envelope. More specifically, for \(\gamma >0\), let us recall from Stella et al. (2017) that the forward–backward (FB) envelope of the function F in (6b) is defined as

$$\begin{aligned} F_\gamma ^{FB }(x)&:= f(x) -\frac{\gamma }{2}\Vert \nabla f(x)\Vert _2^2 \\&\quad + g_\gamma (x-\gamma \nabla f(x)), \end{aligned}$$
(FB)

where \(g_\gamma \) was defined in (8). A number of useful properties of \(F^FB _\gamma \) are collected below for the convenience of the reader, borrowed from Stella et al. (2017). Recall that \(P_{\gamma g}\) in (9) denotes the proximal operator associated with the function \(\gamma g\).

Proposition 2

(Properties of the FB envelope) Suppose that Assumption 1 is fulfilled. For \(\gamma \in (0,1/\lambda _2)\) and every \(x\in \mathbb {R}^d\), it holds that

  1. 1.

    \(F( P_{\gamma g}(x-\gamma \nabla f(x) ) \le F_\gamma ^{FB }(x) \le F(x) \), which relates the function F to its FB envelope.

  2. 2.

    \(F_{\frac{\gamma }{1-\gamma \lambda _2}}^{MY }(x) \le F_\gamma ^{FB }(x) \le F_\gamma ^{MY }(x)\), which relates the MY and FB envelopes of the function F.

  3. 3.

    \(F_\gamma ^{FB }\) is continuously differentiable and its gradient is given by

    $$\begin{aligned} \nabla F_\gamma ^{FB }(x)= \frac{(I-\gamma \nabla ^2 f(x))}{\gamma } (x - P_{\gamma g}(x-\gamma \nabla f(x))). \end{aligned}$$
  4. 4.

    \({{\,\mathrm{\hbox {arg min}}\,}}F^FB _\gamma = {{\,\mathrm{\hbox {arg min}}\,}}F\), i.e., the function F and its FB envelope have the same minimizers.

In view of Proposition 2.4, a remarkable property of the FB envelope is that the modes of the distribution \(\mathfrak {p}_\gamma ^FB \propto \exp (-F^FB _\gamma )\) coincide with the modes of the target distribution \(\mathfrak {p}\propto \exp (-F)\), for all sufficiently small \(\gamma \), rather than only in the limit of \(\gamma \rightarrow 0\). Indeed, very small values of \(\gamma \) are often avoided in practice due to numerical stability issues. This observation signifies the advantage of the FB envelope over the MY envelope. Recall that the modes of \(\mathfrak {p}^MY _\gamma \) coincide with those of \(\mathfrak {p}\) only in the limit of \(\gamma \rightarrow 0\); see Sect. 3.

As a side note, let us remark that the proximal gradient descent algorithm for minimizing the (non-smooth) function F coincides with the gradient descent (with variable metric) for minimizing the (smooth) function \(F^FB _\gamma \), whenever \(\gamma \) is sufficiently small (Stella et al. 2017). It is also easy to use Proposition 2 to check the admissibility of the FB envelopes, as summarized below.

Proposition 3

(Admissibility of FB envelopes) Suppose that Assumption 1 is fulfilled. Then \(\{F_\gamma ^{FB }:\gamma \in (0,\gamma ^{FB })\}\) are admissible envelopes of F in (6b), where

$$\begin{aligned} \gamma ^{FB }:=\frac{1}{2 \lambda _2+ 2\lambda _3(\lambda _0+R)}. \end{aligned}$$
(10)

Moreover,

$$\begin{aligned}&\left\Vert \nabla F^FB _\gamma (x) - \nabla F^FB _\gamma (y) \right\Vert_2 \nonumber \\&\quad \le \lambda _\gamma ^FB \Vert x-y\Vert _2, \end{aligned}$$
(11a)
$$\begin{aligned}&\langle x-y,\nabla F^FB _\gamma (x)-\nabla F^FB _\gamma (y) \rangle \nonumber \\&\quad \ge \mu ^FB _\gamma \Vert x-y\Vert _2^2, \end{aligned}$$
(11b)

for every \(x,y\in \mathbb {R}^d\). The second inequality above holds when \(\Vert x-y\Vert _2 \ge \rho _\gamma ^FB \). Lastly,

$$\begin{aligned} \lambda ^{FB }_\gamma&:= \gamma ^{-1}+2\lambda _2+\lambda _3(\lambda _0+R),\\ \mu ^FB _\gamma&:= \lambda _2+ \lambda _3(\lambda _0+R), \\ \rho _\gamma ^FB&:= \frac{2R}{1-2\gamma ( \lambda _2+ \lambda _3(\lambda _0+R)))}. \end{aligned}$$

The Eq. (11) provides valuable information about the landscape of the FB envelope of F, which we now summarize: (11a) means that \(F^FB _\gamma \) is a \(\lambda _\gamma ^FB \)-smooth function. The smoothness of \(F^FB _\gamma \) in (FB) is not surprising since both f and \(g_\gamma \) are smooth functions. (Recall that \(g_\gamma \) is the MY envelope of g, which is known to be \(\gamma ^{-1}\)-smooth.) Moreover, even though \(F^FB _\gamma \) is not necessarily a strongly convex function, (11b) implies that \(F^FB _\gamma \) behaves like a strongly convex function over long distances. As detailed in the proof, (11b) holds essentially because the MY envelope of the indicator function \(1_\textrm{K}\) is the “quadratic” function \(\tfrac{1}{2\gamma }{\text {dist}}(\cdot ,\textrm{K})^2\). The latter function grows quadratically far away from the origin. Here, \({\text {dist}}(\cdot ,\textrm{K})\) is the distance to the set \(\textrm{K}\).

It is worth noting that a similar result to Proposition 3 is implicit in Brosse et al. (2017) for the MY envelope. That is, the MY envelope \(F^MY _\gamma \) also satisfies (11), albeit with different constants.

Remark 3

(Convergence in the Wasserstein metric) Recall from Remark 1 that  \(\mathfrak {p}_\gamma ^FB \) converges weakly to \(\mathfrak {p}\) in the limit of \(\gamma \rightarrow 0\). This weak convergence implies convergence in the Wasserstein metric by Bou-Rabee and Eberle (2020, Lemma 2.6), i.e.,

$$\begin{aligned} \lim _{\gamma \rightarrow 0} \textrm{W}_1(\mathfrak {p}_\gamma ^FB ,\mathfrak {p})=0. \end{aligned}$$
(12)

We recall here that, for two probability measures \(\mathfrak {q}_1\) and \(\mathfrak {q}_2\) that satisfy \(\mathbb {E}_{x\sim \mathfrak {q}_1}\Vert x\Vert _2<\infty \) and \(\mathbb {E}_{y\sim \mathfrak {q}_2}\Vert y\Vert _2 <\infty \), their 1-Wasserstein or Kantorovich distance (Villani 2009) is

$$\begin{aligned} \textrm{W}_1(\mathfrak {q}_1,\mathfrak {q}_2):= \inf _{\begin{array}{c} x\sim \mathfrak {q}_1\\ y\sim \mathfrak {q}_2 \end{array}} \mathbb {E}\Vert x-y\Vert _2. \end{aligned}$$
(13)

With some abuse of notation, throughout this work, we will occasionally replace the probability measures above with the corresponding probability distributions or random variables.

A non-asymptotic version of Remark 3 is presented below, which bounds the Wasserstein distance between the two probability measures \(\mathfrak {p}^FB _\gamma \) and \(\mathfrak {p}\). The result below is an analogue of Brosse (2017, Proposition 5) for the MY envelope. The key ingredient of their result is the Steiner’s formula for the volume of the set \(\textrm{K}+\textrm{B}(0,t)=\{x: {\text {dist}}(x,\textrm{K})\le t\}\), where the sum is in the Minkowski sense. Essentially, our proof strategy is to use Proposition 2.2 to relate the FB and MY envelopes and then invoke Brosse (2017, Proposition 5). below needs update

Theorem 1

(Wasserstein distance between \(\mathfrak {p}^FB _\gamma \) and \(\mathfrak {p}\)) Suppose that Assumption 1 is fulfilled. For \(\gamma \in (0,\gamma ^{FB })\), it holds that

$$\begin{aligned}&\textrm{W}_1(\mathfrak {p}_\gamma ^FB ,\mathfrak {p})\nonumber \\&\le c_1+ \frac{ c_2 R I_1(\gamma )+ c_2 I_2(\gamma )+ c_3 I_2(\gamma /(1-\gamma \lambda _2)) }{\text {d}(\textrm{K})+I_1(\gamma )}+ c_4 R\nonumber \\&=: c_5^FB (\gamma ), \end{aligned}$$
(14)

where

$$\begin{aligned}&c_1:= \tfrac{e^{2\max _{x\in \textrm{K}} g(x)-\gamma \lambda _2 \min _z f(z)} \int _{\textrm{K}} \Vert x\Vert _2 e^{-(1-\gamma \lambda _2)f(x)} \, {\text {d}}x }{\int e^{-f(x)-\frac{{\text {dist}}(x,\textrm{K})^2}{2\gamma }}\,{\text {d}}x}\nonumber \\&\,\,\qquad -\tfrac{e^{\gamma \lambda _2 \min _z f(z)-2\max _{x\in \textrm{K}} g(x)} \int _\textrm{K}\Vert x\Vert _2 e^{-f(x)}\, {\text {d}}x}{\max \left( \int e^{-f(x) - \frac{{\text {dist}}(x,\textrm{K})^2}{2\gamma }}\, {\text {d}}x, \int e^{-(1-\gamma \lambda _2)f(x) - \frac{{\text {dist}}(x,\textrm{K})^2}{2\gamma /(1-\gamma \lambda _2)}}\, {\text {d}}x\right) }, \nonumber \\&c_2 := e^{\max _x f(x)-\min _x f(x)}, \nonumber \\&c_3:= e^{\max _x f(x) - \min _x f(x) +2\max _{x\in \textrm{K}}g(x)}, \nonumber \\&c_4:= e^{\max _{x\in \textrm{K}}g(x)-\min _{x\in \textrm{K}}g(x) }\nonumber \\&\,\,\qquad -e^{\min _{x\in \textrm{K}}g(x)-\max _{x\in \textrm{K}} g(x)}, \nonumber \\&I_1(\gamma ) := \sum _{i=0}^{d-1} \text {d}_{i}(\textrm{K}) \cdot (2\pi \gamma )^{\frac{d-i}{2}},\nonumber \\&I_2(\gamma ):= \sum _{i=0}^{d-1} \text {d}_i(\textrm{K})\cdot (2\pi \gamma )^{\frac{d-i}{2}} \left( \sqrt{\gamma (d-i+3)}+R \right). \end{aligned}$$
(15)

Above, \(\text {d}_i(\textrm{K})\) is the i-th intrinsic volume of \(\textrm{K}\); see Klain and Rota (1997). In particular, the d-th volume of \(\textrm{K}\) coincides with the standard volume of \(\textrm{K}\), i.e., \(\text {d}_d(\textrm{K})=\text {d}(\textrm{K})\). Moreover, to keep the notation light, above we have suppressed the dependence of \(c_1\) to \(c_5\) on \(\textrm{K},f,g,\gamma \).

As a sanity check, consider the special case of \(g=1_\textrm{K}\), where \(1_\textrm{K}\) is the indicator function for the set \(\textrm{K}\). Then we can use (15) to verify that \(c_1\) and \(I_1(\gamma )\) and \(I_2(\gamma )\) and \(I_2(\gamma /(1-\gamma \lambda _2))\) all vanish when we send \(\gamma \rightarrow 0\). Consequently, both the left- and right-hand sides of (14) vanish if we send \(\gamma \rightarrow 0\). When \(g=1_\textrm{K}\), then Theorem 1 is precisely the analogue of Brosse (2017, Proposition 5) for the FB envelope. Their work, however, does not cover the case of \(g\ne 1_\textrm{K}\).

When \(g\ne 1_\textrm{K}\) and \(\gamma \rightarrow 0\), the right-hand side of (14) converges to the nonzero value

$$\begin{aligned}&\left(e^{2\max _{x\in \textrm{K}} g(x)} - e^{-2\max _{x\in \textrm{K}} g(x)}\right) \frac{\int _\textrm{K}\Vert x\Vert _2e^{-f(x)} \, {\text {d}}x }{\int e^{-f(x)}\, {\text {d}}x} \\&\quad + \left( \frac{e^{\max _{x\in \textrm{K}} g(x)}}{e^{\min _{x\in \textrm{K}} g(x)}} -\frac{e^{\min _{x\in \textrm{K}} g(x)}}{ e^{\max _{x\in \textrm{K}} g(x)}}\right), \end{aligned}$$

unlike the left-hand side of (14), which converges to zero by (12). Improving (14) in the case \(g\ne 1_\textrm{K}\) appears to require highly restrictive assumptions on g which we wish to avoid here. Loosely speaking, the technical difficulty can be described as follows: As the volume is obtained by integrating the indicator function, the Steiner formula is particularly suited for the case \(g=1_\textrm{K}\). When \(g\ne 1_\textrm{K}\), we are not aware of an analogue of the Steiner’s formula. At any rate, very small values of \(\gamma \) are often avoided in practice due to numerical stability issues. In this sense, improving (14) for very small values of \(\gamma \) might have limited practical value.

To summarize this section, we established that the FB envelopes \(\{F^FB _\gamma : \gamma \in (0,\gamma ^{0,FB })\}\) are admissible and we can thus use them as a differentiable proxy for the non-smooth function F in (6b). Crucially, the FB envelope addresses the key limitation of the MY envelope, i.e., the modes of \(\mathfrak {p}^FB _\gamma \propto \exp (-F^FB _\gamma )\) coincide with the modes \(\mathfrak {p}\propto \exp (-F)\), for all sufficiently small \(\gamma \), rather than only in the limit of \(\gamma \rightarrow 0\).

5 EULA: envelope unadjusted Langevin algorithm

We have so far introduced two smooth envelopes for the non-smooth function F in (6b), namely, the MY envelope \(F^MY _\gamma \) in (MY) and the FB envelope \(F^FB _\gamma \) in (FB). We also described in Sect. 4 the advantage of the FB envelope over the MY envelope. To keep our discussion general, below we consider admissible envelopes \(\{F_\gamma :\gamma \in (0,\gamma ^0)\}\) for the target function F in (6b); see Definition 1. Our discussion below can be specialized to either of the envelopes by setting \(F_\gamma =F^MY _\gamma \) or \(F_\gamma =F^FB _\gamma \), respectively.

For the time being, let us fix \(\gamma \in (0,\gamma ^0)\). Unlike F, note that \(\nabla F_\gamma \) exists and is Lipschitz continuous by Definition 1.3. We can now use the ULA (Dalalyan and Karagulyan 2019) to sample from \(\mathfrak {p}_\gamma \propto \exp (-F_\gamma )\), as a proxy for the target measure \(\mathfrak {p}\propto \exp (-F)\). The k-th iteration of the resulting algorithm is

$$\begin{aligned} x_{k+1} = x_k - h \nabla F_{\gamma }(x_k)+ \sqrt{2h} \zeta _{k+1}, \end{aligned}$$
(16)

where h is the step size and \(\zeta _{k+1} \in \mathbb {R}^d\) is a standard Gaussian random vector, independent of \(\{\zeta _i\}_{i\le k}\). In particular, if we choose \(F_\gamma =F^MY _\gamma \), then (16) coincides with the MYULA from Brosse et al. (2017).

Under standard assumptions, to be reviewed later, the Markov chain \(\{x_k\}_{k\ge 0}\) in (16) has a unique invariant probability measure, which we denote by \(\widehat{\mathfrak {p}}_{\gamma ,h}\). There are two sources of error that contribute to the difference between \(\widehat{\mathfrak {p}}_{\gamma ,h}\) and the target measure \(\mathfrak {p}\) in (6a), which we list below:

  1. 1.

    First, note that (16) is only intended to sample from \(\mathfrak {p}_\gamma \), as a proxy for the target distribution \(\mathfrak {p}\). That is, the first source of error is the difference between the two probability measures \(\mathfrak {p}_\gamma \) and \(\mathfrak {p}\).

  2. 2.

    Second, the step size h is known to contribute to the difference between the two probability measures \(\widehat{\mathfrak {p}}_{\gamma ,h}\) and \(\mathfrak {p}_\gamma \); see Dalalyan and Karagulyan (2019). This bias vanishes only in the limit of \(h\rightarrow 0\).

In fact, instead of (16), we study here a slightly more general algorithm that allows \(\gamma \) and h to vary. More specifically, for a nonincreasing sequence \(\{\gamma _k\}_{k\ge 0}\) and step sizes \(\{h_k\}_{k\ge 0}\), the k-th iteration of this more general algorithm is

$$\begin{aligned} x_{k+1} = x_k - h_k \nabla F_{\gamma _k}(x_k) + \sqrt{2h_k} \zeta _{k+1}, \end{aligned}$$
(EULA)

where \(\zeta _{k+1} \in \mathbb {R}^d\) is a standard Gaussian random vector independent of \(\{\zeta _i\}_{i\le k}\). (EULA) stands for Envelope Unadjusted Langevin Algorithm.

In particular, if we set \(\gamma _k=\gamma \) in (EULA) for every \(k\ge 0\), then we retrieve (16). Alternatively, if \(\{\gamma _k\}_{k\ge 0}\) is a decreasing sequence, then \(F_{\gamma _k}\) becomes an increasingly better approximation of the target potential function F as k increases; see Definition 1.2. That is, (EULA) uses increasingly better approximations of the potential function F as k increases.

We next present the iteration complexity of (EULA) for admissible envelopes \(\{F_\gamma :\gamma \in (0,\gamma ^0)\}\), where admissibility was defined in Definition 1. The result below can be specialized to both MY and FB envelopes by setting \(F_\gamma =F^MY _\gamma \) or \(F_\gamma =F^FB _\gamma \), respectively.

Theorem 2

(Iteration complexity of (EULA)) For \(\gamma ^0>0\), consider admissible envelopes \(\{F_\gamma :\gamma \in (0,\gamma ^0)\}\) of F in (6b); see Definition 1. For \(\mu _\gamma >0\) and \(\rho _\gamma \ge 0\), we additionally assume that \(F_\gamma \) satisfies

$$\begin{aligned}&\langle x-y,\nabla F_\gamma (x)-\nabla F_\gamma (y) \rangle \ge \mu _\gamma \Vert x-y\Vert _2^2, \end{aligned}$$
(17)

when \(\Vert x-y\Vert _2\ge \rho _\gamma \) and \(\gamma \in (0,\gamma ^0)\). Consider two sequences \(\{\gamma _k\}_{k\ge 0}\subset (0,\gamma ^0)\) and \(\{h_k\}_{k\ge 0} \subset \mathbb {R}_+\). For the algorithm (EULA), let \(\mathfrak {q}_k\) denote the law of \(x_k\) for every integer \(k\ge 0\). That is, \(x_k \sim \mathfrak {q}_k\) for every \(k\ge 0\). Then the \(\textrm{W}_1\) distance between \(\mathfrak {q}_k\) and the target measure \(\mathfrak {p}\propto e^{-F}\) in (6a) is bounded by

$$\begin{aligned} \textrm{W}_1(\mathfrak {q}_{k},\mathfrak {p})&\le e^{c_6c_7} \prod _{i=0}^{k-1} (1-c_8 h_i) \cdot \textrm{W}_1(\mathfrak {q}_0,\mathfrak {p}_{\gamma _{0}}) \\&\quad + e^{c_6c_7}\sum _{i=0}^{k-1} \alpha _i \prod _{j=i+1}^{k-1}(1-c_8h_j) + \textrm{W}_1(\mathfrak {p}_{\gamma _k},\mathfrak {p}), \end{aligned}$$

for every \(k\ge 0\), provided that

$$\begin{aligned} h_k \le \frac{1}{\lambda _{\gamma _k}} \min \left( \frac{1}{6}, \frac{\mu _{\gamma _k}}{\lambda _{\gamma _k}} , \frac{\lambda _{\gamma _k} \rho _{\gamma _k}^2}{3}, \frac{c_0^2}{970 \lambda _{\gamma _k} \rho _{\gamma _k}^2}\right). \end{aligned}$$
(18)

Above, \(c_0\ge 0.007\) is a universal constant specified in Eberle and Majka (2019, Eq. (6.6)). Moreover,

$$\begin{aligned}&c_6 := (1+h_k\lambda _{\gamma _{k}})\rho _{\gamma _k} \le 7 \rho _{\gamma _k}/6, \quad c_7:= 7 \lambda _{\gamma _k}\rho _{\gamma _k}/c_0, \\&c_8 := \min \left( \frac{\mu _{\gamma _k}}{2},\frac{245}{24c_0} (\lambda _{\gamma _k} \rho _{\gamma _k})^2 \right) e^{-\frac{49}{6c_0} \lambda _{\gamma _k} \rho _{\gamma _k}^2}, \\&\alpha _k := \lambda _{\gamma _{k}} \sqrt{h_k^3 d} \cdot \left( \sqrt{h_k \lambda _{\gamma _{k}} } +\sqrt{2}\right) \\&\qquad +\textrm{W}_1(\mathfrak {p}_{\gamma _k},\mathfrak {p})+\textrm{W}_1(\mathfrak {p}_{\gamma _{k+1}},\mathfrak {p}), \quad k\ge 0. \end{aligned}$$

When \(\rho _\gamma =0\), then note that (17) requires \(F_\gamma \) to be \(\mu _\gamma \)-strongly convex for every \(\gamma \in (0,\gamma ^0)\). This can happen, for example, when f itself is a strongly convex function. A shortcoming of Theorem 2 in (18) is that \(h_k=0\) if \(\rho _{\gamma _k}=0\). It becomes clear from our proof that we have inherited this shortcoming from Eberle and Majka (2019); see Proposition 4. Their work also offers a more involved result, which does not suffer from this issue; see Eberle and Majka (2019, Theorem 2.12). However, we opted in this work for their simplified result in Proposition 4, though this can be improved.

A particularly important special case of Theorem 2 is when we choose the FB envelope, and use a fixed \(\gamma \) and step size h.

Corollary 1

(Iteration complexity of (EULA) for FB envelope) Suppose that Assumption 1 is fulfilled. For the algorithm (EULA), suppose that \(\gamma _k=\gamma \in (0,\gamma ^FB )\) and \(F_{\gamma _k}=F^FB _\gamma \) and \(h_k=h>0\) for every integer \(k\ge 0\), see (FB) and (10). In (EULA), also let \(\mathfrak {q}_k\) denote the law of \(x_k\) for every \(k\ge 0\). That is, \(x_k \sim \mathfrak {q}_k\) for every integer \(k\ge 0\). Then the \(\textrm{W}_1\) distance between \(\mathfrak {q}_k\) and the target measure \(\mathfrak {p}\propto e^{-F}\) in (6a) is bounded by

$$\begin{aligned} \textrm{W}_1(\mathfrak {q}_{k},\mathfrak {p})&\le e^{c_6^FB c_7^FB } (1-c_8^FB h)^k \cdot \textrm{W}_1(\mathfrak {q}_0,\mathfrak {p}_{\gamma }) \nonumber \\&\quad + \frac{\alpha ^FB e^{c_6^FB c_7^FB }}{c_8^FB h}+ c_5^FB , \end{aligned}$$
(19)

for every \(k\ge 0\), provided that \(\gamma \in (0,\gamma ^0)\), and

$$\begin{aligned} h \le \frac{1}{\lambda _{\gamma }^FB } \min \left( \frac{1}{6}, \frac{\mu _{\gamma }^FB }{\lambda _{\gamma }^FB } , \frac{\lambda _{\gamma }^FB (\rho _{\gamma }^FB )^2}{3}, \frac{c_0^2}{970 \lambda _{\gamma }^FB (\rho _{\gamma }^ FB )^2}\right). \end{aligned}$$

Above, \(c_0\ge 0.007\) is a universal constant specified in Eberle and Majka (2019, Eq. (6.6)). Moreover,

$$\begin{aligned}&c_6^FB := (1+h\lambda _{\gamma }^FB )\rho _{\gamma }^FB \le 7 \rho _{\gamma }^FB /6, \\&c_7^FB := 7 \lambda _{\gamma }^FB \rho _{\gamma }^FB /c_0, \\&c_8^FB := \min \left( \frac{\mu _{\gamma }^FB }{2},\frac{245}{24c_0} (\lambda _{\gamma }^FB \rho _{\gamma }^FB )^2 \right) e^{-\frac{49}{6c_0} \lambda _{\gamma }^FB (\rho _{\gamma }^FB )^2},\\&\alpha ^FB := \lambda _{\gamma }^FB \sqrt{h^3 d} \cdot \left( \sqrt{h \lambda _{\gamma }^FB } +\sqrt{2}\right) +2c_5^FB . \end{aligned}$$

The remaining quantities were defined in Propositions 3 and 1.

We remark that Corollary 1 for the FB envelope is the analogue of Brosse (2017, Proposition 7) for the MY envelope. However, note that Brosse (2017, Proposition 7) requires f to be strongly convex whereas we merely assume f to be convex, see Assumption 1.

6 Proof of Theorem 2

To begin, we let \(Q_k\) denote the Markov transition kernel associated with the Markov chain \(\{x_k\}_{k\ge 0}\). This transition kernel is specified as

$$\begin{aligned} Q_k(x,\cdot ) := \text {Normal}(x-h_k\nabla F_{\gamma _k}(x),2h_{k}I_d), \end{aligned}$$
(20)

where \(\text {Normal}(a,B)\) is the Gaussian probability measure with mean \(a\in \mathbb {R}^d\) and covariance matrix \(B\in \mathbb {R}^{d\times d}\). Above, note that \(Q_k\) depends on both \(h_k\) and \(\gamma _k\). We also let \(\mathfrak {q}_k\) denote the law of \(x_k\), i.e., \(x_k \sim \mathfrak {q}_k\). Using the standard notation, we can now write that

$$\begin{aligned} \mathfrak {q}_{k+1}=\mathfrak {q}_k Q_{{k}}. \end{aligned}$$
(21)

To be precise, (21) is equivalent to

$$\begin{aligned} \mathfrak {q}_{k+1}({\text {d}}y) = \int \mathfrak {q}_{k}({\text {d}}x) Q_{k}( x,{\text {d}}y), \end{aligned}$$
(22)

for every \(y\in \mathbb {R}^d\). Recall that \(\mathfrak {p}_{\gamma _{k+1}}\) serves as a proxy for the target probability measure \(\mathfrak {p}\). The \(\textrm{W}_1\) distance between \(\mathfrak {q}_{k+1}\) and \(\mathfrak {p}_{\gamma _{k}}\) can be bounded as

(23)

where (i) uses the triangle inequality. We separately control each \(\textrm{W}_1\) distance in the last line above. For the first distance, we plan to invoke Theorem 2.12 from Eberle and Majka (2019), reviewed below for the convenience of the reader. It is worth noting that a similar result to the one below appears in Majka et al. (2020, Corollary 2.4).

Proposition 4

(Eberle and Majka 2019, Theorem 2.12) Let

$$\begin{aligned}&c_6 := (1+h_k\lambda _{\gamma _{k}})\rho _{\gamma _k} \le 7 \rho _{\gamma _k}/6,\quad c_7:= 7 \lambda _{\gamma _k}\rho _{\gamma _k}/c_0, \\&\quad \theta (r) :=\int _0^r e^{-c_7\min (s,r_1)} {\text {d}}s, \quad \Theta (x,y):= \theta (\Vert x-y\Vert _2),\\&c_8 := \min \left( \frac{\mu _{\gamma _k}}{2},\frac{245}{24c_0} (\lambda _{\gamma _k} \rho _{\gamma _k})^2 \right) e^{-\frac{49}{6c_0} \lambda _{\gamma _k} \rho _{\gamma _k}^2}, \\&\quad h_k \le \frac{1}{\lambda _{\gamma _k}} \min \left( \frac{1}{6}, \frac{\mu _{\gamma _k}}{\lambda _{\gamma _k}} , \frac{\lambda _{\gamma _k} \rho _{\gamma _k}^2}{3}, \frac{c_0^2}{970 \lambda _{\gamma _k} \rho _{\gamma _k}^2}\right), \end{aligned}$$

where \(c_0\ge 0.007\) is a universal constant specified in Eberle and Majka (2019, Eq. (6.6)). Then it holds that

$$\begin{aligned} \textrm{W}_\Theta (\mathfrak {q}_{k}Q_{k},\mathfrak {p}_{\gamma _{k}}Q_{k}) \le (1-c_8 h_k) \cdot \textrm{W}_\Theta (\mathfrak {q}_k,\mathfrak {p}_{\gamma _{k}}), \end{aligned}$$
(24)

where \(\textrm{W}_\Theta \) is defined similar to (13) but the \(\ell _2\)-norm is replaced with \(\Theta \). Above, to keep the notation light, we have suppressed the dependence of \(c_6\) and \(c_7\) and \(c_8\) on \(\textrm{K},f,g,h_k\). Moreover, the two metrics \(\textrm{W}_\Theta \) and \(\textrm{W}_1\) are related as

$$\begin{aligned} e^{-c_6c_7} \textrm{W}_1 \le \textrm{W}_\Theta \le \textrm{W}_1. \end{aligned}$$
(25)

As for the second \(\textrm{W}_1\) distance in the last line of (23), the following result is standard; see the appendix for its proof.

Lemma 1

(Discretization error) It holds that

$$\begin{aligned}&\textrm{W}_1( \mathfrak {p}_{\gamma _{k}}Q_{k}, \mathfrak {p}_{\gamma _{k}}) \nonumber \\&\le c_9 :=\lambda _{\gamma _{k}} \sqrt{h_k^3 d} \cdot \left( \sqrt{h_k \lambda _{\gamma _{k}} } +\sqrt{2}\right). \end{aligned}$$
(26)

In fact, it is more common to write the left-hand side of (26) in terms of the Markov transition kernel of the corresponding Langevin diffusion, as discussed in the proof of Lemma 1. By combining Proposition 4 and Lemma 1, we can now revisit (23) and write that

$$\begin{aligned}&\textrm{W}_\Theta (\mathfrak {q}_{k+1},\mathfrak {p}_{\gamma _{k}}) \nonumber \\&\overset{(i)}{\le } \textrm{W}_\Theta ( \mathfrak {q}_{k}Q_{k},\mathfrak {p}_{\gamma _{k}}Q_{k}) + \textrm{W}_\Theta ( \mathfrak {p}_{\gamma _{k}}Q_{k}, \mathfrak {p}_{\gamma _{k}}) \nonumber \\&\overset{(ii)}{\le }\ (1-c_8 h_k) \textrm{W}_\Theta (\mathfrak {q}_k,\mathfrak {p}_{\gamma _{k}}) + \textrm{W}_1( \mathfrak {p}_{\gamma _{k}}Q_{k}, \mathfrak {p}_{\gamma _{k}}) \nonumber \\&\overset{(iii)}{\le }\ (1-c_8h_k) \textrm{W}_\Theta (\mathfrak {q}_k,\mathfrak {p}_{\gamma _{k}}) +c_9, \end{aligned}$$
(27)

where (i) uses (23), (ii) uses Proposition 4 and (25), and (iii) uses Lemma 1. Using the triangle inequality, it immediately follows that

$$\begin{aligned}&\textrm{W}_\Theta (\mathfrak {q}_{k+1},\mathfrak {p}_{\gamma _{k+1}}) \nonumber \\&\overset{(i)}{\le }\ \textrm{W}_\Theta (\mathfrak {q}_{k+1},\mathfrak {p}_{\gamma _k}) +\textrm{W}_\Theta (\mathfrak {p}_{\gamma _k},\mathfrak {p}) +\textrm{W}_\Theta (\mathfrak {p}_{\gamma _{k+1}},\mathfrak {p}) \nonumber \\&\overset{(ii)}{\le }\ (1-c_8 h_k) \textrm{W}_\Theta (\mathfrak {q}_k,\mathfrak {p}_{\gamma _{k}}) + c_9 +\textrm{W}_\Theta (\mathfrak {p}_{\gamma _k},\mathfrak {p})\nonumber \\&\qquad +\textrm{W}_\Theta (\mathfrak {p}_{\gamma _{k+1}},\mathfrak {p}) \nonumber \\&\overset{(iii)}{\le }\ (1-c_8 h_k) \textrm{W}_\Theta (\mathfrak {q}_k,\mathfrak {p}_{\gamma _{k}}) + c_9 +\textrm{W}_1(\mathfrak {p}_{\gamma _k},\mathfrak {p})\nonumber \\&\qquad +\textrm{W}_1(\mathfrak {p}_{\gamma _{k+1}},\mathfrak {p}) \nonumber \\&=:(1-c_8 h_k) \textrm{W}_\Theta (\mathfrak {q}_k,\mathfrak {p}_{\gamma _{k}}) +\alpha _k, \end{aligned}$$
(28)

where (i) uses the triangle inequality, (ii) uses (27), and (iii) uses (25). By unwrapping (28), we find that

$$\begin{aligned}&\textrm{W}_1 (\mathfrak {q}_{k},\mathfrak {p}_{\gamma _{k}}) \nonumber \\&\overset{(i)}{\le }\ e^{c_6c_7} \textrm{W}_\Theta (\mathfrak {q}_{k},\mathfrak {p}_{\gamma _{k}}) \nonumber \\&\overset{(ii)}{\le }\ e^{c_6c_7} \prod _{i=0}^{k-1} (1-c_8h_i) \cdot \textrm{W}_\Theta (\mathfrak {q}_0,\mathfrak {p}_{\gamma _{0}}) \nonumber \\&\qquad + e^{c_6c_7}\sum _{i=0}^{k-1} \alpha _i \prod _{j=i+1}^{k-1}(1-c_8h_j) \nonumber \\&\overset{(iii)}{\le }\ e^{c_6c_7} \prod _{i=0}^{k-1} (1-c_8h_i) \cdot \textrm{W}_1(\mathfrak {q}_0,\mathfrak {p}_{\gamma _{0}}) \nonumber \\&\qquad + e^{c_6c_7}\sum _{i=0}^{k-1} \alpha _i \prod _{j=i+1}^{k-1}(1-c_8h_j), \end{aligned}$$
(29)

where (i) and (iii) use (25), and (ii) uses (28). Lastly, we can use (29) in order to bound the \(\textrm{W}_1\) distance at iteration k to the target measure \(\mathfrak {p}\) as

$$\begin{aligned}&\textrm{W}_1(\mathfrak {q}_{k},\mathfrak {p}) \nonumber \\&\overset{(i)}{\le }\ \textrm{W}_1(\mathfrak {q}_{k},\mathfrak {p}_{\gamma _{k}}) + \textrm{W}_1(\mathfrak {p}_{\gamma _{k}}, \mathfrak {p}) \nonumber \\&\overset{(ii)}{\le }\ e^{c_6c_7} \prod _{i=0}^{k-1} (1-c_8h_i) \cdot \textrm{W}_1(\mathfrak {q}_0,\mathfrak {p}_{\gamma _{0}}) + e^{c_6c_7}\cdot \nonumber \\&\sum _{i=0}^{k-1} \alpha _i \prod _{j=i+1}^{k-1}(1-c_8h_j) + \textrm{W}_1(\mathfrak {p}_{\gamma _k},\mathfrak {p}), \end{aligned}$$
(30)

where (i) uses the triangle inequality, and (ii) uses (29). This completes the proof of Theorem 2.

7 Numerical experiments

A few numerical experiments are presented below to support our theoretical contributions.

7.1 Truncated Gaussian

Our first numerical experiment deals with sampling from a truncated Gaussian distribution, restricted to a box \(K_{d} \subset \mathbb {R}^d\). For this problem the potential \(U:\mathbb {R}^d\rightarrow \mathbb {R}\) is specified as

$$\begin{aligned} U(x):=\frac{1}{2} \left\langle x,\Sigma ^{-1}x \right\rangle +\iota _{K_{d}}(x). \end{aligned}$$
(31)

Here similarly to Brosse et al. (2017) the (ij)th entry of the covariance matrix is given by

$$\begin{aligned} \Sigma _{i,j}:=\frac{1}{1+|i-j |}. \end{aligned}$$

We now consider three scenarios, namely,

  • \(d=2\) with \(K_{2}=[0,5] \times [0,1]\),

  • \(d=10\) with \(K_{10}=[0,5] \times [0,0.5]^{9}\)

  • \(d=100\) with \(K_{100}=[0,5] \times [0,0.5]^{99}\).

Fig. 1
figure 1

This figure compares the MY and FB envelopes for the two-dimensional truncated Gaussian distribution \(\mathfrak {p}\propto e^{-U-1_{K_2}}\) specified in Sect. 7.1. The horizontal lines in the top and bottom panels show, respectively, the expectation and variance of the first coordinate, namely, \(\mathbb {E}_\mathfrak {p}[x_1]\) and \(\textrm{var}_\mathfrak {p}[x_1]=\mathbb {E}_\mathfrak {p}[x_1^2]-(\mathbb {E}_\mathfrak {p}[x_1])^2\). The blue and red graphs in both panels show the estimated values of \(\mathbb {E}_\mathfrak {p}[x_1]\) and \(\textrm{var}_\mathfrak {p}[x_1]\), obtained via MY and FB envelopes. That is, the top graphs correspond to \(\mathbb {E}_{\mathfrak {p}_\gamma ^MY }[x_1]\) and \(\mathbb {E}_{\mathfrak {p}_\gamma ^FB }[x_1]\), for various values of \(\gamma \). Similarly, the bottom graphs correspond to \(\textrm{var}_{\mathfrak {p}_\gamma ^MY }[x_1]\) and \(\textrm{var}_{\mathfrak {p}_\gamma ^FB }[x_1]\), for various values of the parameter \(\gamma \)

Fig. 2
figure 2

This figure shows the boxplots for the expectations of \(x_{1},x_{2},x_{3}\) for the truncated Gaussian distribution in dimension 10 obtained by MYULA, FBULA, and wHMC. The last approach serves as the ground truth

Fig. 3
figure 3

Boxplots for the expectations of \(x_{1},x_{2},x_{3}\) for the truncated Gaussian distribution in dimension 100, obtained by MYULA, FBULA, and wHMC. The last approach serves as the ground truth

Fig. 4
figure 4

Tomography experiment: a True image x of dimension \(d=128\times 128\). b Incomplete and noisy observation y, amplitude of Fourier coefficients in logarithmic scale

Fig. 5
figure 5

Tomography experiment: a Convergence to the typical set of the posterior distribution (32). b Evolution of the MSE in stationarity

Fig. 6
figure 6

Tomography experiment: Posterior mean of (32) as estimated with a MYULA and b FBULA, respectively, after \(10^6\) iterations

Using quadrature techniques, it is possible in the two-dimensional case (\(d=2\)) to calculate exactly the mean and the covariance of the truncated Gaussian distribution, as well as the corresponding approximations obtained via MY and FB envelopes. More specifically, Fig. 1 uses MATLAB’s integral2 command to plot the following quantities for various values of \(\gamma \):

$$\begin{aligned} \mathbb {E}_{\mathfrak {p}^{MY }_\gamma }[x_1]&:= \frac{ \int x_1 e^{-F^{MY }_\gamma (x)} \,{\text {d}}x}{\int e^{-F^{MY }_\gamma (x)}\,{\text {d}}x}, \\ \textrm{var}_{\mathfrak {p}^{MY }_\gamma }[x_1]&:= \frac{ \int x_1^2 e^{-F^{MY }_\gamma (x)} \,{\text {d}}x}{\int e^{-F^{MY }_\gamma (x)}\,{\text {d}}x}- (\mathbb {E}_{\mathfrak {p}^{MY }_\gamma }[x_1] )^2, \\ \mathbb {E}_{\mathfrak {p}^{FB }_\gamma }[x_1]&:= \frac{ \int x e^{-F^{FB }_\gamma (x)} \,{\text {d}}x}{\int e^{-F^{FB }_\gamma (x)}\,{\text {d}}x}, \\ \textrm{var}_{\mathfrak {p}^{FB }_\gamma }[x_1]&:= \frac{ \int x_1^2 e^{-F^{FB }_\gamma (x)} \,{\text {d}}x}{\int e^{-F^{FB }_\gamma (x)}\,{\text {d}}x}- (\mathbb {E}_{\mathfrak {p}^{FB }_\gamma }[x_1] )^2. \end{aligned}$$

The horizontal lines in Fig. 1 show the ground truth values obtained by MATLAB’s integral2 command, i.e.,

$$\begin{aligned} \mathbb {E}_{\mathfrak {p}}[x_1]&:= \frac{ \int x_1 e^{-F(x)} \,{\text {d}}x}{\int e^{-F(x)}\,{\text {d}}x}, \\ \textrm{var}_{\mathfrak {p}}[x_1]&:= \frac{ \int x_1^2 e^{-F(x)} \,{\text {d}}x}{\int e^{-F(x)}\,{\text {d}}x}- (\mathbb {E}_{\mathfrak {p}}[x_1] )^2. \end{aligned}$$

For small values of the parameter \(\gamma \), we observe in Fig. 1 that the FB envelope better approximates the mean of the first component than the MY envelope. However, the FB envelope tends to overestimate the variance. This can be understood by comparing the two envelopes since in the case of the MY envelope the smoothing is more localized compared to the FB envelope.

Such explicit calculations are not tractable in higher dimensions, i.e., for \(d\in \{10,100\}\). Instead, we now generate \(10^6\) samples from the truncated Gaussian distribution \(\mathfrak {p}\) by applying MYULA and FBULA. As our ground truth, we also generate \(10^{5}\) samples from \(\mathfrak {p}\) with the wall HMC (wHMC) (Pakman and Paninski 2014). In all three approaches, the initial 10% of the obtained samples are discarded as the burn-in period, while each experiment is repeated 100 times. In terms of the parameters, we set \(\gamma =0.05 \) and fix \(h=0.005\) for all of our experiments.

The results are visualized in Figs. 2 and 3. More specifically, Fig. 2 corresponds to \(d=10\) and shows the estimates for \(\mathbb {E}_\mathfrak {p}[x_i]\) for \(i\in \{1,2,3\}\), obtained by MYULA, FBULA, and wHMC. Similarly, Fig. 3 corresponds to \(d=100\). These figures indicate that, in all of these cases, FBULA is providing a more accurate approximation of the expectation compared to MYULA.

7.2 Tomographic image reconstruction

We now study a tomographic image reconstruction problem. In this case the true image is the Shepp–Logan phantom test image of dimension \(d=128 \times 128\) (Shepp and Logan 1974), in which we applied a Fourier operator F followed by a subsampling operator A, reducing the observed pixels by approximately 85%. Finally, zero-mean additive Gaussian noise \(\xi \) is added with standard deviation \(\sigma = 10^{-2}\) to produce an incomplete observation \(y=AFx + \xi \) where \(y \in \mathbb {C}^p\). Note that \(p < d\). With regards to the prior, we use the total-variation norm with an additional constraint for the size of the pixels. This leads to the following posterior distribution:

$$\begin{aligned} \pi (x) \propto \exp \left[ -\frac{\Vert y - AFx \Vert ^2}{2\sigma ^2} - \beta \text {TV}(x) - 1_{\left[ 0,1\right] ^d}(x) \right] , \end{aligned}$$

with \(\beta = 100\). Above, \(1_{[0,1]^d}\) is the convex indicator function on the unit cube, as the pixel values for this experiment are scaled to the range [0, 1]. Here \(\text {TV}(x)\) represents the total-variation pseudo-norm (Rudin et al. 1992; Chambolle 2004). Following (4) and (6b), we have that \(f(x)=\Vert y - AFx \Vert ^2 / 2\sigma ^2\) and \(g(x) = {\beta }\text {TV}(x) + 1_{\left[ 0,1\right] ^d}(x)\).

Figure 4a shows the Shepp–Logan phantom tomography test image for this experiment and Fig. 4b shows the amplitude of the (noisy) Fourier coefficients collected in the observation vector y (in logarithmic scale). In this figure, black regions represent unobserved pixels.

We have set \(\gamma = 1/(5L_f)\) where \(L_f = 1/\sigma ^2 = 10^4\) for both MYULA and FBULA. Fig. 5a shows the evolution of the values of \(\log \pi (x)\) from (32) of both MYULA and FBULA with the step-size \(h=1/(L_f + 1/\gamma ) = 1.67 \times 10^{-5}\). We observe that both methods converge at a similar rate in terms of the actual computational time required to run the algorithms. However, Fig. 5b shows the evolution of the mean-squared error (MSE) between the ergodic mean of the samples and the true image x. Here, it can be seen FBULA reaches a better MSE level compared to MYULA. We have also included in Fig. 6a and b the posterior mean estimated by both MYULA and FBULA.