# Proximal Markov chain Monte Carlo algorithms

- 3.4k Downloads
- 17 Citations

## Abstract

This paper presents a new Metropolis-adjusted Langevin algorithm (MALA) that uses convex analysis to simulate efficiently from high-dimensional densities that are log-concave, a class of probability distributions that is widely used in modern high-dimensional statistics and data analysis. The method is based on a new first-order approximation for Langevin diffusions that exploits log-concavity to construct Markov chains with favourable convergence properties. This approximation is closely related to Moreau–Yoshida regularisations for convex functions and uses proximity mappings instead of gradient mappings to approximate the continuous-time process. The proposed method complements existing MALA methods in two ways. First, the method is shown to have very robust stability properties and to converge geometrically for many target densities for which other MALA are not geometric, or only if the step size is sufficiently small. Second, the method can be applied to high-dimensional target densities that are not continuously differentiable, a class of distributions that is increasingly used in image processing and machine learning and that is beyond the scope of existing MALA and HMC algorithms. To use this method it is necessary to compute or to approximate efficiently the proximity mappings of the logarithm of the target density. For several popular models, including many Bayesian models used in modern signal and image processing and machine learning, this can be achieved with convex optimisation algorithms and with approximations based on proximal splitting techniques, which can be implemented in parallel. The proposed method is demonstrated on two challenging high-dimensional and non-differentiable models related to image resolution enhancement and low-rank matrix estimation that are not well addressed by existing MCMC methodology.

## Keywords

Bayesian inference Convex analysis High-dimensional statistics Markov chain Monte Carlo Proximal algorithms Signal processing## 1 Introduction

With ever-increasing computational resources Monte Carlo sampling methods have become fundamental to modern statistical science and many of the disciplines it underpins. In particular, Markov chain Monte Carlo (MCMC) algorithms have emerged as a flexible and general purpose methodology that is now routinely applied in diverse areas ranging from statistical signal processing and machine learning to biology and social sciences. Monte Carlo sampling in high dimensions is generally challenging, especially in cases where standard techniques such as Gibbs sampling are not possible or ineffective. The most effective general purpose Monte Carlo methods for high-dimensional models are arguably the Metropolis-adjusted Langevin algorithms (MALA) (Robert and Casella 2004, p. 371) and Hamiltonian Monte Carlo (HMC) (Neal 2012), two classes of MCMC methods that use gradient mappings to capture local properties of the target density and explore the parameter space efficiently.

Advanced versions of MALA and HMC use other elements of differential calculus to achieve higher efficiency. For example, Yuan and Minka (2002) and Zhang and Sutton (2011) use Hessian matrices of the target density to capture higher-order information related to scale and correlation structure. Similarly, Girolami and Calderhead (2011) use differential geometry to lift these methods from Euclidean spaces to Riemannian manifolds where the target density is isotropic. In this paper we move away from differential calculus and explore the potential of convex analysis for MCMC sampling from distributions that are log-concave.

Log-concave distributions, also known as “convex models” outside the statistical literature, are widely used in high-dimensional statistics and data analysis and, among other things, play a central role in revolutionary techniques such as compressive sensing and image super-resolution (see Candès and Tao 2009; Candès and Wakin 2008; Chandrasekaran et al. 2012 for examples in machine learning, signal and image processing, and high-dimensional statistics). Performing inference in these models is a challenging problem that currently receives a lot of attention. A major breakthrough on this topic has been the adoption of convex analysis in high-dimensional optimisation, which led to the development of the so-called “proximal algorithms” that use proximity mappings of concave functions, instead of gradient mappings, to construct fixed point schemes and compute function maxima (see Combettes and Pesquet 2011; Parikh and Boyd 2014 for two recent tutorials on this topic). These algorithms are now routinely used to find the maximisers of posterior distributions that are log-concave and often non-smooth and very high dimensional (Afonso et al. 2011; Agarwal et al. 2012; Candès and Tao 2009; Candès et al. 2011; Chandrasekaran and Jordan 2013; Chandrasekaran et al. 2011; Pesquet and Pustelnik 2012).

In this paper we use convex analysis and proximal techniques to construct a new Langevin MCMC method for high-dimensional distributions that are log-concave and possibly not continuously differentiable. Our experiments show that the method is potentially useful for performing Bayesian inference in many models related to signal and image processing that are not well addressed by existing MCMC methodology, for example, non-differentiable models with synthesis and analysis Laplace priors, priors related to total-variation, nuclear and elastic-net norms or with constraints to convex sets, such as norm balls and the positive semidefinite cone.

The remainder of the paper is structured as follows: Section 2 specifies the class of distributions considered, defines some elements of convex analysis which are essential for our methods, and briefly recalls the unadjusted Langevin algorithm (ULA) and its Metropolised version MALA. In Sect. 3.1 we present a proximal ULA for log-concave distributions and study its geometric convergence properties. Following on from this, Section 3.2 presents a proximal MALA which inherits the favourable convergence properties of the unadjusted algorithm while guaranteeing convergence to the desired target density. Section 4 demonstrates the proposed methodology on two challenging high-dimensional applications related to image resolution enhancement and low-rank matrix estimation. Conclusions and potential extensions are finally discussed in Section 5. A MATLAB implementation of the proposed methods is available at http://www.maths.bris.ac.uk/~mp12320/code/ProxMCMC.zip.

## 2 Definitions and notations

### 2.1 Convex analysis

*proximity mapping*that is inexpensive to evaluate or to approximate.

### **Definition 2.1**

*Proximity mappings*) The \(\lambda \)-proximity mapping or proximal operator of a concave function \(g\) is defined for any \(\lambda > 0\) as (Moreau 1962)

### **Definition 2.2**

*Moreau approximations*) For any \(\lambda > 0\), define the \(\lambda \)-Moreau approximation of \(\pi \) as the following density

### **Definition 2.3**

*Class of distributions*\(\mathcal {E}(\beta ,\gamma )\)) We say that \(\pi \) belongs to the one-dimensional class of distributions with exponential tails \(\mathcal {E}(\beta ,\gamma )\) if for some \(u\), and some constants \(\gamma > 0\) and \(\beta > 0\), \(\pi \) takes the form

- 1.
*Convergence to*\(\pi \) The approximation \(\pi _\lambda (\varvec{x})\) converges point-wise to \(\pi (\varvec{x})\) as \(\lambda \rightarrow 0\). - 2.
*Differentiability*\(\pi _\lambda (\varvec{x})\) is continuously differentiable even if \(\pi \) is not, and its log-gradient is \(\nabla \log \pi _\lambda (\varvec{x}) = \{\hbox {prox}^\lambda _{g}(\varvec{x})-\varvec{x}\}/\lambda \). - 3.
*Subdifferential*The point \(\{\hbox {prox}^\lambda _{g}(\varvec{x})-\varvec{x}\}/\lambda \) belongs to the subdifferential^{1}set of \(\log \pi \) at \(\hbox {prox}^\lambda _{g}(\varvec{x})\), i.e. \(\{\hbox {prox}^\lambda _{g}(\varvec{x})-\varvec{x}\}/\lambda \in \partial \log \pi \{\hbox {prox}^\lambda _{g}(\varvec{x})\}\) (Bauschke and Combettes 2011, Chap. 16). In addition, if \(\log \pi \) is differentiable at \(\hbox {prox}^\lambda _{g}(\varvec{x})\) then its subdifferential collapses to a single point, i.e. \(\{\hbox {prox}^\lambda _{g}(\varvec{x})-\varvec{x}\}/\lambda = \nabla \log \pi \{\hbox {prox}^\lambda _{g}(\varvec{x})\}\). - 4.
*Maximizers*The set of maximizers of \(\pi _\lambda \) is equal to that of \(\pi \). Also, because \(\pi _\lambda \) is continuously differentiable, \(\nabla \log \pi _\lambda (\varvec{x}^*) = 0 \) implies that \(\varvec{x}^*\) is a maximizer of \(\pi \). - 5.
*Separability*Assume that \(\pi (\varvec{x}) = \prod _{i=1}^n f_i(x_i)\) and let \({f_i}_\lambda \) be the \(\lambda \)-Moreau approximation of the marginal density \(f_i\). Then \(\pi _\lambda (\varvec{x}) = \prod _{i=1}^n {f_i}_\lambda (x_i)\). - 6.
*Exponential tails*Assume that \(\pi \in \mathcal {E}(\beta ,\gamma )\) with \(\beta \ge 1\). Then \(\pi _\lambda \in \mathcal {E}(\beta ^\prime ,\gamma ^\prime )\) with \(\beta ^\prime = \min (\beta ,2)\).

As mentioned previously, the methods proposed in this paper are useful for models that have proximity mappings which are easy to evaluate or to approximate numerically (see Sect. 3.2.3 for more details). This is the case for many statistical models used in high-dimensional data analysis, where statistical inference is often conducted using convex optimisation algorithms that also require computing proximity mappings (see Afonso et al. 2011; Becker et al. 2009; Chandrasekaran et al. 2012; Recht et al. 2010 for examples in image restoration, compressive sensing, low-rank matrix recovery and graphical model selection). For more details about the evaluation of these mappings, their properties, and lists of functions with known mappings please see Bauschke and Combettes (2011), Combettes and Pesquet (2011) and (Parikh and Boyd (2014), Chap. 6). A library with MATLAB implementations of frequently used proximity mappings is available on https://github.com/cvxgrp/proximal.

### 2.2 Langevin Markov chain Monte Carlo

The sampling method presented in this paper is derived from the Langevin diffusion process and is related to other Langevin MCMC algorithms that we briefly recall below.

It is well known that MALA can be a very efficient sampling method, particularly in high-dimensional problems. However, it is also known that for certain classes of target densities ULA is transient and as a result MALA is not geometrically ergodic (Roberts and Tweedie 1996). Geometric ergodicity is important theoretically to guarantee the existence of a central limit theorem for the chains and practically because sub-geometric algorithms often fail to explore the parameter space properly. Another limitation of MALA and HMC methods is that they require \(\pi \in \mathcal {C}^1\). This limits their applicability in many popular image processing and machine models that are not smooth.

In the following section we present a new MALA method that use proximity mappings and Moreau approximations to capture the log-concavity of the target density and construct chains with significantly better geometric convergence properties. We emphasise at this point that this is not the first work to consider modifications of MALA with better geometric convergence properties. For example, Roberts and Tweedie (1996) suggested using MALA with a truncated gradient to retain the efficiency of the Langevin proposal near the density’s mode and add robustness in the tails, though we have found this approach to be difficult to implement practically (this is illustrated in Sect. 3.2.4). Also, Casella et al. (2011) recently proposed three variations of MALA based on implicit discretisation schemes that are geometrically ergodic for one-dimensional distributions with super-exponential tails. For certain one-dimensional densities the methods presented in this paper are closely related to the partially implicit schemes of Casella et al. (2011). Manifold MALA (Girolami and Calderhead 2011) is also geometrically ergodic for a wide range of tail behaviours if \(\delta \) is sufficiently small (Łatuszyński et al. 2011).

## 3 Proximal MCMC

### 3.1 Proximal unadjusted Langevin algorithm

This section presents a proximal Metropolis-adjusted Langevin algorithm (P-MALA) that exploits convex analysis to sample efficiently from log-concave densities \(\pi \) of the form (1). In order to define this algorithm we first introduce the *proximal unadjusted Langevin algorithm* (P-ULA) that generates samples approximately distributed according to \(\pi \), and that will be used as proposal mechanism in P-MALA. We establish that P-ULA is geometrically ergodic in many cases for which ULA is transient or explosive and that P-MALA inherits these favourable properties, converging geometrically fast in many cases in which MALA does not.

*relaxed proximal point*iteration to maximise \(\log \pi \) with relaxation parameter \(\delta /2\lambda \), plus a stochastic perturbation given by \(\sqrt{\delta }Z\) (Rockafellar 1976). According to this second interpretation \(\lambda \) should not be smaller than \(\delta /2\), as this could lead to an unstable proximal point update that is expansive and therefore to an explosive Markov chain. We therefore define the optimal \(\lambda \) as the smallest value within the range of stable values \([\delta /2,\infty )\). Setting \(\lambda = \delta /2\) we obtain the P-ULA Markov chain

### **Theorem 3.1**

### *Proof*

The proof follows from the fact that \(\nabla \log \pi _{\delta /2}\) is continuous and \(\text {P-ULA}\) is \(\mu ^{Leb}\)-irreducible and weak Feller, and hence all compact sets are small (Meyn and Tweedie 1993, Chap. 6). Then, using Property 2, the conditions on \(S_d^+\) and \(S_d^-\) are equivalent to the conditions of part (a) of Theorem 3.1 of Roberts and Tweedie (1996) establishing that \(\text {P-ULA}\) is geometrically ergodic for \(d \in [0,1)\). For \(d = 1\) we proceed similarly to Property 6 and note that for approximations \(\pi _{\delta /2}\) with Gaussian tails we have that \(S_1^+ \in (-1,0)\) and \(S_1^- \in (0,1)\), thus part (b) of Theorem 3.1 of Roberts and Tweedie (1996) applies. Finally, notice from Property 2 that the values of \(d\), \(S_d^+\) and \(S_d^-\) are closely related to the tails of the approximation \(\pi _{\delta /2}\), i.e. \(\lim _{x\rightarrow \infty } \frac{\text {d}}{\text {d}x} \log \pi _{\delta /2}(x) = S_d^+ x^d + o(|x|^d)\) and \(\lim _{x\rightarrow -\infty } \frac{\text {d}}{\text {d}x}\log \pi _{\delta /2}(x) = S_d^- x^d + o(|x|^d)\). \(\square \)

Theorem 3.1 is most clearly illustrated when \(\pi \) belongs to the class \(\mathcal {E}(\beta ,\gamma )\). Recall that ULA is not ergodic for if \(\beta >2\) and only for \(\delta \) sufficiently small if \(\beta = 2\) (Roberts and Tweedie 1996).

### **Corollary 3.1**

Assume that \(\pi \in \mathcal {E}(\beta ,\gamma )\) and that (1) holds. Then \(\text {P-ULA}\) is geometrically ergodic for all \(\delta > 0\).

This result follows from the fact that (1) implies \(\beta \ge 1\) (distributions belonging to \(\mathcal {E}(\beta ,\gamma )\) with \(\beta < 1\) are not log-concave), which in turn implies that \(\pi _{\delta /2} \in \mathcal {E}(\beta ^\prime ,\gamma ^\prime )\) with \(\beta ^\prime = \min (\beta ,2)\) and some \(\gamma ^\prime > 0\). The geometric convergence of \(\text {P-ULA}\) is then established by checking that for \(d = \beta ^\prime - 1\) the limits \(S_d^+\) and \(S_d\) exist and verify the conditions of Theorem 3.1 for all \(\delta >0\).

The results presented above establish that under certain conditions on \(\pi \) \(\text {P-ULA}\) converges geometrically to some unknown ergodic measure. To determine if this stationary measure is a good approximation of \(\pi \), and thus if P-ULA is a good proposal for a Metropolis–Hastings algorithm, we consider the more general question of how well \(\text {P-ULA}\) approximates the time-continuous diffusion \(Y(t)\) as a function of \(\delta \) [we consider strong mean-square convergence to \(Y(t)\) in the sense of Higham et al. (2003), which also implies the convergence of P-ULA’s ergodic measure to \(\pi \)].

### **Theorem 3.2**

### *Proof*

To prove the first result we use Property 3 to express P-ULA as a *split-step backward Euler* approximation of \(Y(t)\) (i.e. \(Y^{(m+1)} = Y^{+} + \sqrt{\delta } W^{(m)}\) with \(Y^{+} = \frac{\delta }{2}\nabla \log \pi \left( Y^{+}\right) + Y^{(m)}\)), and apply Theorem 3.3 of Higham et al. (2003), where we note that assumption (1) implies condition 3.1 of Higham et al. (2003). The second result follows from Theorem 4.7 of Higham et al. (2003). \(\square \)

### 3.2 Proximal Metropolis-adjusted Langevin algorithm

#### 3.2.1 Metropolis–Hastings correction

*P-MALA*. This is a Metropolis–Hastings chain \(X^{(m)}\) that uses \(\text {P-ULA}\) as proposal. Precisely, given \(X^{(m)}\), a candidate \(Y^{*}\) is generated by using one \(\text {P-ULA}\) transition

#### 3.2.2 Convergence properties

We provide two alternative sets of conditions for the geometric ergodicity of \(\text {P-MALA}\) and illustrate our results on the case where \(\pi \) belongs to the class \(\mathcal {E}(\beta ,\gamma )\), which we use as benchmark for comparison with other MALAs.

### **Theorem 3.3**

### *Proof*

To prove this result we use Theorem 5.14 of Bauschke and Combettes (2011) to show that if (1) holds then, for any \(\varvec{x}\), the mean candidate position \(\hbox {prox}^\lambda _g(\varvec{x})\) verifies the inequality \(\Vert \hbox {prox}^\lambda _g(\varvec{x})\Vert < \Vert \varvec{x}\Vert \). This result, together with the condition that \(A\) converges inwards in \(q\), implies that \(\text {P-MALA}\) is geometrically ergodic (Roberts and Tweedie 1996, Theorem 4.1). \(\square \)

### **Corollary 3.2**

Suppose that \(\pi \in {\mathcal {E}}(\beta ,\gamma )\) and that (1) holds. Then P-MALA is geometrically ergodic for all \(\delta > 0\).

Proving this result simply consists of checking that if \(\pi \in {\mathcal {E}}(\beta ,\gamma )\) and (1) holds then \(A\) converges inwards in \(q\) and therefore Theorem 3.3 applies, where we note that (1) implies that \(\beta \ge 1\).

Notice from Corollary 3.2 that P-MALA has very robust stability and converge properties. For comparison, MALA is not geometrically ergodic for any \(\pi \in {\mathcal {E}}(\beta ,\gamma )\) with \(\beta > 2\) (Roberts and Tweedie 1996) and manifold MALA is geometrically ergodic for \(\pi \in {\mathcal {E}}(\beta ,\gamma )\) with \(\beta \ne 1\) only if \(\delta \) is sufficiently small (Łatuszyński et al. 2011). P-MALA inherits these robust convergence properties from P-ULA, or more precisely from the regularity properties of \(\pi _{\delta /2}\) that guarantee that P-ULA is always stable and geometrically ergodic. In particular, that \(\log \pi _{\delta /2}\) decays at mostly quadratically, that \(\nabla \log \pi _{\delta /2}\) always exists and is Lipchitz continuous, and that the tails of \(\pi _{\delta /2}\) broaden with \(\delta \) such that \(Y_{\delta /2}(t)\) is always within the stability range of a forward Euler approximation with time step \(\delta \).

Moreover, the convergence properties of \(\text {P-MALA}\) can also be studied in the framework of Random walk Metropolis–Hastings algorithms with bounded drift (Atchade 2006).

### **Theorem 3.4**

### *Proof*

The proof of this result follows from the proof of geometric ergodicity for the Shrinkage-thresholding MALA (Schreck et al. 2013), which is general to all Metropolis–Hastings algorithms with bounded drift, and where we note that the conditions on \(\pi \), together with the bounded drift condition \(\Vert \varvec{x}- \hbox {prox}^\lambda _{g}(\varvec{x})\Vert < R\), satisfy the assumptions of Theorem 4.1 of Schreck et al. (2013). \(\square \)

Notice that it is always possible to enforce the bounded drift condition by composing \(\hbox {prox}^\lambda _{g}(\varvec{x})\) with a projection onto an \(\ell _2\)-ball centred at \(\varvec{x}\) (this is equivalent to using a truncated gradient as proposed in (Roberts and Tweedie 1996)). Also, it is possible to relax the smoothness assumption to \(\pi \in \mathcal {C}^0\) by adding assumptions A3 and A4 from Schreck et al. (2013).

Finally, similarly to other MH algorithms based on local proposals, P-MALA may be geometrically ergodic yet perform poorly if the proposal variance \(\delta \) is either too small or very large. Theoretical and experimental studies of MALA show that for many high-dimensional target densities the value of \(\delta \) should be set to achieve an acceptance rate of approximately 40–70 % (Pillai et al. 2012). These results do not apply directly to P-MALA. However, given the similarities between MALA and P-MALA, it is reasonable to assume that the values of \(\delta \) that are appropriate for MALA will generally also produce good results for P-MALA. In our experiments we have found that P-MALA performs well when \(\delta \) is set to achieve an acceptance rate of 40–60 %.

#### 3.2.3 Computation of the proximity mapping \(\hbox {prox}^{\delta /2}_{g}(\varvec{x})\)

The computational performance of P-MALA depends strongly on the capacity to evaluate efficiently \({\hbox {prox}}^{\delta /2}_{g}(\varvec{x}) = \mathop {\hbox {argmax}}_{\varvec{u}\in {\mathbb {R}}^n}\, g(\varvec{u}) -\Vert \varvec{u}- \varvec{x}\Vert ^2/\delta \). As mentioned previously, the computation of proximity mappings is the focus of significant research efforts because these operators are key to modern convex and non-convex optimisation. As a result, for many important models used in high-dimensional data analysis, signal and image processing, and statistical machine learning, there are now clever analytical or numerical techniques to evaluate these mappings efficiently (two examples of this are the total-variation and the nuclear-norm priors used in the experiments of Sect. 4). For a survey on the evaluation of proximity mappings and lists of some functions with known mappings please see (Parikh and Boyd (2014), Chap. 6) and Combettes and Pesquet (2011).

The most general strategy for computing \(\hbox {prox}^{\delta /2}_{g}(\varvec{x})\) is to note that (2) is a convex optimisation problem that can frequently be solved or approximated quickly with state-of-the-art convex optimisation algorithms. Komodakis and Pesquet (2014) presents these algorithms in the primal-dual framework and provides clear guidelines for parallel and distributed implementations. When applying these techniques within P-MALA it is important to use \(\varvec{x}\) to *hot-start* the optimisation, particularly in high-dimensional models where \({\hbox {prox}}^{\delta /2}_{g}(\varvec{x})\) is close to \(\varvec{x}\) because \(\delta \) has been set to a small value to achieve a good acceptance probability (recall that \({\hbox {prox}}^{\delta /2}_{g}(\varvec{x}) \rightarrow \varvec{x}\) when \(\delta \rightarrow 0\)).

*forward-backward*or

*proximal gradient*algorithm (Combettes and Pesquet 2011). We found this approximation to be very accurate for high-dimensional models because, again, \(\delta \) is set to a small value and \(\hbox {prox}^{\delta /2}_{g}(\varvec{x})\) is close to \(\varvec{x}\), and as a result the approximation \(g_1(\varvec{u}) \approx g_1(\varvec{x}) + (\varvec{u}-\varvec{x})^T\nabla g_1(\varvec{x})\) is generally accurate. Approximation (12) is useful for instance in linear inverse problems of the form \(g(\varvec{x}) = -(\varvec{y}-H\varvec{x})^T \Sigma ^{-1}(\varvec{y}-H\varvec{x})/2 -\alpha \phi (\varvec{x})\) involving a Gaussian likelihood and a convex regulariser \(\phi (\varvec{x})\) with a tractable proximity mapping [\(\phi (\varvec{x})\) is often some norm, which generally have known and fast proximity mappings (Parikh and Boyd 2014, Chap. 6.5)]. Notice that many signal and image processing problems can be formulated in this way (Combettes and Pesquet 2011). Moreover, if \(g_1 \in \mathcal {C}^2\) it is also possible to use a second-order approximation

Finally, it is worth noting that although using an approximation of \(\hbox {prox}^{\delta /2}_{g}(\varvec{x})\) can potentially reduce P-MALA’s mixing speed, if the conditions for geometric ergodicity of Theorem 3.4 hold when \(\hbox {prox}^{\delta /2}_{g}(\varvec{x})\) is evaluated exactly, then P-MALA implemented with an approximate mapping also converges geometrically to \(\pi \) if the approximation error can be bounded by some \(R^\prime > 0\) or if \(\hbox {prox}^\lambda _{g}(\varvec{x})\) is followed by a projection to guarantee a bounded drift.

#### 3.2.4 Illustrative example

For illustration we show an application of P-MALA to the density \(\pi (x) \propto \exp (-x^4)\) depicted in Fig. 1c. We compare our results with MALA, with the truncated gradient MALA (MALTA) (Roberts and Tweedie 1996), and with the simplified manifold MALA (SMMALA) (Girolami and Calderhead 2011). As explained previously, MALA is not geometrically ergodic for this target density due to the lighter-than-Gaussian tails. This can be cured by using MALTA, which is a bounded drift random walk Metropolis–Hastings algorithm constructed by replacing \(h(x) = \nabla \log \pi (x)\) in the MALA proposal with \(h_{\epsilon _1}(x) = \epsilon _1 h(x) / \max (\epsilon _1,\Vert h(x)\Vert )\) for some \(\epsilon _1 > 0\) (Atchade 2006). Although geometrically ergodic, MALTA can converge very slowly if the truncation threshold \(\epsilon _1\) is not set correctly. Setting good values for \(\epsilon _1\) can be difficult in practice, particularly because values that appear suitable in certain regions of the state space are unsuitable in others. Alternatively, manifold MALA implemented using the (regularised) inverse Hessian \(H^{-1}_{\epsilon _2}(x) = (12x^{2} + \epsilon _2)^{-1}\) is also geometrically ergodic if \(\delta \) is sufficiently small (for this example \(\delta \le 6\)) (Łatuszyński et al. 2011), however this algorithm can also converge slowly if the value of \(\epsilon _2\) is not set properly.

We observe in Fig. 2a–d that the chains generated with P-MALA and MALTA exhibit good mixing, that SMMALA has slower mixing, and that MALA has rejected all the proposed moves and failed to converge. We repeated this experiment using the initial state \(X^{(0)} = 5\) and the same values for \(\delta \), \(\epsilon _1\) and \(\epsilon _2\). The first \(250\) samples of each chain are displayed in Fig. 2e–h. Again, we observe the good mixing of P-MALA, the slower mixing of SMMALA, and the lack of ergodicity of MALA. However, we also observe that in this occasion MALTA got “stuck” at states where its mixing properties are very poor and failed to converge. We also repeated this experiment with HMC (not shown) and observed that it suffers from the same drawbacks as MALA.

## 4 Applications

This section demonstrates P-MALA on two challenging high-dimensional and non-smooth models that are widely used in statistical signal and image processing and that are not well addressed by existing MCMC methodology. The first example considers the computation of Bayesian credibility regions for an image resolution enhancement problem. The second example presents a graphical posterior predictive check of the popular *nuclear-norm* model for low-rank matrices.

### 4.1 Bayesian image deconvolution with a total-variation prior

^{2}\(\varvec{y}= H\varvec{x}+ \varvec{w}\), where \(H\) is a linear operator representing the blur point spread function and \(\varvec{w}\) is the sample of a zero-mean white Gaussian vector with covariance matrix \(\sigma ^2\varvec{I}_n\) (Hansen et al. 2006). This inverse problem is usually ill-posed or ill-conditioned, i.e. either \(H\) does not admit an inverse or it is nearly singular, thus yielding highly noise-sensitive solutions. Bayesian image deconvolution methods address this difficulty by exploiting prior knowledge about \(\varvec{x}\) in order to obtain more robust estimates. One of the most widely used image priors for deconvolution problems is the improper total-variation norm prior, \(\pi (\varvec{x}) \propto \exp {\left( -\alpha \Vert \nabla _d \varvec{x}\Vert _1\right) }\), where \(\nabla _d\) denotes the discrete gradient operator that computes the vertical and horizontal differences between neighbour pixels. This prior encodes the fact that differences between neighbour image pixels are often very small and occasionally take large values (i.e. image gradients are nearly sparse). Based on this prior and on the linear observation model described above, the posterior distribution for \(\varvec{x}\) is given by

^{3}(ESS). P-MALA was almost twice as computationally expensive as MALA due to the overhead associated with evaluating the proximity mapping of \(g_2\) (the total computational times were 49 h for P-MALA and 28 h for MALA). However, because P-MALA is exploring the parameter space significantly faster than MALA, its time-normalised ESS was \(4.5\) times better than that of MALA (\(50.8\) and \(11.04\) samples per hour respectively), confirming the good performance of the proposed methodology. Preconditioning MALA with the (regularised) inverse Fisher information matrix \((H^T H + \epsilon \mathbb {I}_n)^{-1}\) led to poor mixing, possibly because most of the correlation structure in the posterior distributions comes from the non-differentiable prior \(\pi (\varvec{x}) \propto \exp {[-\alpha \Vert \nabla _d\varvec{x}\Vert _1]}\) and is not captured by this metric.

### 4.2 Nuclear-norm models for low-rank matrix estimation

In this experiment we use P-MALA to perform a graphical posterior predictive check of the widely used *nuclear norm* model for low-rank matrices (Fazel 2002). Simulating samples from distributions involving the nuclear norm is challenging because matrices are often high-dimensional and because this norm is not continuously differentiable; thus making it difficult to use gradient-based MCMC methods such as MALA and HMC. For simplicity we present our analysis in the context of matrix denoising, however the approach can be easily applied to other low-rank matrix estimation problems such as matrix completion and decomposition (Candès and Plan 2009; Candès and Tao 2009; Chandrasekaran et al. 2011, 2012; Candès et al. 2011).

It is well documented that under certain conditions on the true rank and \(\sigma ^2\), the MAP estimate maximising (15) accurately recovers the true null and column spaces of \(\varvec{x}\) (Candès and Plan 2009; Candès and Tao 2009; Negahban and Wainwright 2012; Rahul et al. 2010). This has led to the general consensus that the nuclear-norm prior is a useful model for low-rank matrix estimation problems and that the errors introduced by using the convex approximation to the rank function do not have a significant effect on the inferences. Here we adopt a Bayesian model checking viewpoint and assess the nuclear-norm model by comparing the observation \(\varvec{y}\) to replicas \(\varvec{y}^\mathrm{rep}\) generated by drawing samples from the posterior predictive distribution \(f(\varvec{y}^\mathrm{rep}|\varvec{y}) = \int _{{\mathbb {R}}^{n\times m}} f(\varvec{y}^\mathrm{rep}|\varvec{x})\pi (\varvec{x}|\varvec{x})\text {d}\varvec{x}\), as recommended by (Gelman et al. (2013), Chap. 6). This technique for checking the fit of a model to data is based on the intuition that “If the model fits, then replicated data generated under the model should look similar to observed data. To put it another way, the observed data should look plausible under the posterior predictive distribution” (Gelman et al. 2013, Chap. 6). In this paper we perform a graphical check and compare visually \(\varvec{y}\) and its replicas \(\varvec{y}^\mathrm{rep}\). In specific applications one could also use \(\varvec{y}^\mathrm{rep}\) to compute posterior predictive *p*-values that evaluate specific aspects of the model that are relevant to the application (Gelman et al. 2013, Chap. 6).

## 5 Conclusion

This paper studied a new Langevin MCMC algorithm that use convex analysis, namely Moreau approximations and proximity mappings, to sample efficiently from high-dimensional densities \(\pi \) that are log-concave and possibly not continuously differentiable. This method is based on a new first-order approximation for Langevin diffusions that is constructed by first approximating the original diffusion \(Y(t)\) with an auxiliary Langevin diffusion \(Y_\lambda (t)\) with ergodic measure \(\pi _\lambda \), and then discretising \(Y_\lambda (t)\) using a forward Euler scheme with time step \(\delta =2\lambda \). The resulting Markov chain, P-ULA, is similar to ULA except for the fact that it uses proximity mappings of \(\log \pi \) instead of gradient mappings. This modification leads to a chain with favourable convergence properties that is geometrically ergodic in many cases for which ULA is transient or explosive. The proposed sampling method, P-MALA, combines P-ULA with a Metropolis–Hastings step guaranteeing convergence to the desired target density. It is shown that P-MALA inherits the favourable convergence properties of P-ULA and is geometrically ergodic in many cases for which MALA does not converge geometrically and for which manifold MALA is only geometric if the time step is sufficiently small. Moreover, because P-MALA uses proximity mappings instead of gradients it can be applied to target densities that are not continuously differentiable, whereas MALA and manifold MALA require \(\pi \in {\mathcal {C}}^1\) and \(\pi \in {\mathcal {C}}^2\) to perform well. Finally, P-MALA was validated and compared to other MCMC algorithms through illustrative examples and applications to real data, including two challenging high-dimensional experiments related to image deconvolution and low-rank matrix denoising. These experiments show that P-MALA can make Bayesian inference techniques practically feasible for high-dimensional and non-differentiable models that are not well addressed by the existing MCMC methodology.

Moreover, although only directly applicable to log-concave distributions, P-MALA can be used within a Gibbs sampler to simulate from more complex models. For example, it can be easily applied to a large class of bilinear models of the form (14) in which there is uncertainty about the linear operator \(H\) (e.g. semi-blind image restoration), as this models can be conveniently split into two high-dimensional conditional densities that are log-concave. Similarly, its application to hierarchical models involving unknown regularisation or noise power hyper-parameters is also straightforward. Future works will focus on the application of P-MALA to the development of new statistical signal and image processing methodologies. In particular, we plan to develop a general set of tools for computing Bayesian estimators and credibility regions for high-dimensional convex linear and bilinear inverse problems, as well as stochastic optimisation algorithms for empirical Bayes estimation in signal and image processing. Another important perspective for future work is to investigate the rate of convergence of P-MALA as a function of the dimension of \(\varvec{x}\). This cannot be achieved with the mathematical techniques used in of Theorems 3.1, 3.3 and 3.4, and will require using a more appropriate set of techniques based on the Wasserstain framework (see Ottobre and Stuart 2014 for more details). Preliminary analyses suggest that P-MALA’s mixing time depends on the shape of (the tail of) \(\pi \), unlike the random walk Metropolis–Hastings algorithm and MALA whose scaling is, under some conditions, independent of \(\pi \).

Also, in some applications the performance of P-MALA could be improved by introducing some form of adaptation or preconditioning that captures the local geometry of the target density. This could be achieved by learning the density’s covariance structure online (Atchade 2006) or by using an appropriate position-dependent metric. For models with \(\pi \in \mathcal {C}^2\) this metric can be derived from the Fisher information matrix or the Hessian matrix as suggested in Girolami and Calderhead (2011), and for other log-concave densities perhaps by using preconditioning techniques from the convex optimisation literature, such as Marnissi et al. (2014) for instance. A key factor will be the availability of efficient algorithms for evaluating proximity mappings on non-canonical Euclidean spaces (i.e. defined using a quadratic penalty functions of the form \((\varvec{u}-\varvec{x})^TA(\varvec{x})(\varvec{u}-\varvec{x})\) for some positive definite matrix \(A(\varvec{x})\)). This topic currently receives a lot of attention in the optimisation literature and is the focus of important engineering efforts. Alternatively, one could also consider extending our methods to other diffusions that are more robust to anisotropic target densities (Roberts and Stramer 2002; Stramer and Tweedie 1999a, b).

We emphasise at this point that P-MALA complements rather than substitutes existing MALA and HMC methods by making high-dimensional simulation more efficient for target densities that are log-concave and have fast proximity mappings, in particular when they are not continuously differentiable. However, there are many models for which state-of-the-art MALA and HMC methods perform very well and for which P-MALA would not be applicable or computationally competitive.

Finally, we acknowledge that since the first preprint of this work (Pereyra 2013), two other works have independently proposed using proximity mappings in MCMC algorithms. These algorithms are similar to P-MALA in that they use thresholding operators within MALA and HMC algorithms (thresholding operators are a particular type of proximity mapping), but otherwise differ significantly from P-MALA. In particular, Schreck et al. (2013) considers a MALA for a variable selection problem and uses thresholding/shrinking operators to design a proposal distribution with atoms at zero (i.e. that generates sparse vectors with positive probability). Chaari et al. (2014) also considers an algorithm for a similar variable selection problem related to signal processing. Similarly to Schreck et al. (2013) that algorithm also uses thresholding operators, but to approximate gradients within an HMC leap-frog integrator. However, because thresholding operators are not continuously differentiable it is not clear if this integrator preserves volume and more crucially if the resulting HMC algorithm converges exactly to the desired target density.

## Footnotes

- 1.
A vector \(\varvec{u}\in {\mathbb {R}}^n\) is a subgradient of the concave function \(g\) at the point \(\varvec{x}_0 \in {\mathbb {R}}^n\) if \(g(\varvec{x}) \le g(\varvec{x}_0) + (\varvec{x}-\varvec{x}_0)^T\varvec{u}\) for all \(\varvec{x}\in {\mathbb {R}}^n\). The set \(\partial g(\varvec{x}_0)\) of all such subgradients is called the subdifferential set of \(g\) at the point \(\varvec{x}_0\).

- 2.
Note that bidimensional and tridimensional images can be represented as points in \({\mathbb {R}}^n\) via lexicographic ordering.

- 3.
Recall that \(\text {ESS} = N\{1+2\sum _k \gamma (k)\}^{-1}\), where \(N\) is the total samples and \(\sum _k \gamma (k)\) is the sum of the \(K\) monotone sample auto-correlations which we estimated with the initial monotone sequence estimator (Geyer 1992).

## Notes

### Acknowledgments

The author would like to thank the editor and two anonymous reviewers for their valuable suggestions to improve the manuscript. The author is also grateful to Ioannis Papastathopoulos, Gersende Fort, Nick Whiteley, Peter Green, Jonathan Rougier, Guy Nason, Nicolas Dobigeon, Steve McLaughlin, Hadj Batatia and Jean-Christophe Pesquet for helpful comments. Marcelo Pereyra holds a Marie Curie Intra-European Fellowship for Career Development. This work was supported in part by the SuSTaIn program - EPSRC grant EP/D063485/1 - at the Department of Mathematics, University of Bristol, and by a French Ministry of Defence postdoctoral fellowship.

## References

- Afonso, M., Bioucas-Dias, J., Figueiredo, M.: An augmented Lagrangian approach to the constrained optimization formulation of imaging inverse problems. IEEE. Trans. Image Process.
**20**(3), 681–695 (2011)MathSciNetCrossRefGoogle Scholar - Agarwal, A., Negahban, S., Wainwright, J.M.: Fast global convergence of gradient methods for high-dimensional statistical recovery. Ann. Stat.
**40**(5), 2452–2482 (2012)MathSciNetCrossRefzbMATHGoogle Scholar - Atchade, Y.: An adaptive version for the Metropolis adjusted Langevin algorithm with a truncated drift. Methodol. Comput. Appl. Probab.
**8**(2), 235–254 (2006)MathSciNetCrossRefzbMATHGoogle Scholar - Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, New York (2011)CrossRefzbMATHGoogle Scholar
- Becker, S., Bobin, J., Candès, E.J.: NESTA: a fast and accurate first-order method for sparse recovery. SIAM J. Imaging Sci.
**4**(1), 1–39 (2009)MathSciNetCrossRefzbMATHGoogle Scholar - Candès, E.J., Plan, Y.: Matrix completion with noise. Proc. IEEE
**98**, 925–936 (2009)Google Scholar - Candès, E.J., Tao, T.: The power of convex relaxation: near-optimal matrix completion. IEEE Trans. Inf. Theory
**56**(5), 2053–2080 (2009)MathSciNetCrossRefGoogle Scholar - Candès, E.J., Wakin, M.B.: An introduction to compressive sampling. IEEE Signal Process. Mag.
**25**(2), 21–30 (2008)CrossRefGoogle Scholar - Candès, E. J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? J. ACM
**58**(3), 11:1–11:37 (2011)Google Scholar - Candès, E.J., Sing-Long, C.A., Trzasko, J.D.: Unbiased risk estimates for singular value thresholding and spectral estimators. IEEE Trans. Signal Process.
**61**(19), 4643–4657 (2013)MathSciNetCrossRefGoogle Scholar - Casella, B., Roberts, G., Stramer, O.: Stability of partially implicit Langevin schemes and their MCMC variants. Methodol. Comput. Appl. Probab.
**13**(4), 835–854 (2011)MathSciNetCrossRefzbMATHGoogle Scholar - Chaari, L., Batatia, H., Chaux, C. & Tourneret, J.-Y.: Sparse signal and image recovery using a proximal Bayesian algorithm. ArXiv e-prints (2014)Google Scholar
- Chambolle, A.: An algorithm for total variation minimization and applications. J. Math. Imaging Vis.
**20**(1–2), 89–97 (2004)MathSciNetGoogle Scholar - Chandrasekaran, V., Jordan, M.I.: Computational and statistical tradeoffs via convex relaxation. Proc. Natl Acad. Sci. U.S.A.
**110**(13), 1181–1190 (2013)MathSciNetCrossRefzbMATHGoogle Scholar - Chandrasekaran, V., Sanghavi, S., Parrilo, P., Willsky, A.: Rank-sparsity incoherence for matrix decomposition. SIAM J. Optim.
**21**(2), 572–596 (2011)MathSciNetCrossRefzbMATHGoogle Scholar - Chandrasekaran, V., Parrilo, P.A., Willsky, A.S.: Latent variable graphical model selection via convex optimization. Ann. Stat.
**40**(4), 1935–1967 (2012)MathSciNetCrossRefzbMATHGoogle Scholar - Combettes, P., Wajs, V.: Signal recovery by proximal forward-backward splitting. Multiscale Model. Simul.
**4**(4), 1168–1200 (2005)MathSciNetCrossRefzbMATHGoogle Scholar - Combettes, P.L., Pesquet, J.-C.: Proximal splitting methods in signal processing. In: Bauschke, H.H., Burachik, R.S., Combettes, P.L., Elser, V., Luke, D.R., Wolkowicz, H. (eds.) Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pp. 181–212. Springer, New York (2011)Google Scholar
- Fazel, M.: Matrix rank minimization with applications. PhD thesis, Department of Electrical Engineering, Stanford University (2002)Google Scholar
- Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B.: Bayesian Data Analysis, 3rd edn. Chapman and Hall/CRC, London (2013)zbMATHGoogle Scholar
- Geyer, C.J.: Practical Markov chain Monte Carlo. Stat. Sci.
**7**(4), 473–483 (1992)MathSciNetCrossRefGoogle Scholar - Girolami, M., Calderhead, B.: Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J. R. Stat. Soc. Ser. B
**73**(2), 123–214 (2011)MathSciNetCrossRefGoogle Scholar - Hansen, P.C., Nagy, J.G., O’Leary, D.P.: Deblurring Images: Matrices, Spectra, and Filtering. SIAM, Philadelphia (2006)CrossRefzbMATHGoogle Scholar
- Higham, D.J., Mao, X., Stuart, A.M.: Strong convergence of Euler-type methods for nonlinear stochastic differential equations. SIAM J. Numer. Anal.
**40**(3), 1041–1063 (2003)MathSciNetCrossRefzbMATHGoogle Scholar - Komodakis, N., Pesquet, J.-C.: Playing with duality: an overview of recent primal-dual approaches for solving large-scale optimization problems. ArXiv e-prints (2014)Google Scholar
- Łatuszyński, K., Roberts, G.O., Thiéry, A., Wolny, K.: Discussion of Riemann manifold Langevin and Hamiltonian Monte Carlo methods by Mark Girolami and Ben Calderhead. J. R. Stat. Soc. Ser. B
**73**(2), 188–189 (2011)Google Scholar - Marnissi, Y., Benazza-Benyahia, A., Chouzenoux, E., Pesquet, J.-C.: Majorize-Minimize adapted Metropolis-Hastings algorithm. Application to multichannel image recovery. In: 22th European Signal Processing Conference (EUSIPCO 2014), Lisbon, Portugal (2014)Google Scholar
- Martinet, B.: Regularisation d’inéquations variationelles par approximations successives. Revue Fran. d’Automatique et Infomatique Rech. Opérationelle
**4**, 154–159 (1970)MathSciNetzbMATHGoogle Scholar - Mattingly, J., Stuart, A., Higham, D.: Ergodicity for SDEs and approximations: locally Lipschitz vector fields and degenerate noise. Stoch. Proc. Appl.
**101**(2), 185–232 (2002)MathSciNetCrossRefzbMATHGoogle Scholar - Meyn, S., Tweedie, R.: Markov Chains and Stochastic Stability. Springer, London (1993)CrossRefzbMATHGoogle Scholar
- Moreau, J.-J.: Fonctions convexes duales et points proximaux dans un espace Hilbertien. C. R. Acad. Sci. Paris Sér. A Math.
**255**, 2897–2899 (1962)MathSciNetzbMATHGoogle Scholar - Neal, R.: MCMC using Hamiltonian dynamics. ArXiv e-prints (2012)Google Scholar
- Negahban, S., Wainwright, M.J.: Restricted strong convexity and weighted matrix completion: optimal bounds with noise. J. Mach. Learn. Res.
**13**, 1665–1697 (2012)MathSciNetzbMATHGoogle Scholar - Oliveira, J., Bioucas-Dias, J., Figueiredo, M.: Adaptive total variation image deblurring: a majorization-minimization approach. Signal Process.
**89**(9), 1683–1693 (2009)CrossRefzbMATHGoogle Scholar - Ottobre, M., Stuart, A.M.: Diffusion limit for the random walk Metropolis algorithm out of stationarity. ArXiv e-prints (2014)Google Scholar
- Papadopoulo, T., Lourakis, M. I. A.: Estimating the Jacobian of the singular value decomposition: theory and applications. In: Proceedings of the 6th European Conference on Computer Vision-Part I (ECCV ’00), pp. 554–570 (2000)Google Scholar
- Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim.
**1**(3), 123–231 (2014)Google Scholar - Park, T., Casella, G.: The Bayesian lasso. J. Am. Stat. Assoc.
**103**(482), 681–686 (2008)MathSciNetCrossRefzbMATHGoogle Scholar - Pereyra, M.: Proximal Markov chain Monte Carlo algorithms. ArXiv e-prints (2013)Google Scholar
- Pesquet, J.-C., Pustelnik, N.: A parallel inertial proximal optimization method. Pac. J. Optim.
**8**(2), 273–305 (2012)MathSciNetzbMATHGoogle Scholar - Pillai, N.S., Stuart, A.M., Thiry, A.H.: Optimal scaling and diffusion limits for the Langevin algorithm in high dimensions. Ann. Appl. Probab.
**22**(6), 2320–2356 (2012)MathSciNetCrossRefzbMATHGoogle Scholar - Rahul, M., Trevor, H., Robert, T.: Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res.
**11**, 2287–2322 (2010)MathSciNetzbMATHGoogle Scholar - Recht, B., Fazel, M., Parrilo, P.A.: Guaranteed minimum rank solutions to linear matrix equations via nuclear norm minimization. SIAM Rev.
**52**(3), 471–501 (2010)MathSciNetCrossRefzbMATHGoogle Scholar - Robert, C.P., Casella, G.: Monte Carlo Statistical Methods, 2nd edn. Springer, New York (2004)CrossRefzbMATHGoogle Scholar
- Roberts, G., Stramer, O.: Langevin diffusions and Metropolis–Hastings algorithms. Methodol. Comput. Appl. Probab.
**4**, 337–357 (2002)MathSciNetCrossRefzbMATHGoogle Scholar - Roberts, G.O., Tweedie, R.L.: Exponential convergence of Langevin distributions and their discrete approximations. Bernulli
**2**(4), 341–363 (1996)MathSciNetCrossRefzbMATHGoogle Scholar - Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J. Control Optim.
**14**, 877–898 (1976)MathSciNetCrossRefzbMATHGoogle Scholar - Schreck, A., Fort, G., Le Corff, S., Moulines, E.: A shrinkage-thresholding Metropolis adjusted Langevin algorithm for Bayesian variable selection. ArXiv e-prints (2013)Google Scholar
- Stramer, O., Tweedie, R.L.: Langevin-type models I: diffusions with given stationary distributions and their discretizations. Methodol. Comput. Appl. Probab.
**1**, 283–306 (1999a)MathSciNetCrossRefzbMATHGoogle Scholar - Stramer, O., Tweedie, R.L.: Langevin-type models II: self-targeting candidates for MCMC algorithms. Methodol. Comput. Appl. Probab.
**1**, 307–328 (1999b)MathSciNetCrossRefzbMATHGoogle Scholar - Yuan, Q., Minka, T. P.: Hessian-based Markov chain Monte Carlo Algorithms. Unpublished manuscript (2002)Google Scholar
- Zhang, Y., Sutton, C.: Quasi-Newton Markov chain Monte Carlo. In: Advances in Neural Information Processing Systems (NIPS) (2011)Google Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.