Proximal Markov chain Monte Carlo algorithms

This paper presents a new Metropolis-adjusted Langevin algorithm (MALA) that uses convex analysis to simulate efficiently from high-dimensional densities that are log-concave, a class of probability distributions that is widely used in modern high-dimensional statistics and data analysis. The method is based on a new first-order approximation for Langevin diffusions that exploits log-concavity to construct Markov chains with favourable convergence properties. This approximation is closely related to Moreau–Yoshida regularisations for convex functions and uses proximity mappings instead of gradient mappings to approximate the continuous-time process. The proposed method complements existing MALA methods in two ways. First, the method is shown to have very robust stability properties and to converge geometrically for many target densities for which other MALA are not geometric, or only if the step size is sufficiently small. Second, the method can be applied to high-dimensional target densities that are not continuously differentiable, a class of distributions that is increasingly used in image processing and machine learning and that is beyond the scope of existing MALA and HMC algorithms. To use this method it is necessary to compute or to approximate efficiently the proximity mappings of the logarithm of the target density. For several popular models, including many Bayesian models used in modern signal and image processing and machine learning, this can be achieved with convex optimisation algorithms and with approximations based on proximal splitting techniques, which can be implemented in parallel. The proposed method is demonstrated on two challenging high-dimensional and non-differentiable models related to image resolution enhancement and low-rank matrix estimation that are not well addressed by existing MCMC methodology.


Introduction
With ever-increasing computational resources Monte Carlo sampling methods have become fundamental to modern statistical science and many of the disciplines it underpins.In particular, Markov chain Monte Carlo (MCMC) algorithms have emerged as a flexible and general purpose problems related to signal and image processing, and machine learning.The remainder of the paper is structured as follows: Section 2 specifies the class of distributions considered, defines some elements of convex analysis which are essential for our methods, and briefly recalls the unadjusted Langevin algorithm (ULA) and its metropolised version MALA.In Section 3.1 we present a proximal ULA for log-concave distributions and study its geometric convergence properties.Following on from this, Section 3.2 presents a proximal MALA which inherits the favourable convergence properties of the unadjusted algorithm while guaranteeing convergence to the desired target density.Section 4 demonstrates the proposed methodology on two challenging high-dimensional applications related to signal and image processing: audio compressive sensing and image resolution enhancement.Conclusions and potential extensions are finally discussed in Section 5. A MATLAB implementation of the proposed methods is available at http://www.maths.bris.ac.uk/ ~mp12320/code/ProxMCMC.zip.

Convex analysis
Let x ∈ R n and let π(dx) be a probability distribution which admits a density π(x) with respect to the usual n-dimensional Lebesgue measure.We consider the problem of simulating from target densities of the form where g : R n → [0, ∞) is a concave upper semicontinuous function such that lim x →∞ g(x) = −∞.
It is assumed that g(x) can be evaluated point-wise and that the normalising constant κ may be unknown.Although not denoted explicitly, g may depend on the value of an observation vector, for instance in Bayesian inference problems.The methods presented in this paper will require g to have a proximity mapping that is inexpensive to evaluate or to approximate.
Definition 2.1.Proximity mappings.The λ-proximity mapping or proximal operator of a concave function g is defined for any λ > 0 as (Parikh & Boyd 2014) prox λ g (x) = argmax Proximity mappings share many important properties with gradient mappings that are useful for devising fixed point methods, such as being firmly non-expansive, i.e., prox λ g (x) − prox λ g (y) 2 ≤ (x−y) T {prox λ g (x)−prox λ g (y)}, ∀x, y ∈ R n , and mappings points towards function maxima (Parikh & Boyd 2014, Ch. 2).These mappings have received significant attention in the recent convex optimisation literature and are used extensively in the proximal optimisation algorithms that underpin modern high-dimensional statistics, signal and image processing, and machine learning (Agarwal et al. 2012, Chandrasekaran & Jordan 2013, Combettes & Pesquet 2011).
Definition 2.3.Class of distributions E(β, γ) We say that π belongs to the one-dimensional class of distributions with exponential tails E(β, γ) if for some u, and some constants γ > 0 and β ≥ 0, π takes the form Moreau approximations have several properties that will be useful for constructing algorithms to simulate from π.

Maximizers:
The set of maximizers of π λ is equal to that of π.Also, because π λ is continuously differentiable, ∇ log π λ (x * ) = 0 implies that x * is a maximizer of π.
Properties 1-5 are extensions of well known results for Moreau-Yoshida envelope functions first established in Moreau (1962).Property 1 results from the fact that in the limit λ → 0 the term exp (− u − x 2 /2λ) tends to a Dirac delta function at x. Property 2 can be easily established by using the results of Section 2.3 of Combettes & Wajs (2005).Property 3 follows from the fact that prox λ g (x) is the maximiser of h(u) = log π(u) + u − x 2 /2λ and therefore 0 ∈ ∂h{prox λ g (x)} (Combettes & Wajs 2005, Lemma 2.5).Property 4 follows from Properties 2 and 3: if x * is a maximizer of π λ then from Property 2, prox λ π (x * ) = x * , and from Property 3, 0 ∈ ∂ log π(x * ).Then, Fermat's rule, generalized to subdifferentials, together with the fact that π is log-concave implies that x * is a maximizer of π.Property 5 results from the fact that the proximity mapping of the separable sum g & Boyd 2014, Ch. 2).Finally, to establish Property 6 we use (3) and ( 4) and note that for x sufficiently large, π λ has exponentially decreasing tails with exponent β = β if β ∈ [1, 2] and β = 2 if β > 2 (distributions with β < 1 are not log-concave).
As mentioned previously, the methods proposed in this paper are useful for models that have proximity mappings which are easy to evaluate or to approximate numerically.This is the case for many statistical models used in high-dimensional data analysis, where statistical inference is often conducted using convex optimisation algorithms that also require computing proximity mappings (see Afonso et al. (2011), Becker et al. (2009), Chandrasekaran et al. (2012), Recht et al. (2010) for examples in image restoration, compressive sensing, low-rank matrix recovery and graphical model selection).For more details about the evaluation of these mappings, their properties, and lists of functions with known mappings please see Parikh & Boyd (2014, Ch. 6), Bauschke & Combettes (2011) and Combettes & Pesquet (2011).A library with MATLAB implementations of frequently used proximity mappings is available on https://github.com/cvxgrp/proximal.

Langevin Markov chain Monte Carlo
The methods presented in this paper are derived from to the Langevin diffusion process and are related to other Langevin MCMC algorithms that we briefly recall below.
Suppose that π is everywhere non-zero and differentiable so that ∇ log π is well defined.Then let W be the n-dimensional Brownian motion and consider a Langevin diffusion process y = {y(t) : 0 ≤ t ≤ T } on R n that has π as stationary distribution.Such process is defined as the solution to the stochastic differential equation Under appropriate stability conditions, y(t) converges in distribution to π and is therefore potentially interesting for simulating from π. Unfortunately, direct simulation from y is only possible in very specific cases.A more general solution is to consider a discrete-time approximation of the Langevin diffusion process with step-size δ.For computational reasons a forward Euler approximation is typically used, resulting in the so-called unadjusted Langevin algorithm (ULA) ULA : where the parameter δ controls the discrete time increment as well as the variance of the Gaussian perturbation Z (m) .Under certain conditions on π and δ, ULA produces a good approximation of y and converges to an ergodic measure which is close to π.In MALA this approximation error is corrected by introducing a Metropolis-Hastings rejection step that guarantees converges to the correct target density π (Roberts & Tweedie 1996).
It is well known that ULA and MALA can be very efficient sampling methods, particularly in high dimensional problems.However, it is also known that for certain classes of target densities ULA is transient and as a result MALA is not geometrically ergodic (Roberts & Tweedie 1996).Geometric ergodicity is important theoretically to guarantee the existence of a central limit theorem for the chains and practically because sub-geometric algorithms often fail to explore the parameter space properly.
In the following section we present two new ULA and MALA methods that use proximity mappings and Moreau approximations to capture the log-concavity of the target density and construct chains with better geometric convergence properties.We emphasise at this point that this is not the first work to consider modifications of ULA and MALA with better geometric convergence properties.For example, Roberts & Tweedie (1996) suggested using MALA with a truncated gradient to retain the efficiency of the Langevin proposal near the density's mode and add robustness in the tails, though we have found this approach to be difficult to implement practically (this is illustrated in Section 3.2.2).Also, Casella et al. (2011) recently proposed three variations of MALA based on implicit discretisation schemes that are geometrically ergodic for one-dimensional distributions with super-experiential tails.For certain one-dimensional densities the methods presented in this paper are closely related to the partially implicit schemes of Casella et al. (2011).

Proximal unadjusted Langevin algorithm
This section presents two Langevin MCMC algorithms that exploit convex analysis to sample efficiently from log-concave densities π of the form (1). We first present a proximal unadjusted Langevin algorithm (P-ULA) that generates samples approximately distributed according to π.
We establish that this algorithm is ergodic and convergences geometrically in many cases for which ULA is transient or explosive.Following on from this, we propose a proximal Metropolis adjusted Langevin algorithm (P-MALA) that supplements P-ULA with a Metropolis-Hastings step that guarantees converges to π.We show that this algorithm inherits the favourable converges properties of P-ULA and converges to π geometrically fast in many cases for which MALA is not geometric.
A key element of this paper is to approximate the Langevin diffusion y with an auxiliary diffusion y λ which has invariant measure π λ , defined by the stochastic differential equation ( 5) with π replaced by its λ-Moreau approximation (3).We wish to use y λ to simulate from π λ , which we can make arbitrarily close to π by selecting a small value of λ.Direct simulation from y λ is typically infeasible and we therefore consider the forward Euler approximation (6) for y λ , From Property 2 we obtain that ( 7) is equal to This Markov chain has two alternative interpretations that provide insight on how to set λ. First, ( 8) is a discrete approximation of a Langevin diffusion with invariant measure π λ , and since we are interested is simulating from π, we should set λ to a small value to bring π λ close to π.Second, from a convex optimisation viewpoint, (8) coincides with a relaxed proximal point iteration to maximize π with relaxation parameter δ/2λ, plus a stochastic perturbation given by √ δZ (Rockafellar 1976).
According to this second interpretation, λ should not be smaller than δ/2 as this could lead to unstable proximal point update and to a transient chain.We therefore set λ to its smallest stable value, i.e., λ = δ/2, and obtain the P-ULA Markov chain P-ULA : In the remainder of this section we study the converges properties of P-ULA, which is essentially a forward Euler discretisation of an auxiliary diffusion y δ/2 approximating y.We show that under weak conditions P-ULA approximates y equally or more accurately than ULA and with better ergodicity properties.For example, P-ULA is geometrically ergodic in cases in which ULA is transient, such as when π has lighter-than-Gaussian tails.

Convergence properties
In a manner akin to Roberts & Tweedie (1996) and Casella et al. (2011), we study the geometric convergence properties of P-ULA for the case where π is one-dimensional and we illustrate our results on the class E(β, γ).Geometric convergence results for higher-dimensional models can be established by combining Lemma 3.1 below with Theorem 7.1 and Corollary 7.4 of Mattingly et al. (2002).
Theorem 3.1.Suppose that π is one-dimensional and that (1) holds.For some fixed d > 0, let Then P-ULA is geometrically ergodic if one of the following holds: Proof.The proof follows from the fact that ∇ log π δ/2 is continuous and P-ULA is µ Leb -irreductible and weak Feller, and hence all compact sets are small (Meyn & Tweedie 1993, Ch. 6).This result, with the conditions on S + d and S − d and Property 2, imply the conditions of Theorem 3.1 of Roberts & Tweedie (1996) and hence that P-ULA is geometrically ergodic.
The convergence properties of P-ULA are most clearly demonstrated when π belongs to the class E(β, γ), which is used in Roberts & Tweedie (1996) to study the convergence properties of ULA and in Casella et al. (2011) for variations of ULA based on implicit discretisation schemes.
We will show that P-ULA is geometrically ergodic for all the distributions in E(β, γ) for which ULA is ergodic and for many for which ULA is transient.
This result follows from the fact that (1) implies β ≥ 1 (distributions belonging to E(β, γ) with β < 1 are not log-concave), which in turn implies that π δ/2 ∈ E(β , γ ) with β = min(β, 2) and some γ > 0. The geometric convergence of P-ULA is then established by checking that for d = β − 1 the limits S + d and S d exist and verify the conditions of Theorem 3.1 for all δ > 0. Note that P-ULA is geometrically ergodic for all log-concave distributions in E(β, γ) regardless of the choice of δ.This is in high contrast with ULA, which is known to be explosive for all δ whenever π has lighter-than-Gaussian tails (β > 2), and for δ > 2/γ if β = 2 (Roberts & Tweedie 1996).
Theorem 3.1 above establishes that under certain conditions P-ULA converges geometrically fast to some unknown ergodic measure.Following this result, the question arises as to whether this unknown measure is a good approximation of π.Here we address this point by considering the more general question of how well P-ULA approximates the time-continuous Langevin diffusion y, whose stationary solution is π.We focus here on strong mean square convergence in the sense of Higham et al. (2003), which implies convergence in probability to π.Note that the following results are valid for multidimensional models.
We begin by making the following key observation relating P-ULA to the diffusion y.
Lemma 3.1.Suppose that π ∈ C 1 and that (1) holds.Let y be the Langevin diffusion process defined in (5), with stationary density π.Then the P-ULA chain is equivalent to a split-step backward Euler approximation (SSBE) of y with time step δ; that is, Proof.Proving this result simply consist of decomposing the transition kernel ( 9) into a split-step scheme and then using Property 3 to express the proximal update as an implicit or backwards Euler step.
Split-step backward Euler approximations are known to be generally significantly better behaved than forward schemes, but also to be more computationally expensive as they require solving implicit equations (Mattingly et al. 2002).However, for distributions that have efficient proximity mappings, using this particular implicit approximation comes at no or little computational overhead as the implicit update is replaced by one explicit evaluation of prox λ g .We note at this point that for one-dimensional densities π ∈ E(β, γ) with β ≥ 2 P-ULA is equivalent to a partially-implicit scheme UPILA 3 of Casella et al. (2011) that also uses a split-step Euler approximation.
We use Lemma 3.1 to study the accuracy of the P-ULA approximation of y as a function of the discretisation step δ.We show that under weak conditions P-ULA converges strongly to y at optimal rate O(δ), comparing favourably to ULA which can be transient and not converge to y. Theorem 3.2.Suppose that π ∈ C 2 and that (1) holds.Then there exists a continuous-extension Ȳ (t) of the P-ULA chain for which where y is the Langevin diffusion (5) with ergodic measure π.Moreover, if ∇ log π behaves polynomially then P-ULA converges strongly to y at optimal rate; that is, Proof.To prove the first result we use the equivalency between P-ULA and an SSBE approximation of y and apply Theorem 3.3 of Higham et al. (2003), where we note that assumption (1) implies condition 3.1 of Higham et al. (2003).The proof of the second result follows from Higham et al. (2003, Theorem 4.7).

Illustrative example
For illustration we show applications of P-ULA to the four log-concave distributions depicted in Figure 1, which span a broad range of tail behaviours comprising exponential (β = 1), Gaussian (β = 2), super-Gaussian (β = 4) and short tailed (β → ∞).The proximity mappings related to the Laplace and Gaussian distributions are prox δ/2 g (x) = sgn(x){max(|x|, δ/2) − δ/2} and prox δ/2 g (x) = x(1 + δ/2) −1 respectively.The mapping for the uniform distribution is the projection operator prox δ/2 g (x) = min{max(x, −0.5), 0.5}, and that of the quartic distribution is the solution to the depressed cubic equation 2δu 3 + u − x = 0.For each distribution we generated 110 000 samples using P-ULA with δ = 0.02 for the Laplace, Gaussian and quartic distributions, and δ = 0.002 for the uniform distribution.Figure 2 shows the histograms associated with each chain after removing the first 10 000 samples (burn-in period).We observe that the chains converged to an ergodic measure that is a good approximation of the desired target density.We also observe in Figure 2(d) that the P-ULA approximation of the uniform distribution has Gaussian tails, as described by Property 6 and Figure 1(d).This mismatch can be improved by reducing the value of δ.

Proximal MALA
P-ULA simulates samples from an approximation of π that depends on δ.A natural strategy to correct this approximation error is to supplement P-ULA with a Metropolis-Hasting acceptreject step guaranteeing convergence to π, leading to a proximal Metropolis adjusted Langevin algorithm (P-MALA).This is a Metropolis-Hastings chain X (m) which uses P-ULA as proposal chain.Precisely, given X (m) , a candidate X * is generated by using one P-ULA transition We accept this candidate and set where }, δI n is the P-ULA transition kernel given by ( 9).Otherwise, with probability 1 − r{X (m) , Y * }, we reject the proposition and set X (m+1) = X (m) .By the Hastings construction, the P-MALA chain converges to π in the total variation norm (this follows from the facts that the chain is irreducible, aperiodic and has π as invariant measure (Robert & Casella 2004, ch. 7)).Note that though (12) involves two proximity mappings, we only need to evaluate prox δ/2 g (X * ) at each iteration since prox δ/2 g {X (m) } is known from the algorithm's previous iteration.
In the remainder of this section we study the geometric ergodicity properties of P-MALA.

Convergence properties
We provide two alternative sets of conditions for the geometric ergodicity of P-MALA and illustrate our results on the case where π belongs to the class E(β, γ).
Theorem 3.3.Suppose that (1) holds.Let A(x) = {u : r(x, u) ≥ 1} be the acceptance region of P-MALA from point x, and I(x) = {u : x ≥ u } the region of points interior to x. Suppose that A converges inwards in q, i.e., lim x →∞ A(x)∆I(x) q(u|x)du = 0 where A(x)∆I(x) denotes the symmetric difference A(x) ∪ I(x) \ A(x) ∩ I(x).Then P-MALA is geometrically ergodic.
Proof.To prove this result we use Threorem 5.4 of Bauschke & Combettes (2011) to show that if (1) holds then, for any x, the mean candidate position prox λ g (x) verifies the inequality prox λ g (x) < x .This result, together with the condition that A converges inwards in q, imply that P-MALA is geometrically ergodic (Roberts & Tweedie 1996, Threorem 4.1).
To illustrate this result and compare the convergence properties of MALA and P-MALA we consider the case π ∈ E(β, γ).
Proving this result simply consists of checking that if π ∈ E(β, γ) and (1) holds then A converges inwards in q and therefore Theorem 3.3 applies, where we note that (1) implies that β ≥ 1.Again, we note the good convergence properties of P-MALA and recall that MALA is not geometrically ergodic for any δ when β > 2 or for δ > 2/γ when β = 2 (Roberts & Tweedie 1996).
Moreover, the convergence properties of P-MALA can also be studied in the framework of Random-walk Metropolis-Hastings algorithms with bounded drift (Atchade 2006).
Theorem 3.4.Suppose that π ∈ C 1 and that (1) holds.Assume that there exists R > 0 such that ∀x ∈ R n , x − prox δ/2 g(x) < R, and that π verifies the conditions Then P-MALA is geometrically ergodic.
Proof.The proof follows from the facts that the geometric ergodicity analysis for Shrinkagethresholding MALA (Schreck et al. 2013), which is adapted from (Atchade 2006), is general to all Metropolis-Hastings algorithms with bounded drift, and that the conditions on π, together with the bounded drift condition x − prox λ g (x) < R, satisfy the assumptions of Theorem 4.1 of (Schreck et al. 2013).Note that it is always possible to enforce the bounded drift condition by composing prox λ g (x) with a projection onto an 2 -ball centred at x (this is equivalent to using a truncated gradient as proposed in (Roberts & Tweedie 1996)).Also, it is possible to relax the smoothness assumption to π ∈ C 0 by adding assumptions A3 and A4 from (Schreck et al. 2013).
Finally, similarly to other MH algorithms based on local proposals, P-MALA may be geometrically ergodic yet perform poorly if the proposal variance δ is either too small or very large.
Theoretical and experimental studies of MALA show that for many high-dimensional target densities the value of δ should be set to achieve an acceptance rate of approximately 40% − 70% (Pillai et al. 2012).These results do not apply directly to P-MALA.However, given the similarities between MALA and P-MALA, it is reasonable to assume that the values of δ that are appropriate for MALA will generally also produce good results for P-MALA.In our experiments we have found that P-MALA performs well when δ is set to achieve an acceptance rate of 40% − 60%.

Illustrative example
For illustration we show an application of P-MALA to the density π(x) ∝ exp(−x 4 ) depicted in Figure 1(c).We compare our results with MALA, with Hamiltonian Monte Carlo (HMC) (Neal 2012), and with the truncated gradient MALA (T-MALA) introduced in Roberts & Tweedie (1996).As explained previously, MALA is not geometrically ergodic for this target density due to the lighter-than-Gaussian tails.This can be cured by using T-MALA, which is a bounded-drift random-walk Metropolis-Hastings algorithm constructed by replacing h(x) = ∇ log π(x) in the MALA proposal with h (x) = h{max( , h(x) )} −1 for some > 0 (Atchade 2006).Although geometrically ergodic, T-MALA can converge very slowly if the truncation threshold is not set correctly.Setting good values for can be difficult in practice, particularly because values that appear suitable in certain regions of the state space are unsuitable in others.To the best of our knowledge there are no theoretical results regarding the geometric ergodicity of HMC for this target density; however our experiments suggest that it suffers from the same drawbacks as MALA.
Figures 3(a)-(d) display the first 250 samples of three chains generated with P-MALA, MALA, HMC and T-MALA with initial state X (0) = 10.We implemented MALA, P-MALA and T-MALA using δ = 1, and for HMC we used δ = 0.1 and 10 leap-frog steps.The threshold value for T-MALA has been set to = 20 by conducting a series of pilot runs.We found that for smaller values T-MALA behaves like a Random-walk Metropolis-Hastings algorithm, and that for larger values, say 50, it rejects the proposed moves with very high probability and gets "stuck" at the initial state.We observe in Figures 3(a)-(d) that the chain generated by P-MALA and T-MALA exhibit good mixing, whereas MALA and HMC have rejected all the proposed moves.We repeated this experiment using the initial X (0) = 5 and the same values for δ and .The first 250 samples of each chain are displayed in Figures 3(e)-(h).Again, we observe the good mixing of P-MALA and that MALA and HMC have rejected all the proposed moves.However, we also observe that in this occasion T-MALA got "stuck" at a state where its mixing properties are very poor.

Applications
This sections provides two applications of P-ULA and P-MALA to high-dimensional log-concave models that are widely used in statistical signal and image processing.The first is an application of the Bayesian Lasso (Park & Casella 2008) to a variable selection problem and to audio compressive sensing.The second example considers the problem of computing Bayesian credibility regions for an image resolution enhancement problem.

Bayesian LASSO
Consider the problem of simulating samples from the Bayesian LASSO model (Park & Casella 2008) where y ∈ R p is an observation vector, A ∈ R p×n is a regressor matrix, x 1 = n i=1 |x i |, and where α and σ 2 are fixed hyper-parameters related to the degree of sparsity of x and to the noise variance.This model is used extensively in modern signal processing, for example for compressive sensing and dictionary-based signal reconstruction (Candès & Wakin 2008).At present, signal processing methods based on (13) use almost exclusively maximum-a-posteriori estimators of x that be computed efficiently in high-dimensional scenarios by using proximal optimisation algorithms (Parikh & Boyd 2014).Park & Casella (2008) recently studied the problem of simulating samples from (13) in order to compute Bayesian estimators and high-credibility regions for x (though not specifically for highdimensional applications).They proposed a data-augmentation (DA) Gibbs sampler that uses an appropriate auxiliary vector τ ∈ R n to draw samples from the augmented posterior π(x, τ |y) which admits (13) as marginal.This sampler converges quickly but has a high computational cost due to the need to sample τ which involves generating n univariate inverse Gaussian random variables.More recently, Hans (2009) proposed an orthogonalized Gibbs sampler for (13) that does not require auxiliary variables.For small data sets (i.e., p ≤ 10) this algorithm is more efficient than the DA Gibbs sampler.However, the computational complexity of the sampler scales poorly with the dimensions of y and for medium size data sets (i.e., p > 50) its performance becomes significantly worse than that of the DA Gibbs sampler (Hans 2009).
Here we propose to use P-ULA and P-MALA to simulate from ( 13), where we note that g(x) = − y − Ax 2 /2σ 2 − α x 1 is concave and that its proximity mapping can be efficiently approximated by using forward-backward splitting (Parikh & Boyd 2014); that is, by expressing g as the addition of a differentiable term g 1 (x) = − y − Ax 2 /2σ 2 and a term with an explicit proximity mapping g 2 (x) = −α x 1 and using the approximation prox δ/2 g (x) ≈ prox δ/2 g 2 {x + δ∇g 1 (x)/2}, which converges to the true proximity mapping of g as δ → 0. The proximity mapping of g 2 is the component-wise soft-thresholding operator with threshold αδ/2, i.e., prox

Diabetes data
In our first experiment we apply P-ULA and P-MALA to the diabetes data of Efron et al. (2004), which has p = 442 observations and n = 10 regressors.We use the DA Gibbs sampler of Park & Casella (2008) as a benchmark for the relative efficiency of our methods as well as to assess the accuracy of P-ULA, which converges to an approximation of (13). Figure 4(a) compares the posterior median and 90% credibility region estimates obtained with P-ULA, P-MALA and the Gibbs sampler.These estimates have been computed using 100 000 iterations (after 10 000 burn-in iterations) and using the marginal maximum likelihood estimates for σ and α.The value of δ for P-MALA was chosen to obtain an acceptance rate of around 50%, and for P-ULA it was set to σ 2 /3 by performing pilot runs.We observe that the three algorithms produced very similar estimates and that P-ULA (black error bars), converging to an approximation of (13), slightly overestimated the marginal credibility regions of x.We have observed this behaviour of P-ULA in other experiments and believe that it is due to the fact that Moreau approximations overapproximate the tails of the true density and underapproximate the bulk (this phenomenon is illustrated in Figure 1).Moreover, Figure 4(b) shows 1000-sample trace plots of the chains generated with each algorithm.We observe that the Gibbs sampler produced chains with lower autocorrelation than P-ULA and P-MALA.However, the Gibbs sampler was also significantly more computationally expensive: computing these estimates required 43.80 seconds for the Gibbs sampler, 1.34 seconds for P-MALA and 0.41 seconds for P-ULA (P-ULA is considerably faster than P-MALA because it does not need to evaluate the acceptance probability ( 12)).To compare the relative efficiency of these three algorithms we computed the effective sample size (ESS) for each chain divided by the central processor unit time required to generate the samples.Table 1 reports the time-normalised EES estimates for each covariate and for each algorithm (values computed by repeating simulations 10 times and taking averages).Recall that ESS = N {1 + 2 k γ(k)} −1 , where N is the total samples and k γ(k) is the sum of the K monotone sample auto-correlations which we estimated with the initial monotone sequence estimator (see Geyer (1992)).To ensure a fair comparison the three methods were implemented in the interpreted language MATLAB and using Enes Makalik's optimised implementation of the DA Gibbs sampler 1 .Simulations were performed on a workstation with an Intel Core i7 @3.4 GHz processor, an NVIDIA Quadro 600 graphic card and MATLAB R2013a.We observe (Table 1) that P-ULA is significantly more efficient than P-MALA and than the Gibbs sampler for all the covariates, and that it achieves an average time-normalised EES that is over 4 times higher than P-MALA and over 10 times higher than the Gibbs sampler (though P-ULA simulates from an approximation of ( 13)).We also observe than P-MALA is on average 2.5 times faster than the Gibbs sampler, but has a lower worst time-normalised EES (covariate x 5 ).This shortcoming of P-MALA is due to the fact that there is a single value of δ for all covariates; better EES results could certainly be obtained by introducing a preconditioning matrix that takes into account the anisotropy and correlations of the target density (Atchade 2006).The development of an adaptive P-MALA based on proximity mappings on non-canonical Euclidean spaces is currently under investigation.Table 1: Effective sample size per millisecond for the DA Gibbs sampler (Park & Casella 2008), P-MALA and P-ULA on the the diabetes data Efron et al. (2004).In order to demonstrate the potential of our methods for high-dimensional problems we now show an application to audio compressive sensing.The aim of compressive sensing is to reduce signal acquisition time by taking advantage of the fact that the signal that we wish to measure has a sparse representation in a given dictionary.This allows to speed-up the acquisition by only observing of the signal partially; the "blanks" are subsequently reconstructed by using the prior knowledge that the projection of the signal on the dictionary is dominated by a few large components (this is typically encoded by using a Laplace prior on each component).Precisely, suppose that we wish to acquire a j-dimensional time-discrete signal z ∈ R j that we expect to be sparse in some dictionary Φ ∈ R j×n , i.e, there exists a nearly sparse vector x ∈ R n such that z = Φx.Then, instead of acquiring z directly which would require measuring j univariate components, we use a carefully designed matrix Ψ ∈ R p×j , with p << j, and observe the "compressed" signal y = Ψz that only require taking p measurements.The desired signal z is then typically retrieved by computing the maximum-a-posteriori (MAP) estimate x obtained by maximising ( 13) with A = ΨΦ and then setting ẑ = Φx (see Candès & Wakin (2008) for more details about the compressive sensing framework).
Here we consider the problem of selecting the value of the hyper-parameter α in (13).The value of this parameter controls the degree of sparsity of x and has a strong impact on estimation performance.We adopt an empirical Bayes approach and seek to set α by maximum marginal likelihood estimation (Park & Casella 2008).This likelihood is intractable as it involves marginalising the high-dimensional vector x.In a manner akin to Park & Casella (2008), we address this difficulty by using a Monte Carlo expectation maximisation (EM) algorithm which we drive using P-MALA.In each iteration of this EM algorithm, the posterior expectation E π ( x 1 ) needed to update α is approximated by simulating from ( 13) with a P-MALA sampler using a value of α estimated from the samples of the previous iteration.Specifically, iteration k uses P-MALA to generate M samples from (13) with hyper-parameter α k−1 (i.e., the estimate from iteration k − 1), with M 0 burn-in iterations, and then computes the update rule .
In our experiments we used M 0 = 10 000, M = 10 000, and adjusted the value of δ during the burn-in period of each EM iteration to achieve an acceptance rate of approximately 30% − 60%.
Figure 5(a) shows the evolution of α for 30 iterations of the proposed P-MALA EM algorithm for the "Mary had a little lamb" audio experiment of Balzano et al. (2010).In this experiment an observation vector y of dimension p = 456 is obtained by projecting a MIDI-generated audio file z of dimension j = 319 725 onto a Bernulli random measurement matrix Ψ ∈ {0, 1} p×j (i.e., by observing 456 random entries of z) .The unknown parameter vector x has dimension n = 2 900 and is related to z by a dictionary matrix Φ whose row vectors correspond to different piano notes lasting a quarter-second long.As initialisation for our algorithm we used the heuristic estimate α = 0.1 A T y ∞ /σ 2 = 194 used in Balzano et al. (2010), with σ = 0.015.
We observe in Figure 5(a) that the EM algorithm required around 20 iterations to converge to the maximum marginal likelihood estimate (MMLE) α = 18.6.were computed with the optimisation method of Figueiredo et al. (2007).We observe that the empirical Bayes estimate is very accurate and has correctly identified the active components of x as well as their amplitude (using the MMLE α = 18.6 instead of α = 194 reduced by 360% the total reconstruction error, as measured by Φx 0 − Φx 2 ).Computing the maximum marginal estimate α with P-MALA EM required 216 seconds.Repeating this experiment using P-ULA instead of P-MALA produced an estimate α = 17.9 and required 99 seconds (with δ = σ 2 /5).Based on preliminary pilot runs we estimate that repeating this experiments with the Gibbs sampler of Park & Casella (2008) would require around 3 days (each iteration of this algorithm requires sampling n auxiliary inverse Gaussian variables and inverting an n × n matrix) .

Bayesian image deconvolution with a total-variation prior
In image deconvolution or deblurring problems, the goal is to recover an original image x ∈ R n from a blurred and noisy observed image y ∈ R n related to x by the linear observation model y = Hx+w, where H is a linear operator representing the blur point spread function and w is the sample of a zero-mean white Gaussian vector with covariance matrix σ 2 I n .This inverse problem is usually ill-posed or ill-conditioned, i.e., either H does not admit an inverse or it is nearly singular, thus yielding highly noise-sensitive solutions.Bayesian image deconvolution methods address this difficulty by exploiting prior knowledge about x in order to obtain more robust estimates.One of the most widely used image prior for deconvolution problems is the improper total-variation norm prior, π(x) ∝ exp (−α ∇ d x 1 ), where ∇ d denotes the discrete gradient operator that computes the vertical and horizontal differences between neighbour pixels.This prior encodes the fact that differences between neighbour image pixels are often very small and occasionally take large values (i.e., image gradients are nearly sparse).Based on this prior and on the linear observation model described above, the posterior distribution for x is given by Image processing methods based on ( 14) are almost exclusively based on MAP estimates of x that can be efficiency computed using proximal optimisation algorithms (Afonso et al. 2011).Here we consider the problem of computing credibility regions for x, which we use to assess the confidence in the restored image.Precisely, we note that ( 14) is log-concave and use P-MALA to compute marginal 90% credibility regions for each image pixel.There are several efficient approximations for the proximity mapping of g(x) = − y−Hx 2 /2σ 2 −α ∇x 1 based on different decompositions of g.Here we use a forward-backward approximation prox δ/2 g(x) ≈ prox δ/2 g 2 {x+δ∇g 1 (x)/2} based on the decomposition g(x) = g 1 (x)+g 2 (x), with g 1 (x) = − y−Hx 2 /2σ 2 and g 2 (x) = −α ∇x 1 , and where we note that g 1 has a Lipschitz continuous gradient and that the proximity mapping of g 2 can be efficiently approximated using Chambolle (2004).
Figure 6 presents an experiment with the "cameraman" image, which is a standard image to assess deconvolution methods (Oliveira et al. 2009).Figures 6(a) and (b) show the original cameraman image x 0 of size 128 × 128 and a blurred and noisy observation y, which we produced by convoluting x 0 with a uniform blur of size 9 × 9 and adding white Gaussian noise to achieve a blurred signal-to-noise ratio (BSNR) of 40dB (BRSN = 10 log 10 {var(Hx 0 )/σ 2 }).The MAP estimate of x obtained by maximising ( 14) is depicted in Figure 6(c).This estimate has been computed with the proximal optimisation algorithm of Afonso et al. (2011), and by using the technique of Oliveira et al. (2009) to determine the value of α.By comparing Figures 6(a) and 6(c) we observe that the MAP estimate is very accurate and restored the sharp edges and fine details in the image.Finally, Figure 6(d) shows the magnitude of the marginal 90% credibility regions for each pixels, as measured by the distance between the 95% and 5% quantile estimates.These estimates were computed from a 20 000-sample chain generated with P-MALA using a thinning factor of 1 000 to reduce the algorithm's memory foot-print and 1 million burn-in iterations.These credibility regions show that there is a background level of uncertainty of about 30 grey-levels, which is approximately 10% of the dynamic range of the image (256 grey-levels).More importantly, we observe that there is significantly more uncertainty concentrated at the discontinuities and object boundaries in the image.This reveals that model ( 14) is able to accurately detect the presence of sharp edges in the image but with some uncertainty about their exact location.
Therefore computing credibility regions could be particularly relevant in applications that use images to determine the location and the size of objects, or to compare the size of a same object appearing in two different images.For example, in oncological medical imaging, where deconvolution is increasingly used to improve the resolution of images that are subsequently used to assess the evolution of tumour boundaries over time and make treatment decisions.
Moreover, to asses the efficiency of P-MALA we repeated the experiment with a variation of MALA for partially non-differentiable target densities that uses only the gradient of the differentiable term of ( 14), i.e., ∇ log g 1 (x) = H T (y − Hx)/σ 2 (this variation of MALA was recently used in Schreck et al. (2013) for a Bayesian variable selection problem with a Bernulli-Laplace prior that is also non-differentiable). Figure 7 compares the first 20 lags of the sample autocorrelation function of the chains generated with P-MALA and MALA, computed using log π(x|y) as scalar summary.We observe that the chain produced with P-MALA has significantly lower autocorrelation and therefore higher effective sample size (EES).However, P-MALA was also significantly more computationally expensive than MALA due to the overhead associated with evaluating the proximity mapping of g 2 (the total computational times were 49 hours for P-MALA and 28 hours for MALA).Despite this additional computational cost, the time-normalised EES of P-MALA was 4.5 times higher than that of MALA (50.8 and 11.04 samples per hour respectively), showing the good performance of the proposed methodology.

Conclusion
This paper studied two new Langevin MCMC algorithms that use convex analysis, namely Moreau approximations and proximity mappings, to sample efficiently from high-dimensional densities π that are log-concave.These methods are based on a new first-order approximation for Langevin diffusions that is constructed by first approximating the original diffusion y with an auxiliary Langevin diffusion y λ with ergodic measure π λ , and then discretising y λ using a forward Euler scheme with time-step δ = 2λ.The resulting Markov chain, P-ULA, is similar to ULA except for the fact that it uses proximity mappings of log π instead of gradient mappings.This modification leads to a chain with favourable convergence properties that is geometrically ergodic in many cases for which ULA is transient or explosive.The second method presented in this paper, P-MALA, combines P-ULA with a Metropolis-Hastings step guaranteeing convergence to the desired target density.It is shown that P-MALA inherits the favourable convergence properties of P-ULA and is geometrically ergodic in many cases for which MALA does not converge geometrically.Moreover, the proposed methods were validated and compared to other MCMC algorithms through a series of illustrative examples and applications to real data, including two challenging high-dimensional experiments related to audio compressive sensing and to image deconvolution.These experiments show that P-ULA and P-MALA can make Bayesian inference techniques practically feasible for high-dimensional models that are not well addressed by the existing MCMC methodology.
Moreover, although only directly applicable to log-concave distributions, the proposed methods can be combined with Gibbs sampling to simulate from more complex models.For example, our methods can be easily applied to a large class of bilinear models of the form ( 14) in which there is uncertainty about the linear operator H (e.g., semi-blind image restoration), as this models can be conveniently split into two high-dimensional conditional densities that are log-concave.
Similarly, their application to hierarchical models involving unknown regularisation or noise power hyper-parameters is also straightforward.Future works will focus on the application of P-ULA and P-MALA to the development of new statistical signal and image processing methodologies.
In particular, we plan to develop a general set of tools for computing Bayesian estimators and credibility regions for high-dimensional convex linear and bilinear inverse problems, as well as stochastic optimisation algorithms for empirical Bayes estimation in signal and image processing.
Finallly, the proposed methods could certainly be improved by introducing some form of adaptation or preconditioning that captures the local geometry of the target density.This could be achieved by learning the covariance structure online (Atchade 2006) or by using an appropriate metric (Girolami & Calderhead 2011).Such methods will require evaluating proximity mappings on non-canonical Euclidean spaces, a topic that currently receives a lot of attention in the optimisation literature.Alternatively, one could also consider extending our methods to other diffusions that are more robust to anisotropic target densities (Roberts & Stramer 2002, Stramer & Tweedie 1999a,b).Lastly, we acknowledge that since the first preprint of this work (Pereyra 2013), two other works have independently proposed using proximity mappings in MCMC algorithms.Particularly, Schreck et al. (2013) presents a MALA for a model selection problem that uses thresholding operators, which are specific proximity mappings, to build proposal distributions with atoms at zero.Similarly, Chaari et al. (2014) describes a Metropolis-within-Gibbs sampler for model seleccion that uses soft-thresholding operators to evaluate subgradients in a Hamiltonian Monte Carlo leap-frog step.
Figure 5(b) compares the main 500 components of the true coefficient vector x 0 used to generate the original audio track (top), the empirical Bayes MAP estimate corresponding to α = 18.6 (middle), and the MAP estimate of Balzano et al. (2010) obtained with the heuristical value α = 194 (bottom).Both MAP estimates

Figure 5 :
Figure 5: Audio compressive sensing experiment for the "Mary had a little lamb" data of Balzano et al. (2010).(a) Sequence of α k estimates obtained with the P-MALA EM algorithm converging to the maximum marginal likelihood estimate (MLE) α = 18.6.(b) A segment of the true (top) and the estimated coefficient vectors using α = 18.6 (middle) and the heuristic value α = 0.1σ 2 A T y ∞ = 194.