Efficient stochastic optimisation by unadjusted Langevin Monte Carlo. Application to maximum marginal likelihood and empirical Bayesian estimation

Stochastic approximation methods play a central role in maximum likelihood estimation problems involving intractable likelihood functions, such as marginal likelihoods arising in problems with missing or incomplete data, and in parametric empirical Bayesian estimation. Combined with Markov chain Monte Carlo algorithms, these stochastic optimisation methods have been successfully applied to a wide range of problems in science and industry. However, this strategy scales poorly to large problems because of methodological and theoretical difficulties related to using high-dimensional Markov chain Monte Carlo algorithms within a stochastic approximation scheme. This paper proposes to address these difficulties by using unadjusted Langevin algorithms to construct the stochastic approximation. This leads to a highly efficient stochastic optimisation methodology with favourable convergence properties that can be quantified explicitly and easily checked. The proposed methodology is demonstrated with three experiments, including a challenging application to high-dimensional statistical audio analysis and a sparse Bayesian logistic regression with random effects problem.


Introduction
Maximum likelihood estimation (MLE) is central to modern statistical science. It is a cornerstone of frequentist inference [7], and also plays a fundamental role in parametric empirical Bayesian inference [11,13]. For simple statistical models, MLE can be performed analytically and exactly.
However, for most models, it requires using numerical computation methods, particularly optimisation schemes that iteratively seek to maximise the likelihood function and deliver an approximate solution. Following decades of active research in computational statistics and optimisation, there are now several computationally efficient methods to perform MLE in a wide range of classes of models [32,8].
In this paper we consider MLE in models involving incomplete or "missing" data, such as hidden, latent or unobserved variables, and focus on Expectation Maximisation (EM) optimisation methods [18], which are the predominant strategy in this setting . While the original EM optimisation methodology involved deterministic steps, modern EM methods are mainly stochastic [49]. In particular, they typically rely on a Robbins-Monro stochastic approximation (SA) scheme that uses a Monte Carlo stochastic simulation algorithm to approximate the gradients that drive the optimisation procedure [48,17,37,30]. In many cases, SA methods use Markov chain Monte Carlo (MCMC) algorithms, leading to a powerful general methodology which is simple to implement, has a detailed convergence theory [2], and can address a wide range of moderately low-dimensional models. Alternatively, some stochastic EM schemes use a Gibbs sampling algorithm [12], however this requires running several fully converged MCMC chains and can be significantly more computationally expensive as a result.
The expectations and demands on SA methods constantly rise as we seek to address larger problems and provide stronger theoretical guarantees on the solutions delivered. Unfortunately, existing SA methodology and theory do not scale well to large problems. The reasons are twofold. First, the family of MCMC kernels driving the SA scheme needs to satisfy uniform geometric ergodicity conditions that are usually difficult to verify for high-dimensional MCMC kernels. Second, the existing theory requires using asymptotically exact MCMC methods. In practice, these are usually high-dimensional Metropolis-Hastings methods such as the Metropolis-adjusted Langevin algorithm [51] or Hamiltonian Monte Carlo [33,22], which are difficult to calibrate within the SA scheme to achieve a prescribed acceptance rate. For these reasons, practitioners rarely use SA schemes in high-dimensional settings.
In this paper, we propose to address these limitations by using inexact MCMC methods to drive the SA scheme, particularly unadjusted Langenvin algorithms, which have easily verifiable geometric ergodicity conditions, and are easy to calibrate [21,15]. This will allow us to design a high-dimensional stochastic optimisation scheme with favourable convergence properties that can be quantified explicitly and easily checked.
Our contributions are structured as follows: Section 2 formalises the class of MLE problems considered and presents the proposed stochastic optimisation method, which is based on a SA approach driven by an unadjusted Langevin algorithm. Section 3 presents three numerical experiments that demonstrate the proposed methodology in a variety of scenarios. Detailed theoretical convergence results for the method are reported in Section 4, which also describes a generalisation of the proposed methodology and theory to other inexact Markov kernels. The online supplementary material includes additional theoretical results and some details on computational aspects.

The stochastic optimisation via unadjusted Langevin method
The proposed Stochastic Optimisation via Unadjusted Langevin (SOUL) method is useful for solving maximum likelihood estimation problems involving intractable likelihood functions. The method is a SA iterative scheme that is driven by an unadjusted Langevin MCMC algorithm. Langevin algorithms are very efficient in high dimensions and lead to an SA scheme that inherits their favourable convergence properties.

Maximum marginal likelihood estimation
Let Θ be a convex closed set in R d Θ . The proposed optimisation method is well-suited for solving maximum likelihood estimation problems of the form where the parameter of interest θ is related to the observed data y ∈ Y by a likelihood function p(y, x|θ) involving an unknown quantity x ∈ R d , which is removed from the model by marginalisation. More precisely, we consider problems where the resulting marginal likelihood is computationally intractable, and focus on models where the dimension of x is large, making the computation of (1) even more difficult. For completeness, we allow the use of a penalty function g : Θ → R, or set g = 0 to recover the standard maximum likelihood estimator. As mentioned previously, the maximum marginal likelihood estimation problem (1) arises in problems involving latent or hidden variables [18]. It is also central to parametric empirical Bayes approaches that base their inferences on the pseudo-posterior distribution p(x|y, θ ) = p(y, x|θ )/p(y|θ ) [11]. Moreover, the same optimisation problem also arises in hierarchical Bayesian maximum-a-posteriori estimation of θ given y, with marginal posterior p(θ|y) ∝ p(y|θ)p(θ) where p(θ) denotes the prior for θ; in that case g(θ) = − log p(θ) [7].
Finally, in this paper we assume that log p(y, x|θ) is continuously differentiable with respect to x and θ, and that g is also continuously differentiable with respect to θ. A generalisation of the proposed methodology to non-smooth models is presented in a forthcoming paper [53] that focuses on non-smooth statistical imaging models.

Stochastic approximation methods
The scheme we propose to solve the optimisation problem (1) is derived in the SA framework [17], which we recall below.
Starting from any θ 0 ∈ Θ, SA schemes seek to solve (1) iteratively by computing a sequence (θ n ) n∈N associated with the recursion where ∆ θn is some estimator of the intractable gradient θ → ∇ θ log p(y|θ) at θ n , Π Θ denotes the projection onto Θ, and (δ n ) n∈N * ∈ (R * + ) N * is a sequence of stepsizes. From an optimisation viewpoint, iteration (2) is a stochastic generalisation of the projected gradient ascent iteration [8] for models with intractable gradients. For n ∈ N, Monte Carlo estimators ∆ θn for ∇ θ log p(y|θ) at θ n are derived from the identity ∇ θ log p(X n k , y|θ n ) , where (m n ) n∈N ∈ (N * ) N is a sequence of batch sizes and (X n k ) k∈{1,...,mn} is either an exact Monte Carlo sample from p(x|y, θ n ) = p(x, y|θ n )/p(y|θ n ), or a sample generated by using a Markov Chain targeting this distribution.
Given a sequence (θ n ) N n=1 generated by using (2), an approximate solution of (1) can then be obtained by calculating, for example, the average of the iterates, i.e., This estimate converges almost surely to a solution of (1) as N → ∞ provided that some conditions on p(y|θ), g, p(x|y, θ), (δ n ) n∈N , and ∆ θn are fulfilled. Indeed, following three decades of active research efforts in computational statistics and applied probability, we now have a good understanding of how to construct efficient SA schemes, and the conditions under which these schemes converge (see for example [6,29,20,1,44,2]). SA schemes are successfully applied to maximum marginal likelihood estimation problems where the latent variable x has a low or moderately low dimension. However, they are seldomly used them when x is high-dimensional because this usually requires using high-dimensional MCMC samplers that, unless carefully calibrated, exhibit poor convergence properties. Unfortunately, calibrating the samplers within a SA scheme is challenging because the target density p(x|y, θ n ) changes at each iteration. As a result, it is, for example, difficult to use Metropolis-Hastings algorithms that need to achieve a prescribed acceptance probability range. Additionally, the conditions for convergence of MCMC SA schemes are often difficult to verify for high-dimensional samplers. For these reasons, practitioners rarely use SA schemes in high-dimensional settings.
As mentioned previously, we propose to address these difficulties by using modern inexact Langevin MCMC samplers to drive (3). These samplers have received a lot of attention in the late because they can exhibit excellent large-scale convergence properties and significantly outperform their Metropolised counterparts (see [23] for an extensive comparison in the context of Bayesian imaging models). Stimulated by developments in high-dimensional statistics and machine learning, we now have detailed theory for these algorithms, including explicit and easily verifiable geometric ergodicity conditions [21,15,26,16]. This will allow us to design a stochastic optimisation scheme with favourable convergence properties that can be quantified explicitly and easily checked.

Langevin Markov chain Monte Carlo methods
Langevin MCMC schemes to sample from p(x|y, θ) are based on stochastic continuous dynamics (X θ t ) t≥0 for which the target distribution p(x|y, θ) is invariant. Two fundamental examples are the Langevin dynamics solution of the following Stochastic Differential Equation (SDE) or the kinetic Langevin dynamics solution of where (B t ) t≥0 is a standard d-dimensional Brownian motion. Under mild assumptions on p(x|y, θ), these two SDEs admit strong solutions for which p(x|y, θ) and p(x, v|y, θ) = p(x|y, θ) exp(− v 2 /2)/(2π) d/2 are the invariant probability measures. In addition, there are detailed explicit convergence results for (X θ t ) t≥0 to p(x|y, θ), and for (X θ t , V θ t ) t≥0 to p(x, v|y, θ), under different metrics [25,24]. However, sampling path solutions for these continuous-time dynamics is not feasible in general. Therefore discretizations have to be used instead. In this paper, we mainly focus on the Euler-Maruyama discrete-time approximation of (5), known as the Unadjusted Langevin Algorithm (ULA) [51], given by where γ > 0 is the discretization time step and (Z k ) k∈N * is a i.i.d. sequence of d-dimensional zeromean Gaussian random variables with covariance matrix identity. We will use this Markov kernel to drive our SA schemes. Observe that (6) does not exactly target p(x|y, θ) because of the bias introduced by the discretetime approximation. Computational statistical methods have traditionally addressed this issue by complementing (6) with a Metropolis-Hastings correction step to asymptotically remove the bias [51]. This correction usually deteriorates the convergence properties of the chain and may lead to poor non-asymptotic estimation results, particularly in very high-dimensional settings (see for example [23]). However, until recently it was considered that using (6) without a correction step was too risky. Fortunately, recent works have established detailed theoretical guarantees for (6) that do not require using any correction [15,21]. A main contribution of this work is to extend these guarantees to SA schemes that are driven by these highly efficient but inexact samplers.

The SOUL algorithm
We are now ready to present the proposed Stochastic Optimization via Unadjusted Langevin (SOUL) methodology. Let (δ n ) n∈N * ∈ (R * + ) N * and (m n ) n∈N ∈ (N * ) N be the sequences of stepsizes and batch sizes defining the SA scheme (2)-(3). For any θ ∈ Θ and γ > 0, denote by R γ,θ the Langevin Markov kernel (6) to approximately sample from p(x|y, θ), and by (γ n ) n∈N ∈ (R * + ) N be the sequence of discrete time steps used.
Formally, starting from some X 0 0 ∈ R d and θ 0 ∈ Θ, for n ∈ N and k ∈ {0, . . . , m n − 1}, we recursively define ({X n k : k ∈ {0, . . . , m n }}, θ n ) n∈N on a probability space (Ω, F, P), where (X n k ) k∈{0,...,mn} is a Markov chain with Markov kernel R γn,θn , X n 0 = X n−1 mn−1 given F n−1 , and where we recall that Π Θ is the projection onto Θ, and for all n ∈ N Note that such a construction is always possible by Kolmogorov extension theorem [34,Theorem 5.16], hence for any n ∈ N, θ n+1 is F n -measurable. Then, as mentioned previously, we compute a sequence of approximate solutions of (1) by calculating, for example, The pseudocode associated with the proposed SOUL method is presented in Algorithm 1 below. Observe that, for additional efficiency, instead of generating independent Markov chains at each SA iteration, we warm-start the chains by setting X n 0 = X n−1 mn−1 , for any n ∈ {1, . . . , N }.

Algorithm 1
The Stochastic Optimization via Unadjusted Langevin (SOUL) method 1: Inputs: 3: if n ≥ 1 then 4: end for 10: To conclude, Section 3 below demonstrates the proposed methodology with three numerical experiments related to high-dimensional logistic regression and statistical audio analysis with sparsity promoting priors. A detailed theoretical analysis of the proposed SOUL method is reported in Section 4. More precisely, we establish that if the cost function f (θ) = g(θ) − log p(y|θ) defining (1) is convex, and if (γ n ) n∈N and (δ n ) n∈N go to 0 sufficiently fast, then E[f (θ N )] converges to min Θ f and quantify the rate of convergence. Moreover, in the case where (γ n ) n∈N is held fixed, i.e. for all n ∈ N, γ n = γ, we show convergence to a neighbourhood of the solution, in the sense that there exist explicit C, α > 0 such that lim sup N →+∞ E[f (θ N )] − min Θ f ≤ Cγ α . Finally, we also study the important case where f is not convex. In that case, we use the results of [37] to establish that (θ n ) n∈N converges almost surely to a stationary point of the projected ordinary differential equation associated with ∇f and Θ. We postpone this result to Appendix B in the supplementary document because it is highly technical.

Numerical results
We now demonstrate the proposed methodology with three experiments that we have chosen to illustrate a variety of scenarios. Section 3.1 presents an application to empirical Bayesian logistic regression, where (1) can be analytically shown to be a convex optimisation problem with an unique solution θ , and where we benchmark our MLE estimate against the solution obtained by calculating the marginal likelihood p(y|θ) over a θ-grid by using an harmonic mean estimator. Furthermore, Section 3.2 presents a challenging application related to statistical audio compressive sensing analysis, where we use SOUL to estimate a regularisation parameter that controls the degree of sparsity enforced, and where a main difficulty is the high-dimensionality of the latent space (d = 2, 900). Finally, Section 3.3 presents an application to a high-dimensional empirical Bayesian logistic regression with random effects for which the optimisation problem (1) is not convex. All experiments were carried out on an Intel i9-8950HK@2.90GHz workstation running Matlab R2018a.

Bayesian Logistic Regression
In this first experiment we illustrate the proposed methodology with an empirical Bayesian logistic regression problem [55,45]. We observe a set of covariates {v i } dy i=1 ∈ R d , and binary responses {y i } dy i=1 ∈ {0, 1}, which we assume to be conditionally independent realisations of a logistic regression model: for any i ∈ {1, . . . , d y }, y i given β and v i has distribution Ber(s(v T i β)), where β ∈ R d is the regression coefficient, Ber(α) denotes the Bernoulli distribution with parameter α ∈ [0, 1] and s(u) = e u /(1 + e u ) is the cumulative distribution function of the standard logistic distribution. The prior for β is set to be N(θ1 d , σ 2 I d ), the d-dimensional Gaussian distribution with mean θ1 d and covariance matrix σ 2 I d , where θ is the parameter we seek to estimate, 1 d = (1, . . . , 1) ∈ R d , σ 2 = 5 and I d is the d-dimensional identity matrix. Following an empirical Bayesian approach, the parameter θ is computed by maximum marginal likelihood estimation using Algorithm 1 with the marginal likelihood p(y|θ) given by Lemma 7 in Appendix A of the supplementary document shows that (9) is log-concave with respect to θ. We use the proposed SOUL methodology to estimate θ for the Wisconsin Diagnostic Breast Cancer dataset 1 , for which d y = 683 and d = 10, and where we suitably normalise the covariates. In order to assess the quality of our estimation results, we also calculate p(y|θ) over a grid of values for θ by using a truncated harmonic mean estimator.
To implement Algorithm 1 we derive the log-likelihood function and obtain the following expressions for the gradients used in the MCMC steps (6) and SA steps (2) respectively For the MCMC steps, we use a fixed stepsize γ n = 8.34 × 10 −5 , and batch size m n = 1, for any n ∈ N. On the other hand, we consider for the SA steps, the sequence of stepsizes δ n = 60/n 0.8 , Θ = [−100, 100] and θ 0 = 0. Finally, we first run 100 burn-in iterations with fixed θ n = θ 0 to warm-up the Markov chain, followed by 50 iterations of Algorithm 1 to warm-up the iterates. This procedure is then followed by N = 10 6 iterations of Algorithm 1 to computeθ N .  Observe that the sequence initially oscillates, and then stabilises close to θ after approximately 50 iterations. Figure 1(b) presents the iterates θ n for n = 10 5 , . . . , 10 6 . For completeness, Figure 2 shows the histograms corresponding to the marginal posteriors p(β j |y, v,θ N ), for j = 1, . . . , 10, obtained as a by-product of Algorithm 1. In order to verify that the obtained estimateθ N is close to the true MLE θ we use a truncated harmonic mean estimator (THME) [50] to calculate the marginal likelihood p(y|θ) for a range of values of θ. Although obtaining the THME is usually computationally expensive, it is viable in this particular experiment as β is low-dimensional. More precisely, given n samples (β i ) i∈{1,...,n} from p(β|y, θ), we obtain an approximation of p(y|θ) by computinĝ where A is a d-dimensional ball centered at the posterior meanβ = n −1 n k=1 β k , and with radius set such that n −1 n i=1 1 A (β i ) ≈ 0.4. Using n = 6 × 10 5 samples, we obtain the approximation shown in Figure 3(a), where in addition to the estimated points we also display a quadratic fit (corresponding to a Gaussian fit in linear scale), which we use to obtain an estimate of θ (the obtained log-likelihood values are small because the dataset is large (d y = 683)).
To empirically study the estimation error involved, we replicate the experiment 10 3 times. Figure 3 shows the obtained histogram of {θ N,i } 1000 i=1 , where we observe that all these estimators are very close to the true maximiser θ . Besides, note that the distribution of the estimation error is close to a Gaussian distribution, as expected for a maximum likelihood estimator. Also, there is a small estimation bias of the order of 3%, which can be attributed to the discretization error of SDE (5), and potentially to a small error in the estimation of θ .
We conclude this experiment by using SOUL to perform a predictive empirical Bayesian analysis on the binary responses. We split the original dataset into an 80% training set (y train , v train ) of size d train = 546, and a 20% test set (y test , v test ) of size d test = 137, and use SOUL to draw samples from the predictive distribution p(y test |y train , v train , v test ,θ N ). More precisely, we use SOUL to simultaneously calculateθ N and simulate from p(β|y train , v train ,θ N ), followed by simulation from p(y test |β, y train , v train , v test ). We then estimate the maximum-a-posteriori predictive responseŷ test , and measure prediction accuracy against the test dataset by computing the error and obtain = 2.2%. For comparison, Figure 4 below reports the error as a function of θ (the discontinuities arise because of the highly non-linear nature of the model). Observe that the estimatedθ N produces a model that has a very good performance in this regard.

Statistical audio compression
Compressive sensing techniques exploit sparsity properties in the data to estimate signals from fewer samples than required by the Nyquist-Shannon sampling theorem [10,9]. Many real-world data admit a sparse representation on some basis or dictionary. Formally, consider an -dimensional time-discrete signal z ∈ R that is sparse in some dictionary Ψ ∈ R ×d , i.e, there exists a latent vector x ∈ R d such that z = Ψx and . This prior assumption can be modelled by using a smoothed-Laplace distribution [38]  where h λ is the Huber function given for any u ∈ R by Acquiring z directly would call for measuring univariate components. Instead, a carefully designed measurement matrix M ∈ R p× , with p , is used to directly observe a "compressed" signal Mz, which only requires taking p measurements. In addition, measurements are typically noisy which results in an observation y ∈ R p modeled as y = Mz + w where we assume that the noise w has distribution N(0, σ 2 I p ), and therefore the likelihood function is given by leading to the posterior distribution To recover z from y, we then compute the maximum-a-posteriori estimatê and setẑ MAP = Ψx MAP . Following decades of active research, there are now many convex optimisation algorithms that can be used to efficiently solve (12), even when d is very large [14,43]. However, the selection of the value of θ in (12) remains a difficult open problem. This parameter controls the degree of sparsity of x and has a strong impact on estimation performance.
The Bayesian framework offers several strategies for estimating θ from the observation y. In this experiment we adopt an empirical Bayesian approach and use SOUL to compute the MLE θ , which is challenging given the high-dimensionality of the latent space.
To illustrate this approach, we consider the audio experiment proposed in [5] for the "Mary had a little lamb" song. The MIDI-generated audio file z has = 319, 725 samples, but we only have access to a noisy observation vector y with p = 456 random time points of the audio signal, corrupted by additive white Gaussian noise with σ = 0.015. The latent signal x has dimension d = 2, 900 and is related to z by a dictionary matrix Ψ whose row vectors correspond to different piano notes lasting a quarter-second long 2 . The parameter λ for the prior (10) is set to λ = 4 × 10 −5 . We used the heuristic θ cs as the initial value for θ in our algorithm. To solve the optimisation problem (12) we use the Gradient Projection for Sparse Reconstruction (GPSR) algorithm proposed in [28]. We use this solver because it is the one used in the online MATLAB demonstration of [5], however, more modern algorithms could be used as well. We implemented Algorithm 1 using a fixed stepsize γ n = 6.9 × 10 −6 , a fixed batch size m n = 1, δ n = 20 n −0.8 /d = 0.0069 n −0.8 and 100 burn-in iterations.
The algorithm converged in approximately 500 iterations, which were computed in only 325 milliseconds. Figure 5 (left), shows the first 250 iterations of the sequence θ n and of the weighted averageθ n . Again, observe that the iterates oscillate for a few iterations and then quickly stabilise. Finally, to assess the quality of the estimateθ N , Figure 5

Sparse Bayesian logistic regression with random effects
Following on from the Bayesian logistic regression in Section 3.1, where p(y|θ) is log-concave and hence θ unique, we now consider a significantly more challenging sparse Bayesian logistic regression with random effects problem. In this experiment p(y|θ) is no longer log-concave, so SOUL can potentially get trapped in local maximisers. Furthermore, the dimension of θ in this experiment is very large (d θ = 1001), making the MLE problem even more challenging. This experiment was previously considered by [2] and we replicate their setup.
1} be a vector of binary responses which can be modelled as d y conditionally independent realisations of a random effect logistic regression model, vectors, x are random effects and σ > 0. In addition, recall that Ber(α) denotes the Bernoulli distribution with parameter α ∈ [0, 1] and s(u) = e u /(1+e u ) is the cumulative distribution function of the standard logistic distribution. The goal is to estimate the unknown parameters θ = (β, σ) ∈ , without knowing the value of x, which we assume to follow a standard Gaussian distribution, i.e. p(x) = exp{− x 2 2 /2}/(2π) d/2 . We estimate θ by MLE using Algorithm 1 to maximize (1), with marginal likelihood given by and we use the penalty function where h λ is the Huber function defined in (11). We follow the procedure described in [2] to generate the observations {y i } dy i=1 , with d y = 500, p = 1000 and d = 5 3 . The vector of regressors β true is generated from the uniform distribution on [1,5] and 98% of its coefficients are randomly set to zero. The variance σ true of the random effect is set to 0.1, and the projection interval for the estimated σ is [10 −5 , +∞). Finally, the parameter λ in (13) is set to λ = 30. We emphasize at this point that θ is high-dimensional in this experiment (d Θ = 1001), making the estimation problem particularly challenging.
The conditional log-likelihood function for this model is To implement Algorithm 1 we use the gradients Finally the gradient of the penalty function is given by where sign denotes the sign function, i.e. for any s ∈ R, sign(s) = |s|/s if s = 0, and sign(s) = 0 otherwise. We use γ n = 0.01, δ n = n −0.95 /d = 0.2 × n −0.95 , a fixed batch size m n = 1, β 0 = 1 p and σ 0 = 1 as initial values. Moreover, we perform 10 4 burn-in iterations with a fixed value of θ 0 = (β 0 , σ 0 ) to warm-up the Markov chain, and further 600 iterations of Algorithm 1 to warm-start the iterates. Following on from this, we run N = 5 × 10 4 iterations of Algorithm 1 to computeθ N . Computing this estimates required 25 seconds in total. Figure 6 shows the evolution of the iterates throughout iterations, where we used β n 0 as a summary statistic to track the number of active components. Because the Huber penalty (11) does not enforce exact sparsity on β, to estimate the number of active components we only consider values that are larger than a threshold τ (we used τ = 0.005). From Figure 6 we observe thatσ n converges to a value that is very close to σ true , and that the number of active components is also accurately estimated. Moreover, Figure 7 shows that most active components were correctly identified. We also observe thatβ n stabilizes after approximately 6300 iterations, which correspond to 6300 Monte Carlo samples as m n =1. This is in close agreement with the results presented in [2, Figure 5], where they observe stabilization after a similar number of iterations of their highly specialised Polya-Gamma sampler.
It is worth emphasising at this point that [2] considers the non-smooth penalty g(θ) = λ β 1 instead of (13). Consequently, instead of using the gradient of g, they resort to the so-called proximal operator of g [14]. The generalisation of the SOUL methodology proposed in this paper to models that have non-differentiable terms is addressed in Vidal and Pereyra [54], Vidal et al. [53].

Theoretical convergence analysis for SOUL, and generalisation to other inexact MCMC kernels (SOUK)
In this section we state our main theoretical results for SOUL. For completeness, we first present the results in a general stochastic optimisation setting and by considering a generic inexact MCMC sampler, and then show that our results apply to the specific MLE optimisation problem (1), and to the specific Langevin algorithm (6) used in SOUL.

Notations and convention
The V -total variation distance of ξ is defined as If V ≡ 1, then · V is the total variation denoted by · TV . Let µ be a finite signed measure, then by the Hahn-Jordan theorem [19,Theorem D.1.3], there exists a pair of finite singular measures µ + , µ − such that µ = µ + − µ − . The total variation measure |µ| is given by We recall that if f : U → R is twice differentiable at point a ∈ R d , its Laplacian is given by be a probability space. Denote by µ ν if µ is absolutely continuous with respect to ν and dµ/dν an associated density. Let µ, ν be two probability measures on (R d , B(R d )). Define the Kullback-Leibler divergence of µ from ν by The complement of a set A ⊂ R d , is denoted by A c . We take the convention that n k=p = 1 and n k=p = 0 for n, p ∈ N, n < p. All densities are w.r.t. the Lebesgue measure unless stated otherwise.

Stochastic Optimization with inexact MCMC methods
We consider the problem of minimizing a function f : Θ → R with Θ ⊂ R d Θ under the following assumptions.
To minimize the objective function f we suggest the use of a SA strategy which extends the one presented in Section 2. More precisely, motivated by the methodology described in Section 2, we propose a SA scheme which relies on biased estimates of ∇f (θ) through a family of Markov kernels {K γ,θ , γ ∈ (0,γ] and θ ∈ Θ}, forγ > 0, such that for any θ ∈ Θ and γ ∈ (0,γ], K γ,θ admits an invariant probability distribution π γ,θ on (R d , B(R d )). In the SOUL method, the Markov kernel K γ,θ stands for R γ,θ for any γ ∈ (0,γ] and θ ∈ Θ, where R γ,θ is the Markov kernel associated with (6). We assume in addition that the bias associated to the use of this family of Markov kernels can be controlled with respect to to γ uniformly in θ, i.e. for example there exists C > 0 such that for all γ ∈ (0,γ] and θ ∈ Θ, π γ,θ − π θ TV ≤ Cγ α with α > 0. Let now (δ n ) n∈N ∈ (R * + ) N and (m n ) n∈N ∈ (N * ) N be sequences of stepsizes and batch sizes which will be used to define the sequence relatively to the variable θ similarly to (2) and (3). Let (γ n ) n∈N ∈ (R * + ) N be a sequence of stepsizes which will be used to get approximate samples from π θn , similarly to (6). Starting from X 0 0 ∈ R d and θ 0 ∈ Θ, we define on a probability space (Ω, F, P), . . , m n }}, θ n ) n∈N by the following recursion for n ∈ N and k ∈ {0, . . . , m n − 1} (X n k ) k∈{0,...,mn} is a MC with kernel K γn,θn and X n 0 = X n−1 mn−1 given F n−1 , where Π Θ is the projection onto Θ and F n is defined as follows for all n ∈ N where {(X k ) k∈{0,...,m } : ∈ {0, . . . , n}} is given by (14). Note that such a construction is always possible by the Kolmogorov extension theorem [34,Theorem 5.16], and by (14), for any n ∈ N, θ n+1 is F n -measurable. Then the sequence of approximate minimizers of f is given by .
Under different sets of conditions on f, H, (δ n ) n∈N , (γ n ) n∈N and (m n ) n∈N we obtain that (θ n ) n∈N converges almost surely to an element of arg min Θ f . In particular in this section we consider the case where f is assumed to be convex. We establish that if (γ n ) n∈N and (δ n ) n∈N go to 0 sufficiently fast, E[f (θ N )]−min Θ f goes to 0 with a quantitative rate of convergence. In the case where (γ n ) n∈N is held fixed, i.e. for all n ∈ N, In the case where f is nonconvex, we apply some results from stochastic approximation [37] which establish that the sequence (θ n ) n∈N converges almost surely to a stationary point of the projected ordinary differential equation associated with ∇f and Θ. We postpone this result to Appendix B, since it involves a theoretical background which we think is out of the scope of the main document.

Main results
We impose a stability condition on the stochastic process {(X n k ) k∈{0,...,mn} : n ∈ N} defined by (14) and that for any γ ∈ (0,γ] and θ ∈ Θ the iterates of K γ,θ are close enough to π θ after a sufficiently large number of iterations.

H1-(ii)
is an ergodicity condition in V -norm for the Markov kernel K γ,θ uniform in θ ∈ Θ. There exists an extensive literature on the conditions under which a Markov kernel is ergodic [41,19]. H 1-(iii) ensures that the distance between the invariant measure π γ,θ of the Markov kernel K γ,θ and π θ can be controlled uniformly in θ. We show that this condition holds in the case of the Langevin Monte Carlo algorithm in Proposition 23.
We now state our mains results.
Let {(X n k ) k∈{0,...,mn} : n ∈ N} and (θ n ) n∈N be given by (14). Assume in addition that H 1 is satisfied and that for any θ ∈ Θ and x ∈ R d , H θ (x) ≤ V 1/2 (x). Then, the following statements hold: (a) (θ n ) n∈N converges almost surely to some θ ∈ arg min Θ f ; Proof. The proof is postponed to Appendix C.1.
Note that in (14), X n 0 = X n−1 mn−1 for n ∈ N * . This procedure is referred to as warm-start in the sequel. An inspection of the proof of Theorem 1 shows that X n 0 could be any random variable independent from F n−1 for any n ∈ N with sup n∈N * E [V (X n 0 )] < +∞. It is not an option in the fixed batch size setting of Theorem 3, where the warm-start procedure is crucial for the convergence to occur.
In the case where K γ,θ = R γ,θ is the Markov kernel associated with the Langevin update (6), under appropriate conditions Proposition 23 shows that for any γ ∈ (0,γ] withγ > 0, Ψ(γ) = O(γ 1/2 ). In that case, assume then that there exist a, b, c > 0 such that for any n ∈ N * , δ n = n −a , γ n = n −b and m n = n c then (16) is equivalent to Suppose a ∈ [0, 1) is given, then the previous equation reads This illustrates a trade-off between the intrinsic inaccuracy of our algorithm through the family of Markov kernels (14) which do not exactly target π θ and the minimization aim of our scheme. Note also that (δ n ) n∈N is allowed to be constant. This case yields γ n = n −2−ς1 and m n = n 3+ς2 with ς 2 > ς 1 > 0. In our next result we derive an non-asymptotic upper-bound of ( Theorem 2 (Increasing batch size 2). Assume A1, A2, A3 hold and f is convex. Let (γ n ) n∈N , (δ n ) n∈N be sequences of non-increasing positive real numbers and (m n ) n∈N be a sequence of positive integers satisfying sup n∈N δ n < 1/L f , sup n∈N γ n <γ. Let {(X n k ) k∈{0,...,mn} : n ∈ N} be given by (14). Assume in addition that H 1 is satisfied and that for any θ ∈ Θ and with for any n ∈ N * , where B 1 and B 2 are given in Lemma 11 and Lemma 12 respectively.
Proof. The proof is postponed to Appendix C.2.
We recall that in the case where K γ,θ = R γ,θ is the Markov kernel associated with the Langevin update (6), under appropriate conditions Proposition 23 shows that for any γ ∈ (0,γ] withγ > 0, Ψ(γ) = O(γ 1/2 ). In that case, if there exist a, b, c ≥ 0 such that for any n ∈ N * , δ n = n −a , γ n = n −b , m n = n c and (17)  Therefore, we obtain that Similarly, if the stepsize is fixed and the number of Markov chain iterates is fixed, i.e. for all n ∈ N, γ n = γ and m n = m with γ > 0 and m ∈ N * , we obtain that lim sup with However if (m n ) n∈N is constant the convergence cannot be obtained using Theorem 1. Strengthening the conditions of Theorem 1 and making use of the warm-start property of the algorithm we can derive the convergence in that case. We now are interested in the case where the batch size is fixed, i.e. m n = m 0 for all n ∈ N. For ease of exposition we only consider m 0 = 1 and letX n+1 = X n 1 for any n ∈ N. However the general case can be adapted from the proof of the result stated below. More precisely we consider the setting where the recursion (14) can be written for any n ∈ N as X n+1 has distribution K γn,θn (X n , ·) conditionally toF n , with θ 0 ∈ Θ,X 0 ∈ R d and whereF n is given bỹ We consider the following assumption on the family {H θ : θ ∈ Θ}.

H2. There exist a measurable function
The following theorem ensures convergence properties for (θ n ) n∈N similar to the ones of Theorem 1. The proof of this result is based on a generalization of [30, Lemma 4.2] for inexact MCMC schemes.
Theorem 3 (Fixed batch size 1). Assume A1, A2, A3, A4 hold and f is convex. Letγ > 0, (γ n ) n∈N and (δ n ) n∈N be sequences of non-increasing positive real numbers satisfying sup n∈N δ n < 1/L f , sup n∈N γ n <γ, sup n∈N |δ n+1 − δ n |δ −2 n < +∞, Let (X n ) n∈N be given by (20). Assume in addition that H1 and H2 are satisfied and that for any . Then the following statements hold: (a) (θ n ) n∈N converges almost surely to some θ ∈ arg min Θ f ; (b) furthermore, almost surely there exists C ≥ 0 such that for any n ∈ N * n k=1 Proof. The proof is postponed to Appendix C.3.
Theorem 4 (Fixed batch size 2). Assume A1, A2, A3, A4 hold and f is convex. Let (γ n ) n∈N , (δ n ) n∈N be sequences of non-increasing positive real numbers and (m n ) n∈N be a sequence of positive integers satisfying sup n∈N δ n < 1/L f and sup n∈N γ n <γ. Let (X n ) n∈N be given by (20). Assume in addition that H1 and H2 are satisfied and that for any θ ∈ Θ and with for any n ∈ N * , where C 1 , C 2 and C 3 are given in Lemma 13, Lemma 16 and Lemma 15 respectively.
Proof. The proof is postponed to Appendix C.4.

Application to SOUL
We now apply our results to the SOUL methodology introduced in Section 2 where the Markov kernel R γ,θ with γ ∈ (0,γ] and θ ∈ Θ is given by a Langevin Markov kernel and associated with recursion (6). Setting for any θ ∈ Θ, π θ = p(·|y, θ), we consider the following assumption on the family of probability distributions (π θ ) θ∈Θ .

L1.
For any θ ∈ Θ, there exists U θ : R d → R such that π θ admits a probability density function with respect to to the Lebesgue measure proportional to x → exp(−U θ (x)). In addition (θ, x) → U θ (x) is continuous, x → U θ (x) is differentiable for all θ ∈ Θ and there exists L ≥ 0 such that for any In the case where K γ,θ = R γ,θ for any γ ∈ (0,γ] and θ ∈ Θ, the first line of (14) can be rewritten for any n ∈ N and k ∈ {0, . . . , m n − 1} given (γ n ) n∈N ∈ (0,γ] N , (m n ) n∈N ∈ (N * ) N and (Z n k ) n∈N,k∈{1,...,mn} a family of i.i.d d-dimensional zero-mean Gaussian random variables with covariance matrix identity. In the following propositions, we show that the results above hold by deriving sufficient conditions under which H1 and H2 are satisfied.
Under L1, the Langevin diffusion defined by (5) admits a unique strong solution for any θ ∈ Θ. Consider now the following additional tail condition on U θ which ensures geometric ergodicity of R γ,θ for any θ ∈ Θ and γ ∈ (0,γ], withγ which will be specified below.
L2. There exist m 1 > 0 and m 2 , c, R 1 ≥ 0 such that for any θ ∈ Θ and x ∈ R d , L3. There exists L U ≥ 0 such that for any x ∈ R d and θ 1 , Proof. The proof is postponed to Appendix C.5.
Proof. The proof is postponed to Appendix C.6.

B Non-convex objective function
In this section we turn to the case where f is non-convex. We recall that the normal space of a sub-manifold M ⊂ R d Θ at point θ is given by where T(θ, M) is the tangent space of the sub-manifold M at point x, see [3]. ..,mn} : n ∈ N} be given by (14). Assume in addition that H1 is satisfied. Then (θ n ) n∈N defined by (14) converges almost surely to some θ ∈ {θ ∈ Θ : ∇f (θ) + n = 0, n ∈ N(θ, ∂Θ)}.
Proof. The proof is an application of [37, Chapter 5, Theorem 2.3] using the decomposition of the error term considered in the proof of Theorem 1 and Theorem 3. Indeed we decompose the error term η n defined by (24) as η n = δM n + B n , where δM n is a martingale increment. Then, we only need to show that the following sums converge which is established in Lemma 11 and Lemma 12.

C Postponed proofs
We first derive the following technical lemmas.

Lemma 10. For any probability measures
1−a , which concludes the proof using that a ≤ 1.

C.1 Proof of Theorem 1
Consider (η n ) n∈N defined for any n ∈ N by The proof of Theorem 1 relies on the two following lemmas. We consider the following decomposition for any n ∈ N, η n = η (1) n + η (2) n , where We now give upper bounds on E[ η Lemma 11. Assume A1, A2, A3, H1 and that for any θ ∈ Θ and Proof. Using the definition of (F n ) n∈N , see (15), the Markov property, H 1-(ii)-(iii), Lemma 10, Jensen's inequality and that for any θ ∈ Θ and x ∈ R d , H θ (x) ≤ V 1/2 (x), we have for any n ∈ N * where for the last inequality we have used Lemma 9. In a similar manner, we have We conclude using H1-(i) and that (a + b) 2 ≤ 2a 2 + 2b 2 for a, b ∈ R.
We now turn to the proof of Theorem 1.
(b) Applying [2, Theorem 3], the Cauchy-Schwarz inequality and using A1 we obtain that almost surely for any n ∈ N which implies by the proof of (a) that sup n∈N [ n k=1 δ k {f (θ k ) − min Θ f }] < +∞ almost surely. The proof is then completed upon dividing (39) by n k=1 δ k .

C.2 Proof of Theorem 2
Proof. Taking the expectation in (39) and using that η (2) n is a martingale increment with respect to (F n ) n∈N , we get that for every n ∈ N E n k=1 Combining this result, Lemma 11 and Lemma 12 completes the proof.
(a) There exists C 3 ≥ 0 such that for any n ∈ N and θ ∈ Θ converges almost surely.
(b) Assume now (22). We show that almost surely the first term in (45) is absolutely convergence and the second term converges to 0.

Lemma 17.
Assume A1, A2, A3, H1 and that for any θ ∈ Θ and Proof. By a straightforward application of H1-(iii) and by (44) along with the fact that for any θ ∈ Θ and x ∈ R d , H θ (x) ≤ V 1/4 (x) we have for any n ∈ N, E η d n ≤ Ψ(γ n ).
We now turn to the proof of Theorem 3.

C.4 Proof of Theorem 4
The proof is similar to the one of Theorem 2, using Lemma 13, Lemma 15, Lemma 16, Lemma 17 and the fact thatη (a) n is a (F n ) n∈N -martingale increment, see (21).

C.5 Proof of Theorem 5
In this section, we give the proof of Theorem 5 by showing that H 1 holds. First of all in Appendix C.5.1, we establish under L 1 and L 2 stability results uniform in the parameter θ ∈ Θ for the Langevin diffusion (5) and the associated Euler-Maruyama discretization (6) based on a Foster-Lyapunov drift condition with constants independent of θ. Then, in Appendix C.5.2, we show that the stability conditions that we derive, are sufficient to prove that H1 holds. The proof of Theorem 5 then consists in combining all these results and is presented in Appendix C.5.3. Under L1 and L2, for any θ ∈ Θ, (5) defines a Markov semi-group (P t,θ ) t≥0 for any x ∈ R d and is the solution of (5) with Y θ 0 = x. Consider now the generator of (P t,θ ) t≥0 for any θ ∈ Θ, defined for any f ∈ C 2 (R d ) by We say that a Markov kernel R on R d ×B(R d ) satisfies a discrete Foster-Lyapunov drift condition D d (V, λ, b) if there exist λ ∈ (0, 1), b ≥ 0 and a measurable function V : We say that a Markov semi-group (P t ) t≥0 on R d × B(R d ) with extended infinitesimal generator (A, D(A)) (see e.g. [42] for the definition of (A, D(A))) satisfies a continuous drift condition D c (V, ζ, β) if there exist ζ > 0, β ≥ 0 and a measurable function V :
The proof of (57) for x ∈ B(0, r e ) and θ ∈ Θ is then completed upon using that e a − e b ≤ (a − b)e a for all a, b ∈ R with a ≥ b.
Proof. First, by definition, for any x ∈ R d , we have Therefore, by (55) and L2, we get for any θ ∈ Θ and x ∈ R d , The proof is then complete upon using that for any x ∈ B(0,r e ) c , x /φ(x) ≥ 2 −1/2 , for any y ∈ R d , y /φ(y) ≤ 1.
Proof of Theorem 6. L1 and L2 ensure a uniform drift condition on R γ,θ , see Proposition 18 . Note that the Lyapunov function V defined by Proposition 18 satisfies sup x∈R d (1 + x 4 )/V (x) < +∞. H2 is then a direct consequence of Proposition 24