# Control variates for stochastic gradient MCMC

- 423 Downloads
- 1 Citations

## Abstract

It is well known that Markov chain Monte Carlo (MCMC) methods scale poorly with dataset size. A popular class of methods for solving this issue is stochastic gradient MCMC (SGMCMC). These methods use a noisy estimate of the gradient of the log-posterior, which reduces the per iteration computational cost of the algorithm. Despite this, there are a number of results suggesting that stochastic gradient Langevin dynamics (SGLD), probably the most popular of these methods, still has computational cost proportional to the dataset size. We suggest an alternative log-posterior gradient estimate for stochastic gradient MCMC which uses control variates to reduce the variance. We analyse SGLD using this gradient estimate, and show that, under log-concavity assumptions on the target distribution, the computational cost required for a given level of accuracy is independent of the dataset size. Next, we show that a different control-variate technique, known as zero variance control variates, can be applied to SGMCMC algorithms for free. This postprocessing step improves the inference of the algorithm by reducing the variance of the MCMC output. Zero variance control variates rely on the gradient of the log-posterior; we explore how the variance reduction is affected by replacing this with the noisy gradient estimate calculated by SGMCMC.

## Keywords

Stochastic gradient MCMC Langevin dynamics Scalable MCMC Control variates Computational cost Big data## 1 Introduction

Markov chain Monte Carlo (MCMC), one of the most popular methods for Bayesian inference, scales poorly with dataset size. This is because standard methods require the whole dataset to be evaluated at each iteration. Stochastic gradient MCMC (SGMCMC) are a class of MCMC algorithms that aim to account for this issue. The algorithms have recently gained popularity in the machine learning literature. These methods use efficient MCMC proposals based on discretised dynamics that use gradients of the log-posterior. They reduce the computational cost by replacing the gradients with an unbiased estimate which uses only a subset of the data, referred to as a minibatch. They also bypass the acceptance step by making small discretisation steps (Welling and Teh 2011; Chen et al. 2014; Ding et al. 2014; Dubey et al. 2016).

These new algorithms have been successfully applied to a range of state of the art machine learning problems (e.g. Patterson and Teh 2013; Wenzhe Li 2016). There is now software available to implement these methods (Tran et al. 2016; Baker et al. 2017). In particular, Baker et al. (2017) implements the control-variate methodology we discuss in this article.

This paper investigates stochastic gradient Langevin dynamics (SGLD), a popular SGMCMC algorithm that discretises the Langevin diffusion. There are a number of results suggesting that whilst SGLD has a lower per iteration computational cost compared with MCMC, its overall computational cost is proportional to the dataset size (Welling and Teh 2011; Nagapetyan et al. 2017). This motivates improving the computational cost of SGLD, which can be done by using control variates (Ripley 2009). Control variates can be applied to reduce the Monte Carlo variance of the gradient estimate in stochastic gradient MCMC algorithms. We refer to SGLD using this new control-variate gradient estimate as SGLD-CV.

We analyse the algorithm using the Wasserstein distance between the distribution defined by SGLD-CV and the true posterior distribution, by adapting recent results by Dalalyan and Karagulyan (2017). This is quite a strong assumption, but has become common for the analysis of these methods (see e.g. Durmus and Moulines 2016; Dalalyan and Karagulyan 2017; Nagapetyan et al. 2017) since it ensures that errors do not accumulate from one iteration to the next. We provide an empirical comparison on a wide variety of models that do not necessarily satisfy these conditions in order to fully explore the results. If the strongly log-concave condition does not hold, then provided the posterior contracts with the number of observations, there should still be some benefit from the variance reduction but we leave a full analysis in the nonconvex case for future work.

We get bounds on the Wasserstein distance between the target distribution and the distribution we sample from at a given step of SGLD-CV. These bounds are in terms of the tuning constants chosen when implementing SGLD-CV. Under assumptions on how the posterior changes as we get more data, we are able to show that, after an initialisation step, the cost of running SGLD-CV to a required level of accuracy does not grow with the number of data points. The initialistion step involves finding a centring value \(\hat{\theta }\) using optimisation and evaluating the gradient of the log-posterior at this value. Both these are *O*(*N*) calculations, where *N* is the dataset size, but we show in Sect. 3.4 these each require just a single pass through the dataset. We also suggest starting the algorithm from the centring value, essentially replacing the burn-in step of SGLD with the optimisation step. The experiments show this optimisation step is often more efficient than the burn-in step of SGLD. Our results in Sect. 3.3 quantifies the impact on performance of obtaining a poor centring value.

The use of control variates has also been shown to be important for other Monte Carlo algorithms for simulating from a posterior with a cost that is sublinear in the number of data points (Bardenet et al. 2017; Bierkens et al. 2016; Pollock et al. 2016; Nagapetyan et al. 2017). For previous work that suggests using control variates within SGLD, see Dubey et al. (2016) and Chen et al. (2017). These latter papers, whilst showing benefits of using control variates, do not show that the resulting algorithm can have sublinear cost in the number of data points. A recent paper, Nagapetyan et al. (2017), does investigate how SGLD-CV performs in the data limit under similar log-concavity assumptions on the posterior distribution. They have results that are qualitatively similar to ours, including the sublinear computational cost of SGLD-CV. Though they measure accuracy of the algorithm through the mean squared error of Monte Carlo averages rather than through the Wasserstein distance.

Not only can control variates be used to speed up stochastic gradient MCMC by enabling smaller minibatches to be used; we show also that they can be used to improve the inferences made from the MCMC output. In particular, we can use postprocessing control variates (Mira et al. 2013; Papamarkou et al. 2014; Friel et al. 2016) to produce MCMC samples with a reduced variance. The postprocessing methods rely on the MCMC output as well as gradient information. Since stochastic gradient MCMC methods already compute estimates of the gradient, we explore replacing the true gradient in the postprocessing step with these free estimates. We also show theoretically how this affects the variance reduction factor, and empirically demonstrate the variance reduction that can be achieved by using these postprocessing methods.

## 2 Stochastic gradient MCMC

Throughout this paper, we aim to make inference on a vector of parameters \(\theta \in \mathbb {R}^d\), with data \(\mathbf {x} = \{x_i\}_{i=1}^N\). We denote the probability density of \(x_i\) as \(p(x_i | \theta )\) and assign a prior density \(p(\theta )\). The resulting posterior density is then \(p( \theta | \mathbf {x} ) \propto p( \theta ) \prod _{i=1}^N p( x_i | \theta )\), which defines the posterior distribution \(\pi \). For brevity we write \(f_i(\theta ) = - \log p(x_i | \theta )\) for \(i = 1, \dots N\), \(f_0(\theta ) = - \log p(\theta )\) and \(f(\theta ) = - \log p(\theta | \mathbf {x})\).

Many MCMC algorithms are based upon discrete-time approximations to continuous-time dynamics, such as the Langevin diffusion, that are known to have the posterior as their invariant distribution. The approximate discrete-time dynamics are then used as a proposal distribution within a Metropolis–Hastings algorithm. The accept-reject step within such an algorithm corrects for any errors in the discrete-time dynamics. Examples of such an approach include the Metropolis-adjusted Langevin algorithm (MALA; see e.g. Roberts and Tweedie 1996) and Hamiltonian Monte Carlo (HMC; see e.g. Neal 2010).

### 2.1 Stochastic gradient Langevin dynamics

SGLD, first introduced by Welling and Teh (2011), is a minibatch version of the unadjusted Langevin algorithm (Roberts and Tweedie 1996). At each iteration, it creates an approximation of the true gradient of the log-posterior by using a small sample of data.

*d*-dimensional Wiener process. The stationary distribution of this diffusion is \(\pi \). This means that it will target the posterior exactly, but in practice we need to discretise the dynamics to simulate from it, which introduces error. A bottleneck for this simulation is that calculating \(\nabla f( \theta )\) is an

*O*(

*N*) operation. To get around this, Welling and Teh (2011) replace the log-posterior gradient with the following unbiased estimate

MALA uses a Metropolis–Hastings accept-reject step to correct for the discretisation of the Langevin process. Welling and Teh (2011) bypass this acceptance step, as it requires calculating \(p( \theta | \mathbf {x} )\) using the full dataset, and instead use an adaptive rather than fixed stepsize, where \(h_k \rightarrow 0\) as \(k \rightarrow \infty \). The motivation is that the noise in the gradient estimate disappears faster than the process noise, so eventually, the algorithm will sample the posterior approximately. In practice, we found the algorithm does not mix well when the stepsize is decreased to zero, so in general a fixed small stepsize *h* is used in practice, as suggested by Vollmer and Zygalakis (2016).

## 3 Control variates for SGLD efficiency

The SGLD algorithm has a reduced per iteration computational cost compared to traditional MCMC algorithms. However, there have been a number of results suggesting that the overall computational cost of SGLD is still *O*(*N*) (Welling and Teh 2011; Nagapetyan et al. 2017). What we mean by this is that in order for the algorithm to reach a desired distance from the true posterior, the minibatch size *n*, and the total number of iterations *K*, need to be set so that the computational cost *Kn* is *O*(*N*). The main reason for this result is that in order to control the variance in the gradient estimate, \(\nabla \hat{f}(\theta )\), we need *n* to increase linearly with *N*. Therefore, we would assume that reducing this variance would lead to an improved computational cost of the algorithm. A natural choice is to reduce this variance through control variates (Ripley 2009).

Control variates applied to SGLD have also been investigated by Dubey et al. (2016) and Chen et al. (2017), who show that the convergence bound of SGLD is reduced when they are used. Theoretical results, similar and independent to ours, show how control variates can improve the computational cost of SGLD (Nagapetyan et al. 2017).

In Sect. 3.1, we show how control variates can be used to reduce the variance in the gradient estimate in SGLD, leading to the algorithm SGLD-CV. Then, in Sect. 3.3, we analyse the Wasserstein distance between the distribution defined by SGLD-CV and the true posterior. There are a number of quantities that affect the performance of SGLD-CV, including the stepsize *h*, the number of iterations *K* and the minibatch size *n*. We provide sufficient conditions on *h*, *K* and *n* in order to bound the Wasserstein distance. We show under certain assumptions, the computational cost, measured as *Kn*, required to bound the Wasserstein distance is independent of *N*.

### 3.1 Control variates for SGMCMC

*S*is a random sample from \(\{1,\ldots ,N\}\) such that \(|S| = n\) and the expected number of times

*i*is sampled is \(np_i\). The standard implementation of control variates would set \(p_i=1/N\) for all

*i*. Yet we show below that there can be advantages in having these probabilities vary with

*i*; for example to give higher probabilities to sampling data points for which \(\nabla f_i(\theta ) - \nabla f_i(\hat{\theta })\) has higher variability.

Implementing this in practice means finding a suitable \(\hat{\theta }\), which we refer to as the *centring value*. We show below, under a strongly log-concave assumption, that the Wasserstein distance between the distribution defined by SGLD-CV and the posterior distribution can be bounded by some arbitrary constant; and that this can be done such that the computational cost *Kn* is *O*(1). For this to be the case though, we require both \(\hat{\theta }\) and the starting point of SGLD-CV, \(\theta _0\), to be \(O(N^{-\frac{1}{2}})\) distance from the posterior mean. This requires some preprocessing steps: an optimisation step to find \(\hat{\theta }\), and calculating \(f(\hat{\theta })\). These steps are both *O*(*N*), but we suggest starting the algorithm from \(\hat{\theta }\), meaning the optimisation step essentially replaces the burn-in of the chain. We find in the experiments that these initial steps are often faster than the burn-in of SGLD.

*stochastic optimisation*(Robbins and Monro 1951), and then calculate the full log-posterior gradient at this point \(\nabla f(\hat{\theta })\). We then start the algorithm from \(\hat{\theta }\). In our implementations we use a simple stochastic optimisation method, known as

*stochastic gradient descent*(SGD, see e.g. Bottou 2010). The method works similarly to the standard optimisation method gradient descent, but at each iteration replaces the true gradient of the function with an unbiased estimate. A single update of the algorithm is as follows:

We show in Sect. 3.4, under our assumptions of log-concavity of the posterior, that finding \(\hat{\theta }\) using SGD has a computational cost that is linear in *N*, and we can achieve the required accuracy with just a single pass through the data. As we then start SGLD-CV with this value for \(\theta \), we can view finding the centring value as a replacement for the burn-in phase of the algorithm, and we find, in practice, that the time to find a good \(\hat{\theta }\) is often quicker than the time it takes for SGLD to burn-in. One downside of this procedure is that the SGD algorithm, as well as the SGLD-CV algorithm itself needs to be tuned, which adds to the tuning burden.

In comparison with SGLD-CV, the SAGA algorithm by Dubey et al. (2016) also uses control variates to reduce the variance in the gradient estimate of SGLD. They show that this reduces the MSE of SGLD. The main difference is that their algorithm uses a previous state in the chain as the control variate, rather than an estimate of the mode. This means that SAGA does not require the additional optimisation step, so tuning should be easier. However we show in the experiments of Sect. 5, that the algorithm gets more easily stuck in local stationary points, especially during burn-in. For more complex examples, the algorithm was prohibitively slow to burn-in because of this tendency to get trapped. Dubey et al. (2016) also do not show that SAGA has favourable computational cost results.

### 3.2 Variance reduction

The improvements of using the control-variate gradient estimate (4) over the standard (2) become apparent when we calculate the variances of each. For our analysis, we make the assumption that the posterior is strongly log-concave, formally defined in Assumption 1. This has become a common assumption when analysing gradient based samplers that do not have an acceptance step (Durmus and Moulines 2016; Dalalyan and Karagulyan 2017). In all the following analysis, we use \(\left\| \cdot \right\| \) to denote the Euclidean norm.

### Assumption 1

*m*and

*M*, such that the following conditions hold for the negative log-posterior

We further need a Lipschitz condition for each of the likelihood terms in order to bound the variance of our control-variate estimator of the gradient.

### Assumption 2

Using Assumption 2, we are able to derive a bound on the variance of the gradient estimate of SGLD-CV. This bound is formally stated in Lemma 1.

### Lemma 1

*k*

*th*iteration, with stepsize

*h*and centring value \(\hat{\theta }\). Assume we estimate the gradient using the control-variate estimator with \(p_i=L_i/\sum _{j=1}^N L_j\) for \(i=1,\ldots ,N\). Define \(\xi _k := \nabla \tilde{f}(\theta _k) - \nabla f(\theta _k)\), so that \(\xi _k\) measures the noise in the gradient estimate \(\nabla \tilde{f}\) and has mean 0. Then, for all \(\theta _k, \hat{\theta } \in \mathbb {R}^d\), and all \(k = 1, \dots , K\), we have

In order to consider how SGLD-CV scales with *N*, we need to make assumptions on the properties of the posterior and how these change with *N*. To make discussions concrete, we will focus on the following, strong, assumption that each likelihood-term in the posterior is strongly log-concave. As we discuss later, our results apply under weaker conditions.

### Assumption 3

*L*and

*l*such that \(f_i\) satisfies the following conditions

Under this assumption the log-concavity constants, *m* and *M*, of the posterior both increase linearly with *N*, as shown by the following Lemma.

### Lemma 2

*f*, satisfies the following

We can see that the bound on the gradient estimate variance in (8) depends on the distance between \(\theta _k\) and \(\hat{\theta }\). Appealing to the Bernstein-von Mises theorem (Le Cam 2012), under standard asymptotics we would expect the distance \(\mathbb {E}\left\| \theta _k - \hat{\theta } \right\| ^2\) to be *O*(1 / *N*), if \(\hat{\theta }\) is within \(O(N^{-1/2})\) of the posterior mean, once the MCMC algorithm has burnt in. As *M* is *O*(*N*), this suggests that \(\mathbb {E}\left\| \xi _k \right\| ^2\) will be *O*(*N*).

*N*is large, we now compare this

*O*(

*N*) result for SGLD-CV, with a result on the variance of the simple estimator, \(\nabla \hat{f} (\theta )\). If we randomly pick some data point index

*I*and fix some point \(\theta = \vartheta \), then define \(V_j(\vartheta )\) to be the empirical variance of \(\partial _j f_I(\vartheta )\) given the data \(\mathbf {x}\); and set \(V(\vartheta ) = \sum _{j=1}^d V_j(\vartheta )\). Then, defining \(\hat{\xi }(\theta ) = \nabla \hat{f}(\theta ) - \nabla f(\theta )\), if we assume we are sampling the minibatch without replacement then

*N*, the true parameter value converges to \(\theta _0\). Then, we would expect that, for \(\vartheta \) close to \(\theta _0\), \(\mathbb {E}\left\| \hat{\xi }(\vartheta ) \right\| ^2 \approx \frac{N^2}{n} V(\theta _0)\), so that the estimator will be \(O(N^2)\). More precisely, if we assume that as \(N \rightarrow \infty \), the posterior converges to a point mass at some value \(\theta _0\), then we can choose \(\epsilon > 0\) and assume \(V(\vartheta ) \ge \sigma ^2 > 0\) for all \(\vartheta \) in an epsilon ball around \(\theta _0\). Then, there is a constant \(c > 0\) such that

*O*(

*N*) reduction in variance, and this plays a key part in the computational cost improvements we show in the next section.

### 3.3 Computational cost of SGLD-CV

In this section, we investigate how applying control variates to the gradient estimate of SGLD reduces the computational cost of the algorithm.

*N*is changed. For two measures \(\mu \) and \(\nu \) defined on the probability space \((\mathbb {R}^d, B(\mathbb {R}^d))\), and for a real number \(q > 0\), the distance \(W_q\) is defined by

One issue when working with the Wasserstein distance is that it is not invariant to transformations. For example, scaling all entries of \(\theta \) by a constant will scale the Wasserstein distance by the same constant. A linear transformation of the parameters will result in a posterior that is still strongly log-concave, but with different constants *m* and *M*. To account for this, we suggest measuring error by the quantity \(\sqrt{m} W_2\), which is invariant to scaling \(\theta \) by a constant. Theorem 1 of Durmus and Moulines (2016) bounds the standard deviation of any component of \(\theta \) by a constant times \(1/\sqrt{m}\), so we can view the quantity \(\sqrt{m} W_2\) as measuring the error on a scale that is relative to the variability of \(\theta \) under the posterior distribution.

There are a number of quantities that will affect the performance of SGLD and SGLD-CV. These include the stepsize *h*, the minibatch size *n* and the total number of iterations *K*. In the analysis to follow, we find conditions on *h*, *n* and *K* that ensure the Wasserstein distance between the distribution defined by SGLD-CV and the true posterior distribution \(\pi \) are less than some \(\epsilon > 0\). We use these conditions to calculate the computational cost, measured as *Kn*, required for this algorithm to reach the satisfactory error \(\epsilon \).

The first step is to find an upper bound on the Wasserstein distance between SGLD-CV and the posterior distribution in terms of *h*, *n*, *K* and the constants *m* and *M* declared in Assumption 1.

### Proposition 1

*K*

*th*iteration of the algorithm with stepsize

*h*, initial value \(\theta _0\), centring value \(\hat{\theta }\). Let the distribution of \(\theta _K\) be \(\nu _K\). Denote the expectation of \(\theta \) under the posterior distribution \(\pi \) by \(\bar{\theta }\). If \(h < \frac{2m}{2M^2 + m^2}\), then for all integers \(K \ge 0\),

*d*is the dimension of \(\theta _{k}\).

The proof of this proposition is closely related to the proof of Proposition 2 of Dalalyan and Karagulyan (2017). The extra complication comes from our bound on the variance of the estimator of the gradient; which depends on the current state of the SGLD-CV algorithm, rather than being bounded by a global constant.

We can now use Proposition 1 to find conditions on *K*, *h* and *n* in terms of the constants *M* and *m* such that the Wasserstein distance is bounded by some positive constant \(\epsilon _0/\sqrt{m}\) at its final iteration *K*.

### Theorem 1

*h*, initial value \(\theta _0\), centring value \(\hat{\theta }\). Let the distribution of \(\theta _K\) be \(\nu _K\). Denote the expectation of \(\theta \) under the posterior distribution \(\pi \) by \(\bar{\theta }\). Define \(R := M / m\). Then, for any \(\epsilon _0 > 0\), if the following conditions hold:

*d*is the dimension of \(\theta _k\), then \(W_2(\nu _K, \pi ) \le \epsilon _0/\sqrt{m}\).

As a corollary of this result, we have the following, which gives a bound on the computational cost of SGLD, as measured by *Kn*, to achieve a required bound on the Wasserstein distance.

### Corollary 1

The constants, \(C_1\) and \(C_2\), in the bound on *Kn*, depend on \(\epsilon _0\) and \(R=M/m\). It is simple to show that both constants are increasing in *R*. Under Assumption 3, we have that *R* is a constant as *N* increases. Corollary 1 suggests that provided \(\left\| \theta _0 - \bar{\theta } \right\| < c / \sqrt{m}\) and \(\left\| \hat{\theta } - \bar{\theta } \right\| < c / \sqrt{m}\), for some constant *c*; then the computational cost of SGLD-CV will be bounded by a constant. Since we suggest starting SGLD-CV at \(\hat{\theta }\), then technically we just need this to hold for \(\left\| \hat{\theta } - \bar{\theta } \right\| \). Under Assumption 3, we have that *m* increases linearly with *N*, so this corresponds to needing \(\left\| \hat{\theta } - \bar{\theta } \right\| <c_1/\sqrt{N}\) as *N* increases. Additionally, by Theorem 1 of Durmus and Moulines (2016), we have that the variance of the posterior scales like \(1/m=1/N\) as *N* increases, so we can interpret the \(1/\sqrt{N}\) factor as being a measure of the spread of the posterior as *N* increases. The form of the corollary makes it clear that a similar argument would apply under weaker assumptions than Assumption 3. We only need that the ratio of the log-concavity constants, *M* / *m*, of the posterior remains bounded as *N* increases.

This corollary also gives insight into the computational penalty you pay for a poor choice of \(\theta _0\) or \(\hat{\theta }\). The bound on the computational cost will increase logarithmically with \(\left\| \theta _0-\bar{\theta } \right\| \) and linearly with \(\left\| \hat{\theta }-\bar{\theta } \right\| \).

Similar bounds for the computational cost of SGLD have been derived (e.g. Nagapetyan et al. 2017; Chatterji et al. 2018). For example, Chatterji et al. (2018) state the computational cost of SGLD in order to sample from a distribution with \(W_2\) distance that is within \(\epsilon \) distance of the target is \(O(1/\epsilon ^2)\). For our required level of accuracy, \(\epsilon =\epsilon _0/\sqrt{m}\), this corresponds to a bound on the computational cost that is *O*(*N*), as compared to *O*(1) for SGLD-CV. However, a formal proof that SGLD is *O*(*N*) requires showing that any implementation of SGLD with *Kn* that is sublinear in *N* cannot reach arbitrary accuracy \(\epsilon \). This requires a lower bound on the \(W_2\) distance, rather than an upper bound. Whilst the result of Chatterji et al. (2018) is just an upper bound on the \(W_2\) distance, there are empirical results that suggest a linear computational cost for SGLD, including those in Nagapetyan et al. (2017), as well as our own experiments in Sect. 5.

### 3.4 Setup costs

There are a number of known results on the convergence of SGD under the strongly log-concave conditions of Assumption 1. These will allow us to quantify the setup cost of finding the point \(\hat{\theta }\) in this setting. More complex cases are explored empirically in the experiments in Sect. 5. Lemma 3 due to Nemirovski et al. (2009) quantifies the convergence of the final point of SGD.

### Lemma 3

*K*iterations. Suppose \(\mathbb {E}\left\| \nabla \hat{f}(\theta ) \right\| ^2 \le D^2\) and denote the true mode of

*f*by \(\theta ^*\). Then it holds that

By using a similar argument to (9), we would expect that \(D^2\) is \(O(N^2/n)\). This means that under Assumption 3, we will need to process the full dataset once before the SGD algorithm has converged to a mode \(\hat{\theta }\) within \(O(N^{-\frac{1}{2}})\) of the posterior mean. It follows that, for these cases there are two one off *O*(*N*) setup costs, one to find an acceptable mode \(\hat{\theta }\) and one to find the full log-posterior gradient at this point \(\nabla f(\hat{\theta })\).

## 4 Postprocessing control variates

Mira et al. (2013) introduce a way of choosing \(h(\theta )\) almost automatically by using the gradient of the log-posterior. Choosing \(h(\cdot )\) in this manner is referred to as a zero variance (ZV) control variate. Friel et al. (2016) showed that, under mild conditions, we can replace the log-posterior gradient with an unbiased estimate and still have a valid control variate. SGMCMC methods produce unbiased estimates of the log-posterior gradient, and so it follows that these gradient estimates can be applied as ZV control variates. For the rest of this section, we focus our attention on SGLD, but these ideas are easily extendable to other stochastic gradient MCMC algorithms. We refer to SGLD with these postprocessing control variates as SGLD-ZV.

*Laplace operator*\(\frac{ \partial ^2 }{ \partial \theta _1^2 } + \dots + \frac{ \partial ^2 }{ \partial \theta _d^2 }\). In order to get the best variance reduction, we simply have to optimise the coefficients of the polynomial

*Q*(.). In practice, first or second degree polynomials \(Q(\theta )\) often provide good variance reduction (Mira et al. 2013). For the rest of this section, we focus on first degree polynomials, so \(Q(\theta ) = \mathbf {a}^T \theta \), but the ideas are easily extendable to higher orders (Papamarkou et al. 2014).

*z*that is independent of \(\theta \), depending only on the minibatch choice. See “Appendix” for the full calculation.

The efficiency of ZV control variates in reducing the variance of our MCMC sample is directly affected by using an estimate of the gradient rather than the truth. For the remainder of this section, we investigate how the choice of the gradient estimate, and the minibatch size *n*, affects the variance reduction.

### Assumption 4

\(\mathbb {V}{\mathrm {ar}}[ \phi (\theta )] < \infty \) and \(\mathbb {V}{\mathrm {ar}}[ \hat{\psi } (\theta )] < \infty \). \(\mathbb {E}_{\theta | \mathbf {x}} \left\| \nabla f_i(\theta ) \right\| ^2\) is bounded by some constant \(\sigma \) for all \(i = 0, \dots N\), \(\theta \in \mathbb {R}^d\).

### Theorem 2

*R*, and the optimal variance reduction using SGLD gradient estimates to be \(\hat{R}\). Then, we have that

The proof of this result is given in the “Appendix”. An important consequence of Theorem 2 is that if we use the standard SGLD gradient estimate, then the denominator of (15) is *O*(*n* / *N*), so our variance reduction diminishes as *N* gets large. However, if we use the SGLD-CV estimate instead, then under standard asymptotics, the denominator of (15) is *O*(*n*), so the variance reduction does not diminish with increasing dataset size. The same holds for SAGA, and other control-variate algorithms that reduce the gradient error to *O*(*N*). It follows that for best results, we recommend using the ZV postprocessing step after running the SGLD-CV algorithm, especially for large *N*. The ZV postprocessing step can be immediately applied in exactly the same way to other stochastic gradient MCMC algorithms, such as SGHMC and SGNHT (Chen et al. 2014; Ding et al. 2014).

It is worth noting that there are some storage constraints for SGLD-ZV. This algorithm requires storing the full MCMC chain, as well as the gradient estimates at each iteration. So the storage cost is twice the storage cost of a standard SGMCMC run. However, in some high-dimensional cases, the required SGMCMC test statistic is estimated on the fly using the most recent output of the chain and thus reducing the storage costs. We suggest that if the dimensionality is not too high, then the additional storage cost of recording the gradients to apply the ZV postprocessing step can offer significant variance reduction for free. However, for very high-dimensional parameters, the cost associated with storing the gradients may preclude the use of the ZV step.

## 5 Experiments

In the experiments to follow, we compare SGLD-CV to SGLD in a number of different scenarios. To empirically demonstrate the scalability advantages, we infer each model for three different dataset sizes. The preprocessing steps for SGLD-CV are included in the plots. We include the method SAGA (Dubey et al. 2016) in our results, an alternative variance reduction method for SGLD.

We also provide box-plots of the log predictive density of iterations of SGLD-CV before and after it has been postprocessed using ZV control variates, to demonstrate the variance reduction and improved predictive performance of the method.

A fixed stepsize scheme is used for all methods, and these are tuned using a grid search (for SGLD-CV, both SGD and SGLD steps are tuned using a grid search). Full details of the tuning constants and alternative results using a decreasing stepsize scheme are given in “Appendix”. The minibatch size is fixed across all the dataset sizes. All results are timed when the algorithms are run on four cores of a 2.3 GHz Intel Xeon CPU.

### 5.1 Logistic regression

We used the cover type dataset (Blackard and Dean 1999), which has 581012 observations, which we split into a training and test set. First, we run SGLD, SGLD-CV and SAGA on the dataset, all with minibatch size 500. The method SAGA was discussed at the end of Sect. 3.1. To empirically support the scalability results of Theorem 1, we fit the model 3 times. In each fit, the dataset size is varied, from about 1% of the full dataset to the full dataset size *N*. The performance is measured by calculating the log predictive density on a held out test set every 10 iterations. Some of our examples are high-dimensional, so our performance measure aims to reduce dimensionality whilst still capturing important quantities such as the variance of the chain. We include the burn-in of SGLD and SAGA, to contrast with the optimisation step and full gradient calculation required for SGLD-CV, which is included in the total computational time.

The results are plotted against time in Fig. 1. The results illustrate the efficiency gains of SGLD-CV over SGLD as the dataset size increases, as expected from Theorem 1. SAGA outperforms SGLD-CV in this example because SAGA converges quickly in this simple setting. In the more complicated examples to follow, we show that SAGA can get stuck in local stationary points.

### 5.2 Probabilistic matrix factorisation

A common recommendation system task is to predict a user’s rating of a set of items, given previous ratings and the ratings of other users. The end goal is to recommend new items that the user will rate highly. Probabilistic matrix factorisation (PMF) is a popular method to train these models (Mnih and Salakhutdinov 2008). As the matrix of ratings is sparse, over-fitting is a common issue in these systems, and Bayesian approaches are a way to account for this (Chen et al. 2014; Ahn et al. 2015).

*R*, of size \(L \times M\), where

*L*is the number of users and

*M*is the number of items. Each entry contains a rating of a particular item by that user, with a lot of missing entries. The aim is to predict the values of the missing entries. This is done by factorising

*R*into two matrices

*U*and

*V*, so that \(R \approx U^T V\), where

*U*and

*V*are size \(D \times L\) and \(D \times M\), respectively. Here,

*D*is some user-specified parameter, which we choose to be 20. We use the Movielens dataset ml-100k

^{1}, which contains 100,000 ratings from almost 1000 users and 1,700 movies. We use batch sizes of 5000, with a larger minibatch size chosen due to the high-dimensional parameter space. As before, we compare performance by calculating the predictive distribution on a held out dataset every 10 iterations. Full details of chosen hyperparameters are detailed in “Appendix”.

We investigate the scaling results of SGLD-CV and SAGA versus SGLD by varying the dataset size. We do this by limiting the number of users in the dataset, ranging from 100 users to the full 943. The results are given in Fig. 3. In this example, SAGA converges slowly in comparison even with SGLD. In fact, the algorithm converges slowly in all our more complex experiments. The problem is particularly bad for large *N*. This is likely a result of the starting point for SAGA being far from the posterior mode. Empirically, we found that the gradient direction and magnitude can update very slowly in these cases. This is not an issue for simpler examples such as logistic regression, but for more complex examples we believe it could be a sign that the algorithm is getting stuck in, or moving slowly through, local modes where the gradient is comparatively flatter. The problem appears to be made worse for large *N* when it takes longer to update \(g_\alpha \). This is an example where the optimisation step of SGLD-CV is an advantage, as the algorithm is immediately started close to the posterior mode and so the efficiency gains are quickly noted. This issue with SAGA could be related to the starting point condition for SGLD-CV as detailed in Corollary 1. Due to the form of the Wasserstein bound, it is likely that SAGA would have a similar starting point condition.

### 5.3 Latent Dirichlet allocation

Latent Dirichlet allocation (LDA) is an example of a topic model used to describe collections of documents by sets of discovered topics (Blei et al. 2003). The input consists of a matrix of word frequencies in each document, which is very sparse, motivating the use of a Bayesian approach to avoid over-fitting.

Due to storage constraints, it was not feasible to apply SGLD-ZV to this problem, so we focus on SGLD-CV. We scraped approximately 80,000 documents from Wikipedia, and used the 1000 most common words to form our document-word matrix input. We used a similar formulation to Patterson and Teh (2013), though we did not use a Riemannian sampler, which should test the methods due to the gradient instability. We also instead use the expanded-natural parameterisation, since according to empirical results by Patterson and Teh (2013) that was the best performing parameterisation for the non-Riemannian case. We set the number of topics *K* to be 20. Full details of hyperparameters are given in “Appendix”.

Once again in our comparison of SGLD, SGLD-CV and SAGA, we vary the dataset size, this time by changing the number of documents used in fitting the model, from 10,000 to the full 81,538. We use batch sizes of 50 documents. We measure the performance of LDA using the *perplexity* on held out words from each document, a standard performance measure for this model, for more detail see Patterson and Teh (2013). The results are given in Fig. 5. Here, the scalability improvements of using SGLD-CV over SGLD are clear as the dataset size increases. This time the batch size is small compared to the dataset size, which probably makes the scalability improvements more obvious. The sudden drop in perplexity for the SGLD-CV plot occurs at the switch from the stochastic optimisation step to SGLD-CV. This is likely a result of the algorithm making efficient use of the Gibbs step to simulate the latent topics.

An interesting aspect of this problem is that it appears to have a pronounced local mode where each of the methods become trapped (this can be seen by the blip in the plot at a perplexity of around 850). SGLD-CV is the first to escape followed by SGLD, but SAGA takes a long time to escape. This is probably due to a similar aspect as the one discussed in the previous experiment (Sect. 5.2). Similar to the previous experiment, we find that whilst SAGA seems trapped, its gradient estimate changes very little, which could be a sign that the algorithm is moving very slowly through an area with a relatively flat gradient, such as a local mode. A simple solution would be to start SAGA closer to the mode using a stochastic optimisation scheme.

## 6 Discussion

We have used control variates for stochastic gradient MCMC to reduce the variance in the gradient estimate. We have shown that in the strongly log-concave setting, and under standard asymptotics, this proposed SGLD-CV algorithm reduces the computational cost of stochastic gradient Langevin dynamics to *O*(1). Our theoretical results give results on the computational cost under nonstandard asymptotics also, and show there should be some benefit provided distance between the centring value \(\hat{\theta }\) and the posterior mean inversely depends on *N*. The algorithm relies on a setup cost that estimates the posterior mode which replaces the burn-in of SGLD. We have explored the cost of this step both theoretically and empirically. We have empirically supported these scalability results on a variety of interesting and challenging problems from the statistics and machine learning literature using real-world datasets. The simulation study also revealed that SGLD-CV was less susceptible to getting stuck in local stationary points than an alternative method that performs variance reduction using control variates, SAGA (Dubey et al. 2016). An interesting future extension would be to reduce the start-up cost of SGLD-CV, along with introducing automatic stepsize tuning.

We showed that stochastic gradient MCMC methods calculate all the information needed to apply zero variance postprocessing control variates. This improves the inference of the output by reducing its variance. We explored how the variance reduction is affected by the minibatch size and the gradient estimate, and show using SGLD-CV or SAGA rather than SGLD can achieve a better variance reduction. We demonstrated this variance reduction empirically. A limitation of these postprocessing control variates is they require the whole chain, which can lead to high storage costs if the dimensionality of the sample space is high. Future work could explore ways to reduce the storage costs of stochastic gradient MCMC.

## Footnotes

## Notes

### Acknowledgements

The first author gratefully acknowledges the support of the EPSRC funded EP/L015692/1 STOR-i Centre for Doctoral Training. This work was supported by EPSRC grants EP/K014463/1, EP/R018561/1, EP/S00159X/1 and EP/R01860X/1 and ONR Grant N00014-15-1-2380 and NSF CAREER Award IIS-1350133.

## Supplementary material

## References

- Ahn, S., Korattikara, A., Liu, N., Rajan, S., Welling, M.: Large-scale distributed Bayesian matrix factorization using stochastic gradient MCMC. In In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 9–18. ACM, New York (2015)Google Scholar
- Baker, J., Fearnhead, P., Fox, E.B., Nemeth, C.: sgmcmc: An R package for stochastic gradient Markov Chain Monte Carlo. J. Stat. Softw. (2017). https://arxiv.org/abs/1710.00578
- Bardenet, R., Doucet, A., Holmes, C.: On Markov chain Monte Carlo methods for tall data. J. Mach. Learn. Res.
**18**(47), 1–43 (2017)MathSciNetzbMATHGoogle Scholar - Bierkens, J., Fearnhead, P., Roberts, G.: The zig-zag process and super-efficient sampling for Bayesian analysis of big data. (2016). https://arxiv.org/abs/1607.03188
- Blackard, J.A., Dean, D.J.: Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Comput. Electron. Agric.
**24**(3), 131–151 (1999)CrossRefGoogle Scholar - Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res.
**3**, 993–1022 (2003)zbMATHGoogle Scholar - Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of the 19th international conference on computational statistics, pp. 177–187. Springer, Berlin (2010)Google Scholar
- Chatterji, N.S., Flammarion, N., Ma, Y.-A., Bartlett, P.L., Jordan, M.I.: On the theory of variance reduction for stochastic gradient Monte Carlo. (2018). https://arxiv.org/abs/1802.05431v1
- Chen, C., Wang, W., Zhang, Y., Su, Q., Carin, L.: A convergence analysis for a class of practical variance-reduction stochastic gradient MCMC. (2017). https://arxiv.org/abs/1709.01180
- Chen, T., Fox, E., Guestrin, C.: Stochastic gradient Hamiltonian Monte Carlo. In: Proceedings of the 31st international conference on machine learning PMLR, pp. 1683–1691 (2014)Google Scholar
- Dalalyan, A.S., Karagulyan, A.G.: User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. (2017). https://arxiv.org/abs/1710.00095
- Ding, N., Fang, Y., Babbush, R., Chen, C., Skeel, R.D., Neven, H.: Bayesian sampling using stochastic gradient thermostats. In: Advances in neural information processing systems 27, pp. 3203–3211. Curran Associates, Inc, Red Hook (2014)Google Scholar
- Dubey, K.A., Reddi, S.J., Williamson, S.A., Poczos, B., Smola, A.J., Xing, E.P.: Variance reduction in stochastic gradient Langevin dynamics. In: Advances in neural information processing systems, vol. 29, pp. 1154–1162. Curran Associates, Inc, Red Hook (2016)Google Scholar
- Durmus, A., Moulines, E.: High-dimensional bayesian inference via the unadjusted langevin algorithm. (2016). https://hal.archives-ouvertes.fr/hal-01304430/
- Friel, N., Mira, A., Oates, C.: Exploiting multi-core architectures for reduced-variance estimation with intractable likelihoods. Bayesian Anal.
**11**(1), 215–245 (2016)MathSciNetCrossRefzbMATHGoogle Scholar - Le Cam, L.: Asymptotic Methods in Statistical Decision Theory. Springer, Berlin (2012)Google Scholar
- Mira, A., Solgi, R., Imparato, D.: Zero variance Markov chain Monte Carlo for Bayesian estimators. Stat. Comput.
**23**(5), 653–662 (2013)MathSciNetCrossRefzbMATHGoogle Scholar - Mnih, A., Salakhutdinov, R.R.: Probabilistic matrix factorization. In: Advances in neural information processing systems, vol. 20, pp. 1257–1264. Curran Associates, Inc, Red Hook (2008)Google Scholar
- Nagapetyan, T., Duncan, A., Hasenclever, L., Vollmer, S.J., Szpruch, L., Zygalakis, K.: The true cost of stochastic gradient Langevin dynamics. (2017). https://arxiv.org/abs/1706.02692
- Neal, R.M.: MCMC using Hamiltonian dynamics. In: Brooks, S., Gelman, A., Jones, G.L., Meng, X.-L. (eds.) Handbook of Markov Chain Monte Carlo. Chapman & Hall, Boca Raton (2010)Google Scholar
- Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim.
**19**(4), 1574–1609 (2009)MathSciNetCrossRefzbMATHGoogle Scholar - Papamarkou, T., Mira, A., Girolami, M.: Zero variance differential geometric Markov chain Monte Carlo algorithms. Bayesian Anal.
**9**(1), 97–128 (2014)MathSciNetCrossRefzbMATHGoogle Scholar - Patterson, S., Teh, Y.W.: Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In: Advances in neural information processing systems, vol. 26, pp. 3102–3110. Curran Associates, Inc, Red Hook (2013)Google Scholar
- Pollock, M., Fearnhead, P., Johansen, A.M., Roberts, G.O.: The scalable Langevin exact algorithm: Bayesian inference for big data. (2016). https://arxiv.org/abs/1609.03436
- Ripley, B.D.: Stochastic Simulation. Wiley, Hoboken (2009)zbMATHGoogle Scholar
- Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat.
**22**(3), 400–407 (1951)MathSciNetCrossRefzbMATHGoogle Scholar - Roberts, G.O., Tweedie, R.L.: Exponential convergence of langevin distributions and their discrete approximations. Bernoulli
**2**(4), 341–363 (1996)MathSciNetCrossRefzbMATHGoogle Scholar - Teh, Y.W., Thiéry, A.H., Vollmer, S.J.: Consistency and fluctuations for stochastic gradient Langevin dynamics. J. Mach. Learn. Res.
**17**(7), 1–33 (2016)MathSciNetzbMATHGoogle Scholar - Tran, D., Kucukelbir, A., Dieng, A.B., Rudolph, M., Liang, D., Blei, D.M.: Edward: a library for probabilistic modeling, inference, and criticism. (2016). https://arxiv.org/abs/1610.09787
- Vollmer, S.J., Zygalakis, K.C.: (Non-) asymptotic properties of stochastic gradient Langevin dynamics. J. Mach. Learn. Res.
**17**(159), 1–48 (2016)Google Scholar - Welling, M., Teh, Y.W.: Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of the 28th international conference on machine learning PMLR, pp. 681–688 (2011)Google Scholar
- Wenzhe Li, Sungjin Ahn, M.W.: Scalable MCMC for mixed membership stochastic blockmodels. In: Proceedings of the 19th international conference on artificial intelligence and statistics PMLR, pp. 723–731 (2016)Google Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.