Stochastic gradient Hamiltonian Monte Carlo with variance reduction for Bayesian inference

Abstract

Gradient-based Monte Carlo sampling algorithms, like Langevin dynamics and Hamiltonian Monte Carlo, are important methods for Bayesian inference. In large-scale settings, full-gradients are not affordable and thus stochastic gradients evaluated on mini-batches are used as a replacement. In order to reduce the high variance of noisy stochastic gradients, Dubey et al. (in: Advances in neural information processing systems, pp 1154–1162, 2016) applied the standard variance reduction technique on stochastic gradient Langevin dynamics and obtained both theoretical and experimental improvements. In this paper, we apply the variance reduction tricks on Hamiltonian Monte Carlo and achieve better theoretical convergence results compared with the variance-reduced Langevin dynamics. Moreover, we apply the symmetric splitting scheme in our variance-reduced Hamiltonian Monte Carlo algorithms to further improve the theoretical results. The experimental results are also consistent with the theoretical results. As our experiment shows, variance-reduced Hamiltonian Monte Carlo demonstrates better performance than variance-reduced Langevin dynamics in Bayesian regression and classification tasks on real-world datasets.

Introduction

Gradient-based Monte Carlo algorithms are useful tools for sampling posterior distributions. Similar to gradient descent algorithms, gradient-based Monte Carlo generates posterior samples iteratively using the gradient of log-likelihood.

Langevin dynamics (LD) and Hamiltonian Monte Carlo (HMC) (Duane et al. 1987; Neal et al. 2011) are two important examples of gradient-based Monte Carlo sampling algorithms that are widely used in Bayesian inference. Since calculating the likelihood on large datasets is expensive, people use stochastic gradients (Robbins and Monro 1951) in place of full gradient, and have, for both Langevin dynamics and Hamiltonian Monte Carlo, developed their stochastic gradient counterparts (Welling and Teh 2011; Chen et al. 2014). Stochastic gradient Hamiltonian Monte Carlo (SGHMC) usually converges faster than stochastic gradient Langevin dynamics (SGLD) in practical machine learning tasks like covariance estimation of bivariate Gaussian and Bayesian neural networks for classification on MNIST dataset, as demonstrated in Welling and Teh (2011). Similar phenomenon was also observed in Chen et al. (2015) where SGHMC and SGLD were compared on both synthetic and real-world datasets. Intuitively speaking, comparing against SGLD, SGHMC has a momentum term that may enable it to explore the parameter space of posterior distribution much faster when the gradient of log-likelihood becomes smaller.

Very recently, Dubey et al. (2016) borrowed the standard variance reduction techniques from the stochastic optimization literature (Johnson and Zhang 2013; Defazio et al. 2014) and applied them on SGLD to obtain two variance-reduced SGLD algorithms (called SAGA-LD and SVRG-LD) with improved theoretical results and practical performance. Because of the superiority of SGHMC over SGLD in terms of convergence rate in a wide range of machine learning tasks, it would be a natural question whether such variance reduction techniques can be applied on SGHMC to achieve better results than variance-reduced SGLD. The challenge is that SGHMC is more complicated than SGLD, i.e., the extra momentum term (try to explore faster) and friction term (control the noise caused by SGHMC from HMC) in SGHMC. Note that the friction term in SGHMC is inherently different than SGLD since LD itself already has noise so it can be directly extended to SGLD, while HMC itself is deterministic. To the best of our knowledge, there is even no existing work to prove that SGHMC is better than SGLD. So in this paper we need to give some new approaches and insights in our analysis to prove that variance-reduced SGHMC is better than variance-reduced SGLD due to the existence of momentum term and friction term.

Actually, it seems that the variance reduction in this stochastic Bayesian inference is more effective compared with stochastic optimization settings. Intuitively, the full gradient case (no variance) may converge to a saddle point or a local minimum (not a global minimum) in nonconvex optimization, and the variance of the stochastic gradient estimator may be useful for escaping saddle points or bad local minima. Thus, we may not want to reduce the variance. However, the full gradient case (no variance) will always converge to the stationary posterior distribution for Bayesian inference. Thus, it is useful to reduce the variance of the stochastic gradient estimator for obtaining more approximate posterior distribution. Note that in large-scale settings, full-gradients (no variance) are not affordable and thus stochastic gradients evaluated on mini-batches are used as a replacement.

Our contribution

  1. 1.

    We propose two variance-reduced versions of Hamiltonian Monte Carlo algorithms (called SVRG-HMC and SAGA-HMC) using the standard approaches from Johnson and Zhang (2013), and Defazio et al. (2014). Compared with SVRG/SAGA-LD (Dubey et al. 2016), our algorithms guarantee improved theoretical convergence results due to the extra momentum term in HMC (see Corollary 3).

  2. 2.

    Moreover, we combine the proposed SVRG/SAGA-HMC algorithms with the symmetric splitting scheme (Chen et al. 2015; Leimkuhler and Shang 2016) to extend them to 2nd-order integrators, which further improve the dependency on step size (see the difference between Theorem 2 and 5). We denote these two algorithms as SVRG2nd-HMC and SAGA2nd-HMC.

  3. 3.

    Finally, we evaluate the proposed algorithms on real-world datasets and compare them with SVRG/SAGA-LD (Dubey et al. 2016); as it turns out, our algorithms converge markedly faster than the benchmarks (vanilla SGHMC and SVRG/SAGA-LD).

Related work

Langevin dynamics and Hamiltonian Monte Carlo are two important sampling algorithms that are widely used in Bayesian inference. Many literatures have studied how to develop variants of them to achieve improved performance, especially for scalability for large datasets. Welling and Teh (2011) started this direction with the notable work stochastic gradient Langevin dynamics (SGLD). Ahn et al. (2012) proposed a modification to SGLD reminiscent of Fisher scoring to better estimate the gradient noise variance, with lower classification error rates on HHP dataset and MNIST dataset. Chen et al. (2014) developed the stochastic gradient version of HMC (SGHMC), with a quite nontrivial approach different from SGLD. Ding et al. (2014) further improved SGHMC by a new dynamics to better control the gradient noise, and the proposed stochastic gradient Nosé-Hoover thermostats (SGNHT) outperforms SGHMC on MNIST dataset.

Various settings of Markov Chain Monte Carlo (MCMC) are also considered. Girolami and Calderhead (2011) enhanced LD and HMC by exploring the Riemannian structure of the target distribution, with Riemannian manifold LD and HMC (RMLD and RMHMC, respectively). Byrne and Girolami (2013) developed geodesic Monte Carlo (GMC) that is applicable to Riemannian manifolds with no global coordinate systems. Large-scale variants of RMLD, RMHMC and GMC with stochastic gradient were developed by Patterson and Teh (2013), Ma et al. (2015) and Liu et al. (2016), respectively. Ahn et al. (2014) studied the behaviour of stochastic gradient MCMC algorithms for distributed posterior inference. Very recently, Zou et al. (2018) used a stochastic variance-reduced HMC for sampling from smooth and strongly log-concave distributions which requires f is smooth and strongly convex. In this paper, we do not assume f is strongly convex or convex and we also use an efficient discretization scheme to further imporve the convergence results. Besides, their results were measured with 2-Wasserstein distance, while ours are measured with mean square error. Note that the variance reduction techniques have already been used in nonconvex optimization literature [see e.g., Allen-Zhu and Hazan 2016; Reddi et al. 2016; Li and Li 2018; Ge et al. 2019; Li 2019], and they achieved improved convergence results.

Preliminary

Let \(X = \{x_i\}_{i=1}^n\) be a d-dimensional dataset that follows the distribution \(\Pr (X|\theta ) = \prod _{i=1}^n \Pr (x_i|\theta )\). Then, we are interested in sampling the posterior distribution \(\Pr (\theta |X)\propto \Pr (\theta )\prod _{i=1}^n \Pr (x_i|\theta )\) based on Hamiltonian Monte Carlo algorithms. Let [n] denote the set \(\{1, 2, \ldots , n\}\). Define \(f(\theta ) = \sum _{i=1}^n f_i(\theta )-\log \Pr (\theta )\), where \(f_i(\theta ) = -\log \Pr (x_i|\theta )\) and \(i \in [n]\). Similar to Dubey et al. (2016), we assume that each \(f_i\) is L-smooth and G-Lipschitz, for all \(i\in [n]\).

The general algorithmic framework maintains two sequences for \(t = 0, 1, \ldots , T-1\) by the following discrete time procedure:

$$\begin{aligned} p_{t+1}&= (1-Dh)p_t -h\tilde{\nabla }_t + \sqrt{2Dh}\cdot \xi _t \end{aligned}$$
(1)
$$\begin{aligned} \theta _{t+1}&= \theta _t + hp_{t+1} \end{aligned}$$
(2)

and then returns the samples \(\{\theta _1, \theta _2, \ldots , \theta _T\}\) as an approximation to the stationary distribution \(\Pr (\theta |X)\). \(\theta _t\) is the parameter we wish to sample and \(p_t\) is an auxiliary variable conventionally called the “momentum”. Here h is step size, D is a constant independent of \(\theta \) and p, \(\xi _t\sim \mathrm {N}(0, I_d)\) and \(\tilde{\nabla }_t\) is a mini-batch approximation of the full gradient \(\nabla f(\theta _t)\). If we set \(\tilde{\nabla }_t = \frac{n}{b}\sum _{i\in I} \nabla f_i(\theta _t)\), I being a b-element index set uniformly randomly drawn (with replacement) from \(\{1, 2, \ldots , n\}\) as introduced in Robbins and Monro (1951), then the algorithm becomes SGHMC.

The above discrete time procedure provides an approximation to the continuous Hamiltonian Monte Carlo diffusion process \((\theta , p)\):

$$\begin{aligned} \mathrm {d}\theta&= p\mathrm {d}t \end{aligned}$$
(3)
$$\begin{aligned} \mathrm {d}p&= -\nabla _\theta f(\theta ) \mathrm {d}t - Dp \mathrm {d}t+ \sqrt{2D} \mathrm {d}W \end{aligned}$$
(4)

Here W is a Wiener process. According to Chen et al. (2015), the stationary joint distribution of \((\theta , p)\) is \(\pi (\theta ,p) \propto e^{-f(\theta ) - \frac{p^\intercal p}{2}}\).

How do we evaluate the quality of the samples \(\{\theta _1, \theta _2, \ldots , \theta _T\}\)? Assuming \(\phi : \mathbb {R}^d\rightarrow \mathbb {R}\) is a smooth test function, we wish to upper bound the Mean-Squared Error (MSE) \(\mathbb {E}(\hat{\phi } - \bar{\phi })^2\), where \(\hat{\phi } = \frac{1}{T}\sum _{t=1}^T \phi (\theta _t)\) is the empirical average, and \(\bar{\phi } = \mathbb {E}_{\theta \sim \Pr (\theta \mid X)}\phi (\theta )\) is the population average. So, the objective of our algorithm is to carefully design \(\tilde{\nabla }_t\) to minimize \(\mathbb {E}(\hat{\phi } - \bar{\phi })^2\) in a faster way, where \(\tilde{\nabla }_t\) is a stochastic approximation of \(\nabla f(\theta _t)\).

To study how the choice of \(\tilde{\nabla }_t\) influences the value of \(\mathbb {E}(\hat{\phi } - \bar{\phi })^2\), define \(\psi (\theta , p)\) to be the solution to the Poisson equation \(\mathcal {L}\psi = \phi (\theta ) - \bar{\phi }\), \(\mathcal {L}\) being the generator of Hamiltonian Monte Carlo diffusion process. In order to analyze the theoretical convergence results related to the MSE \(\mathbb {E}(\hat{\phi } - \bar{\phi })^2\), we inherit the following assumption from Chen et al. (2015).

Assumption 1

(Chen et al. 2015) Function \(\psi \) is bounded up to 3rd-order derivatives by some real-valued function \(\varGamma (\theta , p)\), i.e. \(\Vert \mathcal {D}^k \psi \Vert \le C_k \varGamma ^{q_k}\) where \(\mathcal {D}^k\) is the kth order derivative for \(k = 0, 1, 2, 3\), and \(C_k, q_k > 0\). Furthermore, the expectation of \(\varGamma \) on \(\{(\theta _t, p_t) \}\) is bounded, i.e. \(\sup _t \mathbb {E}[\varGamma ^q(\theta _t, p_t)] < \infty \) and that \(\varGamma \) is smooth such that \(\sup _{s\in (0, 1)}\varGamma ^q(s\theta + (1-s)\theta ^\prime , s p+ (1-s)p^\prime )\le C(\varGamma ^q(\theta , p) + \varGamma ^q(\theta ^\prime , p^\prime ))\), \(\forall \theta , p, \theta ^\prime , p^\prime , q\le \max 2q_k\) for some constant \(C > 0\).

Define operator \(\varDelta V_t = (\tilde{\nabla }_t - \nabla f(\theta _t))\cdot \nabla \) for all \(t = 0, 1, 2, \ldots , T-1\). When the above assumption holds, we have the following theorem by Chen et al. (2015).

For the rest of this paper, for any two values \(A, B > 0\), we say \(A\lesssim B\) if \(A = O(B)\), where the notation \(O(\cdot )\) only hides a constant factor independent of the algorithm’s parameters TnDhGb.

Theorem 1

(Chen et al. 2015) Let \(\tilde{\nabla }_t\) be an unbiased estimate of \(\nabla f(\theta _t)\) for all t. Then under Assumption 1, for a smooth test function \(\phi \), the MSE of SGHMC is bounded in the following way:

$$\begin{aligned} \mathbb {E}(\hat{\phi } - \bar{\phi })^2\lesssim \frac{\frac{1}{T}\sum _{t=0}^{T-1}\mathbb {E}(\varDelta V_t \psi (\theta _t, p_t))^2}{T} + \frac{1}{Th} + h^2 \end{aligned}$$
(5)

Similar to the [A2] assumption in Dubey et al. (2016), we also need to make the following assumption which relates \(\varDelta V_t\phi (\theta , p)\) to the difference \(\Vert \tilde{\nabla }_t- \nabla f(\theta )\Vert ^2\).

Assumption 2

\((\varDelta V_t \psi (\theta _t, p_t))^2\lesssim \Vert \tilde{\nabla }_t - \nabla f(\theta _t)\Vert ^2\) for all \(0\le t < T\).

Combined with Theorem 1, Assumption 2 immediately yields the following corollary.

Corollary 1

Under Assumptions 1 and 2, we have:

$$\begin{aligned} \mathbb {E}(\hat{\phi } - \bar{\phi })^2\lesssim \frac{\frac{1}{T}\sum _{t=0}^{T-1}\mathbb {E}\Vert \tilde{\nabla }_t - \nabla f(\theta _t)\Vert ^2}{T} + \frac{1}{Th} + h^2 \end{aligned}$$

As we mentioned before, if we take \(\tilde{\nabla }_t\) to be the Robbins & Monro approximation of \(\nabla f(\theta _t)\) (Robbins and Monro 1951), then it becomes SGHMC and the following corollary holds since all \(f_i\)’s are G-Lipschitz and \(\mathbb {E}(X - \mathbb {E}X)^2\le \mathbb {E}X^2\) for any random variable X.

Corollary 2

Under Assumptions 1 and 2, the MSE of SGHMC is bounded as:

$$\begin{aligned} \mathbb {E}(\hat{\phi } - \bar{\phi })^2&\lesssim \frac{\frac{1}{T}\sum _{t=0}^{T-1}\mathbb {E}\Vert \tilde{\nabla }_t - \nabla f(\theta _t)\Vert ^2}{T} + \frac{1}{Th} + h^2 \nonumber \\&\le \frac{n^2 G^2}{bT} + \frac{1}{Th} + h^2 \end{aligned}$$
(6)

Variance reduction for Hamiltonian Monte Carlo

In this section, we introduce two versions of variance-reduced Hamiltonian Monte Carlo based on SVRG (Johnson and Zhang 2013) and SAGA (Defazio et al. 2014) respectively.

SVRG-HMC

In this subsection, we propose the SVRG-HMC algorithm (see Algorithm 1) which is based on the SVRG algorithm. As can be seen from Line 8 of Algorithm 1, we use \(\tilde{\nabla }_{tK + k} = -\nabla \log \Pr (\theta _{tK + k}) + \frac{n}{b}\sum _{i\in I}\big (\nabla f_i(\theta _{tK + k})- \nabla f_i(w)\big ) + g\), where g is \(\sum _{i=1}^n \nabla f_i(\theta _{tK})\), as the stochastic estimation for the full gradient \(\nabla f(\theta _{tK + k})\).

Note that we initialize \(\theta _0, p_0\) to be zero vectors in the algorithm only to simplify the theoretical analysis. It would still work with an arbitrary initialization.

figurea

The following theorem shows the convergence result for MSE of SVRG-HMC (Algorithm 1). We defer all the proofs to Appendix B.

Theorem 2

Under Assumptions 1 and 2, the MSE of SVRG-HMC is bounded as:

$$\begin{aligned} \mathbb {E}[(\hat{\phi } - \bar{\phi })^2] \lesssim \min \left\{ \frac{n^2G^2}{bT}, \frac{L^2 n^2 K^2h^2}{bT}\left( \frac{\sqrt{n^2G^2 + D^2d}}{D-L^2n^2K^2h^3b^{-1}}\right) ^2\right\} + \frac{1}{Th} + h^2 \end{aligned}$$
(7)

To see how the SVRG-HMC (Algorithm 1) is compared with SVRG-LD (Dubey et al. 2016), we restate their results as following.

Theorem 3

(Dubey et al. 2016) Under Assumptions 1 and 2, the MSE of SVRG-LD is bounded as:

$$\begin{aligned} \mathbb {E}[(\hat{\phi } - \bar{\phi })^2] \lesssim \frac{\min \{n^2G^2, n^2K^2(n^2L^2h^2G^2 + hd) \}}{bT} + \frac{1}{Th} + h^2 \end{aligned}$$
(8)

We assume \(n^2G^2 > n^2K^2(n^2L^2 h^2G^2 + hd)\). Otherwise the MSE upper bound of SVRG-LD would be equal to SGHMC [see (6]) and (8)]. We then omit the same terms (i.e., second and third terms) in the RHS of (7) and (8). Then we have the following lemma.

Lemma 1

Let \(\mathsf {R}^{\mathsf {HMC}}= \frac{L^2 n^2 K^2h^2}{bT}\big (\frac{\sqrt{n^2G^2 + D^2d}}{D-L^2n^2K^2h^3b^{-1}}\big )^2\) [i.e., the first term in the RHS of (7)] and \(\mathsf {R}^{\mathsf {LD}}= \frac{n^2K^2(n^2L^2h^2G^2 + hd)}{bT}\) [i.e., the first term in the RHS of (8)]. Then the following inequality holds.

$$\begin{aligned} \mathsf {R}^{\mathsf {HMC}}\le \max \left\{ \frac{1}{D^2}, \frac{L}{nK}\right\} \mathsf {R}^{\mathsf {LD}} \end{aligned}$$
(9)

In particular, if \(D\ge {1}/{L\sqrt{h}}\), then (9) becomes:

$$\begin{aligned} \mathsf {R}^{\mathsf {HMC}}\le \frac{L}{nK} \mathsf {R}^{\mathsf {LD}} \end{aligned}$$
(10)

Note that K is suggested to be 2n by Johnson and Zhang (2013) or n / b by Dubey et al. (2016). we obtain the following corollary from Lemma 1.

Corollary 3

If K is n / b as suggested by Dubey et al. (2016), then (10) becomes:

$$\begin{aligned} \mathsf {R}^{\mathsf {HMC}}\le \frac{bL}{n^2} \mathsf {R}^{\mathsf {LD}} \end{aligned}$$
(11)

In other words, the SVRG-HMC is \(O(\frac{n^2}{bL})\) times faster than SVRG-LD, in terms of the convergence bound related to the variance reduction [i.e., the first terms in RHS of (7) and (8)]. Note that n is the size of dataset which can be very large, b is the mini-batch size which is usually a small constant and L is the Lipschitz smooth parameter for \(f_i(\theta )\).

We also want to mention that the convergence proof for SVRG-HMC (i.e., Theorem 2) is a bit more difficult than that for SVRG-LD (Dubey et al. 2016) due to the momentum variable p (see Line 9 of Algorithm 1). Concretely, the main part of the proof in both SVRG-LD and SVRG-HMC is to bound the variance. Moreover, the variance can be bounded by the adjacent distance \(\{\Vert {\theta _t-\theta _{t-1}}\Vert ^2\}\) in both SVRG-LD and SVRG-HMC. In SVRG-LD (Dubey et al. 2016), they can directly bound each of \(\Vert {\theta _t-\theta _{t-1}}\Vert ^2\) for \(t\in [T]\). However, due to the momentum variable p in our SVRG-HMC (see Line 9 of Algorithm 1), the distances \(\{\Vert {\theta _t-\theta _{t-1}}\Vert ^2\}_{t\in [T]}\) are more correlated. Thus, we cannot directly bound each of the \(\Vert {\theta _t-\theta _{t-1}}\Vert ^2\) independently like in SVRG-LD. We bound the variance as a whole, i.e., we bound the summation of the variance which is equivalent to bound the summation \(\sum _{t\in [T]}\Vert {\theta _t-\theta _{t-1}}\Vert ^2\). Then we get a quadratic inequality due to the correlation among \(\{\Vert {\theta _t-\theta _{t-1}}\Vert ^2\}_{t\in [T]}\). Finally, we solve this quadratic inequality to bound the variance.

SAGA-HMC

In this subsection, we propose the SAGA-HMC algorithm by applying the SAGA framework (Defazio et al. 2014) to the Hamiltonian Monte Carlo. The details are described in Algorithm 2. Similar to the SVRG-HMC, we initialize \(\theta _0, p_0\) to be zero vectors in the algorithm only to simplify the analysis; it would still work with an arbitrary initialization.

figureb

The following theorem shows the convergence result for MSE of SAGA-HMC. The proof is deferred to Appendix B.

Theorem 4

Under Assumptions 1 and 2, the MSE of SAGA-HMC is bounded as:

$$\begin{aligned} \mathbb {E}[(\hat{\phi } - \bar{\phi })^2] \lesssim \min \left\{ \frac{n^2G^2}{bT}, \frac{L^2 n^4h^2 }{T b^3}\left( \frac{\sqrt{n^2G^2 + D^2d}}{D-L^2n^4h^3b^{-3}}\right) ^2\right\} + \frac{1}{Th} + h^2 \end{aligned}$$

Note that SAGA-HMC can be compared with SAGA-LD (Dubey et al. 2016) in a very similar manner to SVRG-HMC (i.e., Lemma 1 and Corollary 3). Thus we omit such a repetition.

Variance-reduced SGHMC with symmetric splitting

Symmetric splitting is a numerically efficient method introduced in Leimkuhler and Shang (2016) to accelerate gradient-based algorithms. We note that one additional advantage of SGHMC over SGLD is that SGHMC can be combined with symmetric splitting while SGLD cannot (Chen et al. 2015). So it is quite natural to combine symmetric splitting with the proposed SVRG-HMC and SAGA-HMC respectively to see if any further improvements can be obtained.

The symmetric splitting scheme breaks the original recursion into 5 steps:

$$\begin{aligned} \theta _t^{(1)}&= \theta _t + \frac{h}{2}p_t \end{aligned}$$
(12)
$$\begin{aligned} p_t^{(1)}&= e^{-Dh/2}p_t\end{aligned}$$
(13)
$$\begin{aligned} p_t^{(2)}&= p_t^{(1)} -h\tilde{\nabla }_t + \sqrt{2Dh}\xi _t \end{aligned}$$
(14)
$$\begin{aligned} p_{t+1}&= e^{-Dh/2}p_t^{(2)}\end{aligned}$$
(15)
$$\begin{aligned} \theta _{t+1}&= \theta _t^{(1)} + \frac{h}{2}p_{t+1} \end{aligned}$$
(16)

If we eliminate the intermediate variables, then

$$\begin{aligned} p_{t+1}&= e^{-Dh/2} \left( e^{-Dh/2}p_t - h\tilde{\nabla }_t + \sqrt{2Dh}\xi _t\right) \end{aligned}$$
(17)
$$\begin{aligned} \theta _{t+1}&= \theta _t + \frac{h}{2} p_{t+1} + \frac{h}{2}p_t \end{aligned}$$
(18)

Same as before \(\xi _t\sim \mathrm {N}(0, I_d)\). Note that the stochastic gradient \(\tilde{\nabla }_t\) is computed at \(\theta _t^{(1)}\) (which is \(\theta _t + \frac{h}{2}p_t\)) instead of \(\theta _t\) [see (1), (12) and (14)]. As shown in Chen et al. (2015), this symmetric splitting scheme is a 2nd-order local integrator. Then it improves the dependency of the MSE on step size h, i.e., the third term in the RHS of (5) changes to be \(h^4\), which is a higher order term than the original \(h^2\). It means that we can allow a larger step size h by using this symmetric splitting scheme (note that \(h<1\)).

Similarly, we can further improve the convergence results for SVRG/SAGA-HMC by combining the symmetric splitting scheme. We give the details of the algorithms and theoretical results for SVRG2nd-HMC and SAGA2nd-HMC in the following subsections.

SVRG2nd-HMC

In this subsection, we propose the SVRG2nd-HMC algorithm (see Algorithm 3) by combining our SVRG-HMC (Algorithm 1) with the symmetric splitting scheme.

figurec

The convergence result for SVRG2nd-HMC is provided in Theorem 5. It shows that the dependency of MSE on step size h can be improved from \(h^2\) to \(h^4\) [see (7) and (19)].

Theorem 5

Under Assumptions 1 and 2, the MSE of SVRG2nd-HMC is bounded as:

$$\begin{aligned} \mathbb {E}(\hat{\phi } - \bar{\phi })^2\lesssim \min \left\{ \frac{n^2G^2}{bT}, \frac{L^2 n^2 K^2h^2}{bT}\left( \frac{\sqrt{n^2G^2 + D^2d}}{D-L^2n^2K^2h^3b^{-1}}\right) ^2\right\} + \frac{1}{Th} + h^4 \end{aligned}$$
(19)

SAGA2nd-HMC

In this subsection, we propose the SAGA2nd-HMC algorithm (see Algorithm 4) by combining our SAGA-HMC (Algorithm 2) with the symmetric splitting scheme. The convergence result and algorithm details are described below.

Theorem 6

Under Assumptions 1 and 2, the MSE of SAGA2nd-HMC is bounded as:

$$\begin{aligned} \mathbb {E}(\hat{\phi } - \bar{\phi })^2\lesssim \min \left\{ \frac{n^2G^2}{bT}, \frac{L^2 n^4h^2 }{Tb^3}\left( \frac{\sqrt{n^2G^2 + D^2d}}{D-L^2n^4h^3b^{-3}}\right) ^2\right\} + \frac{1}{Th} + h^4 \end{aligned}$$
figured

Experiment

We present experimental results in this section. We compare the proposed SVRG-HMC (Algorithm 1), as well as its symmetric splitting variant SVRG2nd-HMC (Algorithm 3), against SVRG-LD (Dubey et al. 2016) on Bayesian regression, Bayesian classification and Bayesian Neural Networks. The experimental results of the SAGA variants (Algorithm 2 and 4) are almost same as the SVRG variants. We report the corresponding SAGA experiments in Appendix A. In accordance with the theoretical analysis, all algorithms have a fixed step size h, and all HMC-based algorithms have a fixed friction parameter D; a grid search is performed to select the best step size and friction parameter for each algorithm. The minibatch size b is chosen to be 10 (same as SVRG/SAGA-LD Dubey et al. 2016) for all algorithms, and K is set to be n / b.

The experiments are tested on the real-world UCI datasets.Footnote 1 The information of the standard datasets used in our experiments are described in the following Tables 1 and 2 (Sect. 5.3). For each dataset (regression or classification), we partition the dataset into training (70%), validation (10%) and test (20%) sets. The validation set is used to select step size as well as friction for HMC-based algorithms in an eightfold manner.

Table 1 Summary of standard UCI datasets for Bayesian regression and classification
Table 2 Summary of larger UCI datasets for Bayesian neural networks experiments

Bayesian regression

In this subsection we study the performance of those aforementioned algorithms on Bayesian linear regression. Say we are provided with inputs \(Z = \{(x_i, y_i)\}_{i=1}^n\) where \(x_i\in \mathbb {R}^d\) and \(y_i\in \mathbb {R}\). The distribution of \(y_i\) given \(x_i\) is modelled as \(\Pr (y_i|x_i) = N(\beta ^\intercal x_i, \sigma ^2)\), where the unknown parameter \(\beta \) follows a prior distribution of \(\mathrm {N}(0, I_d)\). The gradients of log-likelihood can thus be calculated as \(\nabla _\beta \log \Pr (y_i| x_i, \beta ) = (y_i - \beta ^\intercal x_i)x_i\) and \(\nabla _\beta \log \Pr (\beta ) = -\beta \). The average test Mean-Squared Error (MSE) is reported in Fig. 1.

As can be observed from Fig. 1, SVRG-HMC as well its symmetric splitting counterpart SVRG2nd-HMC, converge markedly faster than SVRG-LD in the first pass through the whole dataset. The performance SVRG2nd-HMC is usually similar (no worse) to SVRG-HMC, and it turns out that a slightly larger step size can be chosen for SVRG2nd-HMC, which is also consistent with our theoretical results (i.e., allow larger step size).

Fig. 1
figure1

Performance comparison of SVRG variants on Bayesian regression tasks. The x-axis and y-axis represent number of passes through the entire training dataset and average test MSE respectively. For the bike dataset, we have omitted the first 10 MSE values from the diagram because otherwise the diagram would scale badly as MSE values are very large in the first several iterations

Bayesian classification

In this subsection we study classification tasks using Bayesian logistic classification. Suppose input data \(Z = \{(x_i, y_i)\}\) where \(x_i\in \mathbb {R}^d\), \(y_i\in \{0, 1\}\). The distribution of the output \(y_i\) is modelled as \(\Pr (y_i = 1) = 1 / (1 + \exp (-\beta ^\intercal x_i))\), where the model parameter follows a prior distribution of \(\mathrm {N}(0, I_d)\). Then the gradient of log-likelihood and log-prior can be written as \(\nabla _\beta \log \Pr (y_i| x_i, \beta ) = \big (y_i - 1/(1 + \exp (-\beta ^\intercal x_i))\big )x_i\) and \(\nabla _\beta \log \Pr (\beta ) = -\beta \). The average test log-likelihood is reported in Fig. 2.

Similar to the Bayesian regression, SVRG-HMC as well its symmetric splitting counterpart SVRG2nd-HMC, converge markedly faster than SVRG-LD for the Bayesian classification tasks. Also, the experimental results suggest that SVRG2nd-HMC converges more quickly than SVRG-HMC, which is consistent with Theorem 2 and 5.

In sum, for our four algorithms, we recommend SVRG2nd/SAGA2nd-HMC due to the better theoretical results (Theorems 2, 4, 5 and 6) and practical results (Figs. 1, 2, 4, 5) compared with SVRG/SAGA-HMC. Further, we recommend SVRG2nd-HMC since SAGA2nd-HMC needs high memory cost and its implementation is a little bit more complicated than SVRG2nd-HMC.

Fig. 2
figure2

Performance comparison of SVRG variants on Bayesian classification tasks. The x-axis and y-axis represent number of passes through the entire training dataset and average test log-likelihood respectively

Fig. 3
figure3

Performance comparison of vanilla SGHMC, SVRG-LD, SVRG-HMC, SVRG2nd-HMC on regression tasks using Bayesian neural networks. The x-axis and y-axis represent number of passes through the entire training dataset and average test RMSE respectively

Fig. 4
figure4

Performance comparison of vanilla SGHMC, SVRG-LD, SVRG-HMC, SVRG2nd-HMC on classification tasks using Bayesian neural networks. The x-axis and y-axis represent number of passes through the entire training dataset and average test log-likelihood respectively

Bayesian neural networks

To show the scalability of variance reduced HMC to larger datasets and its application to nonconvex problems and more complicated models, we study Bayesian neural networks tasks. In our experiments, the model is a neural network with one hidden layer which has 50 hidden units (100 hidden units for ’susy’ dataset) with ReLU activation, which is denoted by \(f_{NN}\). Its unknown parameter \(\beta \) follows a prior distribution of \(\mathrm {N}(0, \sigma _p^2 I_d)\). Let \(t_i=f_{NN}(x_i, \beta )\) denotes output of the neural network with parameter value \(\beta \) and input \(x_i\). The experiments are tested on larger UCI regression and classification datasets described in Table 2. Suppose we are provided with inputs \(Z = \{(x_i, y_i)\}_{i=1}^n\) where \(x_i\in \mathbb {R}^d\). In regression tasks, \(y_i\in \mathbb {R}\), \(t_i\in \mathbb {R}\), and the distribution of \(y_i\) given \(x_i\) is modelled as \(\Pr (y_i|x_i) = N(t_i, \sigma _l^2)\). In binary classification tasks, \(y_i\in \{0,1\}\), \(t_i\in \mathbb {R}\), and \(\Pr (y_i = 1) = 1 / (1 + \exp (-t_i))\). In K-class classification tasks (\(K\ge 3\)), \(y_i\in \{1,2,\ldots ,K\}\), \(t_i\in \mathbb {R}^K\), and \(\Pr (y_i = n) = \exp (t_{in}) / \sum _{m=1}^K\exp (t_{im})\). The code for experiments is implemented in TensorFlow. We conduct experiments for vanilla SGHMC and SVRG variants of LD and HMC algorithms. The test Root-Mean-Square Error (RMSE) for regression tasks is reported in Fig. 3, and the average test log-likelihood for classification tasks is reported in Fig. 4.

Experimental results show that SVRG/SVRG2nd-HMC outperforms vanilla SGHMC and SVRG-LD, often by a significant gap. In particularly, this means that the variance reduction technique indeed helps the convergence of SGHMC, i.e., the performance gap between SVRG/SVRG2nd-HMC and SVRG-LD in Figs. 1, 2, 3 and 4 is not only coming from the superiority of HMC compared with LD. Similar to previous Sects. 5.1 and 5.2, the performance SVRG2nd-HMC is usually similar (no worse) to SVRG-HMC, and our experiments found that sometimes a slightly larger step size can be chosen for SVRG2nd-HMC (while the same step size brings SVRG-HMC to NaN), which is also consistent with our theoretical results in Theorem 5.

Conclusion

In this paper, we propose four variance-reduced Hamiltonian Monte Carlo algorithms, i.e., SVRG-HMC, SAGA-HMC, SVRG2nd-HMC and SAGA2nd-HMC for Bayesian Inference. These proposed algorithms guarantee improved theoretical convergence results and converge markedly faster than the benchmarks (vanilla SGHMC and SVRG/SAGA-LD) in practice. In conclusion, the SVRG2nd/SAGA2nd-HMC are more preferable than SVRG/SAGA-HMC according to our theoretical and experimental results. We would like to note that, our variance-reduced Hamiltonian Monte Carlo samplers are not Markovian procedures, but fortunately our theoretical analysis does not rely on any properties of Markov processes, and so it does not affect the correctness of Theorem 2, 4, 5 and 6.

For future work, it would be interesting to study whether our analysis can be apply to vanilla SGHMC without variance reduction. To the best of our knowledge, there is no existing work to prove that SGHMC is better than SGLD. On the other hand, we note that stochastic thermostat (Ding et al. 2014) could outperform both SGLD and SGHMC. It might be interesting to study if a variance-reduced variant of stochastic thermostat could also beat SVRG-LD and SVRG/SVRG2nd-HMC both theoretically and experimentally.

Notes

  1. 1.

    The UCI datasets can be downloaded from https://archive.ics.uci.edu/ml/datasets.html.

References

  1. Ahn, S., Korattikara, A., & Welling, M. (2012). Bayesian posterior sampling via stochastic gradient fisher scoring. In Proceedings of the 29th international conference on machine learning (pp. 1771–1778).

  2. Ahn, S., Shahbaba, B., & Welling, M. (2014). Distributed stochastic gradient MCMC. In International conference on machine learning (pp. 1044–1052).

  3. Allen-Zhu, Z., & Hazan, E. (2016). Variance reduction for faster non-convex optimization. In International conference on machine learning (pp. 699–707).

  4. Byrne, S., & Girolami, M. (2013). Geodesic Monte Carlo on embedded manifolds. Scandinavian Journal of Statistics, 40(4), 825–845.

    MathSciNet  Article  MATH  Google Scholar 

  5. Chen, C., Ding, N., & Carin, L. (2015). On the convergence of stochastic gradient MCMC algorithms with high-order integrators. In Advances in neural information processing systems (pp. 2278–2286).

  6. Chen, T., Fox, E., & Guestrin, C. (2014). Stochastic gradient Hamiltonian Monte Carlo. In International conference on machine learning (pp. 1683–1691).

  7. Defazio, A., Bach, F., & Lacoste-Julien, S. (2014). Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems (pp. 1646–1654).

  8. Ding, N., Fang, Y., Babbush, R., Chen, C., Skeel, R. D., & Neven, H. (2014). Bayesian sampling using stochastic gradient thermostats. In Advances in neural information processing systems (pp. 3203–3211).

  9. Duane, S., Kennedy, A. D., Pendleton, B. J., & Roweth, D. (1987). Hybrid Monte Carlo. Physics Letters B, 195(2), 216–222.

    MathSciNet  Article  Google Scholar 

  10. Dubey, K. A., Reddi, S. J., Williamson, S. A., Poczos, B., Smola, A. J., & Xing, E. P. (2016). Variance reduction in stochastic gradient Langevin dynamics. In Advances in neural information processing systems (pp. 1154–1162).

  11. Ge, R., Li, Z., Wang, W., & Wang, X. (2019). Stabilized SVRG: Simple variance reduction for nonconvex optimization. In Conference on learning theory.

  12. Girolami, M., & Calderhead, B. (2011). Riemann manifold langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2), 123–214.

    MathSciNet  Article  MATH  Google Scholar 

  13. Johnson, R., & Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems (pp. 315–323).

  14. Leimkuhler, B., & Shang, X. (2016). Adaptive thermostats for noisy gradient systems. SIAM Journal on Scientific Computing, 38(2), A712–A736.

    MathSciNet  Article  MATH  Google Scholar 

  15. Li, Z. (2019). SSRGD: Simple stochastic recursive gradient descent for escaping saddle points. arXiv preprint arXiv:1904.09265.

  16. Li, Z., & Li, J. (2018). A simple proximal stochastic gradient method for nonsmooth nonconvex optimization. In Advances in neural information processing systems (pp. 5569–5579).

  17. Liu, C., Zhu, J., & Song, Y. (2016). Stochastic gradient geodesic MCMC methods. In Advances in neural information processing systems (pp. 3009–3017).

  18. Ma, Y. A., Chen, T., & Fox, E. (2015). A complete recipe for stochastic gradient MCMC. In Advances in neural information processing systems (pp. 2917–2925).

  19. Neal, R. M., et al. (2011). MCMC using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11), 139–188.

    MATH  Google Scholar 

  20. Patterson, S., & Teh, Y.W. (2013). Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In Advances in neural information processing systems (pp. 3102–3110).

  21. Reddi, S. J., Hefny, A., Sra, S., Póczos, B., & Smola, A. (2016). Stochastic variance reduction for nonconvex optimization. In International conference on machine learning (pp. 314–323).

  22. Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, 22(3).

  23. Welling, M., & Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th international conference on machine learning (pp. 681–688).

  24. Zou, D., Xu, P., & Gu, Q. (2018). Stochastic variance-reduced Hamilton Monte Carlo methods. arXiv preprint arXiv:1802.04791.

Download references

Acknowledgements

Funding was provided by National Basic Research Program of China (Grant No. 2015CB358700), National Natural Science Foundation of China (Grant Nos. 61772297, 61632016, 61761146003) and Microsoft Research Asia. We would like to thank Chang Liu for useful discussions.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Jian Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Editors: Jesse Davis, Elisa Fromont, Bjorn Bringmann, Derek Greene.

Appendices

A SAGA experiments

In this appendix, we report the corresponding experimental results of SAGA variants (i.e., SAGA-LD, SAGA-HMC and SAGA2nd-HMC) for Bayesian regression and Bayesian classification tasks (Figs. 5, 6). The settings are the same as those in Sects. 5.1 and 5.2.

Fig. 5
figure5

Performance comparison of SAGA variants on Bayesian regression tasks

Fig. 6
figure6

Performance comparison of SAGA variants on Bayesian classification tasks

B proofs

In this appendix, we provide the detailed proofs for Corollary 2, Theorem 2, Lemma 1, and Theorem 4, 5 and 6.

B.1 Proof of Corollary 2

To prove this corollary, it is sufficient to show \(\mathbb {E}\Vert \tilde{\nabla }_t - \nabla f(\theta _t)\Vert ^2\le n^2G^2/b\). Recall that \(\tilde{\nabla }_t = \frac{n}{b}\sum _{i\in I} \nabla f_i(\theta _t)\), I being a b-element index set uniformly randomly drawn (with replacement) from \(\{1, 2, \ldots , n\}\) and \(\nabla f(\theta _t)=\sum _{j=1}^n \nabla f_j(\theta _t)\). Now, we prove this inequality as follows:

$$\begin{aligned} \mathbb {E}_I\Vert \tilde{\nabla }_t - \nabla f(\theta _t)\Vert ^2&=\mathbb {E}_I\left\| \frac{n}{b}\sum _{i\in I} \nabla f_i(\theta _t) -\sum _{j=1}^n \nabla f_j(\theta _t) \right\| ^2 \\&=n^2\mathbb {E}_I\left\| \frac{1}{b}\sum _{i\in I} \nabla f_i(\theta _t) -\frac{1}{n}\sum _{j=1}^n \nabla f_j(\theta _t) \right\| ^2 \\&=n^2\mathbb {E}_I\left\| \frac{1}{b}\sum _{i\in I} \left( \nabla f_i(\theta _t) -\frac{1}{n}\sum _{j=1}^n \nabla f_j(\theta _t)\right) \right\| ^2 \\&=\frac{n^2}{b^2}\mathbb {E}_I\left\| \sum _{i\in I} \left( \nabla f_i(\theta _t) -\frac{1}{n}\sum _{j=1}^n \nabla f_j(\theta _t)\right) \right\| ^2 \\&=\frac{n^2}{b^2}\mathbb {E}_I\left\| \sum _{i\in I} \left( \nabla f_i(\theta _t) -\frac{1}{n}\sum _{j=1}^n \nabla f_j(\theta _t)\right) \right\| ^2 \\&=\frac{n^2}{b}\mathbb {E}_i\left\| \nabla f_i(\theta _t) -\frac{1}{n}\sum _{j=1}^n \nabla f_j(\theta _t) \right\| ^2 \\&\le \frac{n^2}{b}\mathbb {E}_i\left\| \nabla f_i(\theta _t)\right\| ^2 \\&\le \frac{n^2 G^2}{b} \end{aligned}$$

where the last two inequalities hold since \(\mathbb {E}(X - \mathbb {E}X)^2\le \mathbb {E}X^2\) for any random variable X and \(f_i\) is G-Lipschitz. \(\square \)

B.2 Proof of Theorem 2

According to Corollary 1, we have:

$$\begin{aligned} \mathbb {E}[(\hat{\phi } - \bar{\phi })^2]\lesssim \frac{1}{T^2}\sum _{t=0}^{T-1} \mathbb {E}[\Vert \varDelta _t\Vert ^2] + \frac{1}{Th} + h^2 \end{aligned}$$
(20)

where \(\varDelta _t = \tilde{\nabla }_{t} - \nabla f(\theta _t)\) is the additive error in estimating the full gradient \(\nabla f(\theta _t)\). By applying the variance reduction technique, we need to upper bound the summation \(\sum _{t=1}^T\mathbb {E}[\Vert \varDelta _t\Vert ^2]\).

Unpacking the definition of \(\varDelta _t\) and \(\tilde{\nabla }_t\), we have:

$$\begin{aligned}&\sum _{t=0}^{T-1} \mathbb {E}[\Vert \varDelta _t\Vert ^2] \nonumber \\&\quad = \sum _{t=0}^{T-1} \mathbb {E}\left[ \left\| -\nabla \log \Pr (\theta _t) + \frac{n}{b}\sum _{i\in I}\left( \nabla f_i(\theta _t) - \nabla f_i\left( \theta _{\left\lfloor \frac{t}{K}\right\rfloor K}\right) \right) + g - \nabla f(\theta _t) \right\| ^2\right] \nonumber \\&\quad = \sum _{t=0}^{T-1} n^2\mathbb {E}\left[ \left\| \frac{1}{b}\sum _{i\in I}\left( \nabla f_i(\theta _t) - \nabla f_i\left( \theta _{\left\lfloor \frac{t}{K}\right\rfloor K}\right) \right) \right. \right. \nonumber \\&\qquad \left. \left. - \frac{1}{n}\sum _{j=1}^n \left( \nabla f_j(\theta _t) - \nabla f_j\left( \theta _{\left\lfloor \frac{t}{K}\right\rfloor K}\right) \right) \right\| ^2\right] \nonumber \\&\quad \le \frac{n^2}{b}\sum _{t=0}^{T-1}\mathbb {E}_{i\in [n]}\left\| \nabla f_i(\theta _t) - \nabla f_i\left( \theta _{\left\lfloor \frac{t}{K}\right\rfloor K}\right) \right\| ^2 \end{aligned}$$
(21)

The Inequality (21) is due to \(\mathbb {E}(X - \mathbb {E}X)^2\le \mathbb {E}X^2\) for any random variable X. In the rightmost summation, index i is picked uniformly random from \([n] = \{1, 2, \ldots , n\}\).

Then, we bound the RHS of (21) as follows:

$$\begin{aligned} \frac{n^2}{b}\sum _{t=0}^{T-1}\mathbb {E}_{i\in [n]}\left\| \nabla f_i(\theta _t) - \nabla f_i\left( \theta _{\left\lfloor \frac{t}{K}\right\rfloor K}\right) \right\| ^2&\le \frac{n^2}{b}\sum _{t=0}^{T-1}L^2\mathbb {E}\Vert \theta _t - \theta _{\left\lfloor \frac{t}{K}\right\rfloor K}\Vert ^2 \nonumber \\&\le \frac{L^2 n^2}{b}\sum _{t=0}^{T-1} K\sum _{j=\left\lfloor \frac{t}{K}\right\rfloor K }^{t-1}\mathbb {E}\Vert \theta _{j+1} - \theta _j\Vert ^2 \nonumber \\&\le \frac{L^2 n^2 K^2}{b}\sum _{t=0}^{T-1}\mathbb {E}\Vert \theta _{t+1} - \theta _t\Vert ^2 \end{aligned}$$
(22)

The first inequality is by L-smoothness of all \(f_i\)’s, and the second one is by Cauchy’s inequality.

By our algorithm, \(\Vert \theta _{t+1} - \theta _t\Vert ^2 = h^2\Vert p_{t+1}\Vert ^2\), so we need to upper bound \(\mathbb {E}\Vert p_{t+1}\Vert ^2\) for each \(0\le t<T\).

By the recursion of \(p_{t+1}\),

$$\begin{aligned}&\mathbb {E}\Vert p_{t+1}\Vert ^2 \\&\quad = \mathbb {E}\Vert (1-Dh)p_t - h\tilde{\nabla }_t + \sqrt{2Dh}\xi _t\Vert ^2\\&\quad = \mathbb {E}\Vert (1-Dh)p_t - h\nabla f(\theta _t) - h\varDelta _t + \sqrt{2Dh}\xi _t\Vert ^2\\&\quad = \mathbb {E}\Vert (1-Dh)p_t - h\nabla f(\theta _t)\Vert ^2 + h^2\mathbb {E}[\Vert \varDelta _t\Vert ^2] + 2Dhd\\&\quad \le (1-Dh)^2\mathbb {E}\Vert p_t\Vert ^2 + 2Gnh(1-Dh)\sqrt{\mathbb {E}\Vert p_t\Vert ^2} + h^2n^2G^2 \\&\qquad + h^2\mathbb {E}[\Vert \varDelta _t\Vert ^2] + 2Dhd \end{aligned}$$

The third equality holds because \(\mathbb {E}[\varDelta _t] = \mathbb {E}[\xi _t] = 0\) and \(\varDelta _t\) and \(\xi _t\) are independent. The first inequality takes advantage of \(\Vert \nabla f\Vert \le nG\) and \(\mathbb {E}\Vert p_t\Vert \le \sqrt{\mathbb {E}\Vert p_t\Vert ^2}\).

Define \(S = \sum _{t=1}^T\mathbb {E}[\Vert p_t\Vert ^2]\). Then, taking a grand summation over \(t = 0, 1, \ldots , T-1\),

$$\begin{aligned} S&\le (1-Dh)^2 S + 2nGh(1-Dh)\sum _{t=0}^{T-1}\sqrt{\mathbb {E}\Vert p_t\Vert ^2}\\&\quad + T(h^2n^2G^2 + 2Dhd) + h^2\sum _{t=0}^{T-1}\mathbb {E}[\Vert \varDelta _t\Vert ^2]\\&\le (1-Dh)^2 S + 2nGh(1-Dh)\sqrt{T} \sqrt{S}\\&\quad + T(h^2n^2G^2 + 2Dhd) + \frac{L^2 n^2 K^2h^4}{b}S \end{aligned}$$

The second inequality again contains an implicit Cauchy’s inequality.

Rearranging the terms we have:

$$\begin{aligned} \left( 1 - (1-Dh)^2 - \frac{L^2n^2K^2h^4}{b}\right) \frac{S}{T} - 2nGh(1-Dh)\sqrt{\frac{S}{T}} - (h^2n^2G^2 + 2Dhd)\le 0 \end{aligned}$$

Solving a quadratic equation with respect to \(\sqrt{S / T}\) and ignoring constant factors, we have:

$$\begin{aligned} \sqrt{\frac{S}{T}} \lesssim \frac{nG + \sqrt{n^2G^2 + D^2d}}{D-L^2n^2K^2h^3b^{-1}} \lesssim \frac{\sqrt{n^2G^2 + D^2d}}{D-L^2n^2K^2h^3b^{-1}} \end{aligned}$$
(23)

From (21) and (22), we have:

$$\begin{aligned} \sum _{t=0}^{T-1} \mathbb {E}[\Vert \varDelta _t\Vert ^2]\le \frac{L^2 n^2 K^2}{b}\sum _{t=0}^{T-1}\mathbb {E}\Vert \theta _{t+1} - \theta _t\Vert ^2 \end{aligned}$$

Recall that \(\Vert \theta _{t+1} - \theta _t\Vert ^2 = h^2\Vert p_{t+1}\Vert ^2\) and \(S=\sum _{t=1}^T\mathbb {E}[\Vert p_t\Vert ^2]\), we have:

$$\begin{aligned} \frac{1}{T}\sum _{t=1}^T \mathbb {E}[\Vert \varDelta _t\Vert ^2] \lesssim \frac{L^2 n^2 K^2h^2}{b}\left( \frac{\sqrt{n^2G^2 + D^2d}}{D-L^2n^2K^2h^3b^{-1}}\right) ^2 \end{aligned}$$
(24)

On the other hand, we can bound (21) as follows:

$$\begin{aligned} \sum _{t=0}^{T-1} \mathbb {E}[\Vert \varDelta _t\Vert ^2]&\le \frac{n^2}{b}\sum _{t=0}^{T-1}\mathbb {E}_{i\in [n]}\left\| \nabla f_i(\theta _t) - \nabla f_i\left( \theta _{\left\lfloor \frac{t}{K}\right\rfloor K}\right) \right\| ^2 \nonumber \\&\le \frac{4Tn^2G^2}{b} \end{aligned}$$
(25)

where (25) holds due to all \(f_i\)’s are G-Lipschitz and Cauchy’s inequality \(\Vert a+b\Vert ^2\le 2(\Vert a\Vert ^2+\Vert b\Vert ^2)\).

Now, the proof of Theorem 2 is finished by combining (20), (24) and (25). \(\square \)

B.3 Proof of Lemma 1

Since \(G^2> K^2(n^2L^2 h^2G^2 + hd) > K^2n^2L^2h^2G^2\), we have \(h < \frac{1}{nKL}\). Therefore, we have

$$\begin{aligned}&D - \frac{h^3L^2n^2K^2}{b} \\&\quad > D - \frac{1}{n^3K^3L^3}\frac{L^2n^2K^2}{b} \\&\quad = D - \frac{1}{nKbL} \\&\quad \gg D - 0.1 \ge 0.9 D \end{aligned}$$

where the last line holds since \(nKbL\gg 10\) and \(D>1\) (see Line 1 of Algorithm 1). Thus, the proof is reduced to comparing \(h^2L^2n^2G^2 + hd\) and \(\frac{h^2L^2 n^2 G^2}{D^2} + h^2L^2d\) asymptotically.

Clearly,

$$\begin{aligned}&\frac{h^2L^2 n^2 G^2}{D^2} + h^2L^2d \\&\quad \le \frac{h^2L^2 n^2 G^2}{D^2} + \frac{1}{nKL}hL^2 d \\&\quad = \frac{h^2L^2 n^2 G^2}{D^2} + hd\frac{L}{nK}\\&\quad \le \max \{\frac{1}{D^2}, \frac{L}{nK}\}(h^2L^2n^2G^2 + hd) \end{aligned}$$

Note that if \(D\ge {1}/{L\sqrt{h}}\), then \(\max \{\frac{1}{D^2}, \frac{L}{nK}\}\) turns to be \(\frac{L}{nK}\) which is very small. The reason is that:

$$\begin{aligned} D^2&\ge \frac{1}{L^2h} \\&\ge \frac{1}{L^2\frac{1}{nKL}} \\&= \frac{nK}{L} \end{aligned}$$

where the second inequality holds due to \(h < \frac{1}{nKL}\) which is mentioned above. \(\square \)

B.4 Proof of Theorem 4

Defining \(\varDelta _t = \tilde{\nabla }_t - \nabla f(\theta _t)\), it suffices to upper bound \(\sum _{t=0}^{T-1}\mathbb {E}[\Vert \varDelta _t\Vert ^2]\) according to (20). Unpacking the definition of \(\varDelta _t\) and \(\tilde{\nabla }_t\), we have:

$$\begin{aligned} \sum _{t=0}^{T-1}\mathbb {E}[\Vert \varDelta _t\Vert ^2]&= \sum _{t=0}^{T-1}\mathbb {E}[\Vert \tilde{\nabla }_t - \nabla f(\theta _t)\Vert ^2]\nonumber \\&= \sum _{t=0}^{T-1}\mathbb {E}\left[ \left\| \frac{n}{b}\sum _{i\in I}(\nabla f_i(\theta _t) - \nabla f_i(\alpha _t^i)) -\sum _{j=1}^n(\nabla f_j(\theta _t) - \nabla f_j(\alpha _t^j))\right\| ^2\right] \nonumber \\&= \sum _{t=0}^{T-1}n^2 \mathbb {E}\left[ \left\| \frac{1}{b}\sum _{i\in I}(\nabla f_i(\theta _t) - \nabla f_i(\alpha _t^i)) - \frac{1}{n}\sum _{j=1}^n(\nabla f_j(\theta _t) - \nabla f_j(\alpha _t^j)) \right\| ^2\right] \nonumber \\&\le \frac{n^2}{b}\sum _{t=0}^{T-1}\mathbb {E}_{i\in [n]}[\Vert \nabla f_i(\theta _t) - \nabla f_i(\alpha _t^i)\Vert ^2] \nonumber \\&\le \frac{L^2 n^2}{b}\sum _{t=0}^{T-1}\mathbb {E}_{i\in [n]}[\Vert \theta _t - \alpha _t^i\Vert ^2] \end{aligned}$$
(26)

The first inequality is because \(\mathbb {E}(X - \mathbb {E}X)^2\le \mathbb {E}X^2\) for any random variable X; the second inequality holds due to L-smoothness of \(f_i\)’s.

Let \(\gamma = 1 - (1 - 1/n)^b\). Next we upper bound each \(\mathbb {E}[\Vert \theta _t - \alpha _t^i\Vert ^2]\) in the following manner.

$$\begin{aligned} \mathbb {E}[\Vert \theta _t - \alpha _t^i\Vert ^2]&= \sum _{j=0}^{t-1}\mathbb {E}[\Vert \theta _t - \theta _j\Vert ^2] \Pr (\alpha _t^i = \theta _j)\\&= \sum _{j=0}^{t-1}\mathbb {E}[\Vert \theta _t - \theta _j\Vert ^2](1-\gamma )^{t-j-1}\gamma \\&= h^2\sum _{j=0}^{t-1} \mathbb {E}[\Vert p_t + p_{t-1} + \cdots + p_{j+1}\Vert ^2] (1-\gamma )^{t-j-1}\gamma \\&\le h^2\sum _{j=0}^{t-1} (t-j) (1-\gamma )^{t-j-1}\gamma (\mathbb {E}[\Vert p_t\Vert ^2] + \mathbb {E}[\Vert p_{t-1}\Vert ^2] + \cdots + \mathbb {E}[\Vert p_{j+1}\Vert ^2])\\&= h^2\sum _{j=1}^{t}\mathbb {E}[\Vert p_j\Vert ^2] \sum _{k=0}^{j-1}(t-k)(1-\gamma )^{t-k-1}\gamma \\&\le h^2\sum _{j=1}^{t}\mathbb {E}[\Vert p_j\Vert ^2] \sum _{k=0}^\infty (t-j+k+1)(1-\gamma )^{t-j+k}\gamma \\&< h^2\sum _{j=1}^t \mathbb {E}[\Vert p_j\Vert ^2] \left( \frac{1}{\gamma } + t - j\right) (1-\gamma )^{t-j} \end{aligned}$$

The second equality is by direct calculation \(\Pr (\alpha _t^i = \theta _j) = (1-\gamma )^{t-j-1}\gamma \); the first inequality is a direct application of Cauchy’s inequality; the last inequality is a weighted summation of geometric series \((1-\gamma )^{t-j+k}, k\ge 0\).

Summing over all t and i, we then have:

$$\begin{aligned} \sum _{t=0}^{T-1}\mathbb {E}_{i\in [n]}[\Vert \theta _t - \alpha _t^i\Vert ^2]&\le h^2\sum _{t=0}^{T-1}\sum _{j=1}^t \mathbb {E}[\Vert p_j\Vert ^2] \left( \frac{1}{\gamma } + t - j\right) (1-\gamma )^{t-j}\\&\le h^2\sum _{t=1}^{T-1} \mathbb {E}[\Vert p_t\Vert ^2]\sum _{j=t}^{T-1} \left( \frac{1}{\gamma } + j-t\right) (1-\gamma )^{j-t}\\&\le h^2\sum _{t=1}^{T-1} \mathbb {E}[\Vert p_t\Vert ^2]\sum _{j=t}^\infty \left( \frac{1}{\gamma } + j-t\right) (1-\gamma )^{j-t}\\&\le h^2\sum _{t=1}^{T-1} \mathbb {E}[\Vert p_t\Vert ^2]\frac{2}{\gamma ^2} \\&= \frac{2}{\gamma ^2}h^2\sum _{t=1}^{T-1} \mathbb {E}[\Vert p_t\Vert ^2]\\&\le \frac{8h^2n^2}{b^2}\sum _{t=1}^{T} \mathbb {E}[\Vert p_t\Vert ^2] \end{aligned}$$

The last inequality is because \((1-1/n)^b \le \frac{1}{1 + \frac{b}{n-1}}\), and thus \(\gamma = 1 - (1-1/n)^b \ge \frac{\frac{b}{n-1}}{1 + \frac{b}{n-1}} > \frac{b}{2n}\); the last inequality holds as mini-batch size b is smaller than dataset size n.

Similar to the previous subsection, we derive upper an upper bound on \(\sum _{t=1}^{T}\mathbb {E}[\Vert p_t\Vert ^2]\). By recursion of \(p_{t+1}\)’s, we have:

$$\begin{aligned} \mathbb {E}\Vert p_{t+1}\Vert ^2&= \mathbb {E}\Vert (1-Dh)p_t - h\tilde{\nabla }_t + \sqrt{2Dh}\xi _t\Vert ^2\\&= \mathbb {E}\Vert (1-Dh)p_t - h\nabla f(\theta _t) - h\varDelta _t + \sqrt{2Dh}\xi _t\Vert ^2\\&= \mathbb {E}\Vert (1-Dh)p_t - h\nabla f(\theta _t)\Vert ^2 + h^2\mathbb {E}[\Vert \varDelta _t\Vert ^2] + 2Dhd\\&\le (1-Dh)^2\mathbb {E}\Vert p_t\Vert ^2 + 2Gnh(1-Dh)\sqrt{\mathbb {E}\Vert p_t\Vert ^2} + h^2n^2G^2 \\&\quad + h^2\mathbb {E}[\Vert \varDelta _t\Vert ^2] + 2Dhd \end{aligned}$$

The third equality holds because \(\mathbb {E}[\varDelta _t] = \mathbb {E}[\xi _t] = 0\) and \(\varDelta _t\) and \(\xi _t\) are independent. The first inequality takes advantage of \(\Vert \nabla f\Vert \le nG\) and \(\mathbb {E}\Vert p_t\Vert \le \sqrt{\mathbb {E}\Vert p_t\Vert ^2}\).

Define \(S = \sum _{t=1}^T\mathbb {E}[\Vert p_t\Vert ^2]\). Then, taking a grand summation over \(t = 0, 1, \ldots , T-1\),

$$\begin{aligned} S&\le (1-Dh)^2 S + 2nGh(1-Dh)\sum _{t=0}^{T-1}\sqrt{\mathbb {E}\Vert p_t\Vert ^2}\\&\quad + T(h^2n^2G^2 + 2Dhd) + h^2\sum _{t=0}^{T-1}\mathbb {E}[\Vert \varDelta _t\Vert ^2]\\&\le (1-Dh)^2 S + 2nGh(1-Dh)\sqrt{T} \sqrt{S}\\&\quad + T(h^2n^2G^2 + 2Dhd) + \frac{8h^4L^2n^4}{b^3}S \end{aligned}$$

Rearranging the terms we have:

$$\begin{aligned}&\left( 1-(1-Dh)^2 - \frac{8h^4L^2n^4}{b^3}\right) \frac{S}{T} + 2nGh(1-Dh)\sqrt{\frac{S}{T}} + (h^2n^2G^2 + 2Dhd) \le 0 \end{aligned}$$

Solving a quadratic equation with respect to \(\sqrt{S / T}\) and ignoring constant factors, we have:

$$\begin{aligned} \sqrt{\frac{S}{T}} \lesssim \frac{nG}{D-h^3L^2n^4b^{-3}} + \frac{\sqrt{n^2G^2 + D^2d}}{D-h^3L^2n^4b^{-3}} \lesssim \frac{\sqrt{n^2G^2 + D^2d}}{D-h^3L^2n^4b^{-3}} \end{aligned}$$

Plugging it in

$$\begin{aligned} \sum _{t=0}^{T-1} \mathbb {E}[\Vert \varDelta _t\Vert ^2]\le \frac{8h^2L^2 n^4 }{b^3}\sum _{t=0}^{T-1}\mathbb {E}\Vert p_t\Vert ^2 \end{aligned}$$

we have:

$$\begin{aligned} \frac{1}{T}\sum _{t=1}^T \mathbb {E}[\Vert \varDelta _t\Vert ^2] \lesssim \frac{h^2L^2 n^4 }{b^3}\left( \frac{\sqrt{n^2G^2 + D^2d}}{D-h^3L^2n^4b^{-3}}\right) ^2 \end{aligned}$$
(27)

Similar to (25), we can bound (26) as follows:

$$\begin{aligned}&\sum _{t=0}^{T-1} \mathbb {E}[\Vert \varDelta _t\Vert ^2] \nonumber \\&\quad \le \frac{n^2}{b}\sum _{t=0}^{T-1}\mathbb {E}_{i\in [n]}[\Vert \nabla f_i(\theta _t) - \nabla f_i(\alpha _t^i)\Vert ^2] \nonumber \\&\quad \le \frac{4Tn^2G^2}{b} \end{aligned}$$
(28)

where (28) holds due to all \(f_i\)’s are G-Lipschitz and Cauchy’s inequality \(\Vert a+b\Vert ^2\le 2(\Vert a\Vert ^2+\Vert b\Vert ^2)\).

Now, the proof of Theorem 4 is finished by combining (20), (27) and (28). \(\square \)

B.5 Proof of Theorem 5

To prove this theorem, we need the following theorem from Chen et al. (2015). The theorem shows that SGHMC with symmetric splitting can improve the dependency of MSE on step size h, thus allowing larger step size and faster MSE convergence.

Theorem 7

(Chen et al. 2015) Let \(\tilde{\nabla }_t\) be an unbiased estimate of \(\nabla f(\theta _t)\) for all t. Then under Assumption 1, for a smooth test function \(\phi \), the MSE of SGHMC with symmetric splitting is bounded in the following way:

$$\begin{aligned} \mathbb {E}(\hat{\phi } - \bar{\phi })^2\lesssim \frac{\frac{1}{T}\sum _{t=0}^{T-1}\mathbb {E}(\varDelta V_t \psi (\theta _t, p_t))^2}{T} + \frac{1}{Th} + h^4 \end{aligned}$$
(29)

Now, we define \(\varDelta _t = \tilde{\nabla }_t - \nabla f(\theta _t + \frac{h}{2}p_t)\). According to Assumption 2, we have:

$$\begin{aligned} \mathbb {E}(\hat{\phi } - \bar{\phi })^2 \le \frac{1}{T^2}\sum _{t=0}^{T-1}\mathbb {E}\Vert \varDelta _t\Vert ^2+ \frac{1}{Th} + h^4 \end{aligned}$$
(30)

According to (30), we mainly need to bound the term \(\sum _{t=0}^{T-1}\mathbb {E}\Vert \varDelta _t\Vert ^2\) for our SVRG2nd-HMC algorithm. First, we unfold the definition of \(\tilde{\nabla }_t\),

$$\begin{aligned} \tilde{\nabla }_t&= -\nabla \log \Pr \left( \theta _t + \frac{h}{2}p_t\right) + \frac{n}{b}\sum _{i\in I}\left( \nabla f_i\left( \theta _t + \frac{h}{2}p_t\right) - \nabla f_i\left( \theta _{\left\lfloor \frac{t}{K}\right\rfloor K} + \frac{h}{2}p_{\left\lfloor \frac{t}{K}\right\rfloor K}\right) \right) \\&\quad + \sum _{i=1}^n \nabla f_i\left( \theta _{\left\lfloor \frac{t}{K}\right\rfloor K} + \frac{h}{2}p_{\left\lfloor \frac{t}{K}\right\rfloor K}\right) \end{aligned}$$

Then,

$$\begin{aligned} \mathbb {E}\Vert \varDelta _t\Vert ^2&= \mathbb {E}\left\| \frac{n}{b}\sum _{i\in I}\left( \nabla f_i\left( \theta _t + \frac{h}{2}p_t\right) - \nabla f_i\left( \theta _{\left\lfloor \frac{t}{K}\right\rfloor K} + \frac{h}{2}p_{\left\lfloor \frac{t}{K}\right\rfloor K}\right) \right) \right. \nonumber \\&\quad \qquad \left. - \sum _{i=1}^n\left( \nabla f_i\left( \theta _t + \frac{h}{2}p_t\right) - \nabla f_i\left( \theta _{\left\lfloor \frac{t}{K}\right\rfloor K} + \frac{h}{2}p_{\left\lfloor \frac{t}{K}\right\rfloor K}\right) \right) \right\| ^2 \nonumber \\&\le \mathbb {E}\left\| \frac{n}{b}\sum _{i\in I}\left( \nabla f_i\left( \theta _t + \frac{h}{2}p_t\right) - \nabla f_i\left( \theta _{\left\lfloor \frac{t}{K}\right\rfloor K} + \frac{h}{2}p_{\left\lfloor \frac{t}{K}\right\rfloor K}\right) \right) \right\| ^2 \nonumber \\&\le \frac{n^2}{b}\mathbb {E}_{i\in [n]}\left\| \nabla f_i\left( \theta _t + \frac{h}{2}p_t\right) - \nabla f_i\left( \theta _{\left\lfloor \frac{t}{K}\right\rfloor K} + \frac{h}{2}p_{\left\lfloor \frac{t}{K}\right\rfloor K}\right) \right\| ^2 \nonumber \\&\le \frac{n^2L^2}{b} \mathbb {E}\left\| \theta _t + \frac{h}{2}p_t - \theta _{\left\lfloor \frac{t}{K}\right\rfloor K} - \frac{h}{2}p_{\left\lfloor \frac{t}{K}\right\rfloor K}\right\| ^2 \nonumber \\&\le \frac{n^2L^2K}{b}\sum _{j=\left\lfloor \frac{t}{K}\right\rfloor K}^{t-1} \mathbb {E}\left\| \theta _{j+1} + \frac{h}{2}p_{j+1} - \theta _j - \frac{h}{2}p_j\right\| ^2 \end{aligned}$$
(31)

Taking a summation we have:

$$\begin{aligned} \sum _{t=0}^{T-1}\mathbb {E}\Vert \varDelta _t\Vert ^2&\le \frac{n^2L^2K}{b}\sum _{t=0}^{T-1}\sum _{j=\left\lfloor \frac{t}{K}\right\rfloor K}^{t-1} \mathbb {E}\left\| \theta _{j+1} + \frac{h}{2}p_{j+1} - \theta _j - \frac{h}{2}p_j\right\| ^2\\&\le \frac{n^2L^2K^2}{b}\sum _{t=0}^{T-1}\mathbb {E}\left\| \theta _{t+1} + \frac{h}{2}p_{t+1} - \theta _t - \frac{h}{2}p_t\right\| ^2\\&= \frac{n^2L^2K^2h^2}{b}\sum _{t=0}^{T-1}\mathbb {E}\Vert p_{t+1}\Vert ^2 \end{aligned}$$

The last equality follows by the recursion \(\theta _{t+1} = \theta _t + \frac{h}{2}p_{t+1} + \frac{h}{2}p_t\).

By definition of \(p_{t+1}\) we have:

$$\begin{aligned}&\mathbb {E}\Vert p_{t+1}\Vert ^2 \\&\quad = \mathbb {E}\Vert e^{-Dh/2}\left( e^{-Dh/2}p_t - h\tilde{\nabla }_t + \sqrt{2Dh}\xi _t\right) \Vert ^2\\&\quad = e^{-Dh} \mathbb {E}\Vert e^{-Dh/2}p_t - h\varDelta _t - h\nabla f\left( \theta _t + \frac{h}{2}p_t\right) + \sqrt{2Dh}\xi _t\Vert ^2\\&\quad = e^{-Dh} \left( \mathbb {E}\Vert e^{-Dh/2} p_t - h\nabla f\left( \theta _t + \frac{h}{2}p_t\right) \Vert ^2 + \mathbb {E}\Vert \sqrt{2Dh}\xi _t\Vert ^2 + \mathbb {E}\Vert h\varDelta _t\Vert ^2\right) \\&\quad = e^{-Dh} \left( e^{-Dh}\mathbb {E}\Vert p_t\Vert ^2 + 2e^{-Dh/2}nG\mathbb {E}\Vert p_t\Vert + n^2G^2h^2 + 2Dhd + h^2\mathbb {E}\Vert \varDelta _t\Vert ^2\right) \\&\quad \le \left( 1-\frac{Dh}{4}\right) ^2 \left( \left( 1-\frac{Dh}{4}\right) ^2\mathbb {E}\Vert p_t\Vert ^2 + 2\left( 1-\frac{Dh}{4}\right) nG\sqrt{\mathbb {E}\Vert p_t\Vert ^2} \right. \\&\qquad \left. + n^2G^2h^2 + 2Dhd + h^2\mathbb {E}\Vert \varDelta _t\Vert ^2\right) \end{aligned}$$

Define \(S = \sum _{t=1}^T \mathbb {E}\Vert p_t\Vert ^2\) and \(M=n^2G^2h^2 + 2Dhd\). Taking a grand summation of the above inequality for \(t = 0, 1, 2, \ldots , T-1\), we have:

$$\begin{aligned} S&\le \left( 1 - \frac{Dh}{4}\right) ^2 \left( \left( 1 - \frac{Dh}{4}\right) ^2S + 2\left( 1 - \frac{Dh}{4}\right) nG\sum _{t=0}^{T-1}\sqrt{\mathbb {E}\Vert p_t\Vert ^2} +TM\right. \\&\quad \left. +h^2\sum _{t=0}^{T-1}\mathbb {E}\Vert \varDelta _t\Vert ^2\right) \\&\le \left( 1 - \frac{Dh}{4}\right) ^2 \left( \left( 1 - \frac{Dh}{4}\right) ^2S + 2\left( 1 - \frac{Dh}{4}\right) nG\sqrt{TS}+TM +h^2\sum _{t=0}^{T-1}\mathbb {E}\Vert \varDelta _t\Vert ^2 \right) \\&\le \left( 1 - \frac{Dh}{4}\right) ^2 \left( \left( 1 - \frac{Dh}{4}\right) ^2S + 2\left( 1 - \frac{Dh}{4}\right) nG\sqrt{TS}+ TM + h^2\sum _{t=0}^{T-1}\mathbb {E}\Vert \varDelta _t\Vert ^2\right) \\&\le \left( 1 - \frac{Dh}{4}\right) ^2 \left( \left( 1 - \frac{Dh}{4}\right) ^2S + 2\left( 1 - \frac{Dh}{4}\right) nG\sqrt{TS}+ TM + \frac{n^2L^2K^2h^4}{b}S\right) \end{aligned}$$

Rewriting it as a quadratic inequality with respect to \(\sqrt{\frac{S}{T}}\), we have:

$$\begin{aligned}&\left( 1 - \left( 1 - \frac{Dh}{4}\right) ^4 - \left( 1 - \frac{Dh}{4}\right) ^2\frac{n^2L^2K^2h^4}{b}\right) \frac{S}{T} - 2\left( 1 - \frac{Dh}{4}\right) ^3nG\sqrt{\frac{S}{T}} \\&\quad - \left( 1-\frac{Dh}{4}\right) ^2M\le 0 \end{aligned}$$

Solve the inequality and ignore constant factors:

$$\begin{aligned} \sqrt{\frac{S}{T}}&\lesssim \frac{nG + \sqrt{n^2G^2 + D^2d}}{D-L^2n^2K^2h^3b^{-1}} \\&\lesssim \frac{\sqrt{n^2G^2 + D^2d}}{D-L^2n^2K^2h^3b^{-1}} \end{aligned}$$

Similar to the proof of Theorem 2, it easily follows that:

$$\begin{aligned} \frac{1}{T}\sum _{t=0}^{T-1}\mathbb {E}\Vert \varDelta _t\Vert ^2 \lesssim \frac{L^2 n^2 K^2h^2}{b}\left( \frac{\sqrt{n^2G^2 + D^2d}}{D-L^2n^2K^2h^3b^{-1}}\right) ^2 \end{aligned}$$
(32)

Similar to (25), we can bound (31) as follows:

$$\begin{aligned} \mathbb {E}[\Vert \varDelta _t\Vert ^2]&\le \frac{n^2}{b}\mathbb {E}_{i\in [n]}\left\| \nabla f_i\left( \theta _t + \frac{h}{2}p_t\right) - \nabla f_i\left( \theta _{\left\lfloor \frac{t}{K}\right\rfloor K} + \frac{h}{2}p_{\left\lfloor \frac{t}{K}\right\rfloor K}\right) \right\| ^2 \nonumber \\&\le \frac{4n^2G^2}{b} \end{aligned}$$
(33)

where (33) holds due to all \(f_i\)’s are G-Lipschitz and Cauchy’s inequality \(\Vert a+b\Vert ^2\le 2(\Vert a\Vert ^2+\Vert b\Vert ^2)\).

Now, the proof of Theorem 5 is finished by combining (30), (32) and (33). \(\square \)

B.6 Proof of Theorem 6

Similar to the proof of Theorem 5, we define \(\varDelta _t = \tilde{\nabla }_t - \nabla f(\theta _t + \frac{h}{2}p_t)\). By Assumption 2 and Inequality (29), we have:

$$\begin{aligned} \mathbb {E}(\hat{\phi } - \bar{\phi })^2&\lesssim \frac{\frac{1}{T}\sum _{t=0}^{T-1}\mathbb {E}(\varDelta V_t\psi (\theta _t, p_t))^2}{T} + \frac{1}{Th} + h^4 \\&\le \frac{1}{T^2}\sum _{t=0}^{T-1}\mathbb {E}\Vert \varDelta _t\Vert ^2+ \frac{1}{Th} + h^4 \end{aligned}$$

Unpacking the definition of \(\varDelta _t\) and \(\tilde{\nabla }_t\), we have:

$$\begin{aligned}&\sum _{t=0}^{T-1}\mathbb {E}[\Vert \varDelta _t\Vert ^2] \nonumber \\&\quad = \sum _{t=0}^{T-1}\mathbb {E}[\Vert \tilde{\nabla }_t - \nabla f(\theta _t)\Vert ^2] \nonumber \\&\quad = \sum _{t=0}^{T-1}\mathbb {E}\left[ \left\| \frac{n}{b}\sum _{i\in I}\left( \nabla f_i\left( \theta _t + \frac{h}{2}p_t\right) - \nabla f_i(\alpha _t^i)\right) \right. \right. \nonumber \\&\qquad \left. \left. - \sum _{j=1}^n\left( \nabla f_j\left( \theta _t + \frac{h}{2}p_t\right) - \nabla f_j(\alpha _t^j)\right) \right\| ^2\right] \nonumber \\&\quad = \sum _{t=0}^{T-1}n^2 \mathbb {E}\left[ \left\| \frac{1}{b}\sum _{i\in I}\left( \nabla f_i\left( \theta _t + \frac{h}{2}p_t\right) - \nabla f_i(\alpha _t^i)\right) \right. \right. \nonumber \\&\qquad \left. \left. - \frac{1}{n}\sum _{j=1}^n\left( \nabla f_j\left( \theta _t + \frac{h}{2}p_t\right) - \nabla f_j(\alpha _t^j)\right) \right\| ^2\right] \nonumber \\&\quad \le \frac{n^2}{b}\sum _{t=0}^{T-1}\mathbb {E}_{i\in [n]}\left[ \left\| \nabla f_i\left( \theta _t + \frac{h}{2}p_t\right) - \nabla f_i(\alpha _t^i)\right\| ^2\right] \end{aligned}$$
(34)
$$\begin{aligned}&\quad \le \frac{L^2 n^2}{b}\sum _{t=0}^{T-1}\mathbb {E}_{i\in [n]}\left[ \left\| \theta _t + \frac{h}{2}p_t - \alpha _t^i\right\| ^2\right] \end{aligned}$$
(35)

Let \(\gamma = 1 - (1 - 1/n)^b\). Then,

$$\begin{aligned}&\mathbb {E}\left[ \left\| \theta _t + \frac{h}{2}p_t - \alpha _t^i\right\| ^2\right] \\&\quad = \sum _{j=0}^{t-1}\mathbb {E}\left[ \left\| \theta _t + \frac{h}{2}p_t - \theta _j - \frac{h}{2}p_j \right\| ^2\right] \Pr \left( \alpha _t^i = \theta _j + \frac{h}{2}p_j\right) \\&\quad = \sum _{j=0}^{t-1}\mathbb {E}\left[ \left\| \theta _t + \frac{h}{2}p_t - \theta _j - \frac{h}{2}p_j \right\| ^2\right] (1-\gamma )^{t-j-1}\gamma \\&\quad = h^2\sum _{j=0}^{t-1} \mathbb {E}[\Vert p_t + p_{t-1} + \cdots + p_{j+1}\Vert ^2] (1-\gamma )^{t-j-1}\gamma \\&\quad \le h^2\sum _{j=0}^{t-1} (t-j) (1-\gamma )^{t-j-1}\gamma (\mathbb {E}[\Vert p_t\Vert ^2] + \mathbb {E}[\Vert p_{t-1}\Vert ^2] + \cdots + \mathbb {E}[\Vert p_{j+1}\Vert ^2])\\&\quad = h^2\sum _{j=1}^{t}\mathbb {E}[\Vert p_j\Vert ^2] \sum _{k=0}^{j-1}(t-k)(1-\gamma )^{t-k-1}\gamma \\&\quad \le h^2\sum _{j=1}^{t}\mathbb {E}[\Vert p_j\Vert ^2] \sum _{k=0}^\infty (t-j+k+1)(1-\gamma )^{t-j+k}\gamma \\&\quad < h^2\sum _{j=1}^t \mathbb {E}[\Vert p_j\Vert ^2] \left( \frac{1}{\gamma } + t - j\right) (1-\gamma )^{t-j} \end{aligned}$$

Summing over all t and i, we then have:

$$\begin{aligned} \sum _{t=0}^{T-1}\mathbb {E}_{i\in [n]}[\Vert \theta _t + \frac{h}{2}p_t - \alpha _t^i\Vert ^2]&\le h^2\sum _{t=0}^{T-1}\sum _{j=1}^t \mathbb {E}[\Vert p_j\Vert ^2] \left( \frac{1}{\gamma } + t - j\right) (1-\gamma )^{t-j}\nonumber \\&= h^2\sum _{t=0}^{T-1}\sum _{j=1}^t \mathbb {E}[\Vert p_j\Vert ^2] \left( \frac{1}{\gamma } + t - j\right) (1-\gamma )^{t-j}\nonumber \\&\le h^2\sum _{t=1}^{T-1} \mathbb {E}[\Vert p_t\Vert ^2]\sum _{j=t}^{T-1} \left( \frac{1}{\gamma } + j-t\right) (1-\gamma )^{j-t}\nonumber \\&\le h^2\sum _{t=1}^{T-1} \mathbb {E}[\Vert p_t\Vert ^2]\sum _{j=t}^\infty \left( \frac{1}{\gamma } + j-t\right) (1-\gamma )^{j-t}\nonumber \\&\le h^2\sum _{t=1}^{T-1} \mathbb {E}[\Vert p_t\Vert ^2]\frac{2}{\gamma ^2} = \frac{2}{\gamma ^2}h^2\sum _{t=1}^{T-1} \mathbb {E}[\Vert p_t\Vert ^2]\nonumber \\&\le \frac{8h^2n^2}{b^2}\sum _{t=1}^{T} \mathbb {E}[\Vert p_t\Vert ^2] \end{aligned}$$
(36)

The last inequality is because \((1-1/n)^b \le \frac{1}{1 + \frac{b}{n-1}}\), and thus \(\gamma = 1 - (1-1/n)^n \ge \frac{\frac{b}{n-1}}{1 + \frac{b}{n-1}} > \frac{b}{2n}\); the last inequality holds as mini-batch size b is smaller than dataset size n.

Now, we derive an upper bound on \(\sum _{t=1}^{T}\mathbb {E}[\Vert p_t\Vert ^2]\). By recursion of \(p_{t+1}\)’s, we have:

$$\begin{aligned}&\mathbb {E}\Vert p_{t+1}\Vert ^2 \\&\quad = \mathbb {E}\Vert e^{-Dh/2} (e^{-Dh/2}p_t - h\tilde{\nabla }_t + \sqrt{2Dh}\xi _t)\Vert ^2\\&\quad = e^{-Dh} \mathbb {E}\left\| e^{-Dh/2}p_t - h\varDelta _t - h\nabla f\left( \theta _t + \frac{h}{2}p_t\right) + \sqrt{2Dh}\xi _t\right\| ^2\\&\quad = e^{-Dh} \left( \mathbb {E}\Vert e^{-Dh/2} p_t - h\nabla f\left( \theta _t + \frac{h}{2}p_t\right) \Vert ^2 + \mathbb {E}\Vert \sqrt{2Dh}\xi _t\Vert ^2 + \mathbb {E}\Vert h\varDelta _t\Vert ^2\right) \\&\quad = e^{-Dh} \left( e^{-Dh}\mathbb {E}\Vert p_t\Vert ^2 + 2e^{-Dh/2}nG\mathbb {E}\Vert p_t\Vert + n^2G^2h^2 + 2Dhd + h^2\mathbb {E}\Vert \varDelta _t\Vert ^2\right) \\&\quad \le \left( 1-\frac{Dh}{4}\right) ^2 \left( \left( 1-\frac{Dh}{4}\right) ^2\mathbb {E}\Vert p_t\Vert ^2 + 2\left( 1-\frac{Dh}{4}\right) nG\sqrt{\mathbb {E}\Vert p_t\Vert ^2} + n^2G^2h^2 \right. \\&\qquad \left. + 2Dhd + h^2\mathbb {E}\Vert \varDelta _t\Vert ^2\right) \end{aligned}$$

Define \(S = \sum _{t=1}^T \mathbb {E}\Vert p_t\Vert ^2\) and \(M=n^2G^2h^2 + 2Dhd\). Taking a grand summation of the above inequality for \(t = 0, 1, 2, \ldots , T-1\), we have:

$$\begin{aligned} S&\le \left( 1 - \frac{Dh}{4}\right) ^2 \left( \left( 1 - \frac{Dh}{4}\right) ^2S + 2\left( 1 - \frac{Dh}{4}\right) nG\sum _{t=0}^{T-1}\sqrt{\mathbb {E}\Vert p_t\Vert ^2} + TM \right. \\&\quad \left. + h^2\sum _{t=0}^{T-1}\mathbb {E}\Vert \varDelta _t\Vert ^2\right) \\&\le \left( 1 - \frac{Dh}{4}\right) ^2 \left( \left( 1 - \frac{Dh}{4}\right) ^2S + 2\left( 1 - \frac{Dh}{4}\right) nG\sqrt{TS}+ TM + h^2\sum _{t=0}^{T-1}\mathbb {E}\Vert \varDelta _t\Vert ^2\right) \\&\le \left( 1 - \frac{Dh}{4}\right) ^2 \left( \left( 1 - \frac{Dh}{4}\right) ^2S + 2\left( 1 - \frac{Dh}{4}\right) nG\sqrt{TS}+ TM + h^2\sum _{t=0}^{T-1}\mathbb {E}\Vert \varDelta _t\Vert ^2\right) \\&\le \left( 1 - \frac{Dh}{4}\right) ^2 \left( \left( 1 - \frac{Dh}{4}\right) ^2S + 2\left( 1 - \frac{Dh}{4}\right) nG\sqrt{TS}+ TM + \frac{8n^4L^2h^4}{b^3}S\right) \end{aligned}$$

Similar to the proof of theorem 5, we solve a quadratic inequality with respect to \(\sqrt{\frac{S}{T}}\), and then,

$$\begin{aligned} \sqrt{\frac{S}{T}}\lesssim \frac{\sqrt{n^2G^2 + D^2d}}{D-h^3L^2n^4b^{-3}} \end{aligned}$$
(37)

From (35), (36), (37) and the definition of S, we have:

$$\begin{aligned}&\sum _{t=0}^{T-1} \mathbb {E}[\Vert \varDelta _t\Vert ^2] \nonumber \\&\quad \le \frac{8h^2L^2 n^4 }{b^3}\sum _{t=0}^{T-1}\mathbb {E}\Vert p_t\Vert ^2 \nonumber \\&\quad \le \frac{8h^2L^2 n^4T}{b^3}\left( \frac{\sqrt{n^2G^2 + D^2d}}{D-h^3L^2n^4b^{-3}}\right) ^2 \end{aligned}$$
(38)

Similar to (25), we can bound (34) as follows:

$$\begin{aligned}&\sum _{t=0}^{T-1} \mathbb {E}[\Vert \varDelta _t\Vert ^2] \nonumber \\&\quad \le \frac{n^2}{b}\sum _{t=0}^{T-1}\mathbb {E}_{i\in [n]}\left[ \left\| \nabla f_i\left( \theta _t + \frac{h}{2}p_t\right) - \nabla f_i(\alpha _t^i)\right\| ^2\right] \nonumber \\&\quad \le \frac{4Tn^2G^2}{b} \end{aligned}$$
(39)

where (39) holds due to all \(f_i\)’s are G-Lipschitz and Cauchy’s inequality \(\Vert a+b\Vert ^2\le 2(\Vert a\Vert ^2+\Vert b\Vert ^2)\).

Now, the proof of Theorem 6 is finished by combining (30), (38) and (39). \(\square \)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Li, Z., Zhang, T., Cheng, S. et al. Stochastic gradient Hamiltonian Monte Carlo with variance reduction for Bayesian inference. Mach Learn 108, 1701–1727 (2019). https://doi.org/10.1007/s10994-019-05825-y

Download citation

Keywords

  • Hamiltonian Monte Carlo
  • Variance reduction
  • Bayesian inference