Skip to main content
Log in

An adaptively weighted stochastic gradient MCMC algorithm for Monte Carlo simulation and global optimization

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

We propose an adaptively weighted stochastic gradient Langevin dynamics (AWSGLD) algorithm for Bayesian learning of big data problems. The proposed algorithm is scalable and possesses a self-adjusting mechanism: It adaptively flattens the high-energy region and protrudes the low-energy region during simulations such that both Monte Carlo simulation and global optimization tasks can be greatly facilitated in a single run. The self-adjusting mechanism enables the proposed algorithm to be essentially immune to local traps. Theoretically, by showing the stability of the mean-field system and verifying the existence and regularity properties of the solution of Poisson equation, we establish the convergence of the AWSGLD algorithm, including both the convergence of the self-adapting parameters and the convergence of the weighted averaging estimators. Empirically, the AWSGLD algorithm is tested on multiple benchmark datasets including CIFAR100 and SVHN for both optimization and uncertainty estimation tasks. The numerical results indicate its great potential in Monte Carlo simulation and global optimization for modern machine learning tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. If the distribution of interest \(\pi _T\) is flat, we can choose \(\tau =T\) by default; otherwise, we run at a higher temperature \(\tau >T\) for facilitating the exploration and a reweighting scheme will be applied later to recover the distribution \(\pi _{T}\).

  2. If \(\zeta \ne 1\), the mean-field system becomes nonlinear and \({\varvec{\theta }}\) might not be able to converge to the global equilibrium \({\varvec{\theta }}_{\star }\). In this case, one may run SGLD (or its variant) sufficiently long to get a good initialization of \({\varvec{\theta }}_0\).

  3. Setting \(m=\infty \) leads to a continuous version of the method for which a kernel method as used in the continuous Monte Carlo algorithm (Liang 2007) is needed for estimating the gradient multiplier and updating the parameter \({\varvec{\theta }}\). For the current version of the algorithm, an excessively large value of m can lead to a numerical instability issue as for which the gradient multiplier can be large around the global optimal points.

  4. The wide adoption of data augmentation, such as random flipping and random cropping, in DNN training implicitly includes more data and leads to a concentrated posterior. See Wenzel et al. (2020), Aitchison (2021) for details.

  5. Various data augmentation techniques are included, such as random flipping, cropping and erasing (Zhong et al. 2020).

  6. \({{\widetilde{{\varvec{\Theta }}}}}\subset {\varvec{\Theta }}\) is a close neighborhood of \({\varvec{\theta }}_{\star }\) that yields a linear approximation.

  7. The region-wise weighted density function \(\varpi _{\Psi _{{\varvec{\theta }}}}({{\varvec{x}}})\propto \frac{\pi _{\tau }({{\varvec{x}}})}{\Psi ^{\zeta }_{{\varvec{\theta }}}(U({{\varvec{x}}}))}\) might affect the smoothness constant of the underlying energy function, which leads to a perturbation (28) depending on \({\varvec{\theta }}\) and won’t change much for a small change of \({\varvec{\theta }}\). The same logic applies to (27) and (24).

References

  • Ahn, S., Balan, A.K., Welling, M.: Bayesian posterior sampling via stochastic gradient fisher scoring. In: International Conference on Machine Learning (ICML) (2012a)

  • Ahn, S., Korattikara, A., Welling, M.: Bayesian posterior sampling via stochastic gradient fisher scoring. In: International Conference on Machine Learning (ICML) (2012b)

  • Aitchison, L.: A statistical theory of cold posteriors in deep neural networks. In: International Conference on Learning Representation (ICLR) (2021)

  • Andrieu, C., Moulines, E., Priouret, P.: Stability of stochastic approximation under verifiable conditions. SIAM J. Control Optim. 44, 283–312 (2005)

    Article  MathSciNet  Google Scholar 

  • Belkin, M.: Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. Acta Numer. 30, 203–248 (2021)

    Article  MathSciNet  Google Scholar 

  • Benveniste, A., Métivier, M., Priouret, P.: Adaptive Algorithms and Stochastic Approximations. Springer, Berlin (1990)

    Book  Google Scholar 

  • Berg, B.A., Neuhaus, T.: Multicanonical algorithms for first order phase transitions. Phys. Lett. B 267, 249–253 (1991)

    Article  Google Scholar 

  • Chen, C., Carlson, D., Gan, Z., Li, C., Carin, L.: Bridging the gap between stochastic gradient MCMC and stochastic optimization. In: International Conference on Artificial Intelligence and Statistics, AISTATS 2016 (2016)

  • Chen, C., Ding, N., Carin, L.: On the convergence of stochastic gradient MCMC algorithms with high-order integrators. In: Advances in Neural Information Processing Systems (NIPS) (2015)

  • Chen, T., Fox, E.B., Guestrin, C.: Stochastic gradient Hamiltonian Monte Carlo. In: International Conference on Machine Learning (ICML) (2014)

  • Chen, Y., Chen, J., Dong, J., Peng, J., Wang, Z.: Accelerating nonconvex learning via replica exchange Langevin diffusion. In: International Conference on Learning Representation (ICLR) (2019)

  • Deng, W., Feng, Q., Gao, L., Liang, F., Lin, G.: Non-convex learning via replica exchange stochastic gradient MCMC. In: International Conference on Machine Learning (ICML) (2020a)

  • Deng, W., Feng, Q., Karagiannis, G., Lin, G., Liang, F.: Accelerating convergence of replica exchange stochastic gradient MCMC via variance reduction. In: International Conference on Learning Representation (ICLR) (2021)

  • Deng, W., Lin, G., Liang, F.: A contour stochastic gradient langevin dynamics algorithm for simulations of multi-modal distributions. In: Advances in Neural Information Processing Systems (NeurIPS) (2020b)

  • Ding, N., Fang, Y., Babbush, R., Chen, C., Skeel, R.D., Neven, H.: Bayesian sampling using stochastic gradient thermostats. In: Advances in Neural Information Processing Systems (NeurIPS) (2014)

  • Erdogdu, M.A., Mackey, L., Shamir, O.: Global non-convex optimization with discretized diffusions. In: Advances in Neural Information Processing Systems (NeurIPS) (2018)

  • Fort, G., Jourdain, B., Kuhn, E., Lelièvre, T., Stoltz, G.: Convergence of the Wang–Landau algorithm. Math. Comput. 84, 2297–2327 (2015)

    Article  MathSciNet  Google Scholar 

  • Geyer, C.J.: Markov chain Monte Carlo maximum likelihood. In: Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface, pp. 156–163 (1991)

  • Geyer, C.J., Thompson, E.A.: Annealing Markov Chain Monte Carlo with applications to ancestral inference. J. Am. Stat. Assoc. 90, 909–920 (1995)

    Article  Google Scholar 

  • Girolami, M., Calderhead, B.: Riemann manifold Langevin and Hamiltonian Monte Carlo methods (with discussion). J. R. Stat. Soc. B 73, 123–214 (2011)

    Article  MathSciNet  Google Scholar 

  • Hastings, W.: Monte Carlo sampling methods using Markov Chain and their applications. Biometrika 57, 97–109 (1970)

    Article  MathSciNet  Google Scholar 

  • He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

  • Hesselbo, B., Stinchcombe, R.: Monte Carlo simulation and global optimization without parameters. Phys. Rev. Lett. 74(12), 2151–2155 (1995)

    Article  Google Scholar 

  • Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-stage architecture for object recognition? In: International Conference on Computer Vision (ICCV) (2009)

  • Kirkpatrick, S., Gelatt, C.D., Jr., Vecchi, M.P.: Optimization by simulated annealing. Science 220, 671–680 (1983)

    Article  MathSciNet  Google Scholar 

  • Laguna, M., Martí, R.: Experimental testing of advanced scatter search designs for global optimization of multimodal functions. J. Glob. Optim. 33, 235–255 (2005)

    Article  MathSciNet  Google Scholar 

  • LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998)

    Article  Google Scholar 

  • Li, C., Chen, C., Carlson, D.E., Carin, L.: Preconditioned stochastic gradient Langevin dynamics for deep neural networks. In: AAAI Conference on Artificial Intelligence (AAAI) (2016)

  • Liang, F.: Generalized 1/k-ensemble algorithm. Phys. Rev. E 69, 66701–66707 (2004)

    Article  Google Scholar 

  • Liang, F.: A generalized Wang–Landau algorithm for Monte Carlo computation. J. Am. Stat. Assoc. 100, 1311–1327 (2005)

    Article  MathSciNet  Google Scholar 

  • Liang, F.: Continuous contour Monte Carlo for marginal density estimation with an application to a spatial statistical model. J. Comput. Graph. Stat. 16, 608–632 (2007)

    Article  MathSciNet  Google Scholar 

  • Liang, F.: On the use of stochastic approximation Monte Carlo for Monte Carlo integration. Stat. Probab. Lett. 79, 581–587 (2009)

    Article  MathSciNet  Google Scholar 

  • Liang, F., Liu, C., Carroll, R.J.: Stochastic approximation in Monte Carlo computation. J. Am. Stat. Assoc. 102, 305–320 (2007)

    Article  MathSciNet  Google Scholar 

  • Liu, C., Zhu, L., Belkin, M.: Loss landscapes and optimization in over-parameterized non-linear systems and neural networks (2021). arXiv:2003.00307v2

  • Lu, X., Perrone, V., Hasenclever, L., Teh, Y.W., Vollmer, S.: Relativistic Monte Carlo. In: the 20th International Conference on Artificial Intelligence and Statistics (2017)

  • Ma, Y.-A., Chen, T., Fox, E.B.: A complete recipe for stochastic gradient MCMC. In: Advances in Neural Information Processing Systems (NeurIPS) (2015)

  • Maddox, W., Garipov, T., Izmailov, P., Vetrov, D., Wilson, A.G. A simple baseline for Bayesian uncertainty in deep learning. In: Advances in Neural Information Processing Systems (NeurIPS) (2019)

  • Mangoubi, O., Vishnoi, N.K.: Convex optimization with unbounded nonconvex oracles using simulated annealing. In: Conference on Learning Theory (COLT) (2018)

  • Marinari, E., Parisi, G.: Simulated tempering: a new Monte Carlo scheme. Europhys. Lett. 19, 451–458 (1992)

    Article  Google Scholar 

  • Mattingly, J., Stuartb, A., Highamc, D.: Ergodicity for SDEs and approximations: locally Lipschitz vector fields and degenerate noise. Stoch. Process. Appl. 101, 185–232 (2002)

    Article  MathSciNet  Google Scholar 

  • Mattingly, J.C., Stuart, A.M., Tretyakov, M.: Convergence of numerical time-averaging and stationary measures via Poisson equations. SIAM J. Numer. Anal. 48, 552–577 (2010)

    Article  MathSciNet  Google Scholar 

  • Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys. 21, 1087–1091 (1953)

    Article  Google Scholar 

  • Neal, R.M.: MCMC Using Hamiltonian Dynamics. Handbook of Markov Chain Monte Carlo, vol. 54, pp. 113–162. Chapman & Hall/CRC, London (2012)

    Google Scholar 

  • Nemeth, C., Fearnhead, P.: Stochastic gradient Markov Chain Monte Carlo. J. Am. Stat. Assoc. 116, 433–450 (2021)

    Article  MathSciNet  Google Scholar 

  • Patterson, S., Teh, Y.W.: Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, vol. 2, pp. 3102–3110. NIPS’13. Curran Associates Inc, Red Hook (2013)

  • PyTorch. CyclicLR in PyTorch (2019). https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.CyclicLR.html

  • Raginsky, M., Rakhlin, A., Telgarsky, M.: Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis. Proc. Mach. Learn. Res. 65, 1–30 (2017)

    Google Scholar 

  • Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)

    Article  MathSciNet  Google Scholar 

  • Robert, C., Casella, G.: Monte Carlo Statistical Methods. Springer, Berlin (2004)

    Book  Google Scholar 

  • Roberts, G.O., Tweedie, R.L.: Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli 2, 341–363 (1996)

    Article  MathSciNet  Google Scholar 

  • Saatci, Y., Wilson, A.G.: Bayesian GAN. In: Advances in Neural Information Processing Systems (NIPS), pp. 3622–3631 (2017)

  • Sato, I., Nakagawa, H.: Approximation analysis of stochastic gradient Langevin dynamics by using Fokker–Planck equation and ito process. In: International Conference on Machine Learning (ICML) (2014)

  • Simsekli, U., Badeau, R., Cemgil, T., Richard, G.: Stochastic quasi-Newton Langevin Monte Carlo. In: International Conference on Machine Learning, vol. 48 (2016)

  • Smith, L.N.: Cyclical learning rates for training neural networks. In: 2017 Winter Conference on Applications of Computer Vision (2017)

  • Swendsen, R.H., Wang, J.-S.: Replica Monte Carlo simulation of spin-glasses. Phys. Rev. Lett. 57, 2607–2609 (1986)

    Article  MathSciNet  Google Scholar 

  • TensorFlow. TensorFlow Addons Optimizers: CyclicalLearningRate (2021). https://www.tensorflow.org/addons/tutorials/optimizers_cyclicallearningrate

  • Vollmer, S.J., Zygalakis, K.C., Teh, Y.W.: Exploration of the (non-)asymptotic bias and variance of stochastic gradient Langevin dynamics. J. Mach. Learn. Res. 17, 1–48 (2016)

    MathSciNet  MATH  Google Scholar 

  • Wang, F., Landau, D.P.: Efficient, multiple-range random walk algorithm to calculate the density of states. Phys. Rev. Lett. 86, 2050–2053 (2001)

    Article  Google Scholar 

  • Weinhart, T., Singh, A., Thornton, A.: Perturbation theory & stability analysis. Slides (2010)

  • Welling, M., Teh, Y.W.: Bayesian learning via stochastic gradient langevin dynamics. In: International Conference on Machine Learning (ICML) (2011)

  • Wenzel, F., Roth, K., Veeling, B.S., Światkowski, J., Tran, L., Mandt, S., Snoek, J., Salimans, T., Jenatton, R., Nowozin, S.: How good is the Bayes posterior in deep neural networks really? In: International Conference on Machine Learning (ICML) (2020)

  • Xu, P., Chen, J., Zou, D., Gu, Q.: Global convergence of langevin dynamics based algorithms for nonconvex optimization. In: Advances in Neural Information Processing Systems (NIPS) (2018)

  • Ye, N., Zhu, Z., Mantiuk, R.K.: Langevin dynamics with continuous tempering for training deep neural networks. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp. 618–626. Curran Associates Inc., Red Hook (2017)

  • Zagoruyko, S., Komodakis, N.: Wide residual networks. In: Proceedings of the British Machine Vision Conference (BMVC), pp. 87.1–87.12 (2016)

  • Zhang, R., Li, C., Zhang, J., Chen, C., Wilson, A.G.: Cyclical stochastic gradient MCMC for Bayesian deep learning. In: International Conference on Learning Representation (ICLR) (2020)

  • Zhang, X., Jiang, Y., Peng, H., Tu, K., Goldwasser, D.: Semi-supervised structured prediction with neural CRF autoencoder. In: Conference on Empirical Methods for Natural Language Processing (EMNLP), pp. 1701–1711 (2017)

  • Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. In: AAAI Conference on Artificial Intelligence, vol. 34 (2020)

Download references

Acknowledgements

Liang’s research is supported in part by the grant DMS-2015498 from National Science Foundation and the grants R01-GM117597 and R01-GM126089 from National Institutes of Health. Lin and Deng would like to acknowledge the support from National Science Foundation (DMS-2053746, DMS-1555072, and DMS-1736364), Brookhaven National Laboratory Subcontract (382247), and U.S. Department of Energy Office of Science Advanced Scientific Computing Research (DE-SC0021142). The authors thank the editor, associate editor and referees for their constructive comments which have led to significant improvement of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Faming Liang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

Convergence analysis for AWSGLD

1.1 Stochastic approximation

The stochastic approximation algorithm (Robbins and Monro 1951) is the prototype of many adaptive algorithms, which aims to solve an expectation equation given by:

$$\begin{aligned} \begin{aligned} h({\varvec{\theta }})&=\int _{{\mathcal {X}}} {{\widetilde{H}}}({\varvec{\theta }}, {\varvec{{{\varvec{x}}}}})\varpi _{{\varvec{\theta }}}(\hbox {d}{\varvec{{{\varvec{x}}}}})=0, \end{aligned} \end{aligned}$$

where \({{\varvec{x}}}\in {\mathcal {X}}\subset {\mathbb {R}}^d\), \({\varvec{\theta }}\in {\varvec{\Theta }}\subset {\mathbb {R}}^{m}\), \(\varpi _{{\varvec{\theta }}}({{\varvec{x}}})\) denotes a distribution parameterized by \({\varvec{\theta }}\), and \({{\widetilde{H}}}({\varvec{\theta }},{{\varvec{x}}})\) and \(h({\varvec{\theta }})\) are called the random-field and mean-field functions, respectively. The algorithm works by iterating between the following two steps:

  1. (i)

    Simulate \({\varvec{x}}_{k+1}\) from the transition kernel \(\Pi _{{\varvec{\theta _{k}}}}({\varvec{x}}_{k}, \cdot )\), which admits \( \varpi _{{\varvec{\theta }}_{k}}({\varvec{x}})\) as the invariant distribution,

  2. (ii)

    Update \({\varvec{\theta }}_k\) by setting \({\varvec{\theta }}_{k+1}={\varvec{\theta }}_{k}+\omega _{k+1} {{\widetilde{H}}}({\varvec{\theta }}_{k}, {\varvec{x}}_{k+1})\).

In particular, the algorithm samples \({{\varvec{x}}}\) from a transition kernel \(\Pi _{{\varvec{\theta _{k}}}}(\cdot , \cdot )\) instead of the distribution \( \varpi _{{\varvec{\theta }}_{k}}(\cdot )\) exactly, which leads to a Markov state-dependent noise \({{\widetilde{H}}}({\varvec{\theta }}_k, {{\varvec{x}}}_{k+1})-h({\varvec{\theta }}_k)\).

1.2 Poisson equation

For any \({\varvec{\theta }}\in \Theta \), \({{\varvec{x}}}\in {\mathcal {X}}\), there exists a function \(\mu _{{\varvec{\theta }}}\) on \({\mathcal {X}}\) that solves the Poisson equation

$$\begin{aligned} \mu _{{\varvec{\theta }}}({{\varvec{x}}})-\Pi _{{\varvec{\theta }}}\mu _{{\varvec{\theta }}}({{\varvec{x}}}) ={\widetilde{H}}({\varvec{\theta }},{{\varvec{x}}})-h({\varvec{\theta }}), \end{aligned}$$

where \(\Pi _{{\varvec{\theta }}}\) denotes a probability transition kernel with \(\Pi _{{\varvec{\theta }}}\mu _{{\varvec{\theta }}}({{\varvec{x}}})=\int _{{\mathcal {X}}} \mu _{{\varvec{\theta }}}({{\varvec{x}}}')\Pi _{{\varvec{\theta }}}({{\varvec{x}}},{{\varvec{x}}}') \hbox {d} {{\varvec{x}}}'\). The solution to the Poisson equation exists when the following series converges:

$$\begin{aligned} \mu _{{\varvec{\theta }}}({{\varvec{x}}}):=\sum _{k\ge 0} \Pi _{{\varvec{\theta }}}^k ({{\widetilde{H}}}({\varvec{\theta }}, {{\varvec{x}}})-h({\varvec{\theta }})), \end{aligned}$$

where \(\Pi _{{\varvec{\theta }}}^k ({{\widetilde{H}}}({\varvec{\theta }}, {{\varvec{x}}})-h({\varvec{\theta }}))=\int ({{\widetilde{H}}}({\varvec{\theta }}, {{\varvec{y}}})-h({\varvec{\theta }})) \Pi _{{\varvec{\theta }}}^k({{\varvec{x}}}, \hbox {d}{{\varvec{y}}})\). That is, the consistency of the estimator \({\varvec{\theta }}\) can be established by controlling the perturbations of \(\sum _{k \ge 0} \Pi _{{\varvec{\theta }}}^k ({{\widetilde{H}}}({\varvec{\theta }}, {{\varvec{x}}})-h({\varvec{\theta }}))\) via imposing some regularity conditions on \(\mu _{{\varvec{\theta }}}(\cdot )\). Toward this goal, Benveniste et al. (1990) gave the following regularity conditions on \(\mu _{{\varvec{\theta }}}(\cdot )\) to ensure the convergence of the adaptive algorithm:

There exist a function \(V: {\mathcal {X}}\rightarrow [1,\infty )\), and a constant \(C<\infty \) such that for all \({\varvec{\theta }}, {\varvec{\theta }}'\in {\varvec{{\varvec{\Theta }}}}\),

$$\begin{aligned} \begin{aligned}&\Vert \Pi _{{\varvec{\theta }}}\mu _{{\varvec{\theta }}}({{\varvec{x}}})\Vert \le C V({{\varvec{x}}}),\\&\Vert \Pi _{{\varvec{\theta }}}\mu _{{\varvec{\theta }}}({{\varvec{x}}}) -\Pi _{{\varvec{\theta '}}}\mu _{{\varvec{\theta '}}}({{\varvec{x}}})\Vert \le C\Vert {\varvec{\theta }}-{\varvec{\theta }}'\Vert V({{\varvec{x}}}), \\&\mathbb {E}[V({{\varvec{x}}})]\le \infty , \end{aligned} \end{aligned}$$

where \(\Vert \cdot \Vert \) denotes to the standard Euclidean norm. Notably, it requires only the first-order smoothness. In contrast, the ergodicity theory by Mattingly et al. (2010) and Vollmer et al. (2016) relies on the much stronger fourth-order smoothness.

1.3 AWSGLD algorithm

Recall that under the mini-batch setting, the random-field function \({{\widetilde{H}}}({\varvec{\theta }},{{\varvec{x}}})=({{\widetilde{H}}}_1({\varvec{\theta }},{{\varvec{x}}}), \ldots , {{\widetilde{H}}}_m({\varvec{\theta }},{{\varvec{x}}}))\) of the algorithm is given by

$$\begin{aligned} {{\widetilde{H}}}_i({\varvec{\theta }},{{\varvec{x}}})={\theta }({{\tilde{J}}}({{\varvec{x}}}))\left( 1_{i\ge {{\tilde{J}}}({{\varvec{x}}})}-{\theta }(i)\right) , \quad i=1,2,\ldots ,m.\nonumber \\ \end{aligned}$$
(18)

The resulting algorithm is now described as follows:

  1. (i)

    SGLD step: Sample \({\varvec{x}}_{k+1}={{\varvec{x}}}_k- \epsilon \nabla _{{{\varvec{x}}}} {{\widetilde{L}}}({{\varvec{x}}}_k, {\varvec{\theta }}_k)+{\mathcal {N}}({0, 2\epsilon \tau {\varvec{I}}})\),

  2. (ii)

    Parameter updating step: Update \({\varvec{\theta }}_{k+1}={\varvec{\theta }}_{k}+\omega _{k+1} {{\widetilde{H}}}({\varvec{\theta }}_{k}, {\varvec{x}}_{k+1})\),

where \(\epsilon \) is the learning rate, \(\omega _{k+1}\) is the step size, and the adaptive gradient follows

$$\begin{aligned} \nabla _{{{\varvec{x}}}} {\widetilde{L}}({{\varvec{x}}}_k,{\varvec{\theta }}_k)= & {} \frac{N}{n} \left[ 1+ \frac{\zeta \tau }{\Delta u} \left( \log \theta _k({\tilde{J}}({{\varvec{x}}}_k))\right. \right. \nonumber \\&\left. \left. -\log \theta _k(({\tilde{J}}({{\varvec{x}}}_k)-1)\vee 1) \right) \right] \nabla _{{{\varvec{x}}}} {{\widetilde{U}}}({{\varvec{x}}}_k).\nonumber \\ \end{aligned}$$
(19)

1.4 Convergence of self-adapting parameters

The convergence analysis rests on the following assumptions:

Assumption A1

(Compactness) \({\varvec{\Theta }}\) is a compact space in \((0, 1]^m\), \(\inf _{{\varvec{\Theta }}}\theta (i) >0\) for any \(i\in \{1,2,\ldots ,m\}\).

To ease the proof, we made a slightly stronger assumption that \({\varvec{\theta }}_k\) can be contained in a compact space such that \(\inf _{{\varvec{\Theta }}, i\in \{1,2,\cdots , m\}} \theta (i)>0\) holds. Admittedly, how to choose such a compact space is delicate. To relax this assumption, we refer interested readers to Fort et al. (2015) where, for a similar algorithm, the recurrence property is shown for the sequence \(\{{\varvec{\theta }}_k\}_{k\ge 1}\) that it visits often enough to a desired compact space rendering an almost-sure convergence of the sequence.

Assumption A2

(Smoothness) \(U({\varvec{x}})\) is M-smooth; that is, there exists a constant \(M>0\) such that for any \({{\varvec{x}}}, {{\varvec{x}}}'\in {\mathcal {X}}\),

$$\begin{aligned} \begin{aligned} \Vert \nabla _{{{\varvec{x}}}} U({{\varvec{x}}})-\nabla _{{{\varvec{x}}}} U({\varvec{{{\varvec{x}}}}}')\Vert&\le M\Vert {{\varvec{x}}}-{{\varvec{x}}}'\Vert . \end{aligned} \end{aligned}$$
(20)

The smoothness of \(\nabla _{{{\varvec{x}}}} U({{\varvec{x}}})\) is a standard assumption in studying the convergence of SGLD.

Assumption A3

(Dissipativity) There exist constants \({\tilde{m}}>0\) and \({\tilde{b}}\ge 0\) such that for any \({{\varvec{x}}}\in {\mathcal {X}}\) and \({\varvec{\theta }}\in {\varvec{\Theta }}\),

$$\begin{aligned} \begin{aligned}&\langle \nabla _{{{\varvec{x}}}} L({{\varvec{x}}}, {\varvec{\theta }}), {{\varvec{x}}}\rangle \le {\tilde{b}}-{\tilde{m}}\Vert {{\varvec{x}}}\Vert ^2. \end{aligned} \end{aligned}$$
(21)

This assumption has been widely used in proving the geometric ergodicity of dynamical systems (Mattingly et al. 2002; Raginsky et al. 2017; Xu et al. 2018). It ensures the sampler to move toward the origin regardless of the position of the current point.

Assumption A4

(Gradient noise) The stochastic gradient is unbiased; that is,

$$\begin{aligned} {\mathbb {E}[\nabla _{{{\varvec{x}}}}{{\widetilde{U}}}({{\varvec{x}}}_{k})-\nabla _{{{\varvec{x}}}} U({{\varvec{x}}}_{k})]=0,} \end{aligned}$$

in addition, there exists some constants \(M>0\) and \(B>0\) such that the second moment of the stochastic gradient is bounded by

$$\begin{aligned} \mathbb {E}[\Vert \nabla _{{{\varvec{x}}}}{{\widetilde{L}}}({{\varvec{x}}}_{k}, {\varvec{\theta }}_{k})-\nabla _{{{\varvec{x}}}} L({{\varvec{x}}}_{k}, {\varvec{\theta }}_{k})\Vert ^2]\le M^2 \Vert {{\varvec{x}}}\Vert ^2+B^2. \end{aligned}$$

where the expectation \(\mathbb {E}[\cdot ]\) is with respect to the distribution of the gradient noise.

The gradient noise assumption has been used in Raginsky et al. (2017) and Xu et al. (2018).

Lemma 1 establishes a stability condition for AWSGLD, which describes the dynamics of the mean-field system and implies a potential convergence of \({\varvec{\theta }}_k\).

Lemma 1

(Stability) Suppose Assumptions A1A4 hold, the learning rate \(\epsilon \) is small enough, the minibatch size n is large enough, and an appropriate partition of the sample space is used such that \(\int _{{\mathcal {X}}_i} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}=o\left( \int _{\cup _{k=1}^i{\mathcal {X}}_k} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}\right) \) holds for \(i\ge 2\). Then, for all \({\varvec{\theta }}\in {\varvec{\Theta }}\) if \(\zeta =1\) or \({\varvec{\theta }}\in {{\widetilde{{\varvec{\Theta }}}}}\)Footnote 6 otherwise, the mean-field function \(h({\varvec{\theta }})\) satisfies \(\langle h({\varvec{\theta }}), {\varvec{\theta }}-{{\widehat{{\varvec{\theta }}}}}_{\star }\rangle \lesssim -\phi \Vert {\varvec{\theta }}-{{\widehat{{\varvec{\theta }}}}}_{\star }\Vert ^2\), where \(\phi >0\) denotes a constant, \({{\widehat{{\varvec{\theta }}}}}_{\star }={\varvec{\theta }}_{\star }+{\mathcal {O}}\left( \sup _{{{\varvec{x}}}}\mathrm {Var}(\xi _n({{\varvec{x}}}))+\epsilon +\frac{1}{m}\right) \), \(\mathrm {Var}(\xi _n(\cdot ))\) denotes the variance of the noise \(\xi _n\) of \({{\widetilde{U}}}(\cdot )\) based on batch size n, and \({\varvec{\theta }}_{\star }\) is a solution to the ideal mean-field system without perturbations

$$\begin{aligned} \theta _{\star }(i)= & {} \left( \bigg (1+o(1_{\zeta \ne 1})\bigg )\frac{\int _{\cup _{k=1}^i{\mathcal {X}}_k} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}}{{{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}}\right) ^{\frac{1}{\zeta }}\nonumber \\&\quad \text { for } i\in \{1,2,\cdots , m\} \text { and }{{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}:=\sum _{i=1}^m \frac{\int _{{\mathcal {X}}_i} \pi _{\tau }({{\varvec{x}}})\hbox {d}{{\varvec{x}}}}{\theta _{\star }(i)^{\zeta -1}}.\nonumber \\ \end{aligned}$$
(22)

Proof

We first denote the invariant measure simulated from (8) by \(\varpi _{\Psi _{{\varvec{\theta }}}}({{\varvec{x}}})\propto \frac{\pi _{\tau }({{\varvec{x}}})}{\Psi ^{\zeta }_{{\varvec{\theta }}}(U({{\varvec{x}}}))}\), where \(\Psi _{{\varvec{\theta }}}\) is defined in (4). We also denote the invariant measure simulated from the corresponding continuous-time diffusion process by \(\varpi _{{\varvec{\theta }}}({{\varvec{x}}})\). In addition, we define a piecewise constant function \({\widetilde{\Psi }}_{{\varvec{\theta }}}(u)\) and the associated theoretical measure \(\varpi _{{\widetilde{\Psi }}_{{\varvec{\theta }}}}(x)\) as follows:

$$\begin{aligned} {\widetilde{\Psi }}_{{\varvec{\theta }}}(u)=\sum _{i=1}^m \theta (i) 1_{u_{i-1} < u \le u_{i}}, \quad \varpi _{{\widetilde{\Psi }}_{{\varvec{\theta }}}}({{\varvec{x}}}) \propto \frac{\pi _{\tau }({{\varvec{x}}})}{\theta ^{\zeta }(J({{\varvec{x}}}))}. \end{aligned}$$
(23)

For each \(i \in \{1,2,\ldots ,m\}\), the random field \({{\widetilde{H}}}_i({\varvec{\theta }},{{\varvec{x}}})={\theta }({{\tilde{J}}}({{\varvec{x}}}))\left( 1_{i\ge {{\tilde{J}}}({{\varvec{x}}})}-{\theta }(i)\right) \) is a biased estimator of \( H_i({\varvec{\theta }},{{\varvec{x}}})={\theta }( J({{\varvec{x}}}))\left( 1_{i\ge J({{\varvec{x}}})}-{\theta }(i)\right) \). By Lemma 4, we have

$$\begin{aligned} \mathbb {E}[{\widetilde{H}}_i({\varvec{\theta }},{{\varvec{x}}})-H_i({\varvec{\theta }},{{\varvec{x}}})] ={\mathcal {O}}(\mathrm {Var}(\xi _n({{\varvec{x}}}))), \end{aligned}$$
(24)

which is caused by evaluating the energy with a mini-batch of data of size n. Let’s compute the mean-field \(h({\varvec{\theta }})\) with respect to the probability measure \(\varpi _{{\varvec{\theta }}}({{\varvec{x}}})\) simulated from the diffusion process:

$$\begin{aligned} \begin{aligned} h_i({\varvec{\theta }})&=\int _{{\mathcal {X}}} {{\widetilde{H}}}_i({\varvec{\theta }},{{\varvec{x}}}) \varpi _{{\varvec{\theta }}}({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}\\&=\int _{{\mathcal {X}}} H_i({\varvec{\theta }},{{\varvec{x}}}) \varpi _{{\varvec{\theta }}}({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}+{\mathcal {O}}(\mathrm {Var}(\xi _n({{\varvec{x}}})))\\&= \int _{{\mathcal {X}}} H_i({\varvec{\theta }},{{\varvec{x}}}) \bigg ( \underbrace{\varpi _{{\widetilde{\Psi }}_{\varvec{\theta }}}({{\varvec{x}}})}_{\text {I}_1} \underbrace{-\varpi _{{\widetilde{\Psi }}_{\varvec{\theta }}}({{\varvec{x}}}) +\varpi _{\Psi _{{\varvec{\theta }}}}({{\varvec{x}}})}_{\text {I}_2: {{\mathrm{piecewise~approximation}}}}\\&\qquad \underbrace{-\varpi _{\Psi _{{\varvec{\theta }}}}({{\varvec{x}}}) +\varpi _{{\varvec{\theta }}}({{\varvec{x}}})}_{\text {I}_3: {{{\mathrm{discretization}}}}}\bigg ) \hbox {d}{{\varvec{x}}}+{\mathcal {O}}(\mathrm {Var}(\xi _n({{\varvec{x}}}))). \end{aligned} \end{aligned}$$
(25)

For the first term \(\text {I}_1\): we first denote \({{\widetilde{Z}}}_{{\varvec{\theta }}}:=\sum _{i=1}^m \frac{\int _{{\mathcal {X}}_i} \pi _{\tau }({{\varvec{x}}})\hbox {d}{{\varvec{x}}}}{\theta (i)^{\zeta -1}}\) and have that

$$\begin{aligned} \begin{aligned}&\int _{{\mathcal {X}}} H_i({\varvec{\theta }},{{\varvec{x}}}) \varpi _{{\widetilde{\Psi }}_{\varvec{\theta }}}({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}\\&\quad =\frac{1}{Z_{{\varvec{\theta }}}} \int _{{\mathcal {X}}} {\theta }(J({{\varvec{x}}}))\left( 1_{i\ge J({{\varvec{x}}})}-{\theta }(i)\right) \frac{\pi _{\tau }({{\varvec{x}}})}{\theta ^{\zeta }(J({{\varvec{x}}}))} \hbox {d}{{\varvec{x}}}\\&\quad =\frac{1}{Z_{{\varvec{\theta }}}}\left[ \sum _{k=1}^m \int _{{\mathcal {X}}_k} \frac{\pi _{\tau }({{\varvec{x}}})}{\theta ^{\zeta -1}(k)} 1_{k\le i} \hbox {d}{{\varvec{x}}}\right. \\&\qquad \left. -\theta (i)\sum _{k=1}^m\int _{{\mathcal {X}}_k} \frac{\pi _{\tau }({{\varvec{x}}})}{\theta ^{\zeta -1}(k)}\hbox {d}{{\varvec{x}}}\right] \\&\quad =\frac{1}{Z_{{\varvec{\theta }}}}\left[ \sum _{k=1}^i \int _{{\mathcal {X}}_k} \frac{\pi _{\tau }({{\varvec{x}}})}{\theta ^{\zeta -1}(k)} \hbox {d}{{\varvec{x}}}-\theta (i){{\widetilde{Z}}}_{{\varvec{\theta }}} \right] =0. \end{aligned} \end{aligned}$$
(26)

where \(Z_{{\varvec{\theta }}}=\sum _{i=1}^m \frac{\int _{{\mathcal {X}}_i} \pi _{\tau }({{\varvec{x}}})\hbox {d}{{\varvec{x}}}}{\theta (i)^{\zeta }}\) denotes the normalizing constant of \(\varpi _{{\widetilde{\Psi }}_{\varvec{\theta }}}({{\varvec{x}}})\). According to Lemma 5, the mean-field system (26) has a solution that satisfies

$$\begin{aligned} \theta _{\star }(i)= & {} \left( \bigg (1+o(1_{\zeta \ne 1})\bigg )\frac{\int _{\cup _{k=1}^i{\mathcal {X}}_k} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}}{{{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}}\right) ^{\frac{1}{\zeta }}\\&\quad \text {for } i\in \{1,2,\cdots , m\}. \end{aligned}$$

For the term \(\text {I}_2\), by Lemma 3 and the boundedness of \(H({\varvec{\theta }},{{\varvec{x}}})\), we have

$$\begin{aligned} \int _{{\mathcal {X}}} H_i({\varvec{\theta }},{{\varvec{x}}}) (-\varpi _{{\widetilde{\Psi }}_{{\varvec{\theta }}}}({{\varvec{x}}})+\varpi _{\Psi _{{\varvec{\theta }}}}({{\varvec{x}}})) \hbox {d}{{\varvec{x}}}= {\mathcal {O}}\left( \frac{1}{m}\right) . \end{aligned}$$
(27)

For the term \(\text {I}_3\), it follows that

$$\begin{aligned} \int _{{\mathcal {X}}} H_i({\varvec{\theta }},{{\varvec{x}}}) \left( -\varpi _{\Psi _{{\varvec{\theta }}}}({{\varvec{x}}})+\varpi _{{\varvec{\theta }}}({{\varvec{x}}})\right) \hbox {d}{{\varvec{x}}}={\mathcal {O}}(\epsilon ), \end{aligned}$$
(28)

where the order \({\mathcal {O}}(\epsilon )\) follows from Theorem 6 of Sato and Nakagawa (2014), which quantifies the approximation error of SGLD for an integral with a bounded integrand.

Plugging (26), (27) and (28) into (25). For all \({\varvec{\theta }}\in {\varvec{\Theta }}\) if \(\zeta =1\) or a proper initialized \({\varvec{\theta }}\in {{\widetilde{{\varvec{\Theta }}}}}\subset {\varvec{\Theta }}\) otherwise for the local property of the mean-field system, we have

$$\begin{aligned} h_i({\varvec{\theta }})\propto Z_{{\varvec{\theta }}}^{-1} \left[ \varepsilon \beta _i(\theta )+\theta _{\star }(i)-\theta (i)\right] , \end{aligned}$$
(29)

where \(\varepsilon ={\mathcal {O}}\left( \mathrm {Var}(\xi _n({{\varvec{x}}}))+\epsilon +\frac{1}{m}\right) \) and \(\beta _i(\theta )\) is a regular perturbation term that satisfies \(Z_{{\varvec{\theta }}}^{-1}\varepsilon \beta _i(\theta )={\mathcal {O}}\left( \mathrm {Var}(\xi _n({{\varvec{x}}}))\right. \left. +\epsilon +\frac{1}{m}\right) \).

To solve the ODE system with small disturbances, we consider standard techniques in perturbation theory (Weinhart et al. 2010) and study the stability of the solution \({{\widehat{{\varvec{\theta }}}}}_{\star }\) of (29), where \(\varepsilon {\varvec{\beta }}({{\widehat{{\varvec{\theta }}}}}_{\star })+{\varvec{\theta }}_{\star }-{{\widehat{{\varvec{\theta }}}}}_{\star }=0\), to the mean field system \(h({\varvec{\theta }})\) such that

$$\begin{aligned} h_i({\varvec{\theta }})&\propto Z_{{\varvec{\theta }}}^{-1} \left[ \varepsilon \beta _i(\theta ) +\theta _{\star }(i)-\theta (i)\right] \nonumber \\&=Z_{{\varvec{\theta }}}^{-1} \left[ \varepsilon \beta _i(\theta )-\varepsilon \beta _i ({{\widehat{\theta }}}_{\star })+\varepsilon \beta _i({{\widehat{\theta }}}_{\star }) +\theta _{\star }(i)-\theta (i)\right] \nonumber \\&=Z_{{\varvec{\theta }}}^{-1} \left[ {\mathcal {O}}(\varepsilon )(\theta (i) -{{\widehat{\theta }}}_{\star }(i))+{{\widehat{\theta }}}_{\star }(i)-\theta (i)\right] \nonumber \\&=Z_{{\varvec{\theta }}}^{-1} \big (1-{\mathcal {O}}(\varepsilon )\big ) \left( {{\widehat{\theta }}}_{\star }(i)-\theta (i)\right) , \end{aligned}$$
(30)

where \({\varvec{\beta }}(\cdot )\) satisfies a smoothness condition.Footnote 7 Next, we justify that \({{\widehat{{\varvec{\theta }}}}}_{\star }\) is an asymptotically stable equilibrium of the mean-field system. Considering the positive-definite Lyapunov function \({\mathbb {V}}({\varvec{\theta }})=\frac{1}{2}\Vert {{\widehat{{\varvec{\theta }}}}}_{\star }-{\varvec{\theta }}\Vert ^2\) for the mean-field system \(h({\varvec{\theta }})=Z_{{\varvec{\theta }}}^{-1} \big (1-{\mathcal {O}}(\varepsilon )\big )\left( {{\widehat{{\varvec{\theta }}}}}_{\star }-{\varvec{\theta }}\right) \), we have

$$\begin{aligned} \begin{aligned} \langle h({\varvec{\theta }}), \nabla {\mathbb {V}}({\varvec{\theta }})\rangle&\propto -Z_{{\varvec{\theta }}}^{-1} \big (1-{\mathcal {O}}(\varepsilon )\big )\Vert {\varvec{\theta }}- {{\widehat{{\varvec{\theta }}}}}_{\star }\Vert ^2\\&\le -\phi \Vert {\varvec{\theta }}- {{\widehat{{\varvec{\theta }}}}}_{\star }\Vert ^2, \end{aligned} \end{aligned}$$

where \(\phi =\inf _{{\varvec{\theta }}} Z_{{\varvec{\theta }}}^{-1}\big (1-{\mathcal {O}}(\varepsilon )\big )\). By Assumption A1 and a small enough \(\varepsilon \), we know that \(\phi >0\), which concludes that \({{\widehat{{\varvec{\theta }}}}}_{\star }\) is an asymptotically stable equilibrium of the mean-field system \(h({\varvec{\theta }})\). \(\square \)

Assumption A5

(Step size) The step size \(\{\omega _{k}\}_{k\in \mathrm {N}}\) is a positive decreasing sequence of real numbers such that

$$\begin{aligned} \omega _{k}\rightarrow 0, \ \ \sum _{k=1}^{\infty } \omega _{k}=&+\infty ,\ \ \lim _{k\rightarrow \infty } \inf 2\phi \dfrac{\omega _{k}}{\omega _{k+1}}\nonumber \\&+\dfrac{\omega _{k+1}-\omega _{k}}{\omega ^2_{k+1}}>0. \end{aligned}$$
(31)

According to Benveniste et al. (1990), we can choose \(\omega _{k}:=\frac{A}{k^{\alpha }+B}\) for some \(\alpha \in (\frac{1}{2}, 1]\) and some suitable constants \(A>0\) and \(B>0\).

To prove the convergence of the self-adapting parameters, we are also required to show regularity conditions of the solution of Poisson equation to control the perturbations of the state-dependent noise in stochastic approximation.

Lemma 2

(Solution of Poisson equation) Given Assumptions A1A4 and a sufficiently small learning rate \(\epsilon \). There exists a solution \(\mu _{{\varvec{\theta }}}(\cdot )\) on \({\mathcal {X}}\) to the Poisson equation

$$\begin{aligned} \mu _{{\varvec{\theta }}}({\varvec{x}})-\Pi _{{\varvec{\theta }}}\mu _{{\varvec{\theta }}}({\varvec{x}}) ={{\widetilde{H}}}({\varvec{\theta }}, {\varvec{x}})-h({\varvec{\theta }}). \end{aligned}$$
(32)

In addition, there exists a constant \(C>0\) such that for all \({\varvec{\theta }}, {\varvec{\theta }}'\in {\varvec{{\varvec{\Theta }}}}\),

$$\begin{aligned} \begin{aligned} \mathbb {E}[\Vert \Pi _{{\varvec{\theta }}}\mu _{{\varvec{\theta }}}({{\varvec{x}}})\Vert ]&\le C,\\ \mathbb {E}[\Vert \Pi _{{\varvec{\theta }}}\mu _{{\varvec{\theta }}}({{\varvec{x}}})-\Pi _{{\varvec{\theta }}'} \mu _{{\varvec{\theta '}}}({{\varvec{x}}})\Vert ]&\le C\Vert {\varvec{\theta }}-{\varvec{\theta }}'\Vert . \end{aligned} \end{aligned}$$
(33)

Proof

The proof hinges on verifying drift (DRI) conditions (Section 6, Andrieu et al. (2005)) that guarantee the existence and the regularity of Poisson equation to control the perturbations.

(DRI) For any \({\varvec{\theta }}\in {\varvec{\Theta }}\), the smoothness assumption A2 implies that \(U({{\varvec{x}}})\) is continuously differentiable almost everywhere. Combining the dissipative assumption A3 and following the proof of Theorem 2.1 (Roberts and Tweedie 1996), we can show that the discrete dynamics system is irreducible and aperiodic. Moreover, there exists a Lyapunov function \(V({{\varvec{x}}})=1+\Vert {{\varvec{x}}}\Vert ^2\) such that for any compact subset \(\mathcal {{\varvec{K}}}\subset {\varvec{\Theta }}\), we can verify three drift conditions as follows:

(DRI1) By Corollary 7.5 (Mattingly et al. 2002), a small enough learning rate \(\epsilon \), the smoothness assumption A2, and the dissipative assumption A3 imply that AWSGLD satisfies the minorization condition. That is, there exists a constant \(\eta >0\) and a probability measure \(\nu (\cdot )\) and a set \({\mathcal {C}}\) with \(\nu ({\mathcal {C}})=1\) such that the following inequality holds:

$$\begin{aligned} P_{{\varvec{\theta }}\in \mathcal {{{\varvec{K}}}}}(x, A)\ge \eta \nu (A)\quad \forall A\in {\mathcal {X}}, {{\varvec{x}}}\in {\mathcal {C}}, \end{aligned}$$
(I)

where \(P_{{\varvec{\theta }}}({{\varvec{x}}}, {{\varvec{y}}}):=\frac{1}{2\sqrt{(4\pi \epsilon )^{d/2}}}\mathbb {E}\big [e^{-\frac{\Vert {{\varvec{y}}}-{{\varvec{x}}}+\epsilon \nabla _{{{\varvec{x}}}} {\widetilde{L}}({{\varvec{x}}}, {\varvec{\theta }})\Vert ^2}{4\epsilon }}|{{\varvec{x}}}\big ]\) denotes the transition kernel of the Markov process produced by AWSGLD given the parameter \({\varvec{\theta }}\) and a learning rate \(\epsilon \) and the expectation acts on the stochastic noise of the adaptive gradient \(\nabla _{{{\varvec{x}}}} {\widetilde{L}}({{\varvec{x}}}, {\varvec{\theta }})\) (19). Combining Assumption A4, we have the uniform L2 upper bound given \({\varvec{\theta }}\in {\varvec{\Theta }}\) satisfied by Lemma 3.2 (Raginsky et al. 2017). It follows that by Theorem 7.2 (Mattingly et al. 2002), there exist \({{\tilde{\alpha }}}\in (0,1)\) and \({{\tilde{\beta }}}\ge 0\) such that

$$\begin{aligned} P_{{\varvec{\theta }}\in \mathcal {{{\varvec{K}}}}}V({{\varvec{x}}})\le {{\tilde{\alpha }}} V({{\varvec{x}}})+{{\tilde{\beta }}}. \end{aligned}$$
(II)

By the definition of \(V({{\varvec{x}}})=1+\Vert {{\varvec{x}}}\Vert ^2\), there exists a constant \(\kappa ={{\tilde{\alpha }}}+{{\tilde{\beta }}}\) such that

$$\begin{aligned} P_{{\varvec{\theta }}\in \mathcal {{{\varvec{K}}}}}V({{\varvec{x}}})\le \kappa V({{\varvec{x}}}). \end{aligned}$$
(III)

Having conditions (I), (II), and (III) verified, we have proved the drift condition (DRI1).

(DRI2) Next we proceed to verify the boundedness and Lipshitz conditions on the random-field function \({{\widetilde{H}}}({\varvec{\theta }},{{\varvec{x}}})\), where each subcomponent is defined as \({{\widetilde{H}}}_i({\varvec{\theta }},{{\varvec{x}}})={\theta }({{\tilde{J}}}({{\varvec{x}}}))\left( 1_{i\ge {{\tilde{J}}}({{\varvec{x}}})}-{\theta }(i)\right) \). By the compactness Assumption A1 and the definition of \(V({{\varvec{x}}})=1+\Vert {{\varvec{x}}}\Vert ^2\), it is clear that

$$\begin{aligned} \sup _{{\varvec{\theta }}\in \mathcal {{{\varvec{K}}}}\subset (0, 1]^m}\Vert H({\varvec{\theta }}, {{\varvec{x}}})\Vert \le m V({{\varvec{x}}}). \end{aligned}$$
(IV)

For any \({\varvec{\theta }}_1, {\varvec{\theta }}_2\in \mathcal {{{\varvec{K}}}}\) and a fixed \({{\varvec{x}}}\in {\mathcal {X}}\), it suffices to verify only the i-th index that maximizes \(|\theta _1(j)-\theta _2(j)|\) for \(j\in \{1,2,\ldots ,m\}\), i.e., \(i=\arg \max _j |\theta _1(j)-\theta _2(j)|\). Therefore,

$$\begin{aligned} \begin{aligned}&|{{\widetilde{H}}}_i({\varvec{\theta }}_1,{{\varvec{x}}})- {{\widetilde{H}}}_i({\varvec{\theta }}_2,{{\varvec{x}}})|\\&\quad ={\theta _1} ({{\tilde{J}}}({{\varvec{x}}}))\left( 1_{i\ge {{\tilde{J}}}({{\varvec{x}}})}-{\theta _1}(i)\right) \\&\qquad -{\theta _2} ({{\tilde{J}}}({{\varvec{x}}}))\left( 1_{i\ge {{\tilde{J}}}({{\varvec{x}}})}-{\theta _2}(i)\right) \\&\quad \le |{\theta _1} ({{\tilde{J}}}({{\varvec{x}}}))-{\theta _2} ({{\tilde{J}}}({{\varvec{x}}}))|\\&\qquad +|{\theta _1} ({{\tilde{J}}}({{\varvec{x}}})){\theta _1}(i)-{\theta _2} ({{\tilde{J}}}({{\varvec{x}}})){\theta _2}(i)|\\&\quad \le \max _{j}\Big (|{\theta _1} (j)-{\theta _2} (j)|+{\theta _1} (j)|{\theta _1}(i)-{\theta _2}(i)|\\&\qquad +|{\theta _1} (j)-{\theta _2} (j)|\theta _2(i)\Big )\\&\quad \le 3|\theta _1(i)-\theta _2(i)|, \\ \end{aligned} \end{aligned}$$

where the last inequality holds as \(\theta _1(j), \theta _2(i)\in (0, 1]\).

(DRI3) Finally, we study the smoothness of the transitional kernel \(P_{{\varvec{\theta }}}({{\varvec{x}}}, {{\varvec{y}}})\) with respect to \({\varvec{\theta }}\). For any \({\varvec{\theta }}_1, {\varvec{\theta }}_2\in \mathcal {{{\varvec{K}}}}\) and fixed \({{\varvec{x}}}\) and \({{\varvec{y}}}\), we have

$$\begin{aligned} \begin{aligned}&|P_{{\varvec{\theta }}_1}({{\varvec{x}}}, {{\varvec{y}}})-P_{{\varvec{\theta }}_2}({{\varvec{x}}}, {{\varvec{y}}})|\\&\quad =\frac{1}{2\sqrt{(4\pi \epsilon )^{d/2}}}\mathbb {E}\big [e^{-\frac{\Vert {{\varvec{y}}}-{{\varvec{x}}}+\epsilon \nabla _{{{\varvec{x}}}} {\widetilde{L}}({{\varvec{x}}}, {\varvec{\theta }}_1)\Vert ^2}{4\epsilon }} |{{\varvec{x}}}\big ]\\&\qquad -\frac{1}{2\sqrt{(4\pi \epsilon )^{d/2}}}\mathbb {E}\big [e^{-\frac{\Vert {{\varvec{y}}}-{{\varvec{x}}}+\epsilon \nabla _{{{\varvec{x}}}} {\widetilde{L}}({{\varvec{x}}}, {\varvec{\theta }}_2)\Vert ^2}{4\epsilon }}|{{\varvec{x}}}\big ]\\&\quad \lesssim |\Vert {{\varvec{y}}}-{{\varvec{x}}}+\epsilon \nabla _{{{\varvec{x}}}} {\widetilde{L}}({{\varvec{x}}}, {\varvec{\theta }}_1)\Vert ^2 -\Vert {{\varvec{y}}}-{{\varvec{x}}}+\epsilon \nabla _{{{\varvec{x}}}} {\widetilde{L}}({{\varvec{x}}}, {\varvec{\theta }}_2)\Vert ^2|\\&\quad \lesssim \Vert \nabla _{{{\varvec{x}}}} {\widetilde{L}}({{\varvec{x}}}, {\varvec{\theta }}_1) - \nabla _{{{\varvec{x}}}} {\widetilde{L}}({{\varvec{x}}}, {\varvec{\theta }}_2)\Vert \\&\quad \lesssim \Vert {\varvec{\theta }}_1-{\varvec{\theta }}_2\Vert ,\\ \end{aligned} \end{aligned}$$

where the first inequality (up to a finite constant) follows by \(\Vert e^{{{\varvec{x}}}}-e^{{{\varvec{y}}}}\Vert \lesssim \Vert {{\varvec{x}}}-{{\varvec{y}}}\Vert \) for any \({{\varvec{x}}}\), \({{\varvec{y}}}\) in a compact space; the last inequality follows by the definition of the adaptive gradient (19) and \(\Vert \log ({{\varvec{x}}})-\log ({{\varvec{y}}})\Vert \lesssim \Vert {{\varvec{x}}}-{{\varvec{y}}}\Vert \) by the compactness assumption A1.

For \(f:{\mathcal {X}}\rightarrow {\mathbb {R}}^d\), define the norm \(\Vert f\Vert _V=\sup _{{{\varvec{x}}}\in {\mathcal {X}}} \frac{|f({{\varvec{x}}})|}{V({{\varvec{x}}})}\). Following the same technique of Liang et al. (2007) (page 319), we can verify the last drift condition

$$\begin{aligned}&\Vert P_{{\varvec{\theta }}_1}f-P_{{\varvec{\theta }}_2}f\Vert _V\le C\Vert f\Vert _V \Vert {\varvec{\theta }}_1-{\varvec{\theta }}_2\Vert , \\&\quad \forall f\in {\mathcal {L}}_V:=\{f: {\mathcal {X}}\rightarrow {\mathbb {R}}^d, \Vert f\Vert _V<\infty \}. \end{aligned}$$
(VI)

Given conditions (I), (II), \(\cdots \) and (VI), we have verified the drift conditions (DRI) in Section 6 of Andrieu et al. (2005) and completed the proof. \(\square \)

Proof of Theorem 1

Combining the stability condition in Lemma 1 and the regularity of the solution of Poisson equation in Lemma 2, the convergence of \({\varvec{\theta }}_k\) can be derived directly following techniques in Theorem 24 (page 246) (Benveniste et al., 1990). \(\square \)

1.5 Technical Lemmas

Lemma 3

Suppose Assumption A1 holds, and \(u_1\) and \(u_{m-1}\) are fixed such that \(\Psi _{{\varvec{\theta }}}(u_1)>\nu \) and \(\Psi _{{\varvec{\theta }}}(u_{m-1})>1-\nu \) for some small constant \(\nu >0\). For any bounded function \(f({{\varvec{x}}})\), we have

$$\begin{aligned} \int _{{\mathcal {X}}} f({{\varvec{x}}})\left( \varpi _{\Psi _{{\varvec{\theta }}}}({{\varvec{x}}}) -\varpi _{{{\widetilde{\Psi }}}_{{\varvec{\theta }}}}({{\varvec{x}}})\right) \hbox {d}{{\varvec{x}}}={\mathcal {O}}\left( \frac{1}{m}\right) . \end{aligned}$$
(34)

Proof

Recall that \(\varpi _{{{\widetilde{\Psi }}}_{{\varvec{\theta }}}}({{\varvec{x}}})= \frac{1}{Z_{{\varvec{\theta }}}} \frac{\pi _{\tau }({{\varvec{x}}})}{\theta ^{\zeta }(J({{\varvec{x}}}))}\) and \(\varpi _{\Psi _{{\varvec{\theta }}}}({{\varvec{x}}})=\frac{1}{Z_{\Psi _{{\varvec{\theta }}}}}\frac{\pi _{\tau }({{\varvec{x}}})}{\Psi ^{\zeta }_{{\varvec{\theta }}}(U({{\varvec{x}}}))}\). Since \(f({{\varvec{x}}})\) is bounded, it suffices to show

$$\begin{aligned} \begin{aligned}&\int _{{\mathcal {X}}} \frac{1}{ Z_{{\varvec{\theta }}}} \frac{\pi _{\tau }({{\varvec{x}}})}{\theta ^{\zeta }(J({{\varvec{x}}}))} -\frac{1}{Z_{\Psi _{{\varvec{\theta }}}}}\frac{\pi _{\tau }({{\varvec{x}}})}{\Psi ^{\zeta }_{{\varvec{\theta }}}(U({{\varvec{x}}}))} \hbox {d}{{\varvec{x}}}\\&\quad \le \int _{{\mathcal {X}}} \left| \frac{1}{ Z_{{\varvec{\theta }}}} \frac{\pi _{\tau }({{\varvec{x}}})}{\theta ^{\zeta }(J({{\varvec{x}}}))} -\frac{1}{ Z_{{\varvec{\theta }}}}\frac{\pi _{\tau }({{\varvec{x}}})}{\Psi ^{\zeta }_{{\varvec{\theta }}} (U({{\varvec{x}}}))}\right| \hbox {d}{{\varvec{x}}}\\&\qquad +\int _{{\mathcal {X}}}\left| \frac{1}{ Z_{{\varvec{\theta }}}} \frac{\pi _{\tau }({{\varvec{x}}})}{\Psi ^{\zeta }_{{\varvec{\theta }}}(U({{\varvec{x}}}))} -\frac{1}{Z_{\Psi _{{\varvec{\theta }}}}}\frac{\pi _{\tau }({{\varvec{x}}})}{\Psi ^{\zeta }_{{\varvec{\theta }}} (U({{\varvec{x}}}))}\right| \hbox {d}{{\varvec{x}}}\\&\quad =\underbrace{\frac{1}{ Z_{{\varvec{\theta }}}}\sum _{i=1}^m \int _{{\mathcal {X}}_i} \left| \frac{\pi _{\tau }({{\varvec{x}}})}{\theta ^{\zeta }(i)}-\frac{\pi _{\tau }({{\varvec{x}}})}{\Psi ^{\zeta }_{{\varvec{\theta }}}(U({{\varvec{x}}}))}\right| \hbox {d}{{\varvec{x}}}}_{\text {I}_1}\\&\qquad +\underbrace{\sum _{i=1}^m\left| \frac{1}{ Z_{{\varvec{\theta }}}} -\frac{1}{Z_{\Psi _{{\varvec{\theta }}}}}\right| \int _{{\mathcal {X}}_i}\frac{\pi _{\tau }({{\varvec{x}}})}{\Psi ^{\zeta }_{{\varvec{\theta }}}(U({{\varvec{x}}}))} \hbox {d}{{\varvec{x}}}}_{\text {I}_2}\\&\quad ={\mathcal {O}}\left( \frac{1}{m}\right) , \end{aligned} \end{aligned}$$
(35)

where \( Z_{{\varvec{\theta }}}{=}\sum _{i=1}^m \int _{{\mathcal {X}}_i} \frac{\pi _{\tau }({{\varvec{x}}})}{\theta (i)^{\zeta }}\hbox {d}{{\varvec{x}}}\), \(Z_{\Psi _{{\varvec{\theta }}}}{=}\sum _{i=1}^{m}\int _{{\mathcal {X}}_i} \frac{\pi _{\tau }({{\varvec{x}}})}{\Psi ^{\zeta }_{{\varvec{\theta }}}(U({{\varvec{x}}}))}\hbox {d}{{\varvec{x}}}\), and \(\Psi _{{\varvec{\theta }}}(u)\) is a piecewise continuous function defined in (4).

By Assumption A1, \(\inf _{{\varvec{\Theta }}}\theta (i)>0\) for any i. Further, by the mean-value theorem, which implies \(|x^{\zeta }-y^{\zeta }|\lesssim |x-y| z^{\zeta }\) for any \(\zeta >0, x\le y\) and \(z\in [x, y]\subset [u_1, \infty )\), we have

$$\begin{aligned} \begin{aligned} \text {I}_1&=\frac{1}{ Z_{{\varvec{\theta }}}}\sum _{i=1}^m \int _{{\mathcal {X}}_i} \left| \frac{\theta ^{\zeta }(i)-\Psi ^{\zeta }_{{\varvec{\theta }}}(U({{\varvec{x}}}))}{\theta ^{\zeta }(i) \Psi ^{\zeta }_{{\varvec{\theta }}}(U({{\varvec{x}}}))}\right| \pi _{\tau }({{\varvec{x}}})\hbox {d}{{\varvec{x}}}\\&\lesssim \frac{1}{ Z_{{\varvec{\theta }}}}\sum _{i=1}^m \int _{{\mathcal {X}}_i} \frac{|\Psi _{{\varvec{\theta }}}(u_{i-1})-\Psi _{{\varvec{\theta }}}(u_i)|}{\theta ^{\zeta }(i)}\pi _{\tau }({{\varvec{x}}})\hbox {d}{{\varvec{x}}}\\&\le \max _i |\Psi _{{\varvec{\theta }}}(u_{i}-\Delta u)-\Psi _{{\varvec{\theta }}}(u_i)| \frac{1}{ Z_{{\varvec{\theta }}}}\sum _{i=1}^m \int _{{\mathcal {X}}_i} \frac{\pi _{\tau }({{\varvec{x}}})}{\theta ^{\zeta }(i)}\hbox {d}{{\varvec{x}}}\\&=\max _i |\Psi _{{\varvec{\theta }}}(u_{i} -\Delta u)-\Psi _{{\varvec{\theta }}}(u_i)|\lesssim \Delta u={\mathcal {O}}\left( \frac{1}{m}\right) , \end{aligned} \end{aligned}$$

where the last inequality follows by Taylor expansion, and the last equality follows as \(u_1\) and \(u_{m-1}\) are fixed. Similarly, we have

$$\begin{aligned} \text {I}_2&= \left| \frac{1}{ Z_{{\varvec{\theta }}}}-\frac{1}{Z_{\Psi _{{\varvec{\theta }}}}} \right| Z_{\Psi _{{\varvec{\theta }}}}=\frac{ |Z_{\Psi _{{\varvec{\theta }}}}- Z_{{\varvec{\theta }}}|}{ Z_{{\varvec{\theta }}}}\nonumber \\&\le \frac{1}{ Z_{{\varvec{\theta }}}}\sum _{i=1}^m \int _{{\mathcal {X}}_i} \left| \frac{\pi _{\tau }({{\varvec{x}}})}{\theta ^{\zeta }(i)}-\frac{\pi _{\tau }({{\varvec{x}}})}{\Psi ^{\zeta }_{{\varvec{\theta }}}(U({{\varvec{x}}}))}\right| \hbox {d}{{\varvec{x}}}=\text {I}_1={\mathcal {O}}\left( \frac{1}{m}\right) .\nonumber \\ \end{aligned}$$
(36)

The proof can then be concluded by combining the orders of \(\text {I}_1\) and \(\text {I}_2\). \(\square \)

Lemma 4

Stochastic approximation leads to a small bias depending on the variance of the stochastic energy estimator \({{\widetilde{U}}}(\cdot )\). For each component \(i\in \{1,2,\cdots , m\}\), we have

$$\begin{aligned} |\mathbb {E}[{{\widetilde{H}}}_i({\varvec{\theta }},{{\varvec{x}}})]- H_i({\varvec{\theta }},{{\varvec{x}}})|={\mathcal {O}}\left( \mathrm {Var}(\xi _n({{\varvec{x}}}))\right) , \end{aligned}$$

where \(\mathbb {E}[\cdot ]\) denotes the expectation with respect to the random noise \(\xi _n(\cdot )\) defined by \({{\widetilde{U}}}(\cdot )- U(\cdot )\).

Proof

Note that \(H_i({\varvec{\theta }},{{\varvec{x}}})\) can be interpreted as a nonlinear transformation \(\Phi \) which maps \( U({{\varvec{x}}})\) to (0, 1]. In other words, we can upper bound the bias in mini-batch settings as follows:

$$\begin{aligned}&|\mathbb {E}[{{\widetilde{H}}}_i({\varvec{\theta }},{{\varvec{x}}})]- H_i({\varvec{\theta }},{{\varvec{x}}})|\nonumber \\&\quad =\left| \int \Phi (U({{\varvec{x}}})+\xi _n({{\varvec{x}}}))-\Phi (U({{\varvec{x}}}))\hbox {d}\mu (\xi _n({{\varvec{x}}}))\right| \nonumber \\&\quad =\left| \int \xi _n({{\varvec{x}}}) \Phi '(U({{\varvec{x}}}))+\frac{\xi _n({{\varvec{x}}})^2}{2} \Phi ''(u) \hbox {d}\mu (\xi _n({{\varvec{x}}}))\right| \nonumber \\&\quad \le \left| \int \xi _n({{\varvec{x}}}) \Phi '(U({{\varvec{x}}}))\hbox {d}\mu (\xi _n({{\varvec{x}}}))\right| \nonumber \\&\qquad +\left| \frac{\Phi ''(u)}{2}\int \xi _n({{\varvec{x}}})^2 \hbox {d}\mu (\xi _n({{\varvec{x}}}))\right| \nonumber \\&\quad \le \sqrt{\int \xi ^2_n({{\varvec{x}}}) \hbox {d}\mu (\xi _n({{\varvec{x}}}))\int \Phi '(U({{\varvec{x}}}))^{2}\hbox {d}\mu (\xi _n({{\varvec{x}}}))}\nonumber \\&\qquad +\left| \frac{\Phi ''(u)}{2}\int \xi _n({{\varvec{x}}})^2 \hbox {d}\mu (\xi _n({{\varvec{x}}}))\right| \nonumber \\&\quad = {\mathcal {O}}\left( \mathrm {Var}(\xi _n({{\varvec{x}}}))\right) , \end{aligned}$$

where \(\xi _n({{\varvec{x}}})\) denotes the stochastic noise associated with the energy estimator with mean 0 and \(\mu (\xi _n({{\varvec{x}}}))\) denotes the probability measure of \(\xi _n({{\varvec{x}}})\); Taylor expansion is considered for the second equality with u being a value between \(U({{\varvec{x}}})\) and \(U({{\varvec{x}}})+\xi _n({{\varvec{x}}})\); \(\Phi '(U({{\varvec{x}}}))={\mathcal {O}}( \frac{\theta (J({{\varvec{x}}}))-\theta (J({{\varvec{x}}})-1)}{\Delta u})\) by definition and is bounded due to the compactness of \(\Theta \); similar properties also hold for \(\Phi ''(\cdot )\). The last inequality follows by Cauchy–Schwarz inequality, which eventually leads to the order of \(\mathrm {Var}(\xi _n({{\varvec{x}}}))\). \(\square \)

Lemma 5

Given an appropriate partition of the sample space such that \(\int _{{\mathcal {X}}_i} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}=o\left( \int _{\cup _{k=1}^i{\mathcal {X}}_k} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}\right) \) for \(i\ge 2\), the mean-field system \(\sum _{k=1}^i \int _{{\mathcal {X}}_k} \frac{\pi _{\tau }({{\varvec{x}}})}{\theta ^{\zeta -1}(k)} \hbox {d}{{\varvec{x}}}-\theta (i){{\widetilde{Z}}}_{{\varvec{\theta }}}=0\) without perturbations has a solution satisfying the following properties

$$\begin{aligned} \theta _{\star }(i)=\left( \frac{\int _{\cup _{k=1}^i{\mathcal {X}}_k} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}}{{{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}}+o\bigg (\frac{1_{\zeta \ne 1}\int _{\cup _{k=1}^i{\mathcal {X}}_k} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}}{{{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}}\bigg )\right) ^{\frac{1}{\zeta }}.\nonumber \\ \end{aligned}$$
(37)

Proof

We use proof of induction to show (5) is a desired solution.

  1. (i)

    when \(i=1\), it is clear that \(\theta _{\star }(1)=\left( \frac{\int _{{\mathcal {X}}_1} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}}{{{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}}\right) ^{\frac{1}{\zeta }}\).

  2. (ii)

    suppose for index from 1 to i, the solution to \(\sum _{k=1}^i \int _{{\mathcal {X}}_k} \frac{\pi _{\tau }({{\varvec{x}}})}{\theta ^{\zeta -1}(k)} \hbox {d}{{\varvec{x}}}-\theta (i){{\widetilde{Z}}}_{{\varvec{\theta }}}=0\) follows that

    $$\begin{aligned} \theta _{\star }(i)= & {} \left( \frac{\int _{\cup _{k=1}^i{\mathcal {X}}_k} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}}{{{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}}\right. \\&\left. +\,o\bigg (\frac{1_{\zeta \ne 1}\int _{\cup _{k=1}^i{\mathcal {X}}_k} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}}{{{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}}\bigg )\right) ^{\frac{1}{\zeta }}. \end{aligned}$$
  3. (iii)

    when it comes to \(i+1\), \(\theta _{\star }(i+1)\) solves \(\sum _{k=1}^{i+1} \int _{{\mathcal {X}}_k} \frac{\pi _{\tau }({{\varvec{x}}})}{\theta _{\star }^{\zeta -1}(k)} \hbox {d}{{\varvec{x}}}-\theta _{ \star }(i+1) {{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}=0\), which means that

    $$\begin{aligned}&\sum _{k=1}^{i} \int _{{\mathcal {X}}_k} \frac{\pi _{\tau }({{\varvec{x}}})}{\theta _{\star }^{\zeta -1}(k)} \hbox {d}{{\varvec{x}}}- \frac{\int _{\cup _{k=1}^{i}{\mathcal {X}}_{k}} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}}{{\theta _{\star }^{\zeta -1}(i+1)}} \\&\quad + \frac{\int _{\cup _{k=1}^{i+1}{\mathcal {X}}_{k}} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}}{{\theta _{\star }^{\zeta -1}(i+1)}} -\theta _{ \star }(i+1) {{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}=0 \end{aligned}$$

Notice that in step (ii), we have \(\sum _{k=1}^{i} \int _{{\mathcal {X}}_k} \frac{\pi _{\tau }({{\varvec{x}}})}{\theta _{\star }^{\zeta -1}(k)} \hbox {d}{{\varvec{x}}}= \theta _{ \star }(i) {{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}\). It follows that

$$\begin{aligned}&\theta _{ \star }(i) {{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}-\frac{\int _{\cup _{k=1}^{i}{\mathcal {X}}_{k}} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}}{{\theta _{\star }^{\zeta -1}(i+1)}} + \frac{\int _{\cup _{k=1}^{i+1}{\mathcal {X}}_{k}} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}}{{\theta _{\star }^{\zeta -1}(i+1)}} \\&\quad -\theta _{ \star }(i+1) {{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}=0. \end{aligned}$$

Arranging items, we have

$$\begin{aligned}&\underbrace{-\frac{\int _{\cup _{k=1}^{i}{\mathcal {X}}_{k}} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}}{{\theta _{\star }^{\zeta -1}(i+1)}}+\frac{\int _{\cup _{k=1}^{i}{\mathcal {X}}_{k}} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}}{{\theta _{\star }^{\zeta -1}(i)}}}_{\text {remainder}}\\&\quad -\frac{\int _{\cup _{k=1}^{i}{\mathcal {X}}_{k}} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}-\theta ^{\zeta }_{ \star }(i) {{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}}{{\theta _{\star }^{\zeta -1}(i)}}\\&\quad + \frac{\int _{\cup _{k=1}^{i+1}{\mathcal {X}}_{k}} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}-\theta ^{\zeta }_{ \star }(i+1) {{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}}{\theta _{\star }^{\zeta -1}(i+1)}=0. \end{aligned}$$

Since \(\frac{{\theta _{\star }^{\zeta -1}(i+1)}}{{\theta _{\star }^{\zeta -1}(i)}}\left( \int _{\cup _{k=1}^{i}{\mathcal {X}}_{k}} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}-\theta ^{\zeta }_{ \star }(i) {{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}\right) =o\bigg (1_{\zeta \ne 1}\int _{\cup _{k=1}^i{\mathcal {X}}_k} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}\bigg )\), we claim that one solution of \(\theta _{\star }(i+1)\) satisfies \(\left( \!\frac{\int _{\cup _{k=1}^{i+1}{\mathcal {X}}_k} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}}{{{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}}\right. \left. \!+o\bigg (\frac{1_{\zeta \ne 1}\int _{\cup _{k=1}^{i+1}{\mathcal {X}}_k} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}}{{{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}}\bigg )\!\right) ^{\!\frac{1}{\zeta }}\) if we can show \(\frac{\theta _{\star }^{\zeta -1}(i+1)-\theta _{\star }^{\zeta -1}(i)}{\theta _{\star }^{\zeta -1}(i)}\int _{\cup _{k=1}^{i}{\mathcal {X}}_k} \pi _{\tau }({{\varvec{x}}})\hbox {d}{{\varvec{x}}}=o\bigg (\frac{1_{\zeta \ne 1}\int _{\cup _{k=1}^{i+1}{\mathcal {X}}_k} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}}{{{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}}\bigg )\), where \(o(\cdot )\) denotes a little-o notation.

To proceed, consider the mean value theorem for a proper \({{\widetilde{\theta }}}\in [\theta _{\star }(i), \theta _{\star }(i+1)]\) such that

$$\begin{aligned} \begin{aligned}&\theta _{\star }^{\zeta -1}(i+1)-\theta _{\star }^{\zeta -1}(i) = (\theta _{\star }^{\zeta }(i+1))^{\frac{\zeta -1}{\zeta }} -(\theta _{\star }^{\zeta }(i))^{\frac{\zeta -1}{\zeta }}\\&\quad =\left( \frac{\int _{\cup _{k=1}^{i+1}{\mathcal {X}}_k} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}}{{{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}} +o\bigg (\frac{1_{\zeta \ne 1}\int _{\cup _{k=1}^{i+1}{\mathcal {X}}_k} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}}{{{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}}\bigg )\right. \\&\qquad \left. -\frac{\int _{\cup _{k=1}^{i}{\mathcal {X}}_k} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}}{{{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}}\right) \frac{\zeta -1}{\zeta }({{\widetilde{\theta }}}^{\zeta })^{\frac{-1}{\zeta }}\\&\quad \approx \frac{\zeta -1}{\zeta }\frac{\int _{{\mathcal {X}}_{i+1}} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}}{{{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}}{{\widetilde{\theta }}}^{-1}. \end{aligned} \end{aligned}$$

Combining \(\theta _{\star }(i)=\left( \frac{\int _{\cup _{k=1}^i{\mathcal {X}}_k} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}}{{{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}}\right. \left. +o\bigg (\frac{1_{\zeta \ne 1}\int _{\cup _{k=1}^i{\mathcal {X}}_k} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}}{{{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}}\bigg )\right) ^{\frac{1}{\zeta }}\), we have

$$\begin{aligned} \begin{aligned}&\frac{\theta _{\star }^{\zeta -1}(i+1)-\theta _{\star }^{\zeta -1}(i)}{\theta _{\star }^{\zeta -1}(i)}\int _{\cup _{k=1}^i{\mathcal {X}}_k} \pi _{\tau }({{\varvec{x}}})\hbox {d}{{\varvec{x}}}\\&\quad \approx \frac{\zeta -1}{\zeta } \frac{\int _{{\mathcal {X}}_{i+1}} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}}{{{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}} {{\widetilde{\theta }}}^{-1}\left( \frac{\int _{\cup _{k=1}^i{\mathcal {X}}_k} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}}{{{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}} \right) ^{-1 + \frac{1}{\zeta }}\\&\qquad \int _{\cup _{k=1}^i{\mathcal {X}}_k} \pi _{\tau }({{\varvec{x}}})\hbox {d}{{\varvec{x}}}\\&\quad =\frac{\zeta -1}{\zeta }\int _{{\mathcal {X}}_{i+1}} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}{{\widetilde{\theta }}}^{-1}\left( \frac{\int _{\cup _{k=1}^i{\mathcal {X}}_k} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}}{{{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}}\right) ^{\frac{1}{\zeta }}\\&\quad = {\mathcal {O}}\left( \frac{\zeta -1}{\zeta }\int _{{\mathcal {X}}_{i+1}} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}\right) , \end{aligned} \end{aligned}$$

where the big-O notation follows since \({{\widetilde{\theta }}}={\mathcal {O}}\left( \left( \frac{\int _{\cup _{k=1}^i{\mathcal {X}}_k} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}}{{{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}}\right) ^{\frac{1}{\zeta }}\right) \). In the sequel, the desired conclusion \(\frac{\theta _{\star }^{\zeta -1}(i+1)-\theta _{\star }^{\zeta -1}(i)}{\theta _{\star }^{\zeta -1}(i)}\int _{\cup _{k=1}^i{\mathcal {X}}_k} \pi _{\tau }({{\varvec{x}}})\hbox {d}{{\varvec{x}}}{=}o\bigg (\frac{\int _{\cup _{k=1}^{i+1}{\mathcal {X}}_k} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}}{{{\widetilde{Z}}}_{{\varvec{\theta }}_{\star }}}\bigg )\) holds since appropriate high-energy regions lead to an exponentially smaller probability mass in general such that \(\int _{{\mathcal {X}}_{i+1}} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}=o\left( \int _{\cup _{k=1}^{i+1}{\mathcal {X}}_k} \pi _{\tau }({{\varvec{x}}}) \hbox {d}{{\varvec{x}}}\right) \). \(\square \)

1.6 Ergodicity and weighted averaging estimators

We aim to analyze the deviation of the weighted averaging estimate \([\sum _{i=1}^k\theta _{i}^{\zeta }( {\tilde{J}}({{\varvec{x}}}_i)) \pi ^{1-1/\tau }(x_i) f({{\varvec{x}}}_i)]/\) \([\sum _{i=1}^k \theta _i^{\zeta }({\tilde{J}}({{\varvec{x}}}_i)) \pi ^{1-1/\tau }(x_i)]\) from the posterior expectation \(\int _{{\mathcal {X}}}f({{\varvec{x}}})\pi (\hbox {d}{{\varvec{x}}})\) for a bounded function \(f({{\varvec{x}}})\). To accomplish this analysis, we first study the convergence of the posterior sample mean \(\frac{1}{k}\sum _{i=1}^k f({{\varvec{x}}}_i)\) to the posterior expectation \({\bar{f}}=\int _{{\mathcal {X}}}f({{\varvec{x}}})\varpi _{{\widetilde{\Psi }}_{{{\widehat{{\varvec{\theta }}}}}_{\star }}}({{\varvec{x}}})(\hbox {d}{{\varvec{x}}})\). The key tool we employed in the analysis is still the Poisson equation, which is used to characterize the fluctuation between \(f({{\varvec{x}}})\) and \({{\bar{f}}}\):

$$\begin{aligned} {\mathcal {L}}g({{\varvec{x}}})=f({{\varvec{x}}})-{{\bar{f}}}, \end{aligned}$$
(38)

where \(g({{\varvec{x}}})\) is the solution to the Poisson equation, and \({\mathcal {L}}\) is the infinitesimal generator of the Langevin diffusion

$$\begin{aligned} {\mathcal {L}}g:=\langle \nabla g, \nabla L(\cdot , {{\widehat{{\varvec{\theta }}}}}_{\star })\rangle +\tau \Delta g. \end{aligned}$$

By imposing the following regularity conditions on the function \(g({{\varvec{x}}})\), we can control the perturbations of \(\frac{1}{k}\sum _{i=1}^k f({{\varvec{x}}}_i)-{{\bar{f}}}\), which enables convergence of the sample average.

Regularity Condition: Given a sufficiently smooth function \(g({{\varvec{x}}})\) as defined in (38) and a function \({\mathcal {V}}({{\varvec{x}}})\) such that \(\Vert D^k g\Vert \lesssim {\mathcal {V}}^{p_k}({{\varvec{x}}})\) for some constants \(p_k>0\), where \(k\in \{0,1,2,3\}\). In addition, \({\mathcal {V}}^p\) has a bounded expectation, i.e., \(\sup _{{{\varvec{x}}}} \mathbb {E}[{\mathcal {V}}^p({{\varvec{x}}})]<\infty \), and \({\mathcal {V}}^p\) is smooth, i.e., \(\sup _{s\in \{0, 1\}} {\mathcal {V}}^p(s{{\varvec{x}}}+(1-s){{\varvec{y}}})\lesssim {\mathcal {V}}^p({{\varvec{x}}})+{\mathcal {V}}^p({{\varvec{y}}})\) for all \({{\varvec{x}}},{{\varvec{y}}}\in {\mathcal {X}}\) and \(p\le 2\max _k\{p_k\}\).

Although the above regularity condition has been used in the literature, see, e.g., (Chen et al. 2015), it is hard to verify for practical algorithms. Vollmer et al. (2016) provides a set of verifiable conditions but under stronger assumptions. To address this issue, Erdogdu et al. (2018) shows that the regularity conditions can be verified with the dissipative and smoothness assumptions (Xu et al. 2018). The latter resembles the conclusion of Lemma 2. In what follows, we present a lemma which is mainly adapted from Theorem 2 of Chen et al. (2015) with a fixed learning rate \(\epsilon \).

Lemma 6

(Convergence of the Averaging Estimators) Suppose Assumptions A1A5 hold. For any bounded function \(f({{\varvec{x}}})\) and any \(k>k_0\),

$$\begin{aligned} \begin{aligned}&\left| \mathbb {E}\left[ \frac{\sum _{i=1}^k f({{\varvec{x}}}_i)}{k}\right] -{\int _{{\mathcal {X}}}f({{\varvec{x}}})\varpi _{{\widetilde{\Psi }} _{{{\widehat{{\varvec{\theta }}}}}_{\star }}}({{\varvec{x}}})} \hbox {d}{{\varvec{x}}}\right| \\&\quad ={\mathcal {O}}\left( \frac{1}{k\epsilon }+\epsilon +\sqrt{\frac{\sum _{i=1}^k \omega _i}{k}}\right) , \end{aligned} \end{aligned}$$

where \(k_0\) is a sufficiently large constant, \({\varpi _{{\widetilde{\Psi }}_{{{\widehat{{\varvec{\theta }}}}}_{\star }}}({{\varvec{x}}}) }= \frac{1}{Z_{{{\widehat{{\varvec{\theta }}}}}_{\star }}} \frac{\pi _{\tau }({{\varvec{x}}})}{{{\widehat{\theta }}}_{\star }^{\zeta }(J({{\varvec{x}}}))}\), and \(Z_{{{\widehat{{\varvec{\theta }}}}}_{\star }}=\sum _{i=1}^m \frac{\int _{{\mathcal {X}}_i} \pi _{\tau }({{\varvec{x}}})\hbox {d}{{\varvec{x}}}}{{{\widehat{\theta }}}_{\star }^{\zeta }(i)}\).

Proof

We rewrite the AWSGLD algorithm as follows:

$$\begin{aligned} \begin{aligned} {\varvec{x}}_{k+1}&={{\varvec{x}}}_k- \epsilon \nabla _{{{\varvec{x}}}} {\widetilde{L}}({{\varvec{x}}}_k, {{\widehat{{\varvec{\theta }}}}}_k)+{\mathcal {N}}({0, 2\epsilon \tau {\varvec{I}}})\\&={{\varvec{x}}}_k- \epsilon \left( \nabla _{{{\varvec{x}}}} {\widetilde{L}}({{\varvec{x}}}_k, {{\widehat{{\varvec{\theta }}}}}_{\star })+{\Upsilon }({{\varvec{x}}}_k, {\varvec{\theta }}_k, {{\widehat{{\varvec{\theta }}}}}_{\star })\right) \\&\quad +{\mathcal {N}}({0, 2\epsilon \tau {\varvec{I}}}), \end{aligned} \end{aligned}$$

where \(\nabla _{{{\varvec{x}}}} {\widetilde{L}}({{\varvec{x}}},{\varvec{\theta }})\) is as defined in Sect. A.3, and the bias term is given by \({\Upsilon }({{\varvec{x}}}_k,{\varvec{\theta }}_k,{{\widehat{{\varvec{\theta }}}}}_{\star })=\nabla _{{{\varvec{x}}}} {\widetilde{L}}({{\varvec{x}}}_k,{\varvec{\theta }}_k)-\nabla _{{{\varvec{x}}}} {\widetilde{L}}({{\varvec{x}}}_k,{{\widehat{{\varvec{\theta }}}}}_{\star })\). By Jensen’s inequality and Theorem 1, we have

$$\begin{aligned} \Vert \mathbb {E}[\Upsilon ({{\varvec{x}}}_k,{\varvec{\theta }}_k,{{\widehat{{\varvec{\theta }}}}}_{\star })]\Vert&\lesssim \mathbb {E}[\Vert {\varvec{\theta }}_k-{{\widehat{{\varvec{\theta }}}}}_{\star }\Vert ]\le \sqrt{\mathbb {E}[\Vert {\varvec{\theta }}_k-{{\widehat{{\varvec{\theta }}}}}_{\star }\Vert ^2]}\nonumber \\&\le {\mathcal {O}}\left( \sqrt{\omega _{k}}\right) . \end{aligned}$$
(39)

The ergodic average of SGLD with biased gradients and a fixed learning rate has been studied in Theorem 2 of Chen et al. (2015) under the underlying regularity condition. Following from this theorem and then combining with (39), we have

$$\begin{aligned} \begin{aligned}&\left| \mathbb {E}\left[ \frac{\sum _{i=1}^k f({{\varvec{x}}}_i)}{k}\right] -\int _{{\mathcal {X}}}f({{\varvec{x}}}) \varpi _{\Psi _{{{\widehat{{\varvec{\theta }}}}}_{\star }}}({{\varvec{x}}})\hbox {d}{{\varvec{x}}}\right| \\&\quad \le {\mathcal {O}}\left( \frac{1}{k\epsilon }+\epsilon +\frac{\sum _{i=1}^k \Vert \mathbb {E}[\Upsilon ({{\varvec{x}}}_k,{\varvec{\theta }}_k,{{\widehat{{\varvec{\theta }}}}}_{\star })]\Vert }{k}\right) \\&\quad \le {\mathcal {O}}\left( \frac{1}{k\epsilon }+\epsilon +\sqrt{\frac{\sum _{i=1}^k \omega _i}{k}}\right) . \end{aligned} \end{aligned}$$

For any bounded function \(f({{\varvec{x}}})\), we have \(|\int _{{\mathcal {X}}}f({{\varvec{x}}}) \varpi _{\Psi _{{{\widehat{{\varvec{\theta }}}}}_{\star }}}({{\varvec{x}}})\hbox {d}{{\varvec{x}}}- \int _{{\mathcal {X}}}f({{\varvec{x}}}) \varpi _{{\widetilde{\Psi }}_{{{\widehat{{\varvec{\theta }}}}}_{\star }}}({{\varvec{x}}})\hbox {d}{{\varvec{x}}}|= {\mathcal {O}}(\frac{1}{m})\) by Lemma 3. By the triangle inequality, we have

$$\begin{aligned} \begin{aligned}&\left| \mathbb {E}\left[ \frac{\sum _{i=1}^k f({{\varvec{x}}}_i)}{k}\right] -\int _{{\mathcal {X}}}f({{\varvec{x}}}) {\varpi _{{\widetilde{\Psi }}_{{{\widehat{{\varvec{\theta }}}}}_{\star }}}({{\varvec{x}}})} \hbox {d}{{\varvec{x}}}\right| \\&\quad \le {\mathcal {O}}\left( \frac{1}{k\epsilon }+\epsilon +\sqrt{\frac{\sum _{i=1}^k \omega _i}{k}}+ \frac{1}{m}\right) , \end{aligned} \end{aligned}$$

which concludes the proof. \(\square \)

Finally, we are ready to prove Theorem 2 concerning the convergence of the weighted averaging estimator \(\frac{\sum _{i=1}^k\theta _{i} ^{\zeta }({\tilde{J}}({{\varvec{x}}}_i)) f({{\varvec{x}}}_i)}{\sum _{i=1}^k\theta _{i}^{\zeta }( {\tilde{J}}({{\varvec{x}}}_i))}\) to the posterior mean \(\int _{{\mathcal {X}}}f({{\varvec{x}}})\pi _{\tau }(\hbox {d}{{\varvec{x}}})\).

Proof of Theorem 2

Applying triangle inequality and \(|\mathbb {E}[x]|\le \mathbb {E}[|x|]\), we have

$$\begin{aligned} \begin{aligned}&\left| \mathbb {E}\left[ \frac{\sum _{i=1}^k\theta _{i} ^{\zeta }( {\tilde{J}}({{\varvec{x}}}_i)) f({{\varvec{x}}}_i)}{\sum _{i=1}^k\theta _{i}^{\zeta }( {\tilde{J}}({{\varvec{x}}}_i))}\right] -\int _{{\mathcal {X}}}f({{\varvec{x}}})\pi _{\tau }(\hbox {d}{{\varvec{x}}})\right| \\&\quad \le \underbrace{\mathbb {E}\left[ \left| \frac{\sum _{i=1}^k\theta _{i}^{\zeta } ({\tilde{J}}({{\varvec{x}}}_i))f({{\varvec{x}}}_i)}{\sum _{i=1}^k\theta _{i}^{\zeta }({\tilde{J}}({{\varvec{x}}}_i)) }-\frac{Z_{{{\widehat{{\varvec{\theta }}}}}_{\star }}\sum _{i=1}^k\theta _{i}^{\zeta } ({\tilde{J}}({{\varvec{x}}}_i)) f({{\varvec{x}}}_i)}{k}\right| \right] }_{\text {J}_1}\\&\qquad + \underbrace{\mathbb {E}\left[ \frac{Z_{{{\widehat{{\varvec{\theta }}}}}_{\star }}}{k} \sum _{i=1}^k\left| \theta _i^{\zeta } ({\tilde{J}}({{\varvec{x}}}_i))-{{\widehat{\theta }}}_{\star }^{\zeta } ({\tilde{J}}({{\varvec{x}}}_i)) \right| \cdot |f({{\varvec{x}}}_i)|\right] }_{\text {J}_2}\\&\qquad +\underbrace{\left| \mathbb {E}\left[ \frac{Z_{{{\widehat{{\varvec{\theta }}}}}_{\star }}}{k} \sum _{i=1}^k{{\widehat{\theta }}}_{\star }^{\zeta } ({\tilde{J}}({{\varvec{x}}}_i)) f({{\varvec{x}}}_i)\right] -\int _{{\mathcal {X}}}f({{\varvec{x}}})\pi _{\tau }(\hbox {d}{{\varvec{x}}})\right| }_{\text {J}_3}. \end{aligned} \end{aligned}$$

For the term \(\text {J}_1\), by the boundedness of \({\varvec{\Theta }}\) and f and the assumption \(\inf _{{\varvec{\Theta }}}\theta ^{\zeta }(i)>0\), we have

$$\begin{aligned} \text {J}_1&=\mathbb {E}\left[ \left| \frac{\sum _{i=1}^k\theta _{i}^{\zeta }({\tilde{J}}({{\varvec{x}}}_i)) f({{\varvec{x}}}_i)}{\sum _{i=1}^k\theta _{i}^{\zeta } ({\tilde{J}}({{\varvec{x}}}_i)) }\left( 1-\sum _{i=1}^k\frac{\theta _i^{\zeta } ({\tilde{J}}({{\varvec{x}}}_i)) }{k}Z_{{{\widehat{{\varvec{\theta }}}}}_{\star }}\right) \right| \right] \nonumber \\&\lesssim \mathbb {E}\left[ \left| Z_{{{\widehat{{\varvec{\theta }}}}}_{\star }}\frac{{\sum _{i=1}^k\theta _{i}^{\zeta } ({\tilde{J}}({{\varvec{x}}}_i)) }}{k}-1\right| \right] \nonumber \\&=\mathbb {E}\left[ \left| Z_{{{\widehat{{\varvec{\theta }}}}}_{\star }}\sum _{i=1}^m \frac{\sum _{j=1}^k\left( \theta _j^{\zeta }(i)-{{\widehat{\theta }}}_{\star }^{\zeta }(i) +{{\widehat{\theta }}}_{\star }^{\zeta }(i)\right) 1_{ {\tilde{J}}({{\varvec{x}}}_j)=i}}{k}-1\right| \right] \nonumber \\&\le \underbrace{\mathbb {E}\left[ Z_{{{\widehat{{\varvec{\theta }}}}}_{\star }}\sum _{i=1}^m \frac{\sum _{j=1}^k\left| \theta _j^{\zeta }(i)-{{\widehat{\theta }}}_{\star }^{\zeta }(i)\right| 1_{{\tilde{J}}({{\varvec{x}}}_j)=i}}{k} \right] }_{\text {J}_{4}}\nonumber \\&\quad + \underbrace{\mathbb {E}\left[ \left| Z_{{{\widehat{{\varvec{\theta }}}}}_{\star }}\sum _{i=1}^m \frac{{{\widehat{\theta }}}_{\star }^{\zeta }(i)\sum _{j=1}^k 1_{{\tilde{J}}({{\varvec{x}}}_j)=i}}{k}-1\right| \right] }_{\text {J}_{5}}. \end{aligned}$$

For the term \(\text {J}_{4}\), by the mean value theorem and the Cauchy–Schwarz inequality, we have

$$\begin{aligned} \text {J}_{4}&\lesssim \frac{1}{k}\mathbb {E}\left[ \sum _{j=1}^k\sum _{i=1}^m\left| \theta _j^{\zeta }(i)-{{\widehat{\theta }}}_{\star }^{\zeta }(i)\right| \right] \nonumber \\&\lesssim \frac{1}{k}\mathbb {E}\left[ \sum _{j=1}^k\sum _{i=1}^m\left| \theta _j(i)-{{\widehat{\theta }}}_{\star }(i)\right| \right] \nonumber \\&\lesssim \frac{1}{k}\sqrt{\sum _{j=1}^k\mathbb {E}\left[ \left\| {\varvec{\theta }}_j-{{\widehat{{\varvec{\theta }}}}}_{\star }\right\| ^2\right] }, \end{aligned}$$
(40)

where the compactness of \({\varvec{\Theta }}\) has been used in deriving the second inequality.

For the term \(\text {J}_{5}\), by the following relation

$$\begin{aligned} 1= & {} \sum _{i=1}^m\int _{{\mathcal {X}}_i} \pi _{\tau }({{\varvec{x}}})\hbox {d}{{\varvec{x}}}=\sum _{i=1}^m\int _{{\mathcal {X}}_i} {{\widehat{\theta }}}_{\star }^{\zeta }(i) \frac{\pi _{\tau }({{\varvec{x}}})}{{{\widehat{\theta }}}_{\star }^{\zeta }(i)}\hbox {d}{{\varvec{x}}}\\= & {} Z_{{{\widehat{{\varvec{\theta }}}}}_{\star }}\int _{{\mathcal {X}}} \sum _{i=1}^m {{\widehat{\theta }}}_{\star }^{\zeta }(i) 1_{{\tilde{J}}({{\varvec{x}}})=i}\varpi _{{\widetilde{\Psi }}_{{{\widehat{{\varvec{\theta }}}}}_{\star }}}({{\varvec{x}}})\hbox {d}{{\varvec{x}}}, \end{aligned}$$

it is easy to derive that

$$\begin{aligned} \begin{aligned} \text {J}_{5}&=\mathbb {E}\left( \left| Z_{{{\widehat{{\varvec{\theta }}}}}_{\star }}\sum _{i=1}^m \frac{{{\widehat{\theta }}}_{\star }^{\zeta }(i)\sum _{j=1}^k 1_{{\tilde{J}}({{\varvec{x}}}_j)=i}}{k}\right. \right. \\&\quad \left. \left. -Z_{{{\widehat{{\varvec{\theta }}}}}_{\star }}\int _{{\mathcal {X}}} \sum _{i=1}^m {{\widehat{\theta }}}_{\star }^{\zeta }(i) 1_{{\tilde{J}}({{\varvec{x}}})=i}\varpi _{{\widetilde{\Psi }}_{{{\widehat{{\varvec{\theta }}}}}_{\star }}} ({{\varvec{x}}})\hbox {d}{{\varvec{x}}}\right| \right) \\&=Z_{{{\widehat{{\varvec{\theta }}}}}_{\star }} \mathbb {E}\left( \left| \frac{1}{k}\sum _{j=1}^k \left( \sum _{i=1}^m{{\widehat{\theta }}}_{\star }^{\zeta }(i) 1_{{\tilde{J}}({{\varvec{x}}}_j)=i}\right) \right. \right. \\&\quad \left. \left. -\int _{{\mathcal {X}}} \left( \sum _{i=1}^m {{\widehat{\theta }}}_{\star }^{\zeta }(i) 1_{{\tilde{J}}({{\varvec{x}}})=i}\right) \varpi _{{\widetilde{\Psi }}_{{{\widehat{{\varvec{\theta }}}}}_{\star }}} ({{\varvec{x}}})\hbox {d}{{\varvec{x}}}\right| \right) \\&={\mathcal {O}}\left( \frac{1}{k\epsilon }+\epsilon +\sqrt{\frac{\sum _{i=1}^k \omega _i}{k}}+\frac{1}{{m}} \right) , \end{aligned} \end{aligned}$$
(41)

where the last equality follows from Lemma 6 as the step function \(\sum _{i=1}^m {{\widehat{\theta }}}_{\star }^{\zeta }(i) 1_{{\tilde{J}}({{\varvec{x}}})=i}\) is bounded.

For the term \(\text {J}_2\), by the mean value theorem and Cauchy–Schwarz inequality, we have

$$\begin{aligned} \begin{aligned} \text {J}_2&\lesssim \mathbb {E}\left[ \frac{1}{k}\sum _{i=1}^k\left| \theta _{i} ^{\zeta }({\tilde{J}}({{\varvec{x}}}_i)) -{{\widehat{\theta }}}_{\star }^{\zeta }( {\tilde{J}}({{\varvec{x}}}_i))\right| \right] \\&\lesssim \frac{1}{k}\mathbb {E}\biggl [ \sum _{j=1}^k\sum _{i=1}^m\bigl | \theta _j(i)-{{\widehat{\theta }}}_{\star }(i)\bigr | \biggr ]\\&\lesssim \frac{1}{k}\sqrt{\sum _{j=1}^k\mathbb {E}\left[ \left\| {\varvec{\theta }}_j-{{\widehat{{\varvec{\theta }}}}}_{\star }\right\| ^2\right] }. \end{aligned} \end{aligned}$$
(42)

For the last term \(\text {J}_3\), we first decompose \(\int _{{\mathcal {X}}} f({{\varvec{x}}}) \pi _{\tau }(\hbox {d}{{\varvec{x}}})\) into m disjoint regions

$$\begin{aligned}&\int _{{\mathcal {X}}} f({{\varvec{x}}}) \pi _{\tau }(\hbox {d}{{\varvec{x}}})\nonumber \\&\quad =\int _{\cup _{j=1}^m {\mathcal {X}}_j} f({{\varvec{x}}}) \pi _{\tau }(\hbox {d}{{\varvec{x}}})\nonumber \\&\quad =\sum _{j=1}^m\int _{{\mathcal {X}}_j}{{\widehat{\theta }}}_{\star }^{\zeta }(j) f({{\varvec{x}}})\frac{\pi _{\tau }(\hbox {d}{{\varvec{x}}})}{{{\widehat{\theta }}}_{\star }^{\zeta }(j)}\nonumber \\&\quad =Z_{{{\widehat{{\varvec{\theta }}}}}_{\star }}\int _{{\mathcal {X}}} \sum _{j=1}^m {{\widehat{\theta }}}_{\star }(j)^{\zeta }f({{\varvec{x}}}) 1_{ {\tilde{J}}({{\varvec{x}}})=j}\varpi _{{\widetilde{\Psi }}_{{{\widehat{{\varvec{\theta }}}}}_{\star }}}({{\varvec{x}}})(\hbox {d}{{\varvec{x}}}). \end{aligned}$$
(43)

Plugging (43) into the term \(\text {J}_3\), we have

$$\begin{aligned} \begin{aligned} \text {J}_3&=\left| \mathbb {E}\left[ \frac{Z_{{{\widehat{{\varvec{\theta }}}}}_{\star }}}{k} \sum _{i=1}^k\sum _{j=1}^m{{\widehat{\theta }}}_{\star }(j)^{\zeta } f({{\varvec{x}}}_i)1_{ {\tilde{J}}({{\varvec{x}}}_i)=j }\right] \right. \\&\quad \left. -\int _{{\mathcal {X}}}f({{\varvec{x}}})\pi _{\tau }(\hbox {d}{{\varvec{x}}})\right| \\&= Z_{{{\widehat{{\varvec{\theta }}}}}_{\star }}\left| \mathbb {E}\left[ \frac{1}{k}\sum _{i=1}^k \left( \sum _{j=1}^m{{\widehat{\theta }}}_{\star }^{\zeta }(j) f({{\varvec{x}}}_i)1_{ {\tilde{J}}({{\varvec{x}}}_i)=j }\right) \right] \right. \\&\quad \left. -\int _{{\mathcal {X}}} \left( \sum _{j=1}^m{{\widehat{\theta }}}_{\star }^{\zeta }(j) f({{\varvec{x}}})1_{{\tilde{J}}({{\varvec{x}}})=j }\right) \varpi _{{\widetilde{\Psi }}_{{{\widehat{{\varvec{\theta }}}}}_{\star }}}({{\varvec{x}}})(\hbox {d}{{\varvec{x}}})\right| .\\ \end{aligned} \end{aligned}$$
(44)

Applying the function \(g({{\varvec{x}}})=\sum _{j=1}^m{{\widehat{\theta }}}_{\star }^{\zeta }(j) f({{\varvec{x}}})1_{ {\tilde{J}}({{\varvec{x}}})=j }\) to Lemma 6 yields

$$\begin{aligned} \begin{aligned}&\left| \mathbb {E}\left[ \frac{1}{k}\sum _{i=1}^k g({{\varvec{x}}}_i)\right] -\int _{{\mathcal {X}}} g({{\varvec{x}}})\varpi _{{\widetilde{\Psi }}_{{{\widehat{{\varvec{\theta }}}}}_{\star }}}({{\varvec{x}}})(\hbox {d}{{\varvec{x}}})\right| \\&\quad ={\mathcal {O}}\left( \frac{1}{k\epsilon }+{\epsilon }+\sqrt{\frac{\sum _{i=1}^k \omega _i}{k}}+\frac{1}{{m}}\right) . \end{aligned} \end{aligned}$$
(45)

Plugging (45) into (44) and combining \(\text {J}_{1}\), \(\text {J}_{2}\), \(\text {J}_{4}\), \(\text {J}_5\) and Theorem 1, we have

$$\begin{aligned} \begin{aligned}&\left| \mathbb {E}\left[ \frac{\sum _{i=1}^k\theta _{i} ^{\zeta }({\tilde{J}}({{\varvec{x}}}_i)) f({{\varvec{x}}}_i)}{\sum _{i=1}^k\theta _{i}^{\zeta }( {\tilde{J}}({{\varvec{x}}}_i))}\right] -\int _{{\mathcal {X}}}f({{\varvec{x}}})\pi _{\tau }(\hbox {d}{{\varvec{x}}})\right| \\&\quad ={\mathcal {O}}\left( \frac{1}{k\epsilon }+{\epsilon }+\sqrt{\frac{\sum _{i=1}^k \omega _i}{k}}+\frac{1}{{m}} \right) , \end{aligned} \end{aligned}$$

which concludes the proof of the theorem. \(\square \)

Experimental settings

1.1 Background on SGHMC

SGHMC (Chen et al. 2014; Ma et al. 2015) is a natural extension of Hamiltonian Monte Carlo (Neal 2012) to big data problems, where one variant (Saatci and Wilson 2017) tracks the dynamics as follows:

$$\begin{aligned} \left\{ \begin{array}{lr} {{\varvec{x}}}\leftarrow {{\varvec{x}}}+ {\varvec{\gamma }}, \\ {\varvec{\gamma }} \leftarrow (1-{{\tilde{\alpha }}}){\varvec{\gamma }} - \epsilon \frac{N}{n}\nabla _{{{\varvec{x}}}} {\widetilde{U}}({{\varvec{x}}})\hbox {d}t+ {\mathcal {N}}(0, 2\epsilon ({{\tilde{\alpha }}}-{{\tilde{\beta }}})), \end{array} \right. \nonumber \\ \end{aligned}$$
(46)

where \(\epsilon \) is the learning rate, \({{\tilde{\alpha }}}\) denotes the friction term and is often set to 0.1; \({\varvec{\gamma }}\in {\mathbb {R}}^d\) denotes the auxiliary momentum variable and can be re-sampled periodically; \({{\tilde{\beta }}}=\frac{1}{2} \epsilon {\varvec{\widehat{B}}}\), where \({\varvec{\widehat{B}}}\) can be estimated via empirical Fisher information (Ahn et al. 2012b) and is set to \(\frac{1}{1000}{\varvec{I}}\) in our experiments.

Table 3 Hyperparameters used in the experiments of 10 non-convex functions, where AW-10 and AW-100 are shorts for AWSGLD with a partition of 10 subregions and 100 subregions, respectively, d is the dimension, \(\epsilon \) is the learning rate, \(\tau \) is the temperature, \(\Delta u\) is the energy bandwidth, and \(\zeta \) is a tuning parameter. The run terminates when the sampler hits the target set \(\{{{\varvec{x}}}: U({{\varvec{x}}})\le U_{\min } +\varrho \}\)
Table 4 CPU time cost to achieve the target accuracy. The results are averaged based on 10 trials. \(\infty \) means the running time is not calculated since the number of iterations is more than 100,000

1.2 Sample space exploration

For the first example, all the algorithms are run for 100,000 iterations. The default learning rate is 5e−5. pSGLD, SGHMC, and cycSGLD adopt a low temperature \(\tau =2\), while the high-temperature SGLD and AWSGLD adopt a high temperature of 20. For pSGLD, the smooth factor \(\alpha \) and the regularizer \(\lambda \) (to control extreme values) of Li et al. (2016) are set to 0.999 and 0.1, respectively. For SGHMC, we fix the momentum 0.9 and resample the velocity variable from a Gaussian distribution every 1000 steps. For cycSGLD, we choose 10 cycles; the learning rate in each cycle goes from 1e−4 to 3e−5. For AWSGLD, we set \(\omega _k=\frac{0.1}{k^{0.6}+1000}\) and \(\zeta =5\) and partition the energy space (0, 20] into 30 subregions.

For the Griewank function, we inherit most of the settings from the previous example and run the algorithms for 200,000 iterations. The base learning rate is 0.005. The default high and low temperatures are 2 and 0.2, respectively. For cycSGLD, the learning rate in each cycle goes from 0.02 to 0.005; for AWSGLD, we set \(\zeta =3\) and partition the energy space (0,2] into 30 subregions.

1.3 Optimization of multimodal functions

For AWSGLD, we set \(\omega _k=\frac{100}{k^{0.75}+1000}\) for most of the functions, except for the first Rastrigin function, which adopts \(\omega _k=\frac{200}{k^{0.75}+1000}\). The rest hyperparameters are set as in Table 3.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Deng, W., Lin, G. & Liang, F. An adaptively weighted stochastic gradient MCMC algorithm for Monte Carlo simulation and global optimization. Stat Comput 32, 58 (2022). https://doi.org/10.1007/s11222-022-10120-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11222-022-10120-3

Keywords

Navigation