Decentralized Bayesian learning with Metropolis-adjusted Hamiltonian Monte Carlo

Kungurtsev, Vyacheslav; Cobb, Adam; Javidi, Tara; Jalaian, Brian

doi:10.1007/s10994-023-06345-6

Decentralized Bayesian learning with Metropolis-adjusted Hamiltonian Monte Carlo

Published: 20 June 2023

Volume 112, pages 2791–2819, (2023)
Cite this article

Machine Learning Aims and scope Submit manuscript

Vyacheslav Kungurtsev ORCID: orcid.org/0000-0003-2229-8824¹^na1,
Adam Cobb²^na1,
Tara Javidi³ &
…
Brian Jalaian⁴

230 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Federated learning performed by a decentralized networks of agents is becoming increasingly important with the prevalence of embedded software on autonomous devices. Bayesian approaches to learning benefit from offering more information as to the uncertainty of a random quantity, and Langevin and Hamiltonian methods are effective at realizing sampling from an uncertain distribution with large parameter dimensions. Such methods have only recently appeared in the decentralized setting, and either exclusively use stochastic gradient Langevin and Hamiltonian Monte Carlo approaches that require a diminishing stepsize to asymptotically sample from the posterior and are known in practice to characterize uncertainty less faithfully than constant step-size methods with a Metropolis adjustment, or assume strong convexity properties of the potential function. We present the first approach to incorporating constant stepsize Metropolis-adjusted HMC in the decentralized sampling framework, show theoretical guarantees for consensus and probability distance to the posterior stationary distribution, and demonstrate their effectiveness numerically on standard real world problems, including decentralized learning of neural networks which is known to be highly non-convex.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Top-Down Approach to Attain Decentralized Multi-agents

Reinforcement learning in a continuum of agents

Article 13 October 2017

Towards Efficient MCMC Sampling in Bayesian Neural Networks by Exploiting Symmetry

Availibility of data

Please see all the code in the repo: https://github.com/AdamCobb/dblmahmc.

Notes

However, unlike in Parayil et al. (2020), where they use stochastic gradients we apply their method without taking stochastic gradients.

References

Akyildiz, Ö.D., & Sabanis, S. (2020). Nonasymptotic analysis of stochastic gradient Hamiltonian Monte Carlo under local conditions for nonconvex optimization. arXiv preprint arXiv:2002.05465.
Berahas, A. S., Bollapragada, R., Keskar, N. S., & Wei, E. (2018). Balancing communication and computation in distributed optimization. IEEE Transactions on Automatic Control, 64(8), 3141–3155.
Article MathSciNet MATH Google Scholar
Betancourt, M. (2015). The fundamental incompatibility of Hamiltonian Monte Carlo and data subsampling. arXiv preprint arXiv:1502.01510.
Bou-Rabee, N., Eberle, A., & Zimmer, R. (2020). Coupling and convergence for Hamiltonian Monte Carlo. The Annals of applied probability, 30(3), 1209–1250.
Article MathSciNet MATH Google Scholar
Chau, H. N., & Rásonyi, M. (2022). Stochastic gradient Hamiltonian Monte Carlo for non-convex learning. Stochastic Processes and their Applications, 149, 341–368.
Article MathSciNet MATH Google Scholar
Chen, T., Fox, E., & Guestrin, C. (2014). Stochastic gradient Hamiltonian Monte Carlo. In: International Conference on Machine Learning, pp. 1683–1691. PMLR.
Chen, X., Du, S. S., & Tong, X. T. (2020). On stationary-point hitting time and ergodicity of stochastic gradient Langevin dynamics. Journal of Machine Learning Research, 21(68), 1–41.
MathSciNet MATH Google Scholar
Cobb, A.D., & Jalaian, B. (2020). Scaling Hamiltonian Monte Carlo inference for Bayesian neural networks with symmetric splitting. arXiv preprint arXiv:2010.06772.
Di Lorenzo, P., & Scutari, G. (2016). Next: In-network nonconvex optimization. IEEE Transactions on Signal and Information Processing over Networks, 2(2), 120–136.
Article MathSciNet Google Scholar
Durmus, A., Moulines, E., & Saksman, E. (2017). On the convergence of Hamiltonian Mmonte Carlo. arXiv preprint arXiv:1705.00166.
Durmus, A., & Moulines, E. (2019). High-dimensional bayesian inference via the unadjusted langevin algorithm. Bernoulli, 25(4A), 2854–2882.
Article MathSciNet MATH Google Scholar
Gao, X., Gürbüzbalaban, M., & Zhu, L. (2021). Global convergence of stochastic gradient Hamiltonian Monte Carlo for nonconvex stochastic optimization: nonasymptotic performance bounds and momentum-based acceleration. Operations Research, 70, 2931–2947.
Article MathSciNet MATH Google Scholar
Gürbüzbalaban, M., Gao, X., Hu, Y., & Zhu, L. (2020). Decentralized stochastic gradient Langevin dynamics and Hamiltonian Monte Carlo. arXiv preprint arXiv:2007.00590.
Harrison, D., Jr., & Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management, 5(1), 81–102.
Article MATH Google Scholar
Hsieh, C.-J., Si, S., & Dhillon, I. (2014). A divide-and-conquer solver for kernel support vector machines. In International Conference on Machine Learning, pp. 566–574. PMLR.
Izmailov, P., Vikram, S., Hoffman, M.D., & Wilson, A.G. (2021). What are Bayesian neural network posteriors really like? arXiv preprint arXiv:2104.14421
Kolesov, A., & Kungurtsev, V. (2021). Decentralized langevin dynamics over a directed graph. arXiv preprint arXiv:2103.05444.
Kungurtsev, V. (2020). Stochastic gradient langevin dynamics on a distributed network. arXiv preprint arXiv:2001.00665.
Lalitha, A., Wang, X., Kilinc, O., Lu, Y., Javidi, T., & Koushanfar, F. (2019). Decentralized Bayesian learning over graphs. arXiv preprint arXiv:1905.10466.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
Article Google Scholar
Mesquita, D., Blomstedt, P., & Kaski, S. (2020). Embarrassingly parallel MCMC using deep invertible transformations. In Uncertainty in Artificial Intelligence, pp. 1244–1252. PMLR.
Parayil, A., Bai, H., George, J., & Gurram, P. (2020). Decentralized Langevin dynamics for Bayesian learning. Advances in Neural Information Processing Systems, 33, 15978–15989.
Google Scholar
Pu, S., & Nedić, A. (2020). Distributed stochastic gradient tracking methods. Mathematical Programming, 187, 409–457.
Article MathSciNet MATH Google Scholar
Roberts, G. O., Tweedie, R. L., et al. (1996). Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4), 341–363.
Article MathSciNet MATH Google Scholar
Shoham, N., Avidor, T., Keren, A., Israel, N., Benditkis, D., Mor-Yosef, L., & Zeitak, I. (2019). Overcoming forgetting in federated learning on non-iid data. arXiv preprint arXiv:1910.07796.
Teh, Y. W., Thiery, A. H., & Vollmer, S. J. (2016). Consistency and fluctuations for stochastic gradient Langevin dynamics. Journal of Machine Learning Research, 17, 1–33.
MathSciNet MATH Google Scholar
Welling, M., & Teh, Y.W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688. Citeseer.
Zhang, Y., Liang, P., & Charikar, M. (2017). A hitting time analysis of stochastic gradient Langevin dynamics. In Conference on Learning Theory, pp. 1980–2022. PMLR.
Zou, D., & Gu, Q. (2021). On the convergence of Hamiltonian Monte Carlo with stochastic gradients. In International Conference on Machine Learning, pp. 13012–13022. PMLR.

Download references

Funding

VK was supported by the OP VVV project CZ.02.1.01/0.0/0.0/16_019/0000765 “Research Center for Informatics”. AC and BJ were sponsored in part by the Army Research Laboratory under Cooperative Agreement W911NF-17-2-0196. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.

Author information

Vyacheslav Kungurtsev and Adam Cobb authors contributed equally to the paper.

Authors and Affiliations

Department of Computer Science, Czech Technical University, Prague, Czechia
Vyacheslav Kungurtsev
SRI International, Menlo Park, USA
Adam Cobb
Electrical and Computer Engineering, University of California, San Diego, San Diego, USA
Tara Javidi
DEVCOM Army Research Laboratory, Adelphi, USA
Brian Jalaian

Authors

Vyacheslav Kungurtsev
View author publications
You can also search for this author in PubMed Google Scholar
Adam Cobb
View author publications
You can also search for this author in PubMed Google Scholar
Tara Javidi
View author publications
You can also search for this author in PubMed Google Scholar
Brian Jalaian
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

VK contributed to the theoretical analysis and algorithmic formulation. AC contributed to the numerical experiments and code. VK, AC, and TJ contributed equally to the discussions and writing of the manuscript. BJ contributed partially to editing and conceptual discussions in preparation of the manuscript.

Corresponding author

Correspondence to Vyacheslav Kungurtsev.

Ethics declarations

Conflicts

The authors have no relevant financial or non-financial interests to disclose.

Ethics approval

The authors believe there are no potential ethical considerations with this work.

Consent to participate and publish

All authors consent to the process of review and publication of the entire manuscript.

Additional information

Editor: Derek Greene

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1. Proofs of theoretical results

1.1 Appendix 1.1: Bounding approximate to centralized HMC

We first bound the distance in probability for the chain governing the update for $\hat{\varvec{\omega }}^t$ in (7) and for $\tilde{\varvec{\omega }}^t$ in (6).

Note that by construction, it always holds that,

$$\begin{aligned} \begin{array}{l} \tilde{\varvec{g}}^t = {\textbf{1}}_m\otimes \sum _i \nabla U(\tilde{\varvec{\omega }};X_i,Y_i)={\bar{G}}(\tilde{\varvec{\omega }}),\text { and} \\ \tilde{\aleph }^t = {\textbf{1}}_m\otimes \sum _i ({\textbf{p}}_i^t)^T\left( \nabla ^2_{\varvec{\omega }^2} U(\tilde{\varvec{\omega }};X_i,Y_i)\right) ({\textbf{p}}_i^t) \end{array} \end{aligned}$$

(16)

Thus the only cause of a discrepancy between the chains for $\tilde{\varvec{\omega }}^t$ and $\hat{\varvec{\omega }}^t$ is the truncation of the potential at the second order to compute the acceptance probability. In particular we know that the error in this case is simply the Taylor expansion error, which is bounded by,

$$\begin{aligned} \left| \frac{\epsilon ^3}{6} \frac{\partial ^3 U}{\partial \varvec{\omega }^3} [\varvec{p}^t][\varvec{p}^t][\varvec{p}^t]\right| \le \frac{\epsilon ^3 L_3 \Vert \varvec{p}^t\Vert ^3}{6} \end{aligned}$$

(17)

where $L_3$ is given in Assumption 3.1.

Thus the discrepancy between $\hat{\varvec{\omega }}$ and $\tilde{\varvec{\omega }}$ amounts to the possibility of acceptance in one case and not the other, whose probability is bounded by (17) with the error being bounded by the change in the step, or $\epsilon (\varvec{p}^t+\epsilon {\bar{G}}(\tilde{\varvec{\omega }}^t))$ (and ${\bar{G}}(\hat{\varvec{\omega }})$ in the other case). Let us now prove the main result of this Section. As by the notation of Algorithm 1 we refer to $\tilde{\omega }^*$ and ${\hat{\omega }}^*$ as the proposed parameter samples following the Euler update for $\tilde{\omega }^t$ and ${\hat{\omega }}^t$, respectively.

Proof

of Theorem 3.1

For notational brevity, we let,

$$\begin{aligned} \begin{array}{l} {\mathcal {M}}^t:= {\mathcal {M}}(\tilde{\varvec{\omega }}^{*},\tilde{\varvec{\omega }}^{t},\tilde{\aleph }^{t+1},u^t),\text { and}\\ \hat{{\mathcal {M}}}^t:= \hat{{\mathcal {M}}}(\hat{\varvec{\omega }}^{*},\hat{\varvec{\omega }}^{t},u^t) \end{array} \end{aligned}$$

Performing the recursion, we have that,

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\left\| \tilde{\varvec{\omega }}^{t+1}-\hat{\varvec{\omega }}^{t+1}\right\| \le {\mathbb {P}}\left[ ({\mathcal {M}}^t) = \tilde{\varvec{\omega }}^{t}) \cap (\hat{{\mathcal {M}}}^t = \hat{\varvec{\omega }}^{t})\right] {\mathbb {E}}\left\| \tilde{\varvec{\omega }}^{t}-\hat{\varvec{\omega }}^{t}\right\| \\ \qquad +{\mathbb {P}}\left[ ({\mathcal {M}}^t) = \tilde{\varvec{\omega }}^{*}) \cap (\hat{{\mathcal {M}}}^t = \hat{\varvec{\omega }}^{*})\right] {\mathbb {E}}\left\| \epsilon (\varvec{p}^t+\epsilon {\bar{G}}(\tilde{\varvec{\omega }}^t))+\tilde{\varvec{\omega }}^t-\hat{\varvec{\omega }}^t-\epsilon (\varvec{p}^t+\epsilon {\bar{G}}(\hat{\varvec{\omega }}^t))\right\| \\ \qquad +{\mathbb {P}}\left[ ({\mathcal {M}}^t) = \tilde{\varvec{\omega }}^{*}) \cap (\hat{{\mathcal {M}}}^t = \hat{\varvec{\omega }}^{t})\right] {\mathbb {E}}\left\| \epsilon (\varvec{p}^t+\epsilon {\bar{G}}(\tilde{\varvec{\omega }}^t))+\tilde{\varvec{\omega }}^t-\hat{\varvec{\omega }}^t\right\| \\ \qquad +{\mathbb {P}}\left[ ({\mathcal {M}}^t) = \tilde{\varvec{\omega }}^{t}) \cap (\hat{{\mathcal {M}}}^t = \hat{\varvec{\omega }}^{*})\right] {\mathbb {E}}\left\| \epsilon (\varvec{p}^t+\epsilon {\bar{G}}(\hat{\varvec{\omega }}^t))+\hat{\varvec{\omega }}^t-\tilde{\varvec{\omega }}^t\right\| \end{array} \end{aligned}$$

i.e., we partition the discrepancy as by probability of acceptance and rejection for each chain and the associated discrepancy with each form of iteration. Using the triangle inequality and the fact that the probabilities are mutually exclusive and exhaustive and so add to one, we have,

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\left\| \tilde{\varvec{\omega }}^{t+1}-\hat{\varvec{\omega }}^{t+1}\right\| \le {\mathbb {E}}\left\| \tilde{\varvec{\omega }}^{t}-\hat{\varvec{\omega }}^{t}\right\| \\ \qquad + {\mathbb {P}}\left[ ({\mathcal {M}}^t) = \tilde{\varvec{\omega }}^{*}) \cap (\hat{{\mathcal {M}}}^t = \hat{\varvec{\omega }}^{*})\right] {\mathbb {E}}\left\| \epsilon (\varvec{p}^t+\epsilon {\bar{G}}(\tilde{\varvec{\omega }}^t))-\epsilon (\varvec{p}^t+\epsilon {\bar{G}}(\hat{\varvec{\omega }}^t))\right\| \\ \qquad +{\mathbb {P}}\left[ ({\mathcal {M}}^t) = \tilde{\varvec{\omega }}^{*}) \cap (\hat{{\mathcal {M}}}^t = \hat{\varvec{\omega }}^{t})\right] {\mathbb {E}}\left\| \epsilon (\varvec{p}^t+\epsilon {\bar{G}}(\tilde{\varvec{\omega }}^t))\right\| \\ \qquad +{\mathbb {P}}\left[ ({\mathcal {M}}^t) = \tilde{\varvec{\omega }}^{t}) \cap (\hat{{\mathcal {M}}}^t = \hat{\varvec{\omega }}^{*})\right] {\mathbb {E}}\left\| \epsilon (\varvec{p}^t+\epsilon {\bar{G}}(\hat{\varvec{\omega }}^t))\right\| \end{array} \end{aligned}$$

(18)

Now, let us bound the difference in the acceptance probabilities for the two chains as follows,

$$\begin{aligned} \begin{array}{l} {\mathbb {P}}\left[ ({\mathcal {M}}^t) = \tilde{\varvec{\omega }}^{*}) \cap (\hat{{\mathcal {M}}}^t = \hat{\varvec{\omega }}^{t})\right] + {\mathbb {P}}\left[ ({\mathcal {M}}^t) = \tilde{\varvec{\omega }}^{t}) \cap (\hat{{\mathcal {M}}}^t = \hat{\varvec{\omega }}^{*})\right] \\ = \left| \exp \{-\epsilon ^2\varvec{\aleph }^t -\epsilon ^2\Vert \varvec{g}^t\Vert ^2\}-\exp \{-H(\hat{\varvec{\omega }}^*,\varvec{p}^*)+H(\hat{\varvec{\omega }}^t,\varvec{p}^t)\} \right| \\ = \exp \left\{ -H(\hat{\varvec{\omega }}^*,\varvec{p}^*)+H(\hat{\varvec{\omega }}^t,\varvec{p}^t)\right\} \left( \exp \left\{ H(\hat{\varvec{\omega }}^*,\varvec{p}^*)-H(\hat{\varvec{\omega }}^t,\varvec{p}^t)-\epsilon ^2\varvec{\aleph }^t -\epsilon ^2\Vert \varvec{g}^t\Vert ^2\right\} -1\right) \\ \le e \left| H(\hat{\varvec{\omega }}^*,\varvec{p}^*)-H(\varvec{\omega }^t,\varvec{p}^t)-\varvec{\aleph }^t -\epsilon ^2\Vert \varvec{g}^t\Vert ^2\right| \\ \le e \left[ \epsilon ^2 \left| \Vert \varvec{g}^t\Vert ^2-\Vert \hat{\varvec{g}}^t\Vert ^2\right| +\epsilon ^3 L_3 \Vert \varvec{p}^t\Vert ^2\right] \\ \le e \left[ \epsilon ^2 (\Vert \varvec{g}^t\Vert -\Vert \hat{\varvec{g}}^t\Vert )(\Vert \varvec{g}^t\Vert +\Vert \hat{\varvec{g}}^t\Vert -\Vert \hat{\varvec{g}}^t\Vert +\Vert \hat{\varvec{g}}^t\Vert )+\epsilon ^3 L_3 \Vert \varvec{p}^t\Vert ^2\right] \\ \le e \left[ 2 \epsilon ^2 L_2 \Vert \hat{\varvec{\omega }}^t-\varvec{\omega }^t\Vert U+\epsilon ^3 L_3 M^{(2)}\right] \end{array} \end{aligned}$$

We also have the following norm bound on the expected update parameter update,

$$\begin{aligned} \begin{array}{rl} {\mathbb {E}}\left\| \epsilon (\varvec{p}^t+\epsilon {\bar{G}}(\tilde{\varvec{\omega }}^t))\right\| &{}\le \epsilon {\mathbb {E}}\left\| \varvec{p}^t\right\| +\epsilon ^2{\mathbb {E}}\left\| {\bar{G}}(\tilde{\varvec{\omega }}^t)-{\bar{G}}(\hat{\varvec{\omega }}^t)\right\| +\epsilon ^2{\mathbb {E}}\left\| {\bar{G}}(\hat{\varvec{\omega }}^t\right\| \\ {} &{}\le \epsilon M^{(1)}+\epsilon ^2L_2 {\mathbb {E}}\left\| \tilde{\varvec{\omega }}^t-\hat{\varvec{\omega }}^t\right\| +\epsilon ^2U_g \end{array} \end{aligned}$$

In the case of dual acceptance,

$$\begin{aligned} {\mathbb {E}}\left\| \epsilon (\varvec{p}^t+\epsilon {\bar{G}}(\tilde{\varvec{\omega }}^t))-\epsilon (\varvec{p}^t+\epsilon {\bar{G}}(\hat{\varvec{\omega }}^t))\right\| \le \epsilon ^2 L_2 \Vert \tilde{\varvec{\omega }}^t-\hat{\varvec{\omega }}^t\Vert \end{aligned}$$

Combining these last three bounds with (18), we finally have

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\left\| \tilde{\varvec{\omega }}^{t+1}-\hat{\varvec{\omega }}^{t+1}\right\| \le \left[ 1+\epsilon ^2 L_2+2e\epsilon ^3L_2 (M^{(1)}+\epsilon U_g)+e\epsilon ^5 L_2L_3 M^{(2)}\right] {\mathbb {E}} \Vert \tilde{\varvec{\omega }}^t-\hat{\varvec{\omega }}^t\Vert \\ \qquad + 2e\epsilon ^4 L_2^2 ({\mathbb {E}} \Vert \tilde{\varvec{\omega }}^t-\hat{\varvec{\omega }}^t\Vert )^2 + e\epsilon ^4 L_3 M^{(2)}(M^{(1)}+\epsilon U_g) \end{array} \end{aligned}$$

Thus, with $\tilde{\varvec{\omega }}^0=\hat{\varvec{\omega }}^0$ we have that, for any $\epsilon$, if $T(\epsilon )\in {\mathbb {N}}$ is sufficiently small such that,

$$\begin{aligned} \begin{array}{l} A(\epsilon )^{T(\epsilon )} B(\epsilon ) \le 1,\\ A(\epsilon ):= 1+\epsilon ^2 L_2+2e\epsilon ^3L_2 (M^{(1)}+\epsilon U_g)+e\epsilon ^5 L_2L_3 M^{(2)}+2e\epsilon ^4 L_2^2,\\ B(\epsilon ):= e\epsilon ^4 L_3 M^{(2)}(M^{(1)}+\epsilon U_g) \end{array} \end{aligned}$$

we have that for $t\le T(\epsilon )$,

$$\begin{aligned} {\mathbb {E}}\left\| \tilde{\varvec{\omega }}^{t+1}-\hat{\varvec{\omega }}^{t+1}\right\| \le \sum \limits _{s=0}^t A(\epsilon )^s B(\epsilon ) \end{aligned}$$

(19)

$\square$

1.2 Appendix 1.2: Consensus between decentralized and averaged HMC

Now we relate the process as generated by Algorithm 1 to the average dynamics as given by (5).

Proof

of Theorem 3.2 Consider the recursion in expected L2 error.

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\Vert \bar{\varvec{\omega }}^{t+1}-\varvec{\omega }^{t+1}\Vert + {\mathbb {E}}\Vert \bar{\varvec{g}}^{t+1}-\varvec{g}^{t+1}\Vert + {\mathbb {E}}\Vert {{\bar{\aleph }}}^{t+1}-\aleph ^{t+1}\Vert \\ \le {\mathbb {E}}\left\| {\textbf{W}}{\mathcal {M}}({\varvec{\omega }}^t-\epsilon (\varvec{p}^t+\epsilon \varvec{g}^{t+1}),\varvec{\omega }^t,\aleph ^{t+1})-\frac{1}{m}(\varvec{I}\otimes \varvec{1} \varvec{1}^T){\mathcal {M}}({\varvec{ \omega }}^t-\epsilon (\varvec{p}^t+\epsilon {\varvec{ g}}^{t+1}),{\varvec{\omega }}^t,\aleph ^{t+1})\right\| \\ \quad + {\mathbb {E}}\left\| \left( {\textbf{W}}-\frac{1}{m}(\varvec{I}\otimes \varvec{1} \varvec{1}^T)\right) \left( \bar{\varvec{g}}^t-\varvec{g}^t\right) \right\| \\ \quad +{\mathbb {E}}\left\| \left( {\textbf{W}}-\frac{1}{m}(\varvec{I}\otimes \varvec{1} \varvec{1}^T)\right) \left[ G( \varvec{\omega }^t)-G(\bar{\varvec{\omega }}^t)-G( \varvec{\omega }^{t-1})+G(\bar{\varvec{\omega }}^{t-1})\right] \right\| \\ \quad + {\mathbb {E}}\left\| \left( {\textbf{W}}-\frac{1}{m}(\varvec{I}\otimes \varvec{1} \varvec{1}^T)\right) \left( {{\bar{\aleph }}}^t-\aleph ^t\right) \right\| \\ \quad +{\mathbb {E}}\left\| \left( {\textbf{W}}-\frac{1}{m}(\varvec{I}\otimes \varvec{1} \varvec{1}^T)\right) \left[ H(\varvec{\omega }^t)-H(\bar{\varvec{\omega }}^t)-H(\varvec{\omega }^{t-1})+H(\bar{\varvec{\omega }}^{t-1})\right] \right\| \\ \le {\mathbb {E}}\left\| \left( {\textbf{W}}-\frac{1}{m}(\varvec{I}\otimes \varvec{1} \varvec{1}^T)\right) \left( \bar{\varvec{g}}^t-\varvec{g}^t\right) \right\| + \epsilon ^2\Vert {\textbf{g}}^{t+1}-\bar{{\textbf{g}}}^{t+1}\Vert \\ \qquad +\epsilon {\mathbb {E}}\left\| \left( {\textbf{W}}-\frac{1}{m}(\varvec{I}\otimes \varvec{1} \varvec{1}^T)\right) \varvec{p}^t\right\| \\ \quad +(1+L_2+L_3){\mathbb {E}}\left\| \left( {\textbf{W}}-\frac{1}{m}(\varvec{I}\otimes \varvec{1} \varvec{1}^T)\right) \left( \bar{\varvec{\omega }}^t-\varvec{\omega }^t\right) \right\| \\ \quad +(L_2+L_3){\mathbb {E}}\left\| \left( {\textbf{W}}-\frac{1}{m}(\varvec{I}\otimes \varvec{1} \varvec{1}^T)\right) \left( \bar{\varvec{\omega }}^{t-1}-\varvec{\omega }^{t-1}\right) \right\| \\ \quad + {\mathbb {E}}\left\| \left( {\textbf{W}}-\frac{1}{m}(\varvec{I}\otimes \varvec{1} \varvec{1}^T)\right) \left( {{\bar{\aleph }}}^t-\aleph ^t\right) \right\| \\ \le 2\beta {\mathbb {E}}\left\| \bar{\varvec{g}}^t-\varvec{g}^t\right\| +2\epsilon \beta {\mathbb {E}}\left\| \varvec{p}^t\right\| \\ \quad +2(1+L_2+L_3)\beta {\mathbb {E}}\left\| \bar{\varvec{\omega }}^t-\varvec{\omega }^t\right\| +2(L_2+L_3)\beta {\mathbb {E}}\left\| \left( \bar{\varvec{\omega }}^{t-1}-\varvec{\omega }^{t-1}\right) \right\| \\ \quad + 2\beta {\mathbb {E}}\left\| {{\bar{\aleph }}}^t-\aleph ^t\right\| \end{array} \end{aligned}$$

(20)

where we have used that ${\textbf{W}} \bar{\varvec{g}}^t = \frac{1}{m}\left( \varvec{I}\otimes \varvec{1} \varvec{1}^T\right) \varvec{g}^t= \frac{1}{m}\left( \varvec{I}\otimes \varvec{1} \varvec{1}^T\right) \bar{\varvec{g}}^t$, etc. throughout and, e.g., Di Lorenzo & Scutari (2016, Lemma 6) for the fact that $\left\| \left( {\textbf{W}}-\frac{1}{m}(\varvec{I}\otimes \varvec{1} \varvec{1}^T)\right) \right\| \le \beta$. In the last inequality we subtracted $\epsilon ^2\Vert {\textbf{g}}^{t+1}-\bar{{\textbf{g}}}^{t+1}\Vert$ from both sides and lower bounded the left hand side by half of its original.

Now the recursion implies that, using induction on the iterates,

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\Vert \bar{\varvec{\omega }}^{t+1}-\varvec{\omega }^{t+1}\Vert + {\mathbb {E}}\Vert \bar{\varvec{g}}^{t+1}-\varvec{g}^{t+1}\Vert + {\mathbb {E}}\Vert {{\bar{\aleph }}}^{t+1}-\aleph ^{t+1}\Vert \\ \le \sum \limits _{s=0}^t \left( 2\beta (3+2L_2+2 L_3)\right) ^s\epsilon M^{(1)} \le \frac{\epsilon M^{(1)}}{1-2\beta (3+2L_2+2 L_3)} \end{array} \end{aligned}$$

$\square$

1.3 Appendix 1.3: Bounding averaged to approximate HMC

Proof

of Theorem 5.3 We have,

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\left\| \tilde{\varvec{g}}^{t+1}-\bar{\varvec{g}}^{t+1}\right\| \le \\ {\mathbb {E}}\left\| \frac{1}{m}(\varvec{I}\otimes \varvec{1} \varvec{1}^T)\left[ \tilde{\varvec{g}}^{t}-\bar{\varvec{g}}^t+G(\tilde{\varvec{\omega }}^t)-G(\bar{\varvec{\omega }}^t)+G(\bar{\varvec{\omega }}^t)-G({\varvec{\omega }}^t)\right. \right. \\ \qquad \left. \left. +G(\tilde{\varvec{\omega }}^{t-1})-G(\bar{\varvec{\omega }}^{t-1})+G(\bar{\varvec{\omega }}^{t-1})-G({\varvec{\omega }}^{t-1})\right] \right\| \\ \le {\mathbb {E}}\left\| \tilde{\varvec{g}}^{t}-\bar{\varvec{g}}^t\right\| + L_2 {\mathbb {E}}\left\| \tilde{\varvec{\omega }}^{t}-\bar{\varvec{\omega }}^t\right\| +L_2{\mathbb {E}} \left\| {\varvec{\omega }}^{t}-\bar{\varvec{\omega }}^t\right\| \\ \qquad + L_2 {\mathbb {E}}\left\| \tilde{\varvec{\omega }}^{t-1}-\bar{\varvec{\omega }}^{t-1}\right\| +L_2 {\mathbb {E}}\left\| {\varvec{\omega }}^{t-1}-\bar{\varvec{\omega }}^{t-1}\right\| \end{array} \end{aligned}$$

(21)

By the same argument we have,

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\left\| \tilde{{\aleph }}^{t+1}-\bar{{\aleph }}^{t+1}\right\| \le {\mathbb {E}}\left\| \tilde{{\aleph }}^{t}-\bar{{\aleph }}^t\right\| + L_2{\mathbb {E}} \left\| \tilde{\varvec{\omega }}^{t}-\bar{\varvec{\omega }}^t\right\| +L_2 {\mathbb {E}}\left\| {\varvec{\omega }}^{t}-\bar{\varvec{\omega }}^t\right\| \\ \qquad + L_2{\mathbb {E}} \left\| \tilde{\varvec{\omega }}^{t-1}-\bar{\varvec{\omega }}^{t-1}\right\| +L_2 {\mathbb {E}} \left\| {\varvec{\omega }}^{t-1}-\bar{\varvec{\omega }}^{t-1}\right\| \end{array} \end{aligned}$$

(22)

Now we derive the difference in the parameters, noting in the first inequality below that we split the difference across the old parameter values, the update to the old parameter values, and exhaustively splitting the cases of one proposed parameter being accepted and the other not.

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\left\| \tilde{\varvec{\omega }}^{t+1}-\bar{\varvec{\omega }}^{t+1}\right\| \le \\ {\mathbb {E}}\left\| \frac{1}{m}(\varvec{I}\otimes \varvec{1} \varvec{1}^T)\left[ {\mathcal {M}}(\varvec{\omega }^t-\epsilon (\varvec{p}^t+\epsilon \varvec{g}^{t+1}),\varvec{\omega }^t,\aleph ^{t+1},u^t) - {\mathcal {M}}(\tilde{\varvec{\omega }}^t-\epsilon (\varvec{p}^t+\epsilon \tilde{\varvec{g}}^{t+1}),\tilde{\varvec{\omega }}^t,{\tilde{\aleph }}^{t+1},u^t)\right] \right\| \\ \le {\mathbb {E}}\left\| \frac{1}{m}(\varvec{I}\otimes \varvec{1} \varvec{1}^T)(\varvec{\omega }^t-\tilde{\varvec{\omega }}^t)\right\| + \epsilon ^2{\mathbb {E}}\left\| \frac{1}{m}(\varvec{I}\otimes \varvec{1} \varvec{1}^T)(\varvec{g}^{t+1}-\tilde{\varvec{g}}^{t+1})\right\| \\ \quad +{\mathbb {P}}\left[ ({\mathcal {M}}(\varvec{\omega }^t-\epsilon (\varvec{p}^t+\epsilon \varvec{g}^{t+1}),\varvec{\omega }^t,\aleph ^{t+1},u^t) = \varvec{\omega }^t+\epsilon (\varvec{p}^t+\epsilon \varvec{g}^{t+1})) \right. \\ \qquad \left. \cap ({\mathcal {M}}(\tilde{\varvec{\omega }}^t-\epsilon (\varvec{p}^t+\epsilon \tilde{\varvec{g}}^{t+1}),\tilde{\varvec{\omega }}^t,{\tilde{\aleph }}^{t+1},u^t) = \tilde{\varvec{\omega }}^{t})\right] \\ \qquad \times {\mathbb {E}}\left\| \epsilon (\varvec{p}^t+\epsilon (\varvec{g}^{t+1}-\bar{\varvec{g}}^{t+1}+\bar{\varvec{g}}^{t+1}-\tilde{\varvec{g}}^{t+1}+\tilde{\varvec{g}}^{t+1}-\hat{\varvec{g}}^{t+1}-\hat{\varvec{g}}^{t+1}))\right\| \\ \qquad +{\mathbb {P}}\left[ ({\mathcal {M}}(\varvec{\omega }^t-\epsilon (\varvec{p}^t+\epsilon \varvec{g}^{t+1}),\varvec{\omega }^t,\aleph ^{t+1},u^t) = \varvec{\omega }^t) \right. \\ \qquad \left. \cap ({\mathcal {M}}(\tilde{\varvec{\omega }}^t-\epsilon (\varvec{p}^t+\epsilon \tilde{\varvec{g}}^{t+1}),\tilde{\varvec{\omega }}^t,{\tilde{\aleph }}^{t+1},u^t) = \tilde{\varvec{\omega }}^{t}+\epsilon (\varvec{p}^t+\epsilon \tilde{\varvec{g}}^{t+1})\right] \\ \qquad \times {\mathbb {E}}\left\| \epsilon (\varvec{p}^t+\epsilon \tilde{\varvec{g}}^{t+1})\right\| \end{array}\end{aligned}$$

(23)

Clearly,

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\left\| \frac{1}{m}(\varvec{I}\otimes \varvec{1} \varvec{1}^T)(\varvec{\omega }^t-\tilde{\varvec{\omega }}^t)\right\| + \epsilon ^2{\mathbb {E}}\left\| \frac{1}{m}(\varvec{I}\otimes \varvec{1} \varvec{1}^T)(\varvec{g}^{t+1}-\tilde{\varvec{g}}^{t+1})\right\| \\ \qquad \le {\mathbb {E}}\left\| \bar{\varvec{\omega }}^t-\tilde{\varvec{\omega }}^t)\right\| +\epsilon ^2{\mathbb {E}}\left\| \bar{\varvec{g}}^{t+1}-\tilde{\varvec{g}}^{t+1}\right\| \end{array}\end{aligned}$$

Now we must bound the discrepancy in probability of acceptance. We do this similarly as in the proof of Theorem 1. We use ${\mathcal {M}}_t$ and $\tilde{{\mathcal {M}}}_t$ as shorthand for ${\mathcal {M}}(\varvec{\omega }^t-\epsilon (\varvec{p}^t+\epsilon \varvec{g}^{t+1}),\varvec{\omega }^t,\aleph ^{t+1},u^t)$ and ${\mathcal {M}}(\tilde{\varvec{\omega }}^t-\epsilon (\varvec{p}^t+\epsilon \tilde{\varvec{g}}^{t+1}),\tilde{\varvec{\omega }}^t,{\tilde{\aleph }}^{t+1},u^t)$, respectively. We compute the bounds,

$$\begin{aligned} \begin{array}{l} {\mathbb {P}}\left[ ({\mathcal {M}}^t) = \varvec{\omega }^{*}) \cap (\tilde{{\mathcal {M}}}^t = \tilde{\varvec{\omega }}^{t})\right] + {\mathbb {P}}\left[ ({\mathcal {M}}^t) = {\varvec{\omega }}^{t}) \cap (\tilde{{\mathcal {M}}}^t = \tilde{\varvec{\omega }}^{*})\right] \\ = \left| \exp \{-\epsilon ^2\varvec{\aleph }^t -\epsilon ^2\Vert \varvec{g}^t\Vert ^2\}-\exp \{-\epsilon ^2\tilde{\varvec{\aleph }}^t -\epsilon ^2\Vert \tilde{\varvec{g}}^t\Vert ^2\} \right| \\ = \exp \left\{ -\epsilon ^2\varvec{\aleph }^t -\epsilon ^2\Vert \varvec{g}^t\Vert ^2\right\} \left( \exp \left\{ \epsilon ^2\varvec{\aleph }^t +\epsilon ^2\Vert \varvec{g}^t\Vert ^2-\epsilon ^2\tilde{\varvec{\aleph }}^t -\epsilon ^2\Vert \tilde{\varvec{g}}^t\Vert ^2\right\} -1\right) \\ \le e \epsilon ^2\left| \varvec{\aleph }^t +\Vert \varvec{g}^t\Vert ^2-\tilde{\varvec{\aleph }}^t -\Vert \tilde{\varvec{g}}^t\Vert ^2\right| \\ \le e \epsilon ^2 \left[ \left| \Vert \varvec{g}^t\Vert ^2-\Vert \tilde{\varvec{g}}^t\Vert ^2\right| +\Vert \varvec{\aleph }^t-\tilde{\varvec{\aleph }}^t\Vert \right] \\ \le e \epsilon ^2\left[ (\varvec{g}^t-\tilde{\varvec{g}}^t)^T(\varvec{g}^t+\tilde{\varvec{g}}^t-\tilde{\varvec{g}}^t+\tilde{\varvec{g}}^t-2\hat{\varvec{g}}^t+2\hat{\varvec{g}}^t)+\Vert \varvec{\aleph }^t-\tilde{\varvec{\aleph }}^t\Vert \right] \\ \le e \epsilon ^2\left[ \Vert \varvec{g}^t-\tilde{\varvec{g}}^t\Vert ^2+\Vert \varvec{g}^t-\tilde{\varvec{g}}^t\Vert (\Vert \tilde{\varvec{g}}^t-\hat{\varvec{g}}^t\Vert +2U_g)+\Vert \varvec{\aleph }^t-\tilde{\varvec{\aleph }}^t\Vert \right] \end{array} \end{aligned}$$

Next, we see that we have already prepared the expression to bound the following term,

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\left\| \epsilon (\varvec{p}^t+\epsilon (\varvec{g}^{t+1}-\bar{\varvec{g}}^{t+1}+\bar{\varvec{g}}^{t+1}-\tilde{\varvec{g}}^{t+1}+\tilde{\varvec{g}}^{t+1}-\hat{\varvec{g}}^{t+1}-\hat{\varvec{g}}^{t+1}))\right\| \\ \qquad \le \epsilon \left( M^{(1)}+\epsilon \left( {\mathbb {E}}\left\| \varvec{g}^{t+1}-\bar{\varvec{g}}^{t+1}\right\| +{\mathbb {E}}\left\| \bar{\varvec{g}}^{t+1}-\tilde{\varvec{g}}^{t+1}\right\| +{\mathbb {E}}\left\| \tilde{\varvec{g}}^{t+1}-\hat{\varvec{g}}^{t+1}\right\| +U_g\right) \right) \end{array} \end{aligned}$$

We see that in the parentheses the terms ${\mathbb {E}}\left\| \varvec{g}^{t+1}-\bar{\varvec{g}}^{t+1}\right\|$ and ${\mathbb {E}}\left\| \tilde{\varvec{g}}^{t+1}-\hat{\varvec{g}}^{t+1}\right\|$ already have a bound due to the previous two Theorems. Next,

$$\begin{aligned} {\mathbb {E}}\left\| \epsilon (\varvec{p}^t+\epsilon \tilde{\varvec{g}}^{t+1})\right\| = {\mathbb {E}}\left\| \epsilon (\varvec{p}^t+\epsilon (\tilde{\varvec{g}}^{t+1}-\hat{\varvec{g}}^{t+1}+\hat{\varvec{g}}^{t+1}))\right\| \le \epsilon M^{(1)}+\epsilon ^2 U_g+\epsilon ^2 {\mathbb {E}}\Vert \tilde{\varvec{g}}^{t+1}-\hat{\varvec{g}}^{t+1}\Vert \end{aligned}$$

Putting all of these expressions together we finally get,

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\left\| \tilde{\varvec{\omega }}^{t+1}-\bar{\varvec{\omega }}^{t+1}\right\| \\ \quad \le {\mathbb {E}}\left\| \bar{\varvec{\omega }}^t-\tilde{\varvec{\omega }}^t)\right\| +\epsilon ^2{\mathbb {E}}\left\| \bar{\varvec{g}}^{t+1}-\tilde{\varvec{g}}^{t+1}\right\| \\ \qquad + e \epsilon ^3\left[ {\mathbb {E}}\Vert \varvec{g}^t-\tilde{\varvec{g}}^t\Vert ^2+{\mathbb {E}}\Vert \varvec{g}^t-\tilde{\varvec{g}}^t\Vert (K_i(\epsilon ,t)+2U)+K_c\right] \\ \qquad \times \left[ \left( M^{(1)}+\epsilon \left( K_c+{\mathbb {E}}\left\| \bar{\varvec{g}}^{t+1}-\tilde{\varvec{g}}^{t+1}\right\| +K_i(\epsilon ,t+1)+U_g\right) \right) +M^{(1)}+\epsilon U_g+\epsilon K_i(\epsilon ,t+1)\right] \end{array} \end{aligned}$$

Now, we consider that there exists a $T_2(\epsilon )$ such that for $t\le T_2(\epsilon )$, it holds that $\max \left\{ {\mathbb {E}}\left\| \bar{\varvec{g}}^{t+1}-\tilde{\varvec{g}}^{t+1}\right\| ,\Vert \varvec{g}^t-\tilde{\varvec{g}}^t\Vert \right\} \le 1$. It is clear that such a $T_2(\epsilon )$ exist for sufficiently small $\epsilon$, as it holds trivially for $t=0$. We derive the exact requirement on $T_2(\epsilon )$ in the sequel. We have, however,

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\left\| \tilde{\varvec{\omega }}^{t+1}-\bar{\varvec{\omega }}^{t+1}\right\| \\ \quad \le {\mathbb {E}}\left\| \bar{\varvec{\omega }}^t-\tilde{\varvec{\omega }}^t)\right\| +\epsilon ^2{\mathbb {E}}\left\| \bar{\varvec{g}}^{t+1}-\tilde{\varvec{g}}^{t+1}\right\| \\ \qquad + e \epsilon ^3\left[ {\mathbb {E}}\Vert \bar{\varvec{g}}^t-\tilde{\varvec{g}}^t\Vert +K_c+K_i(\epsilon ,t)+2U+K_c\right] \\ \qquad \quad \times \left[ \left( M^{(1)}+\epsilon \left( K_c+1+K_i(\epsilon ,t+1)+U_g\right) \right) +M^{(1)}+\epsilon U_g+\epsilon K(\epsilon ,t+1)\right] \end{array} \end{aligned}$$

(24)

Finally, combining (21) with (22) and (24) we obtain,

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\left\| \tilde{\varvec{\omega }}^{t+1}-\bar{\varvec{\omega }}^{t+1}\right\| +{\mathbb {E}}\left\| \tilde{\varvec{g}}^{t+1}-\bar{\varvec{g}}^{t+1}\right\| +{\mathbb {E}}\left\| \tilde{{\aleph }}^{t+1}-\bar{{\aleph }}^{t+1}\right\| \\ \quad \le {\mathbb {E}}\left\| \tilde{\varvec{g}}^{t}-\bar{\varvec{g}}^t\right\| + L_2 {\mathbb {E}}\left\| \tilde{\varvec{\omega }}^{t}-\bar{\varvec{\omega }}^t\right\| +L_2K_c \\ \qquad \qquad \qquad \qquad + L_2 {\mathbb {E}}\left\| \tilde{\varvec{\omega }}^{t-1}-\bar{\varvec{\omega }}^{t-1}\right\| +L_2 K_c \\ \qquad +{\mathbb {E}}\left\| \tilde{{\aleph }}^{t}-\bar{{\aleph }}^t\right\| + L_2{\mathbb {E}} \left\| \tilde{\varvec{\omega }}^{t}-\bar{\varvec{\omega }}^t\right\| +L_2 K_c \\ \qquad \qquad \qquad + L_2{\mathbb {E}} \left\| \tilde{\varvec{\omega }}^{t-1}-\bar{\varvec{\omega }}^{t-1}\right\| +L_2 K_c \\ \qquad +{\mathbb {E}}\left\| \bar{\varvec{\omega }}^t-\tilde{\varvec{\omega }}^t)\right\| +\epsilon ^2{\mathbb {E}}\left\| \bar{\varvec{g}}^{t+1}-\tilde{\varvec{g}}^{t+1}\right\| \\ \qquad + e \epsilon ^3\left[ {\mathbb {E}}\Vert \bar{\varvec{g}}^t-\tilde{\varvec{g}}^t\Vert +K_c+K_i(\epsilon ,t)+2U+K_c\right] \\ \qquad \quad \times \left[ \left( M^{(1)}+\epsilon \left( K_c+1+K_i(\epsilon ,t+1)+U_g\right) \right) +M^{(1)}+\epsilon U_g+\epsilon K_i(\epsilon ,t+1)\right] \end{array} \end{aligned}$$

(25)

Let ${\bar{K}}(\epsilon ,t)=\left( M^{(1)}+\epsilon \left( K_c+1+K_i(\epsilon ,t+1)+U_g\right) \right) +M^{(1)}+\epsilon U_g+\epsilon K_i(\epsilon ,t+1)$. Next, note that the inequality implies a monotonically increasing bound, so we can apply $\left\| {\varvec{\omega }}^{t-1}-\bar{\varvec{\omega }}^{t-1}\right\| \le \left\| {\varvec{\omega }}^{t}-\bar{\varvec{\omega }}^{t}\right\|$. Then, subtracting $\epsilon ^2{\mathbb {E}}\left\| \bar{\varvec{g}}^{t+1}-\tilde{\varvec{g}}^{t+1}\right\|$ from both sides and dividing by $1-\epsilon ^2$ we get,

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\left\| \tilde{\varvec{\omega }}^{t+1}-\bar{\varvec{\omega }}^{t+1}\right\| +{\mathbb {E}}\left\| \tilde{\varvec{g}}^{t+1}-\bar{\varvec{g}}^{t+1}\right\| +{\mathbb {E}}\left\| \tilde{{\aleph }}^{t+1}-\bar{{\aleph }}^{t+1}\right\| \\ \quad \le (1-\epsilon ^2)^{-1}\left[ (1+4L_2){\mathbb {E}}\left\| \tilde{\varvec{\omega }}^{t}-\bar{\varvec{\omega }}^{t}\right\| +(1+e\epsilon ^3 {\bar{K}}(\epsilon ,t)){\mathbb {E}}\left\| \tilde{\varvec{g}}^{t}-\bar{\varvec{g}}^{t}\right\| \right. \\ \qquad +\left. {\mathbb {E}}\left\| \tilde{{\aleph }}^{t}-\bar{{\aleph }}^{t}\right\| +3L_2 K_c+{\bar{K}}(\epsilon ,t)+e\epsilon ^3(3K_c+K_i(\epsilon ,t)+2U_g) \right] \end{array} \end{aligned}$$

Now, defining,

$$\begin{aligned} \begin{array}{l} A_2(\epsilon ,t):= (1-\epsilon ^2)^{-1}\left[ 1+4L_2+e\epsilon ^3 {\bar{K}}(\epsilon ,t)\right] \\ B_2(\epsilon ,t):= (1-\epsilon ^2)^{-1}\left[ 3L_2 K_c+{\bar{K}}(\epsilon ,t)+e\epsilon ^3(3K_c+K_i(\epsilon ,t)+2U)\right] \\ T_2(\epsilon ):= \max \left\{ T\in {\mathbb {N}}:\, \sum \limits _{t=0}^T \prod _{s=t+1}^T A_2(\epsilon ,s) B_2(\epsilon ,t) \le 1\right\} \end{array} \end{aligned}$$

We obtain the main result. $\square$

1.4 Appendix 1.4: Coupling and contraction outside a finite ball

In this Section we prove Theorem 3.4.

Proof

Let ${\mathcal {A}}_{\omega }$ and ${\mathcal {A}}_{\nu }$ be the acceptance probabilities corresponding to the updates of $\varvec{\omega }$ and $\varvec{\nu }$, respectively. Noting that the error q is unknown and cannot be coupled in the two chains, we compute,

$$\begin{aligned} \begin{array}{l} \Vert \bar{{\mathcal {M}}}(\bar{\varvec{\omega }}+\epsilon (\bar{\varvec{p}}+\epsilon \nabla U({\bar{\omega }})),\bar{\varvec{\omega }},u,q_{\omega }) - \bar{{\mathcal {M}}}(\bar{\varvec{\nu }}+\epsilon (\bar{\varvec{p}}+\epsilon \nabla U({\bar{\nu }})),\bar{\varvec{\nu }},u, q_{\nu })\Vert ^2 \\ \le {\textbf{1}}_{{\mathcal {A}}_{\omega }\cap {\mathcal {A}}_{\nu }} \Vert \bar{\varvec{\omega }}-\epsilon ^2 \nabla U(\bar{\varvec{\omega }})-\epsilon ^2 q_{\omega }- \bar{\varvec{\nu }}+\epsilon ^2 \nabla U(\bar{\varvec{\nu }})+\epsilon ^2 q_{\nu }\Vert ^2 \\ \qquad + {\textbf{1}}_{{\mathcal {A}}^c_{\omega }\cup {\mathcal {A}}^c_{\nu }}\left( \Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert ^2+ \max \left\{ \Vert \epsilon (\bar{\varvec{p}}+\epsilon \nabla U(\bar{\varvec{\omega }})+\epsilon q_{\omega })\Vert ^2,\Vert \epsilon (\bar{\varvec{p}}+\epsilon \nabla U(\bar{\varvec{\nu }})+\epsilon q_{\mu })\Vert ^2\right\} \right) \end{array} \end{aligned}$$

From Lemma 3.1 we recall that that,

$$\begin{aligned} \Vert \bar{\varvec{\omega }}-\epsilon ^2 \nabla U(\bar{\varvec{\omega }})-\epsilon ^2 q_{\omega }- \bar{\varvec{\nu }}+\epsilon ^2 \nabla U(\bar{\varvec{\nu }})+\epsilon ^2 q_{\nu }\Vert ^2 \le (1-\epsilon ^2 K/4)\Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert ^2 \end{aligned}$$

On the other hand, by Assumption,

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\left[ \max \left\{ \Vert \epsilon (\bar{\varvec{p}}+\epsilon \nabla U(\bar{\varvec{\omega }})+\epsilon q_{\omega })\Vert ^2,\Vert \epsilon (\bar{\varvec{p}}+\epsilon \nabla U(\bar{\varvec{\nu }})+\epsilon q_{\mu })\Vert ^2\right\} \vert {\mathcal {A}}_{\omega }^c\cup {\mathcal {A}}^c_{\nu }\right] \\ \le \epsilon ^2 {\mathbb {E}}[\Vert \bar{\varvec{p}}\Vert ^2\vert {\mathcal {A}}_{\omega }^c\cup {\mathcal {A}}^c_{\nu }] +\epsilon ^4 L_2 R_2+\epsilon ^4 K_T(\epsilon ) \end{array} \end{aligned}$$

but

$$\begin{aligned} {\mathbb {E}}[\Vert \bar{\varvec{p}}\Vert ^2\vert {\mathcal {A}}_{\omega }^c\cup {\mathcal {A}}^c_{\nu }]{\mathbb {P}}[{\mathcal {A}}_{\omega }^c\cup {\mathcal {A}}^c_{\nu }]\le {\mathbb {E}}\Vert \bar{\varvec{p}}\Vert ^2 =M^{(2)} \end{aligned}$$

Now we must bound ${\mathbb {P}}[{\mathcal {A}}_{\omega }^c\cup {\mathcal {A}}^c_{\nu }]$. We proceed,

$$\begin{aligned} \begin{array}{l} {\mathbb {P}}[{\mathcal {A}}_{\omega }^c\cup {\mathcal {A}}^c_{\nu }]\le 2 {\mathbb {P}}[{\mathcal {A}}_{\omega }^c] \le 1-\exp \left\{ -\epsilon ^2 {\bar{\aleph }}-\epsilon ^2\Vert \bar{\varvec{g}}\Vert ^2+\epsilon ^2 {\bar{\aleph }}-\epsilon ^2 \aleph +\epsilon ^2\Vert \bar{\varvec{g}}\Vert ^2-\epsilon ^2\Vert \varvec{g}\Vert ^2\right\} \\ \qquad \le \epsilon ^2 K_a(\epsilon ) \end{array} \end{aligned}$$

where $K_a(\epsilon )$ depends on $L_2$ and $K_T(\epsilon )$ and the last step follows similar reasoning as in the proof of Theorem 3.1. Putting these bounds together, we get,

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\Vert \bar{{\mathcal {M}}}(\bar{\varvec{\omega }}+\epsilon (\bar{\varvec{p}}+\epsilon \nabla U({\bar{\omega }})),\bar{\varvec{\omega }},u,q_{\omega }) - \bar{{\mathcal {M}}}(\bar{\varvec{\nu }}+\epsilon (\bar{\varvec{p}}+\epsilon \nabla U(\bar{\varvec{\nu }})),\bar{\varvec{\nu }},u, q_{\nu })\Vert ^2 \\ \quad \le (1-\epsilon ^2 K/4+\epsilon ^2 K_a(\epsilon )){\mathbb {E}}\Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert ^2 +\epsilon ^2 K_a(\epsilon )\left( \epsilon ^2 M^{(2)}+\epsilon ^4 L_2 R_2+\epsilon ^4 K_T(\epsilon )\right) \end{array} \end{aligned}$$

which satisfies the conclusion for sufficiently small $\epsilon$ relative to ${\mathcal {R}}^2\le \Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert ^2/4$. $\square$

1.5 Appendix 1.5: Global coupling and contraction

Here we prove Theorem 3.5. The proof is based on the original proof of Bou-Rabee (2020, Theorem 2.4)

Proof

For $\Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert \ge 2{\mathcal {R}}$, there is the straightforward application of Theorem 3.4, where we write,

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\left[ \rho \left( \left\| \varvec{\Omega }(\bar{\varvec{\omega }},\bar{\varvec{\nu }})-\varvec{{\mathcal {V}}}(\bar{\varvec{\omega }},\bar{\varvec{\nu }})\right\| \right) -\rho \left( \left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| \right) \right] \\ \quad \le \rho '\left( \left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| \right) {\mathbb {E}}\left| \left\| \varvec{\Omega }(\bar{\varvec{\omega }},\bar{\varvec{\nu }})-\varvec{{\mathcal {V}}}(\bar{\varvec{\omega }},\bar{\varvec{\nu }})\right\| -\left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| \right| \\ \quad \le -\frac{1}{8}K\epsilon ^2 \left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| \rho '(\left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| ) \le -\frac{1}{8} K\epsilon ^2\inf \limits _{r>0}\frac{r\rho '(r)}{\rho (r)} \\ \quad \le -\frac{1}{40}K\epsilon ^2 (1+{\mathcal {R}}/\epsilon )e^{-\frac{5{\mathcal {R}}}{2\epsilon }} \end{array} \end{aligned}$$

Define the event ${\mathcal {C}}:=\{\bar{\varvec{r}}-\bar{\varvec{p}}=\gamma (\bar{\varvec{\omega }}-\bar{\varvec{\nu }})\}$. Now for the case of $\left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| <2{\mathcal {R}}$ we consider the exhaustive decomposition,

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\left[ \rho \left( \left\| \varvec{\Omega }(\bar{\varvec{\omega }},\bar{\varvec{\nu }})-\varvec{{\mathcal {V}}}(\bar{\varvec{\omega }},\bar{\varvec{\nu }})\right\| \right) -\rho \left( \left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| \right) \right] \\ \quad = {\mathbb {E}}\left[ \rho \left( \left\| \varvec{\Omega }(\bar{\varvec{\omega }},\bar{\varvec{\nu }})-\varvec{{\mathcal {V}}}(\bar{\varvec{\omega }},\bar{\varvec{\nu }})\right\| \right) -\rho \left( \left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| \right) {\textbf{1}}( {\mathcal {A}}_{\omega }\cap {\mathcal {A}}_{\nu }\cap {\mathcal {C}} )\right] \\ \qquad + {\mathbb {E}}\left[ \rho \left( \min (R_1,\left\| \varvec{\Omega }(\bar{\varvec{\omega }},\bar{\varvec{\nu }})-\varvec{{\mathcal {V}}}(\bar{\varvec{\omega }},\bar{\varvec{\nu }})\right\| )\right) -\rho \left( \left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| \right) {\textbf{1}}({\mathcal {A}}_{\omega }\cap {\mathcal {A}}_{\nu }\cap {\mathcal {C}}^c)\right] \\ \qquad + {\mathbb {E}}\left[ \rho \left( \left\| \varvec{\Omega }(\bar{\varvec{\omega }},\bar{\varvec{\nu }})-\varvec{{\mathcal {V}}}(\bar{\varvec{\omega }},\bar{\varvec{\nu }})\right\| \right) -\rho \left( \min (R_1,\left\| \varvec{\Omega }(\bar{\varvec{\omega }},\bar{\varvec{\nu }})-\varvec{{\mathcal {V}}}(\bar{\varvec{\omega }},\bar{\varvec{\nu }})\right\| )\right) {\textbf{1}}( {\mathcal {A}}_{\omega }\cap {\mathcal {A}}_{\nu }\cap {\mathcal {C}}^c)\right] \\ \qquad + {\mathbb {E}}\left[ \rho \left( \left\| \varvec{\Omega }(\bar{\varvec{\omega }},\bar{\varvec{\nu }})-\varvec{{\mathcal {V}}}(\bar{\varvec{\omega }},\bar{\varvec{\nu }})\right\| \right) -\rho \left( \left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| \right) {\textbf{1}}( ({\mathcal {A}}^c_{\omega }\cap {\mathcal {A}}_{\nu })\cup ({\mathcal {A}}_{\omega }\cap {\mathcal {A}}^c_{\nu }))\right] \end{array} \end{aligned}$$

Now for the first expression, under the event ${\mathcal {A}}_{\omega }\cap {\mathcal {A}}_{\nu }\cap {\mathcal {C}}$ it holds that,

$$\begin{aligned} \begin{array}{l} \left\| \varvec{\Omega }(\bar{\varvec{\omega }},\bar{\varvec{\nu }})-\varvec{{\mathcal {V}}}(\bar{\varvec{\omega }},\bar{\varvec{\nu }})\right\| =\left\| \varvec{L}(\bar{\varvec{\omega }},\varvec{p},\varvec{q}_{\omega })_1-\varvec{L}(\bar{\varvec{\nu }},\varvec{r},\varvec{q}_{\nu })_1\right\| \\ \quad = \left\| \bar{\varvec{\omega }}+\epsilon (\varvec{p}+\epsilon \nabla U(\bar{\varvec{\omega }})+\epsilon \varvec{q}_{\omega })-\bar{\varvec{\nu }}-\epsilon (\varvec{r}+\epsilon \nabla U(\bar{\varvec{\nu }})+\epsilon \varvec{q}_{\nu })\right\| \\ \quad = \left\| (1-\epsilon \gamma )(\bar{\varvec{\omega }}-\bar{\varvec{\nu }})+\epsilon ^2 (\nabla U(\bar{\varvec{\omega }})+\varvec{q}_{\omega })-\epsilon ^2 (\nabla U(\bar{\varvec{\nu }})+ \varvec{q}_{\nu })\right\| \\ \quad \le (1-\epsilon \gamma +\epsilon ^2 L_2)\Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert +\epsilon ^2 K_T(\epsilon ) \end{array} \end{aligned}$$

and thus by the concavity of $\rho$,

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\left[ \rho \left( \left\| \varvec{\Omega }(\bar{\varvec{\omega }},\bar{\varvec{\nu }})-\varvec{{\mathcal {V}}}(\bar{\varvec{\omega }},\bar{\varvec{\nu }})\right\| \right) -\rho \left( \left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| \right) {\textbf{1}}( {\mathcal {A}}_{\omega }\cap {\mathcal {A}}_{\nu }\cap {\mathcal {C}} )\right] \\ \quad \le \left[ (-\epsilon \gamma +\epsilon ^2 L_2)\Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert +\epsilon ^2 K_T(\epsilon )\right] \rho '(\Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert ) [1-{\mathbb {P}}[{\mathcal {A}}^c_{\omega }]-{\mathbb {P}}[{\mathcal {A}}^c_{\nu }]-{\mathbb {P}}[{\mathcal {C}}^c]] \end{array} \end{aligned}$$

Now as in the proof of Theorem 3.4, we have that $\max ({\mathbb {P}}[{\mathcal {A}}_{\omega }^c],{\mathbb {P}}[{\mathcal {A}}_{\nu }^c])\le \epsilon ^2 K_a(\epsilon )$. Next, recall that by construction and Bou-Rabee (2020, Lemma 3.7) together with $\gamma {\mathcal {R}}\le 1/4$ we have that ${\mathbb {P}}[{\mathcal {C}}^c]\le \frac{\gamma \Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert }{\sqrt{2\pi }}\le \frac{1}{4\sqrt{2\pi }}< \frac{1}{10}$. Thus, we can make $\epsilon$ sufficiently small depending on $\gamma$, $L_2$ and the form of $K_a(\epsilon )$ such that,

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\left[ \rho \left( \left\| \varvec{\Omega }(\bar{\varvec{\omega }},\bar{\varvec{\nu }})-\varvec{{\mathcal {V}}}(\bar{\varvec{\omega }},\bar{\varvec{\nu }})\right\| \right) -\rho \left( \left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| \right) {\textbf{1}}( {\mathcal {A}}_{\omega }\cap {\mathcal {A}}_{\nu }\cap {\mathcal {C}} )\right] \\ \quad \le -\frac{3}{5}\epsilon \gamma \Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert \rho '(\Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert )+ \epsilon ^2 K_T(\epsilon )\rho '(\Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert ) \end{array} \end{aligned}$$

Next, it holds that for $s\le R_1$, $\rho (s)-\rho (r)\le \frac{1}{a}\rho '(r)$ and thus by Bou-Rabee (2020, Lemma 3.7) and the definition of $a=1/\epsilon$, we have that,

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\left[ \rho \left( \min (R_1,\left\| \varvec{\Omega }(\bar{\varvec{\omega }},\bar{\varvec{\nu }})-\varvec{{\mathcal {V}}}(\bar{\varvec{\omega }},\bar{\varvec{\nu }})\right\| )\right) -\rho \left( \left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| \right) {\textbf{1}}({\mathcal {A}}_{\omega }\cap {\mathcal {A}}_{\nu }\cap {\mathcal {C}}^c)\right] \\ \quad \le \frac{1}{a}\rho '(\Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert ){\mathbb {P}}[{\mathcal {C}}^c] < \frac{2}{5}\gamma \epsilon \Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert \rho '(\Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert ) \end{array} \end{aligned}$$

Next, under ${\mathcal {A}}_{\omega }\cap {\mathcal {A}}_{\nu }\cap {\mathcal {C}}^c$, by the definition of $\varvec{r}$,

$$\begin{aligned} \begin{array}{l} \left\| \varvec{\Omega }(\bar{\varvec{\omega }},\bar{\varvec{\nu }})-\varvec{{\mathcal {V}}}(\bar{\varvec{\omega }},\bar{\varvec{\nu }})\right\| =\left\| \varvec{L}(\bar{\varvec{\omega }},\varvec{p},\varvec{q}_{\omega })_1-\varvec{L}(\bar{\varvec{\nu }},\varvec{r},\varvec{q}_{\nu })_1\right\| \\ \quad = \left\| \bar{\varvec{\omega }}+\epsilon (\varvec{p}+\epsilon \nabla U(\bar{\varvec{\omega }})+\epsilon \varvec{q}_{\omega })-\bar{\varvec{\nu }}-\epsilon (\varvec{r}+\epsilon \nabla U(\bar{\varvec{\nu }})+\epsilon \varvec{q}_{\nu })\right\| \\ \quad = \left\| (1-\epsilon \gamma )(\bar{\varvec{\omega }}-\bar{\varvec{\nu }})+\epsilon ^2 (\nabla U(\bar{\varvec{\omega }})+\varvec{q}_{\omega })-\epsilon ^2 (\nabla U(\bar{\varvec{\nu }})+ \varvec{q}_{\nu })\right\| + \epsilon \Vert \bar{\varvec{p}}-\bar{\varvec{r}}\Vert \\ \quad \le (1-\epsilon \gamma /2)\Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert +\epsilon ^2 K_T(\epsilon )+4\epsilon \left\| (\bar{\varvec{\omega }}-\bar{\varvec{\omega }})\cdot \bar{\varvec{p}}/\Vert \bar{\varvec{\omega }}-\bar{\varvec{\omega }}\Vert \right\| \end{array} \end{aligned}$$

Thus, with the fact that by construction $R_1\ge \frac{5}{4}(1+\gamma \epsilon )\Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert$ and by the properties of the coupling (see the proof of Bou-Rabee (2020, Theorem 2.4),

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\left[ \rho \left( \left\| \varvec{\Omega }(\bar{\varvec{\omega }},\bar{\varvec{\nu }})-\varvec{{\mathcal {V}}}(\bar{\varvec{\omega }},\bar{\varvec{\nu }})\right\| \right) -\rho \left( \min (R_1,\left\| \varvec{\Omega }(\bar{\varvec{\omega }},\bar{\varvec{\nu }})-\varvec{{\mathcal {V}}}(\bar{\varvec{\omega }},\bar{\varvec{\nu }})\right\| )\right) {\textbf{1}}( {\mathcal {A}}_{\omega }\cap {\mathcal {A}}_{\nu }\cap {\mathcal {C}}^c)\right] \\ \quad \le \rho '(R_1){\mathbb {E}}\left[ (\varvec{L}(\bar{\varvec{\omega }},\varvec{p},\varvec{q}_{\omega })_1-\varvec{L}(\bar{\varvec{\nu }},\varvec{r},\varvec{q}_{\nu })_1)-R_1)^+{\textbf{1}}({\mathcal {A}}_{\omega }\cap {\mathcal {A}}_{\nu }\cap {\mathcal {C}}^c)\right] \\ \quad \le \left( \frac{5}{4}\gamma \epsilon \Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert +\epsilon ^2 K_T(\epsilon )+4\epsilon M^{(1)}\right) \rho '(R_1) \\ \quad \le \left( \frac{5}{4}\gamma \epsilon \Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert +\epsilon ^2 K_T(\epsilon )\right) e^{-a(R_1-2{\mathcal {R}})}\rho '(\Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert ) \\ \quad \le \frac{1}{20}\left( \frac{5}{4}\gamma \epsilon \Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert +\epsilon ^2 K_T(\epsilon )\right) \rho '(\Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert ) \end{array} \end{aligned}$$

Finally,

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\left[ \left\| \varvec{\Omega }(\bar{\varvec{\omega }},\bar{\varvec{\nu }})-\varvec{{\mathcal {V}}}(\bar{\varvec{\omega }},\bar{\varvec{\nu }})\right\| -\left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| {\textbf{1}}( ({\mathcal {A}}^c_{\omega }\cap {\mathcal {A}}_{\nu })\cup ({\mathcal {A}}_{\omega }\cap {\mathcal {A}}^c_{\nu }))\right] \\ \quad = {\mathbb {E}}\left[ \left\| \varvec{L}(\bar{\varvec{\omega }},\varvec{p},\varvec{q}_{\omega })_1-\bar{\varvec{\omega }}\right\| {\textbf{1}} ({\mathcal {A}}^c_{\omega }\cap {\mathcal {A}}_{\nu })\right] +{\mathbb {E}}\left[ \left\| \varvec{L}(\bar{\varvec{\nu }},\varvec{r},\varvec{q}_{\nu })_1-\bar{\varvec{\nu }}\right\| {\textbf{1}}({\mathcal {A}}_{\omega }\cap {\mathcal {A}}^c_{\nu })\right] \\ \quad \le \left[ \epsilon M^{(1)}+\epsilon ^2 R_2+\epsilon ^2 K_T(\epsilon )\right] \epsilon ^2 K_a(\epsilon ) \end{array} \end{aligned}$$

and so, again by the convexity of $\rho (\cdot )$,

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\left[ \rho \left( \left\| \varvec{\Omega }(\bar{\varvec{\omega }},\bar{\varvec{\nu }})-\varvec{{\mathcal {V}}}(\bar{\varvec{\omega }},\bar{\varvec{\nu }})\right\| \right) -\rho \left( \left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| \right) {\textbf{1}}( ({\mathcal {A}}^c_{\omega }\cap {\mathcal {A}}_{\nu })\cup ({\mathcal {A}}_{\omega }\cap {\mathcal {A}}^c_{\nu }))\right] \\ \quad \le \left[ \epsilon M^{(1)}+\epsilon ^2 R_2+\epsilon ^2 K_T(\epsilon )\right] \epsilon ^2 K_a(\epsilon ) \rho '(\left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| ) \end{array} \end{aligned}$$

Finally, putting all of these bounds together, we obtain,

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\left[ \rho \left( \left\| \varvec{\Omega }(\bar{\varvec{\omega }},\bar{\varvec{\nu }})-\varvec{{\mathcal {V}}}(\bar{\varvec{\omega }},\bar{\varvec{\nu }})\right\| \right) -\rho \left( \left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| \right) \right] \\ \quad \le \left[ -\frac{3}{5}+\frac{2}{5}+\frac{5}{80}\right] \gamma \epsilon \left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| \rho '(\left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| )\\ \qquad +\left[ \epsilon ^2 K_T(\epsilon )+\frac{1}{20}\epsilon ^2 K_T(\epsilon )\right] \rho '(\left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| ) \\ \qquad + \left[ \epsilon M^{(1)}+\epsilon ^2 R_2+\epsilon ^2 K_T(\epsilon )\right] \epsilon ^2 K_a(\epsilon )\rho '(\left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| ) \\ \quad = -\frac{11}{80} \gamma \epsilon \left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| \rho '(\left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| ) \\ \qquad + \epsilon ^2\left[ \frac{21}{20}K_T(\epsilon )+\epsilon M^{(1)}K_a(\epsilon )+\epsilon ^2 (R_2+K_T(\epsilon ))K_a(\epsilon )\right] \rho '(\left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| ) \end{array} \end{aligned}$$

Noting that,

$$\begin{aligned} -\frac{11}{80} \gamma \epsilon \left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| \rho '(\left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| ) \le -\frac{11}{80} \gamma \epsilon \inf \limits _{r\le 2{\mathcal {R}}} \frac{r \rho '(r)}{\rho (r)} \rho (\left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| )\le -\frac{11}{160} e^{-2{\mathcal {R}}/\epsilon } \rho (\left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| ) \end{aligned}$$

and

$$\begin{aligned} \rho '(\left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| ) \le \sup \frac{\rho '(r)}{\rho (r)}\rho (\left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| ) = a\rho (\left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| ) \end{aligned}$$

we obtain the final result. $\square$

Appendix 2. Experiment hyperparameters

1.1 Linear regression

We set the doubly stochastic matrix, ${\textbf{W}} = \frac{1}{N_a}{\textbf{1}}_4$, where number of agents, $N_a = 4$ and ${\textbf{1}}_4$ is a $4\times 4$ matrix of ones. We run the experiment over 9 seeds for $T = 10^5$ iterations. Hardware: MacBook Pro, Processor: 2.6 GHz 6-Core Intel Core i7, Memory: 16 GB.

Centralized HMC $\epsilon = 4\times 10^{-4}$, $L = 1$, prior precision $= 1.0$.
Decentralized HMC $\epsilon = 4\times 10^{-4}$, prior precision $= 1.0$. We switch off the MH step for the first $10^3$ steps to ensure that the Taylor approximation is only applied from a point closer to the target distribution.
Decentralized ULA $\epsilon = 3\times 10^{-7}$. Following the same notation from Parayil et al. (2020): $\beta _0 = 0.48,\delta _1 = 0.01, \delta _2 = 0.55, b_1 = 230, b_2 = 230$.

1.2 Logistic regression

Partial observation We set the doubly stochastic matrix, ${\textbf{W}} = \frac{1}{N_a}{\textbf{1}}_4$, where number of agents, $N_a = 4$ and ${\textbf{1}}_4$ is a $4\times 4$ matrix of ones. We run the experiment over 9 seeds for $T = 8\times 10^3$ iterations. Hardware: GeForce RTX 2080 Ti.

Centralized HMC $\epsilon = 0.001$, $L = 1$, prior precision $= 100.0$.
Decentralized HMC $\epsilon = 5\times 10^{-4}$, prior precision $= 100.0$. We switch off the MH step for the first $2\times 10^3$ steps to ensure that the Taylor approximation is only applied from a point closer to the target distribution.
Decentralized ULA $\epsilon = 1\times 10^{-5}$. Following the same notation from Parayil et al. (2020): $\beta _0 = 0.48,\delta _1 = 0.01, \delta _2 = 0.55, b_1 = 230, b_2 = 230$.

Ring network We set the doubly stochastic matrix, ${\textbf{W}} = ({\textbf{I}} + {\textbf{A}})\frac{1}{N_a}$, where number of agents, $N_a = 5$ and ${\textbf{I}}$ is the identity matrix, and ${\textbf{A}}$ is the adjacency matrix for a ring shaped graph. We run the experiment over 9 seeds for $T = 1\times 10^4$ iterations. Hardware: GeForce RTX 2080 Ti.

Centralized HMC $\epsilon = 0.001$, $L = 1$, prior precision $= 100.0$.
Decentralized HMC $\epsilon = 0.003$, prior precision $= 100.0$. We switch off the MH step for the first $1\times 10^3$ steps to ensure that the Taylor approximation is only applied from a point closer to the target distribution.
Decentralized ULA $\epsilon = 1\times 10^{-4}$. Following the same notation from Parayil et al. (2020): $\beta _0 = 0.48,\delta _1 = 0.01, \delta _2 = 0.55, b_1 = 230, b_2 = 230$.

1.3 Bayesian neural network

We set the doubly stochastic matrix, ${\textbf{W}} = \frac{1}{N_a}{\textbf{1}}_2$, where number of agents, $N_a = 2$ and ${\textbf{1}}_2$ is a $2\times 2$ matrix of ones. We run the experiment for $T = 5\times 10^5$ iterations. Decentralized HMC $\epsilon = 7\times 10^{-5}$, prior precision $= 10.0$. We switch off the MH step for the first $2\times 10^3$ steps to ensure that the Taylor approximation is only applied from a point closer to the target distribution. Hardware: GeForce RTX 2080 Ti.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Kungurtsev, V., Cobb, A., Javidi, T. et al. Decentralized Bayesian learning with Metropolis-adjusted Hamiltonian Monte Carlo. Mach Learn 112, 2791–2819 (2023). https://doi.org/10.1007/s10994-023-06345-6

Download citation

Received: 17 January 2022
Revised: 11 April 2023
Accepted: 17 April 2023
Published: 20 June 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s10994-023-06345-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Decentralized Bayesian learning with Metropolis-adjusted Hamiltonian Monte Carlo

Abstract

Access this article

Similar content being viewed by others

A Top-Down Approach to Attain Decentralized Multi-agents

Reinforcement learning in a continuum of agents

Towards Efficient MCMC Sampling in Bayesian Neural Networks by Exploiting Symmetry

Availibility of data

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts

Ethics approval

Consent to participate and publish

Additional information

Publisher's Note

Appendices

Appendix 1. Proofs of theoretical results

1.1 Appendix 1.1: Bounding approximate to centralized HMC

Proof

1.2 Appendix 1.2: Consensus between decentralized and averaged HMC

Proof

1.3 Appendix 1.3: Bounding averaged to approximate HMC

Proof

1.4 Appendix 1.4: Coupling and contraction outside a finite ball

Proof

1.5 Appendix 1.5: Global coupling and contraction

Proof

Appendix 2. Experiment hyperparameters

1.1 Linear regression

1.2 Logistic regression

1.3 Bayesian neural network

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation