Sampling Constrained Probability Distributions Using Spherical Augmentation

Lan, Shiwei; Shahbaba, Babak

doi:10.1007/978-3-319-45026-1_2

Sampling Constrained Probability Distributions Using Spherical Augmentation

Shiwei Lan⁴ &
Babak Shahbaba⁵

Chapter
First Online: 06 October 2016

2064 Accesses
3 Citations

Part of the book series: Advances in Computer Vision and Pattern Recognition ((ACVPR))

Abstract

Statistical models with constrained probability distributions are abundant in machine learning. Some examples include regression models with norm constraints (e.g., Lasso), probit, many copula models, and latent Dirichlet allocation (LDA). Bayesian inference involving probability distributions confined to constrained domains could be quite challenging for commonly used sampling algorithms. In this work, we propose a novel augmentation technique that handles a wide range of constraints by mapping the constrained domain to a sphere in the augmented space. By moving freely on the surface of this sphere, sampling algorithms handle constraints implicitly and generate proposals that remain within boundaries when mapped back to the original space. Our proposed method, called Spherical Augmentation, provides a mathematically natural and computationally efficient framework for sampling from constrained probability distributions . We show the advantages of our method over state-of-the-art sampling algorithms, such as exact Hamiltonian Monte Carlo, using several examples including truncated Gaussian distributions, Bayesian Lasso, Bayesian bridge regression, reconstruction of quantized stationary Gaussian process, and LDA for topic modeling.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Note, $\mathbf{v}^{\mathsf {T}}{} \mathbf{G}_{\mathscr {S}_r}(\varvec{\theta })\mathbf{{v}}\le \Vert \mathbf{{v}}\Vert _2^2\le \Vert \tilde{\mathbf{{v}}}\Vert _2^2 = \mathbf{{v}}^{\mathsf {T}}{} \mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })\mathbf{{v}}$.
2.
Metropolis test is done for this toy example as in [49], but will be omitted in stochastic mini-batch algorithms in Sect. 2.5.5.
3.
The stochastic gradient for sg-wallLMC is $[(n^*_{kw} + \beta -1/2) + \pi _{kw}(n^*_{k\cdot }+W(\beta -1/2))]/n^*_{k\cdot }$ calculated with (2.45).
4.
The same argument applies when $T_k=1$, i.e. $\varvec{\theta }$ is the first point outside the domain among $\{\varvec{\theta }_k\}$.

References

Y. Ahmadian, J.W. Pillow, L. Paninski, Efficient Markov chain Monte Carlo methods for decoding neural spike trains. Neural Comput. 23(1), 46–96 (2011)
Article MathSciNet MATH Google Scholar
S. Ahn, Y. Chen, M. Welling, Distributed and adaptive darting Monte Carlo through regenerations, in Proceedings of the 16th International Conference on Artificial Intelligence and Statistics (AI Stat) (2013)
Google Scholar
S. Ahn, B. Shahbaba, M. Welling, Distributed stochastic gradient MCMC, in International Conference on Machine Learning (2014)
Google Scholar
S. Amari, H. Nagaoka, Methods of Information Geometry, in Translations of Mathematical Monographs, vol. 191 (Oxford University Press, Oxford, 2000)
Google Scholar
C. Andrieu, E. Moulines, On the ergodicity properties of some adaptive MCMC algorithms. Ann. Appl. Probab. 16(3), 1462–1505 (2006)
Article MathSciNet MATH Google Scholar
C. Andrieu, A. Doucet, R. Holenstein, Particle Markov chain Monte Carlo methods. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 72(3), 269–342 (2010)
Google Scholar
M.J. Beal, Variational algorithms for approximate Bayesian inference. Ph.D. thesis, University College London, London (2003)
Google Scholar
A. Beskos, G. Roberts, A. Stuart, J. Voss, MCMC methods for diffusion bridges. Stoch. Dyn. 8(03), 319–350 (2008)
Article MathSciNet MATH Google Scholar
A. Beskos, F.J. Pinski, J.M. Sanz-Serna, A.M. Stuart, Hybrid Monte Carlo on Hilbert spaces. Stoch. Process. Appl. 121(10), 2201–2230 (2011)
Article MathSciNet MATH Google Scholar
D.M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
A.E. Brockwell, Parallel markov chain monte carlo simulation by Pre-Fetching. J. Comput. Gr. Stat. 15, 246–261 (2006)
Article MathSciNet Google Scholar
M.A. Brubaker, M. Salzmann, R. Urtasun, A family of MCMC methods on implicitly defined manifolds, in Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS-12), vol. 22, ed. by N.D. Lawrence, M.A. Girolami (2012), pp. 161–172
Google Scholar
S. Byrne, M. Girolami, Geodesic Monte Carlo on embedded manifolds. Scand. J. Stat. 40(4), 825–845 (2013)
Article MathSciNet MATH Google Scholar
B. Calderhead, M. Sustik, Sparse approximate manifolds for differential geometric MCMC, in Advances in Neural Information Processing Systems, vol. 25, ed. by P. Bartlett, F. Pereira, C. Burges, L. Bottou, K. Weinberger (2012), pp. 2888–2896
Google Scholar
O. Cappé, R. Douc, A. Guillin, J.M. Marin, C.P. Robert, Adaptive importance sampling in general mixture classes. Stat. Comput. 18(4), 447–459 (2008)
Article MathSciNet Google Scholar
N. Chopin, A sequential particle filter method for static models. Biometrika 89(3), 539–552 (2002)
Article MathSciNet MATH Google Scholar
S.L. Cotter, G.O. Roberts, A. Stuart, D. White et al., MCMC methods for functions: modifying old algorithms to make them faster. Stat. Sci. 28(3), 424–446 (2013)
Article MathSciNet MATH Google Scholar
R.V. Craiu, J. Rosenthal, C. Yang, Learn from thy neighbor: parallel-chain and regional adaptive MCMC. J. Am. Stat. Assoc. 104(488), 1454–1466 (2009)
Article MathSciNet MATH Google Scholar
P. Del Moral, A. Doucet, A. Jasra, Sequential Monte Carlo samplers. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 68(3), 411–436 (2006)
Google Scholar
N. de Freitas, P. Højen-Sørensen, M. Jordan, R. Stuart, Variational MCMC, in Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, UAI ’01 (Morgan Kaufmann Publishers Inc., San Francisco, 2001), pp. 120–127
Google Scholar
M.P. do Carmo, Riemannian Geometry, 1st edn (Birkhäuser, Boston, 1992)
Google Scholar
R. Douc, C.P. Robert, A vanilla rao-blackwellization of metropolis-hastings algorithms. Ann. Stat. 39(1), 261–277 (2011)
Article MathSciNet MATH Google Scholar
S. Duane, A.D. Kennedy, B.J. Pendleton, D. Roweth, Hybrid Monte Carlo. Phys. Lett. B 195(2), 216–222 (1987)
Article Google Scholar
I.E. Frank, J.H. Friedman, A statistical view of some chemometrics regression tools. Technometrics 35(2), 109–135 (1993)
Article MATH Google Scholar
A. Gelfand, L. van der Maaten, Y. Chen, M. Welling, On herding and the cycling perceptron theorem. Adv. Neural Inf. Process. Syst. 23, 694–702 (2010)
Google Scholar
C.J. Geyer, Practical Markov Chain Monte Carlo. Stat. Sci. 7(4), 473–483 (1992)
Article MathSciNet Google Scholar
W.R. Gilks, G.O. Roberts, S.K. Sahu, Adaptive Markov chain Monte Carlo through regeneration. J. Am. Stat. Assoc. 93(443), 1045–1054 (1998)
Article MathSciNet MATH Google Scholar
M. Girolami, B. Calderhead, Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J. R. Stat. Soc. Ser. B (with discussion) 73(2), 123–214 (2011)
Article MathSciNet Google Scholar
G.H. Golub, C.F. Van Loan, Matrix Computations, 3rd edn. (Johns Hopkins University Press, Baltimore, 1996)
MATH Google Scholar
C. Hans, Bayesian lasso regression. Biometrika 96(4), 835–845 (2009)
Article MathSciNet MATH Google Scholar
M. Hoffman, F.R. Bach, D.M. Blei, Online learning for latent dirichlet allocation, in Advances in Neural Information Processing Systems (2010), pp. 856–864
Google Scholar
M.D. Hoffman, A. Gelman, The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 15(1), 1593–1623 (2014)
MathSciNet MATH Google Scholar
K. Kurihara, M. Welling, N. Vlassis, Accelerated variational Dirichlet process mixtures, in Advances of Neural Information Processing Systems – NIPS, vol. 19 (2006)
Google Scholar
S. Lan, B. Shahbaba, Spherical hamiltonian Monte Carlo for constrained target distributions, in The 31st International Conference on Machine Learning, Beijing (2014), pp. 629–637
Google Scholar
S. Lan, V. Stathopoulos, B. Shahbaba, M. Girolami, Markov chain Monte Carlo from Lagrangian dynamics. J. Comput. Gr. Stat. 24(2), 357–378 (2015)
Article MathSciNet Google Scholar
B. Leimkuhler, S. Reich, Simulating Hamiltonian Dynamics (Cambridge University Press, Cambridge, 2004)
MATH Google Scholar
J. Møller, A. Pettitt, K. Berthelsen, R. Reeves, An efficient Markov chain Monte Carlo method for distributions with intractable normalisation constants. Biometrica 93, 451–458 (2006) (To appear)
Google Scholar
I. Murray, R.P. Adams, D.J. MacKay, Elliptical slice sampling. JMLR:W&CP 9, 541–548 (2010)
Google Scholar
P. Mykland, L. Tierney, B. Yu, Regeneration in Markov chain samplers. J. Am. Stat. Assoc. 90(429), 233–241 (1995)
Article MathSciNet MATH Google Scholar
P. Neal, G.O. Roberts, Optimal scaling for random walk metropolis on spherically constrained target densities. Methodol. Comput. Appl. Probab. 10(2), 277–297 (2008)
Article MathSciNet MATH Google Scholar
P. Neal, G.O. Roberts, W.K. Yuen, Optimal scaling of random walk metropolis algorithms with discontinuous target densities. Ann. Appl. Probab. 22(5), 1880–1927 (2012)
Article MathSciNet MATH Google Scholar
R.M. Neal, Probabilistic inference using Markov Chain Monte Carlo methods. Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto (1993)
Google Scholar
R.M. Neal, Bayesian Learning for Neural Networks (Springer, Secaucus, 1996)
Book MATH Google Scholar
R.M. Neal, Slice sampling. Ann. Stat. 31(3), 705–767 (2003)
Article MathSciNet MATH Google Scholar
R.M. Neal, The short-cut metropolis method. Technical Report 0506, Department of Statistics, University of Toronto (2005)
Google Scholar
R.M. Neal, MCMC using Hamiltonian dynamics, in: Handbook of Markov Chain Monte Carlo, ed. by S. Brooks, A. Gelman, G. Jones, X.L. Meng. Chapman and Hall/CRC (2011), pp. 113–162
Google Scholar
A. Pakman, L. Paninski, Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians. J. Comput. Gr. Stat. 23(2), 518–542 (2014)
Article MathSciNet Google Scholar
T. Park, G. Casella, The Bayesian lasso. J. Am. Stat. Assoc. 103(482), 681–686 (2008)
Article MathSciNet MATH Google Scholar
S. Patterson, Y.W. Teh, Stochastic gradient riemannian langevin dynamics on the probability simplex, in Advances in Neural Information Processing Systems (2013), pp. 3102–3110
Google Scholar
N.S. Pillai, A.M. Stuart, A.H. Thiéry et al., Optimal scaling and diffusion limits for the Langevin algorithm in high dimensions. Ann. Appl. Probab. 22(6), 2320–2356 (2012)
Article MathSciNet MATH Google Scholar
J.G. Propp, D.B. Wilson, Exact sampling with coupled Markov chains and applications to statistical mechanics (1996), pp. 223–252
Google Scholar
D. Randal, G. Arnaud, M. Jean-Michel, R.P. Christian, Minimum variance importance sampling via population Monte Carlo. ESAIM: Probab. Stat. 11, 427–447 (2007)
Article MathSciNet Google Scholar
G.O. Roberts, S.K. Sahu, Updating schemes, correlation structure, blocking and parameterisation for the Gibbs sampler. J. R. Stat. Soc. Ser. B 59, 291–317 (1997)
Article MathSciNet MATH Google Scholar
G.O. Roberts, A. Gelman, W.R. Gilks, Weak convergence and optimal scaling of random walk metropolis algorithms. Ann. Appl. Probab. 7(1), 110–120 (1997)
Article MathSciNet MATH Google Scholar
G.O. Roberts, J.S. Rosenthal et al., Optimal scaling for various metropolis-hastings algorithms. Stat. Sci. 16(4), 351–367 (2001)
Article MathSciNet MATH Google Scholar
B. Shahbaba, S. Lan, W.O. Johnson, R.M. Neal, Split Hamiltonian Monte Carlo. Stat. Comput. 24(3), 339–349 (2014)
Article MathSciNet MATH Google Scholar
C. Sherlock, G.O. Roberts, Optimal scaling of the random walk metropolis on elliptically symmetric unimodal targets. Bernoulli 15(3), 774–798 (2009)
Article MathSciNet MATH Google Scholar
M. Spivak, Calculus on Manifolds, vol. 1 (WA Benjamin, New York, 1965)
MATH Google Scholar
M. Spivak, A Comprehensive Introduction to Differential Geometry, vol. 1, 2nd edn (Publish or Perish, Inc., Houston, 1979)
Google Scholar
Y.W. Teh, D. Newman, M. Welling, A collapsed variational bayesian inference algorithm for latent dirichlet allocation, in Advances in Neural Information Processing Systems (2006), pp. 1353–1360
Google Scholar
R. Tibshirani, Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)
MathSciNet MATH Google Scholar
H.M. Wallach, I. Murray, R. Salakhutdinov, D. Mimno, Evaluation methods for topic models, in Proceedings of the 26th Annual International Conference on Machine Learning (ACM, 2009), pp. 1105–1112
Google Scholar
G.R. Warnes, The normal kernel coupler: an adaptive Markov Chain Monte Carlo method for efficiently sampling from multi-modal distributions. Technical Report No. 395, University of Washington (2001)
Google Scholar
M. Welling, Herding dynamic weights to learn, in Proceeding of International Conference on Machine Learning (2009)
Google Scholar
M. Welling, Y.W. Teh, Bayesian learning via stochastic gradient Langevin dynamics, in Proceedings of the 28th International Conference on Machine Learning (ICML) (2011), pp. 681–688
Google Scholar
M. West, On scale mixtures of normal distributions. Biometrika 74(3), 646–648 (1987)
Article MathSciNet MATH Google Scholar
Y. Zhang, C. Sutton, Quasi-Newton methods for Markov Chain Monte Carlo, in Advances in Neural Information Processing Systems, ed. by J. Shawe-Taylor, R.S. Zemel, P. Bartlett, F.C.N. Pereira, K.Q. Weinberger, vol. 24 (2011), pp. 2393–2401
Google Scholar

Download references

Acknowledgments

SL is supported by EPSRC Programme Grant, Enabling Quantification of Uncertainty in Inverse Problems (EQUIP), EP/K034154/1. BS is supported by NSF grant IIS-1216045 and NIH grant R01-AI107034.

Author information

Authors and Affiliations

Department of Statistics, University of Warwick, Coventry, CV4 7AL, UK
Shiwei Lan
Department of Statistics and Department of Computer Science, University of California, Irvine, CA, 92697, USA
Babak Shahbaba

Authors

Shiwei Lan
View author publications
You can also search for this author in PubMed Google Scholar
Babak Shahbaba
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shiwei Lan .

Editor information

Editors and Affiliations

Istituto Italiano di Tecnologia, Pattern Analysis and Computer Vision, Genoa, Italy
Hà Quang Minh
Istituto Italiano di Tecnologia, Pattern Analysis and Computer Vision , Genoa, Italy
Vittorio Murino

Appendix

Spherical Geometry

We first discuss the geometry of the D-dimensional sphere $\mathscr {S}^D:=\{\tilde{\varvec{\theta }} \in \mathbb R^{D+1}: \Vert \tilde{\varvec{\theta }}\Vert _2= 1\}\hookrightarrow \mathbb R^{D+1}$ under different coordinate systems, namely, Cartesian coordinates and the spherical coordinates. Since $\mathscr {S}^D$ can be embedded (injectively and differentiably mapped to) in $\mathbb R^{D+1}$, we first introduce the concept of ‘induced metric’.

Definition 2.1

(induced metric) If $\mathscr {D}^{d}$ can be embedded to $\mathscr {M}^{m}$ $(m>d)$ by $f: U\subset \mathscr {D}\hookrightarrow \mathscr {M}$, then one can define the induced metric, $g_{\mathscr {D}}$, on $T\mathscr {D}$ through the metric $g_{\mathscr {M}}$ defined on $T\mathscr {M}$:

$$\begin{aligned} g_{\mathscr {D}}(\varvec{\theta })(\mathbf{u},\mathbf{{v}})=g_{\mathscr {M}}(f(\varvec{\theta }))(df_{\varvec{\theta }}(\mathbf{u}),df_{\varvec{\theta }}(\mathbf{{v}})),\quad \mathbf{u},\mathbf{{v}}\in T_{\varvec{\theta }}\mathscr {D} \end{aligned}$$

(2.58)

Remark 2.2

For any $f: U\subset \mathscr {S}^D\hookrightarrow \mathbb R^{D+1}$, we can define the induced metric through dot product on $\mathbb R^{D+1}$. More specifically,

$$\begin{aligned} g_{\mathscr {S}}(\mathbf{u},\mathbf{{v}}) = {[(Df)\mathbf{u}]}^{\mathsf {T}} (Df)\mathbf{{v}} = \mathbf{u}^{\mathsf {T}} [{(Df)}^{\mathsf {T}} (Df)] \mathbf{{v}} \end{aligned}$$

(2.59)

where $(Df)_{(D+1)\times D}$ is the Jacobian matrix of the mapping f. A Metric induced from dot product on Euclidean space is called a “canonical metric ”. This observation leads to the following simple fact that lays down the foundation of Spherical HMC.

Proposition 2.7.1

(Energy invariance ) Kinetic energy $\frac{1}{2}\langle \mathbf{{v}},\mathbf{{v}}\rangle _{\mathbf{G}(\varvec{\theta })}$ is invariant to the choice of coordinate systems.

Proof

For any $\mathbf{{v}}\in T_{\varvec{\theta }}\mathscr {D}$, suppose $\varvec{\theta }(t)$ such that $\dot{\varvec{\theta }}(0)=\mathbf{{v}}$. Denote the pushforward of $\mathbf{{v}}$ by embedding map $f:\mathscr {D}\rightarrow \mathscr {M}$ as $\tilde{\mathbf{{v}}}:=f_*(\mathbf{v})=\frac{d}{dt}(f\circ \varvec{\theta })(0)$. Then we have

$$\begin{aligned} \frac{1}{2}\langle \mathbf{{v}},\mathbf{{v}}\rangle _{\mathbf{G}(\varvec{\theta })} = \frac{1}{2} g_{\mathscr {M}}(f(\varvec{\theta }))(\tilde{\mathbf{{v}}},\tilde{\mathbf{{v}}}) \end{aligned}$$

(2.60)

That is, regardless of the form of the energy under a coordinate system, its value is the same as the one in the embedded manifold. In particular, when $\mathscr {M}=\mathbb R^{D+1}$, the right hand side simplifies to $\frac{1}{2} \Vert \tilde{\mathbf{{v}}}\Vert _2^2$. $\square $

Canonical Metric in Cartesian Coordinates

Now consider the D-dimensional ball $\mathscr {B}_\mathbf{0}^D(1):=\{\varvec{\theta }\in \mathbb R^D: \Vert \varvec{\theta }\Vert _2\le 1\}$. Here, $\{\varvec{\theta }, \mathscr {B}_\mathbf{0}^D(1)\}$ can be viewed as the Cartesian coordinate system for ${\mathscr {S}}^D$. The coordinate mapping $T_{\mathscr {B}\rightarrow \mathscr {S}_+}:\varvec{\theta }\mapsto \tilde{\varvec{\theta }}=(\varvec{\theta },\theta _{D+1})$ in (2.5) can be viewed as the embedding map into $\mathbb R^{D+1}$, and the Jacobian matrix of $T_{\mathscr {B}\rightarrow \mathscr {S}_+}$ is $dT_{\mathscr {B}\rightarrow \mathscr {S}_+}=\frac{d\tilde{\varvec{\theta }}}{d{\varvec{\theta }}^{\mathsf {T}}}=\begin{bmatrix}{} \mathbf{I}_{D}\\tp{\varvec{\theta }}/\theta _{D+1}\end{bmatrix}$. Therefore the canonical metric of ${\mathscr {S}}^D$ in Cartesian coordinates, $\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })$, is

$$\begin{aligned} \mathbf{G}_{\mathscr {S}_c}(\varvec{\theta }) = {dT}^{\mathsf {T}}_{\mathscr {B}\rightarrow \mathscr {S}_+} dT_{\mathscr {B}\rightarrow \mathscr {S}_+} = \mathbf{I}_D + \frac{\varvec{\theta } {\varvec{\theta }}^{\mathsf {T}}}{\theta _{D+1}^2} = \mathbf{I}_D + \frac{\varvec{\theta } {\varvec{\theta }}^{\mathsf {T}}}{1-\Vert \varvec{\theta }\Vert _2^2} \end{aligned}$$

(2.61)

Another way to obtain the metric is through the first fundamental form $ds^{2}$ (i.e., squared infinitesimal length of a curve) for $\mathscr {S}^D$, which can be expressed in terms of the differential form $d\varvec{\theta }$ and the canonical metric $\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })$,

$$\begin{aligned} ds^{2} = \langle d\varvec{\theta }, d\varvec{\theta } \rangle _{\mathbf{G}_{\mathscr {S}_c}} = d{\varvec{\theta }}^{\mathsf {T}}{} \mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })d\varvec{\theta } \end{aligned}$$

On the other hand, $ds^{2}$ can also be obtained as follows [59]:

$$\begin{aligned} ds^2 = \sum _{i=1}^D d\theta _i^2 + (d(\theta _{D+1}(\varvec{\theta })))^2 = d{\varvec{\theta }}^{\mathsf {T}} d\varvec{\theta } + \frac{({\varvec{\theta }}^{\mathsf {T}} d\varvec{\theta })^2}{1-\Vert \varvec{\theta }\Vert _2^2} = d{\varvec{\theta }}^{\mathsf {T}} [\mathbf{I} + \varvec{\theta } {\varvec{\theta }}^{\mathsf {T}}/\theta _{D+1}^2] d\varvec{\theta } \end{aligned}$$

Equating the above two quantities yields the form of the canonical metric $\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })$ as in Eq. (2.61). This viewpoint provides a natural way to explain the length of tangent vector. For any vector $\tilde{\mathbf{{v}}}=(\mathbf{{v}},v_{D+1})\in T_{\tilde{\varvec{\theta }}}{\mathscr {S}}^D=\{\tilde{\mathbf{{v}}}\in \mathbb R^{D+1}: {\tilde{\varvec{\theta }}}^{\mathsf {T}}\tilde{\mathbf{{v}}}=0\}$, one could think of $\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })$ as a mean to express the length of $\tilde{\mathbf{v}}$ in terms of $\mathbf{{v}}$,

$$\begin{aligned} \mathbf{{v}}^{\mathsf {T}}{} \mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })\mathbf{{v}} = \Vert \mathbf{{v}}\Vert _2^2 + \frac{\mathbf{{v}}^{\mathsf {T}}\varvec{\theta } {\varvec{\theta }}^{\mathsf {T}} \mathbf{{v}}}{\theta _{D+1}^2} = \Vert \mathbf{{v}}\Vert _2^2 + \frac{ (-\theta _{D+1} v_{D+1})^2}{\theta _{D+1}^2} = \Vert \mathbf{{v}}\Vert _2^2 + v_{D+1}^2 = \Vert \tilde{\mathbf{{v}}}\Vert _2^2 \end{aligned}$$

(2.62)

This indeed verifies the energy invariance Proposition 2.7.1.

The following proposition provides the analytic forms of the determinant and the inverse of $\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })$.

Proposition 2.7.2

The determinant and the inverse of the canonical metric are as follows:

$$\begin{aligned} |\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })| = \theta _{D+1}^{-2}, \qquad \mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })^{-1} = \mathbf{I}_D - \varvec{\theta } {\varvec{\theta }}^{\mathsf {T}} \end{aligned}$$

(2.63)

Proof

The determinant of the canonical metric $\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })$ is given by the matrix determinant lemma

$$\begin{aligned} |\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })| = \det \left[ \mathbf{I}_D + \frac{\varvec{\theta } {\varvec{\theta }}^{\mathsf {T}}}{\theta _{D+1}^2}\right] = 1+ \frac{{\varvec{\theta }}^{\mathsf {T}} \varvec{\theta }}{\theta _{D+1}^2} = \frac{1}{\theta _{D+1}^2} \end{aligned}$$

The inverse of $\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })$ is obtained by the Sherman-Morrison-Woodbury formula [29]

$$\begin{aligned} \mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })^{-1} = \left[ \mathbf{I}_D + \frac{\varvec{\theta } {\varvec{\theta }}^{\mathsf {T}}}{\theta _{D+1}^2} \right] ^{-1} = \mathbf{I}_D - \frac{\varvec{\theta } {\varvec{\theta }}^{\mathsf {T}}/\theta _{D+1}^2}{1+{\varvec{\theta }}^{\mathsf {T}}\varvec{\theta }/\theta _{D+1}^2} = \mathbf{I}_D - \varvec{\theta } {\varvec{\theta }}^{\mathsf {T}} \end{aligned}$$

$\square $

Corollary 2.1

The volume adjustment of changing measure in (2.6) is

$$\begin{aligned} \left| \frac{d\varvec{\theta }_{\mathscr {B}}}{d\varvec{\theta }_{\mathscr {S}_c}}\right| =|\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })|^{-\frac{1}{2}}=|\theta _{D+1}| \end{aligned}$$

(2.64)

Proof

Canonical measure can be defined through the Riesz representation theorem by using a positive linear functional on the space $C_0(\mathscr {S}^D)$ of compactly supported continuous functions on $\mathscr {S}^D$ [21, 59]. More precisely, there is a unique positive Borel measure $\mu _c$ such that for (any) coordinate chart $(\mathscr {B}_\mathbf{0}^D(1),T_{\mathscr {B}\rightarrow \mathscr {S}_+})$,

$$\begin{aligned} \int _{\mathscr {S}_+^D} f(\tilde{\varvec{\theta }}) d\varvec{\theta }_{\mathscr {S}_c} = \int _{\mathscr {B}_\mathbf{0}^D(1)} f(\varvec{\theta }) \sqrt{|\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })|} d\varvec{\theta }_{\mathscr {B}} \end{aligned}$$

where $\mu _c=d\varvec{\theta }_{\mathscr {S}_c}$, and $d\varvec{\theta }_{\mathscr {B}}$ is the Euclidean measure. Therefore we have

$$\begin{aligned} \left| \frac{d\varvec{\theta }_{\mathscr {S}_c}}{d\varvec{\theta }_{\mathscr {B}}}\right| = |\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })|^{\frac{1}{2}} = |\theta _{D+1}|^{-1} \end{aligned}$$

Alternatively, $\left| \frac{d\varvec{\theta }_{\mathscr {B}}}{d\varvec{\theta }_{\mathscr {S}_c}}\right| =|\theta _{D+1}|$. $\square $

Geodesic on a Sphere in Cartesian Coordinates

To find the geodesic on a sphere , we need to solve the following equations:

$$\begin{aligned} \dot{\varvec{\theta }}&= \mathbf{{v}} \end{aligned}$$

(2.65)

$$\begin{aligned} \dot{\mathbf{{v}}}&= -\mathbf{{v}}^{\mathsf {T}}\varvec{\varGamma }_{\mathscr {S}_c}(\varvec{\theta })\mathbf{{v}} \end{aligned}$$

(2.66)

for which we need to calculate the Christoffel symbols, $\varvec{\varGamma }_{\mathscr {S}_c}(\varvec{\theta })$, first. Note that the (i, j)-th element of $\mathbf{G}_{\mathscr {S}_c}$ is $g_{ij} = \delta _{ij} + \theta _i\theta _j/\theta _{D+1}^2$, and the (i, j, k)-th element of $d\mathbf{G}_{\mathscr {S}_c}$ is $g_{ij,k} = (\delta _{ik}\theta _j + \theta _i\delta _{jk})/\theta _{D+1}^2 + 2\theta _i\theta _j\theta _k/\theta _{D+1}^4$. Therefore

$$\begin{aligned} \begin{aligned} \varGamma _{ij}^k&= \frac{1}{2}g^{kl}[g_{lj,i}+g_{il,j}-g_{ij,l}]\\&= (\delta ^{kl}-\theta ^k\theta ^l)\theta _l/\theta _{D+1}^2[\delta _{ij}+\theta _i\theta _j/\theta _{D+1}^2]\\&= \theta _k[\delta _{ij}+\theta _i\theta _j/\theta _{D+1}^2] = [\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })\otimes \varvec{\theta }]_{ijk} \end{aligned} \end{aligned}$$

Using these results, we can write Eq. (2.66) as $\dot{\mathbf{{v}}} = -\mathbf{{v}}^{\mathsf {T}} \mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })\mathbf{{v}}\varvec{\theta } = -\Vert \tilde{\mathbf{v}}\Vert _2^2\varvec{\theta }$. Further, we have

Therefore, we can rewrite the geodesic equations (2.65), (2.66) with augmented components as

$$\begin{aligned} \dot{\tilde{\varvec{\theta }}}&= \tilde{\mathbf{{v}}} \end{aligned}$$

(2.67)

$$\begin{aligned} \dot{\tilde{\mathbf{{v}}}}&= -\Vert \tilde{\mathbf{{v}}}\Vert _2^2 \tilde{\varvec{\theta }} \end{aligned}$$

(2.68)

Multiplying both sides of Eq. (2.68) by ${\tilde{\mathbf{{v}}}}^{\mathsf {T}}$ to obtain $\frac{d}{dt}\Vert \tilde{\mathbf{{v}}}\Vert _2^2=0$, we can solve the above system of differential equations as follows:

$$\begin{aligned} \tilde{\varvec{\theta }}(t)&= \tilde{\varvec{\theta }}(0) \cos (\Vert \tilde{\mathbf{{v}}}(0)\Vert _{2} t) + \frac{\tilde{\mathbf{{v}}}(0)}{\Vert \tilde{\mathbf{{v}}}(0)\Vert _{2}} \sin (\Vert \tilde{\mathbf{{v}}}(0)\Vert _{2} t)\\ \tilde{\mathbf{{v}}}(t)&= -\tilde{\varvec{\theta }}(0) \Vert \tilde{\mathbf{{v}}}(0)\Vert _{2} \sin (\Vert \tilde{\mathbf{{v}}}(0)\Vert _{2} t) + \tilde{\mathbf{{v}}}(0) \cos (\Vert \tilde{\mathbf{{v}}}(0)\Vert _{2} t) \end{aligned}$$

Round Metric in the Spherical Coordinates

Consider the D-dimensional hyperrectangle $\mathscr {R}_\mathbf{0}^D:=[0,\pi ]^{D-1}\times [0,2\pi )$ and the corresponding spherical coordinate system, $\{\varvec{\theta }, \mathscr {R}_\mathbf{0}^D\}$, for $\mathscr {S}^D$. The coordinate mapping $T_{\mathscr {R}_\mathbf{0}\rightarrow \mathscr {S}}: \varvec{\theta }\mapsto \mathbf{x},\; x_d = \cos (\theta _d)\prod _{i=1}^{d-1}\sin (\theta _i),\, d=1,\ldots , D+1$, ($\theta _{D+1}=0$) can be viewed as the embedding map into $\mathbb R^{D+1}$, and the Jacobian matrix of $T_{\mathscr {R}_\mathbf{0}\rightarrow \mathscr {S}}$ is $\frac{d\mathbf{x}}{d{\varvec{\theta }}^{\mathsf {T}}}$ with the (d, j)-th element $[-\tan (\theta _d)\delta _{dj}+\cot (\theta _j)I(j<d)]x_d$. The induced metric of ${\mathscr {S}}^D$ in the spherical coordinates is called round metric , denoted as $\mathbf{G}_{\mathscr {S}_r}(\varvec{\theta })$, whose (i, j)-th element is as follows

$$\begin{aligned} \begin{aligned}&\mathbf{G}_{\mathscr {S}_r}(\varvec{\theta })_{ij}\\&= \sum _{d=1}^{D+1} [-\tan (\theta _d)\delta _{di}+\cot (\theta _j)I(i<d)][-\tan (\theta _d)\delta _{dj}+\cot (\theta _j)I(j<d)]x_d^2\\&={\left\{ \begin{array}{ll} -\tan (\theta _j)\cot (\theta _i)x_j^2 + \cot (\theta _i)\cot (\theta _j)\sum _{d>j} x_d^2 =0,&{} i<j\\ \tan ^2(\theta _i)x_i^2 + \cot ^2(\theta _i) \sum _{d>i} x_d^2 = (\tan ^2(\theta _i)+1)x_i^2 = \prod _{i=1}^{d-1}\sin ^2(\theta _i),&{} i=j \end{array}\right. }\\&= \prod _{i=1}^{d-1}\sin ^2(\theta _i) \delta _{ij} \end{aligned} \end{aligned}$$

(2.69)

Therefore, $\mathbf{G}_{\mathscr {S}_r}(\varvec{\theta })={{\mathrm{diag}}}[1,\sin ^2(\theta _1),\ldots ,\prod _{d=1}^{D-1}\sin ^2(\theta _d)]$. Another way to obtain $\mathbf{G}_{\mathscr {S}_r}(\varvec{\theta })$ is through the coordinate change

$$\begin{aligned} \mathbf{G}_{\mathscr {S}_r}(\varvec{\theta }) = \frac{d{\varvec{\theta }}^{\mathsf {T}}_{\mathscr {S}_c}}{d\varvec{\theta }_{\mathscr {S}_r}}{} \mathbf{G}_{\mathscr {S}_c}(\varvec{\theta }) \frac{d\varvec{\theta }_{\mathscr {S}_c}}{d{\varvec{\theta }}^{\mathsf {T}}_{\mathscr {S}_r}} \end{aligned}$$

(2.70)

Similar to Corollary (2.1), we have

Proposition 2.7.3

The volume adjustment of changing measure in (2.9) is

$$\begin{aligned} \left| \frac{d\varvec{\theta }_{\mathscr {R}_\mathbf{0}}}{d\varvec{\theta }_{\mathscr {S}_r}}\right| =|\mathbf{G}_{\mathscr {S}_r}(\varvec{\theta })|^{-\frac{1}{2}}=\prod _{d=1}^{D-1}\sin ^{-(D-d)}(\theta _{d}) \end{aligned}$$

(2.71)

Jacobian of the Transformation Between q -Norm Domains

The following proposition gives the weights needed for the transformation from $\mathscr {Q}^D$ to $\mathscr {B}_\mathbf{0}^D(1)$.

Proposition 2.7.4

The Jacobian determinant (weight) of $T_{\mathscr {B}\rightarrow \mathscr {Q}}$ is as follows:

$$\begin{aligned} |dT_{\mathscr {S}\rightarrow \mathscr {Q}}| = \left( \frac{2}{q}\right) ^D \left( \prod _{i=1}^{D}|\theta _i|\right) ^{2/q-1} \end{aligned}$$

(2.72)

Proof

Note

$$\begin{aligned} T_{\mathscr {B}\rightarrow \mathscr {Q}}:\, \varvec{\theta }\mapsto \varvec{\beta }=\mathrm {sgn}(\varvec{\theta })|\varvec{\theta }|^{2/q} \end{aligned}$$

The Jacobian matrix for $T_{\mathscr {B}\rightarrow \mathscr {Q}}$ is

$$\begin{aligned} \frac{d\varvec{\beta }}{d{\varvec{\theta }}^{\mathsf {T}}} = \frac{2}{q}\mathrm {diag}(|\varvec{\theta }|^{2/q-1}) \end{aligned}$$

Therefore the Jacobian determinant of $T_{\mathscr {B}\rightarrow \mathscr {Q}}$ is

$$\begin{aligned} |dT_{\mathscr {B}\rightarrow \mathscr {Q}}| = \left| \frac{d\varvec{\beta }}{d{\varvec{\theta }}^{\mathsf {T}}}\right| = \left( \frac{2}{q}\right) ^D \left( \prod _{i=1}^{D}|\theta _i|\right) ^{2/q-1} \end{aligned}$$

$\square $

The following proposition gives the weights needed for the change of domains from $\mathscr {R}^D$ to $\mathscr {B}_\mathbf{0}^D(1)$.

Proposition 2.7.5

The Jacobian determinant (weight) of $T_{\mathscr {B}\rightarrow \mathscr {R}}$ is as follows:

$$\begin{aligned} |dT_{\mathscr {B}\rightarrow \mathscr {R}}| = \frac{\Vert \varvec{\theta }\Vert _2^D}{\Vert \varvec{\theta }\Vert _{\infty }^D} \prod _{i=1}^D \frac{u_i-l_i}{2} \end{aligned}$$

(2.73)

Proof

First, we note

$$\begin{aligned} T_{\mathscr {B}\rightarrow \mathscr {R}} = T_{\mathscr {C}\rightarrow \mathscr {R}}\circ T_{\mathscr {B}\rightarrow \mathscr {C}}:\, \varvec{\theta }\mapsto \varvec{\beta }'=\varvec{\theta } \frac{\Vert \varvec{\theta }\Vert _2}{\Vert \varvec{\theta }\Vert _{\infty }}\mapsto \varvec{\beta }=\frac{\mathbf{u}-\mathbf{l}}{2}\varvec{\beta }'+\frac{\mathbf{u}+\mathbf{l}}{2} \end{aligned}$$

The corresponding Jacobian matrices are

where $\mathbf{e}_{\arg \max |\varvec{\theta }|}$ is a vector with $(\arg \max |\varvec{\theta }|)$-th element 1 and all others 0. Therefore,

$$\begin{aligned} |dT_{\mathscr {B}\rightarrow \mathscr {R}}| = |dT_{\mathscr {C}\rightarrow \mathscr {R}}|\, |dT_{\mathscr {B}\rightarrow \mathscr {C}}| = \left| \frac{d\varvec{\beta }}{d{(\varvec{\beta }')}^{\mathsf {T}}}\right| \left| \frac{d\varvec{\beta }'}{d{\varvec{\theta }}^{\mathsf {T}}}\right| = \frac{\Vert \varvec{\theta }\Vert _2^D}{\Vert \varvec{\theta }\Vert _{\infty }^D} \prod _{i=1}^D \frac{u_i-l_i}{2} \end{aligned}$$

$\square $

Splitting Hamiltonian (Lagrangian) Dynamics on ${\varvec{\mathscr {S}}}^D$

Splitting the Hamiltonian dynamics and its usefulness in improving HMC is a well-studied topic of research [13, 36, 56]. Splitting the Lagrangian dynamics (used in our approach), on the other hand, has not been discussed in the literature, to the best of our knowledge. Therefore, we prove the validity of our splitting method by starting with the well-understood method of splitting Hamiltonian [13],

$$\begin{aligned} H^*(\varvec{\theta },\mathbf{p}) = \frac{1}{2} U(\varvec{\theta }) + \frac{1}{2} \mathbf{p}^{\mathsf {T}} \mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })^{-1}{} \mathbf{p} + \frac{1}{2}U(\varvec{\theta }) \end{aligned}$$

The corresponding systems of differential equations,

can be written in terms of Lagrangian dynamics in $(\varvec{\theta }, \mathbf{{v}})$ as follows:

We have solved the second dynamics (on the right) in Appendix “Geodesic on a Sphere in Cartesian Coordinates”. To solve the first dynamics, we note that

Therefore, we have

$$\begin{aligned} \tilde{\varvec{\theta }}(t)&= \tilde{\varvec{\theta }}(0)\\ \tilde{\mathbf{{v}}}(t)&= \tilde{\mathbf{{v}}}(0) -\frac{t}{2} \begin{bmatrix} \mathbf{I}\\ -\frac{{\varvec{\theta }(0)}^{\mathsf {T}}}{\theta _{D+1}(0)}\end{bmatrix} [\mathbf{I}-\varvec{\theta }(0){\varvec{\theta }(0)}^{\mathsf {T}}] \nabla _{\varvec{\theta }} U(\varvec{\theta }) \end{aligned}$$

where $\begin{bmatrix} \mathbf{I}\\ -\frac{{\varvec{\theta }(0)}^{\mathsf {T}}}{\theta _{D+1}(0)}\end{bmatrix} [\mathbf{I}-\varvec{\theta }(0){\varvec{\theta }(0)}^{\mathsf {T}}]=\begin{bmatrix} \mathbf{I}-\varvec{\theta }(0){\varvec{\theta }(0)}^{\mathsf {T}}\\ -\theta _{D+1}(0) {\varvec{\theta }(0)}^{\mathsf {T}}\end{bmatrix} = \begin{bmatrix} \mathbf{I}\\ \mathbf{0}^{\mathsf {T}}\end{bmatrix} - \tilde{\varvec{\theta }}(0) {\varvec{\theta }(0)}^{\mathsf {T}}$.

Finally, we note that $\Vert \tilde{\varvec{\theta }}(t)\Vert _2=1$ if $\Vert \tilde{\varvec{\theta }}(0)\Vert _2=1$ and $\tilde{\mathbf{{v}}}(t)\in T_{\tilde{\varvec{\theta }}(t)} \mathscr {S}_c^D$ if $\tilde{\mathbf{{v}}}(0)\in T_{\tilde{\varvec{\theta }}(0)} \mathscr {S}_c^D$.

Error Analysis of Spherical HMC

Following [36], we now show that the discretization error $e_n=\Vert \mathbf{{z}}(t_n) - \mathbf{{z}}^{(n)}\Vert =\Vert ({\varvec{\theta }}(t_n),\mathbf{{v}}(t_n)) - ({\varvec{\theta }}^{(n)},\mathbf{{v}}^{(n)})\Vert $ (i.e., the difference between the true solution and the numerical solution) is ${\mathscr {O}}(\varepsilon ^3)$ locally and $\mathscr {O}(\varepsilon ^2)$ globally, where $\varepsilon $ is the discretization step size. Here, we assume that $\mathbf{{f}}({\varvec{\theta }},\mathbf{{v}}):= \mathbf{{v}}^{\textsf {T}}\varvec{\varGamma }({\varvec{\theta }}) \mathbf{{v}} + \mathbf{{G}}({\varvec{\theta }})^{-1} \nabla _{\varvec{\theta }}U({\varvec{\theta }})$ is smooth; hence, $\mathbf{{f}}$ and its derivatives are uniformly bounded as $\mathbf{{z}}=({\varvec{\theta }},\mathbf{{v}})$ evolves within finite time duration T. We expand the true solution $\mathbf{{z}}(t_{n+1})$ at $t_n$:

$$\begin{aligned} \mathbf{{z}}(t_{n+1})&= \mathbf{{z}}(t_n)+\dot{\mathbf{{z}}}(t_n)\varepsilon +\frac{1}{2}\ddot{\mathbf{{z}}}(t_n)\varepsilon ^2 +\mathscr {O}(\varepsilon ^3)\nonumber \\&= \left[ \begin{array}{ll}{\varvec{\theta }}(t_n)\\ \mathbf{{v}}(t_n)\end{array}\right] + \left[ \begin{array}{cl}{} \mathbf{{v}}(t_n)\\ -\mathbf{{f}}({\varvec{\theta }}(t_n),\mathbf{{v}}(t_n))\end{array}\right] \varepsilon + \frac{1}{2} \left[ \begin{array}{ll}-\mathbf{{f}}({\varvec{\theta }}(t_n),\mathbf{{v}}(t_n)) \\ -\dot{\mathbf{{f}}}({\varvec{\theta }}(t_n),\mathbf{{v}}(t_n))\end{array}\right] \varepsilon ^2 +\mathscr {O}(\varepsilon ^3) \end{aligned}$$

(2.76)

We first consider Spherical HMC in Cartesian coordinates, where $\mathbf{{f}}({\varvec{\theta }},\mathbf{{v}})=\Vert \tilde{\mathbf{{v}}}\Vert ^2\varvec{\theta } + [\mathbf{I}-\varvec{\theta }{\varvec{\theta }}^{\mathsf {T}}] \nabla _{\varvec{\theta }} U(\varvec{\theta })$. From Eq. (2.34) we have

$$\begin{aligned} \mathbf{{v}}^{(n+1/2)}&= \mathbf{{v}}^{(n)} -\frac{\varepsilon }{2} (\mathbf{I}-\varvec{\theta }^{(n)}{(\varvec{\theta }^{(n)})}^{\mathsf {T}}) \nabla _{\varvec{\theta }} U(\varvec{\theta }^{(n)})\nonumber \\ \Vert \tilde{\mathbf{{v}}}^{(n+1/2)}\Vert ^2&= \Vert \tilde{\mathbf{{v}}}^{(n)}\Vert ^2 -\varepsilon {(\mathbf{{v}}^{(n)})}^{\mathsf {T}} \nabla _{\varvec{\theta }} U(\varvec{\theta }^{(n)}) + \mathscr {O}(\varepsilon ^2) \end{aligned}$$

(2.77)

Now we expand Eq. (2.35) using Taylor series as follows:

$$\begin{aligned} \varvec{\theta }^{(n+1)} =&\, \varvec{\theta }^{(n)}[1-\Vert \tilde{\mathbf{{v}}}^{(n+1/2)}\Vert ^2\varepsilon ^2/2+\mathscr {O}(\varepsilon ^4)]\\&+ \mathbf{{v}}^{(n+1/2)}\varepsilon [1 - \Vert \tilde{\mathbf{{v}}}^{(n+1/2)}\Vert ^2\varepsilon ^2/3! + \mathscr {O}(\varepsilon ^4)]\\ \mathbf{{v}}^{(n+3/4)} =&-\varvec{\theta }^{(n)} \Vert \tilde{\mathbf{{v}}}^{(n+1/2)}\Vert ^2 \varepsilon [1 - \Vert \tilde{\mathbf{{v}}}^{(n+1/2)}\Vert ^2\varepsilon ^2/3! + \mathscr {O}(\varepsilon ^4)]\\&+ \mathbf{{v}}^{(n+1/2)} [1-\Vert \tilde{\mathbf{{v}}}^{(n+1/2)}\Vert ^2\varepsilon ^2/2+\mathscr {O}(\varepsilon ^4)] \end{aligned}$$

Substituting (2.77) in the above equations yields

$$\begin{aligned} \begin{aligned} \varvec{\theta }^{(n+1)}&=\varvec{\theta }^{(n)} + \mathbf{{v}}^{(n+1/2)}\varepsilon - \varvec{\theta }^{(n)}\Vert \tilde{\mathbf{{v}}}^{(n+1/2)}\Vert ^2\varepsilon ^2/2 +\mathscr {O}(\varepsilon ^3)\\&= \varvec{\theta }^{(n)} + \mathbf{{v}}^{(n)}\varepsilon - \frac{1}{2} \mathbf{f}({\varvec{\theta }}^{(n)},\mathbf{{v}}^{(n)}) \varepsilon ^2 + \mathscr {O}(\varepsilon ^3)\\ \mathbf{{v}}^{(n+3/4)}&= \mathbf{{v}}^{(n+1/2)} -\varvec{\theta }^{(n)} \Vert \tilde{\mathbf{{v}}}^{(n+1/2)}\Vert ^2 \varepsilon - \mathbf{{v}}^{(n+1/2)} \Vert \tilde{\mathbf{{v}}}^{(n+1/2)}\Vert ^2\varepsilon ^2/2 +\mathscr {O}(\varepsilon ^3)\\&= \mathbf{{v}}^{(n)} - [(\mathbf{I}-\varvec{\theta }^{(n)}{(\varvec{\theta }^{(n)})}^{\mathsf {T}}) \nabla _{\varvec{\theta }} U(\varvec{\theta }^{(n)})/2 + \varvec{\theta }^{(n)} \Vert \tilde{\mathbf{{v}}}^{(n)}\Vert ^2]\varepsilon \\&{= \mathbf{{v}}^{(n)}\;} + [\varvec{\theta }^{(n)} {(\mathbf{{v}}^{(n)})}^{\mathsf {T}} \nabla _{\varvec{\theta }} U(\varvec{\theta }^{(n)}) - \mathbf{{v}}^{(n)} \Vert \tilde{\mathbf{{v}}}^{(n)}\Vert ^2/2]\varepsilon ^2 + \mathscr {O}(\varepsilon ^3) \end{aligned} \end{aligned}$$

With the above results, we have

$$\begin{aligned} \mathbf{{v}}^{(n+1)}&=\, \mathbf{{v}}^{(n+3/4)} -\frac{\varepsilon }{2} (\mathbf{I}-\varvec{\theta }^{(n+1)}{(\varvec{\theta }^{(n+1)})}^{\mathsf {T}}) \nabla _{\varvec{\theta }} U(\varvec{\theta }^{(n+1)})\\&= \mathbf{{v}}^{(n)} - \mathbf{f}({\varvec{\theta }}^{(n)},\mathbf{{v}}^{(n)})\varepsilon + [\varvec{\theta }^{(n)} {(\mathbf{{v}}^{(n)})}^{\mathsf {T}} \nabla _{\varvec{\theta }} U(\varvec{\theta }^{(n)}) - \mathbf{{v}}^{(n)} \Vert \tilde{\mathbf{{v}}}^{(n)}\Vert ^2/2]\varepsilon ^2 + \mathscr {O}(\varepsilon ^3)\\&- \frac{1}{2}[ (\mathbf{I}-\varvec{\theta }^{(n)}{(\varvec{\theta }^{(n)})}^{\mathsf {T}}) \nabla _{\varvec{\theta }}^2 U(\varvec{\theta }^{(n)}) \mathbf{{v}}^{(n)} - (\varvec{\theta }^{(n)}{(\mathbf{{v}}^{(n)})}^{\mathsf {T}}+\mathbf{{v}}^{(n)}{(\varvec{\theta }^{(n)})}^{\mathsf {T}}) \nabla _{\varvec{\theta }} U(\varvec{\theta }^{(n)}) ]\varepsilon ^2 \\&= \mathbf{{v}}^{(n)} - \mathbf{f}({\varvec{\theta }}^{(n)},\mathbf{{v}}^{(n)})\varepsilon - \frac{1}{2} \dot{\mathbf{f}}({\varvec{\theta }}^{(n)},\mathbf{{v}}^{(n)})\varepsilon ^2 + \mathscr {O}(\varepsilon ^3) \end{aligned}$$

where for the last equality we need to show ${(\mathbf{{v}}^{(n)})}^{\mathsf {T}} \nabla _{\varvec{\theta }} U(\varvec{\theta }^{(n)})= -2\frac{d}{dt}\Vert \tilde{\mathbf{{v}}}^{(n)}\Vert ^2$. This can be proved as follows:

$$\begin{aligned} \frac{d}{dt}\Vert \tilde{\mathbf{{v}}}\Vert ^2&= \frac{d}{dt} [ \Vert \tilde{\mathbf{{v}}}\Vert ^2 + v_{D+1}^2] = 2[-\mathbf{{v}}^{\mathsf {T}}{} \mathbf{f} + v_{D+1}\dot{v}_{D+1}]\\&= 2\left[ -\mathbf{{v}}^{\mathsf {T}}{} \mathbf{f} +\left( -\frac{{\dot{\varvec{\theta }}}^{\mathsf {T}} \mathbf{{v}}+ {\varvec{\theta }}^{\mathsf {T}}\dot{\mathbf{{v}}}}{\theta _{D+1}} + \frac{{\varvec{\theta }}^{\mathsf {T}}{} \mathbf{{v}}}{\theta _{D+1}^{2}}\dot{\theta }_{D+1}\right) v_{D+1}\right] \\&= -2\left[ {\left( \mathbf{{v}} -\frac{v_{D+1}}{\theta _{D+1}}\varvec{\theta }\right) }^{\mathsf {T}}{} \mathbf{f} + \frac{v_{D+1}}{\theta _{D+1}} \Vert \tilde{\mathbf{{v}}}\Vert ^2 \right] \\&= -2\left[ \left( \mathbf{{v}}^{\mathsf {T}}\varvec{\theta } -\frac{v_{D+1}}{\theta _{D+1}}(\Vert \varvec{\theta }\Vert ^2-1)\right) \Vert \tilde{\mathbf{{v}}}\Vert ^2 + {\left( \mathbf{{v}} -\frac{v_{D+1}}{\theta _{D+1}}\varvec{\theta }\right) }^{\mathsf {T}}[\mathbf{I}-\varvec{\theta }{\varvec{\theta }}^{\mathsf {T}}] \nabla _{\varvec{\theta }} U(\varvec{\theta })] \right] \\&= -2\left[ \mathbf{{v}}^{\mathsf {T}}\nabla _{\varvec{\theta }} U(\varvec{\theta }) + \left( -\mathbf{{v}}^{\mathsf {T}}\varvec{\theta } -\frac{v_{D+1}}{\theta _{D+1}}(1-\Vert \varvec{\theta }\Vert ^2)\right) {\varvec{\theta }}^{\mathsf {T}}\nabla _{\varvec{\theta }} U(\varvec{\theta })\right] \\&=-2 \mathbf{{v}}^{\mathsf {T}}\nabla _{\varvec{\theta }} U(\varvec{\theta }) \end{aligned}$$

Therefore we have

$$\begin{aligned} \mathbf{z}^{(n+1)} := \begin{bmatrix}{\varvec{\theta }}^{(n+1)}\\\mathbf{{v}}^{(n+1)}\end{bmatrix} = \begin{bmatrix}{\varvec{\theta }}^{(n)}\\\mathbf{{v}}^{(n)}\end{bmatrix} +\begin{bmatrix}{} \mathbf{{v}}^{(n)}\\ -\mathbf{f}({\varvec{\theta }}^{(n)},\mathbf{{v}}^{(n)})\end{bmatrix}\varepsilon + \frac{1}{2} \begin{bmatrix}-\mathbf{f}({\varvec{\theta }}^{(n)},\mathbf{{v}}^{(n)})\\ -\dot{\mathbf{f}}({\varvec{\theta }}^{(n)},\mathbf{{v}}^{(n)})\end{bmatrix}\varepsilon ^2 + \mathscr {O}(\varepsilon ^3) \end{aligned}$$

(2.78)

The local error is

$$\begin{aligned} \begin{aligned} e_{n+1}&=\Vert \mathbf{z}(t_{n+1}) - \mathbf{z}^{(n+1)}\Vert \\&= \left\| \begin{bmatrix}{\varvec{\theta }}(t_n)-{\varvec{\theta }}^{(n)}\\ \mathbf{{v}}(t_n)-\mathbf{{v}}^{(n)}\end{bmatrix} + \begin{bmatrix}{} \mathbf{{v}}(t_n)-\mathbf{{v}}^{(n)}\\ -[\mathbf{f}(t_n)- \mathbf{f}^{(n)}]\end{bmatrix}\varepsilon + \frac{1}{2} \begin{bmatrix}-[\mathbf{f}(t_n)- \mathbf{f}^{(n)}]\\ -[\dot{\mathbf{f}}(t_n)- \dot{\mathbf{f}}^{(n)}]\end{bmatrix}\varepsilon ^2 + \mathscr {O}(\varepsilon ^3) \right\| \\&\le (1+M_1\varepsilon +M_2\varepsilon ^2) e_n + \mathscr {O}(\varepsilon ^3) \end{aligned} \end{aligned}$$

(2.79)

where $M_k = c_k\sup _{t\in [0,T]}\Vert \nabla ^k \mathbf{f}({\varvec{\theta }}(t),\mathbf{{v}}(t))\Vert ,\; k=1,2$ for some constants $c_k>0$. Accumulating the local errors by iterating the above inequality for $L=T/\varepsilon $ steps provides the following global error:

$$\begin{aligned} \begin{aligned} e_{L+1}&\le (1+M_1\varepsilon +M_2\varepsilon ^2) e_L + \mathscr {O}(\varepsilon ^3) \le (1+M_1\varepsilon +M_2\varepsilon ^2)^2 e_{L-1} + 3\mathscr {O}(\varepsilon ^3)\le \cdots \\&\le (1+M_1\varepsilon +M_2\varepsilon ^2)^L e_1 + L\mathscr {O}(\varepsilon ^3) \le (e^{M_1T}+T)\varepsilon ^2 \rightarrow 0,\quad as\; \varepsilon \rightarrow 0 \end{aligned} \end{aligned}$$

(2.80)

For Spherical HMC in the spherical coordinates, we conjecture that the integrator of Algorithm 2 still has order 3 local error and order 2 global error. One can follow the same argument as above to verify this.

Bounce in Diamond: Wall HMC for 1-Norm Constraint

Reference [46] discusses the Wall HMC method for $\infty $-norm constraint only. We can however derive a similar approach for 1-norm constraint. As shown in the left panel of Fig. 2.13, given the current state $\varvec{\theta }_0$, HMC makes a proposal $\varvec{\theta }$. It will hit the boundary to move from $\varvec{\theta }_0$ towards $\varvec{\theta }$. To determine the hit point ‘X’, we are required to solve for $t\in (0,1)$ such that

$$\begin{aligned} \Vert \varvec{\theta }_0+(\varvec{\theta }-\varvec{\theta }_0)t\Vert _1=\sum _{d=1}^D|\theta _0^d+(\theta ^d-\theta _0^d)t|=1 \end{aligned}$$

(2.81)

One can find the hitting time using the bisection method. However, a more efficient method is to find the orthant in which the sampler hits the boundary, i.e., find the normal direction $\mathbf{n}$ with elements being $\pm 1$. Then, we can find t,

$$\begin{aligned} \Vert \varvec{\theta }_0+(\varvec{\theta }-\varvec{\theta }_0)t\Vert _1 = \mathbf{n}^{\mathsf {T}}[\varvec{\theta }_0+(\varvec{\theta }-\varvec{\theta }_0)t]=1 \implies t^* = \frac{1-\mathbf{n}^{\mathsf {T}}\varvec{\theta }_0}{\mathbf{n}^{\mathsf {T}}(\varvec{\theta }-\varvec{\theta }_0)} \end{aligned}$$

(2.82)

Therefore the hit point is $\varvec{\theta }_0'=\varvec{\theta }_0+(\varvec{\theta }-\varvec{\theta }_0)t^*$ and consequently the reflection point is

$$\begin{aligned} \varvec{\theta }'=\varvec{\theta }-2\mathbf{n}^*\langle \mathbf{n}^*,\varvec{\theta }-\varvec{\theta }_0'\rangle = \varvec{\theta }-2\mathbf{n}(\mathbf{n}^{\mathsf {T}}\varvec{\theta }-1)/D \end{aligned}$$

(2.83)

where $\mathbf{n}^*:=\mathbf{n}/\Vert \mathbf{n}\Vert _2$ and $\mathbf{n}^{\mathsf {T}}\varvec{\theta }_0'=1$ because $\varvec{\theta }_0'$ is on the boundary with the normal direction $\mathbf{n}^*$.

It is in general difficult to directly determine the intersection of $\varvec{\theta }-\varvec{\theta }_0$ with boundary. Instead, we can find its intersections with coordinate planes $\{\varvec{\pi }_d\}_{d=1}^D$, where $\varvec{\pi }_d:=\{\varvec{\theta }\in \mathbb R^D|\theta ^d=0\}$. The intersection times are defined as $\mathbf{T}=\{\theta _0^d/(\theta _0^d-\theta ^d)|\theta _0^d\ne \theta ^d\}$. We keep those between 0 and 1 and sort them in ascending order (Fig. 2.13, right panel). Then, we find the intersection points $\{\varvec{\theta }_k:=\varvec{\theta }_0+(\varvec{\theta }-\varvec{\theta }_0)T_k\}$ that violate the constraint $\Vert \varvec{\theta }\Vert \le 1$. Denote the first intersection point outside the constrained domain as $\varvec{\theta }_k$. The signs of $\varvec{\theta }_k$ and $\varvec{\theta }_{k-1}$ determine the orthant of the hitting point $\varvec{\theta }_0'$.

Note, for each $d\in \{1,\ldots D\}$, $({{\mathrm{sign}}}(\varvec{\theta }_k^d),{{\mathrm{sign}}}(\varvec{\theta }_{k-1}^d))$ cannot be $(+,-)$ or $(-,+)$, otherwise there exists an intersection point $\varvec{\theta }^*:=\varvec{\theta }_0+(\varvec{\theta }-\varvec{\theta }_0)T^*$ with some coordinate plane $\varvec{\pi }_{d^*}$ between $\varvec{\theta }_k$ and $\varvec{\theta }_{k-1}$. Then $T_{k-1}<T^*<T_k$ contradicts the order of $\mathbf{T}$.^{Footnote 4} Therefore any point (including $\varvec{\theta }_0'$) between $\varvec{\theta }_k$ and $\varvec{\theta }_{k-1}$ must have the same sign as ${{\mathrm{sign}}}({{\mathrm{sign}}}(\varvec{\theta }_k)+{{\mathrm{sign}}}(\varvec{\theta }_{k-1}))$; that is

$$\begin{aligned} \mathbf{n} = {{\mathrm{sign}}}({{\mathrm{sign}}}(\varvec{\theta }_k)+{{\mathrm{sign}}}(\varvec{\theta }_{k-1})) \end{aligned}$$

(2.84)

After moving from $\varvec{\theta }$ to $\varvec{\theta }'$, we examine whether $\varvec{\theta }'$ satisfies the constraint. If it does not satisfy the constraint, we repeat above procedure with $\varvec{\theta }_0\leftarrow \varvec{\theta }_0'$ and $\varvec{\theta }\leftarrow \varvec{\theta }'$ until the final state is inside the constrained domain. Then we adjust the velocity direction by

$$\begin{aligned} \mathbf{{v}} \leftarrow (\varvec{\theta }'-\varvec{\theta }_0')\frac{\Vert \mathbf{{v}}\Vert }{\Vert \varvec{\theta }'-\varvec{\theta }_0'\Vert } \end{aligned}$$

(2.85)

Algorithm 4 summarizes the above steps.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Lan, S., Shahbaba, B. (2016). Sampling Constrained Probability Distributions Using Spherical Augmentation. In: Minh, H., Murino, V. (eds) Algorithmic Advances in Riemannian Geometry and Applications. Advances in Computer Vision and Pattern Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-45026-1_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-45026-1_2
Published: 06 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45025-4
Online ISBN: 978-3-319-45026-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

Buying options

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Definition 2.1

Remark 2.2

Proposition 2.7.1

Proof

Proposition 2.7.2

Proof

Corollary 2.1

Proof

Proposition 2.7.3

Proposition 2.7.4

Proof

Proposition 2.7.5

Proof

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation