Skip to main content

Sampling Constrained Probability Distributions Using Spherical Augmentation

  • Chapter
  • First Online:

Part of the book series: Advances in Computer Vision and Pattern Recognition ((ACVPR))

Abstract

Statistical models with constrained probability distributions are abundant in machine learning. Some examples include regression models with norm constraints (e.g., Lasso), probit, many copula models, and latent Dirichlet allocation (LDA). Bayesian inference involving probability distributions confined to constrained domains could be quite challenging for commonly used sampling algorithms. In this work, we propose a novel augmentation technique that handles a wide range of constraints by mapping the constrained domain to a sphere in the augmented space. By moving freely on the surface of this sphere, sampling algorithms handle constraints implicitly and generate proposals that remain within boundaries when mapped back to the original space. Our proposed method, called Spherical Augmentation, provides a mathematically natural and computationally efficient framework for sampling from constrained probability distributions . We show the advantages of our method over state-of-the-art sampling algorithms, such as exact Hamiltonian Monte Carlo, using several examples including truncated Gaussian distributions, Bayesian Lasso, Bayesian bridge regression, reconstruction of quantized stationary Gaussian process, and LDA for topic modeling.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   159.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Note, \(\mathbf{v}^{\mathsf {T}}{} \mathbf{G}_{\mathscr {S}_r}(\varvec{\theta })\mathbf{{v}}\le \Vert \mathbf{{v}}\Vert _2^2\le \Vert \tilde{\mathbf{{v}}}\Vert _2^2 = \mathbf{{v}}^{\mathsf {T}}{} \mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })\mathbf{{v}}\).

  2. 2.

    Metropolis test is done for this toy example as in [49], but will be omitted in stochastic mini-batch algorithms in Sect. 2.5.5.

  3. 3.

    The stochastic gradient for sg-wallLMC is \([(n^*_{kw} + \beta -1/2) + \pi _{kw}(n^*_{k\cdot }+W(\beta -1/2))]/n^*_{k\cdot }\) calculated with (2.45).

  4. 4.

    The same argument applies when \(T_k=1\), i.e. \(\varvec{\theta }\) is the first point outside the domain among \(\{\varvec{\theta }_k\}\).

References

  1. Y. Ahmadian, J.W. Pillow, L. Paninski, Efficient Markov chain Monte Carlo methods for decoding neural spike trains. Neural Comput. 23(1), 46–96 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  2. S. Ahn, Y. Chen, M. Welling, Distributed and adaptive darting Monte Carlo through regenerations, in Proceedings of the 16th International Conference on Artificial Intelligence and Statistics (AI Stat) (2013)

    Google Scholar 

  3. S. Ahn, B. Shahbaba, M. Welling, Distributed stochastic gradient MCMC, in International Conference on Machine Learning (2014)

    Google Scholar 

  4. S. Amari, H. Nagaoka, Methods of Information Geometry, in Translations of Mathematical Monographs, vol. 191 (Oxford University Press, Oxford, 2000)

    Google Scholar 

  5. C. Andrieu, E. Moulines, On the ergodicity properties of some adaptive MCMC algorithms. Ann. Appl. Probab. 16(3), 1462–1505 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  6. C. Andrieu, A. Doucet, R. Holenstein, Particle Markov chain Monte Carlo methods. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 72(3), 269–342 (2010)

    Google Scholar 

  7. M.J. Beal, Variational algorithms for approximate Bayesian inference. Ph.D. thesis, University College London, London (2003)

    Google Scholar 

  8. A. Beskos, G. Roberts, A. Stuart, J. Voss, MCMC methods for diffusion bridges. Stoch. Dyn. 8(03), 319–350 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  9. A. Beskos, F.J. Pinski, J.M. Sanz-Serna, A.M. Stuart, Hybrid Monte Carlo on Hilbert spaces. Stoch. Process. Appl. 121(10), 2201–2230 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  10. D.M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  11. A.E. Brockwell, Parallel markov chain monte carlo simulation by Pre-Fetching. J. Comput. Gr. Stat. 15, 246–261 (2006)

    Article  MathSciNet  Google Scholar 

  12. M.A. Brubaker, M. Salzmann, R. Urtasun, A family of MCMC methods on implicitly defined manifolds, in Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS-12), vol. 22, ed. by N.D. Lawrence, M.A. Girolami (2012), pp. 161–172

    Google Scholar 

  13. S. Byrne, M. Girolami, Geodesic Monte Carlo on embedded manifolds. Scand. J. Stat. 40(4), 825–845 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  14. B. Calderhead, M. Sustik, Sparse approximate manifolds for differential geometric MCMC, in Advances in Neural Information Processing Systems, vol. 25, ed. by P. Bartlett, F. Pereira, C. Burges, L. Bottou, K. Weinberger (2012), pp. 2888–2896

    Google Scholar 

  15. O. Cappé, R. Douc, A. Guillin, J.M. Marin, C.P. Robert, Adaptive importance sampling in general mixture classes. Stat. Comput. 18(4), 447–459 (2008)

    Article  MathSciNet  Google Scholar 

  16. N. Chopin, A sequential particle filter method for static models. Biometrika 89(3), 539–552 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  17. S.L. Cotter, G.O. Roberts, A. Stuart, D. White et al., MCMC methods for functions: modifying old algorithms to make them faster. Stat. Sci. 28(3), 424–446 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  18. R.V. Craiu, J. Rosenthal, C. Yang, Learn from thy neighbor: parallel-chain and regional adaptive MCMC. J. Am. Stat. Assoc. 104(488), 1454–1466 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  19. P. Del Moral, A. Doucet, A. Jasra, Sequential Monte Carlo samplers. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 68(3), 411–436 (2006)

    Google Scholar 

  20. N. de Freitas, P. Højen-Sørensen, M. Jordan, R. Stuart, Variational MCMC, in Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, UAI ’01 (Morgan Kaufmann Publishers Inc., San Francisco, 2001), pp. 120–127

    Google Scholar 

  21. M.P. do Carmo, Riemannian Geometry, 1st edn (Birkhäuser, Boston, 1992)

    Google Scholar 

  22. R. Douc, C.P. Robert, A vanilla rao-blackwellization of metropolis-hastings algorithms. Ann. Stat. 39(1), 261–277 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  23. S. Duane, A.D. Kennedy, B.J. Pendleton, D. Roweth, Hybrid Monte Carlo. Phys. Lett. B 195(2), 216–222 (1987)

    Article  Google Scholar 

  24. I.E. Frank, J.H. Friedman, A statistical view of some chemometrics regression tools. Technometrics 35(2), 109–135 (1993)

    Article  MATH  Google Scholar 

  25. A. Gelfand, L. van der Maaten, Y. Chen, M. Welling, On herding and the cycling perceptron theorem. Adv. Neural Inf. Process. Syst. 23, 694–702 (2010)

    Google Scholar 

  26. C.J. Geyer, Practical Markov Chain Monte Carlo. Stat. Sci. 7(4), 473–483 (1992)

    Article  MathSciNet  Google Scholar 

  27. W.R. Gilks, G.O. Roberts, S.K. Sahu, Adaptive Markov chain Monte Carlo through regeneration. J. Am. Stat. Assoc. 93(443), 1045–1054 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  28. M. Girolami, B. Calderhead, Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J. R. Stat. Soc. Ser. B (with discussion) 73(2), 123–214 (2011)

    Article  MathSciNet  Google Scholar 

  29. G.H. Golub, C.F. Van Loan, Matrix Computations, 3rd edn. (Johns Hopkins University Press, Baltimore, 1996)

    MATH  Google Scholar 

  30. C. Hans, Bayesian lasso regression. Biometrika 96(4), 835–845 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  31. M. Hoffman, F.R. Bach, D.M. Blei, Online learning for latent dirichlet allocation, in Advances in Neural Information Processing Systems (2010), pp. 856–864

    Google Scholar 

  32. M.D. Hoffman, A. Gelman, The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 15(1), 1593–1623 (2014)

    MathSciNet  MATH  Google Scholar 

  33. K. Kurihara, M. Welling, N. Vlassis, Accelerated variational Dirichlet process mixtures, in Advances of Neural Information Processing Systems – NIPS, vol. 19 (2006)

    Google Scholar 

  34. S. Lan, B. Shahbaba, Spherical hamiltonian Monte Carlo for constrained target distributions, in The 31st International Conference on Machine Learning, Beijing (2014), pp. 629–637

    Google Scholar 

  35. S. Lan, V. Stathopoulos, B. Shahbaba, M. Girolami, Markov chain Monte Carlo from Lagrangian dynamics. J. Comput. Gr. Stat. 24(2), 357–378 (2015)

    Article  MathSciNet  Google Scholar 

  36. B. Leimkuhler, S. Reich, Simulating Hamiltonian Dynamics (Cambridge University Press, Cambridge, 2004)

    MATH  Google Scholar 

  37. J. Møller, A. Pettitt, K. Berthelsen, R. Reeves, An efficient Markov chain Monte Carlo method for distributions with intractable normalisation constants. Biometrica 93, 451–458 (2006) (To appear)

    Google Scholar 

  38. I. Murray, R.P. Adams, D.J. MacKay, Elliptical slice sampling. JMLR:W&CP 9, 541–548 (2010)

    Google Scholar 

  39. P. Mykland, L. Tierney, B. Yu, Regeneration in Markov chain samplers. J. Am. Stat. Assoc. 90(429), 233–241 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  40. P. Neal, G.O. Roberts, Optimal scaling for random walk metropolis on spherically constrained target densities. Methodol. Comput. Appl. Probab. 10(2), 277–297 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  41. P. Neal, G.O. Roberts, W.K. Yuen, Optimal scaling of random walk metropolis algorithms with discontinuous target densities. Ann. Appl. Probab. 22(5), 1880–1927 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  42. R.M. Neal, Probabilistic inference using Markov Chain Monte Carlo methods. Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto (1993)

    Google Scholar 

  43. R.M. Neal, Bayesian Learning for Neural Networks (Springer, Secaucus, 1996)

    Book  MATH  Google Scholar 

  44. R.M. Neal, Slice sampling. Ann. Stat. 31(3), 705–767 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  45. R.M. Neal, The short-cut metropolis method. Technical Report 0506, Department of Statistics, University of Toronto (2005)

    Google Scholar 

  46. R.M. Neal, MCMC using Hamiltonian dynamics, in: Handbook of Markov Chain Monte Carlo, ed. by S. Brooks, A. Gelman, G. Jones, X.L. Meng. Chapman and Hall/CRC (2011), pp. 113–162

    Google Scholar 

  47. A. Pakman, L. Paninski, Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians. J. Comput. Gr. Stat. 23(2), 518–542 (2014)

    Article  MathSciNet  Google Scholar 

  48. T. Park, G. Casella, The Bayesian lasso. J. Am. Stat. Assoc. 103(482), 681–686 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  49. S. Patterson, Y.W. Teh, Stochastic gradient riemannian langevin dynamics on the probability simplex, in Advances in Neural Information Processing Systems (2013), pp. 3102–3110

    Google Scholar 

  50. N.S. Pillai, A.M. Stuart, A.H. Thiéry et al., Optimal scaling and diffusion limits for the Langevin algorithm in high dimensions. Ann. Appl. Probab. 22(6), 2320–2356 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  51. J.G. Propp, D.B. Wilson, Exact sampling with coupled Markov chains and applications to statistical mechanics (1996), pp. 223–252

    Google Scholar 

  52. D. Randal, G. Arnaud, M. Jean-Michel, R.P. Christian, Minimum variance importance sampling via population Monte Carlo. ESAIM: Probab. Stat. 11, 427–447 (2007)

    Article  MathSciNet  Google Scholar 

  53. G.O. Roberts, S.K. Sahu, Updating schemes, correlation structure, blocking and parameterisation for the Gibbs sampler. J. R. Stat. Soc. Ser. B 59, 291–317 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  54. G.O. Roberts, A. Gelman, W.R. Gilks, Weak convergence and optimal scaling of random walk metropolis algorithms. Ann. Appl. Probab. 7(1), 110–120 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  55. G.O. Roberts, J.S. Rosenthal et al., Optimal scaling for various metropolis-hastings algorithms. Stat. Sci. 16(4), 351–367 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  56. B. Shahbaba, S. Lan, W.O. Johnson, R.M. Neal, Split Hamiltonian Monte Carlo. Stat. Comput. 24(3), 339–349 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  57. C. Sherlock, G.O. Roberts, Optimal scaling of the random walk metropolis on elliptically symmetric unimodal targets. Bernoulli 15(3), 774–798 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  58. M. Spivak, Calculus on Manifolds, vol. 1 (WA Benjamin, New York, 1965)

    MATH  Google Scholar 

  59. M. Spivak, A Comprehensive Introduction to Differential Geometry, vol. 1, 2nd edn (Publish or Perish, Inc., Houston, 1979)

    Google Scholar 

  60. Y.W. Teh, D. Newman, M. Welling, A collapsed variational bayesian inference algorithm for latent dirichlet allocation, in Advances in Neural Information Processing Systems (2006), pp. 1353–1360

    Google Scholar 

  61. R. Tibshirani, Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  62. H.M. Wallach, I. Murray, R. Salakhutdinov, D. Mimno, Evaluation methods for topic models, in Proceedings of the 26th Annual International Conference on Machine Learning (ACM, 2009), pp. 1105–1112

    Google Scholar 

  63. G.R. Warnes, The normal kernel coupler: an adaptive Markov Chain Monte Carlo method for efficiently sampling from multi-modal distributions. Technical Report No. 395, University of Washington (2001)

    Google Scholar 

  64. M. Welling, Herding dynamic weights to learn, in Proceeding of International Conference on Machine Learning (2009)

    Google Scholar 

  65. M. Welling, Y.W. Teh, Bayesian learning via stochastic gradient Langevin dynamics, in Proceedings of the 28th International Conference on Machine Learning (ICML) (2011), pp. 681–688

    Google Scholar 

  66. M. West, On scale mixtures of normal distributions. Biometrika 74(3), 646–648 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  67. Y. Zhang, C. Sutton, Quasi-Newton methods for Markov Chain Monte Carlo, in Advances in Neural Information Processing Systems, ed. by J. Shawe-Taylor, R.S. Zemel, P. Bartlett, F.C.N. Pereira, K.Q. Weinberger, vol. 24 (2011), pp. 2393–2401

    Google Scholar 

Download references

Acknowledgments

SL is supported by EPSRC Programme Grant, Enabling Quantification of Uncertainty in Inverse Problems (EQUIP), EP/K034154/1. BS is supported by NSF grant IIS-1216045 and NIH grant R01-AI107034.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shiwei Lan .

Editor information

Editors and Affiliations

Appendix

Appendix

Spherical Geometry

We first discuss the geometry of the D-dimensional sphere \(\mathscr {S}^D:=\{\tilde{\varvec{\theta }} \in \mathbb R^{D+1}: \Vert \tilde{\varvec{\theta }}\Vert _2= 1\}\hookrightarrow \mathbb R^{D+1}\) under different coordinate systems, namely, Cartesian coordinates and the spherical coordinates. Since \(\mathscr {S}^D\) can be embedded (injectively and differentiably mapped to) in \(\mathbb R^{D+1}\), we first introduce the concept of ‘induced metric’.

Definition 2.1

(induced metric) If \(\mathscr {D}^{d}\) can be embedded to \(\mathscr {M}^{m}\) \((m>d)\) by \(f: U\subset \mathscr {D}\hookrightarrow \mathscr {M}\), then one can define the induced metric, \(g_{\mathscr {D}}\), on \(T\mathscr {D}\) through the metric \(g_{\mathscr {M}}\) defined on \(T\mathscr {M}\):

$$\begin{aligned} g_{\mathscr {D}}(\varvec{\theta })(\mathbf{u},\mathbf{{v}})=g_{\mathscr {M}}(f(\varvec{\theta }))(df_{\varvec{\theta }}(\mathbf{u}),df_{\varvec{\theta }}(\mathbf{{v}})),\quad \mathbf{u},\mathbf{{v}}\in T_{\varvec{\theta }}\mathscr {D} \end{aligned}$$
(2.58)

Remark 2.2

For any \(f: U\subset \mathscr {S}^D\hookrightarrow \mathbb R^{D+1}\), we can define the induced metric through dot product on \(\mathbb R^{D+1}\). More specifically,

$$\begin{aligned} g_{\mathscr {S}}(\mathbf{u},\mathbf{{v}}) = {[(Df)\mathbf{u}]}^{\mathsf {T}} (Df)\mathbf{{v}} = \mathbf{u}^{\mathsf {T}} [{(Df)}^{\mathsf {T}} (Df)] \mathbf{{v}} \end{aligned}$$
(2.59)

where \((Df)_{(D+1)\times D}\) is the Jacobian matrix of the mapping f. A Metric induced from dot product on Euclidean space is called a “canonical metric ”. This observation leads to the following simple fact that lays down the foundation of Spherical HMC.

Proposition 2.7.1

(Energy invariance ) Kinetic energy \(\frac{1}{2}\langle \mathbf{{v}},\mathbf{{v}}\rangle _{\mathbf{G}(\varvec{\theta })}\) is invariant to the choice of coordinate systems.

Proof

For any \(\mathbf{{v}}\in T_{\varvec{\theta }}\mathscr {D}\), suppose \(\varvec{\theta }(t)\) such that \(\dot{\varvec{\theta }}(0)=\mathbf{{v}}\). Denote the pushforward of \(\mathbf{{v}}\) by embedding map \(f:\mathscr {D}\rightarrow \mathscr {M}\) as \(\tilde{\mathbf{{v}}}:=f_*(\mathbf{v})=\frac{d}{dt}(f\circ \varvec{\theta })(0)\). Then we have

$$\begin{aligned} \frac{1}{2}\langle \mathbf{{v}},\mathbf{{v}}\rangle _{\mathbf{G}(\varvec{\theta })} = \frac{1}{2} g_{\mathscr {M}}(f(\varvec{\theta }))(\tilde{\mathbf{{v}}},\tilde{\mathbf{{v}}}) \end{aligned}$$
(2.60)

That is, regardless of the form of the energy under a coordinate system, its value is the same as the one in the embedded manifold. In particular, when \(\mathscr {M}=\mathbb R^{D+1}\), the right hand side simplifies to \(\frac{1}{2} \Vert \tilde{\mathbf{{v}}}\Vert _2^2\). \(\square \)

Canonical Metric in Cartesian Coordinates

Now consider the D-dimensional ball \(\mathscr {B}_\mathbf{0}^D(1):=\{\varvec{\theta }\in \mathbb R^D: \Vert \varvec{\theta }\Vert _2\le 1\}\). Here, \(\{\varvec{\theta }, \mathscr {B}_\mathbf{0}^D(1)\}\) can be viewed as the Cartesian coordinate system for \({\mathscr {S}}^D\). The coordinate mapping \(T_{\mathscr {B}\rightarrow \mathscr {S}_+}:\varvec{\theta }\mapsto \tilde{\varvec{\theta }}=(\varvec{\theta },\theta _{D+1})\) in (2.5) can be viewed as the embedding map into \(\mathbb R^{D+1}\), and the Jacobian matrix of \(T_{\mathscr {B}\rightarrow \mathscr {S}_+}\) is \(dT_{\mathscr {B}\rightarrow \mathscr {S}_+}=\frac{d\tilde{\varvec{\theta }}}{d{\varvec{\theta }}^{\mathsf {T}}}=\begin{bmatrix}{} \mathbf{I}_{D}\\tp{\varvec{\theta }}/\theta _{D+1}\end{bmatrix}\). Therefore the canonical metric of \({\mathscr {S}}^D\) in Cartesian coordinates, \(\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })\), is

$$\begin{aligned} \mathbf{G}_{\mathscr {S}_c}(\varvec{\theta }) = {dT}^{\mathsf {T}}_{\mathscr {B}\rightarrow \mathscr {S}_+} dT_{\mathscr {B}\rightarrow \mathscr {S}_+} = \mathbf{I}_D + \frac{\varvec{\theta } {\varvec{\theta }}^{\mathsf {T}}}{\theta _{D+1}^2} = \mathbf{I}_D + \frac{\varvec{\theta } {\varvec{\theta }}^{\mathsf {T}}}{1-\Vert \varvec{\theta }\Vert _2^2} \end{aligned}$$
(2.61)

Another way to obtain the metric is through the first fundamental form \(ds^{2}\) (i.e., squared infinitesimal length of a curve) for \(\mathscr {S}^D\), which can be expressed in terms of the differential form \(d\varvec{\theta }\) and the canonical metric \(\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })\),

$$\begin{aligned} ds^{2} = \langle d\varvec{\theta }, d\varvec{\theta } \rangle _{\mathbf{G}_{\mathscr {S}_c}} = d{\varvec{\theta }}^{\mathsf {T}}{} \mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })d\varvec{\theta } \end{aligned}$$

On the other hand, \(ds^{2}\) can also be obtained as follows [59]:

$$\begin{aligned} ds^2 = \sum _{i=1}^D d\theta _i^2 + (d(\theta _{D+1}(\varvec{\theta })))^2 = d{\varvec{\theta }}^{\mathsf {T}} d\varvec{\theta } + \frac{({\varvec{\theta }}^{\mathsf {T}} d\varvec{\theta })^2}{1-\Vert \varvec{\theta }\Vert _2^2} = d{\varvec{\theta }}^{\mathsf {T}} [\mathbf{I} + \varvec{\theta } {\varvec{\theta }}^{\mathsf {T}}/\theta _{D+1}^2] d\varvec{\theta } \end{aligned}$$

Equating the above two quantities yields the form of the canonical metric \(\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })\) as in Eq. (2.61). This viewpoint provides a natural way to explain the length of tangent vector. For any vector \(\tilde{\mathbf{{v}}}=(\mathbf{{v}},v_{D+1})\in T_{\tilde{\varvec{\theta }}}{\mathscr {S}}^D=\{\tilde{\mathbf{{v}}}\in \mathbb R^{D+1}: {\tilde{\varvec{\theta }}}^{\mathsf {T}}\tilde{\mathbf{{v}}}=0\}\), one could think of \(\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })\) as a mean to express the length of \(\tilde{\mathbf{v}}\) in terms of \(\mathbf{{v}}\),

$$\begin{aligned} \mathbf{{v}}^{\mathsf {T}}{} \mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })\mathbf{{v}} = \Vert \mathbf{{v}}\Vert _2^2 + \frac{\mathbf{{v}}^{\mathsf {T}}\varvec{\theta } {\varvec{\theta }}^{\mathsf {T}} \mathbf{{v}}}{\theta _{D+1}^2} = \Vert \mathbf{{v}}\Vert _2^2 + \frac{ (-\theta _{D+1} v_{D+1})^2}{\theta _{D+1}^2} = \Vert \mathbf{{v}}\Vert _2^2 + v_{D+1}^2 = \Vert \tilde{\mathbf{{v}}}\Vert _2^2 \end{aligned}$$
(2.62)

This indeed verifies the energy invariance Proposition 2.7.1.

The following proposition provides the analytic forms of the determinant and the inverse of \(\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })\).

Proposition 2.7.2

The determinant and the inverse of the canonical metric are as follows:

$$\begin{aligned} |\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })| = \theta _{D+1}^{-2}, \qquad \mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })^{-1} = \mathbf{I}_D - \varvec{\theta } {\varvec{\theta }}^{\mathsf {T}} \end{aligned}$$
(2.63)

Proof

The determinant of the canonical metric \(\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })\) is given by the matrix determinant lemma

$$\begin{aligned} |\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })| = \det \left[ \mathbf{I}_D + \frac{\varvec{\theta } {\varvec{\theta }}^{\mathsf {T}}}{\theta _{D+1}^2}\right] = 1+ \frac{{\varvec{\theta }}^{\mathsf {T}} \varvec{\theta }}{\theta _{D+1}^2} = \frac{1}{\theta _{D+1}^2} \end{aligned}$$

The inverse of \(\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })\) is obtained by the Sherman-Morrison-Woodbury formula [29]

$$\begin{aligned} \mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })^{-1} = \left[ \mathbf{I}_D + \frac{\varvec{\theta } {\varvec{\theta }}^{\mathsf {T}}}{\theta _{D+1}^2} \right] ^{-1} = \mathbf{I}_D - \frac{\varvec{\theta } {\varvec{\theta }}^{\mathsf {T}}/\theta _{D+1}^2}{1+{\varvec{\theta }}^{\mathsf {T}}\varvec{\theta }/\theta _{D+1}^2} = \mathbf{I}_D - \varvec{\theta } {\varvec{\theta }}^{\mathsf {T}} \end{aligned}$$

\(\square \)

Corollary 2.1

The volume adjustment of changing measure in (2.6) is

$$\begin{aligned} \left| \frac{d\varvec{\theta }_{\mathscr {B}}}{d\varvec{\theta }_{\mathscr {S}_c}}\right| =|\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })|^{-\frac{1}{2}}=|\theta _{D+1}| \end{aligned}$$
(2.64)

Proof

Canonical measure can be defined through the Riesz representation theorem by using a positive linear functional on the space \(C_0(\mathscr {S}^D)\) of compactly supported continuous functions on \(\mathscr {S}^D\) [21, 59]. More precisely, there is a unique positive Borel measure \(\mu _c\) such that for (any) coordinate chart \((\mathscr {B}_\mathbf{0}^D(1),T_{\mathscr {B}\rightarrow \mathscr {S}_+})\),

$$\begin{aligned} \int _{\mathscr {S}_+^D} f(\tilde{\varvec{\theta }}) d\varvec{\theta }_{\mathscr {S}_c} = \int _{\mathscr {B}_\mathbf{0}^D(1)} f(\varvec{\theta }) \sqrt{|\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })|} d\varvec{\theta }_{\mathscr {B}} \end{aligned}$$

where \(\mu _c=d\varvec{\theta }_{\mathscr {S}_c}\), and \(d\varvec{\theta }_{\mathscr {B}}\) is the Euclidean measure. Therefore we have

$$\begin{aligned} \left| \frac{d\varvec{\theta }_{\mathscr {S}_c}}{d\varvec{\theta }_{\mathscr {B}}}\right| = |\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })|^{\frac{1}{2}} = |\theta _{D+1}|^{-1} \end{aligned}$$

Alternatively, \(\left| \frac{d\varvec{\theta }_{\mathscr {B}}}{d\varvec{\theta }_{\mathscr {S}_c}}\right| =|\theta _{D+1}|\). \(\square \)

Geodesic on a Sphere in Cartesian Coordinates

To find the geodesic on a sphere , we need to solve the following equations:

$$\begin{aligned} \dot{\varvec{\theta }}&= \mathbf{{v}} \end{aligned}$$
(2.65)
$$\begin{aligned} \dot{\mathbf{{v}}}&= -\mathbf{{v}}^{\mathsf {T}}\varvec{\varGamma }_{\mathscr {S}_c}(\varvec{\theta })\mathbf{{v}} \end{aligned}$$
(2.66)

for which we need to calculate the Christoffel symbols, \(\varvec{\varGamma }_{\mathscr {S}_c}(\varvec{\theta })\), first. Note that the (ij)-th element of \(\mathbf{G}_{\mathscr {S}_c}\) is \(g_{ij} = \delta _{ij} + \theta _i\theta _j/\theta _{D+1}^2\), and the (ijk)-th element of \(d\mathbf{G}_{\mathscr {S}_c}\) is \(g_{ij,k} = (\delta _{ik}\theta _j + \theta _i\delta _{jk})/\theta _{D+1}^2 + 2\theta _i\theta _j\theta _k/\theta _{D+1}^4\). Therefore

$$\begin{aligned} \begin{aligned} \varGamma _{ij}^k&= \frac{1}{2}g^{kl}[g_{lj,i}+g_{il,j}-g_{ij,l}]\\&= (\delta ^{kl}-\theta ^k\theta ^l)\theta _l/\theta _{D+1}^2[\delta _{ij}+\theta _i\theta _j/\theta _{D+1}^2]\\&= \theta _k[\delta _{ij}+\theta _i\theta _j/\theta _{D+1}^2] = [\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })\otimes \varvec{\theta }]_{ijk} \end{aligned} \end{aligned}$$

Using these results, we can write Eq. (2.66) as \(\dot{\mathbf{{v}}} = -\mathbf{{v}}^{\mathsf {T}} \mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })\mathbf{{v}}\varvec{\theta } = -\Vert \tilde{\mathbf{v}}\Vert _2^2\varvec{\theta }\). Further, we have

Therefore, we can rewrite the geodesic equations (2.65), (2.66) with augmented components as

$$\begin{aligned} \dot{\tilde{\varvec{\theta }}}&= \tilde{\mathbf{{v}}} \end{aligned}$$
(2.67)
$$\begin{aligned} \dot{\tilde{\mathbf{{v}}}}&= -\Vert \tilde{\mathbf{{v}}}\Vert _2^2 \tilde{\varvec{\theta }} \end{aligned}$$
(2.68)

Multiplying both sides of Eq. (2.68) by \({\tilde{\mathbf{{v}}}}^{\mathsf {T}}\) to obtain \(\frac{d}{dt}\Vert \tilde{\mathbf{{v}}}\Vert _2^2=0\), we can solve the above system of differential equations as follows:

$$\begin{aligned} \tilde{\varvec{\theta }}(t)&= \tilde{\varvec{\theta }}(0) \cos (\Vert \tilde{\mathbf{{v}}}(0)\Vert _{2} t) + \frac{\tilde{\mathbf{{v}}}(0)}{\Vert \tilde{\mathbf{{v}}}(0)\Vert _{2}} \sin (\Vert \tilde{\mathbf{{v}}}(0)\Vert _{2} t)\\ \tilde{\mathbf{{v}}}(t)&= -\tilde{\varvec{\theta }}(0) \Vert \tilde{\mathbf{{v}}}(0)\Vert _{2} \sin (\Vert \tilde{\mathbf{{v}}}(0)\Vert _{2} t) + \tilde{\mathbf{{v}}}(0) \cos (\Vert \tilde{\mathbf{{v}}}(0)\Vert _{2} t) \end{aligned}$$

Round Metric in the Spherical Coordinates

Consider the D-dimensional hyperrectangle \(\mathscr {R}_\mathbf{0}^D:=[0,\pi ]^{D-1}\times [0,2\pi )\) and the corresponding spherical coordinate system, \(\{\varvec{\theta }, \mathscr {R}_\mathbf{0}^D\}\), for \(\mathscr {S}^D\). The coordinate mapping \(T_{\mathscr {R}_\mathbf{0}\rightarrow \mathscr {S}}: \varvec{\theta }\mapsto \mathbf{x},\; x_d = \cos (\theta _d)\prod _{i=1}^{d-1}\sin (\theta _i),\, d=1,\ldots , D+1\), (\(\theta _{D+1}=0\)) can be viewed as the embedding map into \(\mathbb R^{D+1}\), and the Jacobian matrix of \(T_{\mathscr {R}_\mathbf{0}\rightarrow \mathscr {S}}\) is \(\frac{d\mathbf{x}}{d{\varvec{\theta }}^{\mathsf {T}}}\) with the (dj)-th element \([-\tan (\theta _d)\delta _{dj}+\cot (\theta _j)I(j<d)]x_d\). The induced metric of \({\mathscr {S}}^D\) in the spherical coordinates is called round metric , denoted as \(\mathbf{G}_{\mathscr {S}_r}(\varvec{\theta })\), whose (ij)-th element is as follows

$$\begin{aligned} \begin{aligned}&\mathbf{G}_{\mathscr {S}_r}(\varvec{\theta })_{ij}\\&= \sum _{d=1}^{D+1} [-\tan (\theta _d)\delta _{di}+\cot (\theta _j)I(i<d)][-\tan (\theta _d)\delta _{dj}+\cot (\theta _j)I(j<d)]x_d^2\\&={\left\{ \begin{array}{ll} -\tan (\theta _j)\cot (\theta _i)x_j^2 + \cot (\theta _i)\cot (\theta _j)\sum _{d>j} x_d^2 =0,&{} i<j\\ \tan ^2(\theta _i)x_i^2 + \cot ^2(\theta _i) \sum _{d>i} x_d^2 = (\tan ^2(\theta _i)+1)x_i^2 = \prod _{i=1}^{d-1}\sin ^2(\theta _i),&{} i=j \end{array}\right. }\\&= \prod _{i=1}^{d-1}\sin ^2(\theta _i) \delta _{ij} \end{aligned} \end{aligned}$$
(2.69)

Therefore, \(\mathbf{G}_{\mathscr {S}_r}(\varvec{\theta })={{\mathrm{diag}}}[1,\sin ^2(\theta _1),\ldots ,\prod _{d=1}^{D-1}\sin ^2(\theta _d)]\). Another way to obtain \(\mathbf{G}_{\mathscr {S}_r}(\varvec{\theta })\) is through the coordinate change

$$\begin{aligned} \mathbf{G}_{\mathscr {S}_r}(\varvec{\theta }) = \frac{d{\varvec{\theta }}^{\mathsf {T}}_{\mathscr {S}_c}}{d\varvec{\theta }_{\mathscr {S}_r}}{} \mathbf{G}_{\mathscr {S}_c}(\varvec{\theta }) \frac{d\varvec{\theta }_{\mathscr {S}_c}}{d{\varvec{\theta }}^{\mathsf {T}}_{\mathscr {S}_r}} \end{aligned}$$
(2.70)

Similar to Corollary (2.1), we have

Proposition 2.7.3

The volume adjustment of changing measure in (2.9) is

$$\begin{aligned} \left| \frac{d\varvec{\theta }_{\mathscr {R}_\mathbf{0}}}{d\varvec{\theta }_{\mathscr {S}_r}}\right| =|\mathbf{G}_{\mathscr {S}_r}(\varvec{\theta })|^{-\frac{1}{2}}=\prod _{d=1}^{D-1}\sin ^{-(D-d)}(\theta _{d}) \end{aligned}$$
(2.71)

Jacobian of the Transformation Between q -Norm Domains

The following proposition gives the weights needed for the transformation from \(\mathscr {Q}^D\) to \(\mathscr {B}_\mathbf{0}^D(1)\).

Proposition 2.7.4

The Jacobian determinant (weight) of \(T_{\mathscr {B}\rightarrow \mathscr {Q}}\) is as follows:

$$\begin{aligned} |dT_{\mathscr {S}\rightarrow \mathscr {Q}}| = \left( \frac{2}{q}\right) ^D \left( \prod _{i=1}^{D}|\theta _i|\right) ^{2/q-1} \end{aligned}$$
(2.72)

Proof

Note

$$\begin{aligned} T_{\mathscr {B}\rightarrow \mathscr {Q}}:\, \varvec{\theta }\mapsto \varvec{\beta }=\mathrm {sgn}(\varvec{\theta })|\varvec{\theta }|^{2/q} \end{aligned}$$

The Jacobian matrix for \(T_{\mathscr {B}\rightarrow \mathscr {Q}}\) is

$$\begin{aligned} \frac{d\varvec{\beta }}{d{\varvec{\theta }}^{\mathsf {T}}} = \frac{2}{q}\mathrm {diag}(|\varvec{\theta }|^{2/q-1}) \end{aligned}$$

Therefore the Jacobian determinant of \(T_{\mathscr {B}\rightarrow \mathscr {Q}}\) is

$$\begin{aligned} |dT_{\mathscr {B}\rightarrow \mathscr {Q}}| = \left| \frac{d\varvec{\beta }}{d{\varvec{\theta }}^{\mathsf {T}}}\right| = \left( \frac{2}{q}\right) ^D \left( \prod _{i=1}^{D}|\theta _i|\right) ^{2/q-1} \end{aligned}$$

\(\square \)

The following proposition gives the weights needed for the change of domains from \(\mathscr {R}^D\) to \(\mathscr {B}_\mathbf{0}^D(1)\).

Proposition 2.7.5

The Jacobian determinant (weight) of \(T_{\mathscr {B}\rightarrow \mathscr {R}}\) is as follows:

$$\begin{aligned} |dT_{\mathscr {B}\rightarrow \mathscr {R}}| = \frac{\Vert \varvec{\theta }\Vert _2^D}{\Vert \varvec{\theta }\Vert _{\infty }^D} \prod _{i=1}^D \frac{u_i-l_i}{2} \end{aligned}$$
(2.73)

Proof

First, we note

$$\begin{aligned} T_{\mathscr {B}\rightarrow \mathscr {R}} = T_{\mathscr {C}\rightarrow \mathscr {R}}\circ T_{\mathscr {B}\rightarrow \mathscr {C}}:\, \varvec{\theta }\mapsto \varvec{\beta }'=\varvec{\theta } \frac{\Vert \varvec{\theta }\Vert _2}{\Vert \varvec{\theta }\Vert _{\infty }}\mapsto \varvec{\beta }=\frac{\mathbf{u}-\mathbf{l}}{2}\varvec{\beta }'+\frac{\mathbf{u}+\mathbf{l}}{2} \end{aligned}$$

The corresponding Jacobian matrices are

where \(\mathbf{e}_{\arg \max |\varvec{\theta }|}\) is a vector with \((\arg \max |\varvec{\theta }|)\)-th element 1 and all others 0. Therefore,

$$\begin{aligned} |dT_{\mathscr {B}\rightarrow \mathscr {R}}| = |dT_{\mathscr {C}\rightarrow \mathscr {R}}|\, |dT_{\mathscr {B}\rightarrow \mathscr {C}}| = \left| \frac{d\varvec{\beta }}{d{(\varvec{\beta }')}^{\mathsf {T}}}\right| \left| \frac{d\varvec{\beta }'}{d{\varvec{\theta }}^{\mathsf {T}}}\right| = \frac{\Vert \varvec{\theta }\Vert _2^D}{\Vert \varvec{\theta }\Vert _{\infty }^D} \prod _{i=1}^D \frac{u_i-l_i}{2} \end{aligned}$$

\(\square \)

Splitting Hamiltonian (Lagrangian) Dynamics on \({\varvec{\mathscr {S}}}^D\)

Splitting the Hamiltonian dynamics and its usefulness in improving HMC is a well-studied topic of research [13, 36, 56]. Splitting the Lagrangian dynamics (used in our approach), on the other hand, has not been discussed in the literature, to the best of our knowledge. Therefore, we prove the validity of our splitting method by starting with the well-understood method of splitting Hamiltonian [13],

$$\begin{aligned} H^*(\varvec{\theta },\mathbf{p}) = \frac{1}{2} U(\varvec{\theta }) + \frac{1}{2} \mathbf{p}^{\mathsf {T}} \mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })^{-1}{} \mathbf{p} + \frac{1}{2}U(\varvec{\theta }) \end{aligned}$$

The corresponding systems of differential equations,

can be written in terms of Lagrangian dynamics in \((\varvec{\theta }, \mathbf{{v}})\) as follows:

We have solved the second dynamics (on the right) in Appendix “Geodesic on a Sphere in Cartesian Coordinates”. To solve the first dynamics, we note that

Therefore, we have

$$\begin{aligned} \tilde{\varvec{\theta }}(t)&= \tilde{\varvec{\theta }}(0)\\ \tilde{\mathbf{{v}}}(t)&= \tilde{\mathbf{{v}}}(0) -\frac{t}{2} \begin{bmatrix} \mathbf{I}\\ -\frac{{\varvec{\theta }(0)}^{\mathsf {T}}}{\theta _{D+1}(0)}\end{bmatrix} [\mathbf{I}-\varvec{\theta }(0){\varvec{\theta }(0)}^{\mathsf {T}}] \nabla _{\varvec{\theta }} U(\varvec{\theta }) \end{aligned}$$

where \(\begin{bmatrix} \mathbf{I}\\ -\frac{{\varvec{\theta }(0)}^{\mathsf {T}}}{\theta _{D+1}(0)}\end{bmatrix} [\mathbf{I}-\varvec{\theta }(0){\varvec{\theta }(0)}^{\mathsf {T}}]=\begin{bmatrix} \mathbf{I}-\varvec{\theta }(0){\varvec{\theta }(0)}^{\mathsf {T}}\\ -\theta _{D+1}(0) {\varvec{\theta }(0)}^{\mathsf {T}}\end{bmatrix} = \begin{bmatrix} \mathbf{I}\\ \mathbf{0}^{\mathsf {T}}\end{bmatrix} - \tilde{\varvec{\theta }}(0) {\varvec{\theta }(0)}^{\mathsf {T}}\).

Finally, we note that \(\Vert \tilde{\varvec{\theta }}(t)\Vert _2=1\) if \(\Vert \tilde{\varvec{\theta }}(0)\Vert _2=1\) and \(\tilde{\mathbf{{v}}}(t)\in T_{\tilde{\varvec{\theta }}(t)} \mathscr {S}_c^D\) if \(\tilde{\mathbf{{v}}}(0)\in T_{\tilde{\varvec{\theta }}(0)} \mathscr {S}_c^D\).

Error Analysis of Spherical HMC

Following [36], we now show that the discretization error \(e_n=\Vert \mathbf{{z}}(t_n) - \mathbf{{z}}^{(n)}\Vert =\Vert ({\varvec{\theta }}(t_n),\mathbf{{v}}(t_n)) - ({\varvec{\theta }}^{(n)},\mathbf{{v}}^{(n)})\Vert \) (i.e., the difference between the true solution and the numerical solution) is \({\mathscr {O}}(\varepsilon ^3)\) locally and \(\mathscr {O}(\varepsilon ^2)\) globally, where \(\varepsilon \) is the discretization step size. Here, we assume that \(\mathbf{{f}}({\varvec{\theta }},\mathbf{{v}}):= \mathbf{{v}}^{\textsf {T}}\varvec{\varGamma }({\varvec{\theta }}) \mathbf{{v}} + \mathbf{{G}}({\varvec{\theta }})^{-1} \nabla _{\varvec{\theta }}U({\varvec{\theta }})\) is smooth; hence, \(\mathbf{{f}}\) and its derivatives are uniformly bounded as \(\mathbf{{z}}=({\varvec{\theta }},\mathbf{{v}})\) evolves within finite time duration T. We expand the true solution \(\mathbf{{z}}(t_{n+1})\) at \(t_n\):

$$\begin{aligned} \mathbf{{z}}(t_{n+1})&= \mathbf{{z}}(t_n)+\dot{\mathbf{{z}}}(t_n)\varepsilon +\frac{1}{2}\ddot{\mathbf{{z}}}(t_n)\varepsilon ^2 +\mathscr {O}(\varepsilon ^3)\nonumber \\&= \left[ \begin{array}{ll}{\varvec{\theta }}(t_n)\\ \mathbf{{v}}(t_n)\end{array}\right] + \left[ \begin{array}{cl}{} \mathbf{{v}}(t_n)\\ -\mathbf{{f}}({\varvec{\theta }}(t_n),\mathbf{{v}}(t_n))\end{array}\right] \varepsilon + \frac{1}{2} \left[ \begin{array}{ll}-\mathbf{{f}}({\varvec{\theta }}(t_n),\mathbf{{v}}(t_n)) \\ -\dot{\mathbf{{f}}}({\varvec{\theta }}(t_n),\mathbf{{v}}(t_n))\end{array}\right] \varepsilon ^2 +\mathscr {O}(\varepsilon ^3) \end{aligned}$$
(2.76)

We first consider Spherical HMC in Cartesian coordinates, where \(\mathbf{{f}}({\varvec{\theta }},\mathbf{{v}})=\Vert \tilde{\mathbf{{v}}}\Vert ^2\varvec{\theta } + [\mathbf{I}-\varvec{\theta }{\varvec{\theta }}^{\mathsf {T}}] \nabla _{\varvec{\theta }} U(\varvec{\theta })\). From Eq. (2.34) we have

$$\begin{aligned} \mathbf{{v}}^{(n+1/2)}&= \mathbf{{v}}^{(n)} -\frac{\varepsilon }{2} (\mathbf{I}-\varvec{\theta }^{(n)}{(\varvec{\theta }^{(n)})}^{\mathsf {T}}) \nabla _{\varvec{\theta }} U(\varvec{\theta }^{(n)})\nonumber \\ \Vert \tilde{\mathbf{{v}}}^{(n+1/2)}\Vert ^2&= \Vert \tilde{\mathbf{{v}}}^{(n)}\Vert ^2 -\varepsilon {(\mathbf{{v}}^{(n)})}^{\mathsf {T}} \nabla _{\varvec{\theta }} U(\varvec{\theta }^{(n)}) + \mathscr {O}(\varepsilon ^2) \end{aligned}$$
(2.77)

Now we expand Eq. (2.35) using Taylor series as follows:

$$\begin{aligned} \varvec{\theta }^{(n+1)} =&\, \varvec{\theta }^{(n)}[1-\Vert \tilde{\mathbf{{v}}}^{(n+1/2)}\Vert ^2\varepsilon ^2/2+\mathscr {O}(\varepsilon ^4)]\\&+ \mathbf{{v}}^{(n+1/2)}\varepsilon [1 - \Vert \tilde{\mathbf{{v}}}^{(n+1/2)}\Vert ^2\varepsilon ^2/3! + \mathscr {O}(\varepsilon ^4)]\\ \mathbf{{v}}^{(n+3/4)} =&-\varvec{\theta }^{(n)} \Vert \tilde{\mathbf{{v}}}^{(n+1/2)}\Vert ^2 \varepsilon [1 - \Vert \tilde{\mathbf{{v}}}^{(n+1/2)}\Vert ^2\varepsilon ^2/3! + \mathscr {O}(\varepsilon ^4)]\\&+ \mathbf{{v}}^{(n+1/2)} [1-\Vert \tilde{\mathbf{{v}}}^{(n+1/2)}\Vert ^2\varepsilon ^2/2+\mathscr {O}(\varepsilon ^4)] \end{aligned}$$

Substituting (2.77) in the above equations yields

$$\begin{aligned} \begin{aligned} \varvec{\theta }^{(n+1)}&=\varvec{\theta }^{(n)} + \mathbf{{v}}^{(n+1/2)}\varepsilon - \varvec{\theta }^{(n)}\Vert \tilde{\mathbf{{v}}}^{(n+1/2)}\Vert ^2\varepsilon ^2/2 +\mathscr {O}(\varepsilon ^3)\\&= \varvec{\theta }^{(n)} + \mathbf{{v}}^{(n)}\varepsilon - \frac{1}{2} \mathbf{f}({\varvec{\theta }}^{(n)},\mathbf{{v}}^{(n)}) \varepsilon ^2 + \mathscr {O}(\varepsilon ^3)\\ \mathbf{{v}}^{(n+3/4)}&= \mathbf{{v}}^{(n+1/2)} -\varvec{\theta }^{(n)} \Vert \tilde{\mathbf{{v}}}^{(n+1/2)}\Vert ^2 \varepsilon - \mathbf{{v}}^{(n+1/2)} \Vert \tilde{\mathbf{{v}}}^{(n+1/2)}\Vert ^2\varepsilon ^2/2 +\mathscr {O}(\varepsilon ^3)\\&= \mathbf{{v}}^{(n)} - [(\mathbf{I}-\varvec{\theta }^{(n)}{(\varvec{\theta }^{(n)})}^{\mathsf {T}}) \nabla _{\varvec{\theta }} U(\varvec{\theta }^{(n)})/2 + \varvec{\theta }^{(n)} \Vert \tilde{\mathbf{{v}}}^{(n)}\Vert ^2]\varepsilon \\&{= \mathbf{{v}}^{(n)}\;} + [\varvec{\theta }^{(n)} {(\mathbf{{v}}^{(n)})}^{\mathsf {T}} \nabla _{\varvec{\theta }} U(\varvec{\theta }^{(n)}) - \mathbf{{v}}^{(n)} \Vert \tilde{\mathbf{{v}}}^{(n)}\Vert ^2/2]\varepsilon ^2 + \mathscr {O}(\varepsilon ^3) \end{aligned} \end{aligned}$$

With the above results, we have

$$\begin{aligned} \mathbf{{v}}^{(n+1)}&=\, \mathbf{{v}}^{(n+3/4)} -\frac{\varepsilon }{2} (\mathbf{I}-\varvec{\theta }^{(n+1)}{(\varvec{\theta }^{(n+1)})}^{\mathsf {T}}) \nabla _{\varvec{\theta }} U(\varvec{\theta }^{(n+1)})\\&= \mathbf{{v}}^{(n)} - \mathbf{f}({\varvec{\theta }}^{(n)},\mathbf{{v}}^{(n)})\varepsilon + [\varvec{\theta }^{(n)} {(\mathbf{{v}}^{(n)})}^{\mathsf {T}} \nabla _{\varvec{\theta }} U(\varvec{\theta }^{(n)}) - \mathbf{{v}}^{(n)} \Vert \tilde{\mathbf{{v}}}^{(n)}\Vert ^2/2]\varepsilon ^2 + \mathscr {O}(\varepsilon ^3)\\&- \frac{1}{2}[ (\mathbf{I}-\varvec{\theta }^{(n)}{(\varvec{\theta }^{(n)})}^{\mathsf {T}}) \nabla _{\varvec{\theta }}^2 U(\varvec{\theta }^{(n)}) \mathbf{{v}}^{(n)} - (\varvec{\theta }^{(n)}{(\mathbf{{v}}^{(n)})}^{\mathsf {T}}+\mathbf{{v}}^{(n)}{(\varvec{\theta }^{(n)})}^{\mathsf {T}}) \nabla _{\varvec{\theta }} U(\varvec{\theta }^{(n)}) ]\varepsilon ^2 \\&= \mathbf{{v}}^{(n)} - \mathbf{f}({\varvec{\theta }}^{(n)},\mathbf{{v}}^{(n)})\varepsilon - \frac{1}{2} \dot{\mathbf{f}}({\varvec{\theta }}^{(n)},\mathbf{{v}}^{(n)})\varepsilon ^2 + \mathscr {O}(\varepsilon ^3) \end{aligned}$$

where for the last equality we need to show \({(\mathbf{{v}}^{(n)})}^{\mathsf {T}} \nabla _{\varvec{\theta }} U(\varvec{\theta }^{(n)})= -2\frac{d}{dt}\Vert \tilde{\mathbf{{v}}}^{(n)}\Vert ^2\). This can be proved as follows:

$$\begin{aligned} \frac{d}{dt}\Vert \tilde{\mathbf{{v}}}\Vert ^2&= \frac{d}{dt} [ \Vert \tilde{\mathbf{{v}}}\Vert ^2 + v_{D+1}^2] = 2[-\mathbf{{v}}^{\mathsf {T}}{} \mathbf{f} + v_{D+1}\dot{v}_{D+1}]\\&= 2\left[ -\mathbf{{v}}^{\mathsf {T}}{} \mathbf{f} +\left( -\frac{{\dot{\varvec{\theta }}}^{\mathsf {T}} \mathbf{{v}}+ {\varvec{\theta }}^{\mathsf {T}}\dot{\mathbf{{v}}}}{\theta _{D+1}} + \frac{{\varvec{\theta }}^{\mathsf {T}}{} \mathbf{{v}}}{\theta _{D+1}^{2}}\dot{\theta }_{D+1}\right) v_{D+1}\right] \\&= -2\left[ {\left( \mathbf{{v}} -\frac{v_{D+1}}{\theta _{D+1}}\varvec{\theta }\right) }^{\mathsf {T}}{} \mathbf{f} + \frac{v_{D+1}}{\theta _{D+1}} \Vert \tilde{\mathbf{{v}}}\Vert ^2 \right] \\&= -2\left[ \left( \mathbf{{v}}^{\mathsf {T}}\varvec{\theta } -\frac{v_{D+1}}{\theta _{D+1}}(\Vert \varvec{\theta }\Vert ^2-1)\right) \Vert \tilde{\mathbf{{v}}}\Vert ^2 + {\left( \mathbf{{v}} -\frac{v_{D+1}}{\theta _{D+1}}\varvec{\theta }\right) }^{\mathsf {T}}[\mathbf{I}-\varvec{\theta }{\varvec{\theta }}^{\mathsf {T}}] \nabla _{\varvec{\theta }} U(\varvec{\theta })] \right] \\&= -2\left[ \mathbf{{v}}^{\mathsf {T}}\nabla _{\varvec{\theta }} U(\varvec{\theta }) + \left( -\mathbf{{v}}^{\mathsf {T}}\varvec{\theta } -\frac{v_{D+1}}{\theta _{D+1}}(1-\Vert \varvec{\theta }\Vert ^2)\right) {\varvec{\theta }}^{\mathsf {T}}\nabla _{\varvec{\theta }} U(\varvec{\theta })\right] \\&=-2 \mathbf{{v}}^{\mathsf {T}}\nabla _{\varvec{\theta }} U(\varvec{\theta }) \end{aligned}$$

Therefore we have

$$\begin{aligned} \mathbf{z}^{(n+1)} := \begin{bmatrix}{\varvec{\theta }}^{(n+1)}\\\mathbf{{v}}^{(n+1)}\end{bmatrix} = \begin{bmatrix}{\varvec{\theta }}^{(n)}\\\mathbf{{v}}^{(n)}\end{bmatrix} +\begin{bmatrix}{} \mathbf{{v}}^{(n)}\\ -\mathbf{f}({\varvec{\theta }}^{(n)},\mathbf{{v}}^{(n)})\end{bmatrix}\varepsilon + \frac{1}{2} \begin{bmatrix}-\mathbf{f}({\varvec{\theta }}^{(n)},\mathbf{{v}}^{(n)})\\ -\dot{\mathbf{f}}({\varvec{\theta }}^{(n)},\mathbf{{v}}^{(n)})\end{bmatrix}\varepsilon ^2 + \mathscr {O}(\varepsilon ^3) \end{aligned}$$
(2.78)

The local error is

$$\begin{aligned} \begin{aligned} e_{n+1}&=\Vert \mathbf{z}(t_{n+1}) - \mathbf{z}^{(n+1)}\Vert \\&= \left\| \begin{bmatrix}{\varvec{\theta }}(t_n)-{\varvec{\theta }}^{(n)}\\ \mathbf{{v}}(t_n)-\mathbf{{v}}^{(n)}\end{bmatrix} + \begin{bmatrix}{} \mathbf{{v}}(t_n)-\mathbf{{v}}^{(n)}\\ -[\mathbf{f}(t_n)- \mathbf{f}^{(n)}]\end{bmatrix}\varepsilon + \frac{1}{2} \begin{bmatrix}-[\mathbf{f}(t_n)- \mathbf{f}^{(n)}]\\ -[\dot{\mathbf{f}}(t_n)- \dot{\mathbf{f}}^{(n)}]\end{bmatrix}\varepsilon ^2 + \mathscr {O}(\varepsilon ^3) \right\| \\&\le (1+M_1\varepsilon +M_2\varepsilon ^2) e_n + \mathscr {O}(\varepsilon ^3) \end{aligned} \end{aligned}$$
(2.79)

where \(M_k = c_k\sup _{t\in [0,T]}\Vert \nabla ^k \mathbf{f}({\varvec{\theta }}(t),\mathbf{{v}}(t))\Vert ,\; k=1,2\) for some constants \(c_k>0\). Accumulating the local errors by iterating the above inequality for \(L=T/\varepsilon \) steps provides the following global error:

$$\begin{aligned} \begin{aligned} e_{L+1}&\le (1+M_1\varepsilon +M_2\varepsilon ^2) e_L + \mathscr {O}(\varepsilon ^3) \le (1+M_1\varepsilon +M_2\varepsilon ^2)^2 e_{L-1} + 3\mathscr {O}(\varepsilon ^3)\le \cdots \\&\le (1+M_1\varepsilon +M_2\varepsilon ^2)^L e_1 + L\mathscr {O}(\varepsilon ^3) \le (e^{M_1T}+T)\varepsilon ^2 \rightarrow 0,\quad as\; \varepsilon \rightarrow 0 \end{aligned} \end{aligned}$$
(2.80)

For Spherical HMC in the spherical coordinates, we conjecture that the integrator of Algorithm 2 still has order 3 local error and order 2 global error. One can follow the same argument as above to verify this.

Bounce in Diamond: Wall HMC for 1-Norm Constraint

Reference [46] discusses the Wall HMC method for \(\infty \)-norm constraint only. We can however derive a similar approach for 1-norm constraint. As shown in the left panel of Fig. 2.13, given the current state \(\varvec{\theta }_0\), HMC makes a proposal \(\varvec{\theta }\). It will hit the boundary to move from \(\varvec{\theta }_0\) towards \(\varvec{\theta }\). To determine the hit point ‘X’, we are required to solve for \(t\in (0,1)\) such that

$$\begin{aligned} \Vert \varvec{\theta }_0+(\varvec{\theta }-\varvec{\theta }_0)t\Vert _1=\sum _{d=1}^D|\theta _0^d+(\theta ^d-\theta _0^d)t|=1 \end{aligned}$$
(2.81)

One can find the hitting time using the bisection method. However, a more efficient method is to find the orthant in which the sampler hits the boundary, i.e., find the normal direction \(\mathbf{n}\) with elements being \(\pm 1\). Then, we can find t,

$$\begin{aligned} \Vert \varvec{\theta }_0+(\varvec{\theta }-\varvec{\theta }_0)t\Vert _1 = \mathbf{n}^{\mathsf {T}}[\varvec{\theta }_0+(\varvec{\theta }-\varvec{\theta }_0)t]=1 \implies t^* = \frac{1-\mathbf{n}^{\mathsf {T}}\varvec{\theta }_0}{\mathbf{n}^{\mathsf {T}}(\varvec{\theta }-\varvec{\theta }_0)} \end{aligned}$$
(2.82)
Fig. 2.13
figure 13

Wall HMC bounces in the 1-norm constraint domain. Left given the current state \(\varvec{\theta }_0\), Wall HMC proposes \(\varvec{\theta }\), but bounces of the boundary and reaches \(\varvec{\theta }'\) instead. Right determining the hitting time by monitoring the first intersection point with coordinate planes that violates the constraint

Therefore the hit point is \(\varvec{\theta }_0'=\varvec{\theta }_0+(\varvec{\theta }-\varvec{\theta }_0)t^*\) and consequently the reflection point is

$$\begin{aligned} \varvec{\theta }'=\varvec{\theta }-2\mathbf{n}^*\langle \mathbf{n}^*,\varvec{\theta }-\varvec{\theta }_0'\rangle = \varvec{\theta }-2\mathbf{n}(\mathbf{n}^{\mathsf {T}}\varvec{\theta }-1)/D \end{aligned}$$
(2.83)

where \(\mathbf{n}^*:=\mathbf{n}/\Vert \mathbf{n}\Vert _2\) and \(\mathbf{n}^{\mathsf {T}}\varvec{\theta }_0'=1\) because \(\varvec{\theta }_0'\) is on the boundary with the normal direction \(\mathbf{n}^*\).

It is in general difficult to directly determine the intersection of \(\varvec{\theta }-\varvec{\theta }_0\) with boundary. Instead, we can find its intersections with coordinate planes \(\{\varvec{\pi }_d\}_{d=1}^D\), where \(\varvec{\pi }_d:=\{\varvec{\theta }\in \mathbb R^D|\theta ^d=0\}\). The intersection times are defined as \(\mathbf{T}=\{\theta _0^d/(\theta _0^d-\theta ^d)|\theta _0^d\ne \theta ^d\}\). We keep those between 0 and 1 and sort them in ascending order (Fig. 2.13, right panel). Then, we find the intersection points \(\{\varvec{\theta }_k:=\varvec{\theta }_0+(\varvec{\theta }-\varvec{\theta }_0)T_k\}\) that violate the constraint \(\Vert \varvec{\theta }\Vert \le 1\). Denote the first intersection point outside the constrained domain as \(\varvec{\theta }_k\). The signs of \(\varvec{\theta }_k\) and \(\varvec{\theta }_{k-1}\) determine the orthant of the hitting point \(\varvec{\theta }_0'\).

Note, for each \(d\in \{1,\ldots D\}\), \(({{\mathrm{sign}}}(\varvec{\theta }_k^d),{{\mathrm{sign}}}(\varvec{\theta }_{k-1}^d))\) cannot be \((+,-)\) or \((-,+)\), otherwise there exists an intersection point \(\varvec{\theta }^*:=\varvec{\theta }_0+(\varvec{\theta }-\varvec{\theta }_0)T^*\) with some coordinate plane \(\varvec{\pi }_{d^*}\) between \(\varvec{\theta }_k\) and \(\varvec{\theta }_{k-1}\). Then \(T_{k-1}<T^*<T_k\) contradicts the order of \(\mathbf{T}\).Footnote 4 Therefore any point (including \(\varvec{\theta }_0'\)) between \(\varvec{\theta }_k\) and \(\varvec{\theta }_{k-1}\) must have the same sign as \({{\mathrm{sign}}}({{\mathrm{sign}}}(\varvec{\theta }_k)+{{\mathrm{sign}}}(\varvec{\theta }_{k-1}))\); that is

$$\begin{aligned} \mathbf{n} = {{\mathrm{sign}}}({{\mathrm{sign}}}(\varvec{\theta }_k)+{{\mathrm{sign}}}(\varvec{\theta }_{k-1})) \end{aligned}$$
(2.84)

After moving from \(\varvec{\theta }\) to \(\varvec{\theta }'\), we examine whether \(\varvec{\theta }'\) satisfies the constraint. If it does not satisfy the constraint, we repeat above procedure with \(\varvec{\theta }_0\leftarrow \varvec{\theta }_0'\) and \(\varvec{\theta }\leftarrow \varvec{\theta }'\) until the final state is inside the constrained domain. Then we adjust the velocity direction by

$$\begin{aligned} \mathbf{{v}} \leftarrow (\varvec{\theta }'-\varvec{\theta }_0')\frac{\Vert \mathbf{{v}}\Vert }{\Vert \varvec{\theta }'-\varvec{\theta }_0'\Vert } \end{aligned}$$
(2.85)

Algorithm 4 summarizes the above steps.

figure d

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Lan, S., Shahbaba, B. (2016). Sampling Constrained Probability Distributions Using Spherical Augmentation. In: Minh, H., Murino, V. (eds) Algorithmic Advances in Riemannian Geometry and Applications. Advances in Computer Vision and Pattern Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-45026-1_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-45026-1_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-45025-4

  • Online ISBN: 978-3-319-45026-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics