Abstract
Statistical models with constrained probability distributions are abundant in machine learning. Some examples include regression models with norm constraints (e.g., Lasso), probit, many copula models, and latent Dirichlet allocation (LDA). Bayesian inference involving probability distributions confined to constrained domains could be quite challenging for commonly used sampling algorithms. In this work, we propose a novel augmentation technique that handles a wide range of constraints by mapping the constrained domain to a sphere in the augmented space. By moving freely on the surface of this sphere, sampling algorithms handle constraints implicitly and generate proposals that remain within boundaries when mapped back to the original space. Our proposed method, called Spherical Augmentation, provides a mathematically natural and computationally efficient framework for sampling from constrained probability distributions . We show the advantages of our method over state-of-the-art sampling algorithms, such as exact Hamiltonian Monte Carlo, using several examples including truncated Gaussian distributions, Bayesian Lasso, Bayesian bridge regression, reconstruction of quantized stationary Gaussian process, and LDA for topic modeling.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Note, \(\mathbf{v}^{\mathsf {T}}{} \mathbf{G}_{\mathscr {S}_r}(\varvec{\theta })\mathbf{{v}}\le \Vert \mathbf{{v}}\Vert _2^2\le \Vert \tilde{\mathbf{{v}}}\Vert _2^2 = \mathbf{{v}}^{\mathsf {T}}{} \mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })\mathbf{{v}}\).
- 2.
- 3.
The stochastic gradient for sg-wallLMC is \([(n^*_{kw} + \beta -1/2) + \pi _{kw}(n^*_{k\cdot }+W(\beta -1/2))]/n^*_{k\cdot }\) calculated with (2.45).
- 4.
The same argument applies when \(T_k=1\), i.e. \(\varvec{\theta }\) is the first point outside the domain among \(\{\varvec{\theta }_k\}\).
References
Y. Ahmadian, J.W. Pillow, L. Paninski, Efficient Markov chain Monte Carlo methods for decoding neural spike trains. Neural Comput. 23(1), 46–96 (2011)
S. Ahn, Y. Chen, M. Welling, Distributed and adaptive darting Monte Carlo through regenerations, in Proceedings of the 16th International Conference on Artificial Intelligence and Statistics (AI Stat) (2013)
S. Ahn, B. Shahbaba, M. Welling, Distributed stochastic gradient MCMC, in International Conference on Machine Learning (2014)
S. Amari, H. Nagaoka, Methods of Information Geometry, in Translations of Mathematical Monographs, vol. 191 (Oxford University Press, Oxford, 2000)
C. Andrieu, E. Moulines, On the ergodicity properties of some adaptive MCMC algorithms. Ann. Appl. Probab. 16(3), 1462–1505 (2006)
C. Andrieu, A. Doucet, R. Holenstein, Particle Markov chain Monte Carlo methods. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 72(3), 269–342 (2010)
M.J. Beal, Variational algorithms for approximate Bayesian inference. Ph.D. thesis, University College London, London (2003)
A. Beskos, G. Roberts, A. Stuart, J. Voss, MCMC methods for diffusion bridges. Stoch. Dyn. 8(03), 319–350 (2008)
A. Beskos, F.J. Pinski, J.M. Sanz-Serna, A.M. Stuart, Hybrid Monte Carlo on Hilbert spaces. Stoch. Process. Appl. 121(10), 2201–2230 (2011)
D.M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
A.E. Brockwell, Parallel markov chain monte carlo simulation by Pre-Fetching. J. Comput. Gr. Stat. 15, 246–261 (2006)
M.A. Brubaker, M. Salzmann, R. Urtasun, A family of MCMC methods on implicitly defined manifolds, in Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS-12), vol. 22, ed. by N.D. Lawrence, M.A. Girolami (2012), pp. 161–172
S. Byrne, M. Girolami, Geodesic Monte Carlo on embedded manifolds. Scand. J. Stat. 40(4), 825–845 (2013)
B. Calderhead, M. Sustik, Sparse approximate manifolds for differential geometric MCMC, in Advances in Neural Information Processing Systems, vol. 25, ed. by P. Bartlett, F. Pereira, C. Burges, L. Bottou, K. Weinberger (2012), pp. 2888–2896
O. Cappé, R. Douc, A. Guillin, J.M. Marin, C.P. Robert, Adaptive importance sampling in general mixture classes. Stat. Comput. 18(4), 447–459 (2008)
N. Chopin, A sequential particle filter method for static models. Biometrika 89(3), 539–552 (2002)
S.L. Cotter, G.O. Roberts, A. Stuart, D. White et al., MCMC methods for functions: modifying old algorithms to make them faster. Stat. Sci. 28(3), 424–446 (2013)
R.V. Craiu, J. Rosenthal, C. Yang, Learn from thy neighbor: parallel-chain and regional adaptive MCMC. J. Am. Stat. Assoc. 104(488), 1454–1466 (2009)
P. Del Moral, A. Doucet, A. Jasra, Sequential Monte Carlo samplers. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 68(3), 411–436 (2006)
N. de Freitas, P. Højen-Sørensen, M. Jordan, R. Stuart, Variational MCMC, in Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, UAI ’01 (Morgan Kaufmann Publishers Inc., San Francisco, 2001), pp. 120–127
M.P. do Carmo, Riemannian Geometry, 1st edn (Birkhäuser, Boston, 1992)
R. Douc, C.P. Robert, A vanilla rao-blackwellization of metropolis-hastings algorithms. Ann. Stat. 39(1), 261–277 (2011)
S. Duane, A.D. Kennedy, B.J. Pendleton, D. Roweth, Hybrid Monte Carlo. Phys. Lett. B 195(2), 216–222 (1987)
I.E. Frank, J.H. Friedman, A statistical view of some chemometrics regression tools. Technometrics 35(2), 109–135 (1993)
A. Gelfand, L. van der Maaten, Y. Chen, M. Welling, On herding and the cycling perceptron theorem. Adv. Neural Inf. Process. Syst. 23, 694–702 (2010)
C.J. Geyer, Practical Markov Chain Monte Carlo. Stat. Sci. 7(4), 473–483 (1992)
W.R. Gilks, G.O. Roberts, S.K. Sahu, Adaptive Markov chain Monte Carlo through regeneration. J. Am. Stat. Assoc. 93(443), 1045–1054 (1998)
M. Girolami, B. Calderhead, Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J. R. Stat. Soc. Ser. B (with discussion) 73(2), 123–214 (2011)
G.H. Golub, C.F. Van Loan, Matrix Computations, 3rd edn. (Johns Hopkins University Press, Baltimore, 1996)
C. Hans, Bayesian lasso regression. Biometrika 96(4), 835–845 (2009)
M. Hoffman, F.R. Bach, D.M. Blei, Online learning for latent dirichlet allocation, in Advances in Neural Information Processing Systems (2010), pp. 856–864
M.D. Hoffman, A. Gelman, The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 15(1), 1593–1623 (2014)
K. Kurihara, M. Welling, N. Vlassis, Accelerated variational Dirichlet process mixtures, in Advances of Neural Information Processing Systems – NIPS, vol. 19 (2006)
S. Lan, B. Shahbaba, Spherical hamiltonian Monte Carlo for constrained target distributions, in The 31st International Conference on Machine Learning, Beijing (2014), pp. 629–637
S. Lan, V. Stathopoulos, B. Shahbaba, M. Girolami, Markov chain Monte Carlo from Lagrangian dynamics. J. Comput. Gr. Stat. 24(2), 357–378 (2015)
B. Leimkuhler, S. Reich, Simulating Hamiltonian Dynamics (Cambridge University Press, Cambridge, 2004)
J. Møller, A. Pettitt, K. Berthelsen, R. Reeves, An efficient Markov chain Monte Carlo method for distributions with intractable normalisation constants. Biometrica 93, 451–458 (2006) (To appear)
I. Murray, R.P. Adams, D.J. MacKay, Elliptical slice sampling. JMLR:W&CP 9, 541–548 (2010)
P. Mykland, L. Tierney, B. Yu, Regeneration in Markov chain samplers. J. Am. Stat. Assoc. 90(429), 233–241 (1995)
P. Neal, G.O. Roberts, Optimal scaling for random walk metropolis on spherically constrained target densities. Methodol. Comput. Appl. Probab. 10(2), 277–297 (2008)
P. Neal, G.O. Roberts, W.K. Yuen, Optimal scaling of random walk metropolis algorithms with discontinuous target densities. Ann. Appl. Probab. 22(5), 1880–1927 (2012)
R.M. Neal, Probabilistic inference using Markov Chain Monte Carlo methods. Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto (1993)
R.M. Neal, Bayesian Learning for Neural Networks (Springer, Secaucus, 1996)
R.M. Neal, Slice sampling. Ann. Stat. 31(3), 705–767 (2003)
R.M. Neal, The short-cut metropolis method. Technical Report 0506, Department of Statistics, University of Toronto (2005)
R.M. Neal, MCMC using Hamiltonian dynamics, in: Handbook of Markov Chain Monte Carlo, ed. by S. Brooks, A. Gelman, G. Jones, X.L. Meng. Chapman and Hall/CRC (2011), pp. 113–162
A. Pakman, L. Paninski, Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians. J. Comput. Gr. Stat. 23(2), 518–542 (2014)
T. Park, G. Casella, The Bayesian lasso. J. Am. Stat. Assoc. 103(482), 681–686 (2008)
S. Patterson, Y.W. Teh, Stochastic gradient riemannian langevin dynamics on the probability simplex, in Advances in Neural Information Processing Systems (2013), pp. 3102–3110
N.S. Pillai, A.M. Stuart, A.H. Thiéry et al., Optimal scaling and diffusion limits for the Langevin algorithm in high dimensions. Ann. Appl. Probab. 22(6), 2320–2356 (2012)
J.G. Propp, D.B. Wilson, Exact sampling with coupled Markov chains and applications to statistical mechanics (1996), pp. 223–252
D. Randal, G. Arnaud, M. Jean-Michel, R.P. Christian, Minimum variance importance sampling via population Monte Carlo. ESAIM: Probab. Stat. 11, 427–447 (2007)
G.O. Roberts, S.K. Sahu, Updating schemes, correlation structure, blocking and parameterisation for the Gibbs sampler. J. R. Stat. Soc. Ser. B 59, 291–317 (1997)
G.O. Roberts, A. Gelman, W.R. Gilks, Weak convergence and optimal scaling of random walk metropolis algorithms. Ann. Appl. Probab. 7(1), 110–120 (1997)
G.O. Roberts, J.S. Rosenthal et al., Optimal scaling for various metropolis-hastings algorithms. Stat. Sci. 16(4), 351–367 (2001)
B. Shahbaba, S. Lan, W.O. Johnson, R.M. Neal, Split Hamiltonian Monte Carlo. Stat. Comput. 24(3), 339–349 (2014)
C. Sherlock, G.O. Roberts, Optimal scaling of the random walk metropolis on elliptically symmetric unimodal targets. Bernoulli 15(3), 774–798 (2009)
M. Spivak, Calculus on Manifolds, vol. 1 (WA Benjamin, New York, 1965)
M. Spivak, A Comprehensive Introduction to Differential Geometry, vol. 1, 2nd edn (Publish or Perish, Inc., Houston, 1979)
Y.W. Teh, D. Newman, M. Welling, A collapsed variational bayesian inference algorithm for latent dirichlet allocation, in Advances in Neural Information Processing Systems (2006), pp. 1353–1360
R. Tibshirani, Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)
H.M. Wallach, I. Murray, R. Salakhutdinov, D. Mimno, Evaluation methods for topic models, in Proceedings of the 26th Annual International Conference on Machine Learning (ACM, 2009), pp. 1105–1112
G.R. Warnes, The normal kernel coupler: an adaptive Markov Chain Monte Carlo method for efficiently sampling from multi-modal distributions. Technical Report No. 395, University of Washington (2001)
M. Welling, Herding dynamic weights to learn, in Proceeding of International Conference on Machine Learning (2009)
M. Welling, Y.W. Teh, Bayesian learning via stochastic gradient Langevin dynamics, in Proceedings of the 28th International Conference on Machine Learning (ICML) (2011), pp. 681–688
M. West, On scale mixtures of normal distributions. Biometrika 74(3), 646–648 (1987)
Y. Zhang, C. Sutton, Quasi-Newton methods for Markov Chain Monte Carlo, in Advances in Neural Information Processing Systems, ed. by J. Shawe-Taylor, R.S. Zemel, P. Bartlett, F.C.N. Pereira, K.Q. Weinberger, vol. 24 (2011), pp. 2393–2401
Acknowledgments
SL is supported by EPSRC Programme Grant, Enabling Quantification of Uncertainty in Inverse Problems (EQUIP), EP/K034154/1. BS is supported by NSF grant IIS-1216045 and NIH grant R01-AI107034.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
Spherical Geometry
We first discuss the geometry of the D-dimensional sphere \(\mathscr {S}^D:=\{\tilde{\varvec{\theta }} \in \mathbb R^{D+1}: \Vert \tilde{\varvec{\theta }}\Vert _2= 1\}\hookrightarrow \mathbb R^{D+1}\) under different coordinate systems, namely, Cartesian coordinates and the spherical coordinates. Since \(\mathscr {S}^D\) can be embedded (injectively and differentiably mapped to) in \(\mathbb R^{D+1}\), we first introduce the concept of ‘induced metric’.
Definition 2.1
(induced metric) If \(\mathscr {D}^{d}\) can be embedded to \(\mathscr {M}^{m}\) \((m>d)\) by \(f: U\subset \mathscr {D}\hookrightarrow \mathscr {M}\), then one can define the induced metric, \(g_{\mathscr {D}}\), on \(T\mathscr {D}\) through the metric \(g_{\mathscr {M}}\) defined on \(T\mathscr {M}\):
Remark 2.2
For any \(f: U\subset \mathscr {S}^D\hookrightarrow \mathbb R^{D+1}\), we can define the induced metric through dot product on \(\mathbb R^{D+1}\). More specifically,
where \((Df)_{(D+1)\times D}\) is the Jacobian matrix of the mapping f. A Metric induced from dot product on Euclidean space is called a “canonical metric ”. This observation leads to the following simple fact that lays down the foundation of Spherical HMC.
Proposition 2.7.1
(Energy invariance ) Kinetic energy \(\frac{1}{2}\langle \mathbf{{v}},\mathbf{{v}}\rangle _{\mathbf{G}(\varvec{\theta })}\) is invariant to the choice of coordinate systems.
Proof
For any \(\mathbf{{v}}\in T_{\varvec{\theta }}\mathscr {D}\), suppose \(\varvec{\theta }(t)\) such that \(\dot{\varvec{\theta }}(0)=\mathbf{{v}}\). Denote the pushforward of \(\mathbf{{v}}\) by embedding map \(f:\mathscr {D}\rightarrow \mathscr {M}\) as \(\tilde{\mathbf{{v}}}:=f_*(\mathbf{v})=\frac{d}{dt}(f\circ \varvec{\theta })(0)\). Then we have
That is, regardless of the form of the energy under a coordinate system, its value is the same as the one in the embedded manifold. In particular, when \(\mathscr {M}=\mathbb R^{D+1}\), the right hand side simplifies to \(\frac{1}{2} \Vert \tilde{\mathbf{{v}}}\Vert _2^2\). \(\square \)
Canonical Metric in Cartesian Coordinates
Now consider the D-dimensional ball \(\mathscr {B}_\mathbf{0}^D(1):=\{\varvec{\theta }\in \mathbb R^D: \Vert \varvec{\theta }\Vert _2\le 1\}\). Here, \(\{\varvec{\theta }, \mathscr {B}_\mathbf{0}^D(1)\}\) can be viewed as the Cartesian coordinate system for \({\mathscr {S}}^D\). The coordinate mapping \(T_{\mathscr {B}\rightarrow \mathscr {S}_+}:\varvec{\theta }\mapsto \tilde{\varvec{\theta }}=(\varvec{\theta },\theta _{D+1})\) in (2.5) can be viewed as the embedding map into \(\mathbb R^{D+1}\), and the Jacobian matrix of \(T_{\mathscr {B}\rightarrow \mathscr {S}_+}\) is \(dT_{\mathscr {B}\rightarrow \mathscr {S}_+}=\frac{d\tilde{\varvec{\theta }}}{d{\varvec{\theta }}^{\mathsf {T}}}=\begin{bmatrix}{} \mathbf{I}_{D}\\tp{\varvec{\theta }}/\theta _{D+1}\end{bmatrix}\). Therefore the canonical metric of \({\mathscr {S}}^D\) in Cartesian coordinates, \(\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })\), is
Another way to obtain the metric is through the first fundamental form \(ds^{2}\) (i.e., squared infinitesimal length of a curve) for \(\mathscr {S}^D\), which can be expressed in terms of the differential form \(d\varvec{\theta }\) and the canonical metric \(\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })\),
On the other hand, \(ds^{2}\) can also be obtained as follows [59]:
Equating the above two quantities yields the form of the canonical metric \(\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })\) as in Eq. (2.61). This viewpoint provides a natural way to explain the length of tangent vector. For any vector \(\tilde{\mathbf{{v}}}=(\mathbf{{v}},v_{D+1})\in T_{\tilde{\varvec{\theta }}}{\mathscr {S}}^D=\{\tilde{\mathbf{{v}}}\in \mathbb R^{D+1}: {\tilde{\varvec{\theta }}}^{\mathsf {T}}\tilde{\mathbf{{v}}}=0\}\), one could think of \(\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })\) as a mean to express the length of \(\tilde{\mathbf{v}}\) in terms of \(\mathbf{{v}}\),
This indeed verifies the energy invariance Proposition 2.7.1.
The following proposition provides the analytic forms of the determinant and the inverse of \(\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })\).
Proposition 2.7.2
The determinant and the inverse of the canonical metric are as follows:
Proof
The determinant of the canonical metric \(\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })\) is given by the matrix determinant lemma
The inverse of \(\mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })\) is obtained by the Sherman-Morrison-Woodbury formula [29]
\(\square \)
Corollary 2.1
The volume adjustment of changing measure in (2.6) is
Proof
Canonical measure can be defined through the Riesz representation theorem by using a positive linear functional on the space \(C_0(\mathscr {S}^D)\) of compactly supported continuous functions on \(\mathscr {S}^D\) [21, 59]. More precisely, there is a unique positive Borel measure \(\mu _c\) such that for (any) coordinate chart \((\mathscr {B}_\mathbf{0}^D(1),T_{\mathscr {B}\rightarrow \mathscr {S}_+})\),
where \(\mu _c=d\varvec{\theta }_{\mathscr {S}_c}\), and \(d\varvec{\theta }_{\mathscr {B}}\) is the Euclidean measure. Therefore we have
Alternatively, \(\left| \frac{d\varvec{\theta }_{\mathscr {B}}}{d\varvec{\theta }_{\mathscr {S}_c}}\right| =|\theta _{D+1}|\). \(\square \)
Geodesic on a Sphere in Cartesian Coordinates
To find the geodesic on a sphere , we need to solve the following equations:
for which we need to calculate the Christoffel symbols, \(\varvec{\varGamma }_{\mathscr {S}_c}(\varvec{\theta })\), first. Note that the (i, j)-th element of \(\mathbf{G}_{\mathscr {S}_c}\) is \(g_{ij} = \delta _{ij} + \theta _i\theta _j/\theta _{D+1}^2\), and the (i, j, k)-th element of \(d\mathbf{G}_{\mathscr {S}_c}\) is \(g_{ij,k} = (\delta _{ik}\theta _j + \theta _i\delta _{jk})/\theta _{D+1}^2 + 2\theta _i\theta _j\theta _k/\theta _{D+1}^4\). Therefore
Using these results, we can write Eq. (2.66) as \(\dot{\mathbf{{v}}} = -\mathbf{{v}}^{\mathsf {T}} \mathbf{G}_{\mathscr {S}_c}(\varvec{\theta })\mathbf{{v}}\varvec{\theta } = -\Vert \tilde{\mathbf{v}}\Vert _2^2\varvec{\theta }\). Further, we have
Therefore, we can rewrite the geodesic equations (2.65), (2.66) with augmented components as
Multiplying both sides of Eq. (2.68) by \({\tilde{\mathbf{{v}}}}^{\mathsf {T}}\) to obtain \(\frac{d}{dt}\Vert \tilde{\mathbf{{v}}}\Vert _2^2=0\), we can solve the above system of differential equations as follows:
Round Metric in the Spherical Coordinates
Consider the D-dimensional hyperrectangle \(\mathscr {R}_\mathbf{0}^D:=[0,\pi ]^{D-1}\times [0,2\pi )\) and the corresponding spherical coordinate system, \(\{\varvec{\theta }, \mathscr {R}_\mathbf{0}^D\}\), for \(\mathscr {S}^D\). The coordinate mapping \(T_{\mathscr {R}_\mathbf{0}\rightarrow \mathscr {S}}: \varvec{\theta }\mapsto \mathbf{x},\; x_d = \cos (\theta _d)\prod _{i=1}^{d-1}\sin (\theta _i),\, d=1,\ldots , D+1\), (\(\theta _{D+1}=0\)) can be viewed as the embedding map into \(\mathbb R^{D+1}\), and the Jacobian matrix of \(T_{\mathscr {R}_\mathbf{0}\rightarrow \mathscr {S}}\) is \(\frac{d\mathbf{x}}{d{\varvec{\theta }}^{\mathsf {T}}}\) with the (d, j)-th element \([-\tan (\theta _d)\delta _{dj}+\cot (\theta _j)I(j<d)]x_d\). The induced metric of \({\mathscr {S}}^D\) in the spherical coordinates is called round metric , denoted as \(\mathbf{G}_{\mathscr {S}_r}(\varvec{\theta })\), whose (i, j)-th element is as follows
Therefore, \(\mathbf{G}_{\mathscr {S}_r}(\varvec{\theta })={{\mathrm{diag}}}[1,\sin ^2(\theta _1),\ldots ,\prod _{d=1}^{D-1}\sin ^2(\theta _d)]\). Another way to obtain \(\mathbf{G}_{\mathscr {S}_r}(\varvec{\theta })\) is through the coordinate change
Similar to Corollary (2.1), we have
Proposition 2.7.3
The volume adjustment of changing measure in (2.9) is
Jacobian of the Transformation Between q -Norm Domains
The following proposition gives the weights needed for the transformation from \(\mathscr {Q}^D\) to \(\mathscr {B}_\mathbf{0}^D(1)\).
Proposition 2.7.4
The Jacobian determinant (weight) of \(T_{\mathscr {B}\rightarrow \mathscr {Q}}\) is as follows:
Proof
Note
The Jacobian matrix for \(T_{\mathscr {B}\rightarrow \mathscr {Q}}\) is
Therefore the Jacobian determinant of \(T_{\mathscr {B}\rightarrow \mathscr {Q}}\) is
\(\square \)
The following proposition gives the weights needed for the change of domains from \(\mathscr {R}^D\) to \(\mathscr {B}_\mathbf{0}^D(1)\).
Proposition 2.7.5
The Jacobian determinant (weight) of \(T_{\mathscr {B}\rightarrow \mathscr {R}}\) is as follows:
Proof
First, we note
The corresponding Jacobian matrices are
where \(\mathbf{e}_{\arg \max |\varvec{\theta }|}\) is a vector with \((\arg \max |\varvec{\theta }|)\)-th element 1 and all others 0. Therefore,
\(\square \)
Splitting Hamiltonian (Lagrangian) Dynamics on \({\varvec{\mathscr {S}}}^D\)
Splitting the Hamiltonian dynamics and its usefulness in improving HMC is a well-studied topic of research [13, 36, 56]. Splitting the Lagrangian dynamics (used in our approach), on the other hand, has not been discussed in the literature, to the best of our knowledge. Therefore, we prove the validity of our splitting method by starting with the well-understood method of splitting Hamiltonian [13],
The corresponding systems of differential equations,
can be written in terms of Lagrangian dynamics in \((\varvec{\theta }, \mathbf{{v}})\) as follows:
We have solved the second dynamics (on the right) in Appendix “Geodesic on a Sphere in Cartesian Coordinates”. To solve the first dynamics, we note that
Therefore, we have
where \(\begin{bmatrix} \mathbf{I}\\ -\frac{{\varvec{\theta }(0)}^{\mathsf {T}}}{\theta _{D+1}(0)}\end{bmatrix} [\mathbf{I}-\varvec{\theta }(0){\varvec{\theta }(0)}^{\mathsf {T}}]=\begin{bmatrix} \mathbf{I}-\varvec{\theta }(0){\varvec{\theta }(0)}^{\mathsf {T}}\\ -\theta _{D+1}(0) {\varvec{\theta }(0)}^{\mathsf {T}}\end{bmatrix} = \begin{bmatrix} \mathbf{I}\\ \mathbf{0}^{\mathsf {T}}\end{bmatrix} - \tilde{\varvec{\theta }}(0) {\varvec{\theta }(0)}^{\mathsf {T}}\).
Finally, we note that \(\Vert \tilde{\varvec{\theta }}(t)\Vert _2=1\) if \(\Vert \tilde{\varvec{\theta }}(0)\Vert _2=1\) and \(\tilde{\mathbf{{v}}}(t)\in T_{\tilde{\varvec{\theta }}(t)} \mathscr {S}_c^D\) if \(\tilde{\mathbf{{v}}}(0)\in T_{\tilde{\varvec{\theta }}(0)} \mathscr {S}_c^D\).
Error Analysis of Spherical HMC
Following [36], we now show that the discretization error \(e_n=\Vert \mathbf{{z}}(t_n) - \mathbf{{z}}^{(n)}\Vert =\Vert ({\varvec{\theta }}(t_n),\mathbf{{v}}(t_n)) - ({\varvec{\theta }}^{(n)},\mathbf{{v}}^{(n)})\Vert \) (i.e., the difference between the true solution and the numerical solution) is \({\mathscr {O}}(\varepsilon ^3)\) locally and \(\mathscr {O}(\varepsilon ^2)\) globally, where \(\varepsilon \) is the discretization step size. Here, we assume that \(\mathbf{{f}}({\varvec{\theta }},\mathbf{{v}}):= \mathbf{{v}}^{\textsf {T}}\varvec{\varGamma }({\varvec{\theta }}) \mathbf{{v}} + \mathbf{{G}}({\varvec{\theta }})^{-1} \nabla _{\varvec{\theta }}U({\varvec{\theta }})\) is smooth; hence, \(\mathbf{{f}}\) and its derivatives are uniformly bounded as \(\mathbf{{z}}=({\varvec{\theta }},\mathbf{{v}})\) evolves within finite time duration T. We expand the true solution \(\mathbf{{z}}(t_{n+1})\) at \(t_n\):
We first consider Spherical HMC in Cartesian coordinates, where \(\mathbf{{f}}({\varvec{\theta }},\mathbf{{v}})=\Vert \tilde{\mathbf{{v}}}\Vert ^2\varvec{\theta } + [\mathbf{I}-\varvec{\theta }{\varvec{\theta }}^{\mathsf {T}}] \nabla _{\varvec{\theta }} U(\varvec{\theta })\). From Eq. (2.34) we have
Now we expand Eq. (2.35) using Taylor series as follows:
Substituting (2.77) in the above equations yields
With the above results, we have
where for the last equality we need to show \({(\mathbf{{v}}^{(n)})}^{\mathsf {T}} \nabla _{\varvec{\theta }} U(\varvec{\theta }^{(n)})= -2\frac{d}{dt}\Vert \tilde{\mathbf{{v}}}^{(n)}\Vert ^2\). This can be proved as follows:
Therefore we have
The local error is
where \(M_k = c_k\sup _{t\in [0,T]}\Vert \nabla ^k \mathbf{f}({\varvec{\theta }}(t),\mathbf{{v}}(t))\Vert ,\; k=1,2\) for some constants \(c_k>0\). Accumulating the local errors by iterating the above inequality for \(L=T/\varepsilon \) steps provides the following global error:
For Spherical HMC in the spherical coordinates, we conjecture that the integrator of Algorithm 2 still has order 3 local error and order 2 global error. One can follow the same argument as above to verify this.
Bounce in Diamond: Wall HMC for 1-Norm Constraint
Reference [46] discusses the Wall HMC method for \(\infty \)-norm constraint only. We can however derive a similar approach for 1-norm constraint. As shown in the left panel of Fig. 2.13, given the current state \(\varvec{\theta }_0\), HMC makes a proposal \(\varvec{\theta }\). It will hit the boundary to move from \(\varvec{\theta }_0\) towards \(\varvec{\theta }\). To determine the hit point ‘X’, we are required to solve for \(t\in (0,1)\) such that
One can find the hitting time using the bisection method. However, a more efficient method is to find the orthant in which the sampler hits the boundary, i.e., find the normal direction \(\mathbf{n}\) with elements being \(\pm 1\). Then, we can find t,
Therefore the hit point is \(\varvec{\theta }_0'=\varvec{\theta }_0+(\varvec{\theta }-\varvec{\theta }_0)t^*\) and consequently the reflection point is
where \(\mathbf{n}^*:=\mathbf{n}/\Vert \mathbf{n}\Vert _2\) and \(\mathbf{n}^{\mathsf {T}}\varvec{\theta }_0'=1\) because \(\varvec{\theta }_0'\) is on the boundary with the normal direction \(\mathbf{n}^*\).
It is in general difficult to directly determine the intersection of \(\varvec{\theta }-\varvec{\theta }_0\) with boundary. Instead, we can find its intersections with coordinate planes \(\{\varvec{\pi }_d\}_{d=1}^D\), where \(\varvec{\pi }_d:=\{\varvec{\theta }\in \mathbb R^D|\theta ^d=0\}\). The intersection times are defined as \(\mathbf{T}=\{\theta _0^d/(\theta _0^d-\theta ^d)|\theta _0^d\ne \theta ^d\}\). We keep those between 0 and 1 and sort them in ascending order (Fig. 2.13, right panel). Then, we find the intersection points \(\{\varvec{\theta }_k:=\varvec{\theta }_0+(\varvec{\theta }-\varvec{\theta }_0)T_k\}\) that violate the constraint \(\Vert \varvec{\theta }\Vert \le 1\). Denote the first intersection point outside the constrained domain as \(\varvec{\theta }_k\). The signs of \(\varvec{\theta }_k\) and \(\varvec{\theta }_{k-1}\) determine the orthant of the hitting point \(\varvec{\theta }_0'\).
Note, for each \(d\in \{1,\ldots D\}\), \(({{\mathrm{sign}}}(\varvec{\theta }_k^d),{{\mathrm{sign}}}(\varvec{\theta }_{k-1}^d))\) cannot be \((+,-)\) or \((-,+)\), otherwise there exists an intersection point \(\varvec{\theta }^*:=\varvec{\theta }_0+(\varvec{\theta }-\varvec{\theta }_0)T^*\) with some coordinate plane \(\varvec{\pi }_{d^*}\) between \(\varvec{\theta }_k\) and \(\varvec{\theta }_{k-1}\). Then \(T_{k-1}<T^*<T_k\) contradicts the order of \(\mathbf{T}\).Footnote 4 Therefore any point (including \(\varvec{\theta }_0'\)) between \(\varvec{\theta }_k\) and \(\varvec{\theta }_{k-1}\) must have the same sign as \({{\mathrm{sign}}}({{\mathrm{sign}}}(\varvec{\theta }_k)+{{\mathrm{sign}}}(\varvec{\theta }_{k-1}))\); that is
After moving from \(\varvec{\theta }\) to \(\varvec{\theta }'\), we examine whether \(\varvec{\theta }'\) satisfies the constraint. If it does not satisfy the constraint, we repeat above procedure with \(\varvec{\theta }_0\leftarrow \varvec{\theta }_0'\) and \(\varvec{\theta }\leftarrow \varvec{\theta }'\) until the final state is inside the constrained domain. Then we adjust the velocity direction by
Algorithm 4 summarizes the above steps.
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Lan, S., Shahbaba, B. (2016). Sampling Constrained Probability Distributions Using Spherical Augmentation. In: Minh, H., Murino, V. (eds) Algorithmic Advances in Riemannian Geometry and Applications. Advances in Computer Vision and Pattern Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-45026-1_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-45026-1_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45025-4
Online ISBN: 978-3-319-45026-1
eBook Packages: Computer ScienceComputer Science (R0)