Abstract
Hamiltonian Monte Carlo (HMC) algorithms, which combine numerical approximation of Hamiltonian dynamics on finite intervals with stochastic refreshment and Metropolis correction, are popular sampling schemes, but it is known that they may suffer from slow convergence in the continuous time limit. A recent paper of Bou-Rabee and Sanz-Serna (Ann Appl Prob, 27:2159-2194, 2017) demonstrated that this issue can be addressed by simply randomizing the duration parameter of the Hamiltonian paths. In this article, we use the same idea to enhance the sampling efficiency of a constrained version of HMC, with potential benefits in a variety of application settings. We demonstrate both the conservation of the stationary distribution and the ergodicity of the method. We also compare the performance of various schemes in numerical studies of model problems, including an application to high-dimensional covariance estimation.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction and motivation
Efficient sampling of high dimensional probability distributions is required for Bayesian inference and is a challenge in many fields including biological modelling (Wilkinson 2007), economic modelling (Greenberg 2012), machine learning with large data sets (Pakman et al. 2017; Barber 2012) and molecular dynamics (Perez et al. 2015). A popular approach is Markov chain Monte Carlo, which defines a Markov chain \(X_{i+1} \sim p(\cdot \mid X_{i})\) with invariant measure \(\mu \) and from which we may estimate expected values from the relation \(\mathbb {E}_{X \sim \mu } f(X) \approx \frac{1}{N}\sum ^{N}_{i=1}f(X_{i})\); however convergence of such averages can be slow for high dimensional and multimodal distributions (see e.g. Quiroz et al. (2018)). Recent attempts to address this problem include the local bouncy particle sampler of Bouchard-Côté et al. (2018) and the Zig-Zag process of Bierkens et al. (2019). These methods can be viewed as piecewise deterministic Markov processes (PDMPs), see (Vanetti et al. 2018). The Randomized Hamiltonian Monte Carlo (RHMC), proposed in Bou-Rabee and Sanz-Serna (2017) and further studied in Deligiannidis et al. (2021), evolves a Hamiltonian flow for a duration drawn from an exponential distribution. In standard HMC the choice of integration time is a challenging task (see Hoffman and Gelman 2014) and mixing can be inefficient for some choices of integration time. By contrast, RHMC does not suffer from this problem as randomization of the duration prevents periodicities. This strategy has been studied from both analytic and numerical perspectives in Bou-Rabee and Sanz-Serna (2017). Other recent algorithms have been proposed which build on this idea (for example Riou-Durand and Vogrinc (2022) and Kleppe (2022)). We remark that RHMC is a special case of Andersen dynamics which is popular in the Molecular dynamics literature (see (Bou-Rabee and Eberle 2022)[Remark 2.2] and Andersen (1980)). Andersen dynamics has been studied in Bou-Rabee and Eberle (2022); Weinan and Li (2008) and Li (2007).
The algorithms discussed above are targeted to sampling from distributions in Euclidean space. The need to work with Riemannian manifolds is motivated by applications where constraints are imposed from modelling considerations or are introduced in order to restrict sampling to a relevant subdomain derived from statistical analysis (see (Brubaker et al. 2012)). Examples of manifolds include products of spheres or orthogonal matrices which arise in applications in protein configuration modelling with the Fisher-Bingham distribution (Hamelryck et al. 2006), texture analysis using distributions over rotations (Kunze and Schaeben 2004) and fixed-rank matrix factorization for collaborative filtering (Meyer et al. 2011; Salakhutdinov and Mnih 2008). Methods that sample from probability distributions on manifolds have been considered and studied in Hartmann (2008); Brubaker et al. (2012); Byrne and Girolami (2013); Girolami and Calderhead (2011); Lelièvre et al. (2012); Zappa et al. (2018); Diaconis et al. (2013); Lee and Vempala (2018); Lelièvre et al. (2019) and Laurent and Vilmart (2021). In this article, we focus on manifolds defined by algebraic constraints. In order to maintain the constraints, in practice one needs to perform projections at each step of the algorithm, an additional overhead compared to Euclidean MCMC algorithms.
In this paper we propose the Randomized Time Riemannian Manifold Hamiltonian Monte Carlo (RT-RMHMC) method, an RHMC scheme for Riemannian manifolds. We establish invariance under a compactness assumption of the desired measure in the (small stepsize limit) continuous-time PDMP version of our method, where the algorithm is rejection free. Our approach to proving invariance is based on the PDMP framework established in Durmus et al. (2021), we construct an approximation of RT-RMHMC by truncating the velocity distribution consistently and prove all the conditions needed in Durmus et al. (2021) with the approximation; we have not found such a construction technique in the literature. Further, we demonstrate the invariance of the discretized method with Metropolis-Hastings adjustment and prove ergodicity of the discretized method with Metropolis-Hastings adjustment. We show in numerical experiments that this method has improved robustness, demonstrating for example that the convergence rate is relatively flat in the choice of mean time parameter; these results mirror those obtained for the Euclidean version of the method. Moreover, we compare RT-RMHMC to a constrained underdamped Langevin integrator (g-BAOAB) introduced in Leimkuhler and Matthews (2016).
To our knowledge, there is no theoretical or numerical treatment of RHMC in the manifold setting and there has been no theoretical treatment of Riemannian Hamiltonian Monte Carlo methods in the continuous time setting. We provide a first result to establish invariance of a continuous time Riemannian Hamiltonian Monte Carlo method in the compact setting. A biased RHMC method was recently introduced (see (Kleppe 2022)) which has event rates which depend on the position in the state space, these state dependent event rates can be incorporated into our RHMC Riemannian framework when the framework is unadjusted. We note that in the appendix of that article, a version of RHMC is introduced in the setting of adapting the metric for sampling on Euclidean space but not for working on a Riemannian manifold.
The remainder of this article is organized as follows. In the next section we describe the algorithm and provide invariance in the continuous time setting under a compactness assumption. Section 3 considers the numerical implementation with and without Metropolis test. Section 4 provides conservation of the stationary distribution of the discretized algorithm and the ergodicity of the method with Metropolis-Hastings adjustment. Section 5 discusses numerical experiments and Section 6 gives some thoughts on future developments. We include several appendices addressing the generator, the invariance of the target measure and the irreducibility of the scheme, from which ergodicity necessarily follows.
2 Algorithm
Let \((\mathcal {M},g)\) be a d-dimensional Riemannian manifold and \(T\mathcal {M}\) denote its tangent bundle. Let G(x) denote the positive definite matrix associated to the metric g at \(x \in \mathcal {M}\). Consider a target distribution on \(\mathcal {M}\) with density
with respect to \(\sigma _{\mathcal {M}}(dx)\), the surface measure (Hausdorff measure) of \(\mathcal {M}\) defined by \(\sigma _{\mathcal {M}}(dx) = \sqrt{\det {G(x)}}dx\) and \(Z_{\mathcal {M}} = \int _{\mathcal {M}}\exp {(- U_{\mathcal {H}}(x))}\sigma _{\mathcal {M}}(dx)\), which we assume to be finite. Consider an extension of the distribution to \(T\mathcal {M}\) as
where \(\lambda _{T\mathcal {M}}(dz)\) is the Liouville measure of \(T\mathcal {M}\), H is defined by
for \((x,v) \in T\mathcal {M}\) and \(Z_{T\mathcal {M}} = \int _{T\mathcal {M}}\exp {(-H(x,v))}\)\(\lambda _{T \mathcal {M}}(dz),\) which is finite when \(Z_{\mathcal {M}}\) is. We have that
where \(\psi (x)(dv)\) is simply the Gaussian measure on \(T_{x}\mathcal {M}\) given by
in local coordinates and \(\sigma _{T_{x}\mathcal {M}}(dv)\) is the Lebesgue measure on \(T_{x}\mathcal {M}\). In particular we have that \(\mu \) has marginal distribution \(\pi _{\mathcal {H}}\) with respect to the Hausdorff measure (Girolami and Calderhead 2011; Byrne and Girolami 2013; Lelievre et al. 2010[Section 3.3.2]).
We will define a stochastic process which is a Riemannian version of the Randomized Hamiltonian Monte Carlo of Bou-Rabee and Sanz-Serna (2017). The stochastic process follows constrained Hamiltonian dynamics for an time duration t sampled from \(t \sim \exp {(\lambda )}\) for some rate \(\lambda > 0\) before an event. This event is a random velocity refreshment under the distribution \(\psi (x)\).
Algorithm 1 defines Randomized time Riemannian Manifold Hamiltonian Monte Carlo (RT-RMHMC) with rate parameter \(\lambda > 0\), and Hamiltonian dynamics governed by the Hamiltonian \(H(x,v) = U_{\mathcal {H}}(x) + \frac{1}{2}v^{T}G(x)^{-1}v\) defined on \(T\mathcal {M}\). This stochastic process has invariant measure \(\mu (z) = \exp {(-H(z))}\) with respect to the Liouville measure on \(T\mathcal {M}\).
To sample from a distribution \(\pi \) with respect to the Hausdorff measure we define \(U = - \log {\pi }\) under the assumption that \(\pi \) is integrable on \(\mathcal {M}\).
We can define the generator for this stochastic process as
where
is the transition kernel for a completely randomized velocity refreshment according to a Gaussian distribution on the tangent space \(T_{x}\mathcal {M}\) and \(X_{H}\) is the Hamiltonian vector field associated to H, which may be defined by \(X_{H} = \left( \frac{\partial H}{\partial v_{i}},-\frac{\partial H}{\partial x_{i}}\right) \) in local coordinates. In the Appendix, we will prove that this is the generator of this stochastic process in Section A and invariance of the measure in Section B under a compactness assumption. Our main theoretical result about Algorithm 1 is the following.
Corollary 1
(Invariant measure for RT-RMHMC) Let \((P_{t})_{t \ge 0}\) be the transition semigroup of a simulation of Algorithm 1 with characteristics \((\varphi ,\lambda ,Q)\) on \(T\mathcal {M}\) and Hamiltonian \(H \in C^{2}(T\mathcal {M})\), where \((\mathcal {M},g)\) is a compact smooth Riemannian manifold and \(\varphi \) is the Hamiltonian flow associated to the Hamiltonian. Let \(\mu \) be the measure on \((T\mathcal {M}, \mathcal {B}(T\mathcal {M}))\) given by
where \(d\lambda _{T\mathcal {M}}\) is the Liouville measure of \(T\mathcal {M}\). Then \(\mu \) is invariant for RT-RMHMC.
3 Constrained symplectic integrator and metropolis-hastings adjustment
In this section, we will state some more broadly implementable versions of Algorithm 1 that are applicable when the Hamiltonian dynamics cannot be solved exactly. We start with a brief introduction to Lagrangian and Hamiltonian dynamics with constraints based on Lee et al. (2017)[Chapter 3].
Consider manifolds \(\mathcal {M}\) embedded in \(\mathbb {R}^{d}\) that can be described by algebraic equations
where \(c_{i}: \mathbb {R}^{d} \rightarrow \mathbb {R}\) \(i= 1,...,m\) are continuously differentiable functions with linearly independent gradient functions for all \(x \in \mathcal {M}\).
We refer to such a submanifold as an algebraic constraint manifold. We can express the Euler-Lagrange equations as an orthogonal projection of the Euler-Lagrange equations in \(\mathbb {R}^{d}\) onto the constraint manifold, hence we have
where \(\lambda _{i}\) are Lagrange multipliers for each of the constraints. We can then define an augmented Lagrangian function \(L^{a}:T^{*}M \times \mathbb {R}^{m} \rightarrow \mathbb {R}\) by \(L^{a}(x,{\dot{x}},\lambda ) = L(x,{\dot{x}}) - \sum ^{m}_{i=1} \lambda _{i}c_{i}(x)\). Then the Euler-Lagrange equations can be expressed as
and the augmented Hamiltonian function \(H^{a}:T^{*}\mathcal {M} \times \mathbb {R}^{m} \rightarrow \mathbb {R}\) as \(H^{a}(x,\mu ,\lambda ) = \mu \cdot {\dot{x}} - L^{a}(x,{\dot{x}},\lambda )\), and we therefore obtain Hamilton’s equations (see Hartmann 2007)
We next introduce a new formulation of RT-RMHMC for constraint manifolds which we will use for numerical simulation. Note that a constraint manifold Hamiltonian Monte Carlo method was introduced in Brubaker et al. (2012), but with a deterministic duration parameter. We will use the same notation as that used in Brubaker et al. (2012) to introduce randomized time into this algorithm.
Let us denote our constraints \(c(x):= (c_{i}(x),...,c_{m}(x))^{T}\) and let \(C(x) = \frac{\partial c}{\partial x}\) denote the Jacobian of the constraints, which we assume to have full rank everywhere. Define a Hamiltonian of the constrained system as \(H(x,v) = U_{\mathcal {H}}(x) + K(v)\), where \(K(v) = \frac{1}{2}v^{T}G(x)^{-1}v\) is the kinetic energy and v lies in the cotangent space, \(\mathcal {T}^{*}_{x}\mathcal {M} = \{ v \mid C(x) \frac{\partial H}{\partial v}(x,v) = 0\}\). The dynamics of the constrained system in terms of the Hamiltonian is thus given by
where we remark that we can naturally identify the tangent and cotangent spaces and bundles.
We let \(\pi _{\mathcal {H}}\) be our target measure with respect to the Hausdorff measure and \(U_\mathcal {H}(x) = -\log \pi _{\mathcal {H}}(x)\) be the potential energy of our constrained system. We can then simulate the constrained Hamiltonian dynamics using the RATTLE scheme (Andersen 1983). However, if we know \(\pi _{\mathcal {H}}\) explicitly we can avoid computation of the metric tensor by assuming our system is isometrically embedded in Euclidean space. Under this assumption we can then consider Algorithm 2, which is an explicit algorithm for simulation of Randomized time constrained Hamiltonian Monte Carlo (RT-CHMC). We will discuss and justify the embedding assumption further in Sect. 3.1.
In Algorithm 2 we sample the Gaussian distribution on the tangent space at a point on \(\mathcal {M}\). We can do this by sampling a vector whose components are independent standard normal random variables and then projecting this orthogonally. To orthogonally project a momentum vector onto \(T^{*}\mathcal {M}\) and correctly resample the momentum in Algorithm 2 at \(x \in \mathcal {M}\) we apply the projector
Proposition 1
If \(v' \sim \mathcal {N}(0,I)\) then \(v = P_{\mathcal {M}}(x)v'\) is distributed according to \(v \sim \mathcal {N}(0,I \mid C(x)v = 0)\)
Proof
Can be found in Graham et al. (2022). \(\square \)
3.1 Embedded manifolds
We next introduce the theory of manifold embeddings as it was presented in Byrne and Girolami (2013) to show that numerical simulation of RT-CHMC is in fact simulation of RT-RMHMC on constraint manifolds.
If we know the form of the distribution \(\pi _{\mathcal {H}}\) with respect to the Hausdorff measure, then we can avoid the computation of the metric tensor and the lack of a global coordinate system (Byrne and Girolami 2013). We achieve this using isometric embeddings, remarking that every Riemannian manifold can be isometrically embedded in Euclidean space due to the Nash embedding theorem (Nash 1956). If we have an isometric embedding \(\xi : \mathcal {M} \rightarrow \mathbb {R}^{n}\), then considering a path q(t) on \(\mathcal {M}\), the path \(x(t) = \xi (q(t))\) is such that \({\dot{x}}_{i}(t) = \sum _{j} \frac{\partial x_{i}}{\partial q_{j}} {\dot{q}}_{j}(t)\). The phase space (q, p), where \({\dot{q}} = G^{-1}p,\) can then be transformed to the embedded phase space (x, v), where
since \(G = X^{T}X\) due to the fact that the embedding is isometric and preserves inner products (see (Byrne and Girolami 2013)). Now the Hamiltonian (Eq. 2) is
in terms of coordinates (x, v). When considering sampling of the velocities in Algorithm 1 and Algorithm 2, since \(p \sim \mathcal {N}(0,G(q))\), we have
where \(X(X^{T}X)^{-1}X^{T}\) is the orthogonal projection onto the tangent space of the embedded manifold (Byrne and Girolami 2013). Therefore we can sample from \(\mathcal {N}(0,I)\) and project onto the tangent space to obtain a necessary sample. The Hamiltonian is thus expressed in a form which is independent of the metric (provided we know the density with respect to the Hausdorff measure). We next introduce the RATTLE scheme for numerical integration on a manifold (Leimkuhler and Reich 2004)[Chapter 7]:
where we solve for \(\lambda ^{n}_{(r)}\) and \(\lambda ^{n+1}_{(v)}\) at each iteration so that the iterates lie in the tangent bundle. We solve for \(\lambda ^{n}_{(r)}\) (a non-linear system of equations) by cycling through the constraints, adjusting one multiplier at each iteration. Denote by \(C_{i}\) the ith row of C and we first initialize
Next we cycle through the list of constraints one after another as follows: for each \(i = 1,...,m\) compute
and update Q by \(Q:= Q - C_{i}(x_{n})^{T}\Delta \Lambda _{i}\) until \(c_{i}(Q)<tol\) for all \(i =1,...,m\), where tol is a certain prescribed tolerance. Then we set \(x_{n+1} = Q\) and have \(x_{n+1} \in \mathcal {M}\) within the tolerance. (Note that other stopping criteria could be used (see (Ortega and Rheinboldt 2000)).) We solve for \(\lambda ^{n+1}_{(v)}\) by solving the linear system:
Once the linear system has been solved we obtain \((x_{n+1},\)\(v_{n+1}) \in T^{*}\mathcal {M}\).
Theorem 2
Let \(\mathcal {M}\) be a constraint manifold. Let \(H \in C^{2}(T\mathcal {M})\), the RATTLE numerical integrator of the Hamiltonian system defined by H in \(T\mathcal {M}\) is symmetric, symplectic and of order 2. Further it respects the manifold constraints.
Proof
Given in Leimkuhler and Skeel (1994). \(\square \)
3.2 Metropolis hastings adjustment
Let \(\Psi ^{L}_{\Delta t}: T\mathcal {M} \rightarrow T\mathcal {M}\) be the numerical integrator defined by L steps of RATTLE with stepsize \(\Delta t\). This integrator approximates the Hamiltonian dynamics. For theoretical purposes we will also define the map \(N: T\mathcal {M} \rightarrow T \mathcal {M}\) which negates the momentum term i.e. \(N(x,v) \equiv (x,-v)\). Note that this leaves the Hamiltonian invariant and due to the fact that the momentum is resampled this has no affect on the samples from \(\pi _{\mathcal {H}}\). We will define the following Metropolized RT-RMHMC, where we sample \(T \sim \exp {(\lambda )}\) and fix a maximum time length \(\Delta t_{\max {}}\) below the stability threshold of the numerical integrator. Then we choose the number of leapfrog steps L to be \(\lceil T/\Delta t_{\max {}} \rceil \). Having chosen L in this way, we set \(\Delta t = T/L \le \Delta t_{\max }\). At each step we perform L RATTLE steps with stepsize \(\Delta t\). We propose this method of discretisation instead of purely randomizing the stepsize and fixing a number of leapfrog steps to avoid numerical instabilities in the numerical integrator. One could also propose fixing a stepsize within the numerical stability threshold of the integrator and simply sampling an integer number of leapfrog steps geometrically to randomize the time. However our proposed method closer relates to the continuous dynamics without the issues due to numerical instabilities.
Remark 1
For large choices of stepsize \(\Delta t\) it has been shown that \(\Phi ^{L}_{\Delta t}\) is not reversible where RATTLE is used to integrate on the manifold, see (Lelièvre et al. 2019; Zappa et al. 2018). In Lelièvre et al. (2019) they propose to combat this by adding a reversibility check incorporated into the Metropolis-Hastings adjustment, although in practice such checks may be neglected in favor of an implicit assumption that \(\Delta t\) is sufficiently small to avoid non-reversibility issues. We will investigate this further in Sect. 5.
In light of Remark 1, we include \(\text {Rev}(\cdot )\) as a additional (optional) accept-reject condition which implements a reversibility check (following (Lelièvre et al. 2019)). In numerical experiments we examine the stepsize threshold where the reversibility condition fails (See Fig. 1).
Remark 2
Our framework can be adapted to handle inequality constraints by incorporating an additional rejection condition in the Metropolis-Hastings step, which rejects samples which aren’t within the boundary.
This will be used in our application in Section 5.4 to impose a half-normal prior on some dimensions of our Bayesian model.
4 Ergodicity
We will now prove ergodicity and exact invariance of the desired measure of the discrete time algorithm with Metropolis-Hastings adjustment. We will provide ergodicity under two assumptions by the same technique as Brubaker et al. (2012) and restating some of their results.
Proposition 3
Assuming that \(\Psi \) for \(\Delta t_{max} > 0\) then \(\mu \) is invariant with respect to the Markov kernel proposed in Algorithm 3.
Proof
See Section C of the Appendix. \(\square \)
Assumption 1
Let \(\mathcal {M} \in \{x \in \mathbb {R}^{n} \mid c(x) = 0 \}\) be Riemannian manifold which is connected, smooth and differentiable. We assume that \(\nicefrac {\partial c}{\partial x}\) is full rank everywhere.
Assumption 2
Let \(\mathcal {M}\) be a Riemannian manifold which satisfies Assumption 1. For \(x \in \mathcal {M}\) we define \(\mathcal {B}_{r}(x) = \{x' \in \mathcal {M} \mid d(x',x) \le r \}\) to be the geodesic ball of radius r of x. We assume that there exists a \(r > 0\) such that for every \(x \in \mathcal {M}\) and \(x' \in \mathcal {B}_{r}(x)\) there exists a unique choice of Lagrange multipliers and velocity \(v \in T_{x}\mathcal {M}\), \(v' \in T_{x'}\mathcal {M}\) for which \((v',x') = \Psi ^{L}_{\Delta t}(v,x)\) for sufficiently small \(\Delta t\).
Theorem 4
(Accessibility) Let \(U \in C^{2}(\mathcal {M})\), and assuming Assumption 1. For any \(x_{0}, x_{1} \in \mathcal {M}\) and \(\Delta t\) sufficiently small, there exists finite \(v_{0} \in T \mathcal {M}\), \(v_{1} \in T \mathcal {M}\) and Lagrange multipliers \(\lambda _0\), \(\lambda _1\) such that \((v_1, x_1 ) = \Psi _{\Delta t}(v_{0},x_{0}).\)
Proof
Found in Brubaker et al. (2012)[Theorem 2] and is an extension of the results of Marsden and West (2001)[Theorem 2.1.1] and Hairer et al. (2006)[Theorem 5.6, Section IX.5.2]. \(\square \)
Theorem 5
(\(\mu \)-irreducible) Let \(U \in C^{2}(\mathcal {M})\), and under Assumptions 1 and 2 we have that for any \(x \in \mathcal {M}\), and measurable set \(A \subset \mathcal {M}\) with positive measure. Then there exists an \(n \in \mathbb {N}\) such that
where K denotes the marginal transition kernel defined on \(\mathcal {M}\) of Algorithm 3.
Proof
See Section C of the Appendix. \(\square \)
Lemma 6
(Aperiodic) Let \(U \in C^{2}(\mathcal {M})\) and under Assumptions 1 and 2 Algorithm 3 is aperiodic.
Proof
Proof given in Brubaker et al. (2012)[Lemma 1]. \(\square \)
Theorem 7
(Ergodicity) Let \(U \in C^{2}(\mathcal {M})\) and under Assumptions 1 and 2 we have for \(\mu -\)almost all starting values x
Proof
Since Algorithm 3 is \(\mu -\)invariant by Theorem 3, \(\mu -\)irreducible by Theorem 5 and aperiodic by Theorem 6, the required result holds by Tierney (1994)[Theorem 1]. \(\square \)
5 Numerical results
We perform numerical simulations of the RT-CHMC algorithm and compare to the CHMC algorithm of Brubaker et al. (2012); Girolami and Calderhead (2011), specifically exploring the underlying dynamics of the two processes. MCMC schemes are used to approximate expected values of certain functions f over some distribution with probability density function \(\pi \)
where we can estimate this quantity using our MCMC scheme by
where \(X^{i}\) is the Markov chain from our MCMC method. We quantify the convergence rate associated to approximation of \(\mathbb {E}_{\pi }(f)\) by considering the integrated autocorrelation function and essential sample size.
5.1 g-BAOAB
As a comparison method we implemented the g-BAOAB integrator of Leimkuhler and Matthews (2016), a numerical integrator for constrained underdamped Langevin dynamics. Constrained underdamped Langevin dynamics can be described by
where \(\gamma \) is a friction coefficient and R(t), is a vector-valued, stationary, zero-mean Gaussian process. The numerical integrator g-BAOAB is a splitting method for such dynamics, which uses similar constrained integrators as that of RT-CHMC. We note that g-BAOAB is a biased sampling algorithm due to the error in the numerical integrator. For a full description of g-BAOAB and a discussion of the sampling error we refer to Leimkuhler and Matthews (2016).
5.2 Computational cost
The cost of propagation steps are essentially the same as those of g-BAOAB. In situations where there are relatively few constraints compared to the dimension of the parameter space, the constraint cost should be affordable. This happens even in complex applications like molecular dynamics of proteins, where the constraint solver rarely introduces a cost greater than a few percent of the total cost of, say, long-ranged force evaluations (Xie et al. 2000). Obviously the practicality of the scheme will depend on the problem itself. The cost of solving the constraints using the semi-explicit SHAKE/RATTLE or geodesic integrator steps is expected to be lower than the cost to integrate the equations of CHMC in the general metric setting (using unconstrained ODEs), since the nonseparable Hamiltonian structure demands an implicit symplectic integration.
5.3 Test examples
We next provide examples of distributions on implicitly defined manifolds embedded in Euclidean space, with the distributions defined with respect to the Hausdorff measure of the manifold. We will consider two types of constraint manifolds: spheres and Stiefel manifolds.
5.3.1 Bingham-Von Mises-Fisher distribution on \(S^{n}\)
The first test case is the Bingham-Von Mises-Fisher (BVMF) distribution defined on the \(n-\)dimensional sphere embedded in \(\mathbb {R}^{n+1}\), that is \(S^{n}:= \{x \in \mathbb {R}^{n+1} \mid \sum ^{n+1}_{i=1}x^{2}_{i} = 1 \}\). The BVMF distribution is the exponential family on \(S^{n} \subset \mathbb {R}^{n+1}\) with density of the form
where \(c \in \mathbb {R}^{n+1}\) and \(A \in M_{n+1}(\mathbb {R})\) is a symmetric matrix.
We compare the integrated autocorrelation (IAC) of \(-\log \pi _{\mathcal {H}}\) of the RT-CHMC method to that of the CHMC method introduced in Brubaker et al. (2012) for a number of distributions with parameters defined in the captions. We also compare the maximum IAC of \(x_{i}\) for \(i = 1,...,n\) to compare the worst efficiency of the mixing in all dimensions. We compare methods by setting the event rate parameter \(\lambda \) of RT-CHMC to be the deterministic duration parameter of CHMC (running the dynamics for this duration before momentum randomization). We then compute the integrated autocorrelation of \(-\log \pi _{\mathcal {H}}\) and \(x_{i}\) for \(i = 1,...,n\) for the two methods for varying choices of \(\lambda \) by a Monte Carlo averaging procedure as described in Section Appendix E. Regarding the reversibility issue for large choices of stepsize (as discussed in Section 3.2), for the geometries and distributions chosen, this is shown to exhibit behaviour as in Fig. 1, where there is a dramatic change in reversibility failure for a small change in step-size. Before this point all samples generated satisfy reversibility conditions. We simply chose stepsizes which are below this threshold in our simulations.
The results are presented in Fig. 2. We choose the stepsize in RATTLE to be \(\Delta t = 0.01\) and sample \(N = 1,000,000\) events with a burn time of \(10\%\) of samples before we compute the Monte Carlo average. We also use lags of up to \(M = N/50\), 2 percent of the number of samples used to estimate the IAC. As our choice of \(\Delta t\) is small, the acceptance rate is high so this process is close to the continous version. The IAC compares the efficiency of the continuous processes.
Remark 3
Due to the symmetry in the \(x_3\)-coordinate (\(x_3 \mapsto -x_3\)) in the BVMF distribution with parameters \(A = \text {diag}(-1000,0,1000)\) and \(c = (100,0,0)\), the distribution is bimodal. However in practice we consider this a unimodal distribution as the probability of transferring between modes is extremely small. This can be seen in Fig. 3, as the second mode is at \(\theta = \pi \). Our dynamics stay near the mode at (0, 0, 1) and do not visit the other mode in all our simulations.
In our first example and Fig. 2 we can see that the regularity of the quality of samples with respect to the duration parameter is poor when a deterministic duration parameter is used and nearly uniform across a wide interval for a randomized duration with the same expected value. This is illustrated in Fig. 3, where a small change in duration parameter causes the dynamics dramatically slows convergence and due to very slow mixing. The fact that CHMC behaves erratically for large mean duration parameters may not be very surprising to some readers as the theoretical convergence bound for HMC without randomization requires a limit on the duration T (see (Mangoubi and Smith 2017)). This is due to the fact that when T is set too large, the coupling argument breaks down.
We next compare efficiencies using the metric gradient evaluations per effective sample size, which tells us the number of gradient evaluations needed for one independent sample in estimating our observables. We compare this metric for varying choices of step-size up to when the reversibility condition is broken and the numerical integrator becomes unstable. Our observables will be \(-\log \pi _{\mathcal {H}}\) and \(x_{i}\) for \(i = 1,...,n\). We can see as in Figs. 4 and 5 CHMC (with deterministic time) exhibits the same behaviour as in Fig. 2 for all choices of step-size, which is not the case for RT-CHMC. We next compare the efficiency of the method with the g-BAOAB constrained Langevin integrator, we find in Figures 4 and 5 that g-BAOAB outperforms RT-CHMC for large choices of the friction parameter \(\gamma \) (for this example \(\gamma = 50\)). g-BAOAB has no Metropolis-Hastings adjustment and hence is a biased sampling method. The bias in the samples creates errors in computed observables. For large choices of \(\gamma \), this bias is dramatically reduced, but use of high friction may slow convergence of metastable systems.
To explore this we next consider a bimodal distribution from Byrne and Girolami (2013) in Fig. 6. It is shown in Fig. 6 that g-BAOAB incurs bias for large stepsizes and convergence is slow for large choices of the friction parameter for this metastable system. The figure also shows that this is not the case for RT-CHMC. In Fig. 4 we choose step-sizes up to which the integrator is reversible and stable.
The efficiency of the methods with their optimal choices of parameters is comparable, but RT-CHMC is much less sensitive with respect to the choice of parameters (stepsize and number of leapfrog steps) compared to CHMC, so RT-CHMC is more reliable from this point of view. This is important as it is hard to know an appropriate choice of parameters a priori and the integration length between samples might have to be arbitrarly small for CHMC to be efficient.
5.3.2 Von Mises-Fisher distribution on \(\mathbb {V}_{d,p}\)
Definition 1
A Stiefel manifold \(\mathbb {V}_{d,p}\) is the set of \(d \times p\) matrices X such that \(X^{T}X = I\).
These arise in many statistical problems which are discussed in Byrne and Girolami (2013). Applications include dimensionality reduction such as is used in factor analysis, principal component analysis (Jolliffe 2002) and directional statistics (Mardia et al. 2000). These are a generalisation of orthogonal groups. The von Mises-Fisher distribution on Stiefel manifolds is defined by the density
where \(x_{i}\) and \(f_{i}\) are the columns of F and X. We simulate IAC esimates for two example distributions for varying duration parameters. In the simulations we use a stepsize of \(\Delta t = 0.01\) and 100, 000 samples in each IAC estimate. The results are shown in Fig. 7. We can see similar behaviour as the easier example on the Sphere. Both examples it is clear CHMC is much more sensitive to the mean duration and hence with respect to stepsize and number of leapfrog steps. We note that \(\text {Skew}(2,-45,-4)\) denotes the 3 by 3 skew-symmetric matrix with up triangular entries \(2,-45\) and \(-4\).
5.4 High dimensional covariance estimation
In many statistical applications for analysing high dimensional data sets it is necessary to estimate sample covariances. This can be challenging when the number of dimensions is larger than the number of data points, as the sample covariance estimator does not work well in such cases. Lam (2020) provides a review of high-dimensional covariance estimation and applications in principal component analysis (Shen et al. 2016), cosmological data analysis (Joachimi 2017) and finance (Lam 2016). Lam (2020) focuses on the setup where the matrix dimension is diverging or even larger than the sample size. In this setting one needs to estimate the population covariance matrix \(\Sigma \) of a set of n, \(p-\)dimensional data vectors, which we assume are drawn from an underlying distribution.
One estimator is the sample covariance matrix, which is defined by \(\Sigma _{S} = \nicefrac {1}{n}\sum ^{n}_{i=1}({\textbf {x}}_{k} - \overline{{\textbf {x}}})({\textbf {x}}_{k} - \overline{{\textbf {x}}})^{T}\), where \(\overline{{\textbf {x}}} = \nicefrac {1}{n}\sum ^{n}_{k = 1}{} {\textbf {x}}_{k}\) is the sample mean. However this is a poor estimator of \(\Sigma \) when p is large compared to the sample size n (due to rank deficiency). A way to combat this is to consider regularized covariance matrix estimators which include structural assumptions on the covariance matrix \(\Sigma \).
One such method which has been proposed assumes the structure of a low rank matrix plus a sparse matrix (see (Ross 1976) and Lam (2020)). This structure is known as a spiked covariance structure and has been studied in Bouchard et al. (2020); Lam (2020) and Cai et al. (2015). There have been interesting applications to finance (Fan et al. 2008), chemometrics(Kritchman and Nadler 2008), and astronomy (Joachimi 2017). The covariance matrix \(\Sigma \) is assumed to be expressible in the form
where X is a Stiefel manifold of dimension \(p \times m\) for \(p \gg m\) and \(D_{i}\) for \(i = 1, 2\) are diagonal matrices of dimensions \(m \times m\) and \(p \times p\) respectively. The motivation for this structure is that we assume that lower dimensional variables \({\textbf {y}}_{i}\) can describe the data \({\textbf {x}}_{i}\) such that \({\textbf {x}}_{i} = X{\textbf {y}}_{i} + \varvec{\epsilon }_{i}\), where X is a \(p \times m\) matrix with orthogonal columns. We have that
which we interpret as a low rank matrix (rank m) plus a sparse matrix. We take \(\Sigma _y\) and \(\Sigma _{\epsilon }\) to be diagonal, which is an approximation of the spiked covariance structure (Chamberlain and Rothschild 1982).
Assume a uniform prior on X with respect to the Hausdorff measure on X. Further assume a half-normal prior of the diagonal entries of \(D_{i}\) for \(i = 1,2\) to ensure positive definiteness. We also consider the following likelihood for the covariance estimation
We introduce the posterior distribution \(p(\Sigma \mid {\textbf {x}}_{1},...,{\textbf {x}}_{p}) \propto \mathcal {L}( \Sigma \mid {\textbf {x}}_{1},...,{\textbf {x}}_{p}) p(\Sigma )\), where \(p(\Sigma ) = p(X)p(D_{1})p(D_{2})\) for \(X \sim \mathcal {U}(\mathbb {V}_{p,m})\), \(D_{1_{jj}} \sim \mathcal {N}_{+}(0,\sigma ^{2}_{1})\) for \(j = 1,...,m\) and \(D_{2_{jj}} \sim \mathcal {N}_{+}(0,\sigma ^{2}_{2})\) for \(j = 1,...,p\) and \(\mathcal {N}_{+}\) denotes the half-normal distribution. Define the potential \(U: \mathbb {V}_{p,m} \times \mathbb {R}^{m} \times \mathbb {R}^{p} \mapsto \mathbb {R}\) by \(U(X,{\textbf {d}}_{1},{\textbf {d}}_{2}) = -\log {\mathcal {L}( \Sigma (X,{\textbf {d}}_{1},{\textbf {d}}_{2}) \mid {\textbf {x}}_{1},...,}\)\({{\textbf {x}}_{p})} -\log {p(X)} - \log {p({\textbf {d}}_{1})}-\log {p({\textbf {d}}_{2})}\) with forces given by
where \([B^{kl}]_{ij} = -(\Sigma ^{-1})_{ik}(\Sigma ^{-1})_{lj}\) and
We also have that
for \(i = 1,2\) and where
for \(j_1 = 1,...,m\) and \(j_2 = 1,...,p\). We will use the likelihood and its gradient for implementing our RT-CHMC algorithm for such models.
5.4.1 Covariance estimation for cosmological data
We consider an application of high dimensional covariance estimation in cosmological data analysis introduced in Joachimi (2017) and discussed in Lam (2020). The data is taken from Joachimi (2017) and consists of covariances of two-point correlation functions of cosmic weak lensing. It is simulated using coupled log-normal random fields from angular power spectra. For further information we refer the reader to Joachimi (2017). We will test this method using 2n/3 data vectors, where n is the dimension of the data vectors. Therefore we are in the setting where the dimension of the covariance matrix is larger than the number of samples. For our low rank plus sparse structure we choose \(m = p/6< < p\) and to ensure fast convergence we will normalize the data entry-wise and initialize our Markov chain via a eigenvalue decomposition of the sample covariance. We initialize our Markov chain as \(\Sigma _{0} = X^{T}D_{1}X + D_{2} \approx \Sigma _{S}\), using the sample covariance matrix \(\Sigma _{S}\) and where \(D_2\) is the diagonal of \(\Sigma _{S}\) and \(X^{T}D_{1}X\) corresponds to the eigenvalue decomposition of \(\Sigma _{S} - D_{2}\), but with the p largest eigenvectors. After the covariance of the normalized data is estimated it can easily be rescaled to match the real data via entry-wise multiplication with the outer product of the entry-wise standard deviations. We compare our method to a maximum a posterior (MAP) estimate of the covariance matrix which uses a simple constrained gradient descent algorithm with Lagrange multipliers to ensure that the “low-rank plus sparse” structure is maintained. We compare the Bayesian and MAP approaches using a relative Frobenius norm and a covariance metric introduced by Förstner and Moonen (2003) which is defined by
where A and B are covariance matrices and \(\lambda _{i}(A,B)\) are the generalized eigenvalues from \(\textrm{det}(\lambda A - B)= 0\). As pointed out in Förstner and Moonen (2003), this covariance metric is affine invariant and invariant to inversion.
Note that in Fig. 8 we have not included the sample estimate with 40 samples because the sample covariance matrix is rank deficient and hence it is not possible to invert the matrix. We notice from Fig. 8 and Table 1 that the MAP and Posterior Expectation perform well when estimating the covariance matrix for only using 40 data points, but lose accuracy under inversion. The posterior expectation using a sampling method seems to retain more structure when inverted and provides a more accurate estimate according to the metric of Förstner and Moonen (2003).
It is clear from Table 1 that using the posterior means do not sacrifice accuracy compared to using the MAP estimators. An additional benefit of the Bayesian approach is that we can compute posterior standard deviations for each component of the covariance estimator, which gives error estimates. This is illustrated in Fig. 9. By comparing these standard deviations with the covariance estimates, we can get a sense of the relative error we are making. This information can be useful when deciding on the number of data samples we need to get a satisfactory level of accuracy in estimating the covariance matrix. Since in practice we do not have access to the true covariance matrix, there is no straightforward way to compute error estimates based on the MAP estimator, and it is challenging to see whether we have reached sufficient accuracy.
Further it is clear from Fig. 10 that the CHMC still suffers from the robustness issue that were present in the lower-dimensional examples and RT-CHMC maintains robustness for this application.
6 Conclusion and future work
In this work we have introduced a Randomized Time Riemannian Manifold Hamiltonian Monte Carlo (RT-RMHMC), which is a robust alternative to Riemannian Manifold Hamiltonian Monte Carlo methods introduced by Girolami and Calderhead (2011) and Brubaker et al. (2012). We establish invariance of the desired measure under a compactness assumption in the continuous (small stepsize limit) setting. We provide an Metropolis adjusted version of RT-RMHMC in the discrete setting and prove invariance and ergodicity of the adjusted discretized algorithm. We show that RT-RMHMC is a more robust method with respect to parameter choice on a number of numerical examples arising in applications and provide an example to demonstrate that our Riemannian manifold sampling method can be used for high-dimensional covariance estimation. We expect the stability with respect to choice of parameters is especially needed in poorly conditioned problems, where RMHMC would require very short time steps for stability but this may lead to some random walk behaviour and highly inefficient mixing in some principal directions.
In terms of future developments for RT-RMHMC, the next step would be to establish invariance of the measure in the non-compact setting and further to this establish (geometric) ergodicity of RT-RMHMC, which is already established in Bou-Rabee and Sanz-Serna (2017) for the Euclidean setting. Then one could find optimal choices of integration parameters and step-size. Another possibility would be to establish mixing time guarantees for RT-RMHMC by a coupling argument like Bou-Rabee et al. (2020) and Mangoubi and Smith (2018). In Mangoubi and Smith (2018) they establish rapid mixing guarantees for a geodesic walk algorithm on manifolds with positive curvature, which is RMHMC for the uniform distribution. One may be able to use a similar coupling argument to guarantee mixing times for RT-RMHMC for manifolds with positive curvature.
The C code for RT-CHMC and CHMC for each application is available at https://github.com/PAWhalley/Randomized-Time-Riemannian-Manifold-Hamiltonian-Monte-Carlo.
References
Andersen, H.C.: Molecular dynamics simulations at constant pressure and/or temperature. J. Chem. Phys. 72(4), 2384–2393 (1980)
Andersen, H.C.: Rattle: a “velocity’’ version of the Shake algorithm for molecular dynamics calculations. J. Comput. Phys. 52(1), 24–34 (1983)
Barber, D.: Bayesian Reasoning and Machine Learning. Cambridge University Press, Cambridge (2012)
Bierkens, J., Fearnhead, P., Roberts, G.: The zig-zag process and super-efficient sampling for Bayesian analysis of big data. Ann. Stat. 47(3), 1288–1320 (2019)
Böttcher, B., Schilling, R., Wang, J.: Lévy Matters III. Lecture Notes in Mathematics. Springer, Cham (2013)
Bouchard, F., Breloy, A., Ginolhac, G., Pascal, F.: Riemannian framework for robust covariance matrix estimation in spiked models. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5979–5983 (2020). IEEE
Bouchard-Côté, A., Vollmer, S.J., Doucet, A.: The bouncy particle sampler: a non-reversible rejection-free Markov chain Monte Carlo method. J. Am. Stat. Assoc. 113(522), 855–867 (2018)
Bou-Rabee, N., Eberle, A.: Couplings for Andersen dynamics. In: Annales de l’Institut Henri Poincare (B) Probabilites et Statistiques, vol. 58, pp. 916–944 (2022). Institut Henri Poincaré
Bou-Rabee, N., Sanz-Serna, J.M.: Randomized Hamiltonian Monte Carlo. Ann. Appl. Probab. 27(4), 2159–2194 (2017)
Bou-Rabee, N., Eberle, A., Zimmer, R.: Coupling and convergence for Hamiltonian Monte Carlo. Ann. Appl. Probab. 30(3), 1209–1250 (2020)
Brubaker, M., Salzmann, M., Urtasun, R.: A family of MCMC methods on implicitly defined manifolds. In: Artificial Intelligence and Statistics, pp. 161–172 (2012). PMLR
Byrne, S., Girolami, M.: Geodesic Monte Carlo on embedded manifolds. Scandinavian J. Stat. 40(4), 825–845 (2013)
Cai, T., Ma, Z., Wu, Y.: Optimal estimation and rank detection for sparse spiked covariance matrices. Probab. Theory Relat. Fields 161(3), 781–815 (2015)
Casella, C.R.G.: Monte Carlo Statistical Methods, 2nd edn. Springer, New York (2004)
Chamberlain, G., Rothschild, M.: Arbitrage, Factor Structure, and Mean-variance Analysis on Large Asset markets. National Bureau of Economic Research Cambridge, Cambridge (1982)
Chicone, C.: Ordinary Differential Equations with Applications. Texts in Applied Mathematic. Springer, New York (2006)
Davis, M.H.A.: Markov Models & Optimization. CRC Press, Boca Raton (1993)
Deligiannidis, G., Paulin, D., Bouchard-Côté, A., Doucet, A.: Randomized Hamiltonian Monte Carlo as scaling limit of the bouncy particle sampler and dimension-free convergence rates. Ann. Appl. Probab. 31(6), 2612–2662 (2021)
Diaconis, P., Holmes, S., Shahshahani, M.: Sampling from a manifold. Adv. Modern Stat. Theory Appl Festschrift Honor Morris L Eaton 10, 102–125 (2013)
Durmus, A., Guillin, A., Monmarché, P.: Piecewise deterministic Markov processes and their invariant measures. Ann. de l’Institut Henri Poincaré, Probabilités et Statistiques 57(3), 1442–1475 (2021)
Ethier, S.N., Kurtz, T.G.: Markov Processes. Characterization and Convergence. Wiley, Hoboken (1986)
Fan, J., Fan, Y., Lv, J.: High dimensional covariance matrix estimation using a factor model. J. Econ. 147(1), 186–197 (2008)
Förstner, W., Moonen, B.: A metric for covariance matrices. In: Geodesy-the Challenge of the 3rd Millennium, pp. 299–309. Springer, Berlin (2003)
Girolami, M., Calderhead, B.: Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J. Royal Stat. Soc. Series B Stat. Method. 73(2), 123–214 (2011)
Graham, M.M., Thiery, A.H., Beskos, A.: Manifold Markov chain Monte Carlo methods for Bayesian inference in diffusion models. J. Royal Stat. Soc Series B (Stat. Methodol) 84(4), 1229–1256 (2022)
Greenberg, E.: Introduction to Bayesian Econometrics, 2nd edn. Cambridge University Press, Cambridge (2012)
Grothaus, M., Mertin, M.C.: Hypocoercivity of Langevin-type dynamics on abstract smooth manifolds. Stoch. Process. Appl. 146, 22–59 (2022)
Guillemin, V., Pollack, A.: Differential Topology. Prentice-Hall Inc, Englewood Cliffs, N.J. (1974)
Hairer, E., Hochbruck, M., Iserles, A., Lubich, C.: Geometric numerical integration. Oberwolfach Rep. 3(1), 805–882 (2006)
Hamelryck, T., Kent, J.T., Krogh, A.: Sampling realistic protein conformations using local structural bias. PLoS Comput. Biol. 2(9), 131 (2006)
Hartmann, C.: Model reduction in classical molecular dynamics. Freie Universität Berlin (2007)
Hartmann, C.: An ergodic sampling scheme for constrained Hamiltonian systems with applications to molecular dynamics. J. Stat. Phys. 130(4), 687–711 (2008)
Hoffman, M.D., Gelman, A.: The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 15(1), 1593–1623 (2014)
Joachimi, B.: Non-linear shrinkage estimation of large-scale structure covariance. Mon. Notices Royal Astron. Soc. Lett. 466(1), 83–87 (2017)
Jolliffe, I.: Generalizations and adaptations of principal component analysis. In: Principal Component Analysis. Springer Series in Statistics, 2nd edn. Springer, New York (2002)
Kleppe, T.S.: Connecting the dots: numerical randomized Hamiltonian Monte Carlo with state-dependent event rates. J. Comput. Graph. Stat. 31, 1–16 (2022)
Kritchman, S., Nadler, B.: Determining the number of components in a factor model from limited noisy data. Chemom. Intell. Lab. Syst. 94(1), 19–32 (2008)
Kunze, K., Schaeben, H.: The Bingham distribution of quaternions and its spherical radon transform in texture analysis. Math. Geol. 36(8), 917–943 (2004)
Lam, C.: Nonparametric eigenvalue-regularized precision or covariance matrix estimator. Ann. Stat. 44(3), 928–953 (2016)
Lam, C.: High-dimensional covariance matrix estimation. Wiley Interdiscip Rev: Comput Stat 12(2), 1485 (2020)
Laurent, A., Vilmart, G.: Order conditions for sampling the invariant measure of ergodic stochastic differential equations on manifolds. Found. Comput. Math. 22, 1–47 (2021)
Lee, Y.T., Vempala, S.S.: Convergence rate of Riemannian Hamiltonian Monte Carlo and faster polytope volume computation. In: Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1115–1121 (2018)
Lee, T., Leok, M., McClamroch, N.H.: Global Formulations of Lagrangian and Hamiltonian Dynamics on Manifolds. Springer, Cham (2017)
Leimkuhler, B., Matthews, C.: Efficient molecular dynamics using geodesic integration and solvent-solute splitting. Proc. Royal Soc. A Math. Phys. Eng. Sci. 472(2189), 20160138 (2016)
Leimkuhler, B., Reich, S.: Simulating Hamiltonian Dynamics. Cambridge University Press, Cambridge (2004)
Leimkuhler, B.J., Skeel, R.D.: Symplectic numerical integrators in constrained Hamiltonian systems. J. Comput. Phys. 112(1), 117–125 (1994)
Lelievre, T., Rousset, M., Stoltz, G.: Free Energy Computations: A Mathematical Perspective. Imperial College Press, London (2010)
Lelièvre, T., Rousset, M., Stoltz, G.: Langevin dynamics with constraints and computation of free energy differences. Math. Comput. 81(280), 2071–2125 (2012)
Lelièvre, T., Rousset, M., Stoltz, G.: Hybrid Monte Carlo methods for sampling probability measures on submanifolds. Numerische Mathematik 143(2), 379–421 (2019)
Li, D.: On the rate of convergence to equilibrium of the Andersen thermostat in molecular dynamics. J. Stat. Phys. 129(2), 265–287 (2007)
Mangoubi, O., Smith, A.: Rapid mixing of Hamiltonian Monte Carlo on strongly log-concave distributions. arXiv preprint arXiv:1708.07114 (2017)
Mangoubi, O., Smith, A.: Rapid mixing of geodesic walks on manifolds with positive curvature. Ann. Appl. Probab. 28(4), 2501–2543 (2018)
Mardia, K.V., Jupp, P.E., Mardia, K.: Directional Statistics. Wiley series in probability and statistics, Wiley, Chichester (2000)
Marsden, J.E., West, M.: Discrete mechanics and variational integrators. Acta Numerica 10, 357–514 (2001)
Meyer, G., Bonnabel, S., Sepulchre, R.: Linear Regression under Fixed-Rank Constraints: A Riemannian Approach. In: Proceedings of the 28th International Conference on Machine Learning, pp. 545–552 (2011)
Nash, J.: The imbedding problem for Riemannian manifolds. Ann. Math. 63, 20–63 (1956)
Ortega, J.M., Rheinboldt, W.C.: Iterative Solution of Nonlinear Equations in Several Variables, vol. 30. SIAM, Philadelphia (2000)
Pakman, A., Gilboa, D., Carlson, D., Paninski, L.: Stochastic bouncy particle sampler. In: International Conference on Machine Learning, pp. 2741–2750 (2017). PMLR
Perez, A., MacCallum, J.L., Dill, K.A.: Accelerating molecular simulations of proteins using Bayesian inference on weak information. Proc. Natl. Acad. Sci. 112(38), 11846–11851 (2015)
Quiroz, M., Kohn, R., Villani, M., Tran, M.-N.: Speeding up MCMC by efficient data subsampling. J. Am. Stat. Assoc. 114, 831–843 (2018)
Riou-Durand, L., Vogrinc, J.: Metropolis Adjusted Langevin Trajectories: a robust alternative to Hamiltonian Monte Carlo. arXiv preprint arXiv:2202.13230 (2022)
Ross, S.A.: The arbitrage theory of capital asset pricing. J. Econ. Theory 13, 341–360 (1976)
Salakhutdinov, R., Mnih, A.: Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In: Proceedings of the 25th International Conference on Machine Learning, pp. 880–887 (2008)
Shen, D., Shen, H., Marron, J.: A general framework for consistency of principal component analysis. J. Mach. Learn. Res. 17(1), 5218–5251 (2016)
Tierney, L.: Markov Chains for exploring posterior distributions. Ann. Stat. 22(4), 1701–1728 (1994)
Vanetti, P., Bouchard-Côté, A., Deligiannidis, G., Doucet, A.: Piecewise-Deterministic Markov Chain Monte Carlo. arXiv preprint arXiv:1707.05296 (2018)
Weinan, E., Li, D.: The Andersen thermostat in molecular dynamics. Commun. Pure Appl. Math. 61(1), 96–136 (2008)
Wilkinson, D.J.: Bayesian methods in bioinformatics and computational systems biology. Brief. Bioinf. 8(2), 109–116 (2007)
Xie, D., Ridgway Scott, L., Schlick, T.: Analysis of the SHAKE-SOR algorithm for constrained molecular dynamics simulations. Methods Appl. Anal. 7(3), 577–590 (2000)
Zappa, E., Holmes-Cerfon, M., Goodman, J.: Monte Carlo on Manifolds: sampling densities and integrating functions. Commun. Pure Appl. Math. 71(12), 2609–2647 (2018)
Acknowledgements
The authors would like to thank the anonymous referees for their valuable feedback and suggestions, which have improved the quality and presentation of the paper. The authors acknowledge the support of the Engineering and Physical Sciences Research Council Grant EP/S023291/1 (MAC-MIGS Centre for Doctoral Training).
Funding
The research leading to these results received support of the Engineering and Physical Sciences Research Council Grant EP/S023291/1 (MAC-MIGS Centre for Doctoral Training).
Author information
Authors and Affiliations
Contributions
P.A.W., D.P. and B.L. developed the theory, conceptualized the algorithm and the simulation tests. P.A.W., D.P. and B.L. wrote the main manuscript text. D.P. and B.L. supervised the project.
Corresponding author
Ethics declarations
Conflicts of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Generator of RT-RMHMC
To prove that the generator of this stochastic process takes the form of Equation (3) and that the measure (Equation (1)) is invariant under RT-RMHMC we use the framework of Durmus et al. (2021) viewing RT-RMHMC as a piecewise deterministic Markov process (PDMP) defined on \(T \mathcal {M}\). For a general state space \(\mathcal {Z}\), a \(\mathcal {Z}-\)valued continuous-time PDMP \((\varphi ,\lambda ,Q)\) consists of the following components:
-
a differential flow \(\varphi : \mathbb {R}_{+} \times \mathcal {Z} \mapsto \mathcal {Z}\) which satisfies the semi group property and is measurable. Moreover, is continuously differentiable with respect to time and a \(C^{1}\)-diffeomorphism of \(\mathcal {Z}\).
-
an event rate \(\lambda : \mathcal {Z} \rightarrow \mathbb {R}^{+}\), which is measurable and locally bounded.
-
a inhomogeneous Markov transition kernel \(Q: \mathbb {R}_{+} \times \mathcal {Z} \times \mathcal {B}(\mathcal {Z}) \rightarrow [0,1]\), such that for all \(A \in \mathcal {B}(\mathcal {Z})\), \((t,z) \mapsto Q(t,z,A)\) is measurable and for all \((t,z) \in \mathbb {R}_{+} \times \mathcal {Z}\), \(Q(t,z,\cdot ) \in \mathcal {P}(\mathcal {Z})\),
where \(\mathcal {B}(\mathcal {Z})\) denotes the \(\sigma \)-algebra on the space \(\mathcal {Z}\) and \(\mathcal {P}(\mathcal {Z})\) denotes the space of probability measures on the space \(\mathcal {Z}\). For RT-RMHMC we consider \(\mathcal {Z} = T\mathcal {M}\).
Definition 2
For a PDMP \(Z = (Z_{t})_{t\ge 0}\), we call \(\tau _{\infty }(Z) = \inf \{t \ge 0 \mid Z_{t} = \infty \}\) the explosion time of the process \((Z_t)_{t \ge 0}\). A process \((Z_{t})_{t \ge 0}\) is said to be non-explosive if \(\tau _{\infty }(Z) = + \infty \) almost surely. PDMP characteristics are said to be non-explosive if for all initial distribution the associated PDMP is non-explosive.
Due to the event rate \(\lambda \) of RT-RMHMC being constant and bounded we have that RT-RMHMC is non-explosive. As RT-RMHMC is a non-explosive PDMP we can use the theory of Durmus et al. (2021)[Section 7 and 8] to estabilish the generator and invariance of the desired measure.
We shall use \(\mathcal {L}\) to denote the generator of a PDMP, we use \(D(\mathcal {L})\) to denote the core of the generator (or domain), which informally is all the continuous functions which vanish at infinity and for which the generator is well-defined. For more information we refer you to Durmus et al. (2021). Under the assumption that the expected number of events in any unit time interval [0, t] is finite, it is shown in Davis (1993)[Theorem 26.14] that for a non-explosive PDMP with generator \(\mathcal {L}\) with domain \(D(\mathcal {L})\) that all \(f \in D(\mathcal {L})\) and \(x \in T\mathcal {M}\),
where \(D_{\varphi }\) is the associated vector field to \(\phi \).
If we let \(N_{t}\) denote the number of events in the interval [0, t] then we have for RT-RMHMC \(\mathbb {E}_{x}(N_{t}) = \lambda t < \infty \), \(\mathbb {E}_{x}\) denoting the expected value given the stochastic process starts with initial condition x. Therefore all the assumptions are satisfied of Davis (1993)[Theorem 26.14] and Durmus et al. (2021)[Section 7] and we have that the generator of RT-RMHMC is given by
where \(X_{H}\) is the Hamiltonian vector field and Q is the transition kernel for the Gaussian distribution induced by the metric G(x) on the tangent space of \(x \in \mathcal {M}\).
Appendix B: invariant measure
To prove that \(\mu \) is an invariant measure of RT-RMHMC it is sufficient to show that \(\int \mathcal {L}f(x) d\mu = 0\) for all \(f \in D(\mathcal {L})\). As it is difficult to consider \(D(\mathcal {L})\), one approach is to show that \(C^{1}_{c}(T\mathcal {M})\) is a core of the generator and that \(\int \mathcal {L}f(x)d\mu = 0\) for all \(f \in C^{1}_{c}(T\mathcal {M})\), where \(C^{k}_{c}(T\mathcal {M})\) denotes the space of k times differentiable functions \(f: T\mathcal {M} \rightarrow \mathbb {R}\) with compact support.
Theorem 8
(Infinitesimal Invariance of RT-RMHMC) Let \(\mathcal {M}\) be a smooth Riemannian manifold with metric g and let \((P_{t})_{t \ge 0}\) be the semigroup of RT-RMHMC defined on \(\mathcal {M}\) with potential \(U \in C^{2}(\mathcal {M})\) and Hamiltonian \(H = U + K \in C^{2}(T\mathcal {M})\). Let \(\mu \) be the measure on \((T\mathcal {M}, \mathcal {B}(T\mathcal {M}))\) defined by
where \(d\lambda _{T\mathcal {M}}\) is the Liouville measure of \(T\mathcal {M}\). Then for all \(f \in C^{1}_{c}(T\mathcal {M})\)
where \(\mathcal {L}\) is the generator of RT-RMHMC.
Proof
We have that
where I is the identity operator. We will now consider these two integrals separately. Considering the first integral, due to the fact that \(\mu \) is a Liouville measure, \(\mu \) is invariant under the Hamiltonian flow by Liouville’s theorem (see for example Grothaus and Mertin (2022)) and hence the first integral is identically zero. Now considering the second integral we have
where C(x) is varying depending on x and does not depend on v or \(\xi \). The last equality is identically integral as it can be seperated as the subtraction of identical integrals using a change of variables. Therefore we have that
and \(\mu \) is an infinitesimally invariant measure. \(\square \)
We will next demonstrate that \(C^{1}_{c}(T\mathcal {M})\) is a core of D(A) by showing that certain conditions established in Durmus et al. (2021) hold under the assumption that \(\mathcal {M}\) is compact. To show that \(C^{1}_{c}(T\mathcal {M})\) is a core of D(A) we use the approach of compactly approximating RT-RMHMC by a more well-behaved PDMP, which has PDMP characteristics \((\varphi ,\lambda ,Q^{\epsilon })\) satisfying the Assumption A3 from Durmus et al. (2021) and has a Feller transition semigroup \((P_{t})_{t \ge 0}\). We then use this approximation to show that RT-RMHMC is Feller and \(C^{1}_{c}(T\mathcal {M})\) is a core of the strong generator of RT-RMHMC, whose transition semigroup \((P_{t})_{t \ge 0}\) is seen as a semigroup on \(C_{0}(T\mathcal {M})\). Note that \(C_{0}(T\mathcal {M})\) denotes the space of continuous functions \(f: T\mathcal {M} \rightarrow \mathbb {R}\) that vanish at infinity and \(C_{0}(T\mathcal {M})\) is a Banach space when equipped with the \(\Vert \cdot \Vert _{\infty }\) norm.
We first approximate our PDMP \((\varphi ,\lambda ,Q)\) (RT-RMHMC) with the PDMP with characteristics \((\varphi ,\lambda , Q^{\epsilon })\) in the sense that
where \(Q^{\epsilon }\) is constructed as a Markov kernel corresponding to a consistently truncated Gaussian distribution on each tangent space as follows.
Define \(G: \mathcal {M} \times \mathbb {R} \rightarrow \mathbb {R}\) by
where \(\psi (x)\) denotes the probability density function of the Gaussian distribution on \(T_{x}\mathcal {M}\) defined by \(\psi (x)(dv) \propto \exp {(-\frac{1}{2}v^{T}G(x)^{-1}v)}dv\), known as the Maxwellian distribution. Then we have that \(\nicefrac {\partial G}{\partial a} \ne 0\) due to the fact that \(G(x, \cdot )\) is strictly increasing. By the implicit function theorem there exists a unique continuously differentiable function \(M: \mathcal {M} \rightarrow \mathbb {R}\) such that \(G(x,M(x)) = 0\) for all \(x \in \mathcal {M}\). We define the transition kernel as follows:
where
is the truncated Maxwellian distribution. Then we have that for any \((x,v) \in T\mathcal {M}\) and \(A \in \mathcal {B}(T\mathcal {M})\)
Lemma 9
(Continuity of Semigroup) Let \((\mathcal {M},g)\) be a smooth Riemannian manifold, and let \(U \in C^{1}(\mathcal {M})\) and hence \(H \in C^{1}(T\mathcal {M})\). Let \((P_{t})_{t \ge 0}\) be the transition semigroup of \((\varphi ,\lambda ,Q^{\epsilon })\), then
Proof
Let \((Z_{t})_{t \ge 0}\) denote a sample path of \((\varphi ,\lambda ,Q^{\epsilon })\). We have that
where \(S_{1}\) is the time of the first event and \(\varphi _{t}(z)\) is the solution of the Hamiltonian flow. If H is continuously differentiable everywhere then \(\varphi _{t}(z)\) is well defined for all \(t>0\), and \(\varphi _{t}(z) \rightarrow z\) as \(t \rightarrow 0\) (see for example Chicone (2006)[Theorem 1.186]). \(\square \)
Lemma 10
Let \((\mathcal {M},g)\) be a compact Riemannian manifold, and let \(U \in C^{1}(\mathcal {M})\). Let \((\varphi ,\lambda ,Q^{\epsilon })\) be the PDMP approximation of RT-RMHMC defined above. The set of all possible sample paths of \((\varphi ,\lambda ,Q^{\epsilon })\) with initial condition \((X_{0},V_{0})\) is contained in a compact set.
Proof
Let M(x) denote the continuous function in the definition of \(Q_{\epsilon }\) which controls the truncation of the Gaussian distribution. M(x) is a continuous function on a compact set and hence bounded by \(M_{\epsilon }\). We further choose \(M_{\epsilon }\) such that \(|V_{0}\vert _{g} \le M_{\epsilon }\). Define the set
Due to the fact that \(\mathcal {M}\) is compact it follows that \(U_{\epsilon }\) is a compact subset of \(T\mathcal {M}\) by Lemma 13. We have that H restricted to \(U_{\epsilon }\) is bounded by \(M_{H}\) as it’s continuous on a compact set. We also have that H is constant between event times of the PDMP, by the definition of Hamiltonian flow. Therefore the Hamiltonian defined on the PDMP \((X_{t},V_{t})\) takes values which are defined by the image of \((X_{t_{i}},V_{t_{i}})\), for events \(t_{i}\) \(i = 1,2,...\). At event time \(t_{i} \sim \exp {\lambda }\), we have that \((X_{t_{i}},V_{t_{i}}) \in U_{\epsilon }\), where \((X_{t_{i}},V_{t_{i}}) \sim Q(X_{t_{i}-},V_{t_{i}-}, \cdot )\). Therefore we can bound the Hamiltonian by \(M_{H}\) on \(\{(X_{t},V_{t})\mid t \ge 0 \}\). Now we have that
for all t. Therefore
which is compact by Lemma 13. \(\square \)
We have the following assumption from Durmus et al. (2021), which we use to establish Proposition 11.
Definition 3
Durmus et al. (2021)(Definition 16) We say that a homogeneous differential flow \(\varphi \) on \(T\mathcal {M}\) and a homogeneous Markov kernel Q on \(T\mathcal {M}\) are compactly compatible if for all compact sets \(K \subset T\mathcal {M}\) and \(T \ge 0\), there exists a compact set \(\tilde{K} \subset T \mathcal {M}\) satisfying: for all \(n \in \mathbb {N}^{*}\), \((t_{i})_{i \in \llbracket 1,n \rrbracket } \in \mathbb {R}^{n}_{+}, \sum ^{n}_{i=1}t_{i} \le T,\) there exists a sequence \((K_{i})_{i \in \llbracket 1,n \rrbracket }\) of compact sets of \(T\mathcal {M}\) such that, setting \(K_{0} = K,\)
-
1.
for all \(i \in \llbracket 1, n \rrbracket ,\) \(K_{i}\) only depends on \((t_{j})_{j \in \llbracket 1,n \rrbracket }\) and \(\cup ^{n}_{i=0} K_{i} \subset \tilde{K}\);
-
2.
for all \(i \in \llbracket 0, n-1 \rrbracket \), \(s_{i+1} \in [0,t_{i+1}]\) and \(s_{n+1} \in [0,T - \sum ^{n}_{j=1} t_{j}],\)
$$\begin{aligned}{} & {} \bigcup _{x \in K_{i}}\text {supp}\{Q(\varphi _{t_{i+1}}(x),\cdot ) \}\subset K_{i+1}, \ \varphi _{s_{i+1}}(K_{i}) \subset \tilde{K}, \\{} & {} \varphi _{s_{n+1}}(K_{n}) \subset \tilde{K}. \end{aligned}$$
Assumption 3
Durmus et al. (2021)(A3) The homogeneous characteristics \((\varphi ,\lambda ,Q)\) satisfy
-
1.
the flow \(\varphi \) and the Markov kernel Q are compactly compatible;
-
2.
\(\lambda \in C^{1}(T\mathcal {M})\) and for all \(f \in C^{1}(T\mathcal {M})\), \(\lambda Q^{\epsilon }f \in C^{1}(T\mathcal {M})\) and there exists a locally bounded function \(\Psi :T\mathcal {M} \rightarrow \mathbb {R}_{+}\) such that for all \(x \in K\),
$$\begin{aligned} \lambda \Vert \nabla (Q^{\epsilon }f)(x) \Vert\le & {} \Vert \Psi \Vert _{\infty ,K}\sup \{|f(y)\vert \\{} & {} +\Vert \nabla f(y) \Vert : y \in \text {supp}\{Q^{\epsilon }(x,\cdot ) \}\}; \end{aligned}$$ -
3.
\((t,x) \mapsto \varphi _{t}(x) \in C^{1}(\mathbb {R}_{+} \times T\mathcal {M})\) and for all compact \(K \subset T\mathcal {M}\) and \(t \ge 0\),
$$\begin{aligned}\sup {\{\Vert \nabla \varphi _{s}(x)\Vert \mid s \in [0,t], x \in K \}} < +\infty . \end{aligned}$$
Proposition 11
(Feller and Core of Generator) Let \((P_{t})_{t \ge 0}\) be the transition semigroup of \((\varphi ,\lambda ,Q^{\epsilon })\) on \(T\mathcal {M}\), where \((\mathcal {M},g)\) is a compact smooth Riemannian manifold and \(\varphi \) is the Hamiltonian flow associated to the Hamiltonian \(H \in C^{2}(T\mathcal {M})\). Then, \((P_{t})_{t \ge 0}\) is Feller and \(C^{1}_{c}(T\mathcal {M})\) is a core for the strong generator of \((P_{t})_{t \ge 0}\) seen as a semigroup on \(C_{0}(T\mathcal {M})\).
Proof
If we prove that \((\varphi ,\lambda ,Q^{\epsilon })\) satisfies Assumption 3, then from Durmus et al. (2021)[Theorem 17] \((P_{t})_{t \ge 0}\) satisfies the Feller property. Once the Feller property is established by Lemma 9 and due to the fact that \(T\mathcal {M}\) is a complete metric space we have by Böttcher et al. (2013)[Lemma 1.4] strong continuity of \((P_{t})_{t \ge 0}\) and that \((P_{t})_{t \ge 0}\) is Feller. Further to this \(C^{1}_{c}(T\mathcal {M})\) is a core for the strong generator of \((P_{t})_{t \ge 0}\) seen as a semigroup on \(C_{0}(T\mathcal {M})\) is a consequence of Durmus et al. (2021)[Theorem 17] and Ethier and Kurtz (1986)[Proposition 3.3,Chapter 1]. We will now establish Assumption 3.
For any compact set \(K \subset T\mathcal {M}\), as K is compact, \(|v\vert _{g} \le M_{K}\) for some constant \(M_{K} \ge 0\) and for all v such that \((\cdot ,v) \in K\). Then by the same argument to that of Lemma 10, but choosing \(M_{\epsilon }\) larger than \(M_{K}\) we have that all PDMPs starting in K are contained in a compact set \(\tilde{K}\). We can define \(K_{0} = K\) and \(K_{i} = \tilde{K}\) for all \(i \ge 1.\) Then we have the flow \(\varphi \) and \(Q_{\epsilon }\) are compactly compatible and hence Assumption 3i) holds.
We show Assumption 3ii) as follows. Trivially we have \(\lambda \in C^{1}(T\mathcal {M})\). We have taken the metric to be smooth and hence, as the truncated Gaussian distribution has a smooth transition kernel, we have that \(Q^{\epsilon }f \in C^{1}(T\mathcal {M})\). Firstly we note that
which is compact by Lemma 13. For all continuously differentiable functions \(f: T\mathcal {M} \rightarrow \mathbb {R}\), with \((x,y) \in T\mathcal {M}\), we define
Therefore it is sufficient to show that for all compact sets \(K \subset T \mathcal {M}\), and for all \((x,y) \in K\),
where \(\Psi : T \mathcal {M} \rightarrow \mathbb {R}\) is bounded on compact sets of \(T \mathcal {M}\). Define \(\Vert \cdot \Vert _{\infty ,M(x)} \equiv \Vert \cdot \Vert _{\infty ,B(0,M(x))}\). We have that for all \((x,y) \in T \mathcal {M}\), since all functions considered are \(C^1\) and hence bounded on all compact sets of \(T \mathcal {M}\) we have the following computation which uses the dominated convergence theorem, a Leibniz’s integral rule and a spherical coordinate system:
where \(C,C_1,C_2\) and \(C_3\) are general constants carrying line by line and \(\nabla _{x}\) denotes the differential operator with respect to position on \(\mathcal {M}\) and we have bounded \(\nabla _{x}\psi \) universally on \(\{ (x,y) \mid x \in \mathcal {M}, |y\vert _{g} \le M(x) \} \subset T \mathcal {M}\). Therefore we have the required result by setting \(\Psi = C_{1} + C_{2}\). Finally we have to show Assumption 3iii), where we use the fact that \(\varphi \) is continuously differentiable, when \(U \in C^{2}(\mathcal {M})\) and for any compact set \(K \subset T \mathcal {M}\) we have that \(|v\vert _{g} \le M_{K}\) for all \((x,v) \in T\mathcal {M}\). Then by the same argument as that of Lemma 10 we can define a larger constant such that all PDMPs starting in K have bounded velocity and hence are contained in a compact set \(\tilde{K}\). Hence Assumption 3iii) holds by the fact that a continuous function on a compact set is bounded. \(\square \)
Theorem 12
(RT-RMHMC Feller and Core) Let \((P_{t})_{t \ge 0}\) be the transition semigroup of \((\varphi ,\lambda ,Q)\) on \(T\mathcal {M}\), where \((\mathcal {M},g)\) is a compact smooth Riemannian manifold and \(\varphi \) is the Hamiltonian flow associated to the Hamiltonian \(H \in C^{1}(T\mathcal {M})\). Then, \((P_{t})_{t \ge 0}\) is Feller and \(C^{1}_{c}(T\mathcal {M})\) is a core for the strong generator of \((P_{t})_{t \ge 0}\) seen as a semigroup on \(C_{0}(T\mathcal {M})\).
Proof
By construction of \(Q^{\epsilon }\) we have the property that
Using Durmus et al. (2021)[Theorem 11], Proposition 11, Durmus et al. (2021)[Theorem 17] and the same argument as Durmus et al. (2021)[Theorem 21] we have the required result. \(\square \)
Corollary 2
(Invariant measure for RT-RMHMC) Let \((P_{t})_{t \ge 0}\) be the transition semigroup of \((\varphi ,\lambda ,Q)\) on \(T\mathcal {M}\), where \((\mathcal {M},g)\) is a compact smooth Riemannian manifold and \(\varphi \) is the Hamiltonian flow associated to the Hamiltonian \(H \in C^{2}(T\mathcal {M})\). Let \(\mu \) be the measure on \((T\mathcal {M}, \mathcal {B}(T\mathcal {M}))\) given by
where \(d\lambda _{T\mathcal {M}}\) is the Liouville measure of \(T\mathcal {M}\). Then \(\mu \) is invariant for RT-RMHMC.
Appendix C: Proof of invariance and \(\mu \)-irreducibility for the Metropolized algorithm
Proof of Proposition 3
Let \(P_{1}\) be the Markov kernel corresponding to the first step. It is clear that resampling from the Gaussian measure on the tangent space keeps \(\pi _{\mathcal {H}}\) invariant as it has marginal \(\phi (x)\), and therefore keeps \(\mu \) invariant.
Let \(P_{2}\) be the Markov Kernel corresponding the second step (the combination of the sampling the time duration, deterministic step by \(\Psi \) and the Metropolis-Hastings accept-reject step. Let L be an arbitrary number of RATTLE steps we will check that \(\mu \) is reversible with respect to \(P_{2}\) and hence also invariant.
\(P_{2}\) is reversible with respect to \(\mu \) if for every measurable bounded function \(f:T \mathcal {M} \times T \mathcal {M} \rightarrow \mathbb {R}\)
For \(P_2\) we have that \(P_{2}(z_{1},dz_{2} )\) is non-zero if and only if \(z_{2} = \Psi (z_{1} )\) or \(z_{2} = z_{1}\), hence we have that
Now let \(z_2 = \Psi (z_1),\) then due to the momentum reversal map N, we have that \(z_{1} = \Psi (z_{2}) = \Psi (\Psi (z_1 )),\) and by the volume preserving property of \(\Psi \) (preserving the Liouville measure), we have that
and using this property we have that the first part of the above sum can be written as
Now considering the second part of the sum, through the change of variables \(z_{1} = z_{2}\) and combining this with the above result we have the required equality. We therefore have that \(\mu \) is reversible with respect to \(P_{2}\). Considering f to be an indicator function we have invariance \(P_{2}\) with respect to \(\mu \). Due to the fact that this calculation was independent of time we have that \(\mu \) is invariant with respect to the Markov kernel of this algorithm. \(\square \)
Proof of Theorem 5
Based on Brubaker et al. (2012)[Theorem 3]. Fix \(\Delta t > 0\) sufficiently small such that our assumption holds. For a measurable set \(A \subset \mathcal {M}\), we can say A is contained in a compact set K, which can be covered by \(\{B_{r/2}(x) \mid x \in K \}\). Then we have that for some \(x' \in K\), \(B_{r/2}(x') \cap A\) has positive measure. We can connect x and \(x'\) by a sequence of points \(x_{0},...,x_{i},...,x_{n}\) for \(0 \le i \le n\), defined on the geodesic between \(x_{0} = x\) and \(x_{n} = x'\) such that \(d(x,x_{n}) \le r/2\). We can find unique \(v_{0},...,v_{n}\) such that \((x_{i+1},v_{i+1}) = \Psi ^{L}_{\Delta t}(x_{i},v_{i})\) by Theorem 4. We have that
due to the Theorem 4 and the fact that \(\phi (x_{i})(v_{i}) > 0 \). Considering the final step we have due to the triangle inequality \(|x_{n-1} - \tilde{x}\vert < r\) for all \(\tilde{x} \in B_{r/2}(x') \cap A\). Hence by the same reasoning and Theorem 4 we have that \(K(x_{n-1}, \tilde{x}) > 0\) for all \(\tilde{x} \in B_{r/2}(x') \cap A\). Using the fact that \(K(x_{i},x_{i+1}) > 0\) for all \(0 \le i \le n-2\), and \(K(x_{n-1}, \tilde{x}) > 0\) for all \(\tilde{x} \in B_{r/2}(x') \cap A\) we have that \(K^{n}(x,\tilde{x}) > 0\) for all \(\tilde{x} \in B_{r/2}(x') \cap A\) and
\(\square \)
Appendix D: Additional results
Lemma 13
Let \((\mathcal {M},g)\) be a smooth k-dimensional Riemannian manifold, let \(K \subset \mathcal {M}\) be compact and let \(R \in C^{1}(\mathcal {M})\) such that \(R(x)>0\) for all \(x \in \mathcal {M}\), then the set
is a compact subset of \(T\mathcal {M}\).
Lemma 13 can be shown by showing that the embedding of \(\tilde{K}_{R}\) is closed and bounded. Closure can be established by showing that the limit of convergent sequences is contained in \(\tilde{K}_{R}\), using the derivative of the local parametrisation as defined in Guillemin and Pollack (1974)[Page 50].
Appendix E: Integrated autocorrelation and ESS
If the MCMC method converges quickly, we have that the variance \(\sigma ^{2}({\overline{f}})\) (the variance of the estimator) is small. From the central limit theorem we know that as \(N \rightarrow \infty \),
and hence
where the quantity a is known as the asymptotic variance. We have the following result
where \(\sigma ^{2}(f)\) is the variance of f under the distribution \(\pi \) and is independent of the MCMC scheme used (see (Casella 2004)[Chapter 12] for an in depth study).
We also have
which is known as the integrated autocorrelation (IAC). If all samples are independent, then \(\tau _{f} = 1\). Generally, MCMC schemes generate correlated samples, and the larger the value of \(\tau _{f}\) the more correlated the samples are.
The IAC (\(\tau _{f}\)) is a measure of how dependent the samples are and the closer this value is to 1, the higher the quality of the MCMC samples produced. Note that we will use \(X^{\bullet }\) to denote the random variables in a Markov chain and \(X_{\bullet }\) to denote the outputs of an MCMC scheme. In the following numerics we approximate the IAC by a Monte Carlo method, that is we create a finite chain \(\{f_{n} \}^{N}_{n=1} = \{f(X_{i}) \}^{N}_{n=1}\) from the MCMC schemes we want to test. We estimate
where
and
We have that
for some large M such that \(M \ll N\). Note that in practice one uses a fast Fourier transform method to calculate \(c_{f}(\cdot )\) as it is much more computationally efficient.
We now define an additional metric of quality of samples known as effective sample size (ESS) which is defined as
for a sample size of size N. This metric is used to say that a sample of size N of an MCMC algorithm has the efficiency of \(N_{\text {eff}}\) independent samples for computing the Monte Carlo average of f.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Whalley, P.A., Paulin, D. & Leimkuhler, B. Randomized time Riemannian Manifold Hamiltonian Monte Carlo. Stat Comput 34, 48 (2024). https://doi.org/10.1007/s11222-023-10303-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-023-10303-6