1 Introduction and motivation

Efficient sampling of high dimensional probability distributions is required for Bayesian inference and is a challenge in many fields including biological modelling (Wilkinson 2007), economic modelling (Greenberg 2012), machine learning with large data sets (Pakman et al. 2017; Barber 2012) and molecular dynamics (Perez et al. 2015). A popular approach is Markov chain Monte Carlo, which defines a Markov chain \(X_{i+1} \sim p(\cdot \mid X_{i})\) with invariant measure \(\mu \) and from which we may estimate expected values from the relation \(\mathbb {E}_{X \sim \mu } f(X) \approx \frac{1}{N}\sum ^{N}_{i=1}f(X_{i})\); however convergence of such averages can be slow for high dimensional and multimodal distributions (see e.g. Quiroz et al. (2018)). Recent attempts to address this problem include the local bouncy particle sampler of Bouchard-Côté et al. (2018) and the Zig-Zag process of Bierkens et al. (2019). These methods can be viewed as piecewise deterministic Markov processes (PDMPs), see (Vanetti et al. 2018). The Randomized Hamiltonian Monte Carlo (RHMC), proposed in Bou-Rabee and Sanz-Serna (2017) and further studied in Deligiannidis et al. (2021), evolves a Hamiltonian flow for a duration drawn from an exponential distribution. In standard HMC the choice of integration time is a challenging task (see Hoffman and Gelman 2014) and mixing can be inefficient for some choices of integration time. By contrast, RHMC does not suffer from this problem as randomization of the duration prevents periodicities. This strategy has been studied from both analytic and numerical perspectives in Bou-Rabee and Sanz-Serna (2017). Other recent algorithms have been proposed which build on this idea (for example Riou-Durand and Vogrinc (2022) and Kleppe (2022)). We remark that RHMC is a special case of Andersen dynamics which is popular in the Molecular dynamics literature (see (Bou-Rabee and Eberle 2022)[Remark 2.2] and Andersen (1980)). Andersen dynamics has been studied in Bou-Rabee and Eberle (2022); Weinan and Li (2008) and Li (2007).

The algorithms discussed above are targeted to sampling from distributions in Euclidean space. The need to work with Riemannian manifolds is motivated by applications where constraints are imposed from modelling considerations or are introduced in order to restrict sampling to a relevant subdomain derived from statistical analysis (see (Brubaker et al. 2012)). Examples of manifolds include products of spheres or orthogonal matrices which arise in applications in protein configuration modelling with the Fisher-Bingham distribution (Hamelryck et al. 2006), texture analysis using distributions over rotations (Kunze and Schaeben 2004) and fixed-rank matrix factorization for collaborative filtering (Meyer et al. 2011; Salakhutdinov and Mnih 2008). Methods that sample from probability distributions on manifolds have been considered and studied in Hartmann (2008); Brubaker et al. (2012); Byrne and Girolami (2013); Girolami and Calderhead (2011); Lelièvre et al. (2012); Zappa et al. (2018); Diaconis et al. (2013); Lee and Vempala (2018); Lelièvre et al. (2019) and Laurent and Vilmart (2021). In this article, we focus on manifolds defined by algebraic constraints. In order to maintain the constraints, in practice one needs to perform projections at each step of the algorithm, an additional overhead compared to Euclidean MCMC algorithms.

In this paper we propose the Randomized Time Riemannian Manifold Hamiltonian Monte Carlo (RT-RMHMC) method, an RHMC scheme for Riemannian manifolds. We establish invariance under a compactness assumption of the desired measure in the (small stepsize limit) continuous-time PDMP version of our method, where the algorithm is rejection free. Our approach to proving invariance is based on the PDMP framework established in Durmus et al. (2021), we construct an approximation of RT-RMHMC by truncating the velocity distribution consistently and prove all the conditions needed in Durmus et al. (2021) with the approximation; we have not found such a construction technique in the literature. Further, we demonstrate the invariance of the discretized method with Metropolis-Hastings adjustment and prove ergodicity of the discretized method with Metropolis-Hastings adjustment. We show in numerical experiments that this method has improved robustness, demonstrating for example that the convergence rate is relatively flat in the choice of mean time parameter; these results mirror those obtained for the Euclidean version of the method. Moreover, we compare RT-RMHMC to a constrained underdamped Langevin integrator (g-BAOAB) introduced in Leimkuhler and Matthews (2016).

To our knowledge, there is no theoretical or numerical treatment of RHMC in the manifold setting and there has been no theoretical treatment of Riemannian Hamiltonian Monte Carlo methods in the continuous time setting. We provide a first result to establish invariance of a continuous time Riemannian Hamiltonian Monte Carlo method in the compact setting. A biased RHMC method was recently introduced (see (Kleppe 2022)) which has event rates which depend on the position in the state space, these state dependent event rates can be incorporated into our RHMC Riemannian framework when the framework is unadjusted. We note that in the appendix of that article, a version of RHMC is introduced in the setting of adapting the metric for sampling on Euclidean space but not for working on a Riemannian manifold.

The remainder of this article is organized as follows. In the next section we describe the algorithm and provide invariance in the continuous time setting under a compactness assumption. Section 3 considers the numerical implementation with and without Metropolis test. Section 4 provides conservation of the stationary distribution of the discretized algorithm and the ergodicity of the method with Metropolis-Hastings adjustment. Section 5 discusses numerical experiments and Section 6 gives some thoughts on future developments. We include several appendices addressing the generator, the invariance of the target measure and the irreducibility of the scheme, from which ergodicity necessarily follows.

2 Algorithm

Let \((\mathcal {M},g)\) be a d-dimensional Riemannian manifold and \(T\mathcal {M}\) denote its tangent bundle. Let G(x) denote the positive definite matrix associated to the metric g at \(x \in \mathcal {M}\). Consider a target distribution on \(\mathcal {M}\) with density

$$\begin{aligned} \pi _{\mathcal {H}}(x) = \frac{1}{Z_{\mathcal {M}}}\exp {(- U_{\mathcal {H}}(x))}, \end{aligned}$$

with respect to \(\sigma _{\mathcal {M}}(dx)\), the surface measure (Hausdorff measure) of \(\mathcal {M}\) defined by \(\sigma _{\mathcal {M}}(dx) = \sqrt{\det {G(x)}}dx\) and \(Z_{\mathcal {M}} = \int _{\mathcal {M}}\exp {(- U_{\mathcal {H}}(x))}\sigma _{\mathcal {M}}(dx)\), which we assume to be finite. Consider an extension of the distribution to \(T\mathcal {M}\) as

$$\begin{aligned} \mu (dz) = \frac{1}{Z_{T\mathcal {M}}} \exp {(-H(x,v))}\lambda _{T \mathcal {M}}(dz), \end{aligned}$$
(1)

where \(\lambda _{T\mathcal {M}}(dz)\) is the Liouville measure of \(T\mathcal {M}\), H is defined by

$$\begin{aligned} H(x,v)&= U_{\mathcal {H}}(x) + \frac{1}{2}v^{T}G(x)^{-1}v \nonumber \\&= U(x) + \frac{1}{2}\log {\{(2\pi )^{d}\det {G(x)}\}} + \frac{1}{2}v^{T}G(x)^{-1}v, \end{aligned}$$
(2)

for \((x,v) \in T\mathcal {M}\) and \(Z_{T\mathcal {M}} = \int _{T\mathcal {M}}\exp {(-H(x,v))}\)\(\lambda _{T \mathcal {M}}(dz),\) which is finite when \(Z_{\mathcal {M}}\) is. We have that

$$\begin{aligned}\mu (dz) = \pi _{\mathcal {H}}(x)\sigma _{\mathcal {M}}(dx)\psi (x)(dv), \end{aligned}$$

where \(\psi (x)(dv)\) is simply the Gaussian measure on \(T_{x}\mathcal {M}\) given by

$$\begin{aligned} \psi (x)(dv)= & {} \frac{1}{\sqrt{\{(2\pi )^{d}\det {G(x)}\}}}\\{} & {} \times \exp {\left( -\frac{1}{2}v^{T}G(x)^{-1}v\right) } \sigma _{T_{x}\mathcal {M}}(dv), \end{aligned}$$

in local coordinates and \(\sigma _{T_{x}\mathcal {M}}(dv)\) is the Lebesgue measure on \(T_{x}\mathcal {M}\). In particular we have that \(\mu \) has marginal distribution \(\pi _{\mathcal {H}}\) with respect to the Hausdorff measure (Girolami and Calderhead 2011; Byrne and Girolami 2013; Lelievre et al. 2010[Section 3.3.2]).

We will define a stochastic process which is a Riemannian version of the Randomized Hamiltonian Monte Carlo of Bou-Rabee and Sanz-Serna (2017). The stochastic process follows constrained Hamiltonian dynamics for an time duration t sampled from \(t \sim \exp {(\lambda )}\) for some rate \(\lambda > 0\) before an event. This event is a random velocity refreshment under the distribution \(\psi (x)\).

Algorithm 1 defines Randomized time Riemannian Manifold Hamiltonian Monte Carlo (RT-RMHMC) with rate parameter \(\lambda > 0\), and Hamiltonian dynamics governed by the Hamiltonian \(H(x,v) = U_{\mathcal {H}}(x) + \frac{1}{2}v^{T}G(x)^{-1}v\) defined on \(T\mathcal {M}\). This stochastic process has invariant measure \(\mu (z) = \exp {(-H(z))}\) with respect to the Liouville measure on \(T\mathcal {M}\).

Algorithm 1
figure a

RT-RMHMC

To sample from a distribution \(\pi \) with respect to the Hausdorff measure we define \(U = - \log {\pi }\) under the assumption that \(\pi \) is integrable on \(\mathcal {M}\).

We can define the generator for this stochastic process as

$$\begin{aligned} \mathcal {L}f(z) = X_{H}(f(z)) + \lambda (Qf(z) - f(z)), \end{aligned}$$
(3)

where

$$\begin{aligned}{} & {} Qf(x,v):= \frac{1}{\sqrt{\{(2\pi )^{d}\det {G}(x)\}}}\\{} & {} \int _{T_{x}\mathcal {M}} \exp {\left( -\frac{1}{2}\xi ^{T}G(x)^{-1}\xi \right) }f(x,\xi ) d\xi , \end{aligned}$$

is the transition kernel for a completely randomized velocity refreshment according to a Gaussian distribution on the tangent space \(T_{x}\mathcal {M}\) and \(X_{H}\) is the Hamiltonian vector field associated to H, which may be defined by \(X_{H} = \left( \frac{\partial H}{\partial v_{i}},-\frac{\partial H}{\partial x_{i}}\right) \) in local coordinates. In the Appendix, we will prove that this is the generator of this stochastic process in Section A and invariance of the measure in Section B under a compactness assumption. Our main theoretical result about Algorithm 1 is the following.

Corollary 1

(Invariant measure for RT-RMHMC) Let \((P_{t})_{t \ge 0}\) be the transition semigroup of a simulation of Algorithm 1 with characteristics \((\varphi ,\lambda ,Q)\) on \(T\mathcal {M}\) and Hamiltonian \(H \in C^{2}(T\mathcal {M})\), where \((\mathcal {M},g)\) is a compact smooth Riemannian manifold and \(\varphi \) is the Hamiltonian flow associated to the Hamiltonian. Let \(\mu \) be the measure on \((T\mathcal {M}, \mathcal {B}(T\mathcal {M}))\) given by

$$\begin{aligned} \mu (dz) \propto e^{-H(x,v)}d\lambda _{T \mathcal {M}}(z),\end{aligned}$$

where \(d\lambda _{T\mathcal {M}}\) is the Liouville measure of \(T\mathcal {M}\). Then \(\mu \) is invariant for RT-RMHMC.

3 Constrained symplectic integrator and metropolis-hastings adjustment

In this section, we will state some more broadly implementable versions of Algorithm 1 that are applicable when the Hamiltonian dynamics cannot be solved exactly. We start with a brief introduction to Lagrangian and Hamiltonian dynamics with constraints based on Lee et al. (2017)[Chapter 3].

Consider manifolds \(\mathcal {M}\) embedded in \(\mathbb {R}^{d}\) that can be described by algebraic equations

$$\begin{aligned} \mathcal {M}:=\{x \in \mathbb {R}^{d} \mid c_{i}(x) = 0, i = 1,...,m \} \subset \mathbb {R}^{d}, \end{aligned}$$

where \(c_{i}: \mathbb {R}^{d} \rightarrow \mathbb {R}\) \(i= 1,...,m\) are continuously differentiable functions with linearly independent gradient functions for all \(x \in \mathcal {M}\).

We refer to such a submanifold as an algebraic constraint manifold. We can express the Euler-Lagrange equations as an orthogonal projection of the Euler-Lagrange equations in \(\mathbb {R}^{d}\) onto the constraint manifold, hence we have

$$\begin{aligned} \frac{d}{dt}\left( \frac{\partial L(x,{\dot{x}})}{\partial {\dot{x}}}\right) - \frac{\partial L(x,{\dot{x}})}{\partial x} + \sum ^{m}_{i=1}\lambda _{i}\frac{\partial c_{i}(x)}{\partial x} = 0, \end{aligned}$$

where \(\lambda _{i}\) are Lagrange multipliers for each of the constraints. We can then define an augmented Lagrangian function \(L^{a}:T^{*}M \times \mathbb {R}^{m} \rightarrow \mathbb {R}\) by \(L^{a}(x,{\dot{x}},\lambda ) = L(x,{\dot{x}}) - \sum ^{m}_{i=1} \lambda _{i}c_{i}(x)\). Then the Euler-Lagrange equations can be expressed as

$$\begin{aligned} \frac{d}{dt}\left( \frac{\partial L^{a}(x,{\dot{x}},\lambda )}{\partial {\dot{x}}}\right) - \frac{\partial L^{a}(x,{\dot{x}},\lambda )}{\partial x} = 0, \end{aligned}$$

and the augmented Hamiltonian function \(H^{a}:T^{*}\mathcal {M} \times \mathbb {R}^{m} \rightarrow \mathbb {R}\) as \(H^{a}(x,\mu ,\lambda ) = \mu \cdot {\dot{x}} - L^{a}(x,{\dot{x}},\lambda )\), and we therefore obtain Hamilton’s equations (see Hartmann 2007)

$$\begin{aligned} {\dot{x}} = \frac{\partial H^{a}(x,\mu ,\lambda )}{\partial \mu } \qquad {{\dot{\mu }}} = -\frac{\partial H^{a}(x,\mu ,\lambda )}{\partial x}. \end{aligned}$$

We next introduce a new formulation of RT-RMHMC for constraint manifolds which we will use for numerical simulation. Note that a constraint manifold Hamiltonian Monte Carlo method was introduced in Brubaker et al. (2012), but with a deterministic duration parameter. We will use the same notation as that used in Brubaker et al. (2012) to introduce randomized time into this algorithm.

Let us denote our constraints \(c(x):= (c_{i}(x),...,c_{m}(x))^{T}\) and let \(C(x) = \frac{\partial c}{\partial x}\) denote the Jacobian of the constraints, which we assume to have full rank everywhere. Define a Hamiltonian of the constrained system as \(H(x,v) = U_{\mathcal {H}}(x) + K(v)\), where \(K(v) = \frac{1}{2}v^{T}G(x)^{-1}v\) is the kinetic energy and v lies in the cotangent space, \(\mathcal {T}^{*}_{x}\mathcal {M} = \{ v \mid C(x) \frac{\partial H}{\partial v}(x,v) = 0\}\). The dynamics of the constrained system in terms of the Hamiltonian is thus given by

$$\begin{aligned} {\dot{v}} = -\frac{\partial H}{\partial x} + C(x)^{T}\lambda ,\qquad {\dot{x}} = \frac{\partial H}{\partial v},\qquad \text {such that } c(x) = 0, \end{aligned}$$

where we remark that we can naturally identify the tangent and cotangent spaces and bundles.

We let \(\pi _{\mathcal {H}}\) be our target measure with respect to the Hausdorff measure and \(U_\mathcal {H}(x) = -\log \pi _{\mathcal {H}}(x)\) be the potential energy of our constrained system. We can then simulate the constrained Hamiltonian dynamics using the RATTLE scheme (Andersen 1983). However, if we know \(\pi _{\mathcal {H}}\) explicitly we can avoid computation of the metric tensor by assuming our system is isometrically embedded in Euclidean space. Under this assumption we can then consider Algorithm 2, which is an explicit algorithm for simulation of Randomized time constrained Hamiltonian Monte Carlo (RT-CHMC). We will discuss and justify the embedding assumption further in Sect. 3.1.

Algorithm 2
figure b

RT-CHMC

In Algorithm 2 we sample the Gaussian distribution on the tangent space at a point on \(\mathcal {M}\). We can do this by sampling a vector whose components are independent standard normal random variables and then projecting this orthogonally. To orthogonally project a momentum vector onto \(T^{*}\mathcal {M}\) and correctly resample the momentum in Algorithm 2 at \(x \in \mathcal {M}\) we apply the projector

$$\begin{aligned}P_{\mathcal {M}}(x):= I - C(x)^T (C(x)C(x)^T)^{-1} C(x).\end{aligned}$$

Proposition 1

If \(v' \sim \mathcal {N}(0,I)\) then \(v = P_{\mathcal {M}}(x)v'\) is distributed according to \(v \sim \mathcal {N}(0,I \mid C(x)v = 0)\)

Proof

Can be found in Graham et al. (2022). \(\square \)

3.1 Embedded manifolds

We next introduce the theory of manifold embeddings as it was presented in Byrne and Girolami (2013) to show that numerical simulation of RT-CHMC is in fact simulation of RT-RMHMC on constraint manifolds.

If we know the form of the distribution \(\pi _{\mathcal {H}}\) with respect to the Hausdorff measure, then we can avoid the computation of the metric tensor and the lack of a global coordinate system (Byrne and Girolami 2013). We achieve this using isometric embeddings, remarking that every Riemannian manifold can be isometrically embedded in Euclidean space due to the Nash embedding theorem (Nash 1956). If we have an isometric embedding \(\xi : \mathcal {M} \rightarrow \mathbb {R}^{n}\), then considering a path q(t) on \(\mathcal {M}\), the path \(x(t) = \xi (q(t))\) is such that \({\dot{x}}_{i}(t) = \sum _{j} \frac{\partial x_{i}}{\partial q_{j}} {\dot{q}}_{j}(t)\). The phase space (qp), where \({\dot{q}} = G^{-1}p,\) can then be transformed to the embedded phase space (xv),  where

$$\begin{aligned} v = {\dot{x}} = XG(q)^{-1}p = X(X^{T}X)^{-1}p, \text { where } X_{ij} = \frac{\partial x_{i}}{\partial q_{j}}, \end{aligned}$$

since \(G = X^{T}X\) due to the fact that the embedding is isometric and preserves inner products (see (Byrne and Girolami 2013)). Now the Hamiltonian (Eq. 2) is

$$\begin{aligned} H(x,v) = -\log { \pi _{\mathcal {H}}(x)} + \frac{1}{2}v^{T}v, \end{aligned}$$

in terms of coordinates (xv). When considering sampling of the velocities in Algorithm 1 and Algorithm 2, since \(p \sim \mathcal {N}(0,G(q))\), we have

$$\begin{aligned} v \sim \mathcal {N}(0,X(X^{T}X)^{-1}X^{T}), \end{aligned}$$

where \(X(X^{T}X)^{-1}X^{T}\) is the orthogonal projection onto the tangent space of the embedded manifold (Byrne and Girolami 2013). Therefore we can sample from \(\mathcal {N}(0,I)\) and project onto the tangent space to obtain a necessary sample. The Hamiltonian is thus expressed in a form which is independent of the metric (provided we know the density with respect to the Hausdorff measure). We next introduce the RATTLE scheme for numerical integration on a manifold (Leimkuhler and Reich 2004)[Chapter 7]:

$$\begin{aligned} x_{n+1}&= x_{n} + \Delta t v_{n+1/2}\\ v_{n+1/2}&= v_{n} -\frac{\Delta t}{2}\nabla _{x}U(x_{n}) - \frac{\Delta t}{2}C(x_{n})^{T} \lambda ^{n}_{(r)}\\&\qquad \qquad \quad c(x_{n+1}) = 0\\ v_{n+1}&= v_{n+1/2} -\frac{\Delta t}{2}\nabla _{x}U(x_{n+1}) - \frac{\Delta t}{2} C(x_{n+1})^{T} \lambda ^{n+1}_{(v)}\\&\qquad C(x_{n+1})v_{n+1} = 0, \end{aligned}$$

where we solve for \(\lambda ^{n}_{(r)}\) and \(\lambda ^{n+1}_{(v)}\) at each iteration so that the iterates lie in the tangent bundle. We solve for \(\lambda ^{n}_{(r)}\) (a non-linear system of equations) by cycling through the constraints, adjusting one multiplier at each iteration. Denote by \(C_{i}\) the ith row of C and we first initialize

$$\begin{aligned} Q:= {\overline{x}}_{n+1} = x_{n} + \Delta t v_{n} - \frac{\Delta t^{2}}{2} \nabla _{x}U(x_{n}). \end{aligned}$$

Next we cycle through the list of constraints one after another as follows: for each \(i = 1,...,m\) compute

$$\begin{aligned} \Delta \Lambda _{i}:= \frac{c_{i}(Q)}{C_{i}(Q)C_{i}(x_{n})}, \end{aligned}$$

and update Q by \(Q:= Q - C_{i}(x_{n})^{T}\Delta \Lambda _{i}\) until \(c_{i}(Q)<tol\) for all \(i =1,...,m\), where tol is a certain prescribed tolerance. Then we set \(x_{n+1} = Q\) and have \(x_{n+1} \in \mathcal {M}\) within the tolerance. (Note that other stopping criteria could be used (see (Ortega and Rheinboldt 2000)).) We solve for \(\lambda ^{n+1}_{(v)}\) by solving the linear system:

$$\begin{aligned} \left( C(x_{n})C(x_{n})^{T} \right) \lambda ^{n}_{(v)} = C(x_{n})\left( \frac{2}{\Delta t} v_{n-1/2} - \nabla _{x}U(x_{n}) \right) . \end{aligned}$$

Once the linear system has been solved we obtain \((x_{n+1},\)\(v_{n+1}) \in T^{*}\mathcal {M}\).

Theorem 2

Let \(\mathcal {M}\) be a constraint manifold. Let \(H \in C^{2}(T\mathcal {M})\), the RATTLE numerical integrator of the Hamiltonian system defined by H in \(T\mathcal {M}\) is symmetric, symplectic and of order 2. Further it respects the manifold constraints.

Proof

Given in Leimkuhler and Skeel (1994). \(\square \)

3.2 Metropolis hastings adjustment

Let \(\Psi ^{L}_{\Delta t}: T\mathcal {M} \rightarrow T\mathcal {M}\) be the numerical integrator defined by L steps of RATTLE with stepsize \(\Delta t\). This integrator approximates the Hamiltonian dynamics. For theoretical purposes we will also define the map \(N: T\mathcal {M} \rightarrow T \mathcal {M}\) which negates the momentum term i.e. \(N(x,v) \equiv (x,-v)\). Note that this leaves the Hamiltonian invariant and due to the fact that the momentum is resampled this has no affect on the samples from \(\pi _{\mathcal {H}}\). We will define the following Metropolized RT-RMHMC, where we sample \(T \sim \exp {(\lambda )}\) and fix a maximum time length \(\Delta t_{\max {}}\) below the stability threshold of the numerical integrator. Then we choose the number of leapfrog steps L to be \(\lceil T/\Delta t_{\max {}} \rceil \). Having chosen L in this way, we set \(\Delta t = T/L \le \Delta t_{\max }\). At each step we perform L RATTLE steps with stepsize \(\Delta t\). We propose this method of discretisation instead of purely randomizing the stepsize and fixing a number of leapfrog steps to avoid numerical instabilities in the numerical integrator. One could also propose fixing a stepsize within the numerical stability threshold of the integrator and simply sampling an integer number of leapfrog steps geometrically to randomize the time. However our proposed method closer relates to the continuous dynamics without the issues due to numerical instabilities.

Fig. 1
figure 1

Ratio of samples out of \(10^6\) samples which don’t satisfy reversibility check for different choices of \(\Delta t\) for the BVMF distribution with parameters \(A = \text {diag}(-1000,0,1000)\) and \(c = (100,0,0)\)

Remark 1

For large choices of stepsize \(\Delta t\) it has been shown that \(\Phi ^{L}_{\Delta t}\) is not reversible where RATTLE is used to integrate on the manifold, see (Lelièvre et al. 2019; Zappa et al. 2018). In Lelièvre et al. (2019) they propose to combat this by adding a reversibility check incorporated into the Metropolis-Hastings adjustment, although in practice such checks may be neglected in favor of an implicit assumption that \(\Delta t\) is sufficiently small to avoid non-reversibility issues. We will investigate this further in Sect. 5.

In light of Remark 1, we include \(\text {Rev}(\cdot )\) as a additional (optional) accept-reject condition which implements a reversibility check (following (Lelièvre et al. 2019)). In numerical experiments we examine the stepsize threshold where the reversibility condition fails (See Fig. 1).

Algorithm 3
figure c

RT-CHMC with Metropolis-Hastings step

Remark 2

Our framework can be adapted to handle inequality constraints by incorporating an additional rejection condition in the Metropolis-Hastings step, which rejects samples which aren’t within the boundary.

This will be used in our application in Section 5.4 to impose a half-normal prior on some dimensions of our Bayesian model.

4 Ergodicity

We will now prove ergodicity and exact invariance of the desired measure of the discrete time algorithm with Metropolis-Hastings adjustment. We will provide ergodicity under two assumptions by the same technique as Brubaker et al. (2012) and restating some of their results.

Proposition 3

Assuming that \(\Psi \) for \(\Delta t_{max} > 0\) then \(\mu \) is invariant with respect to the Markov kernel proposed in Algorithm 3.

Proof

See Section C of the Appendix. \(\square \)

Assumption 1

Let \(\mathcal {M} \in \{x \in \mathbb {R}^{n} \mid c(x) = 0 \}\) be Riemannian manifold which is connected, smooth and differentiable. We assume that \(\nicefrac {\partial c}{\partial x}\) is full rank everywhere.

Assumption 2

Let \(\mathcal {M}\) be a Riemannian manifold which satisfies Assumption 1. For \(x \in \mathcal {M}\) we define \(\mathcal {B}_{r}(x) = \{x' \in \mathcal {M} \mid d(x',x) \le r \}\) to be the geodesic ball of radius r of x. We assume that there exists a \(r > 0\) such that for every \(x \in \mathcal {M}\) and \(x' \in \mathcal {B}_{r}(x)\) there exists a unique choice of Lagrange multipliers and velocity \(v \in T_{x}\mathcal {M}\), \(v' \in T_{x'}\mathcal {M}\) for which \((v',x') = \Psi ^{L}_{\Delta t}(v,x)\) for sufficiently small \(\Delta t\).

Theorem 4

(Accessibility) Let \(U \in C^{2}(\mathcal {M})\), and assuming Assumption 1. For any \(x_{0}, x_{1} \in \mathcal {M}\) and \(\Delta t\) sufficiently small, there exists finite \(v_{0} \in T \mathcal {M}\), \(v_{1} \in T \mathcal {M}\) and Lagrange multipliers \(\lambda _0\), \(\lambda _1\) such that \((v_1, x_1 ) = \Psi _{\Delta t}(v_{0},x_{0}).\)

Proof

Found in Brubaker et al. (2012)[Theorem 2] and is an extension of the results of Marsden and West (2001)[Theorem 2.1.1] and Hairer et al. (2006)[Theorem 5.6, Section IX.5.2]. \(\square \)

Theorem 5

(\(\mu \)-irreducible) Let \(U \in C^{2}(\mathcal {M})\), and under Assumptions 1 and 2 we have that for any \(x \in \mathcal {M}\), and measurable set \(A \subset \mathcal {M}\) with positive measure. Then there exists an \(n \in \mathbb {N}\) such that

$$\begin{aligned}K^{n}(x,A) > 0, \end{aligned}$$

where K denotes the marginal transition kernel defined on \(\mathcal {M}\) of Algorithm 3.

Proof

See Section C of the Appendix. \(\square \)

Lemma 6

(Aperiodic) Let \(U \in C^{2}(\mathcal {M})\) and under Assumptions 1 and 2 Algorithm 3 is aperiodic.

Proof

Proof given in Brubaker et al. (2012)[Lemma 1]. \(\square \)

Theorem 7

(Ergodicity) Let \(U \in C^{2}(\mathcal {M})\) and under Assumptions 1 and 2 we have for \(\mu -\)almost all starting values x

$$\begin{aligned}\lim _{t \rightarrow \infty } \int _{\mathcal {M}}|K^{t}(x,y) - \pi _{\mathcal {H}}(y)\vert \sigma _{\mathcal {M}}(dy) = 0. \end{aligned}$$

Proof

Since Algorithm 3 is \(\mu -\)invariant by Theorem 3, \(\mu -\)irreducible by Theorem 5 and aperiodic by Theorem 6, the required result holds by Tierney (1994)[Theorem 1]. \(\square \)

5 Numerical results

We perform numerical simulations of the RT-CHMC algorithm and compare to the CHMC algorithm of Brubaker et al. (2012); Girolami and Calderhead (2011), specifically exploring the underlying dynamics of the two processes. MCMC schemes are used to approximate expected values of certain functions f over some distribution with probability density function \(\pi \)

$$\begin{aligned} \mathbb {E}_{\pi }(f) = \int f(x) \pi (x)dx, \end{aligned}$$

where we can estimate this quantity using our MCMC scheme by

$$\begin{aligned} {\overline{f}}:= \mathbb {E}_{\pi }(f) \approx \frac{1}{M} \sum ^{M}_{i=1} f(X^{i}), \end{aligned}$$

where \(X^{i}\) is the Markov chain from our MCMC method. We quantify the convergence rate associated to approximation of \(\mathbb {E}_{\pi }(f)\) by considering the integrated autocorrelation function and essential sample size.

5.1 g-BAOAB

As a comparison method we implemented the g-BAOAB integrator of Leimkuhler and Matthews (2016), a numerical integrator for constrained underdamped Langevin dynamics. Constrained underdamped Langevin dynamics can be described by

$$\begin{aligned} {\dot{x}}&= v, \\ 0&= C(x)v,\\ {\dot{v}}&= -\nabla _{x} U(x) - \gamma v + \sqrt{2\gamma }R(t) - C(x)^{T}\lambda ,\\ 0&= c(x) , \end{aligned}$$

where \(\gamma \) is a friction coefficient and R(t), is a vector-valued, stationary, zero-mean Gaussian process. The numerical integrator g-BAOAB is a splitting method for such dynamics, which uses similar constrained integrators as that of RT-CHMC. We note that g-BAOAB is a biased sampling algorithm due to the error in the numerical integrator. For a full description of g-BAOAB and a discussion of the sampling error we refer to Leimkuhler and Matthews (2016).

5.2 Computational cost

The cost of propagation steps are essentially the same as those of g-BAOAB. In situations where there are relatively few constraints compared to the dimension of the parameter space, the constraint cost should be affordable. This happens even in complex applications like molecular dynamics of proteins, where the constraint solver rarely introduces a cost greater than a few percent of the total cost of, say, long-ranged force evaluations (Xie et al. 2000). Obviously the practicality of the scheme will depend on the problem itself. The cost of solving the constraints using the semi-explicit SHAKE/RATTLE or geodesic integrator steps is expected to be lower than the cost to integrate the equations of CHMC in the general metric setting (using unconstrained ODEs), since the nonseparable Hamiltonian structure demands an implicit symplectic integration.

5.3 Test examples

We next provide examples of distributions on implicitly defined manifolds embedded in Euclidean space, with the distributions defined with respect to the Hausdorff measure of the manifold. We will consider two types of constraint manifolds: spheres and Stiefel manifolds.

5.3.1 Bingham-Von Mises-Fisher distribution on \(S^{n}\)

The first test case is the Bingham-Von Mises-Fisher (BVMF) distribution defined on the \(n-\)dimensional sphere embedded in \(\mathbb {R}^{n+1}\), that is \(S^{n}:= \{x \in \mathbb {R}^{n+1} \mid \sum ^{n+1}_{i=1}x^{2}_{i} = 1 \}\). The BVMF distribution is the exponential family on \(S^{n} \subset \mathbb {R}^{n+1}\) with density of the form

$$\begin{aligned} \pi _{\mathcal {H}}(x) \propto \exp {\{ c^{T}x + x^{T}Ax\}}, \end{aligned}$$

where \(c \in \mathbb {R}^{n+1}\) and \(A \in M_{n+1}(\mathbb {R})\) is a symmetric matrix.

We compare the integrated autocorrelation (IAC) of \(-\log \pi _{\mathcal {H}}\) of the RT-CHMC method to that of the CHMC method introduced in Brubaker et al. (2012) for a number of distributions with parameters defined in the captions. We also compare the maximum IAC of \(x_{i}\) for \(i = 1,...,n\) to compare the worst efficiency of the mixing in all dimensions. We compare methods by setting the event rate parameter \(\lambda \) of RT-CHMC to be the deterministic duration parameter of CHMC (running the dynamics for this duration before momentum randomization). We then compute the integrated autocorrelation of \(-\log \pi _{\mathcal {H}}\) and \(x_{i}\) for \(i = 1,...,n\) for the two methods for varying choices of \(\lambda \) by a Monte Carlo averaging procedure as described in Section Appendix E. Regarding the reversibility issue for large choices of stepsize (as discussed in Section 3.2), for the geometries and distributions chosen, this is shown to exhibit behaviour as in Fig. 1, where there is a dramatic change in reversibility failure for a small change in step-size. Before this point all samples generated satisfy reversibility conditions. We simply chose stepsizes which are below this threshold in our simulations.

The results are presented in Fig. 2. We choose the stepsize in RATTLE to be \(\Delta t = 0.01\) and sample \(N = 1,000,000\) events with a burn time of \(10\%\) of samples before we compute the Monte Carlo average. We also use lags of up to \(M = N/50\), 2 percent of the number of samples used to estimate the IAC. As our choice of \(\Delta t\) is small, the acceptance rate is high so this process is close to the continous version. The IAC compares the efficiency of the continuous processes.

Fig. 2
figure 2

IAC estimates for different choices of \(\lambda \) for the BVMF distribution with parameters \(A = \text {diag}(-1000,0,1000)\) and \(c = (100,0,0)\) averaged over 20 independent runs.. Left: IAC of \(-\log \pi _{\mathcal {H}}\). Right: Maximum IAC over \(x_{1},x_{2}\) and \(x_{3}\)

Remark 3

Due to the symmetry in the \(x_3\)-coordinate (\(x_3 \mapsto -x_3\)) in the BVMF distribution with parameters \(A = \text {diag}(-1000,0,1000)\) and \(c = (100,0,0)\), the distribution is bimodal. However in practice we consider this a unimodal distribution as the probability of transferring between modes is extremely small. This can be seen in Fig. 3, as the second mode is at \(\theta = \pi \). Our dynamics stay near the mode at (0, 0, 1) and do not visit the other mode in all our simulations.

Fig. 3
figure 3

Contour plot of \(-\log \pi _{\mathcal {H}}\) for the BVMF distribution with parameters \(A = \text {diag}(-1000,0,1000)\) and \(c = (100,0,0)\). The axis being a 2D parameterisation of \(S^2\). The points are 2000 samples after a 8000 sample burn in time. Upper left: RT-CHMC for \(\lambda ^{-1} = 0.09\). Upper right: CHMC for \(\lambda ^{-1} = 0.09\). Lower left: RT-CHMC for \(\lambda ^{-1} = 0.1\). Lower right: CHMC for \(\lambda ^{-1} = 0.1\)

Fig. 4
figure 4

Gradient evaluation per ESS estimates of \(-\log \pi _{\mathcal {H}}\) for the BVMF distribution using 100,000 samples with parameters \(A = \text {diag}(-1000,0,1000)\) and \(c = (100,0,0)\) and for varying choices of step-size. Upper left: CHMC. Upper right: RT-CHMC. Lower left: g-BAOAB. Lower right: g-BAOAB and RT-CHMC

Fig. 5
figure 5

Maximum gradient evaluation per ESS estimates over \(x_1, x_2\) and \(x_3\) for the BVMF distribution using 100,000 samples with parameters \(A = \text {diag}(-1000,0,1000)\) and \(c = (100,0,0)\) and for varying choices of step-size. Upper left: CHMC. Upper right: RT-CHMC. Lower left: g-BAOAB. Lower right: g-BAOAB and RT-CHMC

In our first example and Fig. 2 we can see that the regularity of the quality of samples with respect to the duration parameter is poor when a deterministic duration parameter is used and nearly uniform across a wide interval for a randomized duration with the same expected value. This is illustrated in Fig. 3, where a small change in duration parameter causes the dynamics dramatically slows convergence and due to very slow mixing. The fact that CHMC behaves erratically for large mean duration parameters may not be very surprising to some readers as the theoretical convergence bound for HMC without randomization requires a limit on the duration T (see (Mangoubi and Smith 2017)). This is due to the fact that when T is set too large, the coupling argument breaks down.

We next compare efficiencies using the metric gradient evaluations per effective sample size, which tells us the number of gradient evaluations needed for one independent sample in estimating our observables. We compare this metric for varying choices of step-size up to when the reversibility condition is broken and the numerical integrator becomes unstable. Our observables will be \(-\log \pi _{\mathcal {H}}\) and \(x_{i}\) for \(i = 1,...,n\). We can see as in Figs. 4 and 5 CHMC (with deterministic time) exhibits the same behaviour as in Fig. 2 for all choices of step-size, which is not the case for RT-CHMC. We next compare the efficiency of the method with the g-BAOAB constrained Langevin integrator, we find in Figures 4 and 5 that g-BAOAB outperforms RT-CHMC for large choices of the friction parameter \(\gamma \) (for this example \(\gamma = 50\)). g-BAOAB has no Metropolis-Hastings adjustment and hence is a biased sampling method. The bias in the samples creates errors in computed observables. For large choices of \(\gamma \), this bias is dramatically reduced, but use of high friction may slow convergence of metastable systems.

To explore this we next consider a bimodal distribution from Byrne and Girolami (2013) in Fig. 6. It is shown in Fig. 6 that g-BAOAB incurs bias for large stepsizes and convergence is slow for large choices of the friction parameter for this metastable system. The figure also shows that this is not the case for RT-CHMC. In Fig. 4 we choose step-sizes up to which the integrator is reversible and stable.

Fig. 6
figure 6

Monte Carlo average of \(-\log \pi _{\mathcal {H}}\) for the BVMF distribution with parameters \(A = \text {diag}(-20,-10,0,10,20)\) and \(c = (40,0,0,0,0)\) with \(N = 5 \times 10^6\) samples. Left: 10 step RT-CHMC. Middle: g-BAOAB with \(\gamma = 2\). Right: g-BAOAB with \(dt = 0.01\)

The efficiency of the methods with their optimal choices of parameters is comparable, but RT-CHMC is much less sensitive with respect to the choice of parameters (stepsize and number of leapfrog steps) compared to CHMC, so RT-CHMC is more reliable from this point of view. This is important as it is hard to know an appropriate choice of parameters a priori and the integration length between samples might have to be arbitrarly small for CHMC to be efficient.

5.3.2 Von Mises-Fisher distribution on \(\mathbb {V}_{d,p}\)

Definition 1

A Stiefel manifold \(\mathbb {V}_{d,p}\) is the set of \(d \times p\) matrices X such that \(X^{T}X = I\).

These arise in many statistical problems which are discussed in Byrne and Girolami (2013). Applications include dimensionality reduction such as is used in factor analysis, principal component analysis (Jolliffe 2002) and directional statistics (Mardia et al. 2000). These are a generalisation of orthogonal groups. The von Mises-Fisher distribution on Stiefel manifolds is defined by the density

$$\begin{aligned}{} & {} p_{vMF}(X) \propto \exp {(Tr(F^{T}X))} \\{} & {} \quad = \exp {(\langle f_{1},x_{1} \rangle +... + \langle f_{p},x_{p}\rangle )}, \end{aligned}$$

where \(x_{i}\) and \(f_{i}\) are the columns of F and X. We simulate IAC esimates for two example distributions for varying duration parameters. In the simulations we use a stepsize of \(\Delta t = 0.01\) and 100, 000 samples in each IAC estimate. The results are shown in Fig. 7. We can see similar behaviour as the easier example on the Sphere. Both examples it is clear CHMC is much more sensitive to the mean duration and hence with respect to stepsize and number of leapfrog steps. We note that \(\text {Skew}(2,-45,-4)\) denotes the 3 by 3 skew-symmetric matrix with up triangular entries \(2,-45\) and \(-4\).

Fig. 7
figure 7

IAC estimates of \(-\log \pi _{\mathcal {H}}\) for different choices of \(\lambda \) for the VMF distribution on O(3) and \(\mathbb {V}_{18,3}\) with parameters given in the captions averaged over 20 independent runs. Left: VMF distribution on O(3) with parameters \(F = A:= \text {Skew}(2,-45,-4)\). Right: VMF distribution on \(\mathbb {V}_{18,3}\) with parameters \(F = [I,-A,I,-A,I,-A]^{T}\)

5.4 High dimensional covariance estimation

In many statistical applications for analysing high dimensional data sets it is necessary to estimate sample covariances. This can be challenging when the number of dimensions is larger than the number of data points, as the sample covariance estimator does not work well in such cases. Lam (2020) provides a review of high-dimensional covariance estimation and applications in principal component analysis (Shen et al. 2016), cosmological data analysis (Joachimi 2017) and finance (Lam 2016). Lam (2020) focuses on the setup where the matrix dimension is diverging or even larger than the sample size. In this setting one needs to estimate the population covariance matrix \(\Sigma \) of a set of n, \(p-\)dimensional data vectors, which we assume are drawn from an underlying distribution.

One estimator is the sample covariance matrix, which is defined by \(\Sigma _{S} = \nicefrac {1}{n}\sum ^{n}_{i=1}({\textbf {x}}_{k} - \overline{{\textbf {x}}})({\textbf {x}}_{k} - \overline{{\textbf {x}}})^{T}\), where \(\overline{{\textbf {x}}} = \nicefrac {1}{n}\sum ^{n}_{k = 1}{} {\textbf {x}}_{k}\) is the sample mean. However this is a poor estimator of \(\Sigma \) when p is large compared to the sample size n (due to rank deficiency). A way to combat this is to consider regularized covariance matrix estimators which include structural assumptions on the covariance matrix \(\Sigma \).

One such method which has been proposed assumes the structure of a low rank matrix plus a sparse matrix (see (Ross 1976) and Lam (2020)). This structure is known as a spiked covariance structure and has been studied in Bouchard et al. (2020); Lam (2020) and Cai et al. (2015). There have been interesting applications to finance (Fan et al. 2008), chemometrics(Kritchman and Nadler 2008), and astronomy (Joachimi 2017). The covariance matrix \(\Sigma \) is assumed to be expressible in the form

$$\begin{aligned} \Sigma = X D_{1} X^{T} + D_2, \end{aligned}$$

where X is a Stiefel manifold of dimension \(p \times m\) for \(p \gg m\) and \(D_{i}\) for \(i = 1, 2\) are diagonal matrices of dimensions \(m \times m\) and \(p \times p\) respectively. The motivation for this structure is that we assume that lower dimensional variables \({\textbf {y}}_{i}\) can describe the data \({\textbf {x}}_{i}\) such that \({\textbf {x}}_{i} = X{\textbf {y}}_{i} + \varvec{\epsilon }_{i}\), where X is a \(p \times m\) matrix with orthogonal columns. We have that

$$\begin{aligned} \Sigma = X\Sigma _y X^{T} + \Sigma _{\epsilon }, \end{aligned}$$

which we interpret as a low rank matrix (rank m) plus a sparse matrix. We take \(\Sigma _y\) and \(\Sigma _{\epsilon }\) to be diagonal, which is an approximation of the spiked covariance structure (Chamberlain and Rothschild 1982).

Assume a uniform prior on X with respect to the Hausdorff measure on X. Further assume a half-normal prior of the diagonal entries of \(D_{i}\) for \(i = 1,2\) to ensure positive definiteness. We also consider the following likelihood for the covariance estimation

$$\begin{aligned}{} & {} \mathcal {L}(\Sigma \mid {\textbf {x}}_{1},...,{\textbf {x}}_{p} ) = (2\pi )^{-np/2} \prod ^{n}_{i=1} \det {(\Sigma )}^{-1/2} \\{} & {} \exp {\left( -\frac{1}{2}({\textbf {x}}_{i} - \overline{{\textbf {x}}})^{T}\Sigma ^{-1}({\textbf {x}}_{i} - \overline{{\textbf {x}}})\right) }. \end{aligned}$$

We introduce the posterior distribution \(p(\Sigma \mid {\textbf {x}}_{1},...,{\textbf {x}}_{p}) \propto \mathcal {L}( \Sigma \mid {\textbf {x}}_{1},...,{\textbf {x}}_{p}) p(\Sigma )\), where \(p(\Sigma ) = p(X)p(D_{1})p(D_{2})\) for \(X \sim \mathcal {U}(\mathbb {V}_{p,m})\), \(D_{1_{jj}} \sim \mathcal {N}_{+}(0,\sigma ^{2}_{1})\) for \(j = 1,...,m\) and \(D_{2_{jj}} \sim \mathcal {N}_{+}(0,\sigma ^{2}_{2})\) for \(j = 1,...,p\) and \(\mathcal {N}_{+}\) denotes the half-normal distribution. Define the potential \(U: \mathbb {V}_{p,m} \times \mathbb {R}^{m} \times \mathbb {R}^{p} \mapsto \mathbb {R}\) by \(U(X,{\textbf {d}}_{1},{\textbf {d}}_{2}) = -\log {\mathcal {L}( \Sigma (X,{\textbf {d}}_{1},{\textbf {d}}_{2}) \mid {\textbf {x}}_{1},...,}\)\({{\textbf {x}}_{p})} -\log {p(X)} - \log {p({\textbf {d}}_{1})}-\log {p({\textbf {d}}_{2})}\) with forces given by

$$\begin{aligned} \frac{\partial U}{\partial X_{ij}}= & {} \frac{\partial U}{\partial \Sigma _{kl}}\frac{\partial \Sigma _{kl}}{\partial X_{ij}}\\= & {} \left( \frac{1}{2}n(\Sigma ^{-1})^{T}_{kl}\! +\! \frac{1}{2}\sum ^{n}_{r = 1}({\textbf {x}}_{r} - \overline{{\textbf {x}}})^{T}B^{kl}({\textbf {x}}_{r} - \overline{{\textbf {x}}})\right) \frac{\partial \Sigma _{kl}}{\partial X_{ij}}, \end{aligned}$$
Fig. 8
figure 8

Covariance estimates of astronomical data of 60-dimensional data vectors. \(\Sigma \) is the true covariance using all 2000 data points. \(\Sigma _{MAP}\) is a maximum a posteriori covariance estimate and \({{\hat{\Sigma }}}\) is a posterior expectation estimate of \(\Sigma \) using 40 data points. \({{\hat{\Sigma }}}^{-1}\) is a posterior expectation estimate of \(\Sigma ^{-1}\) using 40 data points. The posterior expectation estimate uses priors of \(\sigma _1 = \sigma _2 = 2\) after normalisation and 500, 000 samples from RT-CHMC. a): \(\Sigma \). b): \(\Sigma _{MAP}\). c): \({{\hat{\Sigma }}}\). d): \(\ln {|\Sigma ^{-1}\vert }\). e): \(\ln {|\Sigma _{MAP}^{-1}\vert }\). f): \(\ln {|{{\hat{\Sigma }}}^{-1}\vert }\)

where \([B^{kl}]_{ij} = -(\Sigma ^{-1})_{ik}(\Sigma ^{-1})_{lj}\) and

$$\begin{aligned} \frac{\partial \Sigma _{kl}}{\partial X_{ij}} = {\left\{ \begin{array}{ll} X_{kj}D^{1}_{jj} &{} k \ne i, l = i,\\ D^{1}_{jj}X_{lj} &{} k = i, l \ne i, \\ 2X_{ij}D^{1}_{jj} &{} k = i, l = i,\\ 0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

We also have that

$$\begin{aligned} \frac{\partial U}{\partial d_{i_{j}}} = \frac{\partial U}{\partial \Sigma _{kl}}\frac{\partial \Sigma _{kl}}{\partial d_{i_{j}}} + \frac{ d_{i_{j}}}{\sigma ^{2}_{1}}, \end{aligned}$$

for \(i = 1,2\) and where

$$\begin{aligned} \frac{\partial \Sigma _{kl}}{\partial d_{1_{j_1}}} = X_{kj_1}X_{lj_1} \quad \text {and} \quad \frac{\partial \Sigma _{kl}}{\partial d_{2_{j_2}}} = {\left\{ \begin{array}{ll} 1 &{} \text {if } k = l = j_2\\ 0 &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$

for \(j_1 = 1,...,m\) and \(j_2 = 1,...,p\). We will use the likelihood and its gradient for implementing our RT-CHMC algorithm for such models.

5.4.1 Covariance estimation for cosmological data

We consider an application of high dimensional covariance estimation in cosmological data analysis introduced in Joachimi (2017) and discussed in Lam (2020). The data is taken from Joachimi (2017) and consists of covariances of two-point correlation functions of cosmic weak lensing. It is simulated using coupled log-normal random fields from angular power spectra. For further information we refer the reader to Joachimi (2017). We will test this method using 2n/3 data vectors, where n is the dimension of the data vectors. Therefore we are in the setting where the dimension of the covariance matrix is larger than the number of samples. For our low rank plus sparse structure we choose \(m = p/6< < p\) and to ensure fast convergence we will normalize the data entry-wise and initialize our Markov chain via a eigenvalue decomposition of the sample covariance. We initialize our Markov chain as \(\Sigma _{0} = X^{T}D_{1}X + D_{2} \approx \Sigma _{S}\), using the sample covariance matrix \(\Sigma _{S}\) and where \(D_2\) is the diagonal of \(\Sigma _{S}\) and \(X^{T}D_{1}X\) corresponds to the eigenvalue decomposition of \(\Sigma _{S} - D_{2}\), but with the p largest eigenvectors. After the covariance of the normalized data is estimated it can easily be rescaled to match the real data via entry-wise multiplication with the outer product of the entry-wise standard deviations. We compare our method to a maximum a posterior (MAP) estimate of the covariance matrix which uses a simple constrained gradient descent algorithm with Lagrange multipliers to ensure that the “low-rank plus sparse” structure is maintained. We compare the Bayesian and MAP approaches using a relative Frobenius norm and a covariance metric introduced by Förstner and Moonen (2003) which is defined by

Table 1 A comparison using three metrics between the MAP estimate and the posterior expectation estimate using 500,000 samples from RT-CHMC
Fig. 9
figure 9

A comparison between the log of the absolute error and the log of the posterior standard deviations in each component. Left: \(\ln |\Sigma - {{\hat{\Sigma }}}\vert \). Right: \(\ln {\Sigma _{SD}}\)

$$\begin{aligned} d(A,B) = \sqrt{\sum ^{n}_{i=1} \ln ^{2}{\lambda _{i}(A,B)}}, \end{aligned}$$

where A and B are covariance matrices and \(\lambda _{i}(A,B)\) are the generalized eigenvalues from \(\textrm{det}(\lambda A - B)= 0\). As pointed out in Förstner and Moonen (2003), this covariance metric is affine invariant and invariant to inversion.

Note that in Fig. 8 we have not included the sample estimate with 40 samples because the sample covariance matrix is rank deficient and hence it is not possible to invert the matrix. We notice from Fig. 8 and Table 1 that the MAP and Posterior Expectation perform well when estimating the covariance matrix for only using 40 data points, but lose accuracy under inversion. The posterior expectation using a sampling method seems to retain more structure when inverted and provides a more accurate estimate according to the metric of Förstner and Moonen (2003).

It is clear from Table 1 that using the posterior means do not sacrifice accuracy compared to using the MAP estimators. An additional benefit of the Bayesian approach is that we can compute posterior standard deviations for each component of the covariance estimator, which gives error estimates. This is illustrated in Fig. 9. By comparing these standard deviations with the covariance estimates, we can get a sense of the relative error we are making. This information can be useful when deciding on the number of data samples we need to get a satisfactory level of accuracy in estimating the covariance matrix. Since in practice we do not have access to the true covariance matrix, there is no straightforward way to compute error estimates based on the MAP estimator, and it is challenging to see whether we have reached sufficient accuracy.

Further it is clear from Fig. 10 that the CHMC still suffers from the robustness issue that were present in the lower-dimensional examples and RT-CHMC maintains robustness for this application.

Fig. 10
figure 10

IAC estimates for different choices of \(\lambda \) for the Bayesian posterior estimation of the upper left quadrant of the covariance matrix for CHMC and RT-CHMC averaged over 20 independent runs

6 Conclusion and future work

In this work we have introduced a Randomized Time Riemannian Manifold Hamiltonian Monte Carlo (RT-RMHMC), which is a robust alternative to Riemannian Manifold Hamiltonian Monte Carlo methods introduced by Girolami and Calderhead (2011) and Brubaker et al. (2012). We establish invariance of the desired measure under a compactness assumption in the continuous (small stepsize limit) setting. We provide an Metropolis adjusted version of RT-RMHMC in the discrete setting and prove invariance and ergodicity of the adjusted discretized algorithm. We show that RT-RMHMC is a more robust method with respect to parameter choice on a number of numerical examples arising in applications and provide an example to demonstrate that our Riemannian manifold sampling method can be used for high-dimensional covariance estimation. We expect the stability with respect to choice of parameters is especially needed in poorly conditioned problems, where RMHMC would require very short time steps for stability but this may lead to some random walk behaviour and highly inefficient mixing in some principal directions.

In terms of future developments for RT-RMHMC, the next step would be to establish invariance of the measure in the non-compact setting and further to this establish (geometric) ergodicity of RT-RMHMC, which is already established in Bou-Rabee and Sanz-Serna (2017) for the Euclidean setting. Then one could find optimal choices of integration parameters and step-size. Another possibility would be to establish mixing time guarantees for RT-RMHMC by a coupling argument like Bou-Rabee et al. (2020) and Mangoubi and Smith (2018). In Mangoubi and Smith (2018) they establish rapid mixing guarantees for a geodesic walk algorithm on manifolds with positive curvature, which is RMHMC for the uniform distribution. One may be able to use a similar coupling argument to guarantee mixing times for RT-RMHMC for manifolds with positive curvature.

The C code for RT-CHMC and CHMC for each application is available at https://github.com/PAWhalley/Randomized-Time-Riemannian-Manifold-Hamiltonian-Monte-Carlo.