Randomized Time Riemannian Manifold Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) algorithms which combine numerical approximation of Hamiltonian dynamics on finite intervals with stochastic refreshment and Metropolis correction are popular sampling schemes, but it is known that they may suffer from slow convergence in the continuous time limit. A recent paper of Bou-Rabee and Sanz-Serna (Ann. Appl. Prob., 27:2159-2194, 2017) demonstrated that this issue can be addressed by simply randomizing the duration parameter of the Hamiltonian paths. In this article, we use the same idea to enhance the sampling efficiency of a constrained version of HMC, with potential benefits in a variety of application settings. We demonstrate both the conservation of the stationary distribution and the ergodicity of the method. We also compare the performance of various schemes in numerical studies of model problems, including an application to high-dimensional covariance estimation.


Introduction and Motivation
Efficient sampling of high dimensional probability distributions is required for Bayesian inference and is a challenge in many fields including biological modelling ( [1]), economic modelling ( [2]), machine learning with large data sets ( [3,4]) and molecular dynamics ( [5]).A popular approach is Markov chain Monte Carlo, which defines a Markov chain X i+1 ∼ p(• | X i ) with invariant 1 measure µ and from which we may estimate expected values from the relation E X∼µ f (X) ≈ 1 N N i=1 f (X i ); however convergence of such averages can be slow for high dimensional and multimodal distributions (see e.g.[6]).Recent attempts to address this problem include the local bouncy particle sampler of [7] and the Zig-Zag process of [8].These methods can be viewed as piecewise deterministic Markov processes (PDMPs), see [9].The Randomized Hamiltonian Monte Carlo (RHMC), proposed in [10] and further studied in [11] , evolves a Hamiltonian flow for a duration drawn from an exponential distribution.In standard HMC the choice of integration time is a challenging task (see [12]) and mixing can be inefficient for some choices of integration time.By contrast, RHMC does not suffer from this problem as randomization of the duration prevents periodicities.This strategy has been studied from both analytic and numerical perspectives in [10].Other recent algorithms have been proposed which build on this idea (for example [13] and [14]).
The algorithms discussed above are targeted to sampling from distributions in Euclidean space.The need to work with Riemannian manifolds is motivated by applications where constraints are imposed from modelling considerations or are introduced in order to restrict sampling to a relevant subdomain derived from statistical analysis (see [15]).Examples of manifolds include products of spheres or orthogonal matrices which arise in applications in protein configuration modelling with the Fisher-Bingham distribution ( [16]), texture analysis using distributions over rotations ( [17]) and fixed-rank matrix factorization for collaborative filtering ( [18], [19]).Methods that sample from probability distributions on manifolds have been considered in [20], [15], [21], [22], [23], [24], [25], [26] and [27].In this article, we focus on manifolds defined by algebraic constraints.In order to maintain the constraints, in practice one needs to perform projections at each step of the algorithm, an additional overhead compared to Euclidean MCMC algorithms.
In this paper we propose the Randomized Time Riemannian Manifold Hamiltonian Monte Carlo (RT-RMHMC) method, an RHMC scheme for Riemannian manifolds.We establish invariance under a compactness assumption of the desired measure in the (small stepsize limit) continuous-time PDMP version of our method, where the algorithm is rejection free.Further, we demonstrate the invariance of the discretized method with Metropolis-Hastings adjustment and prove ergodicity of the discretized method with Metropolis-Hastings adjustment.We show in numerical experiments that this method has improved robustness, demonstrating for example that the convergence rate is relatively flat in the choice of mean time parameter; these results mirror those obtained for the Euclidean version of the method.Moreover, we compare RT-RMHMC to a constrained underdamped Langevin integrator g-BAOAB introduced in [28].
To our knowledge, there is no theoretical or numerical treatment of RHMC in the manifold setting and there has been no theoretical treatment of Riemannian Hamiltonian Monte Carlo methods in the continuous time setting.We provide a first result to estabilish invariance of a continuous time Riemannian Hamiltonian Monte Carlo method in the compact setting.A biased RHMC method was recently introduced (see [14]) which has event rates which depend on the position in the state space, these state dependent event rates can be incorporated into our RHMC Riemannian framework when the framework is unadjusted.We note that in the appendix of that article, a version of RHMC is introduced in the setting of adapting the metric for sampling on Euclidean space but not for working on a Riemannian manifold.
The remainder of this article is organised as follows.In the next section we describe the algorithm and provide invariance in the continuous time setting under a compactness assumption.Section 3 considers the numerical implementation with and without Metropolis test.Section 4 provides conservation of the stationary distribution of the discretized algorithm and the ergodicity of the method with Metropolis-Hastings adjustment.Section 5 discusses numerical experiments and Section 6 gives some thoughts on future developments.We include several appendices addressing the generator, the invariance of the target measure and the irreducibility of the scheme, from which ergodicity necessarily follows.

Algorithm
Let (M, g) be a d-dimensional Riemannian manifold and T M denote its tangent bundle.Let G(x) denote the positive definite matrix associated to the metric g at x ∈ M. Consider a target distribution on M with density with respect to σ M (dx), the surface measure (Hausdorff measure) of M defined by σ M (dx) = det G(x)dx and Z M = M exp (−U H (x))σ M (dx), which we assume to be finite.Consider an extension of the distribution to T M as where λ T M (dz) is the Liouville measure of T M, H is defined by where ψ(x)(dv) is simply the Gaussian measure on T x M given by in local coordinates and σ TxM (dv) is the Lebesgue measure on T x M. In particular we have that µ has marginal distribution π H with respect to the Hausdorff measure ( [22], [21], [29][Section 3.3.2]).We will define a stochastic process which is a Riemannian version of the Randomized Hamiltonian Monte Carlo of [10].The stochastic process follows constrained Hamiltonian dynamics for an time duration t sampled from t ∼ exp (λ) for some rate λ > 0 before an event.This event is a random velocity refreshment under the distribution ψ(x).
Algorithm 1 defines Randomized time Riemannian Manifold Hamiltonian Monte Carlo (RT-RMHMC) with rate parameter λ > 0, and Hamiltonian dynamics governed by the Hamiltonian H(x, v) = U H (x) + 1 2 v T G(x) −1 v defined on T M. This stochastic process has invariant measure µ(z) = exp (−H(z)) with respect to the Liouville measure on T M.

Algorithm 1: RT-RMHMC
• Initialise x 0 arbitrarily on M and sample To sample from a distribution π with respect to the Hausdorff measure we define U = − log π under the assumption that π is integrable on M.
We can define the generator for this stochastic process as where is the transition kernel for a completely randomized velocity refreshment according to a Gaussian distribution on the tangent space T x M and X H is the Hamiltonian vector field associated to H.In the Supplementary Material, we will prove that this is the generator of this stochastic process in Section A and invariance of the measure in Section B under a compactness assumption.
Our main theoretical result about Algorithm 1 is the following.
Corollary 1 (Invariant measure for RT-RMHMC) Let (P t ) t≥0 be the transition semigroup of a simulation of Algorithm 1 with characteristics (ϕ, λ, Q) on T M and Hamiltonian H ∈ C 2 (T M), where (M, g) is a compact smooth Riemannian manifold and ϕ is the Hamiltonian flow associated to the Hamiltonian.Let µ be the measure on (T M, B(T M)) given by where dλ T M is the Liouville measure of T M. Then µ is invariant for RT-RMHMC.

Constrained Symplectic Integrator and Metropolis-Hastings adjustment
In this section, we will state some more broadly implementable versions of Algorithm 1 that are applicable when the Hamiltonian dynamics cannot be solved exactly.We start with a brief introduction to Lagrangian and Hamiltonian dynamics with constraints based on [30][ Chapter 3] .Consider manifolds M embedded in R d that can be described by algebraic equations .., m are continuously differentiable functions with linearly independent gradient functions for all x ∈ M.
We refer to such a submanifold as an algebraic constraint manifold.We can express the Euler-Lagrange equations as an orthogonal projection of the Euler-Lagrange equations in R d onto the constraint manifold, hence we have where λ i are Lagrange multipliers for each of the constraints.We can then define an augmented Lagrangian function L a : Then the Euler-Lagrange equations can be expressed as and the augmented Hamiltonian function , and we therefore obtain Hamilton's equations (see [31]) We next introduce a new formulation of RT-RMHMC for constraint manifolds which we will use for numerical simulation.Note that a constraint manifold Hamiltonian Monte Carlo method was introduced in [15], but with a deterministic duration parameter.We will use the same notation as that used in [15] to introduce randomized time into this algorithm.
Let us denote our constraints c(x) := (g i (x), ..., g m (x)) T and let C(x) = ∂c ∂x denote the Jacobian of the constraints, which we assume to have full rank everywhere.Define a Hamiltonian of the constrained system as The dynamics of the constrained system in terms of the Hamiltonian is thus given by where we remark that we can naturally identify the tangent and cotangent spaces and bundles.
If we let π H be our target measure with respect to the Hausdorff measure.We also let U H (x) = − log π H (x) be the potential energy of our constrained system.We can simulate the constrained Hamiltonian dynamics using [32].However, if we know π H explicitly we can avoid computation of the metric tensor by assuming our system is isometrically embedded in Euclidean space.Under this assumption we can then consider Algorithm 2, which is an explicit algorithm for simulation of Randomized time constrained Hamiltonian Monte Carlo (RT-CHMC).We will discuss and justify the embedding assumption further in section 3.1.
In Algorithm 2 we sample the Gaussian distribution on the tangent space at a point on M. We can do this by sampling a Gaussian distributed vector and then projecting this orthogonally.To orthogonally project a momentum vector onto T * M and correctly resample the momentum in Algorithm 2 at x ∈ M we apply the projector Proof Can be found in [33].

Embedded Manifolds
We next introduce the theory of manifold embeddings as it was presented in [21] to show that numerical simulation of RT-CHMC is in fact simulation of RT-RMHMC on constraint manifolds.
If we know the form of the distribution π H with respect to the Hausdorff measure, then we can avoid the computation of the metric tensor and the lack of a global coordinate system ( [21]).We achieve this using isometric embeddings, remarking that every Riemannian manifold can be isometrically embedded in Euclidean space due to the Nash embedding theorem ( [34]).If we have an isometric embedding ξ : M → R n , then considering a path q(t) on M, the path x(t) = ξ(q(t)) is such that ẋi (t) = j ∂xi ∂qj qj (t).The phase space (q, p), where q = G −1 p, can then be transformed to the embedded phase space (x, v), where since G = X T X due to the fact that the embedding is isometric and preserves inner products (see [21]).Now the Hamiltonian (eq 2) is in terms of coordinates (x, v).When considering sampling of the velocities in Algorithm 1 and Algorithm 2, since p ∼ N (0, G(q)), we have where X(X T X) −1 X T is the orthogonal projection onto the tangent space of the embedded manifold ( [21]).Therefore we can sample from N (0, I) and project onto the tangent space to obtain a necessary sample.The Hamiltonian is thus expressed in a form which is independent of the metric (provided we know the density with respect to the Hausdorff measure).We now introduce the numerical integrator's (RATTLE) scheme [35][Chapter 7]: where we solve for λ n (r) and λ n+1 (v) at each iteration so that the iterates lie in the tangent bundle.We solve for λ n (r) (a non-linear system of equations) by cycling through the constraints, adjusting one multiplier at each iteration.Denote by C i the ith row of C and we first initialise Next we cycle through the list of constraints one after another as follows: for each i = 1, ..., m compute and update Q by where tol is a certain prescribed tolerance.Then we set x n+1 = Q and have x n+1 ∈ M within the tolerance.(Note that other stopping criteria could be used (see [36]).)We solve for λ n+1 (v) by solving the linear system: Once the linear system has been solved we obtain (x n+1 , v n+1 ) ∈ T * M.
Theorem 2 Let M be a constraint manifold.Let H ∈ C 2 (T M), the RATTLE numerical integrator of the Hamiltonian system defined by H in T M is symmetric, symplectic and of order 2. Further it respects the manifold constraints.

Metropolis Hastings Adjustment
Let Ψ L ∆t : T M → T M be the numerical integrator defined by L steps of RATTLE with stepsize ∆t.This integrator approximates the Hamiltonian dynamics.For theoretical purposes we will also define the map N : T M → T M which negates the momentum term i.e.N (x, v) ≡ (x, −v).Note that this leaves the Hamiltonian invariant and due to the fact that the momentum is resampled this has no affect on the samples from π H .We will define the following Metropolised RT-RMHMC, where we sample T ∼ exp (λ) and fix a maximum time length ∆t max below the stability threshold of the numerical integrator.Then we choose the number of leapfrog steps L to be T /∆t max .Having chosen L in this way, we set ∆t = T /L ≤ ∆t max .At each step we perform L RATTLE steps with stepsize ∆t.We propose this method of discretisation instead of purely randomising the stepsize and fixing a number of leapfrog steps to avoid numerical instabilities in the numerical integrator.One could also propose fixing a stepsize within the numerical stability threshold of the integrator and simply sampling an integer number of leapfrog steps geometrically to randomise the time.However our proposed method closer relates to the continuous dynamics without the issues due to numerical instabilities.
Remark 1 For large choices of stepsize ∆t it has been shown that Φ L ∆t is not reversible where RATTLE is used to integrate on the manifold, see [24,26].In [26] they propose to combat this by adding a reversibility check incorporated into the Metropolis-Hastings adjustment, although in practice such checks may be neglected in favor of an implicit assumption that ∆t is sufficiently small to avoid non-reversibility issues.We will investigate this further in section 5.
In light of Remark 1, we include Rev(•) as a additional (optional) acceptreject condition which implements a reversibility check (following [26]).In numerical experiments we examine the stepsize threshold where the reversibility condition fails (See Fig. 1).

Algorithm 3: RT-CHMC with Metropolis-Hastings step
• Initialise ∆t max within the stability threshold.
• Initialise x 0 arbitrarily on M and sample v 0 ∼ N (0, Remark 2 Our framework can be adapted to handle inequality constraints by incorporating an additional rejection condition in the Metropolis-Hastings step, which rejects samples which aren't within the boundary.
This will be used in our application in Section 5.3 to impose a half-normal prior on some dimensions of our Bayesian model.

Ergodicity
We will now prove ergodicity and exact invariance of the desired measure of the discrete time algorithm with metropolis-hastings adjustment.We will provide ergodicity under two assumptions by the same technique as [15] and restating some of their results.
Proposition 3 Assuming that Ψ is reversible for ∆tmax > 0 then µ is invariant with respect to the Markov kernel proposed in Algorithm 3.
Proof See Section C of the Supplementary Material.
be Riemannian manifold which is connected, smooth and differentiable.We assume that ∂c /∂x is full rank everywhere.
Assumption 2 Let M be a Riemannian manifold which satisfies assumption 1.For x ∈ M we define Br(x) = {x ∈ M | d(x , x) ≤ r} to be the geodesic ball of radius r of x.We assume that there exists a r > 0 such that for every x ∈ M and x ∈ Br(x) there exists a unique choice of Lagrange multipliers and velocity v ∈ TxM, v ∈ T x M for which (v , x ) = Ψ L ∆t (v, x) for sufficently small ∆t.
Theorem 5 (µ-irreducible) Let U ∈ C 2 (M), and under assumptions 1 and 2 we have that for any x ∈ M, and measurable set A ⊂ M with positive measure.Then there exists an n ∈ N such that where K denotes the marginal transition kernel defined on M of Algorithm 3.
Proof See Section C of the Supplementary Material.

Numerical results
We perform numerical simulations of the RT-RMHMC algorithm and compare to the RMHMC algorithm of [15,22], specifically exploring the underlying dynamics of the two processes.MCMC schemes are used to approximate expected values of certain functions f over some distribution with pdf π where we can estimate this quantity using our MCMC scheme by where X i are the Markov chain from our MCMC method.We quantify the convergence rate associated to approximation of E π (f ) by considering the integrated autocorrelation function and essential sample size.

g-BAOAB
As a comparison method we implemented the g-BAOAB integrator of [28], a numerical integrator for constrained underdamped Langevin dynamics.Constrained underdamped Langevin dynamics can be described by where γ is a friction coefficient and R(t), is a vector-valued, stationary, zeromean Gaussian process.The numerical integrator g-BAOAB is a splitting method for such dynamics, which uses similar constrained integrators as that of RT-RMHMC.We note that g-BAOAB is a biased sampling algorithm due to the error in the numerical integrator.For a full description of g-BAOAB and a discussion of the sampling error we refer to [28].

Test Examples
We next provide examples of distributions on implicitly defined manifolds embedded in Euclidean space, with the distributions defined with respect to the Hausdorff measure of the manifold.We will consider two types of constraint manifolds: spheres and Stiefel manifolds.

Bingham-Von Mises-Fisher distribution on S n
The first test case is the Bingham-Von Mises-Fisher (BVMF) distribution defined on the n−dimensional sphere embedded in R n+1 , that is . The BVMF distribution is the exponential family on S n ⊂ R n+1 with density of the form We compare the integrated autocorrelation (IAC) of − log π H of the RT-RMHMC method to that of the RMHMC method introduced in [22] for a number of distributions with parameters defined in the captions.We also compare the maximum IAC of x i for i = 1, ..., n to compare the worst efficiency of the mixing in all dimensions.We compare methods by setting the event rate parameter λ of RT-RMHMC to be the deterministic duration parameter of RMHMC (running the dynamics for this duration before momentum randomization).We then compute the integrated autocorrelation of − log π H and x i for i = 1, ..., n for the two methods for varying choices of λ by a Monte Carlo averaging procedure as described in section E. Regarding the reversibility issue for large choices of stepsize (as discussed in Section 3.2), for the geometries and distributions chosen, this is shown to exhibit behaviour as in Figure 1, where there is a dramatic change in reversibility failure for a small change in stepsize.Before this point all samples generated satisfy reversibility conditions.We simply chose stepsizes which are below this threshold in our simulations.
The results are presented in Figure 2. We choose the stepsize in RATTLE to be ∆t = 0.001 and sample N = 1, 000, 000 events with a burn time of 10% of samples before we compute the Monte Carlo average.We also use lags of up to M = N/50, 2 percent of the number of samples used to estimate the IAC.As our choice of ∆t is small, the acceptance rate is high so this process is close to the continous version.The IAC compares the efficiency of the continuous processes.In our first example and Figure 2 we can see that the regularity of the quality of samples with respect to the duration parameter is poor when a deterministic duration parameter is used and nearly uniform across a wide interval for a randomized duration with the same expected value.This is illustrated in Figure 3, where a small change in duration parameter causes the dynamics dramatically slows convergence and due to very slow mixing.The fact that RMHMC behaves erratically for large mean duration parameters may not be very surprising to some readers as the theoretical convergence bound for HMC without randomization requires a limit on the duration T (see [41]).This is due to the fact that when T is set too large, the coupling argument breaks down.
We next compare efficiencies using the metric gradient evaluations per effective sample size, which tells us the number of gradient evaluations needed for one independent sample in estimating our observables.We compare this metric for varying choices of step-size up to the reversibility condition is broken and the numerical integrator becomes unstable.Our observables will be − log π H and x i for i = 1, ..., n.We can see as in Figure 4 and 5 RMHMC (without deterministic time) exhibits the same behaviour as in Figure 2 for all choices of step-size, which is not the case for RT-RMHMC.We next compare the efficiency of the method with the g-BAOAB constrained Langevin integrator, we find in Figure 4 and 5 that g-BAOAB outperforms RT-RMHMC for large choices of the friction parameter γ (for this example γ = 50).g-BAOAB has no Metropolis-Hastings adjustment and hence is a biased sampling method.The bias in the samples creates errors in computed observables.For large choices of γ, this bias is dramatically reduced, but use of high friction may slow convergence of metastable systems.
To explore this we next consider a bimodal distribution from [21] in Figure 6.It is shown in Figure 6 that g-BAOAB incurs bias for large stepsizes and convergence is slow for large choices of the friction parameter for this metastable system.The Figure also shows that this is not the case for RT-RMHMC.In Figure 4 we choose step-sizes up to which the integrator is reversible and stable.The efficiency of the methods with their optimal choices of parameters is comparable, but RT-RMHMC is much less sensitive with respect to the choice of parameters (stepsize and number of leapfrog steps) compared to RMHMC, so RT-RMHMC is more reliable from this point of view.This is important as it is hard to know an appropriate choice of parameters apriori and the integration length between samples might have to be arbitrarly small for RMHMC to be efficient.
Von Mises-Fisher distribution on V d,p Definition 1 A Stiefel manifold V d,p is the set of d×p matrices X such that X T X = I.
These arise in many statistical problems which are discussed in [21].Applications include dimensionality reduction such as is used in factor analysis, principal component analysis ( [42]) and directional statistics ( [43]).These are a generalisation of orthogonal groups.The von Mises-Fisher distribution on Stiefel manifolds is defined by the density where x i and f i are the columns of F and X.We simulate IAC esimates for two example distributions for varying duration parameters.In the simulations we use a stepsize of ∆t = 0.001 and 100, 000 samples in each IAC estimate.The results are shown in Figure 7.We can see similar behaviour as the easier example on the Sphere.Both examples it is clear RMHMC is much more sensitive to the mean duration and hence with respect to stepsize and number of leapfrog steps.We note that Skew(2, −45, −4) denotes the 3 by 3 skewsymmetric matrix with up triangular entries 2, −45 and −4.

High Dimensional Covariance Estimation
In many statistical applications for analysing high dimensional data sets it is necessary to estimate sample covariances.This can be challenging when the number of dimensions is larger than the number of data points, as the sample covariance estimator does not work well in such cases.[44] provides a review of high-dimensional covariance estimation and applications in principal component analysis ( [45]), cosmological data analysis ( [46]) and finance ( [47]).[44] focuses on the set up where the matrix dimension is diverging or even larger than the sample size.In this setting one needs to estimate the population covariance matrix Σ of a set of n, p−dimensional data vectors, which we assume are drawn from an underlying distribution.
One estimator is the sample covariance matrix, which is defined by x k is the sample mean.However this is a poor estimator of Σ when p is large compared to the sample size n (due to rank deficiency).A way to combat this is to consider regularised covariance matrix estimators which include structural assumptions on the covariance matrix Σ.
One such method which has been proposed assumes the structure of a low rank matrix plus a sparse matrix (see [48] and [44]).This structure is known as a spiked covariance structure and has been studied in [49], [44] and [50].There have been interesting applications to finance ( [51]), chemometrics( [52]), and astronomy ( [46]).The covariance matrix Σ is assumed to be expressible in the form where X is a Stiefel manifold of dimension p × m for p >> m and D i for i = 1, 2 are diagonal matrices of dimensions m × m and p × p respectively.The motivation for this structure is that we assume that lower dimensional variables y i can describe the data x i such that x i = Xy i + i , where X is a p × m matrix with orthogonal columns.We have that Σ = XΣ y X T + Σ , which we interpret as a low rank matrix (rank m) plus a sparse matrix.We take Σ y and Σ to be diagonal, which is an approximation of the spiked covariance structure [53].Assume a uniform prior on X with respect to the Hausdorff measure on X.Further assume a half-normal prior of the diagonal entries of D i for i = 1, 2 to ensure positive definiteness.We also consider the following likelihood for the covariance estimation We introduce the posterior distribution p(Σ | x 1 , ..., x p ) ∝ L(Σ | x 1 , ..., x p )p(Σ), where p(Σ) = p(X)p(D 1 )p(D 2 ) for X ∼ U(V p,m ), D 1jj ∼ N + (0, σ 2 1 ) for j = 1, ..., m and D 2jj ∼ N + (0, σ 2 2 ) for j = 1, ..., p and N + denotes the half-normal distribution.Define the potential We also have that , for i = 1, 2 and where for j 1 = 1, ..., m and j 2 = 1, ..., p.We will use the likelihood and its gradient for implementing our RT-RMHMC algorithm for such models.

Covariance estimation for cosmological data
We consider an application of high dimensional covariance estimation in cosmological data analysis introduced in [46] and discussed in [44].The data is taken from [46] and consists of covariances of two-point correlation functions of cosmic weak lensing.It is simulated using coupled log-normal random fields from angular power spectra.For further information we refer the reader to [46].We will test this method using 2n/3 data vectors, where n is the dimension of the data vectors.Therefore we are in the setting where the dimension of the covariance matrix is larger than the number of samples.For our low rank plus sparse structure we choose m = p/6 << p and to ensure fast convergence we will normalize the data entry-wise and initialise our Markov chain via a eigenvalue decomposition of the sample covariance.We initialise our Markov chain as Σ 0 = X T D 1 X + D 2 ≈ Σ S , using the sample covariance matrix Σ S and where D 2 is the diagonal of Σ S and X T D 1 X corresponds to the eigenvalue decomposition of Σ S − D 2 , but with the p largest eigenvectors.After the covariance of the normalized data is estimated it can easily be rescaled to match the real data via entry-wise multiplication with the outer product of the entry-wise standard deviations.We compare our method to a maximum a posterior (MAP) estimate of the covariance matrix which uses a simple constrained gradient descent algorithm with Lagrange multipliers to ensure that the "low-rank plus sparse" structure is maintained.We compare the Bayesian and MAP approaches using a relative Frobenius norm and a covariance metric introduced by [54] which is defined by where A and B are covariance matrices and λ i (A, B) are the generalized eigenvalues from det(λA − B) = 0.As pointed out in [54], this covariance metric is affine invariant and invariant to inversion.Note that in Figure 8 we have not included the sample estimate with 40 samples because the sample covariance matrix is rank deficient and hence it is not possible to invert the matrix.We notice from Figure 8 and table 1 that the MAP and Posterior Expectation perform well when estimating the covariance matrix for only using 40 data points, but lose accuracy under inversion.The posterior expectation using a sampling method seems to retain more structure Table 1 A comparison using three metrics between the MAP estimate and the posterior expectation estimate using 500,000 samples from RT-RMHMC.The relative frobenius norm of the estimate and its inverse and the covariance metric introduced by [54].when inverted and provides a more accurate estimate according to the metric of [54].

Metric MAP Posterior Expectation
It is clear from Table 1 that using the posterior means do not sacrifice accuracy compared to using the MAP estimators.An additional benefit of the Bayesian approach is that we can compute posterior standard deviations for each component of the covariance estimator, which gives error estimates.This is illustrated in Figure 9.By comparing these standard deviations with the covariance estimates, we can get a sense of the relative error we are making.This information can be useful when deciding on the number of data samples we need to get a satisfactory level of accuracy in estimating the covariance matrix.Since in practice we do not have access to the true covariance matrix, there is no straightforward way to compute error estimates based on the MAP estimator, and it is challenging to see whether we have reached sufficient accuracy.

Conclusion and Future Work
In this work we have introduced a Randomized Time Riemannian Manifold Hamiltonian Monte Carlo (RT-RMHMC), which is a robust alternative to Riemannian Manifold Hamiltonian Monte Carlo methods introduced by [22] and [15].We establish invariance of the desired measure under a compactness assumption in the continuous (small stepsize limit) setting.We provide an Metropolis adjusted version of RT-RMHMC in the discrete setting and prove invariance and ergodicity of the adjusted discretized algorithm.We show that RT-RMHMC is a more robust method with respect to parameter choice on a number of numerical examples arising in applications and provide an example to demonstrate that our Riemannian manifold sampling method can be used for high-dimensional covariance estimation.We expect the stability with respect to choice of parameters is especially needed in poorly conditioned problems, where RMHMC would require very short time steps for stability but this may lead to some random walk behaviour and highly inefficient mixing in some principal directions.
In terms of future developments for RT-RMHMC, the next step would be to establish invariance of the measure in the non-compact setting and further to this establish (geometric) ergodicity of RT-RMHMC, which is already established in [10] for the Euclidean setting.Then one could find optimal choices of integration parameters and step-size.Another possibility would be to establish mixing time guarantees for RT-RMHMC by a coupling argument like [55] and [56].In [56] they establish rapid mixing guarantees for a geodesic walk algorithm on manifolds with positive curvature, which is RMHMC for the uniform distribution.One may be able to use a similar coupling argument to guarantee mixing times for RT-RMHMC for manifolds with positive curvature.
if τ∞(Z) = +∞ almost surely.PDMP characteristics are said to be non-explosive if for all initial distribution the associated PDMP is non-explosive.
Due to the event rate λ of RT-RMHMC being constant and bounded we have that RT-RMHMC is non-explosive.As RT-RMHMC is a non-explosive PDMP we can use the theory of [57][Section 7 and 8] to estabilish the generator and invariance of the desired measure.
Under the assumption that the expected number of events in any unit time interval [0, t] is finite, it is shown in [58][Theorem 26.14] that for a non-explosive PDMP with generator A with domain D(A) that all f ∈ D(A) and x ∈ T M, If we let N t denote the number of events in the interval [0, t] then we have for RT-RMHMC E x (N t ) = λt < ∞, E x denoting the expected value given the stochastic process starts with initial condition x.Therefore all the assumptions are satsified of [58][Theorem 26.14] and [57][Section 7] and we have that the generator of RT-RMHMC is given by Lf where X H is the Hamiltonian vector field and Q is the transition kernel for the Gaussian distribution induced by the metric G(x) on the tangent space of x ∈ M.

Appendix B Invariant measure
To prove that µ is an invariant measure of RT-RMHMC it is sufficent to show that Lf (x)dµ = 0 for all f ∈ D(L).As it is difficult to consider D(L), one approach is to show that C 1 c (T M) is a core of the generator and that Lf (x)dµ = 0 for all f ∈ C 1 c (T M), where C k c (T M) denotes the space of k times differentiable functions f : T M → R with compact support.
Theorem 8 (Infinitesimal Invariance of RT-RMHMC) Let M be a smooth Riemannian manifold with metric g and let (P t ) t≥0 be the semigroup of RT-RMHMC defined on M with potential U ∈ C 2 (M) and Hamiltonian H = U + K ∈ C 2 (T M).Let µ be the measure on (T M, B(T M)) defined by where dλ T M is the Liouville measure of T M. Then for all f ∈ C 1 c (T M) where L is the generator of RT-RMHMC.
Proof We have that We will now consider these two integrals separately.Considering the first integral, due to the fact that µ is a Liouville measure, µ is invariant under the Hamiltonian flow by Liouville's theorem and hence the first integral is identically zero.Now considering the second integral we have where C(x) is a varying constant depending on x.Therefore we have that and µ is an infinitesimally invariant measure.
We will next demonstrate that C 1 c (T M) is a core of D(A) by showing that certain conditions established in [57] hold under the assumption that M is compact.To show that C 1 c (T M) is a core of D(A) we use the approach of compactly approximating RT-RMHMC by a more well-behaved PDMP, which has PDMP characteristics (ϕ, λ, Q ) satisfying the Assumption A3 from [57] and has a Feller transition semigroup (P t ) t≥0 .We then use this approximation to show that RT-RMHMC is Feller and C 1 c (T M) is a core of the strong generator of RT-RMHMC, whose transition semigroup (P t ) t≥0 is seen as a semigroup on C 0 (T M).Note that C 0 (T M) denotes the space of continuous functions f : T M → R that vanish at infinity and C 0 (T M) is a Banach space when equipped with the • ∞ norm.
We first approximate our PDMP (ϕ, λ, Q) (RT-RMHMC) with the PDMP with characteristics (ϕ, λ, Q ) in the sense that where Q is constructed as a Markov kernel corresponding to a consistently truncated Gaussian distribution on each tangent space as follows. Define where ψ(x) denotes the probability density function of the Gaussian distribution on T x M defined by ψ(x)(dv) ∝ exp (− 1 2 v T G(x) −1 v)dv, known as the Maxwellian distribution.Then we have that ∂G /∂a = 0 due to the fact that G(x, •) is strictly increasing.By the implicit function theorem there exists a unique continuously differentiable function M : M → R such that G(x, M (x)) = 0 for all x ∈ M. We define the transition kernel as follows: where is the truncated Maxwellian distribution.Then we have that for any (x, v) ∈ T M and A ∈ B(T M) Lemma 9 (Continuity of Semigroup) Let (M, g) be a smooth Riemannian manifold, and let U ∈ C 1 (M) and hence H ∈ C 1 (T M).Let (P t ) t≥0 be the transition semigroup of (ϕ, λ, Q ), then Proof Let (Z t ) t≥0 denote a sample path of (ϕ, λ, Q ).We have that where S 1 is the time of the first event and ϕ t (z) is the solution of the Hamiltonian flow.If H is continuously differentiable everywhere then ϕ t (z) is well defined for all t > 0, and ϕ t (z) → z as t → 0 (see for example [59][Theorem 1.186]).
Lemma 10 Let (M, g) be a compact Riemannian manifold, and let U ∈ C 1 (M).Let (ϕ, λ, Q ) be the PDMP approximation of RT-RMHMC defined above.The set of all possible sample paths of (ϕ, λ, Q ) with initial condition (X 0 , V 0 ) is contained in a compact set.
Proof Let M (x) denote the continuous function in the definition of Q which controls the truncation of the Gaussian distribution.M (x) is a continuous function on a compact set and hence bounded by M .We further choose M such that |V 0 |g ≤ M .Define the set Due to the fact that M is compact it follows that U is a compact subset of T M by Lemma 13.We have that H restricted to U is bounded by M H as it's continuous on a compact set.We also have that H is constant between event times of the PDMP, by the definition of Hamiltonian flow.Therefore the Hamiltonian defined on the PDMP (X t , V t ) takes values which are defined by the image of (X ti , V ti ), for events which is compact by Lemma 13.
We have the following assumption from [57], which we use to establish Proposition 11.
Definition 3 [57][ Definition 16] We say that a homogeneous differential flow ϕ on T M and a homogeneous Markov kernel Q on T M are compactly compatible if for all compact sets K ⊂ T M and T ≥ 0, there exists a compact set K ⊂ T M satisfying: for all n ∈ N * , (t i ) i∈ 1,n ∈ R n + , n i=1 t i ≤ T, there exists a sequence (K i ) i∈ 1,n of compact sets of T M such that, setting K 0 = K, 1. for all i ∈ 1, n , K i only depends on (t j ) j∈ 1,n and ∪ Assumption 3 [57][A3] The homogeneous characteristics (ϕ, λ, Q) satisfy 1. the flow ϕ and the Markov kernel Q are compactly compatible; 2. λ ∈ C 1 (T M) and for all f ∈ C 1 (T M), λQ f ∈ C 1 (T M) and there exists a locally bounded function Ψ : Proposition 11 (Feller and Core of Generator) Let (P t ) t≥0 be the transition semigroup of (ϕ, λ, Q ) on T M, where (M, g) is a compact smooth Riemannian manifold and ϕ is the Hamiltonian flow associated to the Hamiltonian H ∈ C 2 (T M).Then, (P t ) t≥0 is Feller and C 1 c (T M) is a core for the strong generator of (P t ) t≥0 seen as a semigroup on C 0 (T M).
Proof If we prove that (ϕ, λ, Q ) satisfies Assumption 3, then from [57][Theorem 17] (P t ) t≥0 satisfies the Feller property.Once the Feller property is established by Lemma 9 and due to the fact that T M is a complete metric space we have by [60][Lemma 1.4] strong continuity of (P t ) t≥0 and that (P t ) t≥0 is Feller.Further to this C 1 c (T M) is a core for the strong generator of (P t ) t≥0 seen as a semigroup on C 0 (T M) is a consequence of [57][Theorem 17] and [61][Proposition 3.3,Chapter 1].We will now establish Assumption 3.
For any compact set K ⊂ T M, as K is compact, |v|g ≤ M K for some constant M K ≥ 0 and for all v such that (•, v) ∈ K. Then by the same argument to that of Lemma 10, but choosing M larger than M K we have that all PDMPs starting in K are contained in a compact set K. We can define K 0 = K and K i = K for all i ≥ 1.Then we have the flow ϕ and Q are compactly compatible and hence Assumption 3i) holds.
We show Assumption 3ii) as follows.Trivially we have λ ∈ C 1 (T M).We have taken the metric to be smooth and hence, as the truncated Gaussian distribution has a smooth transition kernel, we have that which is compact by Lemma 13.For all continuously differentiable functions f : T M → R, with (x, y) ∈ T M, we define f (x, y )ψ(x)(dy ).
Therefore it is sufficient to show that for all compact sets K ⊂ T M, and for all (x, y) ∈ K, . We have that for all (x, y) ∈ T M, since all functions considered are C 1 and hence bounded on all compact sets of T M we have the following computation which uses the dominated convergence theorem, a Leibniz's integral rule and a spherical coordinate system: ∇xψ(x, y )dy Theorem 12 (RT-RMHMC Feller and Core) Let (P t ) t≥0 be the transition semigroup of (ϕ, λ, Q) on T M, where (M, g) is a compact smooth Riemannian manifold and ϕ is the Hamiltonian flow associated to the Hamiltonian H ∈ C 1 (T M).Then, (P t ) t≥0 is Feller and C 1 c (T M) is a core for the strong generator of (P t ) t≥0 seen as a semigroup on C 0 (T M).
Proof By construction of Q we have the property that Corollary 1 (Invariant measure for RT-RMHMC) Let (P t ) t≥0 be the transition semigroup of (ϕ, λ, Q) on T M, where (M, g) is a compact smooth Riemannian manifold and ϕ is the Hamiltonian flow associated to the Hamiltonian H ∈ C 2 (T M).Let µ be the measure on (T M, B(T M)) given by µ(dz) ∝ e −H(x,v) dλ T M (z), where dλ T M is the Liouville measure of T M. Then µ is invariant for RT-RMHMC.
Appendix C Proof of invariance and µ-irreducibility for the Metropolized algorithm Proof of Proposition 3 Let P 1 be the Markov kernel corresponding to the first step.
It is clear that resampling from the Gaussian measure on the tangent space keeps π H invariant as it is independent and also keeps ψ(x) invariant, and therefore keeps µ invariant.Let P 2 be the Markov Kernel corresponding the second step (the combination of the sampling the time duration, deterministic step by Ψ and the Metropolis-Hastings accept-reject step.Let L be an arbitrary number of RATTLE steps we will check that µ is reversible with respect to P 2 and hence also invariant. P 2 is reversible with respect to µ if for every measurable bounded function f : T M × T M → R f (z 1 , z 2 )µ(dz 1 )K(z 1 , dz 2 ) = f (z 1 , z 2 )µ(dz 2 )K(z 2 , dz 1 ).
For P 2 we have that P 2 (z 1 , dz 2 ) is non-zero if and only if z 2 = Ψ(z 1 ) and z 2 = z 1 , hence we have that Now considering the second part of the sum, through a change of variables and combining these two equations we have the required result.We therefore have that µ is reversible with respect to P 2 and by the same argument and considering f to be the identity we have invariance P 2 with respect to µ. Due to the fact that this calculation was independent of time we have that µ is invariant with respect to the Markov kernel of this algorithm.
Proof of Theorem 5 Based on [15][Theorem 3].Fix ∆t > 0 sufficiently small such that our assumption holds.For a measurable set A ⊂ M, we can say A is contained in a compact set K, which can be covered by {B r/2 (x) | x ∈ K}.Then we have that for some x ∈ K, B r/2 (x ) ∩ A has positive measure.We can connect x and x by a sequence of points x 0 , ..., x i , ..., xn for 0 ≤ i ≤ n, defined on the geodesic between

FFig. 9 A
Fig. 9 A comparison between the log of the absolute error and the log of the posterior standard deviations in each component.Left: ln|Σ − Σ|.Right: ln Σ SD .