Two-scale coupling for preconditioned Hamiltonian Monte Carlo in infinite dimensions

We derive non-asymptotic quantitative bounds for convergence to equilibrium of the exact preconditioned Hamiltonian Monte Carlo algorithm (pHMC) on a Hilbert space. As a consequence, explicit and dimension-free bounds for pHMC applied to high-dimensional distributions arising in transition path sampling and path integral molecular dynamics are given. Global convexity of the underlying potential energies is not required. Our results are based on a two-scale coupling which is contractive in a carefully designed distance.


Introduction
Hamiltonian or Hybrid Monte Carlo (HMC) methods are a class of Markov Chain Monte Carlo (MCMC) methods originating in statistical physics [20] which have become increasingly popular in various application areas [47,52,59,11,63]. Their success is in particular due to empirically observed convergence acceleration compared to more traditional, random-walk based methods. The basic idea in HMC is to define an MCMC method with the help of an artificial ‡ N. B.-R. was supported by the National Science Foundation under Grant No. DMS-1816378 and the Alexander von Humboldt Foundation. § A.E. has been supported by the Hausdorff Center for Mathematics. Gefördert durch die Deutsche Forschungsgemeinschaft (DFG) im Rahmen der Exzellenzstrategie des Bundes und der Länder -GZ 2047/1, Projekt-ID 390685813.
Hamiltonian dynamics whose only purpose is to accelerate convergence to equilibrium. This Hamiltonian dynamics is designed to leave invariant a product of the target measure and a fictitious Gaussian measure in an artifical velocity variable. First rigorous theoretical results supporting the empirical evidence have only been established recently. In particular, geometric ergodicity has been verified in [10,22,48], and quantitative convergence bounds have been derived in the strongly convex case in [51], and under more general assumptions in [9], both by applying coupling methods.
Since many applications are high dimensional, a key issue is to understand the dependence of the convergence bounds on the dimension. Here, we study the problem of dimension dependence for a special class of models that is relevant for several important applications including Path Integral Molecular Dynamics (PIMD) [14,57,56,15,17,16,31,32,49,43], Transition Path Sampling (TPS) [64,62,6,58], and Bayesian inverse problems [41,67,18,7]. For the class of models we consider, a corresponding HMC Markov chain relying on a preconditioned Hamiltonian dynamics can be defined directly on the infinite dimensional state space [3]. This suggests that one might hope for dimension-free convergence bounds for the corresponding Markov chains on finite-dimensional discretizations of the state space. Corresponding dimension-free convergence rates to equilibrium have been established for the preconditioned Crank-Nicholson (pCN) algorithm [36] and for the Metropolis-adjusted Langevin algorithm (MALA) [24], but a corresponding result for HMC is not known so far.
The goal of this paper is to fill this gap. To this end we extend the coupling approach developed for HMC in the finite dimensional case in [9], and combine it with a two-scale coupling approach for stochastic dynamics on infinite dimensional Hilbert spaces that originates in [54,53,33,34] and has been further developed in [70]. The splitting into "low modes" and "high modes" in the two-scale coupling can be traced back to contraction results for the stochastic Navier-Stokes equations [54], and analogous results in the deterministic setting [28]; see [53] for a detailed review.
Our object of study is the exact preconditioned HMC algorithm (pHMC) with fixed durations on a Hilbert space, i.e., the (preconditioned) Hamiltonian dynamics is exactly integrated (or, in practical terms, the integration is carried out with very small step sizes). Here, preconditioning corresponds to an appropriate choice of the kinetic energy which involves picking the mass operator equal to the stiffness operator (or inverse covariance) associated to the Gaussian reference measure of the target probability measure. This choice of kinetic energy ensures that the corresponding pHMC algorithm is more amenable to numerical approximation and Metropolis-adjustment than HMC without preconditioning [3,11].
We prove that the transition kernel of the Markov chain induced by the pHMC algorithm is contracting in a suitable Wasserstein/Kantorovich metric with an explicit contraction rate. This rate depends on the duration of the Hamiltonian flow, the eigenvalues of the covariance operator of the Gaussian reference measure, and the regularity of the preconditioned Hamiltonian dynamics. The results are given in a more general setting that includes pHMC as a special case, and also covers other types of dynamics and preconditioning strategies. As a consequence of our general results, we derive dimension-free bounds for pHMC applied to finite-dimensional approximations arising in TPS and PIMD.
Before stating our results in detail, we conclude with a brief outlook. The results below apply only to pHMC with exact integration of the Hamiltonian dynamics. In practice, the Hamiltonian dynamics is numerically approximated, to obtain numerical versions of pHMC that are implementable on a computer. The time integrator of choice for pHMC is the symmetric splitting integrator introduced in [3]. Unlike other splittings for the Hamiltionian dynamics, this approximation has an acceptance rate that is uniform with respect to the spatial step size associated with the discretization of the Hilbert space [11, §8]. Time discretization creates a bias in the invariant measure that can be avoided by a Metropolis adjustment [69,11]. We would expect that for unadjusted numerical HMC based on the integrator proposed in [3], similar contraction results as stated below hold if the time step size is chosen sufficiently small (but independently of the dimension). Under additional regularity assumptions, one could also hope for dimension free bounds for the Metropolis adjusted version. First steps in this direction are carried out in [9, §2.5.4] in the finite dimensional case, and in [61, §4] in a strongly convex infinite dimensional case, but a full study in the general case would be lengthy and go beyond the scope of the current work.
Alternatively to preconditioning, it is also possible (but more tricky) to implement non-preconditioned HMC, which corresponds to injecting white noise in the velocity variable. In this case, the corresponding Hamiltonian dynamics is highly oscillatory in high modes [60]. Therefore, convergence bounds for exact HMC without preconditioning on an infinite dimensional Hilbert space can be expected to hold only if the durations are randomized [59], and in numerical implementations, strongly stable integrators [43] have to be used in order to be able to choose the step size independently of the dimension. Furthermore, scaling limit results show that for Metropolis adjusted HMC applied to i.i.d. product measures on high dimensional state spaces, the step size has to be chosen of order O(d −1/4 ) to avoid degeneracy of the acceptance probabilities [30,42,2].
We now state our main results in Section 2, and consider applications to TPS and PIMD in Section 3. The remaining sections contain the proofs of all results.

Main results
Let H be a separable and real Hilbert space with inner product ·, · and norm |·|. Let C : H → H be a positive compact symmetric linear operator. By the spectral theorem, the eigenfunctions {e i } i∈N of C form a complete orthonormal basis of H with corresponding eigenvalues {λ i } i∈N which we arrange in descending order, i.e., λ 1 ≥ λ 2 ≥ · · · . The positivity condition means that λ j > 0 for all j ∈ N, and by compactness, if dim(H) = ∞ then lim j→∞ λ j = 0. Any function x ∈ H can be represented in spectral coordinates by the expansion Moreover, for all s ∈ R, the operator C s is defined via the spectral decomposition of C. We introduce the family of inner products and norms given by We will now introduce the pHMC method for approximate sampling from a probability measure µ that has a density w.r.t. a Gaussian measure µ 0 on one of the Hilbert spaces H s . Afterwards, in Section 2.2, we will introduce a more general family of Markov chains on Hilbert spaces that includes the Markov chain associated to pHMC as a special case. In Section 2.3, we introduce a new coupling for these Markov chains that combines ideas from [9] and [70]. Then in Sections 2.4 and 2.5, we state our main contraction result for these couplings, and derive quantitative error bounds.

Exact Preconditioned Hamiltonian Monte Carlo
Let µ 0 = N (0, C) denote the centered Gaussian measure whose covariance operator w.r.t. the inner product ·, · is C [5]. If C is trace class then µ 0 is supported on H. More generally, we fix s ∈ (−∞, 1) and assume that µ 0 is supported on the corresponding Hilbert space H s . This is ensured by the following assumption: Assumption 2.1. The operator C 1−s is trace class, i.e., A realization ξ from µ 0 can be generated using the expansion For ξ ∼ µ 0 , Assumption 2.1 implies E |ξ| 2 s = trace(C 1−s ) < ∞, and thus, ξ is indeed a Gaussian random variable on H s .

Remark 2.2.
To avoid confusion, we stress that the covariance operator of a Gaussian measure is a non-intrinsic object that depends on the choice of an inner product. In particular, the covariance operator of µ 0 w.r.t. the H s inner product is C 1−s . Nonetheless, in what follows, we always define the covariance operator with respect to the H inner product, and in this sense, the measure µ 0 has covariance operator C.
Exact preconditioned Hamiltonian Monte Carlo (pHMC) is an MCMC method for approximate sampling from probability distributions on a Hilbert space that have the general form where U is a function on a Hilbert space on which the Gaussian measure µ 0 is supported. The pHMC method generates a Markov chain on this Hilbert space with transition step Here ξ ∼ N (0, C), and the duration T : Ω → R + is in general an independent random variable with a given distribution ν (e.g. ν = δ r or ν = Exp(λ −1 )). We will only consider the case where T ∈ (0, ∞) is a given deterministic constant. Moreover, is the exact flow of the Hilbert space valued ODE given by Formally, (2.5) is a preconditioned Hamiltonian dynamics for the Hamiltonian where the covariance operator C is used for preconditioning. A key property of (2.5) is that it leaves invariant the probability measurê on phase space, and in turn, this implies that the transition kernel of pHMC defined by π(x, B) = P[X (x) ∈ B] leaves µ in (2.3) invariant [3].
Below, our key assumption in this setup will be that U is a gradient Lipschitz function on the Hilbert space H s where the reference measure µ 0 is supported: The target measure µ is a probability measure on H s that is absolutely continuous with respect to µ 0 . The relative density is proportional to for all x, y, h ∈ H s for some finite positive constant L g .
In Assumption 2.3, ∂ h U denotes the directional derivative of U in direction h. We also use the notation DU to denote the differential of U , i.e., ( [3] except here the domain H s of the potential energy is defined in terms of the covariance operator itself rather than in terms of an auxiliary operator with related eigenfunctions and eigenvalues.

General setting
We now introduce a more general setup that includes the Markov chain induced by pHMC as a special case. We fix s ∈ R, and we assume that b : ) denote the exact flow of the Hilbert space valued ODE given by As above, we fix a constant duration T ∈ (0, ∞) and consider the Markov chain on H s with transition step where C is a linear operator on H with the same eigenfunctions as C.
Assumption 2.5. C : H → H is a symmetric linear operator with eigenfunctions {e i } i∈N and corresponding eigenvalues { λ i } i∈N . Moreover, the operator CC −s is trace class, i.e., For ξ ∼ N (0, C), this implies E |ξ| 2 s = trace( CC −s ) < ∞, and thus, ξ in (2.10) is a Gaussian random variable on H s . Let π(x, B) = P[X (x) ∈ B] denote the corresponding transition kernel. In particular, in the case where b is given by (2.8) and C = C, we recover the Markov chain associated to pHMC.
Our main result rests on the assumption that the Hilbert space H s can be split into a finite dimensional subspace H s, ("the low modes") and its orthogonal complement H s,h ("the high modes") such that b(x) is close to a linear map on H s,h . More precisely, fix n ∈ N. Let H s, := span{e 1 , . . . , e n }, and let H s,h denote its orthogonal complement, i.e., H s,h is the closure in H s of span{e n+1 , e n+2 , . . . }. Thus H s = H s, ⊕ H s,h . For any x ∈ H s , we denote by x and x h the orthogonal projections onto H s, and H s,h , respectively. Assumption 2.6. b is a function from H s to H s such that b(0) = 0. Moreover it satisfies the following conditions: Condition (B1) is a global Lipschitz condition. Since b(0) = 0, it implies the linear growth condition |b(x)| s ≤ L |x| s , and this condition and (B3) imply that K ≤ L. Condition (B2) says that in the high modes, b(x) behaves essentially as a linear drift. Finally, Condition (B3) is a standard drift condition which implies that the Markov chain has a Foster-Lyapunov function. It is similar to other conditions in the literature that consider Markov processes on unbounded spaces based on second-order dynamical systems including Hypothesis (H2) in [10], Equation (13) of [65], Hypothesis 1.1 in [68], Condition 3.1 in [55], and Assumption 1.2 in [25]. The proof of this lemma is given in Section 5.

Two-scale coupling
We now introduce a coupling for the transition steps of two copies of the Markov chain starting at different initial conditions x and y. We use a synchronous coupling of the high modes in H s,h and a different coupling for the low modes in H s, that together enable us to derive a weak form of contractivity. Note that the covariance operator C has a bounded inverse on the finite dimensional subspace H . Therefore, for h ∈ H , the Gaussian measure N (h, C) is absolutely continuous w.r.t. N (0, C) with relative density (2.14) Let γ > 0 be a positive constant. The precise value of the parameter γ will be chosen in an appropriate way below. The coupling transition step is given by with ξ ∼ N (0, C) and η defined in high/low components as η h := ξ h and Here U ∼ Unif(0, 1) is independent of ξ, z := x − y, and the reflection operator R is defined by Due to Assumption 2.6 (B2), the component in H s,h of the resulting coupled dynamics is contracting in a finite time interval as a result of the linear part of the drift in (2.9). Moreover, the coupling of the components of the initial velocities in H s, is similar to the coupling in [9] which is inspired by a related coupling for second order Langevin diffusions [25]. It is defined in such a way that ξ − η = −γz occurs with the maximal possible probability. As illustrated in Figure 1, and proven later in Lemma 4.3, the reason for this choice is that the projection of the difference process on H s, , i.e., q t (x, ξ)−q t (y, η), is contracting in a finite time interval if the difference ξ −η of the initial velocities is negatively proportional to the difference of the initial positions x −y . Note that if b(x) = 0 or b(x) = −x then the optimal choices of γ would be γ = T −1 and γ = cot(T ), respectively, because for these choices, In the case where ξ −η = −γz , a reflection coupling is applied. The corresponding reflection R is an orthogonal transformation w.r.t. the inner product x, y C = C −1/2 x, C −1/2 y induced by the covariance operator C on H .
In order to verify that (X (x, y), Y (x, y)) is indeed a coupling of the transition probabilities π(x, ·) and π(y, ·), we remark that the distribution of η is N (0, C) since, by definition of η in (2.16) and a change of variables, for any measurable set B. Here a ∧ b denotes the minimum of real numbers a and b, I B (·) denotes the indicator function for the set B, and we have used that N (0, C) is invariant under the reflection R, Rz = −z , and by (2.14), where d TV is the total variation distance. Hence, by the coupling characterization of the total variation distance, η = ξ + γz does indeed hold with the maximal possible probability. Note that if z is not in the reproducing kernel Hilbert space of the covariance operator C then the probability of the event η = ξ + γz in (2.18) tends to one as the number of low modes increases. This explains why it is necessary to split the Hilbert space and apply a two-scale coupling.

Contractivity
We now state our main contraction bound for the coupling introduced above. We first define a norm |||·||| α on H s where the high modes are weighted by α > 0: Thus |||·||| α and |·| s are equivalent norms with Remark 2.8. If the dimension is infinite then the operator C −1 C s is unbounded on H s , because its inverse is trace class. Nonetheless, |||x||| α is a well-defined norm for any x ∈ H s because the operator C − 1 2 C s 2 appearing in |||x||| α only acts on the projection x of x onto the finite dimensional space H s, .
As we will see below, even when U is non-convex, we can still obtain contractivity with respect to a semimetric ρ : and where R > 0, a > 0, and > 0 are parameters to be specified below. The semimetric ρ is similar to the one introduced in [9] in order to prove contractivity of the HMC transition step in the finite dimensional case. In general, ρ is not a metric, since the triangle inequality might be violated. Note that f is nondecreasing, and constant when r ≥ R.
Remark 2.9. The semimetric (2.22) incorporates, in a multiplicative or weighted way, the quadratic Foster-Lyapunov function for pHMC from Lemma 2.7 with weight . The idea to use semimetrics of this general form to study contraction properties of Markov processes goes back to [35] and [12]; see also [26].
Lemma 2.7 implies that the coupling transition (x, y) → (X (x, y), Y (x, y)) also has a quadratic Foster-Lyapunov function: We fix a finite, positive constant R satisfying In our main result below, we choose α := 4σ max L. In this case, the choice of R in (2.24) guarantees that a strict drift condition holds for all (x, y) satisfying |||x − y||| α ≥ R, because by (2.21) and since L ≥ 1, The asymptotic strict drift condition in (2.25) allows us to split the proof of contractivity into two parts: (i) |||x − y||| α ≥ R where any coupling is contracting in ρ due to (2.25), and (ii) |||x − y||| α < R, where ρ is contracting due to the specially designed two-scale coupling.
Theorem 2.10. Suppose that Assumption 2.6 holds. Let T > 0 satisfy Let α, γ, a, and be given by Then for any x, y ∈ H s , we have The proof of this theorem is given in §6.
Remark 2.11. The rate in (2.33) is similar to the rate in the finite-dimensional case found in Theorem 2.3 of Ref. [9]. The main difference is that the condition on LT 2 in (2.27) now reflects the effect of preconditioning.

Quantitative bounds for distance to the invariant measure
Theorem 2.10 establishes global contractivity of the transition kernel π(x, dy) w.r.t. the Kantorovich distance based on the underlying semimetric ρ, which for probability measures ν, η on H s is defined as where the infimum is over all couplings γ of ν and η. Moreover, it implies quantitative bounds for the standard L 1 Wasserstein distance Corollary 2.12. Suppose that Assumption 2.6 holds. Let T ∈ (0, R) satisfy (2.27). Then for any k ∈ N and for any probability measures ν, η on H s , where the rate c and the constant are given explicitly by (2.33) and (2.31), and In particular, for a given constant δ ∈ (0, ∞), the L 1 Wasserstein distance The corollary is a rather direct consequence of Theorem 2.10. A short proof is included in §6.
Remark 2.13 (Quantitative bounds for ergodic averages). MCMC methods are often applied to approximate expectation values w.r.t. the target distribution by ergodic averages of the Markov chain. Our results (e.g. (2.34)) directly imply completely explicit bounds for bias and variances, as well as explicit concentration inequalities for these ergodic averages in the case of pHMC. Indeed, the general results by Joulin and Ollivier [40] show that such bounds follow directly from an L 1 Wasserstein contraction w.r.t. an arbitrary metric ρ, which is precisely the statement shown above.

Transition Path Sampling
Here we discuss the use of pHMC in transition path sampling (TPS). As an application of Theorem 2.10, we obtain dimension-free contraction rates for exact preconditioned HMC in this context. Fix a time horizon τ > 0 (not to be confused with the duration parameter in preconditioned HMC which we denote by T ). The aim of TPS [64,37,4,38] is to sample from a diffusion bridge or conditioned diffusion, i.e., from the conditional law ν a,b of the solution given both initial and final conditions Here Ψ : R d → R is a given potential energy function and W is a d-dimensional standard Brownian motion. TPS is particularly relevant to molecular dynamics where the states a and b represent different configurations of a molecular system [6,58,62].
We first recenter: Let µ = ν•θ −1 M denote the law of the recentered bridge where θ M (x) = x−M is the translation on path space by the mean M(t) = a+(t/τ )(b− a) of the Brownian bridge from a to b. Then by Girsanov's theorem, the measure µ is absolutely continuous with respect to the law µ 0 of the Brownian bridge from 0 to 0 [39,4]. Moreover, the measure µ 0 is the centered Gaussian measure on the Hilbert space where ∆ D is the Dirichlet Laplacian, and the relative density of µ with respect to µ 0 is proportional to exp(−U (x)) where the function U (x) is defined in terms of the so-called path potential energy function G : R d → R as follows In the main convergence result given below, we make the following regularity assumption on G.
Assumption 3.1. The function G : R d → R is continuously differentiable. Moreover, ∇G(0) = 0, and ∇G is uniformly bounded and globally Lipschitz continuous, i.e., there exist finite constants M G , L G such that for all x, y ∈ R d , This regularity assumption frequently holds in molecular dynamics applications, since the configuration space of molecular systems is usually taken to be a fixed cubic box with periodic boundary conditions [1,29,66,13,23,8,45]. In this case, we can lift the TPS problem to the covering space R d by extending the path potential to a periodic function on this space. Thus after recentering the coordinate system, Assumption 3.1 is satisfied whenever G is C 2 .
To implement TPS on a computer, we use the finite difference method to approximate the infinite-dimensional distribution µ(dx) ∝ exp(−U (x))N (0, C)(dx) by a finite-dimensional probability measure µ m . Other approximations, e.g., Galerkin or finite-element, are also possible and should yield similar results. We focus on the finite difference method because it is widely used in practice. Discretize the interval [0, τ ] into m + 2 evenly spaced grid points The space of paths on R d is then approximated by the finite-dimensional space R md . Specifically, we write x ∈ R md as where the j-th component x j+1:j+d := (x j+1 , . . . , x j+d ) is a d-dimensional vector that can be viewed as an approximation of x(t j ) for j = 1, . . . , m. The Dirichlet Laplacian ∆ D is approximated by the md × md Dirichlet Laplacian matrix ∆ D,m with (i, j)-th entry The covariance operator C is approximated by the md×md matrix C = −∆ −1 D,m , and the Hilbert space H is represented by R md with inner product given by the weighted dot product x, y = τ m+1 x • y. The functional (3.2) is discretized as Note that if the vector x contains the grid values of a smooth function x, then U m (x) → U (x) as m → ∞. In summary, the infinite-dimensional path distribution µ(dx) is approximated by the finite-dimensional probability measure µ m (dx) with non-normalized density exp −U m (x) − 1 2 x, C −1 x . To approximately sample from µ m , we use pHMC with transition step in (2.10). This corresponds to a Markov chain on R md with transition step . Let π m denote the transition kernal of (3.4).
Theorem 3.2 (Transition Path Sampling). Suppose that Assumption 3.1 holds. Let κ := 2(τ 2 /π 2 )L G , m = √ 3κ , n = m d, and m = (m + 1)π/2 . Let R, c, C and be defined as Suppose that the duration parameter T ∈ (0, R) satisfies Then for any m > m , k ∈ N, and probability measure ν m on R md , Remark 3.3. Note that the upper bound in (3.11) depends on dimension only through the initial distribution. The dimension independence in the other terms of this bound reflects that the finite-dimensional pHMC algorithm in (3.4) converges to an infinite-dimensional pHMC algorithm whose transition kernel satisfies an infinite-dimensional analog of this quantitative bound.
A proof of this result is given in §7.1.

Path Integral Molecular Dynamics
Here we discuss the use of pHMC for path-integral molecular dynamics (PIMD), and as an application of Theorem 2.10, obtain dimension-free contraction rates for preconditioned HMC in this context. PIMD is used to compute exact Boltzmann properties and approximate dynamical properties of quantum mechanical systems [14]. The technique is based on Feynman's path-integral formulation of quantum statistical mechanics [27], and the observation that the quantum Boltzmann statistical mechanics of a quantum system can be reproduced by the classical Boltzmann statistical mechanics of a ring-polymer system [14].
Consider N interacting quantum particles in 3D with potential energy operator given byV = V (q 1 , . . . ,q N ) (3.12) whereq i is the three-dimensional position operator of particle i and V : R d → R is a potential energy function where d = 3N [32]. The thermal equilibrium properties of this system are described by the quantum mechanical Boltzmann partition function, where β is an inverse temperature parameter. For some a > 0, suppose that the potential energy function can be written as where G : R d → R. Then the partition function Q can be written as the expected value of a Gaussian random variable on loop space as follows and the covariance operator C a of the Gaussian reference measure is defined in terms of the Laplacian with periodic boundary conditions ∆ P on L 2 ([0, β], R d ) as follows C a = (−∆ P + aI) −1 , where I is the identity operator and the potential energy U (x) is given by The probability measures µ 0 and µ(dx) ∝ exp(−U (x)) N (0, C a )(dx) are supported on the loop space consisting of all periodic continuous paths x : [0, β] → R d . They are similar to the corresponding measures considered for Transition Path Sampling, but there is an additional, artificially introduced parameter a appearing in C a . This parameter is essential because ∆ P is not invertible since it has a zero (leading) eigenvalue corresponding to the 'centroid mode' [49].
To implement PIMD on a computer, we use finite-differences to truncate the infinite-dimensional path distribution µ to a finite-dimensional one µ m by discretizing the interval [0, β] into m + 1 grid points t j = βj/m, j = 0, . . . , m . (3.16) The space of loops on R d is approximated by the finite-dimensional space R md . Specifically, we write x ∈ R md as The periodic Laplacian ∆ P is approximated by the md×md discrete periodic Laplacian matrix ∆ P,m with (i, j)-th entry Naturally, the covariance operator C a is approximated by the md × md matrix C a = (−∆ P,m + aI md×md ) −1 where I md×md is the md × md identity matrix, and the infinite-dimensional Hilbert space H is represented by R md with inner product given by the weighted scalar product x, y = β m x • y. The functional in (3.2) is discretized as G(x j+1:j+d ) .
In this context, pHMC generates a Markov chain on R md with invariant measure µ m and with transition step given by Suppose that the duration parameter T ∈ (0, R) satisfies Then for any m > m , k ∈ N, and probability measure ν m on R md , (3.11) holds for the transition kernel of (3.17).
A proof of this result is given in §7.2.

Numerical Illustration of Couplings
Before turning to the proofs of our main results, we test the two-scale coupling defined by (2.15) numerically on the following distributions.
• A TPS distribution with the three-well path potential energy function illustrated in Figure 2 (a). The initial conditions of the components of the coupling are taken to be paths that pass through the two channels that connect the bottom two wells located at x ± ≈ (±1.048, −0.042). • A PIMD distribution where the underlying potential energy is the negative logarithm of the normal mixture density illustrated in Figure 2 (b). The mixture components are twenty two-dimensional Gaussian distributions with covariance matrix given by the 2 × 2 identity matrix and with mean vectors given by 20 independent samples from the uniform distribution over the rectangle [0, 10] × [0, 10]. The energy barriers are not large. The potential energy in this example is adapted from [46,44]. The initial paths are taken to be two unit circles one centered at (1, 1) and the other centered at (9,9). The parameter a is selected to be 0.1. • A PIMD distribution where the underlying potential energy is the negative logarithm of the Laplace mixture density illustrated in Figure 2 (c). The mixture components are twenty two-dimensional (regularized) Laplace distributions using the same covariance matrix and mean vectors as in the preceding example. However, unlike the preceding example, in this example the underlying potential is only asymptotically convex. The initial paths are taken to be two unit circles one centered at (1, 1) and the other centered at (9,9). The parameter a is selected to be 0.1. • A PIMD distribution where the underlying potential energy is the bananashaped potential energy illustrated in Figure 2 (d). This function is highly non-convex and unimodal with a global minimum at the point (1, 1). This minimum lies in a long, narrow, banana shaped valley. The initial paths are taken to be small circles centered at (±4, 16). The parameter a is selected to be 1.0.
For the TPS and PIMD distributions we use the finite-dimensional approximations described in §3.1 and §3.2, respectively. The resulting semi-discrete evolution equations are discretized in time using a strongly stable symmetric splitting integrator [3,43]. We describe this integrator in the specific context of TPS, since a very similar method is used for PIMD with C a replacing C in the dynamics, and the covariance matrix (m/β)C a replacing ((m + 1)/τ )C in the velocity randomization step. First, split (3.
with corresponding flows explicitly given by ). Given a time step size ∆t > 0, and using these exact solutions, a ∆t step of the symmetric splitting integrator we use is given by In order to mitigate the effect of periodicities or near periodicities in the underlying dynamics, we choose the number of integration steps to be geometrically distributed with mean T /∆t. The idea of duration randomization has a long history [19,21,50,13,10,11]. The initial velocity is taken to be an md-dimensional standard normal vector with covariance matrix ((m + 1)/τ )C and a Metropolis accept/reject step is added to ensure the algorithm leaves invariant µ m [69,11]. In summary, we use the following transition step in the simulations.
Algorithm 3.1 (Numerical Randomized pHMC). Denote by T > 0 the duration parameter and let ψ ∆t be the time integrator described in (3.25). Given the current state of the chain x ∈ R md , the algorithm outputs the next state of the chain X ∈ R md as follows.
∆t (x, ξ), and given ξ and k, γ is a Bernoulli random variable with parameter α defined as We stress that ξ and k from (Step 1) and (Step 2) are mutually independent and independent of the state of the Markov chain associated to pHMC. We pick the time step size ∆t of the integrator sufficiently small to ensure that 99% of proposal moves are accepted on average in (Step 3).
Realizations of the coupling process are shown in Figure 2. We chose parameters only for visualization purposes. The different components of the coupling are shown as different color dots. The insets of the figures show the distance between the components of the coupling as a function of the number of steps. Figure 3 shows the average time after which the distance between the components of the coupling is for the first time within 10 −12 . To produce this figure, we generated one hundred samples of the coupled process for one hundred different values of the duration parameter T . As indicated in the figure legends, the coupling parameter γ is set equal to either: • γ = 0 which corresponds to a synchronous coupling of the initial velocities; • γ = T −1 which corresponds to the optimal coupling of the initial velocities when b(x) = 0; and, • γ = cot(T ) which corresponds to the optimal coupling of the initial velocities when b(x) = −x.

A priori bounds
In this section we gather several bounds for the dynamics and for the coupling that will be crucial in the proof of our main result.

Bounds for the dynamics
In the following, we assume throughout that Assumption 2.6 is satisfied, and Recall that φ t = (q t , v t ) denotes the flow of (2.9). With the exception of using a different norm, the proofs of Lemmas 4.1 and 4.2 below are identical to the proofs of Lemmas 3.1 and 3.2 in Ref. [9] and therefore not repeated.
In particular, In particular,
Proof. Property (i) follows from the fact that f is concave. Since f is nondecreasing and constant for r ≥ R, (ii) is trivially true in the cases r ≤ r and r ≥ R. In the case r < min( r, R), Combining these cases gives (ii). Let Property (iii) then follows because g decreases with x, lim x→0 g(x) = 1 and g(x) ≥ max(1, x)e − max(1,x) .

Proof of Foster-Lyapunov drift condition
Before giving the proof of Lemma 2.7, here are some preparatory results. Using the shorthand notation (t) = |q t (x, ξ)| Hence, by Assumption 2.6 (B1) and (B3), we have the differential inequalities The following formula comes from two applications of integration by parts and is valid for any k ∈ N and for any twice differentiable function g : R → R, dr .
The Cauchy-Schwarz inequality, (6.11) and (6.12) now imply where in the last step we used 1 − x ≤ e −x with x = c 1 /4.    (2.30), f (r) = (1 − e − min(r,R)/T )T . Let x, y ∈ H s , and let r = |||x − y||| α . Suppose first that r ≤ T . Then f (r) ≥ r/e, and thus where in the last step we used (2.21). Now suppose that r > T . Then since also R ≥ T by assumption, f (r) ≥ (1 − e −1 )T , and thus we obtain Combining both cases and noting that by (2.28), α ≥ σ min , we see that which implies an analogue bound for the corresponding Wasserstein distances W s,1 and W ρ . Conversely, since f (r) ≤ T for all r, Therefore, with C defined by (2.36), we obtain

Proofs of results for TPS
To prove Theorem 3.2, we compare the eigenvalues of C to C. Note that these eigenvalues each have multiplicity d, and to account for this, define the index function ϕ(k, j) = d(k − 1) + j. Then the eigenvalues of C are and the eigenvalues of C are The following lemma helps estimate the error of the eigenvalues of the approximation C relative to those of C.
Proof of Theorem 3.2. This result is an application of Corollary 2.12. Since C is a finite-dimensional matrix, Assumption 2.1 holds for C with s = 0, and since we choose C = C, Assumption 2.5 also holds with s = 0. Therefore, to apply Corollary 2.12, it suffices to check that: (i) Assumption 2.6 holds with dimension-free constants L, n, K, and A, (ii) the dimension-free R defined in (3.6) satisfies condition (2.24), and (iii) the dimension-free condition (3.10) on the duration T implies (2.27) holds. We then invoke Corollary 2.12 to conclude convergence in the standard L 1 Wasserstein distance.

Proofs of results for PIMD
To prove Theorem 3.5, we compare the eigenvalues of C a to C a . The leading eigenvalue of C a has multiplicity d, while all of the other eigenvalues have multiplicity 2d. If m is odd, then the leading eigenvalue of both C a has multiplicity d, while all of the other eigenvalues of C a have multiplicity 2d. However, if m is even, then the trailing and leading eigenvalues of C a have multiplicity d, while all of the other eigenvalues of of C a have multiplicity 2d. To account for these multiplicities, it is helpful to define the index function if k = 1, 1 ≤ j ≤ d, 1 a + ω 2 k if k > 1, 1 ≤ j ≤ 2d. (7.11) For all m ∈ N and k ∈ {1, . . . , (m + 1)/2 }, the eigenvalues of C a are Here we have introduced θ k := (k − 1)π m and ω 2 k := 4(k − 1) 2 π 2 β 2 . (7.13) Note that the definition in (7.12) includes odd or even values of m. The following lemma estimates the error of the eigenvalues of C a relative to those of C a .
Taking square roots of both sides then gives (E2).
Proof of Theorem 3.5. This proof is very similar to the Proof of Theorem 3.2 with some differences which are highlighted below.