Non-reversible Metropolis-Hastings

The classical Metropolis-Hastings (MH) algorithm can be extended to generate non-reversible Markov chains. This is achieved by means of a modification of the acceptance probability, using the notion of vorticity matrix. The resulting Markov chain is non-reversible. Results from the literature on asymptotic variance, large deviations theory and mixing time are mentioned, and in the case of a large deviations result, adapted, to explain how non-reversible Markov chains have favorable properties in these respects. We provide an application of NRMH in a continuous setting by developing the necessary theory and applying, as first examples, the theory to Gaussian distributions in three and nine dimensions. The empirical autocorrelation and estimated asymptotic variance for NRMH applied to these examples show significant improvement compared to MH with identical stepsize.


Introduction
The Metropolis-Hastings (MH) algorithm Metropolis et al. (1953), Hastings (1970) is a Markov chain Monte-Carlo (MCMC) method of profound importance to many fields of mathematics such as Bayesian inference and statisti-B Joris Bierkens j.bierkens@warwick.ac.uk 1 Department of Statistics, University of Warwick, Coventry CV4 7AL, UK cal mechanics Diaconis and Saloff-Coste (1998), Diaconis (2008), Levin et al. (2009). The applicability of MH to a particular computational problem depends on the efficiency of the Markov chain that is generated by the algorithm. The chains generated by the classical Metropolis-Hastings algorithm are reversible, or, in other words, satisfy detailed balance; in fact, this reversibility is instrumental in showing that the resulting chains have the right invariant probability distribution.
However, non-reversible Markov chains may have better properties in terms of mixing behavior or asymptotic variance. This can be shown experimentally in special cases Suwa and Todo (2010), Turitsyn et al. (2011), Vucelja (2014), theoretically in special cases Diaconis et al. (2000), Neal (2004), and in fact, also in general Sun et al. (2010), Chen and Hwang (2013), with respect to asymptotic variance. See also Rey-Bellet and Spiliopoulos (2014) for improved asymptotic variance of non-reversible diffusion processes on compact manifolds.
There exist two basic approaches to the construction of non-reversible chains from reversible chains: one can 'lift' the Markov chain to a larger state space Diaconis et al. (2000), Neal (2004), Turitsyn et al. (2011), Vucelja (2014), or one can introduce non-reversibility without altering the state space Sun et al. (2010). In continuous spaces, the hybrid or Hamiltonian Monte Carlo Horowitz (1991), Neal (2011) is closely related to the lifting approach in discrete spaces. Other noteworthy publications on non-reversible Markov chains are Wilmer (1999), Geyer and Mira (2000).
In this paper we consider the second type of creating non-reversibility, i.e. without augmenting the state space. In discrete spaces this can, in principle, be achieved by changing transition probabilities (see Remark 2.2). However this may be have computational disadvantages, since it requires access to all transition probabilities. Furthermore, there is no such analogue in continuous spaces (crudely speaking, because all transition probabilities to specific states are zero).
To remedy these issues, in this paper MH is extended to 'non-reversible Metropolis-Hastings' (NRMH) which allows for non-reversible transitions. The main idea of this paper is to modify the acceptance ratio, which is further discussed in Sect. 2. For pedagogical purposes, the theory is first developed for discrete state spaces. It is shown how the acceptance probability of MH, can be adjusted so that the resulting chain in NRMH has a specified 'vorticity', and therefore, will be non-reversible. Any Markov chain satisfying a symmetric structure condition can be constructed by NRMH, which establishes the generality of the algorithm. Theoretical advantages of finite state space non-reversible chains in terms of improved asymptotic variance and large deviations estimates are briefly mentioned in Sect. 3. In particular we recall a result from Sun et al. (2010) that adding non-reversibility decreases asymptotic variance. Also we present a variation on a result by Rey-Bellet and Spiliopoulos (2014) on large deviations from the invariant distribution.
As mentioned above, for continuous state spaces it was so far not clear how general non-reversible Markov chains (i.e. discrete time, for arbitrary target density) could be constructed. One of the main advantages of NRMH is that it also applies in the setting of continuous state spaces, and thus provides a partial solution to this problem, as will be discussed and verified experimentally in Sect. 4. In particular we implement a non-reversible version of the Metropolis Adjusted Langevin Algorithm (MALA) for Gaussian multivariate target distributions. Finally, conclusions and directions of further research are discussed in Sect. 5.

Notation
We will consider both finite and infinite-dimensional vectors and matrices. The constant vector with all elements equal to 1 will be denoted by 1; the dimensionality of 1 should always be clear from the context. Similarly the identity matrix of any dimension will be denoted by I . For sets V, S, with V ⊂ S the indicator function of V is denoted by 1 V : S → R. The transpose of a matrix A is denoted by A . The Euclidean vector norm on R n , as well as its induced matrix norm, will be denoted by · . For a matrix A ∈ R n , the spectrum is denoted by σ (A). The spectral bound and spectral radius of A are denoted by s(A) = max{Re λ : λ ∈ σ (A)} and r(A) = max{|λ| : λ ∈ σ (A)}, respectively.

Metropolis-Hastings generalized to obtain non-reversible chains
As a preliminary to non-reversible Metropolis-Hastings, we require the notion of vorticity matrix, which is introduced in Sect. 2.1. The classical Metropolis-Hastings algorithm, discussed in Sect. 2.2, is extended using the notion of vorticity matrix to a non-reversible version in Sect. 2.3.

Non-reversible Markov chains and vorticity
Let P = (P(x, y)) denote a matrix of transition probabilities of a Markov chain on a finite or countable state space S. A distribution on S is a vector with positive elements in L 1 (S), and is not necessarily normalized, i.e. it is not necessarily the case that x∈S π(x) = 1. If π is a distribution such that x∈S π(x) = 1, then we call π a probability distribution. We will always assume that π(x) > 0 for all x ∈ S. A (probability) distribution π on S is said to be an invariant (probability) distribution of P if π P = π , i.e. x∈S π(x)P(x, y) = π(y) for all y ∈ S. A distribution π on S is said to satisfy the detailed balance condition with respect to P if diag(π )P = P diag(π ), i.e. π(x)P(x, y) = P(y, x)π(y) for all x, y ∈ S. If there exists a distribution π on S that satisfies detailed balance with respect to P, then P is said to be reversible. As is well known, and straightforward to check, if π satisfies detailed balance with respect to P, then π is invariant for P. A chain which does not satisfy the detailed balance condition with respect to its invariant distribution is called non-reversible. In a certain sense, this is a misnomer: we may obtain a time reversed Markov chain P by defining P(x, y) := π(y)P (y,x) π (x) . In fact, P is the adjoint of P with respect to the inner product (·, ·) π , defined by ( f, g) π = x∈S f (x)g(x)π(x). For further background material on Markov chains the reader is referred to Levin et al. (2009).
Consider a non-reversible Markov chain P. Let K = 1 2 (P + P). Then K is a reversible Markov chain with invariant distribution π . We will also define a vorticity matrix as (essentially) the skew-symmetric part of P: or in matrix notation One can think of as (a transformation of) the skewsymmetric part of P: = diag(π )(P − P). The following simple observations are fundamental to this paper. Lemma 2.1 Let P be the transition matrix of a Markov chain on S and let π be a distribution on S. Let be defined by (1). Then: (i) is skew-symmetric, i.e. = − ; (ii) π satisfies detailed balance with respect to P if and only if = 0; (iii) π is invariant for P if and only if 1 = 0.
Proof (i) and (ii) are immediate. As for (iii), note that which is zero for all x if and only if π is invariant for P.
In light of Lemma 2.1, a matrix ∈ R n×n which is skewsymmetric and satisfies 1 = 0 is called a vorticity matrix. If is related by (1) to a Markov chain P with invariant distribution π , it is called the vorticity of P and π . It will be a key ingredient in the construction of a non-reversible version of Metropolis-Hastings. Remark 2.2 A direct way of constructing a non-reversible chain P from a reversible chain K and a vorticity matrix , is by letting P(x, y) = K (x, y) + 1 2π(x) (x, y), provided that P is a probability matrix (i.e. has nonnegative entries). This is discussed in e.g. Sun et al. (2010). In order to make a transition from a state x, one has to compute entries of K (x, ·) and (x, ·) to determine the transition probabilities. This approach has no alternative in uncountable state spaces. This is the main reason for wishing to develop a method that does not depend on the construction mentioned in this remark.

Metropolis-Hastings
In the Metropolis-Hastings (MH) algorithm a reversible Markov chain P 0 with a given invariant distribution π is constructed. We will assume, mainly for simplicity, throughout this paper that π(x) > 0 for all x ∈ S. As an ingredient for the construction of P 0 , a Markov chain Q is used, satisfying the symmetric structure condition In other words, whenever a transition from x to y has positive probability, the reverse probability also has positive probability. The Hastings ratio R 0 (x, y) is defined as π(x)Q(x,y) , for all x, y ∈ S for which π(x)Q(x, y) = 0, 1 otherwise. (3) With this definition of R 0 , acceptance probabilities are defined as and transition probabilities P 0 (x, y) are defined by It is a straightforward exercise to show that the chain P 0 has π as its invariant distribution. An important step is the observation that R 0 (x, y) ≤ 1 if and only if R 0 (y, x) ≥ 1, which will be a recurring phenomenon in the sequel.

Non-reversible Metropolis-Hastings
We will now discuss how this framework can be extended to construct Markov chains that are, in general, non-reversible. Let ∈ R n×n be a vorticity matrix, and let Q be the transition matrix of a Markov chain, satisfying (2). Again, π : S → (0, ∞) is some distribution that is not necessarily normalized and has only positive entries. We define for x, y ∈ S the non-reversible Hastings ratio as and let, analogously to MH, the acceptance probabilities A be A (x, y) := min 1, R (x, y) .
Entries of can be negative. In order to avoid the situation that A becomes negative, we will explicitly constrain vorticity matrix to satisfy (x, y) ≥ −π(y)Q(y, x) for all x, y ∈ S.
In particular, by the symmetric structure condition (2), should have zeroes wherever Q has zeroes. As with Metropolis-Hastings, the transition probabilities P (x, y) are defined by Note that indeed P is a matrix of transition probabilities. For = 0, A and therefore P reduce to A 0 and P 0 , so that the chosen notation is consistent.
In order to check that the proposed Markov chain has π as its invariant density, we need to verify that , π and P are related through (1). As a crucial step, we employ the following lemma, in analogy with Metropolis-Hastings. (2), π a distribution that is nowhere zero, such that (8) holds. Let R be as above. Then R (y, x) > 1 if and only if R (x, y) < 1 for any x, y ∈ S for which Q(x, y) = 0.

Lemma 2.3 Let be a vorticity matrix, Q a matrix of transition probabilities satisfying
so that R (y, x) > 1, using that is skew-symmetric. The converse direction is analogous.
Using the previous lemma, it is now straightforward to show that is the vorticity matrix of (P , π).
Proof If x, y are such that R (x, y) < 1. Then A (x, y) = R (x, y), and A (y, x) = 1. Therefore using Lemma 2.3. so that (1) is satisfied for such x and y. The cases in which R (x, y) = 1 or > 1 are analogous.
Finally, since by the assumption that is a vorticity matrix, 1 = 0, and hence Lemma 2.1 (iii) gives that π is invariant for P . We have obtained our main result.
Theorem 2.5 Let Q be a Markov chain, a vorticity matrix, and π a distribution on S that is everywhere positive, such that (2) and (8) are satisfied. Let P be defined through (6), (7) and (9). Then P has π as invariant distribution and as its vorticity matrix.
Remark 2. 6 We will refer to a combination (Q, , π) which satisfies (2) and (8) as a compatible combination. Especially verifying condition (8) requires some knowledge about π , but these do not need to be exact: it will suffice to have acces to a lower bound for π . In the proof of Theorem 4.2 the analogue of this condition for continuous spaces is checked as an example.
Remark 2.7 Once we have access to a compatible combination of proposal chain Q, vorticity matrix and target distribution π , the NRMH algorithm has similar favorable properties as MH, in that only local information is required: Q, π and only need to be evaluated at the current and proposed state, and no normalization of π is required. In Sect. 4 this will become even more important when NRMH is applied to a problem in continuous state space.

General observations on NRMH
The following trivial observation serves to indicate the generality of this approach. It asserts that every Markov chain may be build, in a trivial way, by the described procedure.

Proposition 2.8 Let P be a Markov chain with invariant distribution π and corresponding vorticity , satisfying (2).
If we use Q = P as proposal distribution and as vorticity matrix in the NRMH algorithm with target distribution π , then the resulting Markov chain P is identical to P.
Proof It suffices to note, that by Q = P and (1), A (x, y) = 1 for all x, y ∈ S, x = y. Remark 2.9 If, for some pair (x, y) ∈ S × S, (8) holds with equality, the transition probability P (x, y) = 0 even when Q(x, y) = 0. Therefore irreducibility of Q does not imply irreducibility of P , unless we impose the stronger condition: for all x, y ∈ S.
As noted in Remark 2.2, one may alternatively construct any non-reversible chain P by 'adding' a vorticity matrix to a reversible chain K . We may translate one approach into the other, as follows:

Proposition 2.10 Consider irreducible transition kernels Q(x, y) and H (x, y), related by Q(x, y)
, both satisfying the symmetric structure condition (2). Suppose Q, π and satisfy (8). Suppose P 1 is the Markov kernel obtained from NRMH, using Q as proposal chain and as vorticity matrix. Suppose P 2 is the Markov kernel given by P 2 (x, y) = K (x, y) + 1 2π(x) (x, y), where K is the classical Metropolis-Hastings kernel obtained by using H as proposal chain (and π as target distribution).
Both P 1 and P 2 have zeros on the off-diagonal entries for which Q has zeros. Since both P 1 and P 2 represent transition probabilities, this also fixes their diagonal elements, which concludes the proof.

Advantages of non-reversible Markov chains in finite state spaces
Non-reversible Markov chains may offer important computational advantages compared to reversible chains. We will briefly discuss some of these advantages as they apply to finite state space Markov chains. It is to our knowledge an open question how to extend the results of Sects. 3.1 and 3.2 in a generic way to uncountable and continuous state spaces.

Asymptotic variance
Consider a Markov chain P on S with invariant probability distribution μ.
In this case σ 2 f is called the asymptotic variance.
We will in this section work under the assumption that S is finite. In this case, for any f : S → R and irreducible P, a CLT is satisfied [see e.g. Roberts and Rosenthal (2004)]. The following result is obtained in Sun et al. (2010), for a more extensive argument see Chen and Hwang (2013).
Proposition 3.1 Let K be a transition matrix of an irreducible reversible Markov chain with invariant probability distribution μ. Let be a non-zero vorticity matrix and let P = K + 1 2 diag(μ) −1 be the transition matrix of an irreducible Markov chain. For any f : S → R and denote by σ 2 f,K and σ 2 f,P the asymptotic variances of f with respect to the transition matrices K and P, respectively.
Then for all f : S → R, we have σ 2 f,P ≤ σ 2 f,K , and there exists an f such that σ 2 f,P < σ 2 f,K . In words, adding non-reversibility decreases asymptotic variance.

Large deviations
In Rey-Bellet and Spiliopoulos (2014) it is noted that nonreversible diffusions on compact manifolds have favorable properties in terms of large deviations of the occupation measure from the invariant distribution. Inspired by their result, we present a simple (but to our knowledge novel) result in the same direction for finite state spaces.
As in the previous section, assume S is finite. We may transform a discrete time Markov chain on S into a continuous time chain by making subsequent transitions after random waiting times that have independent Exp(λ) distributions. The discrete chain with transition matrix P then transforms into a continuous time chain with generator The occupation measure of the resulting Markov process is defined as L t = 1 t t 0 δ X s ds. If G is irreducible, the occupation measure satisfies for large time a large deviation principle with rate function i.e. informally, for large t, for A ⊂ P(S), Here P(S) is the set of probability distributions on S. The rate function satisfies the properties that (i) I G ≥ 0, (ii) I G is strictly convex, and (iii) where π is the invariant distribution of G. Hence I G quantifies the probability of a large deviation of the occupation measure from the invariant distribution for large t. See Hollander (2000) for further details.
The following result shows that deviations for nonreversible continuous time chains from the invariant distribution are asymptotically less likely than for the corresponding reversible chain.

Proposition 3.2 Suppose G admits a decomposition of the
Hence it suffices to prove the result for G = K + 1 2 , with K symmetric and anti-symmetric.
The first term is equal to the the rate function I K (μ) for the empirical measure in the symmetric case, see e.g. (Hollander 2000, Theorem IV.14). Since is skew-symmetric, the second term vanishes. Hence taking the supremum over u gives a value larger than or equal to I K (μ). Let u be given by which proves the second statement in the proposition. If μ(x) = 0 for some x, the proof carries over by only summing over indices for which μ(x) > 0.

Mixing time and spectral gap
Non-reversibility in a Markov chain can have very favorable effects on mixing time and spectral gap, but we are not aware of general results in this direction. The reader is referred to Chen et al. (1999), Diaconis et al. (2000), Levin et al. (2009) for theoretical results in this direction, and to e.g. Sun et al. (2010), Turitsyn et al. (2011), Vucelja (2014 for less rigorous and/or numerical results. In these experimental results, nonreversibility especially seems to improve sampling in the case of sampling from a multimodal distribution.

NRMH in Euclidean space
In this section we explain how to extend the idea of nonreversible Metropolis-Hastings algorithm to a Euclidean state space. In Appendix 1 it is discussed how Metropolis-Hastings can be applied in the case of a general measurable space, and the discussion in this section is a special case of this.

General setting
Suppose Q is a Markov transition kernel on R n which has density q(x, y) with respect to Lebesgue measure, i.e. Q(x, rmdy) = q(x, y)dy. We will be interested in sampling from a distribution on R n with Lebesgue density π . We do not require that R n π(x) dx = 1. Let γ : R n × R n → R be Lebesgue measurable and furthermore suppose γ satisfies and A×R n γ (x, y) dx dy = 0, for all A ∈ B(R n ).
Here B(R n ) denotes the Borel σ -algebra generated by open sets in R n . Furthermore suppose that and γ (x, y) + π(y)q(y, x) ≥ 0, for all x, y ∈ R n for which π(x)q(x, y) = 0.
Define the Hastings ratio acceptance probabilities A(x, y) := min(1, R(x, y)), and transition kernel The proof of the following theorem can be found in Appendix 1.
In analogy with the discrete state space setting we call γ the vorticity density of (P, π). If γ = 0 on a set of positive Lebesgue measure then P is non-reversible, i.e. there exist sets B 1 , B 2 ⊂ R n such that

Langevin diffusions for sampling in Euclidean space
The application of non-reversible sampling methods in Euclidean space is a relatively unexplored area. In this short review section we will discuss the use of Langevin diffusions for simulating from a target density, and discuss the potential role of NRMH in within this context. Assume that the target density π is continuously differentiable. It is well known that π is invariant for the Langevin diffusion where W is a standard Brownian motion in R n . A natural (discrete time) Markov chain for sampling π is then the Euler-Maruyama discretization of the Langevin diffusion, Here h is a suitable stepsize. This discretization is approximately correct if h is chosen to be sufficiently small. However, such a choice of h results in slow convergence to equilibrium of the Markov chain. When h is large, then the discretization results in large discrepancy between (17) and (18). As a result also the respective invariant distributions will be different and in particular the invariant distribution of (18) will not correspond to the desired distribution with density π . It is customary to correct for this by considering the Euler-Maruyama discretization as a proposal for Metropolis-Hastings, resulting in the Metropolis Adjusted Langevin Algorithm [MALA, Roberts and Tweedie (1996), Roberts and Rosenthal (1998)]. Any diffusion of the form with S ∈ R n×n skew-symmetric, has π as invariant density. If S = 0 then the diffusion is non-reversible. In analogy with Sect. 3 one may hope that such a non-reversible diffusion has advantages compared to the reversible Langevin diffusion. In fact for multivariate Gaussian target distributions these advantages are clear Hwang et al. (1993), Lelièvre et al. (2013), as we will discuss below. More generally 1 the probability of large deviations of the empirical distribution from the invariant distribution are reduced for non-reversible diffusions Rey-Bellet and Spiliopoulos (2014). When discretizing (19) it is again necessary to correct for discretization error by a MH accept/reject step. However, since MH generates reversible chains, one should expect that also the favourable properties of non-reversibility are destroyed. Instead, an implementation of NRMH should be able to preserve these favourable properties of non-reversible diffusions. We will illustrate this for multivariate Gaussian distributions.

Non-reversible Metropolis-Hastings for sampling multivarite Gaussian distributions
Consider as target distribution a centered normal distribution with positive definite covariance matrix V . In this case the Langevin diffusion becomes the Ornstein-Uhlenbeck process where (W (t)) is an n-dimensional standard Brownian motion. In Hwang et al. (1993), it is shown that adding a 'nonreversible' component of the form −SV −1 to the drift, with S skew-symmetric, can improve convergence of the sample covariance. Therefore we will instead consider the Ornstein-Uhlenbeck process with modified drift Also, s(B) ≤ s(−V −1 ) for any choice of S. In other words, adding a non-reversible term increases the speed of convergence to equilibrium. In Lelièvre et al. (2013), it is established that it is possible to choose S optimally, resulting in s(B) = − tr(V −1 )/n. By choosing S in such a way the convergence of the chain is effectively governed by the average of the eigenvalues, which should be compared to the reversible case in which the 'worst' eigenvalue determines the speed of convergence. We will apply the theory developed in Sect. 4.1 to the time discretization of the non-reversible Ornstein-Uhlenbeck process. To be able to satisfy (15) later on, we will require flexibility in the magnitude of the drift multiplier B and the diffusivity. We consider the time discretization of (21), with step size h > 0, is where (Z k ) are i.i.d. standard normal and σ > 0. For σ = 1 this is the usual Euler-Maruyama discretization. The transition kernel of the Euler-Maruyama discretization will serve as our proposal distribution Q(x, dy) = q(x, y) dy, and we will first determine the vorticity density γ of Q.
Let ρ denote the density of the invariant probability distribution of Q. Let f (x, y) = ρ(x)q(x, y). The vorticity density of the proposal chain is As target density we have It is clear that γ satisfies (11) and (12). Also since π and q are non-degenerate, (13) and (14) are satisfied. The same statements hold trivially for scalar multiples of γ . Verification of (15) requires more effort. We provide a sufficient condition. The proof of this result is provided in Appendix 1.
Remark 4.3 How should one choose c, h and σ ? It seems reasonable to choose σ 2 equal to the maximal allowed value in (26), so that the deviation from the Euler-Maruyama discretization (which has σ = 1) is minimal; i.e. let To maximize the non-reversibility effects in the acceptance probability one should choose c as large as possible, i.e. c = σ n (h). A heuristic estimate for the scaling of the expected step size is given by the step size h times the multiplicative factor in the vorticity, c = σ n (h) in the acceptance probability. Note that hσ n (h) = 0 for h = 0 and h = 2 C 2 .

Maximization of hσ n (h) with respect to h yields
as long as C 1 < C 2 (which is the case in which S and V do not commute). This expression satisfies the condition h < 2 C 2 . A first order Taylor approximation around 1/n yields the simplified expression h ≈ 4 C 2 (n+2) < 2 C 2 . The corresponding value of σ 2 (h) is to first order equal to σ 2 (h) ≈ 1 − 2C 1 C 2 (n+2) . In case C 1 = C 2 , the optimal value of h is given by h = 4 (n+2)C 2 , with σ 2 (h) = 1 − 2 n+2 .
To summarize, a general procedure for applying nonreversible Metropolis-Hastings may be described by the steps listed in Fig. 1.

Numerical experiments
Below we carry out two experiments illustrating the approach above. For the obtained Markov chain realization (X 1 , . . . , X P ), we will obtain an estimate of the decorrelation by considering the empirical autocorrelation function (EACF) r defined by where i ranges over the coordinates i = 1, . . . , n, and where μ i = 1 P P p=1 X i p is the empirical average of the i-th coordinate. A fast decaying EACF indicates that the samples generated by the Markov chain are quickly decorrelating.

Three-dimensional example
In this example, from Hwang et al. (1993), we take as target covariance structure V a diagonal matrix with diag(V ) = 1, 1, 1/4 s. The optimal nonlinear drift is obtained by letting We choose the parameter values in accordance with Remark 4.3, resulting in The performance of NRMH is compared to MH with identical step-size h, and reversible proposal distribution Q(x, dy) ∼ N (I − hV −1 )x, 2h I . In Fig. 2 the EACFs for this 3-dimensional example are plotted. Here we see that NRMH helps to decrease the autocorrelations of the slowly decorrelating components in MH (here, the first two components). It achieves this without increasing autocorrelations of components that are already quickly decorrelating (here, the third component).
Using the algorithm described in Lelièvre et al. (2013) an optimal non-reversible drift B = −(I + S)V −1 can be computed. For reversible dynamics, we have s(−V −1 ) = −1.0444, while for the optimal non-reversible dynamics, s(B) = − tr V −1 /n = −3.2891. In Fig. 3 the EACFs for this 9-dimensional example are plotted. One can clearly see the typical effect of adding non-reversibility: the autocorrelation of the worst coordinates is improved so that it becomes on par with that of the fastest decorrelating coordinates. In this case, choosing c, h and σ as in Remark 4.3 results in values c = 0.4313, h = 7.0822 × 10 −4 and σ = 0.9108.
In a numerical example with 10 7 proposed transitions this leads to acceptance ratios displayed in Table 2. Using the batch means method (by dividing the sample trajectory in √ n trajectories of length √ n and assuming the √ n are independent), we can estimate asymptotic variance. The resulting estimates for asymptotic variance of the different components are given in Table 1. It should be noted that the notion of asymptotic variance is only defined in case a CLT holds [see Roberts and Rosenthal (2004)], which strictly speaking is an open question in this setting.
It is well known that for optimal convergence in Metropolis Adjusted Langevin (MALA), the stepsize should be tuned so that the ratio of accepted proposals is approximately equal to 0.574 Roberts and Rosenthal (1998). Compared to this, the acceptance ratios in our example, given in Table 2, are fairly high. In particular the Metropolis-Hastings chain can be improved significantly by increasing step size h, and so we are currently comparing NRMH with a sub-optimal tuning of MH.
The results of this experimental section should therefore be considered as a proof of concept of NRMH in continuous state spaces, rather than as an advertisement for its immediate practicality. The experiments do illustrate the faster decorrelation of NRMH in comparison to MH (for fixed step-size).
It is an open question if the framework of NRMH can be extended so that NRMH would become competitive with optimally tuned MH. Note that NRMH improves the (estimated) asymptotic variance for almost all components (and often significantly), except for component #7

Discussion
The efficiency increase of non-reversible Markov chains in MCMC can be significant, in terms of either asymptotic variance or mixing properties, as remarked in this paper. NRMH extends the MCMC-toolbox with a method to utilize these benefits. In particular for continuous state spaces it was, to our knowledge, not known how to construct non-reversible Markov chains for MCMC sampling (taking into account the necessity of a correction step when using time discretization of diffusions). Using the theory developed in Sect. 4, NRMH can be applied to general distributions on R n as follows. For a target density function π , suppose there exists a Gaussian distribution N (0, V ) with density function π 0 on R n , satisfying kπ 0 ≤ π on R n for some k > 0. Then if γ is a suitable vorticity density for sampling from N (0, V ), using proposal density q(x, y), we have for γ := kγ that γ (x, y) + π(x)q(x, y) = kγ (x, y) + π(x)q(x, y) ≥ k γ (x, y) + π 0 (x)q(x, y) ≥ 0, so that (15) is satisfied for the combination for this choice of π , γ and q, and Theorem 4.1 applies. Such a suitable choice of γ may be determined as described in Sect. 4.3.
The approach outlined in Sect. 4 should be considered as a first attempt at implementing the NRMH framework for continuous spaces. As discussed, in order to use the framework one needs to verify the non-negativity condition which leads to technical challenges. In particular we expect that much progress is possible in weakening the conditions of Theorem 4.2. The current form of that proposition results in a relatively small step size h, which obstructs fast convergence of NRMH. As mentioned before, it is an open question whether NRMH in continuous spaces can be made competitive with optimally tuned MALA.
The theoretical discussion of Sect. 3 and the numerical experiment of Sect. 4.4 illustrate how efficiency can be improved by employing non-reversible Metropolis-Hastings. In view of these encouraging results it will hopefully be possible to extend the result to more general settings. The practical application of NRMH depends on the identification of suitable vorticity structures that are compatible with proposal chains, and establishing these in practical examples provides a promising and challenging direction of research.
Analysis of non-reversible Markov chains is difficult, essentially because self-adjointness is lost. Without selfadjointness, it is much more difficult to connect spectral theory to mixing properties of chains. It seems that a good way of understanding benefits of non-reversible sampling is by studying Cesaro averages [see Levin et al. (2009) and e.g. the result on large deviations in Sect. 3.2]. The results of Sect. 3 which establish that non-reversible chains have better asymptotic variance or large deviations properties, are so far qualitative in nature (i.e. fail to quantify the amount of improvement). To obtain quantitative results is an important challenge that remains to be addressed. Also, it is object of further study how these results carry over to countable and uncountable state spaces. In particular, the question under what conditions the resulting chains are geometrically ergodic and/or satisfy a CLT should be considered.
Then is a signed measure on S × S, satisfying and We will call a signed measure on S × S satisfying (30), (31) a vorticity measure. If is related to π and P by (29), it is called the vorticity measure of (P, π).
Let Q(x, dy) and let F Q and B Q as defined above with P replaced by Q. Let be a vorticity measure. The Markov chain Q will play the role of proposal chain, and the role of target vorticity.

Definition 5.1 (Absolute continuity of )
We can use the Jordan decomposition (Rudin 1987, Sect. 6.6) to decompose into two non-signed measures + := 1 2 (| | + ) and − = 1 2 (| | − ), so that = + − − . We say that is absolutely continuous with respect to some measure M on S × S, denoted by M, if − M and + M. If M, we define the Radon-Nikodym derivative of with respect to M by Assuming B Q F Q and F Q , we define the nonreversible Hastings ratio R(x, y) In order for R(x, y) to be nonnegative, we have to impose the condition   (B, A) for A, B ∈ S. Suppose for x, y ∈ S, R(x, y) ≤ 1, so that We compute where (34) was used to establish the inequality.
Define the acceptance probability by and define a transition kernel P by P(x, dy) = A(x, y)Q(x, dy) Note that, for some measurable function h : S → R, This shows that the measure 1 − S A(x, z)Q(x, dz) δ x (dy)π(dx) in (36) is symmetric with respect to interchanging x and y, and therefore has no contribution in the verification of (37 Then π is invariant for P and is the vorticity measure of (P, π).
Proof Lemma 5.3 establishes that is the vorticity measure of (P, π). By 31, for A ∈ S, so that π is invariant for P.
Proof of Theorem 4.1 Define the signed measure , and measures F Q and B Q on S × S by (dx, dy) = γ (x, y) dx dy, B Q (dx, dy) = π(dy)Q(y, dx).
Then by (11), (12), is a vorticity measure, and by assumptions (13) and (14), F Q and B Q are equivalent measures, and is absolutely continuous with respect to F Q . Furthermore R(x, y) is a version of d dF Q + dB Q dF Q , and by assumption (15), We see that all conditions of Theorem 5.4 are satisfied, so that the stated results follow.

Appendix 2: Proof of Theorem 4.2
Define an inner product ·, · V on R n by x, y V = x, V −1 y , where ·, · is the Euclidean inner product. Let · V denote the induced norm. Let B := −(I + S)V −1 .
Proof Suppose λ ∈ σ (I + h B) and let x ∈ C n , x = 0, be an eigenvector corresponding to λ. Without loss of generality assume that x V = V −1/2 x = 1. We have By skew-symmetry of S, Hence Hence the requirement |λ| < 1 translates into the inequality or, after some rearranging, Looking at the denominator, using Cauchy-Schwartz, It follows that if h satisfies (38), then it satisfies (39), and therefore r(I + h B) < 1.
If (38) holds, by (Lancaster and Tismenetsky 1985, Theorem 13.2.1) there exists a unique solution R = R(σ ) to the discrete Lyapunov Eq. (23). Recall f (x, y) = ρ(x)q(x, y), where ρ is the density of N (0, R) and q(x, ·) the density function of N ((I + h B)x, (2hσ 2 )I ). Hence f is a Gaussian density function with mean zero and covariance matrix Lemma 5.6 Suppose (38) holds. Then det M = (2hσ 2 ) n det(R) Proof By standard result on determinants of block matrices, In the argument of the second determinant we recognize (23), from which we obtain det M = det(R) det(2hσ 2 I n ) = (2hσ 2 ) n det(R). (iii) For h < 2 C 2 we have B R B ≤ 2σ 2 C 1 2 − hC 2 .
Therefore if h < 2 C 2 , then (38) holds, and therefore R is well-defined. From the proof of Lemma 5.7, R = σ 2 V + h ∞ 0 e Bs B R B e B s ds. Hence, using (ii), The result follows after rearranging, using the following equality (which holds for any matrix K ) to express B V in terms of the · -norm: V −1/2 K V 1/2 y y = V −1/2 K V 1/2 .
which is equivalent to the stated result.