# Non-reversible Metropolis-Hastings

- 2.2k Downloads
- 8 Citations

## Abstract

The classical Metropolis-Hastings (MH) algorithm can be extended to generate non-reversible Markov chains. This is achieved by means of a modification of the acceptance probability, using the notion of vorticity matrix. The resulting Markov chain is non-reversible. Results from the literature on asymptotic variance, large deviations theory and mixing time are mentioned, and in the case of a large deviations result, adapted, to explain how non-reversible Markov chains have favorable properties in these respects. We provide an application of NRMH in a continuous setting by developing the necessary theory and applying, as first examples, the theory to Gaussian distributions in three and nine dimensions. The empirical autocorrelation and estimated asymptotic variance for NRMH applied to these examples show significant improvement compared to MH with identical stepsize.

## Keywords

Markov Chain Monte Carlo MCMC Metropolis-Hastings Non-reversible Markov processes Asymptotic variance Large deviations Langevin sampling## Mathematics Subject Classification

65C40 60J20## 1 Introduction

The Metropolis-Hastings (MH) algorithm Metropolis et al. (1953), Hastings (1970) is a Markov chain Monte-Carlo (MCMC) method of profound importance to many fields of mathematics such as Bayesian inference and statistical mechanics Diaconis and Saloff-Coste (1998), Diaconis (2008), Levin et al. (2009). The applicability of MH to a particular computational problem depends on the efficiency of the Markov chain that is generated by the algorithm. The chains generated by the classical Metropolis-Hastings algorithm are *reversible*, or, in other words, satisfy *detailed balance*; in fact, this reversibility is instrumental in showing that the resulting chains have the right invariant probability distribution.

However, non-reversible Markov chains may have better properties in terms of mixing behavior or asymptotic variance. This can be shown experimentally in special cases Suwa and Todo (2010), Turitsyn et al. (2011), Vucelja (2014), theoretically in special cases Diaconis et al. (2000), Neal (2004), and in fact, also in general Sun et al. (2010), Chen and Hwang (2013), with respect to asymptotic variance. See also Rey-Bellet and Spiliopoulos (2014) for improved asymptotic variance of non-reversible diffusion processes on compact manifolds.

There exist two basic approaches to the construction of non-reversible chains from reversible chains: one can ‘lift’ the Markov chain to a larger state space Diaconis et al. (2000), Neal (2004), Turitsyn et al. (2011), Vucelja (2014), or one can introduce non-reversibility without altering the state space Sun et al. (2010). In continuous spaces, the *hybrid* or *Hamiltonian Monte Carlo* Horowitz (1991), Neal (2011) is closely related to the lifting approach in discrete spaces. Other noteworthy publications on non-reversible Markov chains are Wilmer (1999), Geyer and Mira (2000).

In this paper we consider the second type of creating non-reversibility, i.e. without augmenting the state space. In discrete spaces this can, in principle, be achieved by changing transition probabilities (see Remark 2.2). However this may be have computational disadvantages, since it requires access to all transition probabilities. Furthermore, there is no such analogue in continuous spaces (crudely speaking, because all transition probabilities to specific states are zero).

To remedy these issues, in this paper MH is extended to ‘non-reversible Metropolis-Hastings’ (NRMH) which allows for non-reversible transitions. The main idea of this paper is to modify the acceptance ratio, which is further discussed in Sect. 2. For pedagogical purposes, the theory is first developed for discrete state spaces. It is shown how the acceptance probability of MH, can be adjusted so that the resulting chain in NRMH has a specified ‘vorticity’, and therefore, will be non-reversible. Any Markov chain satisfying a symmetric structure condition can be constructed by NRMH, which establishes the generality of the algorithm. Theoretical advantages of finite state space non-reversible chains in terms of improved asymptotic variance and large deviations estimates are briefly mentioned in Sect. 3. In particular we recall a result from Sun et al. (2010) that adding non-reversibility decreases asymptotic variance. Also we present a variation on a result by Rey-Bellet and Spiliopoulos (2014) on large deviations from the invariant distribution.

As mentioned above, for continuous state spaces it was so far not clear how general non-reversible Markov chains (i.e. discrete time, for arbitrary target density) could be constructed. One of the main advantages of NRMH is that it also applies in the setting of continuous state spaces, and thus provides a partial solution to this problem, as will be discussed and verified experimentally in Sect. 4. In particular we implement a non-reversible version of the Metropolis Adjusted Langevin Algorithm (MALA) for Gaussian multivariate target distributions. Finally, conclusions and directions of further research are discussed in Sect. 5.

### 1.1 Notation

We will consider both finite and infinite-dimensional vectors and matrices. The constant vector with all elements equal to 1 will be denoted by \(\mathbbm {1}\); the dimensionality of \(\mathbbm {1}\) should always be clear from the context. Similarly the identity matrix of any dimension will be denoted by *I*. For sets *V*, *S*, with \(V \subset S\) the indicator function of *V* is denoted by \(\mathbbm {1}_V : S \rightarrow \mathbb {R}\). The transpose of a matrix *A* is denoted by \(A^{\prime }\). The Euclidean vector norm on \(\mathbb {R}^n\), as well as its induced matrix norm, will be denoted by \(\Vert \cdot \Vert \). For a matrix \(A \in \mathbb {R}^n\), the spectrum is denoted by \(\sigma (A)\). The spectral bound and spectral radius of *A* are denoted by \(\mathfrak {s}(A) = \max \{ {\text {Re}}\lambda : \lambda \in \sigma (A)\}\) and \(\mathfrak {r}(A) = \max \{ |\lambda | : \lambda \in \sigma (A)\}\), respectively.

## 2 Metropolis-Hastings generalized to obtain non-reversible chains

As a preliminary to non-reversible Metropolis-Hastings, we require the notion of *vorticity matrix*, which is introduced in Sect. 2.1. The classical Metropolis-Hastings algorithm, discussed in Sect. 2.2, is extended using the notion of vorticity matrix to a non-reversible version in Sect. 2.3.

### 2.1 Non-reversible Markov chains and vorticity

Let \(P = (P(x,y))\) denote a matrix of transition probabilities of a Markov chain on a finite or countable state space *S*. A *distribution* on *S* is a vector with positive elements in \(L^1(S)\), and is not necessarily normalized, i.e. it is not necessarily the case that \(\sum _{x \in S} \pi (x) = 1\). If \(\pi \) is a distribution such that \(\sum _{x \in S} \pi (x) = 1\), then we call \(\pi \) a *probability distribution*. We will always assume that \(\pi (x) > 0\) for all \(x \in S\). A (probability) distribution \(\pi \) on *S* is said to be an *invariant (probability) distribution* of *P* if \(\pi ^{\prime }P = \pi ^{\prime }\), i.e. \(\sum _{x \in S} \pi (x) P(x,y) = \pi (y)\) for all \(y \in S\). A distribution \(\pi \) on *S* is said to satisfy the *detailed balance condition* with respect to *P* if \({\text {diag}}(\pi ) P = P^{\prime } {\text {diag}}(\pi )\), i.e. \(\pi (x) P(x,y) = P(y,x) \pi (y)\) for all \(x, y \in S\). If there exists a distribution \(\pi \) on *S* that satisfies detailed balance with respect to *P*, then *P* is said to be *reversible*. As is well known, and straightforward to check, if \(\pi \) satisfies detailed balance with respect to *P*, then \(\pi \) is invariant for *P*. A chain which does not satisfy the detailed balance condition with respect to its invariant distribution is called *non-reversible*. In a certain sense, this is a misnomer: we may obtain a time reversed Markov chain \(\widehat{P}\) by defining \(\widehat{P}(x,y) := \frac{ \pi (y) P(y,x)}{\pi (x)}\). In fact, \(\widehat{P}\) is the adjoint of *P* with respect to the inner product \((\cdot ,\cdot )_{\pi }\), defined by \((f,g)_{\pi } = \sum _{x \in S} f(x) g(x) \pi (x)\). For further background material on Markov chains the reader is referred to Levin et al. (2009).

*P*. Let \(K = \frac{1}{2}( P + \widehat{P})\). Then

*K*is a reversible Markov chain with invariant distribution \(\pi \). We will also define a vorticity matrix \(\Gamma \) as (essentially) the skew-symmetric part of

*P*:

*P*: \(\Gamma = {\text {diag}}(\pi ) (P - \widehat{P})\). The following simple observations are fundamental to this paper.

### **Lemma 2.1**

*P*be the transition matrix of a Markov chain on

*S*and let \(\pi \) be a distribution on

*S*. Let \(\Gamma \) be defined by (1). Then:

- (i)
\(\Gamma \) is skew-symmetric, i.e. \(\Gamma = - \Gamma ^{\prime }\);

- (ii)
\(\pi \) satisfies detailed balance with respect to

*P*if and only if \(\Gamma = 0\); - (iii)
\(\pi \) is invariant for

*P*if and only if \(\Gamma \mathbbm {1} = 0\).

### *Proof*

*x*if and only if \(\pi \) is invariant for

*P*. \(\square \)

In light of Lemma 2.1, a matrix \(\Gamma \in \mathbb {R}^{n \times n}\) which is skew-symmetric and satisfies \(\Gamma \mathbbm {1} = 0\) is called a *vorticity matrix*. If \(\Gamma \) is related by (1) to a Markov chain *P* with invariant distribution \(\pi \), it is called the *vorticity of* *P* *and* \(\pi \). It will be a key ingredient in the construction of a non-reversible version of Metropolis-Hastings.

### *Remark 2.2*

A direct way of constructing a non-reversible chain *P* from a reversible chain *K* and a vorticity matrix \(\Gamma \), is by letting \(P(x,y) = K(x,y) + \frac{1}{2 \pi (x)}\Gamma (x,y)\), provided that *P* is a probability matrix (i.e. has nonnegative entries). This is discussed in e.g. Sun et al. (2010). In order to make a transition from a state *x*, one has to compute entries of \(K(x,\cdot )\) and \(\Gamma (x,\cdot )\) to determine the transition probabilities. This approach has no alternative in uncountable state spaces. This is the main reason for wishing to develop a method that does not depend on the construction mentioned in this remark.

### 2.2 Metropolis-Hastings

*Q*is used, satisfying the

*symmetric structure*condition

*x*to

*y*has positive probability, the reverse probability also has positive probability. The

*Hastings ratio*\(R_0(x,y)\) is defined as

### 2.3 Non-reversible Metropolis-Hastings

We will now discuss how this framework can be extended to construct Markov chains that are, in general, non-reversible. Let \(\Gamma \in \mathbb {R}^{n \times n}\) be a vorticity matrix, and let *Q* be the transition matrix of a Markov chain, satisfying (2). Again, \(\pi : S \rightarrow (0,\infty )\) is some distribution that is not necessarily normalized and has only positive entries.

*non-reversible Hastings ratio*as

*Q*has zeroes. As with Metropolis-Hastings, the transition probabilities \(P_{\Gamma }(x,y)\) are defined by

In order to check that the proposed Markov chain has \(\pi \) as its invariant density, we need to verify that \(\Gamma \), \(\pi \) and \(P_{\Gamma }\) are related through (1). As a crucial step, we employ the following lemma, in analogy with Metropolis-Hastings.

### **Lemma 2.3**

Let \(\Gamma \) be a vorticity matrix, *Q* a matrix of transition probabilities satisfying (2), \(\pi \) a distribution that is nowhere zero, such that (8) holds. Let \(R_{\Gamma }\) be as above. Then \(R_{\Gamma }(y,x) > 1\) if and only if \(R_{\Gamma }(x,y) < 1\) for any \(x,y \in S\) for which \(Q(x,y) \ne 0\).

### *Proof*

Using the previous lemma, it is now straightforward to show that \(\Gamma \) is the vorticity matrix of \((P_{\Gamma },\pi )\).

### **Lemma 2.4**

*Q*be a Markov chain, \(\Gamma \) a vorticity matrix, and \(\pi \) a distribution on

*S*, such that (2) and (8) are satisfied. Let \(P_{\Gamma }\) be defined through (6), (7) and (9). Then (1) holds for \((P_{\Gamma }, \pi )\), i.e.

### *Proof*

*x*,

*y*are such that \(R_{\Gamma }(x,y) < 1\). Then \(A_{\Gamma }(x,y) = R_{\Gamma }(x,y)\), and \(A_{\Gamma }(y,x) = 1\). Therefore

*x*and

*y*. The cases in which \(R_{\Gamma }(x,y)=1\) or \(>1\) are analogous. \(\square \)

Finally, since by the assumption that \(\Gamma \) is a vorticity matrix, \(\Gamma \mathbbm {1} = 0\), and hence Lemma 2.1 (iii) gives that \(\pi \) is invariant for \(P_{\Gamma }\). We have obtained our main result.

### **Theorem 2.5**

Let *Q* be a Markov chain, \(\Gamma \) a vorticity matrix, and \(\pi \) a distribution on *S* that is everywhere positive, such that (2) and (8) are satisfied. Let \(P_{\Gamma }\) be defined through (6), (7) and (9). Then \(P_{\Gamma }\) has \(\pi \) as invariant distribution and \(\Gamma \) as its vorticity matrix.

### *Remark 2.6*

We will refer to a combination \((Q, \Gamma ,\pi )\) which satisfies (2) and (8) as a *compatible combination*. Especially verifying condition (8) requires some knowledge about \(\pi \), but these do not need to be exact: it will suffice to have acces to a lower bound for \(\pi \). In the proof of Theorem 4.2 the analogue of this condition for continuous spaces is checked as an example.

### *Remark 2.7*

Once we have access to a compatible combination of proposal chain *Q*, vorticity matrix \(\Gamma \) and target distribution \(\pi \), the NRMH algorithm has similar favorable properties as MH, in that only local information is required: *Q*, \(\pi \) and \(\Gamma \) only need to be evaluated at the current and proposed state, and no normalization of \(\pi \) is required. In Sect. 4 this will become even more important when NRMH is applied to a problem in continuous state space.

### 2.4 General observations on NRMH

The following trivial observation serves to indicate the generality of this approach. It asserts that every Markov chain may be build, in a trivial way, by the described procedure.

### **Proposition 2.8**

Let *P* be a Markov chain with invariant distribution \(\pi \) and corresponding vorticity \(\Gamma \), satisfying (2). If we use \(Q = P\) as proposal distribution and \(\Gamma \) as vorticity matrix in the NRMH algorithm with target distribution \(\pi \), then the resulting Markov chain \(P_{\Gamma }\) is identical to *P*.

### *Proof*

It suffices to note, that by \(Q = P\) and (1), \(A_{\Gamma }(x,y) = 1\) for all \(x, y \in S\), \(x \ne y\). \(\square \)

### *Remark 2.9*

*Q*does not imply irreducibility of \(P_{\Gamma }\), unless we impose the stronger condition:

As noted in Remark 2.2, one may alternatively construct any non-reversible chain *P* by ‘adding’ a vorticity matrix to a reversible chain *K*. We may translate one approach into the other, as follows:

### **Proposition 2.10**

Consider irreducible transition kernels *Q*(*x*, *y*) and *H*(*x*, *y*), related by \(Q(x,y) = H(x,y) + \frac{1}{2 \pi (x)} \Gamma (x,y)\), both satisfying the symmetric structure condition (2). Suppose *Q*, \(\pi \) and \(\Gamma \) satisfy (8). Suppose \(P_1\) is the Markov kernel obtained from NRMH, using *Q* as proposal chain and \(\Gamma \) as vorticity matrix. Suppose \(P_2\) is the Markov kernel given by \(P_2(x,y) = K(x,y) + \frac{1}{2\pi (x)} \Gamma (x,y)\), where *K* is the classical Metropolis-Hastings kernel obtained by using *H* as proposal chain (and \(\pi \) as target distribution). Then \(P_1=P_2\).

### *Proof*

*Q*has zeros. Since both \(P_1\) and \(P_2\) represent transition probabilities, this also fixes their diagonal elements, which concludes the proof. \(\square \)

## 3 Advantages of non-reversible Markov chains in finite state spaces

Non-reversible Markov chains may offer important computational advantages compared to reversible chains. We will briefly discuss some of these advantages as they apply to finite state space Markov chains. It is to our knowledge an open question how to extend the results of Sects. 3.1 and 3.2 in a generic way to uncountable and continuous state spaces.

### 3.1 Asymptotic variance

Consider a Markov chain *P* on *S* with invariant *probability* distribution \(\mu \). Let \(f : S \rightarrow \mathbb {R}\). We say that *f* satisfies a Central Limit Theorem (CLT) if there is a \(\sigma ^2_f < \infty \) such that the normalized sum \(n^{-1/2} \sum _{i=1}^n [f(X_i) - \mu (f)]\) converges weakly to a \(N(0,\sigma _f^2)\) distribution. In this case \(\sigma _f^2\) is called the asymptotic variance.

We will in this section work under the assumption that *S* is finite. In this case, for any \(f : S\rightarrow \mathbb {R}\) and irreducible *P*, a CLT is satisfied [see e.g. Roberts and Rosenthal (2004)]. The following result is obtained in Sun et al. (2010), for a more extensive argument see Chen and Hwang (2013).

### **Proposition 3.1**

Let *K* be a transition matrix of an irreducible reversible Markov chain with invariant probability distribution \(\mu \). Let \(\Gamma \) be a non-zero vorticity matrix and let \(P = K + {\frac{1}{2}} {\text {diag}}(\mu )^{-1} \Gamma \) be the transition matrix of an irreducible Markov chain. For any \(f : S \rightarrow \mathbb {R}\) and denote by \(\sigma _{f,K}^2\) and \(\sigma _{f,P}^2\) the asymptotic variances of *f* with respect to the transition matrices *K* and *P*, respectively.

Then for all \(f : S \rightarrow \mathbb {R}\), we have \(\sigma _{f,P}^2 \le \sigma _{f,K}^2\), and there exists an *f* such that \(\sigma _{f,P}^2 < \sigma _{f,K}^2\).

In words, adding non-reversibility decreases asymptotic variance.

### 3.2 Large deviations

In Rey-Bellet and Spiliopoulos (2014) it is noted that non-reversible diffusions on compact manifolds have favorable properties in terms of large deviations of the occupation measure from the invariant distribution. Inspired by their result, we present a simple (but to our knowledge novel) result in the same direction for finite state spaces.

*S*is finite. We may transform a discrete time Markov chain on

*S*into a continuous time chain by making subsequent transitions after random waiting times that have independent \(\mathrm {Exp}(\lambda )\) distributions. The discrete chain with transition matrix

*P*then transforms into a continuous time chain with generator

*G*is irreducible, the occupation measure satisfies for large time a large deviation principle with rate function

*t*, for \(A \subset \mathcal P(S)\),

*S*. The rate function satisfies the properties that (i) \(I_G \ge 0\), (ii) \(I_G\) is strictly convex, and (iii) \(I_G(\mu ) = 0\) if and only if \(\mu = \pi \), where \(\pi \) is the invariant distribution of

*G*. Hence \(I_G\) quantifies the probability of a large deviation of the occupation measure from the invariant distribution for large

*t*. See Hollander (2000) for further details.

The following result shows that deviations for non-reversible continuous time chains from the invariant distribution are asymptotically less likely than for the corresponding reversible chain.

### **Proposition 3.2**

Suppose *G* admits a decomposition of the form \(G(x,y) = K(x,y) + \frac{1}{2 \pi (x)} \Gamma (x,y)\), where \(\pi (x) K(x,y) = \pi (y) K(y,x)\) and \(\Gamma (x,y) = -\Gamma (y,x)\) for all \(x \ne y\). Then \(I_G(\mu ) \ge I_K(\mu )\) for all \(\mu \in \mathcal P(X)\), and the inequality is strict if \(\Gamma u^{\star } \ne 0\).

### *Proof*

*x*. By writing \(\widetilde{K}(x,y) = \pi (x) K(x,y)\), \(\widetilde{\mu }(x) = \mu (x) / \pi (x)\), we have

*K*symmetric and \(\Gamma \) anti-symmetric.

*u*gives a value larger than or equal to \(I_K(\mu )\).

If \(\mu (x) = 0\) for some *x*, the proof carries over by only summing over indices for which \(\mu (x) > 0\). \(\square \)

### 3.3 Mixing time and spectral gap

Non-reversibility in a Markov chain can have very favorable effects on mixing time and spectral gap, but we are not aware of general results in this direction. The reader is referred to Chen et al. (1999), Diaconis et al. (2000), Levin et al. (2009) for theoretical results in this direction, and to e.g. Sun et al. (2010), Turitsyn et al. (2011), Vucelja (2014) for less rigorous and/or numerical results. In these experimental results, non-reversibility especially seems to improve sampling in the case of sampling from a multimodal distribution.

## 4 NRMH in Euclidean space

In this section we explain how to extend the idea of non-reversible Metropolis-Hastings algorithm to a Euclidean state space. In Appendix 1 it is discussed how Metropolis-Hastings can be applied in the case of a general measurable space, and the discussion in this section is a special case of this.

### 4.1 General setting

*Q*is a Markov transition kernel on \(\mathbb {R}^n\) which has density

*q*(

*x*,

*y*) with respect to Lebesgue measure, i.e. \(Q(x, {rm dy}) = q(x,y) \mathrm{dy}\). We will be interested in sampling from a distribution on \(\mathbb {R}^n\) with Lebesgue density \(\pi \). We do not require that \(\int _{\mathbb {R}^n} \pi (x) \mathrm{\ d x} = 1\). Let \(\gamma : \mathbb {R}^n \times \mathbb {R}^n \rightarrow \mathbb {R}\) be Lebesgue measurable and furthermore suppose \(\gamma \) satisfies

### **Theorem 4.1**

With the above notation and definitions, and assuming conditions (11), (12), (13), (14) and (15) are satisfied, *P* is a Markov transition kernel with invariant density \(\pi \).

*vorticity density*of \((P,\pi )\). If \(\gamma \ne 0\) on a set of positive Lebesgue measure then

*P*is non-reversible, i.e. there exist sets \(B_1, B_2 \subset \mathbb {R}^n\) such that

### 4.2 Langevin diffusions for sampling in Euclidean space

The application of non-reversible sampling methods in Euclidean space is a relatively unexplored area. In this short review section we will discuss the use of Langevin diffusions for simulating from a target density, and discuss the potential role of NRMH in within this context.

*Langevin diffusion*

*W*is a standard Brownian motion in \(\mathbb {R}^n\). A natural (discrete time) Markov chain for sampling \(\pi \) is then the

*Euler-Maruyama discretization*of the Langevin diffusion,

*h*is a suitable stepsize. This discretization is approximately correct if

*h*is chosen to be sufficiently small. However, such a choice of

*h*results in slow convergence to equilibrium of the Markov chain. When

*h*is large, then the discretization results in large discrepancy between (17) and (18). As a result also the respective invariant distributions will be different and in particular the invariant distribution of (18) will not correspond to the desired distribution with density \(\pi \). It is customary to correct for this by considering the Euler-Maruyama discretization as a proposal for Metropolis-Hastings, resulting in the Metropolis Adjusted Langevin Algorithm [MALA, Roberts and Tweedie (1996), Roberts and Rosenthal (1998)].

^{1}the probability of large deviations of the empirical distribution from the invariant distribution are reduced for non-reversible diffusions Rey-Bellet and Spiliopoulos (2014).

When discretizing (19) it is again necessary to correct for discretization error by a MH accept/reject step. However, since MH generates reversible chains, one should expect that also the favourable properties of non-reversibility are destroyed. Instead, an implementation of NRMH should be able to preserve these favourable properties of non-reversible diffusions. We will illustrate this for multivariate Gaussian distributions.

### 4.3 Non-reversible Metropolis-Hastings for sampling multivarite Gaussian distributions

*V*. In this case the Langevin diffusion becomes the Ornstein-Uhlenbeck process

*W*(

*t*)) is an

*n*-dimensional standard Brownian motion. In Hwang et al. (1993), it is shown that adding a ‘nonreversible’ component of the form \(-S V^{-1}\) to the drift, with

*S*skew-symmetric, can improve convergence of the sample covariance. Therefore we will instead consider the Ornstein-Uhlenbeck process with modified drift

*S*skew-symmetric. For any choice of skew-symmetric

*S*, this diffusion keeps \(\pi \) invariant. The convergence to equilibrium of the diffusion is governed by the spectral bound, \({\mathfrak s}(B) := \max \{ {\text {Re}}\lambda : \lambda \in \sigma (B)\}\). More specifically,

*S*. In other words, adding a non-reversible term increases the speed of convergence to equilibrium. In Lelièvre et al. (2013), it is established that it is possible to choose

*S*optimally, resulting in \({\mathfrak s}(B) = - {\text {tr}} (V^{-1}) / n\). By choosing

*S*in such a way the convergence of the chain is effectively governed by the average of the eigenvalues, which should be compared to the reversible case in which the ‘worst’ eigenvalue determines the speed of convergence.

*B*and the diffusivity. We consider the time discretization of (21), with step size \(h > 0\), is

*Q*.

*A*. Provided \(\mathfrak {r}(I+h B) < 1\), the invariant probability distribution of

*Q*is the centered normal distribution with covariance

*R*, where

*R*is the unique positive definite matrix solution to the

*discrete time Lyapunov equation*[see e.g. (Lancaster and Tismenetsky 1985, Theorem 13.2.1)]

*Q*. Let \(f(x,y) = \rho (x) q(x,y)\). The vorticity density of the proposal chain is

*q*are non-degenerate, (13) and (14) are satisfied. The same statements hold trivially for scalar multiples of \(\gamma \).

Verification of (15) requires more effort. We provide a sufficient condition. The proof of this result is provided in Appendix 1.

### **Theorem 4.2**

*N*(0,

*V*).

### *Remark 4.3*

*c*,

*h*and \(\sigma \)? It seems reasonable to choose \(\sigma ^2\) equal to the maximal allowed value in (26), so that the deviation from the Euler-Maruyama discretization (which has \(\sigma = 1\)) is minimal; i.e. let

*c*as large as possible, i.e. \(c = \sigma ^n(h)\). A heuristic estimate for the scaling of the expected step size is given by the step size

*h*times the multiplicative factor in the vorticity, \(c = \sigma ^n(h)\) in the acceptance probability. Note that \(h \sigma ^n(h) = 0\) for \(h = 0\) and \(h = \frac{2}{C_2}\). Maximization of \(h \sigma ^n(h)\) with respect to

*h*yields

*S*and

*V*do not commute). This expression satisfies the condition \(h < \frac{2}{C_2}\). A first order Taylor approximation around 1 /

*n*yields the simplified expression \(h \approx \frac{4}{C_2 (n+2)} < \frac{2}{C_2}\). The corresponding value of \(\sigma ^2(h)\) is to first order equal to \(\sigma ^2(h) \approx 1 - \frac{2 C_1}{C_2(n+2)}\).

In case \(C_1 = C_2\), the optimal value of *h* is given by \(h = \frac{4}{(n+2)C_2}\), with \(\sigma ^2(h) = 1 - \frac{2}{n+2}\).

### 4.4 Numerical experiments

*empirical autocorrelation function (EACF)*

*r*defined by

*i*ranges over the coordinates \(i = 1, \ldots , n\), and where \(\widehat{\mu }^i = \frac{1}{P} \sum _{p=1}^P X_p^i\) is the empirical average of the

*i*-th coordinate. A fast decaying EACF indicates that the samples generated by the Markov chain are quickly decorrelating.

#### 4.4.1 Three-dimensional example

*V*a diagonal matrix with \({\text {diag}}(V) = \left( {\begin{array}{lll} {1,1,1/4} \\ \end{array} } \right) s\). The optimal nonlinear drift is obtained by letting

*h*, and reversible proposal distribution \(Q(x,dy) \sim N\big ((I - h V^{-1})x, 2h I\big )\). In Fig. 2 the EACFs for this 3-dimensional example are plotted. Here we see that NRMH helps to decrease the autocorrelations of the slowly decorrelating components in MH (here, the first two components). It achieves this without increasing autocorrelations of components that are already quickly decorrelating (here, the third component).

#### 4.4.2 Nine-dimensional example

*V*, with

*c*,

*h*and \(\sigma \) as in Remark 4.3 results in values

Estimated asymptotic variances for the 9-dimensional example of Sect. 4.4.2, based on an MCMC trajectory of \(10^7\) steps

Component | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|

NRMH | 599.96 | 661.17 | 40.80 | 572.26 | 159.05 | 27.35 | 230.41 | 401.98 | 718.64 |

MH | 1315.3 | 1522.2 | 47.156 | 1473.3 | 876.46 | 28.316 | 204.05 | 708.83 | 1578.2 |

It is well known that for optimal convergence in Metropolis Adjusted Langevin (MALA), the stepsize should be tuned so that the ratio of accepted proposals is approximately equal to 0.574 Roberts and Rosenthal (1998). Compared to this, the acceptance ratios in our example, given in Table 2, are fairly high. In particular the Metropolis-Hastings chain can be improved significantly by increasing step size *h*, and so we are currently comparing NRMH with a sub-optimal tuning of MH.

Acceptance ratios for 9-dimensional example of Sect. 4.4.2, for a sample path consisting of \(10^7\) proposals, and with a step size \(h = 7.0822 \times 10^{-4}\)

Acceptance ratio | |
---|---|

NRMH | 0.7383 |

MH | 0.9343 |

## 5 Discussion

The efficiency increase of non-reversible Markov chains in MCMC can be significant, in terms of either asymptotic variance or mixing properties, as remarked in this paper. NRMH extends the MCMC-toolbox with a method to utilize these benefits. In particular for continuous state spaces it was, to our knowledge, not known how to construct non-reversible Markov chains for MCMC sampling (taking into account the necessity of a correction step when using time discretization of diffusions).

*N*(0,

*V*) with density function \(\pi _0\) on \(\mathbb {R}^n\), satisfying \(k \pi _0 \le \pi \) on \(\mathbb {R}^n\) for some \(k > 0\). Then if \(\gamma \) is a suitable vorticity density for sampling from

*N*(0,

*V*), using proposal density

*q*(

*x*,

*y*), we have for \(\widetilde{\gamma }:= k \gamma \) that

*q*, and Theorem 4.1 applies. Such a suitable choice of \(\gamma \) may be determined as described in Sect. 4.3.

The approach outlined in Sect. 4 should be considered as a first attempt at implementing the NRMH framework for continuous spaces. As discussed, in order to use the framework one needs to verify the non-negativity condition which leads to technical challenges. In particular we expect that much progress is possible in weakening the conditions of Theorem 4.2. The current form of that proposition results in a relatively small step size *h*, which obstructs fast convergence of NRMH. As mentioned before, it is an open question whether NRMH in continuous spaces can be made competitive with optimally tuned MALA.

The theoretical discussion of Sect. 3 and the numerical experiment of Sect. 4.4 illustrate how efficiency can be improved by employing non-reversible Metropolis-Hastings. In view of these encouraging results it will hopefully be possible to extend the result to more general settings. The practical application of NRMH depends on the identification of suitable vorticity structures that are compatible with proposal chains, and establishing these in practical examples provides a promising and challenging direction of research.

Analysis of non-reversible Markov chains is difficult, essentially because self-adjointness is lost. Without self-adjointness, it is much more difficult to connect spectral theory to mixing properties of chains. It seems that a good way of understanding benefits of non-reversible sampling is by studying Cesaro averages [see Levin et al. (2009) and e.g. the result on large deviations in Sect. 3.2]. The results of Sect. 3 which establish that non-reversible chains have better asymptotic variance or large deviations properties, are so far qualitative in nature (i.e. fail to quantify the amount of improvement). To obtain quantitative results is an important challenge that remains to be addressed. Also, it is object of further study how these results carry over to countable and uncountable state spaces. In particular, the question under what conditions the resulting chains are geometrically ergodic and/or satisfy a CLT should be considered.

## Footnotes

## Notes

### Acknowledgments

I am grateful to Prof. Pavliotis (Imperial College, London) for making available the code for computing optimal non-reversible drift (as discussed in Lelièvre et al. (2013)). I also wish to acknowledge valuable discussions with Prof. Hilbert J. Kappen (Radboud University), Dr. Kevin Sharp (University of Oxford) and Prof. Gareth Roberts (University of Warwick). We thank the reviewers and editor for their valuable suggestions which have had a significant impact upon the paper. This research has received support from the European Union project # FP7-ICT-270327 (Complacs), and the EPSRC under the CRiSM Grant: EP/D002060/1.

## References

- Chen, T.-L., Hwang, C.-R.: Accelerating reversible Markov chains. Stat. Prob. Lett.
**83**(9), 1956–1962 (2013)Google Scholar - Chen, F., Lovász, L., Pak, I.: Lifting Markov chains to speed up mixing. In: Proceedings of the thirty-first annual ACM symposium on Theory of computing, pp. 275–281. ACM, (1999)Google Scholar
- den Hollander, F.: Large Deviations, Volume 14 of Fields Institute Monographs. American Mathematical Society, Providence, RI (2000)Google Scholar
- Diaconis, P., Holmes, S., Neal, R.M.: Analysis of a nonreversible Markov chain sampler. Ann. Appl. Prob.
**10**, 726–752 (2000)MathSciNetCrossRefzbMATHGoogle Scholar - Diaconis, P.: The Markov chain Monte Carlo revolution. Bull. Am. Math. Soc.
**46**(2), 179–205 (2008)MathSciNetCrossRefzbMATHGoogle Scholar - Diaconis, P., Saloff-Coste, L.: What do we know about the Metropolis algorithm? J. Comput. Syst. Sci.
**57**, 20–36 (1998)MathSciNetCrossRefzbMATHGoogle Scholar - Geyer, C.J., Mira, A.: On non-reversible Markov chains. In: Monte Carlo methods (Toronto, ON, 1998), volume 26 of Fields Inst. Commun., pp. 95–110. Amer. Math. Soc., Providence, RI (2000)Google Scholar
- Hastings, W.K.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika
**57**(1), 97–109 (1970)MathSciNetCrossRefzbMATHGoogle Scholar - Hwang, C.R., Hwang-Ma, S.Y., Sheu, S.J.: Accelerating Gaussian diffusions. Ann. Appl. Prob.
**3**(3), 897–913 (1993)MathSciNetCrossRefzbMATHGoogle Scholar - Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (1990)zbMATHGoogle Scholar
- Horowitz, A.M.: A generalized guided Monte Carlo algorithm. Phys. Lett. B
**268**(2), 247–252 (1991)CrossRefGoogle Scholar - Lelièvre, T., Nier, F., Pavliotis, G.A.: Optimal non-reversible linear drift for the convergence to equilibrium of a diffusion. J. Stat. Phys.
**152**(2), 237–274 (2013)MathSciNetCrossRefzbMATHGoogle Scholar - Levin, D.A., Peres, Y., Wilmer, E.L.: Markov Chains and Mixing Times. American Mathematical Society, Providence (2009)zbMATHGoogle Scholar
- Lancaster, P., Tismenetsky, M.: The Theory of Matrices. Computer Science and Applied Mathematics, 2nd edn. Academic Press Inc, Orlando, FL (1985)zbMATHGoogle Scholar
- Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys.
**21**(6), 1087 (1953)CrossRefGoogle Scholar - Neal, R.M.: Improving Asymptotic Variance of MCMC Estimators: Non-reversible Chains are Better. Technical report, No. 0406, Department of Statistics, University of Toronto, July (2004)Google Scholar
- Neal, R.: MCMC using Hamiltonian dynamics. In: Brooks, S., Gelman, A., Jones, G.L., Meng, X.L. (eds.) Handbook of Markov Chain Monte Carlo, pp. 113–162. Chapman & Hall/CRC, Boca Raton (2011)Google Scholar
- Rey-Bellet, L., Spiliopoulos, K.: Irreversible Langevin samplers and variance reduction: a large deviation approach. p. 21, March (2014)Google Scholar
- Roberts, G.O., Rosenthal, J.S.: Optimal scaling of discrete approximations to Langevin diffusions. J. Royal Stat. Soc.
**60**(1), 255–268 (1998)Google Scholar - Roberts, G.O., Rosenthal, J.S.: General state space Markov chains and MCMC algorithms. Prob. Surv.
**1**, 20–71 (2004)MathSciNetCrossRefzbMATHGoogle Scholar - Roberts, G.O., Tweedie, R.L.: Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli
**2**(4), 341–363 (1996)MathSciNetCrossRefzbMATHGoogle Scholar - Rudin, W.: Real and Complex Analysis, 3rd edn. McGraw-Hill Book Co, New York (1987)zbMATHGoogle Scholar
- Sun, Y., Gomez, F., Schmidhuber, J.: Improving the asymptotic performance of Markov Chain Monte-Carlo by inserting vortices. In: Lafferty, J., Williams, C.K.I., Shawe-Taylor, J., Zemel, R.S., Culotta, A. (eds.), Advances in Neural Information Processing Systems 23, pp. 2235–2243. (2010)Google Scholar
- Suwa, H., Todo, S.: Markov Chain Monte Carlo method without detailed balance. Phys. Rev. Lett.
**120603**, 1–4 (2010)Google Scholar - Turitsyn, K.S., Chertkov, M., Vucelja, M.: Irreversible Monte Carlo algorithms for efficient sampling. Physica D
**240**(4–5), 410–414 (2011)CrossRefzbMATHGoogle Scholar - Vucelja, M.: Lifting—a nonreversible Markov Chain Monte Carlo algorithm. December (2014)Google Scholar
- Wilmer, E.L.: Exact rates of convergence for some simple non-reversible Markov chains. PhD thesis, Harvard University (1999)Google Scholar
- Yosida, K.: Functional Analysis. 6th edn (1980)Google Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.