1 Introduction

The Metropolis-Hastings (MH) algorithm Metropolis et al. (1953), Hastings (1970) is a Markov chain Monte-Carlo (MCMC) method of profound importance to many fields of mathematics such as Bayesian inference and statistical mechanics Diaconis and Saloff-Coste (1998), Diaconis (2008), Levin et al. (2009). The applicability of MH to a particular computational problem depends on the efficiency of the Markov chain that is generated by the algorithm. The chains generated by the classical Metropolis-Hastings algorithm are reversible, or, in other words, satisfy detailed balance; in fact, this reversibility is instrumental in showing that the resulting chains have the right invariant probability distribution.

However, non-reversible Markov chains may have better properties in terms of mixing behavior or asymptotic variance. This can be shown experimentally in special cases Suwa and Todo (2010), Turitsyn et al. (2011), Vucelja (2014), theoretically in special cases Diaconis et al. (2000), Neal (2004), and in fact, also in general Sun et al. (2010), Chen and Hwang (2013), with respect to asymptotic variance. See also Rey-Bellet and Spiliopoulos (2014) for improved asymptotic variance of non-reversible diffusion processes on compact manifolds.

There exist two basic approaches to the construction of non-reversible chains from reversible chains: one can ‘lift’ the Markov chain to a larger state space Diaconis et al. (2000), Neal (2004), Turitsyn et al. (2011), Vucelja (2014), or one can introduce non-reversibility without altering the state space Sun et al. (2010). In continuous spaces, the hybrid or Hamiltonian Monte Carlo Horowitz (1991), Neal (2011) is closely related to the lifting approach in discrete spaces. Other noteworthy publications on non-reversible Markov chains are Wilmer (1999), Geyer and Mira (2000).

In this paper we consider the second type of creating non-reversibility, i.e. without augmenting the state space. In discrete spaces this can, in principle, be achieved by changing transition probabilities (see Remark 2.2). However this may be have computational disadvantages, since it requires access to all transition probabilities. Furthermore, there is no such analogue in continuous spaces (crudely speaking, because all transition probabilities to specific states are zero).

To remedy these issues, in this paper MH is extended to ‘non-reversible Metropolis-Hastings’ (NRMH) which allows for non-reversible transitions. The main idea of this paper is to modify the acceptance ratio, which is further discussed in Sect. 2. For pedagogical purposes, the theory is first developed for discrete state spaces. It is shown how the acceptance probability of MH, can be adjusted so that the resulting chain in NRMH has a specified ‘vorticity’, and therefore, will be non-reversible. Any Markov chain satisfying a symmetric structure condition can be constructed by NRMH, which establishes the generality of the algorithm. Theoretical advantages of finite state space non-reversible chains in terms of improved asymptotic variance and large deviations estimates are briefly mentioned in Sect. 3. In particular we recall a result from Sun et al. (2010) that adding non-reversibility decreases asymptotic variance. Also we present a variation on a result by Rey-Bellet and Spiliopoulos (2014) on large deviations from the invariant distribution.

As mentioned above, for continuous state spaces it was so far not clear how general non-reversible Markov chains (i.e. discrete time, for arbitrary target density) could be constructed. One of the main advantages of NRMH is that it also applies in the setting of continuous state spaces, and thus provides a partial solution to this problem, as will be discussed and verified experimentally in Sect. 4. In particular we implement a non-reversible version of the Metropolis Adjusted Langevin Algorithm (MALA) for Gaussian multivariate target distributions. Finally, conclusions and directions of further research are discussed in Sect. 5.

1.1 Notation

We will consider both finite and infinite-dimensional vectors and matrices. The constant vector with all elements equal to 1 will be denoted by \(\mathbbm {1}\); the dimensionality of \(\mathbbm {1}\) should always be clear from the context. Similarly the identity matrix of any dimension will be denoted by I. For sets VS, with \(V \subset S\) the indicator function of V is denoted by \(\mathbbm {1}_V : S \rightarrow \mathbb {R}\). The transpose of a matrix A is denoted by \(A^{\prime }\). The Euclidean vector norm on \(\mathbb {R}^n\), as well as its induced matrix norm, will be denoted by \(\Vert \cdot \Vert \). For a matrix \(A \in \mathbb {R}^n\), the spectrum is denoted by \(\sigma (A)\). The spectral bound and spectral radius of A are denoted by \(\mathfrak {s}(A) = \max \{ {\text {Re}}\lambda : \lambda \in \sigma (A)\}\) and \(\mathfrak {r}(A) = \max \{ |\lambda | : \lambda \in \sigma (A)\}\), respectively.

2 Metropolis-Hastings generalized to obtain non-reversible chains

As a preliminary to non-reversible Metropolis-Hastings, we require the notion of vorticity matrix, which is introduced in Sect. 2.1. The classical Metropolis-Hastings algorithm, discussed in Sect. 2.2, is extended using the notion of vorticity matrix to a non-reversible version in Sect. 2.3.

2.1 Non-reversible Markov chains and vorticity

Let \(P = (P(x,y))\) denote a matrix of transition probabilities of a Markov chain on a finite or countable state space S. A distribution on S is a vector with positive elements in \(L^1(S)\), and is not necessarily normalized, i.e. it is not necessarily the case that \(\sum _{x \in S} \pi (x) = 1\). If \(\pi \) is a distribution such that \(\sum _{x \in S} \pi (x) = 1\), then we call \(\pi \) a probability distribution. We will always assume that \(\pi (x) > 0\) for all \(x \in S\). A (probability) distribution \(\pi \) on S is said to be an invariant (probability) distribution of P if \(\pi ^{\prime }P = \pi ^{\prime }\), i.e. \(\sum _{x \in S} \pi (x) P(x,y) = \pi (y)\) for all \(y \in S\). A distribution \(\pi \) on S is said to satisfy the detailed balance condition with respect to P if \({\text {diag}}(\pi ) P = P^{\prime } {\text {diag}}(\pi )\), i.e. \(\pi (x) P(x,y) = P(y,x) \pi (y)\) for all \(x, y \in S\). If there exists a distribution \(\pi \) on S that satisfies detailed balance with respect to P, then P is said to be reversible. As is well known, and straightforward to check, if \(\pi \) satisfies detailed balance with respect to P, then \(\pi \) is invariant for P. A chain which does not satisfy the detailed balance condition with respect to its invariant distribution is called non-reversible. In a certain sense, this is a misnomer: we may obtain a time reversed Markov chain \(\widehat{P}\) by defining \(\widehat{P}(x,y) := \frac{ \pi (y) P(y,x)}{\pi (x)}\). In fact, \(\widehat{P}\) is the adjoint of P with respect to the inner product \((\cdot ,\cdot )_{\pi }\), defined by \((f,g)_{\pi } = \sum _{x \in S} f(x) g(x) \pi (x)\). For further background material on Markov chains the reader is referred to Levin et al. (2009).

Consider a non-reversible Markov chain P. Let \(K = \frac{1}{2}( P + \widehat{P})\). Then K is a reversible Markov chain with invariant distribution \(\pi \). We will also define a vorticity matrix \(\Gamma \) as (essentially) the skew-symmetric part of P:

$$\begin{aligned} \Gamma (x, y) := \pi (x) P(x,y) - \pi (y) P(y,x), \quad x, y \in S, \end{aligned}$$
(1)

or in matrix notation

$$\begin{aligned} \Gamma = {\text {diag}}(\pi ) P - P^{\prime } {\text {diag}}(\pi ). \end{aligned}$$

One can think of \(\Gamma \) as (a transformation of) the skew-symmetric part of P: \(\Gamma = {\text {diag}}(\pi ) (P - \widehat{P})\). The following simple observations are fundamental to this paper.

Lemma 2.1

Let P be the transition matrix of a Markov chain on S and let \(\pi \) be a distribution on S. Let \(\Gamma \) be defined by (1). Then:

  1. (i)

    \(\Gamma \) is skew-symmetric, i.e. \(\Gamma = - \Gamma ^{\prime }\);

  2. (ii)

    \(\pi \) satisfies detailed balance with respect to P if and only if \(\Gamma = 0\);

  3. (iii)

    \(\pi \) is invariant for P if and only if \(\Gamma \mathbbm {1} = 0\).

Proof

(i) and (ii) are immediate. As for (iii), note that

$$\begin{aligned} (\Gamma \mathbbm {1})(x)= & {} \sum _{y \in S} \left[ \pi (x) P(x, y) - \pi (y) P(y,x)\right] \\= & {} \pi (x) - \sum _{y \in S} \pi (y) P(y,x), \end{aligned}$$

which is zero for all x if and only if \(\pi \) is invariant for P. \(\square \)

In light of Lemma 2.1, a matrix \(\Gamma \in \mathbb {R}^{n \times n}\) which is skew-symmetric and satisfies \(\Gamma \mathbbm {1} = 0\) is called a vorticity matrix. If \(\Gamma \) is related by (1) to a Markov chain P with invariant distribution \(\pi \), it is called the vorticity of P and \(\pi \). It will be a key ingredient in the construction of a non-reversible version of Metropolis-Hastings.

Remark 2.2

A direct way of constructing a non-reversible chain P from a reversible chain K and a vorticity matrix \(\Gamma \), is by letting \(P(x,y) = K(x,y) + \frac{1}{2 \pi (x)}\Gamma (x,y)\), provided that P is a probability matrix (i.e. has nonnegative entries). This is discussed in e.g. Sun et al. (2010). In order to make a transition from a state x, one has to compute entries of \(K(x,\cdot )\) and \(\Gamma (x,\cdot )\) to determine the transition probabilities. This approach has no alternative in uncountable state spaces. This is the main reason for wishing to develop a method that does not depend on the construction mentioned in this remark.

2.2 Metropolis-Hastings

In the Metropolis-Hastings (MH) algorithm a reversible Markov chain \(P_0\) with a given invariant distribution \(\pi \) is constructed. We will assume, mainly for simplicity, throughout this paper that \(\pi (x) > 0\) for all \(x \in S\). As an ingredient for the construction of \(P_0\), a Markov chain Q is used, satisfying the symmetric structure condition

$$\begin{aligned} Q (y,x) = 0 \quad \text{ whenever } \quad Q(x,y) = 0, \quad x, y \in S. \end{aligned}$$
(2)

In other words, whenever a transition from x to y has positive probability, the reverse probability also has positive probability. The Hastings ratio \(R_0(x,y)\) is defined as

$$\begin{aligned} R_0(x,y):=\left\{ \begin{array}{ll} \frac{\pi (y) Q(y,x)}{\pi (x) Q(x,y)}, &{}\quad \text{ for } \text{ all }\,x,y\in S\,\text{ for } \text{ which }\\ &{}\quad \,\pi (x)Q(x,y)\ne 0, \\ 1 &{}\quad \text{ otherwise. } \end{array} \right. \end{aligned}$$
(3)

With this definition of \(R_0\), acceptance probabilities are defined as

$$\begin{aligned} A_0(x,y) := \min \big ( 1, R_0(x,y) \big ), \end{aligned}$$
(4)

and transition probabilities \(P_0(x,y)\) are defined by

$$\begin{aligned} P_0(x,y):=\left\{ \begin{array}{ll} Q(x,y)A_0(x,y),&{}\quad x\ne y,\\ Q(x,x)+\sum _{z\ne x}Q(x,z)&{}\\ \quad \times (1-A_0(x,z)),&{}\quad x=y. \end{array}\right. \end{aligned}$$
(5)

It is a straightforward exercise to show that the chain \(P_0\) has \(\pi \) as its invariant distribution. An important step is the observation that \(R_0(x,y) \le 1\) if and only if \(R_0(y,x) \ge 1\), which will be a recurring phenomenon in the sequel.

2.3 Non-reversible Metropolis-Hastings

We will now discuss how this framework can be extended to construct Markov chains that are, in general, non-reversible. Let \(\Gamma \in \mathbb {R}^{n \times n}\) be a vorticity matrix, and let Q be the transition matrix of a Markov chain, satisfying (2). Again, \(\pi : S \rightarrow (0,\infty )\) is some distribution that is not necessarily normalized and has only positive entries.

We define for \(x, y \in S\) the non-reversible Hastings ratio as

$$\begin{aligned} R_{\Gamma }(x,y) := \left\{ \begin{array}{ll} \frac{\Gamma (x,y) + \pi (y) Q(y,x)}{\pi (x) Q(x,y)}, &{}\quad \text{ if }\,\pi (x) Q(x,y) \ne 0, \\ 1 &{}\quad \text{ otherwise, } \end{array} \right. \end{aligned}$$
(6)

and let, analogously to MH, the acceptance probabilities \(A_{\Gamma }\) be

$$\begin{aligned} A_{\Gamma }(x,y):=\min \big (1,R_{\Gamma }(x,y)\big ). \end{aligned}$$
(7)

Entries of \(\Gamma \) can be negative. In order to avoid the situation that \(A_{\Gamma }\) becomes negative, we will explicitly constrain vorticity matrix \(\Gamma \) to satisfy

$$\begin{aligned} \Gamma (x,y) \ge - \pi (y) Q(y,x) \quad \text{ for } \text{ all }\,x, y \in S. \end{aligned}$$
(8)

Note that (8) implies, by skew-symmetry of \(\Gamma \), that

$$\begin{aligned} - \pi (y) Q(y,x) \le \Gamma (x,y) \le \pi (x) Q(x,y) \quad \text{ for } \text{ all } \ x, y \in S. \end{aligned}$$

In particular, by the symmetric structure condition (2), \(\Gamma \) should have zeroes wherever Q has zeroes. As with Metropolis-Hastings, the transition probabilities \(P_{\Gamma }(x,y)\) are defined by

$$\begin{aligned} P_{\Gamma }(x,y):=\left\{ \begin{array}{ll} Q(x,y) A_{\Gamma }(x,y), &{}\quad x \ne y, \\ Q(x,x) +\sum _{z \ne x} Q(x,z)&{}\\ \quad \times \big (1 - A_{\Gamma }(x,z)\big ), &{}\quad x = y. \end{array}\right. \end{aligned}$$
(9)

Note that indeed \(P_{\Gamma }\) is a matrix of transition probabilities. For \(\Gamma = 0\), \(A_{\Gamma }\) and therefore \(P_{\Gamma }\) reduce to \(A_0\) and \(P_0\), so that the chosen notation is consistent.

In order to check that the proposed Markov chain has \(\pi \) as its invariant density, we need to verify that \(\Gamma \), \(\pi \) and \(P_{\Gamma }\) are related through (1). As a crucial step, we employ the following lemma, in analogy with Metropolis-Hastings.

Lemma 2.3

Let \(\Gamma \) be a vorticity matrix, Q a matrix of transition probabilities satisfying (2), \(\pi \) a distribution that is nowhere zero, such that (8) holds. Let \(R_{\Gamma }\) be as above. Then \(R_{\Gamma }(y,x) > 1\) if and only if \(R_{\Gamma }(x,y) < 1\) for any \(x,y \in S\) for which \(Q(x,y) \ne 0\).

Proof

Suppose \(R_{\Gamma }(x,y) < 1\), i.e. \(\Gamma (x,y) + \pi (y) Q(y,x) < \pi (x) Q(x,y)\). Then

$$\begin{aligned} \pi (x) Q(x,y)+\Gamma (y,x)= & {} \pi (x) Q(x,y)-\Gamma (x,y)\\> & {} \pi (y)Q(y,x), \end{aligned}$$

so that \(R_{\Gamma }(y,x) > 1\), using that \(\Gamma \) is skew-symmetric. The converse direction is analogous. \(\square \)

Using the previous lemma, it is now straightforward to show that \(\Gamma \) is the vorticity matrix of \((P_{\Gamma },\pi )\).

Lemma 2.4

Let Q be a Markov chain, \(\Gamma \) a vorticity matrix, and \(\pi \) a distribution on S, such that (2) and (8) are satisfied. Let \(P_{\Gamma }\) be defined through (6), (7) and (9). Then (1) holds for \((P_{\Gamma }, \pi )\), i.e.

$$\begin{aligned} \Gamma (x,y)=\pi (x)P_{\Gamma }(x,y)-\pi (y)P_{\Gamma }(y,x), \quad \text{ for } \text{ all }\,x,y \in S. \end{aligned}$$

Proof

If x, y are such that \(R_{\Gamma }(x,y) < 1\). Then \(A_{\Gamma }(x,y) = R_{\Gamma }(x,y)\), and \(A_{\Gamma }(y,x) = 1\). Therefore

$$\begin{aligned} \pi (x) P_{\Gamma }(x,y)&= \pi (x) Q(x,y) A_{\Gamma }(x,y)\\&= \pi (x)Q(x,y) \left( \frac{\Gamma (x,y) + \pi (y) Q(y,x)}{\pi (x) Q(x,y)}\right) \\&=\Gamma (x,y) + \pi (y) Q(y,x) = \Gamma (x,y) \\&\quad + \pi (y) A_{\Gamma }(y,x)Q(y,x) = \Gamma (x,y)\\&\quad + \pi (y) P_{\Gamma }(y,x), \end{aligned}$$

using Lemma 2.3. so that (1) is satisfied for such x and y. The cases in which \(R_{\Gamma }(x,y)=1\) or \(>1\) are analogous. \(\square \)

Finally, since by the assumption that \(\Gamma \) is a vorticity matrix, \(\Gamma \mathbbm {1} = 0\), and hence Lemma 2.1 (iii) gives that \(\pi \) is invariant for \(P_{\Gamma }\). We have obtained our main result.

Theorem 2.5

Let Q be a Markov chain, \(\Gamma \) a vorticity matrix, and \(\pi \) a distribution on S that is everywhere positive, such that (2) and (8) are satisfied. Let \(P_{\Gamma }\) be defined through (6), (7) and (9). Then \(P_{\Gamma }\) has \(\pi \) as invariant distribution and \(\Gamma \) as its vorticity matrix.

Remark 2.6

We will refer to a combination \((Q, \Gamma ,\pi )\) which satisfies (2) and (8) as a compatible combination. Especially verifying condition (8) requires some knowledge about \(\pi \), but these do not need to be exact: it will suffice to have acces to a lower bound for \(\pi \). In the proof of Theorem 4.2 the analogue of this condition for continuous spaces is checked as an example.

Remark 2.7

Once we have access to a compatible combination of proposal chain Q, vorticity matrix \(\Gamma \) and target distribution \(\pi \), the NRMH algorithm has similar favorable properties as MH, in that only local information is required: Q, \(\pi \) and \(\Gamma \) only need to be evaluated at the current and proposed state, and no normalization of \(\pi \) is required. In Sect. 4 this will become even more important when NRMH is applied to a problem in continuous state space.

2.4 General observations on NRMH

The following trivial observation serves to indicate the generality of this approach. It asserts that every Markov chain may be build, in a trivial way, by the described procedure.

Proposition 2.8

Let P be a Markov chain with invariant distribution \(\pi \) and corresponding vorticity \(\Gamma \), satisfying (2). If we use \(Q = P\) as proposal distribution and \(\Gamma \) as vorticity matrix in the NRMH algorithm with target distribution \(\pi \), then the resulting Markov chain \(P_{\Gamma }\) is identical to P.

Proof

It suffices to note, that by \(Q = P\) and (1), \(A_{\Gamma }(x,y) = 1\) for all \(x, y \in S\), \(x \ne y\). \(\square \)

Remark 2.9

If, for some pair \((x, y) \in S \times S\), (8) holds with equality, the transition probability \(P_{\Gamma }(x,y) = 0\) even when \(Q(x,y) \ne 0\). Therefore irreducibility of Q does not imply irreducibility of \(P_{\Gamma }\), unless we impose the stronger condition:

$$\begin{aligned} \Gamma (x,y) > - \pi (y) Q(y,x) \qquad \text{ for } \text{ all }\,x,y\in S. \end{aligned}$$
(10)

As noted in Remark 2.2, one may alternatively construct any non-reversible chain P by ‘adding’ a vorticity matrix to a reversible chain K. We may translate one approach into the other, as follows:

Proposition 2.10

Consider irreducible transition kernels Q(xy) and H(xy), related by \(Q(x,y) = H(x,y) + \frac{1}{2 \pi (x)} \Gamma (x,y)\), both satisfying the symmetric structure condition (2). Suppose Q, \(\pi \) and \(\Gamma \) satisfy (8). Suppose \(P_1\) is the Markov kernel obtained from NRMH, using Q as proposal chain and \(\Gamma \) as vorticity matrix. Suppose \(P_2\) is the Markov kernel given by \(P_2(x,y) = K(x,y) + \frac{1}{2\pi (x)} \Gamma (x,y)\), where K is the classical Metropolis-Hastings kernel obtained by using H as proposal chain (and \(\pi \) as target distribution). Then \(P_1=P_2\).

Proof

We compute, for \(x \ne y\), with \(Q(x,y) \ne 0\),

$$\begin{aligned}&\pi (x)P_1(x,y)\\&\quad =\pi (x)\min \left( 1,\frac{\Gamma (x,y)+\pi (y)Q(y,x)}{\pi (x)Q(x,y)}\right) Q(x,y)\\&\quad =\min \left( \pi (x) Q(x,y), \Gamma (x,y) + \pi (y) Q(y,x) \right) \\&\quad =\min \left( \pi (x) H(x,y) + {\frac{1}{2}} \Gamma (x,y), \Gamma (x,y)\right. \\&\left. \qquad \,+ \pi (y) H(y,x) + {\frac{1}{2}} \Gamma (y,x) \right) \\&\quad = \min \left( \pi (x) H(x,y), \pi (y) H(y,x) \right) + {\frac{1}{2}} \Gamma (x,y)\\&\quad = \pi (x) \min \left( 1, \frac{\pi (y) H(y,x)}{\pi (x) H(x,y)}\right) H(x,y) + {\frac{1}{2}} \Gamma (x,y)\\&\quad = \pi (x) P_2(x,y). \end{aligned}$$

Both \(P_1\) and \(P_2\) have zeros on the off-diagonal entries for which Q has zeros. Since both \(P_1\) and \(P_2\) represent transition probabilities, this also fixes their diagonal elements, which concludes the proof. \(\square \)

3 Advantages of non-reversible Markov chains in finite state spaces

Non-reversible Markov chains may offer important computational advantages compared to reversible chains. We will briefly discuss some of these advantages as they apply to finite state space Markov chains. It is to our knowledge an open question how to extend the results of Sects. 3.1 and 3.2 in a generic way to uncountable and continuous state spaces.

3.1 Asymptotic variance

Consider a Markov chain P on S with invariant probability distribution \(\mu \). Let \(f : S \rightarrow \mathbb {R}\). We say that f satisfies a Central Limit Theorem (CLT) if there is a \(\sigma ^2_f < \infty \) such that the normalized sum \(n^{-1/2} \sum _{i=1}^n [f(X_i) - \mu (f)]\) converges weakly to a \(N(0,\sigma _f^2)\) distribution. In this case \(\sigma _f^2\) is called the asymptotic variance.

We will in this section work under the assumption that S is finite. In this case, for any \(f : S\rightarrow \mathbb {R}\) and irreducible P, a CLT is satisfied [see e.g. Roberts and Rosenthal (2004)]. The following result is obtained in Sun et al. (2010), for a more extensive argument see Chen and Hwang (2013).

Proposition 3.1

Let K be a transition matrix of an irreducible reversible Markov chain with invariant probability distribution \(\mu \). Let \(\Gamma \) be a non-zero vorticity matrix and let \(P = K + {\frac{1}{2}} {\text {diag}}(\mu )^{-1} \Gamma \) be the transition matrix of an irreducible Markov chain. For any \(f : S \rightarrow \mathbb {R}\) and denote by \(\sigma _{f,K}^2\) and \(\sigma _{f,P}^2\) the asymptotic variances of f with respect to the transition matrices K and P, respectively.

Then for all \(f : S \rightarrow \mathbb {R}\), we have \(\sigma _{f,P}^2 \le \sigma _{f,K}^2\), and there exists an f such that \(\sigma _{f,P}^2 < \sigma _{f,K}^2\).

In words, adding non-reversibility decreases asymptotic variance.

3.2 Large deviations

In Rey-Bellet and Spiliopoulos (2014) it is noted that non-reversible diffusions on compact manifolds have favorable properties in terms of large deviations of the occupation measure from the invariant distribution. Inspired by their result, we present a simple (but to our knowledge novel) result in the same direction for finite state spaces.

As in the previous section, assume S is finite. We may transform a discrete time Markov chain on S into a continuous time chain by making subsequent transitions after random waiting times that have independent \(\mathrm {Exp}(\lambda )\) distributions. The discrete chain with transition matrix P then transforms into a continuous time chain with generator

$$\begin{aligned} G(x,y)=\left\{ \begin{array}{ll} \lambda P(x,y) &{}\quad \text{ if }\,x\ne y, \\ -\lambda \sum _{z\ne x}P(x,z) &{}\quad \text{ if }\,x=y. \end{array}\right. \end{aligned}$$

The occupation measure of the resulting Markov process is defined as \(L_t = \frac{1}{t} \int _0^t \delta _{X_s} \ d s\). If G is irreducible, the occupation measure satisfies for large time a large deviation principle with rate function

$$\begin{aligned} I_G(\mu ) = \sup _{u > 0} \left( -\sum _{x \in S}\mu _i\frac{(Gu)(x)}{u(x)} \right) , \quad \mu \in \mathcal P(S), \end{aligned}$$

i.e. informally, for large t, for \(A \subset \mathcal P(S)\),

$$\begin{aligned} \mathbb {P}(L_t \in A) \approx \exp \left( - t \inf _{\mu \in A} I_G(\mu ) \right) . \end{aligned}$$

Here \({\mathcal {P}}(S)\) is the set of probability distributions on S. The rate function satisfies the properties that (i) \(I_G \ge 0\), (ii) \(I_G\) is strictly convex, and (iii) \(I_G(\mu ) = 0\) if and only if \(\mu = \pi \), where \(\pi \) is the invariant distribution of G. Hence \(I_G\) quantifies the probability of a large deviation of the occupation measure from the invariant distribution for large t. See Hollander (2000) for further details.

The following result shows that deviations for non-reversible continuous time chains from the invariant distribution are asymptotically less likely than for the corresponding reversible chain.

Proposition 3.2

Suppose G admits a decomposition of the form \(G(x,y) = K(x,y) + \frac{1}{2 \pi (x)} \Gamma (x,y)\), where \(\pi (x) K(x,y) = \pi (y) K(y,x)\) and \(\Gamma (x,y) = -\Gamma (y,x)\) for all \(x \ne y\). Then \(I_G(\mu ) \ge I_K(\mu )\) for all \(\mu \in \mathcal P(X)\), and the inequality is strict if \(\Gamma u^{\star } \ne 0\).

Proof

Assume \(\mu (x) > 0\) for all x. By writing \(\widetilde{K}(x,y) = \pi (x) K(x,y)\), \(\widetilde{\mu }(x) = \mu (x) / \pi (x)\), we have

$$\begin{aligned} \sum _{x=1}^n \mu (x) \frac{ (G u)(x)}{u(x)} = \sum _{x=1}^n \frac{\widetilde{\mu }(x)}{u(x)} \left[ (\widetilde{K} u)(x) + \frac{1}{2}(\Gamma u)(x) \right] . \end{aligned}$$

Hence it suffices to prove the result for \(G = K + \frac{1}{2}\Gamma \), with K symmetric and \(\Gamma \) anti-symmetric.

Now take \(u(x) = \sqrt{\mu (x)}\). Then

$$\begin{aligned} - \sum _{x=1}^n \mu (x)\frac{(G u)(x)}{u(x)}&=-\sum _{x,y=1}^n \sqrt{\mu (x)} K(x,y) \sqrt{\mu (y)}\\&\quad -\frac{1}{2}\sum _{x,y=1}^n \sqrt{\mu (x)} \Gamma (x,y) \sqrt{\mu (y)}. \end{aligned}$$

The first term is equal to the the rate function \(I_K(\mu )\) for the empirical measure in the symmetric case, see e.g. (Hollander 2000, Theorem IV.14). Since \(\Gamma \) is skew-symmetric, the second term vanishes. Hence taking the supremum over u gives a value larger than or equal to \(I_K(\mu )\).

Let \(u^{\star }\) be given by \(u^{\star }(x) = \sqrt{\mu (x)/\pi (x)}\). Then

$$\begin{aligned}&\left. \nabla _u \left( -\sum _{x=1}^n \mu (x) \left( \frac{\frac{1}{2 \pi (x)}\sum _{y=1}^n \Gamma (x,y) u(y) + \sum (y) K(x,y) u(y)}{u(x)} \right) \right) \right| _{u^{\star }}\\&\quad \quad = \Gamma u^{\star }, \end{aligned}$$

which proves the second statement in the proposition.

If \(\mu (x) = 0\) for some x, the proof carries over by only summing over indices for which \(\mu (x) > 0\). \(\square \)

3.3 Mixing time and spectral gap

Non-reversibility in a Markov chain can have very favorable effects on mixing time and spectral gap, but we are not aware of general results in this direction. The reader is referred to Chen et al. (1999), Diaconis et al. (2000), Levin et al. (2009) for theoretical results in this direction, and to e.g. Sun et al. (2010), Turitsyn et al. (2011), Vucelja (2014) for less rigorous and/or numerical results. In these experimental results, non-reversibility especially seems to improve sampling in the case of sampling from a multimodal distribution.

4 NRMH in Euclidean space

In this section we explain how to extend the idea of non-reversible Metropolis-Hastings algorithm to a Euclidean state space. In Appendix 1 it is discussed how Metropolis-Hastings can be applied in the case of a general measurable space, and the discussion in this section is a special case of this.

4.1 General setting

Suppose Q is a Markov transition kernel on \(\mathbb {R}^n\) which has density q(xy) with respect to Lebesgue measure, i.e. \(Q(x, {rm dy}) = q(x,y) \mathrm{dy}\). We will be interested in sampling from a distribution on \(\mathbb {R}^n\) with Lebesgue density \(\pi \). We do not require that \(\int _{\mathbb {R}^n} \pi (x) \mathrm{\ d x} = 1\). Let \(\gamma : \mathbb {R}^n \times \mathbb {R}^n \rightarrow \mathbb {R}\) be Lebesgue measurable and furthermore suppose \(\gamma \) satisfies

$$\begin{aligned} \gamma (x,y) = - \gamma (y,x) \quad \text{ for } \text{ all } \ x,y \in \mathbb {R}^n, \end{aligned}$$
(11)

and

$$\begin{aligned} \int _{A \times \mathbb {R}^n} \gamma (x, y) \mathrm{\ d x \, dy }= 0, \quad \text{ for } \text{ all } \quad A \in \mathcal B(\mathbb {R}^n). \end{aligned}$$
(12)

Here \({\mathcal {B}}(\mathbb {R}^n)\) denotes the Borel \(\sigma \)-algebra generated by open sets in \(\mathbb {R}^n\). Furthermore suppose that

$$\begin{aligned}&\gamma (x,y)=0 \nonumber \\&\quad \text{ for } \text{ all }\,x, y \in \mathbb {R}^n\,\text{ for } \text{ which }\,\pi (x) q(x,y) = 0, \end{aligned}$$
(13)
$$\begin{aligned}&\pi (x)q(x,y)=0\nonumber \\&\quad \text{ if } \text{ and } \text{ only } \text{ if } \quad \pi (y) q(y,x) = 0, \quad \text{ for } \text{ all }\ x, y \in \mathbb {R}^n,\nonumber \\ \end{aligned}$$
(14)

and

$$\begin{aligned}&\gamma (x,y) + \pi (y) q(y,x) \ge 0,\nonumber \\&\quad \text{ for } \text{ all } \quad x, y \in \mathbb {R}^n\,\text{ for } \text{ which }\,\pi (x) q(x,y) \ne 0. \end{aligned}$$
(15)

Define the Hastings ratio

$$\begin{aligned} \small R(x,y):=\left\{ \begin{array}{ll} \frac{\gamma (x,y) + \pi (y) q(y,x)}{\pi (x) q(x,y)}, &{}\quad \pi (x) q(x,y) \ne 0, \\ 1, &{}\quad \pi (x)q(x,y) = 0, \end{array}\right. \quad (x, y \in \mathbb {R}^n),\nonumber \\ \end{aligned}$$
(16)

acceptance probabilities \(A(x,y) := \min (1, R(x,y))\), and transition kernel

$$\begin{aligned} P(x,B):= & {} \int _B A(x,y)q(x,y)\,\mathrm{dy}\\&+\left( 1-\int _S A(x,y)q(x,y)\,\mathrm{dy}\right) \mathbbm {1}_{x \in B}. \end{aligned}$$

The proof of the following theorem can be found in Appendix 1.

Theorem 4.1

With the above notation and definitions, and assuming conditions (11), (12), (13), (14) and (15) are satisfied, P is a Markov transition kernel with invariant density \(\pi \).

In analogy with the discrete state space setting we call \(\gamma \) the vorticity density of \((P,\pi )\). If \(\gamma \ne 0\) on a set of positive Lebesgue measure then P is non-reversible, i.e. there exist sets \(B_1, B_2 \subset \mathbb {R}^n\) such that

$$\begin{aligned} \int _{B_1}\left\{ \int _{B_2}\pi (x)P(x,\mathrm{dy})\right\} \,\mathrm{dx}\ne \int _{B_2}\left\{ \int _{B_1}\pi (x)P(x,\mathrm{dy})\right\} \,\mathrm{dx}. \end{aligned}$$

4.2 Langevin diffusions for sampling in Euclidean space

The application of non-reversible sampling methods in Euclidean space is a relatively unexplored area. In this short review section we will discuss the use of Langevin diffusions for simulating from a target density, and discuss the potential role of NRMH in within this context.

Assume that the target density \(\pi \) is continuously differentiable. It is well known that \(\pi \) is invariant for the Langevin diffusion

$$\begin{aligned} \mathrm{d X}(t) = \nabla (\log \pi )(X(t)) \mathrm{\ d t} + \sqrt{2} \mathrm{\ d W}(t), \quad t \ge 0. \end{aligned}$$
(17)

where W is a standard Brownian motion in \(\mathbb {R}^n\). A natural (discrete time) Markov chain for sampling \(\pi \) is then the Euler-Maruyama discretization of the Langevin diffusion,

$$\begin{aligned} X_{k+1} \sim N(X_k + h \nabla (\log \pi )(X_n), 2 h), \quad k = 0, 1, 2, \ldots .\nonumber \\ \end{aligned}$$
(18)

Here h is a suitable stepsize. This discretization is approximately correct if h is chosen to be sufficiently small. However, such a choice of h results in slow convergence to equilibrium of the Markov chain. When h is large, then the discretization results in large discrepancy between (17) and (18). As a result also the respective invariant distributions will be different and in particular the invariant distribution of (18) will not correspond to the desired distribution with density \(\pi \). It is customary to correct for this by considering the Euler-Maruyama discretization as a proposal for Metropolis-Hastings, resulting in the Metropolis Adjusted Langevin Algorithm [MALA, Roberts and Tweedie (1996), Roberts and Rosenthal (1998)].

Any diffusion of the form

$$\begin{aligned} \mathrm{dX}(t)=-(I+S)\nabla (\log \pi )(X(t))\mathrm{\,dt}+\sqrt{2}\,\mathrm{d W}(t),\quad t\ge 0,\nonumber \\ \end{aligned}$$
(19)

with \(S \in \mathbb {R}^{n \times n}\) skew-symmetric, has \(\pi \) as invariant density. If \(S \ne 0\) then the diffusion is non-reversible. In analogy with Sect. 3 one may hope that such a non-reversible diffusion has advantages compared to the reversible Langevin diffusion. In fact for multivariate Gaussian target distributions these advantages are clear Hwang et al. (1993), Lelièvre et al. (2013), as we will discuss below. More generallyFootnote 1 the probability of large deviations of the empirical distribution from the invariant distribution are reduced for non-reversible diffusions Rey-Bellet and Spiliopoulos (2014).

When discretizing (19) it is again necessary to correct for discretization error by a MH accept/reject step. However, since MH generates reversible chains, one should expect that also the favourable properties of non-reversibility are destroyed. Instead, an implementation of NRMH should be able to preserve these favourable properties of non-reversible diffusions. We will illustrate this for multivariate Gaussian distributions.

4.3 Non-reversible Metropolis-Hastings for sampling multivarite Gaussian distributions

Consider as target distribution a centered normal distribution with positive definite covariance matrix V. In this case the Langevin diffusion becomes the Ornstein-Uhlenbeck process

$$\begin{aligned} \mathrm{d X}(t) = -V^{-1} X(t) \, \mathrm{d t} + \sqrt{2} \, \mathrm{d W}(t), \end{aligned}$$
(20)

where (W(t)) is an n-dimensional standard Brownian motion. In Hwang et al. (1993), it is shown that adding a ‘nonreversible’ component of the form \(-S V^{-1}\) to the drift, with S skew-symmetric, can improve convergence of the sample covariance. Therefore we will instead consider the Ornstein-Uhlenbeck process with modified drift

$$\begin{aligned} \mathrm{d X}(t) = B X(t) \mathrm{\ d t} + \sqrt{2}\,\mathrm{d W}(t), \end{aligned}$$
(21)

where \(B := -(I+S)V^{-1}\) with S skew-symmetric. For any choice of skew-symmetric S, this diffusion keeps \(\pi \) invariant. The convergence to equilibrium of the diffusion is governed by the spectral bound, \({\mathfrak s}(B) := \max \{ {\text {Re}}\lambda : \lambda \in \sigma (B)\}\). More specifically,

$$\begin{aligned} {\text {Cov}}(X(t))= & {} 2 \int _0^t e^{B s} e^{B^{\prime }s}\,\mathrm{ds}\\&\rightarrow 2\int _0^{\infty } e^{Bs}e^{B^{\prime }s}\,\mathrm{ds}=V\qquad (\text{ as }\,t \rightarrow \infty ), \end{aligned}$$

with rate of convergence

$$\begin{aligned} \frac{1}{t} \ln \left\| \int _t^{\infty } e^{Bs} e^{B^{\prime }s} \mathrm{\ d s} \right\| \rightarrow 2 {\mathfrak s}(B). \end{aligned}$$

Also, \({\mathfrak s}(B) \le {\mathfrak s}(-V^{-1})\) for any choice of S. In other words, adding a non-reversible term increases the speed of convergence to equilibrium. In Lelièvre et al. (2013), it is established that it is possible to choose S optimally, resulting in \({\mathfrak s}(B) = - {\text {tr}} (V^{-1}) / n\). By choosing S in such a way the convergence of the chain is effectively governed by the average of the eigenvalues, which should be compared to the reversible case in which the ‘worst’ eigenvalue determines the speed of convergence.

We will apply the theory developed in Sect. 4.1 to the time discretization of the non-reversible Ornstein-Uhlenbeck process. To be able to satisfy (15) later on, we will require flexibility in the magnitude of the drift multiplier B and the diffusivity. We consider the time discretization of (21), with step size \(h > 0\), is

$$\begin{aligned} X_{k+1} = X_k + h B X_k + \sqrt{2h \sigma } Z_{k+1}, \end{aligned}$$
(22)

where \((Z_k)\) are i.i.d. standard normal and \(\sigma > 0\). For \(\sigma = 1\) this is the usual Euler-Maruyama discretization. The transition kernel of the Euler-Maruyama discretization will serve as our proposal distribution \(Q(x,\mathrm{dy}) = q(x,y) \, \mathrm{d y}\), and we will first determine the vorticity density \(\gamma \) of Q.

Let \(\mathfrak {r}(A) = \max \{ |\lambda | : \lambda \in \sigma (A)\}\) denote the spectral radius of a square matrix A. Provided \(\mathfrak {r}(I+h B) < 1\), the invariant probability distribution of Q is the centered normal distribution with covariance R, where R is the unique positive definite matrix solution to the discrete time Lyapunov equation [see e.g. (Lancaster and Tismenetsky 1985, Theorem 13.2.1)]

$$\begin{aligned} R = 2 h \sigma ^2 I + (I+h B) R(I+h B^{\prime }). \end{aligned}$$
(23)

Let \(\rho \) denote the density of the invariant probability distribution of Q. Let \(f(x,y) = \rho (x) q(x,y)\). The vorticity density of the proposal chain is

$$\begin{aligned} \gamma (x,y) := f(x,y) - f(y,x). \end{aligned}$$
(24)

As target density we have

$$\begin{aligned} \pi (x) = \left( (2 \pi )^n \det V\right) ^{-1/2} \exp \left( - \frac{1}{2}x^{\prime } V^{-1} x \right) , \quad x \in \mathbb {R}^n. \end{aligned}$$

It is clear that \(\gamma \) satisfies (11) and (12). Also since \(\pi \) and q are non-degenerate, (13) and (14) are satisfied. The same statements hold trivially for scalar multiples of \(\gamma \).

Verification of (15) requires more effort. We provide a sufficient condition. The proof of this result is provided in Appendix 1.

Theorem 4.2

Define constants \(0 < C_1 \le C_2\) by

$$\begin{aligned} C_1= & {} \Vert V^{-1/2} (I+S) V^{-1} (I-S) V^{1/2}\Vert \quad \text{ and }\nonumber \\&\quad C_2 = \Vert V^{-1/2} (I+S) V^{-1/2} \Vert ^2 \Vert V\Vert . \end{aligned}$$
(25)

Suppose \(c > 0\), \(h > 0\) and \(\sigma > 0\) satisfy

$$\begin{aligned} h < \frac{2}{C_2}, \quad \sigma ^2 \le \frac{2 - h C_2}{2 - h (C_2 - C_1)}, \quad \text{ and } \quad c \le \sigma ^n. \end{aligned}$$
(26)

Then \(\widetilde{\gamma }(x,y) := c \gamma (x,y)\), with \(\gamma \) as constructed above, satisfies (15), and is therefore a vorticity density compatible with proposal distribution \(Q(x,\mathrm{dy}) \sim N((I + h B)x, 2 h \sigma ^2 I)\) and invariant distribution N(0, V).

Remark 4.3

How should one choose c, h and \(\sigma \)? It seems reasonable to choose \(\sigma ^2\) equal to the maximal allowed value in (26), so that the deviation from the Euler-Maruyama discretization (which has \(\sigma = 1\)) is minimal; i.e. let

$$\begin{aligned} \sigma = \sigma (h) = \sqrt{\frac{2 - h C_2}{2 - h \big (C_2 - C_1\big )}}. \end{aligned}$$
(27)

To maximize the non-reversibility effects in the acceptance probability one should choose c as large as possible, i.e. \(c = \sigma ^n(h)\). A heuristic estimate for the scaling of the expected step size is given by the step size h times the multiplicative factor in the vorticity, \(c = \sigma ^n(h)\) in the acceptance probability. Note that \(h \sigma ^n(h) = 0\) for \(h = 0\) and \(h = \frac{2}{C_2}\). Maximization of \(h \sigma ^n(h)\) with respect to h yields

$$\begin{aligned} h = \frac{2}{C_2} + \frac{(n+2)C_1}{2 C_2\big (C_2-C_1\big )} - \frac{\sqrt{(n-2)^2 C_1^2 + 8 n C_1 C_2}}{2 C_2\big (C_2-C_1\big )}\nonumber \\ \end{aligned}$$
(28)

as long as \(C_1 < C_2\) (which is the case in which S and V do not commute). This expression satisfies the condition \(h < \frac{2}{C_2}\). A first order Taylor approximation around 1 / n yields the simplified expression \(h \approx \frac{4}{C_2 (n+2)} < \frac{2}{C_2}\). The corresponding value of \(\sigma ^2(h)\) is to first order equal to \(\sigma ^2(h) \approx 1 - \frac{2 C_1}{C_2(n+2)}\).

In case \(C_1 = C_2\), the optimal value of h is given by \(h = \frac{4}{(n+2)C_2}\), with \(\sigma ^2(h) = 1 - \frac{2}{n+2}\).

To summarize, a general procedure for applying non-reversible Metropolis-Hastings may be described by the steps listed in Fig. 1.

Fig. 1
figure 1

Non-reversible Metropolis-Hastings with Ornstein-Uhlenbeck proposals

4.4 Numerical experiments

Below we carry out two experiments illustrating the approach above. For the obtained Markov chain realization \((X_1, \ldots , X_P)\), we will obtain an estimate of the decorrelation by considering the empirical autocorrelation function (EACF) r defined by

$$\begin{aligned} r^i(k) = \frac{1}{P-k} \sum _{p=1}^{P-k} \left( X_p^i - {\widehat{\mu }}^i\right) \left( X_{p+k}^i-{\widehat{\mu }}^i\right) , \end{aligned}$$

where i ranges over the coordinates \(i = 1, \ldots , n\), and where \(\widehat{\mu }^i = \frac{1}{P} \sum _{p=1}^P X_p^i\) is the empirical average of the i-th coordinate. A fast decaying EACF indicates that the samples generated by the Markov chain are quickly decorrelating.

4.4.1 Three-dimensional example

In this example, from Hwang et al. (1993), we take as target covariance structure V a diagonal matrix with \({\text {diag}}(V) = \left( {\begin{array}{lll} {1,1,1/4} \\ \end{array} } \right) s\). The optimal nonlinear drift is obtained by letting

$$\begin{aligned} S = \left( {\begin{array}{lll} 0 &{}\quad {\sqrt{3} } &{}\quad 1 \\ { - \sqrt{3} } &{}\quad 0 &{}\quad 1 \\ { - 1} &{}\quad { - 1} &{} \quad 0 \\ \end{array} } \right) \end{aligned}$$

We choose the parameter values in accordance with Remark 4.3, resulting in

$$\begin{aligned} c = 0.5333, \quad h = 0.0334, \quad \sigma = 0.8109. \end{aligned}$$

The performance of NRMH is compared to MH with identical step-size h, and reversible proposal distribution \(Q(x,dy) \sim N\big ((I - h V^{-1})x, 2h I\big )\). In Fig. 2 the EACFs for this 3-dimensional example are plotted. Here we see that NRMH helps to decrease the autocorrelations of the slowly decorrelating components in MH (here, the first two components). It achieves this without increasing autocorrelations of components that are already quickly decorrelating (here, the third component).

Fig. 2
figure 2

Empirical autocorrelation functions for 3-dimensional example of Sect. 4.4.1. The blue curve represents the non-reversible Metropolis Hastings method. The green curve represents reversible proposals in conjunction with the usual MH acceptance rule. The step sizes h for all approaches are taken to be the same. These plots are based on an MCMC trajectory of \(10^6\) steps. (Color figure online)

4.4.2 Nine-dimensional example

Here we generated a random diagonal matrix V, with

$$\begin{aligned} {\text {diag}}(V)= & {} \left( 0.8147,0.9058,0.1270,0.9134,0.6324,\right. \\&\left. 0.0975,0.2785,0.5469,0.9575 \right) . \end{aligned}$$

Using the algorithm described in Lelièvre et al. (2013) an optimal non-reversible drift \(B = -(I+S)V^{-1}\) can be computed. For reversible dynamics, we have \({\mathfrak s}(-V^{-1}) = -1.0444\), while for the optimal non-reversible dynamics, \({\mathfrak s}(B) = -{\text {tr}} V^{-1}/n = - 3.2891\). In Fig. 3 the EACFs for this 9-dimensional example are plotted. One can clearly see the typical effect of adding non-reversibility: the autocorrelation of the worst coordinates is improved so that it becomes on par with that of the fastest decorrelating coordinates.

Fig. 3
figure 3

Empirical autocorrelation functions for the 9-dimensional example of Sect. 4.4.2. The blue curve represents the non-reversible Metropolis Hastings method. The green curve represents reversible proposals in conjunction with the usual MH acceptance rule. These plots are based on an MCMC trajectory of \(10^7\) steps. (Color figure online)

In this case, choosing c, h and \(\sigma \) as in Remark 4.3 results in values

$$\begin{aligned} c = 0.4313, \quad h = 7.0822 \times 10^{-4} \quad \text{ and } \quad \sigma = 0.9108. \end{aligned}$$

In a numerical example with \(10^7\) proposed transitions this leads to acceptance ratios displayed in Table 2. Using the batch means method (by dividing the sample trajectory in \(\sqrt{n}\) trajectories of length \(\sqrt{n}\) and assuming the \(\sqrt{n}\) are independent), we can estimate asymptotic variance. The resulting estimates for asymptotic variance of the different components are given in Table 1. It should be noted that the notion of asymptotic variance is only defined in case a CLT holds [see Roberts and Rosenthal (2004)], which strictly speaking is an open question in this setting.

Table 1 Estimated asymptotic variances for the 9-dimensional example of Sect. 4.4.2, based on an MCMC trajectory of \(10^7\) steps

It is well known that for optimal convergence in Metropolis Adjusted Langevin (MALA), the stepsize should be tuned so that the ratio of accepted proposals is approximately equal to 0.574 Roberts and Rosenthal (1998). Compared to this, the acceptance ratios in our example, given in Table 2, are fairly high. In particular the Metropolis-Hastings chain can be improved significantly by increasing step size h, and so we are currently comparing NRMH with a sub-optimal tuning of MH.

The results of this experimental section should therefore be considered as a proof of concept of NRMH in continuous state spaces, rather than as an advertisement for its immediate practicality. The experiments do illustrate the faster decorrelation of NRMH in comparison to MH (for fixed step-size). It is an open question if the framework of NRMH can be extended so that NRMH would become competitive with optimally tuned MH.

Table 2 Acceptance ratios for 9-dimensional example of Sect. 4.4.2, for a sample path consisting of \(10^7\) proposals, and with a step size \(h = 7.0822 \times 10^{-4}\)

5 Discussion

The efficiency increase of non-reversible Markov chains in MCMC can be significant, in terms of either asymptotic variance or mixing properties, as remarked in this paper. NRMH extends the MCMC-toolbox with a method to utilize these benefits. In particular for continuous state spaces it was, to our knowledge, not known how to construct non-reversible Markov chains for MCMC sampling (taking into account the necessity of a correction step when using time discretization of diffusions).

Using the theory developed in Sect. 4, NRMH can be applied to general distributions on \(\mathbb {R}^n\) as follows. For a target density function \(\pi \), suppose there exists a Gaussian distribution N(0, V) with density function \(\pi _0\) on \(\mathbb {R}^n\), satisfying \(k \pi _0 \le \pi \) on \(\mathbb {R}^n\) for some \(k > 0\). Then if \(\gamma \) is a suitable vorticity density for sampling from N(0, V), using proposal density q(xy), we have for \(\widetilde{\gamma }:= k \gamma \) that

$$\begin{aligned} {\widetilde{\gamma }}(x,y) + \pi (x) q(x,y)= & {} k \gamma (x,y) + \pi (x)q(x,y)\\\ge & {} k \big (\gamma (x,y) + \pi _0(x) q(x,y)\big ) \ge 0, \end{aligned}$$

so that (15) is satisfied for the combination for this choice of \(\pi \), \(\widetilde{\gamma }\) and q, and Theorem 4.1 applies. Such a suitable choice of \(\gamma \) may be determined as described in Sect. 4.3.

The approach outlined in Sect. 4 should be considered as a first attempt at implementing the NRMH framework for continuous spaces. As discussed, in order to use the framework one needs to verify the non-negativity condition which leads to technical challenges. In particular we expect that much progress is possible in weakening the conditions of Theorem 4.2. The current form of that proposition results in a relatively small step size h, which obstructs fast convergence of NRMH. As mentioned before, it is an open question whether NRMH in continuous spaces can be made competitive with optimally tuned MALA.

The theoretical discussion of Sect. 3 and the numerical experiment of Sect. 4.4 illustrate how efficiency can be improved by employing non-reversible Metropolis-Hastings. In view of these encouraging results it will hopefully be possible to extend the result to more general settings. The practical application of NRMH depends on the identification of suitable vorticity structures that are compatible with proposal chains, and establishing these in practical examples provides a promising and challenging direction of research.

Analysis of non-reversible Markov chains is difficult, essentially because self-adjointness is lost. Without self-adjointness, it is much more difficult to connect spectral theory to mixing properties of chains. It seems that a good way of understanding benefits of non-reversible sampling is by studying Cesaro averages [see Levin et al. (2009) and e.g. the result on large deviations in Sect. 3.2]. The results of Sect. 3 which establish that non-reversible chains have better asymptotic variance or large deviations properties, are so far qualitative in nature (i.e. fail to quantify the amount of improvement). To obtain quantitative results is an important challenge that remains to be addressed. Also, it is object of further study how these results carry over to countable and uncountable state spaces. In particular, the question under what conditions the resulting chains are geometrically ergodic and/or satisfy a CLT should be considered.