1 Introduction

Time reversibility is a fundamental property of many statistical laws of nature. Inspired by Schrödinger [1], Kolmogorov was the first [2], in his celebrated work [3, 4], to investigate this notion in the context of Markov chains and diffusion processes. Reversible chains also find numerous applications in computer science, for instance in queuing networks [5] or Markov Chain Monte Carlo sampling algorithms [6]. For instance, a random walk over a weighted network corresponds to a reversible Markov chains [7, Section 3.2].

Reversible Markov operators enjoy a considerably richer mathematical structure than their non-reversible counterparts, enabling a wide range of analytical tools and techniques. Indeed, the significance of reversibility spans across surprisingly many areas of mathematics, from spectral theory [8, Chapter 12] to abstract algebra [9]. For instance, the mixing time of a reversible Markov chain, i.e. the time to guarantee closeness to stationarity, is controlled up to logarithmic factors by its absolute spectral gap (the difference of its two largest eigenvalues in magnitude). The diversity of the existing tools and analyses prompts our first question of whether reversibility can also be treated from an information geometry perspective.

Through the lens of information geometry, the manifold of all irreducible Markov kernels forms both an exponential family (e-family) and a mixture family (m-family). Our natural second question is whether we can find subfamilies of irreducible kernels that enjoy similar geometric properties, or in other words, can we find submanifolds that are autoparallel with respect to affine connections of interest? For instance, the set of doubly-stochastic matrices is known to form an m-family [10], while a tree model is an e-family of Markov kernels, if and only if it is an FSMX model [11].

In this article, we will answer these two questions, see that reversible irreducible Markov chains enjoy the structure of both exponential and mixture families, and explore their geometric properties.

1.1 Related work

The concept of exponential tilting of stochastic matrices using Perron-Frobenius (PF) theory can be traced back to the work of Miller [12]. The large deviation theory for Markov chains, whose crown achievement is showing that the convex conjugate of the log-PF root of the tilted kernel essentially controls the large deviation rate was further developed by Donsker and Varadhan [13], Gärtner [14], Dembo and Zeitouni [15]. Csiszár et al. [16] seem to be the first to recognize the exponential structure of the set of irreducible Markov kernels, in the context of information projections. Independently, Ito and Amari [17] implicitly introduced the notion of asymptotic exponential families, and exhibited irreducible Markov kernels as an example. Takeuchi and Barron [18] later formalized this definition (see also Takeuchi and Kawabata [19]), and Takeuchi and Nagaoka [20] subsequently proved that exponential families and their asymptotic counterparts are equivalent. Nakagawa and Kanaya [21] formally defined the exponential family of irreducible Markov chains and Nagaoka [22] later gave a full treatment in the language of information geometry, proving its dually flat structure. A notable collection of works has also explored the implications of this geometric structure for problems related to parameter estimation [10], hypothesis testing [21, 23], large deviation theory [24], and hidden Markov models [25, 26].

We refer the reader to Levin et al. [8] and Amari and Nagaoka [27] for thorough treatments of the theory of Markov chains and information geometry.

1.2 Outline and main results

In Section 2, we begin with a primer on reversible Markov chains, define exponential and mixture families, and briefly discuss the importance of affine structures for our analysis of exponential families. In Section 3, we define a time-reversal operation on parametric families, and show in Proposition 1 that both m-families and e-families are closed under this transformation. In Section 4 we introduce the concept of a reversible e-family, and provide a characterization (Theorem 2) of such family in terms of its carrier kernel and set of generator functions. Adapting the Kolmogorov criterion, we show that the necessary and sufficient conditions can be verified in a time that depends polynomially on the number of states. In Section 5, we prove that the set of all reversible and irreducible transition kernels is both an m-family, and an e-family (Theorem 3), construct a basis (Theorem 4), and derive a parametrization (Theorem 5) of the entire set of reversible kernels. In Section 6, we investigate information projections of an irreducible Markov chain onto its reversible submanifold. We show that the projections verify Pythagorean identities, and obtain closed-form expressions (Theorem 7). Additionally, we prove that the projections are always equidistant from an irreducible Markov kernel and its time-reversal (bisection property, Proposition 2). In Section 7, we show that reversible edge measures also form an e-family in distributions over pairs (Theorem 8). In Section 8, we briefly compare the geometric properties of reversible chains with several other natural families of Markov kernels. Finally, in Section 9, we characterize the reversible family as both the smallest exponential family that comprises symmetric kernels (Theorem 9), and the smallest mixture family that contains memoryless Markov kernels (Theorem 10).

2 Preliminaries

For \(m \in {\mathbb {N}}\) we write \([m] = \left\{ 1, 2, \dots , m \right\} \). Let \({\mathcal {X}}\) be a set such that \(\left| {\mathcal {X}} \right| = m < \infty \), identified with [m], where to avoid trivialities, we also assume that \(m > 1\). We denote \({\mathcal {P}}({\mathcal {X}})\) the probability simplex over \({\mathcal {X}}\), and \({\mathcal {P}}_{+}({\mathcal {X}}) = \left\{ \mu \in {\mathcal {P}}({\mathcal {X}}) :\forall x \in {\mathcal {X}}, \mu (x) > 0 \right\} \). All vectors will be written as row-vectors, unless otherwise stated. For some real matrices A and B, \(\rho (A)\) is the spectral radius of A, f[A] for \(f :{\mathbb {R}}\rightarrow {\mathbb {R}}\) is the entry-wise application of f to A; \(A \circ B\) is the Hadamard product of A and B, \(A > 0\) (resp. \(A \ge 0\)) means that A is an entry-wise positive (resp. non-negative) matrix. We will routinely identify a function \(f :{\mathcal {X}}^2 \rightarrow {\mathbb {R}}\) with the linear operator \(f :{\mathbb {R}}^ {\mathcal {X}} \rightarrow {\mathbb {R}}^{\mathcal {X}}\).

2.1 Irreducible Markov chains

We let \(({\mathcal {X}}, {\mathcal {E}})\) be a strongly connected directed graph, where \({\mathcal {X}}\) is the set of vertices, and \({\mathcal {E}} \subset {\mathcal {X}}^2\) the set of edges. Let \({\mathcal {F}}({\mathcal {X}}, {\mathcal {E}})\) be the set of all real functions over the set \({\mathcal {E}}\), identified with the totality of functions over \({\mathcal {X}}^2\) that are null outside of \({\mathcal {E}}\), and let \({\mathcal {F}}_+({\mathcal {X}}, {\mathcal {E}}) \subset {\mathcal {F}}({\mathcal {X}}, {\mathcal {E}})\) be the subset of positive functions over \({\mathcal {E}}\). Similarly, we define \({\mathcal {P}}({\mathcal {E}}) = {\mathcal {P}}({\mathcal {X}}^2) \cap {\mathcal {F}}_+({\mathcal {X}}, {\mathcal {E}})\), the set of distributions whose mass is concentrated on the edge set \({\mathcal {E}}\). We write \({\mathcal {W}}({\mathcal {X}})\) for the set of row-stochastic transition kernels over the state space \({\mathcal {X}}\), and \({\mathcal {W}}({\mathcal {X}}, {\mathcal {E}})\) for the subset of irreducible kernels whose support is \({\mathcal {E}}\), i.e.

$$\begin{aligned} \begin{aligned} {\mathcal {W}}({\mathcal {X}})&\triangleq \left\{ P \in {\mathbb {R}}^{{\mathcal {X}}^2} :P \ge 0, \forall x \in {\mathcal {X}}, \sum _{x' \in {\mathcal {X}}} P(x, x') = 1 \right\} ,\\ {\mathcal {W}}({\mathcal {X}}, {\mathcal {E}})&\triangleq {\mathcal {F}}_+({\mathcal {X}}, {\mathcal {E}}) \cap {\mathcal {W}}({\mathcal {X}}), \end{aligned} \end{aligned}$$

and where \(P(x,x')\) corresponds to the transition probability from state x to state \(x'\)Footnote 1. For \(P \in {\mathcal {W}}({\mathcal {X}}, {\mathcal {E}})\), there exists a unique \(\pi \in {\mathcal {P}}_+({\mathcal {X}})\), such that \(\pi P =\pi \) [8, Corollary 1.17], which we call the stationary distribution of P. When \({\mathcal {E}} = {\mathcal {X}}^2\) and if there is no ambiguity about the space under consideration, we may write more simply \({\mathcal {F}}, {\mathcal {F}}_+\) instead of \({\mathcal {F}}({\mathcal {X}}, {\mathcal {X}}^2), {\mathcal {F}}_+({\mathcal {X}}, {\mathcal {X}}^2)\) (a similar notation will apply to all subsequently defined spaces).

2.2 Reversibility

For an irreducible kernel P, we write \(Q = {{\,\mathrm{diag}\,}}(\pi ) P\) for the edge measure matrix, [8, (7.5)], which corresponds to stationary pair-probabilities of P, i.e. \(Q(x, x') = {\mathbb {P}}_{\pi }\left( X_t = x, X_{t+1} = x' \right) \), and denote the set of irreducible edge measures by

$$\begin{aligned} {\mathcal {Q}}({\mathcal {X}}, {\mathcal {E}}) \triangleq \left\{ {{\,\mathrm{diag}\,}}(\pi ) P :P \in {\mathcal {W}}({\mathcal {X}}, {\mathcal {E}}), \pi P = \pi \right\} \subset {\mathcal {P}}({\mathcal {E}}). \end{aligned}$$

Note that this definition is equivalent to

$$\begin{aligned} {\mathcal {Q}}({\mathcal {X}}, {\mathcal {E}}) = \left\{ Q \in {\mathcal {P}}({\mathcal {E}}) :\sum _{x' \in {\mathcal {X}}}Q(x, x') = \sum _{x' \in {\mathcal {X}}}Q(x', x) \right\} . \end{aligned}$$
(1)

We further denote \(P^\star \) for the uniquely defined time-reversal of P, that verifies \(P^\star (x, x') = \pi (x') P(x', x) / \pi (x)\), and write \(Q^\star = Q ^\intercal \) for its corresponding edge measure, where \(^\intercal \) denotes matrix transposition. When Q is symmetric (i.e. \(Q^\star = Q \)), the chain verifies the detailed balance equation,

$$\begin{aligned} \pi (x) P(x, x') = \pi (x') P(x', x), \end{aligned}$$

i.e. \(P^\star = P\), and we say that the Markov chain is reversible. Observe that in this case, for P irreducible over \({\mathcal {E}}\), the edge set must also be symmetric (\({\mathcal {E}} = {\mathcal {E}}^\star \), where \({\mathcal {E}}^\star \triangleq \left\{ (x, x') \in {\mathcal {X}}^2 :(x' , x) \in {\mathcal {E}} \right\} \)). We write for the set of all reversible kernels that are irreducible over \(({\mathcal {X}}, {\mathcal {E}})\). For \(f, g \in {\mathbb {R}}^{\mathcal {X}}\), \(\langle f, g \rangle _\pi \triangleq \sum _{x \in {\mathcal {X}}} f(x) g(x) \pi (x)\) defines an inner product. We call \(\ell _2(\pi )\) the corresponding Hilbert space. The time-reversal is the adjoint operator of P in \(\ell _2(\pi )\), i.e. the unique linear operator that verifies \(\langle P f, g \rangle _\pi = \langle f, P^\star g \rangle _\pi , \forall f,g \in {\mathbb {R}}^{\mathcal {X}}\) (represented here as column vectors). As a consequence, when P is reversible, it is also self-adjoint in \(\ell _2(\pi )\), and the spectrum of P is real.

2.3 Mixture family and exponential family

For later convenience we consider the following three equivalent definitions of a mixture family.

Definition 1

(m-family of transition kernels) We say that a family of irreducible transition kernels \({\mathcal {V}}_m\) is a mixture family (m-family) of irreducible transition kernels on \(({\mathcal {X}}, {\mathcal {E}})\) when one of the following (equivalent) statements (i), (ii), (iii) holds.

  1. (i)

    [28] There exist affinely independent \(Q_0, Q_1, \dots , Q_d \in {\mathcal {Q}}({\mathcal {X}}, {\mathcal {E}})\) such that

    $$\begin{aligned} {\mathcal {V}}_m = \left\{ P_\xi \in {\mathcal {W}}({\mathcal {X}}, {\mathcal {E}}) :Q_\xi = \sum _{i = 1}^{d} \xi ^i Q_i + (1 - \sum _{i=1}^{d}\xi ^i) Q_0, \xi \in \varXi \right\} , \end{aligned}$$

    where \(\varXi = \left\{ \xi \in {\mathbb {R}}^d :Q_\xi (x,x') > 0, \forall (x,x') \in {\mathcal {E}} \right\} \), and \(Q_\xi \) is the edge measure that pertains to \(P_\xi \).

  2. (ii)

    [27, 2.35] There exists \(C, F_1, \dots , F_d \in {\mathcal {F}}({\mathcal {X}}, {\mathcal {E}})\), such that \(C, C + F_1, \dots , C + F_d\) are affinely independent,

    $$\begin{aligned} \sum _{x,x'} C(x,x') = 1, \qquad \sum _{x,x'} F_i(x,x') = 0, \forall i \in [d], \end{aligned}$$

    and

    $$\begin{aligned} {\mathcal {V}}_m = \left\{ P_\xi \in {\mathcal {W}}({\mathcal {X}}, {\mathcal {E}}) :Q_\xi = C + \sum _{i =1}^{d} \xi ^i F_i, \xi \in \varXi \right\} \end{aligned}$$

    where \(\varXi = \left\{ \xi \in {\mathbb {R}}^d:Q_\xi (x,x') > 0, \forall (x,x') \in {\mathcal {E}} \right\} \), and \(Q_\xi \) is the edge measure that pertains to \(P_\xi \).

  3. (iii)

    [10, Section 4.2] There exist \(k \in {\mathbb {N}}, g_1, \dots , g_k \in {\mathcal {F}}({\mathcal {X}}, {\mathcal {E}})\) and \(c_1, \dots , c_k \in {\mathbb {R}}\), such that

    $$\begin{aligned} {\mathcal {V}}_m = \left\{ P \in {\mathcal {W}}({\mathcal {X}}, {\mathcal {E}}) :\sum _{x,x'} Q(x,x') g_i(x,x') = c_i, \forall i \in [k] \right\} . \end{aligned}$$

Note that \(\varXi \) is an open set, \(\xi \) is called the mixture parameter and d is the dimension of the family \({\mathcal {V}}_m\).

Definition 2

(e-family of transition kernels) Let \(\varTheta \subset {\mathbb {R}}^d\), be some connected parameter space that contains an open ball centered at 0. We say that the parametric family of irreducible transition kernels

$$\begin{aligned} {\mathcal {V}}_e = \left\{ P_\theta :\theta = (\theta ^1, \dots , \theta ^d) \in \varTheta \right\} \end{aligned}$$

is an exponential family (e-family) of transition kernels on \(({\mathcal {X}}, {\mathcal {E}})\) with natural parameter \(\theta \), whenever

  1. (i)

    For all \(\theta \in \varTheta \), \(P_\theta \in {\mathcal {W}}({\mathcal {X}}, {\mathcal {E}})\) .

  2. (ii)

    There exist functions

    $$\begin{aligned} \begin{aligned} K :{\mathcal {X}} \times {\mathcal {X}}&\rightarrow {\mathbb {R}}, \\ R :\varTheta \times {\mathcal {X}}&\rightarrow {\mathbb {R}}, \\ g_1, \dots , g_d :{\mathcal {X}} \times {\mathcal {X}}&\rightarrow {\mathbb {R}}, \\ \psi :\varTheta&\rightarrow {\mathbb {R}}, \end{aligned} \end{aligned}$$

    such that \(\forall (x,x', \theta ) \in {\mathcal {X}}^2 \times \varTheta \),

    $$\begin{aligned} \log P_\theta (x, x') = K(x, x') + \sum _{i = 1}^{d} \theta ^i g_i(x, x') + R(\theta , x') - R(\theta , x) - \psi (\theta ),\qquad \end{aligned}$$
    (2)

    when \((x, x') \in {\mathcal {E}}\), and \(P_\theta (x, x') = 0\) otherwise.

When fixing some \(\theta \in \varTheta \), we may later write for convenience \(\psi _\theta \) for \(\psi (\theta )\) and \(R_\theta \) for \(R(\theta , \cdot ) \in {\mathbb {R}}^{\mathcal {X}}\). The carrier kernel K, the collection of generator functions \(g_1, \dots , g_d\) and the parameter range \(\varTheta \) define the family entirely. The remaining functions \(R_\theta \) and \(\psi _\theta \) will be determined uniquely by PF theory, from the constraint of \(P_\theta \) being row-stochastic (see for example the proof of Proposition 1). In fact, we can define the mapping \({\mathfrak {s}}\) that constructs a proper irreducible stochastic matrix from any linear operator defined by an irreducible matrix over \(({\mathcal {X}}, {\mathcal {E}})\).

$$\begin{aligned} \begin{aligned} {\mathfrak {s}}:{\mathcal {F}}_+({\mathcal {X}},{\mathcal {E}})&\rightarrow {\mathcal {W}}({\mathcal {X}},{\mathcal {E}}) \\ {\widetilde{P}}(x,x')&\mapsto P(x,x') = \frac{{\widetilde{P}}(x,x') v(x')}{\rho ({\widetilde{P}})v(x)}, \end{aligned} \end{aligned}$$
(3)

where \(\rho ({\widetilde{P}})\) and v are respectively the PF root and right PF eigenvector of \({\widetilde{P}}\).

Remark 1

In Feigin et al. [29], Küchler and Sørensen [30], Hudson [31], Stefanov [32], Küchler and Sørensen [33], Sørensen [34], an exponential family of transition kernels has the form

$$\begin{aligned} \begin{aligned} \log P_\theta (x, x') = K(x, x') + \sum _{i = 1}^{d} \theta ^i g_i(x, x') - \phi (\theta , x'), \end{aligned} \end{aligned}$$

for some function \(\phi :\varTheta \times {\mathcal {X}} \rightarrow {\mathbb {R}}\). Our Definition 2 however follows the one of Nagaoka [22], Hayashi and Watanabe [10], Watanabe and Hayashi [23], that is endowed with a more compelling geometrical structure [10, Remark 3].

Following the information geometry philosophy [27], we view the e-families or m-families that we defined, as d-dimensional submanifolds of \({\mathcal {W}}({\mathcal {X}}, {\mathcal {E}})\) with corresponding chart maps \(\theta , \xi :{\mathcal {W}}({\mathcal {X}}, {\mathcal {E}}) \rightarrow {\mathbb {R}}^d\). We can give more geometrical, parametrization-free definitions of e-families and m-families of irreducible transition kernel over \(({\mathcal {X}}, {\mathcal {E}})\), as autoparallel submanifolds of \({\mathcal {W}}({\mathcal {X}}, {\mathcal {E}})\) with respect to the e-connection and m-connection [22, Section 6]. We will prefer, however, to mostly cast our analysis in the language of linear algebra, and defer analysis of the relationship with differential geometry concepts to Section 5.3. This choice is motivated by the existence of a known correspondence between affine functions over \({\mathcal {E}}\) and the manifold \({\mathcal {W}}({\mathcal {X}}, {\mathcal {E}})\) [22] that we now describe. Denote,

$$\begin{aligned} \begin{aligned} {\mathcal {N}}({\mathcal {X}}, {\mathcal {E}}) \triangleq \bigg \{&h \in {\mathcal {F}}({\mathcal {X}}, {\mathcal {E}}) :\exists (c, f) \in ({\mathbb {R}}, {\mathbb {R}}^{\mathcal {X}}), \\&\forall (x, x') \in {\mathcal {E}}, h(x, x') = f(x') - f(x) + c \bigg \}. \end{aligned} \end{aligned}$$
(4)

Then \({\mathcal {F}}({\mathcal {X}}, {\mathcal {E}})\) defines a \(\left| {\mathcal {E}} \right| \)-dimension vector space, while \({\mathcal {N}}({\mathcal {X}}, {\mathcal {E}})\) is an \(\left| {\mathcal {X}} \right| \)-dimensional vector space [22, Section 3]. Introducing the mapping,

$$\begin{aligned} \begin{aligned} \varDelta :{\mathcal {F}}({\mathcal {X}}, {\mathcal {E}})&\rightarrow {\mathcal {W}}({\mathcal {X}}, {\mathcal {E}}) \\ f&\mapsto \varDelta (f) = {\mathfrak {s}}(\exp \circ f), \end{aligned} \end{aligned}$$
(5)

we see from the expression at (3) that \(\varDelta \) gives a diffeomorphism from the quotient linear space

$$\begin{aligned} {\mathcal {G}}({\mathcal {X}}, {\mathcal {E}}) \triangleq {\mathcal {F}}({\mathcal {X}}, {\mathcal {E}})/{\mathcal {N}}({\mathcal {X}}, {\mathcal {E}}) \end{aligned}$$

to \({\mathcal {W}}({\mathcal {X}}, {\mathcal {E}})\) and a subset \({\mathcal {V}}\) of \({\mathcal {W}}({\mathcal {X}}, {\mathcal {E}})\) is an e-family if and only if there exists an affine subspace \({\mathcal {A}}\) of the quotient space \({\mathcal {G}}({\mathcal {X}}, {\mathcal {E}})\) such that \({\mathcal {V}} = \varDelta ({\mathcal {A}})\) (we identify a coset with a representative function in that coset). In this case, the correspondence is one-to-one, and the dimension of the affine space and the submanifold coincide [22, Theorem 2]. In particular, this entails that \(\dim {\mathcal {W}}({\mathcal {X}}, {\mathcal {E}}) = \left| {\mathcal {E}} \right| - \left| {\mathcal {X}} \right| \) Nagaoka [22, Corollary 1].

Remark 2

For Definition 2, unless stated otherwise, we will henceforth assume that the \(g_i\) form an independent family in \({\mathcal {G}}({\mathcal {X}}, {\mathcal {E}})\). This will ensure that the family is well-behaved in the sense of Hayashi and Watanabe [10, Lemma 4.1].

3 Time-reversal of parametric families

We begin by extending the definition of a time-reversal to families of Markov chains.

Definition 3

(Time-reversal family) We say that the family of irreducible transition kernels \({\mathcal {V}}^\star \) is the time-reversal of the family of irreducible transition kernels \({\mathcal {V}}\) when \({\mathcal {V}}^\star = \left\{ P^\star :P \in {\mathcal {V}} \right\} \), where \(P^\star \) denotes the time-reversal of P.

We now state the fundamental fact that the quality of being an e-family or an m-family of transition kernels is closed under this time-reversal operation.

Proposition 1

The following statements hold.

Time reversal of m-family: Let \({\mathcal {V}}_m\) be an m-family over \(({\mathcal {X}}, {\mathcal {E}})\), then \({\mathcal {V}}_m^\star \) is an m-family over \(({\mathcal {X}}, {\mathcal {E}}^\star )\). Furthermore, if \({\mathcal {V}}_m\) is the m-family generated by \(Q_1, \dots , Q_d \in {\mathcal {Q}}({\mathcal {X}}, {\mathcal {E}})\) (following the notation at Definition 1-(i)), then the time-reversal m-family is given by

$$\begin{aligned} {\mathcal {V}}_m^\star = \left\{ P_\xi \in {\mathcal {W}}({\mathcal {X}}, {\mathcal {E}}^\star ) :Q_\xi = \sum _{i = 1}^{d} \xi ^i Q_i^\star + (1 - \sum _{i=1}^{d}\xi _i) Q_0^\star , \xi \in \varXi ^\star \right\} , \end{aligned}$$

where \(Q_\xi \) pertains to \(P_\xi \) and with

$$\begin{aligned} \varXi ^\star = \left\{ \xi \in {\mathbb {R}}^d :Q_\xi (x,x') > 0, \forall (x,x') \in {\mathcal {E}}^\star \right\} = \varXi . \end{aligned}$$

Time reversal of e-family: Let \({\mathcal {V}}_e\) be an e-family over \(({\mathcal {X}}, {\mathcal {E}})\), then \({\mathcal {V}}_e^\star \) is an e-family over \(({\mathcal {X}}, {\mathcal {E}}^\star )\). Furthermore, if \({\mathcal {V}}_e\) is the e-family generated by K and \(g_1, \dots , g_d\) (following the notation at Definition 2), then the time-reversal e-family is given by \({\mathcal {V}}^\star = \left\{ P^\star _\theta : \theta \in \varTheta \right\} \) such that

$$\begin{aligned} \log P^\star _\theta (x, x') = K(x', x) + \sum _{i = 1}^{d} \theta ^i g_i(x', x) + L_\theta (x') - L_\theta (x) - \psi _\theta , \end{aligned}$$

when \((x, x') \in {\mathcal {E}}^\star \), \(P_\theta ^\star (x, x') = 0\) otherwise, and where \(L_\theta \) is the left PF eigenvector of the non-negative irreducible matrix

$$\begin{aligned} {\widetilde{P}}_\theta ^\star (x,x') = \exp \left( K(x', x) + \sum _{i = 1}^{d} \theta ^i g_i(x', x) \right) . \end{aligned}$$

Proof

Since the edge measure \(Q^\star _\xi \) of the time-reversal \(P^\star _\xi \) is the transpose of \(Q_\xi \) corresponding to \(P_\xi \), it is easy to obtain the expression of the time-reversal, and to see that \({\mathcal {V}}_m^\star \) is a mixture family. It remains to show that this also holds true for e-families. From the definition of an exponential family (2), and the requirement that \(P_\theta \) be row-stochastic, it must be that for any \(x \in {\mathcal {X}}\),

$$\begin{aligned} \sum _{x' \in {\mathcal {X}}} \exp (K(x, x') + \sum _{i = 1}^{d}\theta ^i g_i(x, x')) e^{R_\theta (x')} = e^{\psi _\theta } e^{R_\theta (x)}, \end{aligned}$$

or more concisely, writing \({\widetilde{P}}_\theta (x,x') = \exp (K(x, x') + \sum _{i = 1}^{d}\theta ^i g_i(x, x'))\) for \(x, x' \in {\mathcal {E}}\) and \({\widetilde{P}}_\theta (x,x') = 0\) otherwise, \({\widetilde{P}}_\theta \exp [R_\theta ] = e^{\psi _\theta } \exp [R_\theta ]\). By positivity of the exponential function, the vector \(\exp [R_\theta ] \in {\mathbb {R}}^{{\mathcal {X}}}\) is positive. Thus, from the PF theorem, \(e^{\psi _\theta }\) corresponds to the spectral radius of \({\widetilde{P}}_\theta \), and \(\exp [R_\theta ]\) its (right) associated eigenvector. There must therefore also exist a left positive eigenvector, which we denote by \(\exp [L_\theta ]\), such that

$$\begin{aligned} \exp [L_\theta ] {\widetilde{P}}_\theta = e^{\psi _\theta } \exp [L_\theta ]. \end{aligned}$$

Defining the positive normalized measure

$$\begin{aligned} \begin{aligned} \pi _\theta (x) \triangleq \frac{\exp (L_\theta (x) + R_\theta (x))}{\sum _{x'' \in {\mathcal {X}}} \exp (L_\theta (x'') + R_\theta (x''))}, \end{aligned} \end{aligned}$$
(6)

it is easily verified that \(\pi _\theta \) is the stationary distribution of \(P_\theta \). Notice that \(\theta \), K and \(g_i\) determine uniquely \(L_\theta , R_\theta , \psi _\theta \) and \(\pi _\theta \) by the PF theorem. Recall that the adjoint of a transition kernel P can be written \(P^\star (x, x') = \pi (x')P(x', x)/\pi (x)\), thus we can compute the time-reversal as

$$\begin{aligned} \begin{aligned} P_\theta (x', x) \frac{\pi _\theta (x')}{\pi _\theta (x)}&= \exp \left( K(x', x) + L_\theta (x') - L_\theta (x) + \sum _{i = 1}^{d}\theta ^i g_i(x', x) - \psi _\theta \right) , \\ \end{aligned} \end{aligned}$$

when \((x', x) \in {\mathcal {E}}\), and 0 for \((x', x) \not \in {\mathcal {E}}\). The requirements of Definition 2 for an e-family are all fulfilled, which concludes the proof. \(\square \)

Remark 3

Recall that for a distribution \(\mu \in {\mathcal {P}}({\mathcal {X}})\), we can by exponential change of measure – also known as exponential tilting – construct the natural exponential family of \(\mu \):

$$\begin{aligned} \mu _\theta (x) = \mu (x)\exp ( \theta x- A(\theta )), \end{aligned}$$

where \(A(\theta )\) is a normalization function that ensures \(\mu _\theta \in {\mathcal {P}}({\mathcal {X}})\) for all \(\theta \in {\mathbb {R}}\). The idea of exponential change of measure for distributions can be traced back to Chernoff [35], and was later termed tilting [36, 37]. Similarly, given some function \(g :{\mathcal {X}}^2 \rightarrow {\mathbb {R}}\) we can tilt an irreducible kernel P (e.g. Miller [12]), by first constructing \({\tilde{P}}_\theta (x,x') = P(x,x') e^{\theta g(x,x')}\), and then rescaling the newly obtained irreducible matrixFootnote 2 with the mapping \({\mathfrak {s}}\). When \(\theta = 0\), notice that we recover the original P. But while in our definition,

$$\begin{aligned} P_\theta = {\mathfrak {s}}(P \circ \exp [\theta g]) = \frac{1}{\rho (P \circ \exp [\theta g])} {{\,\mathrm{diag}\,}}v_\theta ^{-1} (P \circ \exp [\theta g]) {{\,\mathrm{diag}\,}}v_\theta \end{aligned}$$

denotes the kernel tilted involving the right PF eigenvector \(v_\theta \), we could alternatively define the Markov kernel \(P'_\theta \) by tilting P with the left PF eigenvector \(u_\theta \):

$$\begin{aligned} P'_\theta = \frac{1}{\rho (P \circ \exp [\theta g])} {{\,\mathrm{diag}\,}}u_\theta ^{-1} (P \circ \exp [\theta g]) ^\intercal {{\,\mathrm{diag}\,}}u_\theta . \end{aligned}$$

Observe that the right and left tilted versions of P with identical \(\theta \) share the same stationary distribution \(\pi _\theta \propto u_\theta \circ v_\theta \) (6) and that they are in fact each other’s time-reversal (\(P'_\theta = P^\star _\theta \)), i.e. they form a pair of adjoint linear operators over the space \(\ell _2(\pi _\theta )\).

4 Reversible exponential families

The previous section extended the time-reversal operation to parametric families of transition kernels. It seems then natural to investigate fixed points, i.e. parametric families that remain invariant under this transformation. We say that an irreducible e-family \({\mathcal {V}}({\mathcal {X}}, {\mathcal {E}})\) is reversible when P is reversible \(\forall P \in {\mathcal {V}}({\mathcal {X}}, {\mathcal {E}})\). In this case, \({\mathcal {E}}\) coincides with \({\mathcal {E}}^\star \) and \({\mathcal {V}}^\star ({\mathcal {X}}, {\mathcal {E}}^\star )\) with \({\mathcal {V}}({\mathcal {X}}, {\mathcal {E}})\). Observe first that an e-family obtained from tilting a reversible P is not generally reversible, making it clear that the reversible nature of the family cannot be determined solely by the properties of the carrier kernel K. It is however easy to see that an e-family is reversible when K and all the generator functions \(g_1, \dots , g_d\) are symmetric. Moreover, for a state space \({\mathcal {X}}\) of size 2, any exponential family would be reversible regardless of symmetry, showing that this condition is not always necessary. In this section, we give a complete characterization of this invariant set. Additionally, we explore the algorithmic cost of checking whether this property is verified from the description of the carrier kernel and generators of a given e-family. Before diving into the general theory of reversible e-families, let us consider the following simple examples.

Example 1

(Lazy random walk on the m-cycle) For \(\mathcal{X}=[m]\), and

$$\begin{aligned} {\mathcal {E}}_{\mathsf {cy}} = \left\{ (i,j) \in [m]^2 :\left| i - j \right| \mod m - 1 \in \left\{ 0, 1 \right\} \right\} , \end{aligned}$$

let

$$\begin{aligned} \begin{aligned} P_\theta (x,x^\prime )&= \exp \bigg ( \theta _1 \sum _{i=1}^m \delta _i(x)\delta _i(x^\prime )\\&+ \theta _2 \sum _{i=1}^m \big (\delta _i(x)\delta _{i+1}(x^\prime ) - \delta _{i+1}(x)\delta _i(x^\prime ) \big ) - \psi (\theta ) \bigg ), \end{aligned} \end{aligned}$$

where \(\psi (\theta )=\log \big ( e^{\theta _1}+e^{\theta _2} + e^{-\theta _2}\big )\) and \(\delta _{m+1}=\delta _1\). This e-family \(\{P_\theta \}_{\theta \in {\mathbb {R}}^2}\) corresponds to the set of biased lazy random walks on the m-cycle given by

$$\begin{aligned} P_\theta (x,x)&= \frac{e^{\theta _1}}{e^{\theta _1}+e^{\theta _2}+e^{-\theta _2}}, \\ P_\theta (x,x+1)&= \frac{e^{\theta _2}}{e^{\theta _1}+e^{\theta _2}+e^{-\theta _2}}, \\ P_\theta (x+1,x)&= \frac{e^{-\theta _2}}{e^{\theta _1}+e^{\theta _2}+e^{-\theta _2}}. \end{aligned}$$

Observe that \(P^\star _{(\theta _1, \theta _2)} = P_{(\theta _1, - \theta _2)}\), and thus \(P_\theta \) is not a reversible e-family. The subfamily \(\{ P_\theta : \theta _1 \in {\mathbb {R}}, \theta _2=0 \}\), however, i.e. unbiased lazy random walks on the m-cycle, form a a reversible e-family.

Example 2

(Birth-and-death chains) For \(\mathcal{X}=[m]\) and \(\mathcal{E}_{\mathsf {bd}}=\{ (i,j) : |i-j| \le 1\}\), a Markov kernel having its support on \(\mathcal{E}_{\mathsf {bd}}\) is referred to as a birth-and-death chain. Since every birth-and-death chain is reversible [8, Section 2.5], \(\mathcal{W}(\mathcal{X},\mathcal{E}_{\mathsf {bd}})\) is a reversible e-family.

We first recall Kolmogorov’s characterization of reversibility, which will be instrumental in our argument. For \({\mathcal {E}} \subset {\mathcal {X}}^2\) such that \(({\mathcal {X}}, {\mathcal {E}})\) is a strongly connected directed graph, we write \(\varGamma ({\mathcal {X}}, {\mathcal {E}})\) for the set of finite directed closed paths in the graph \(({\mathcal {X}}, {\mathcal {E}})\). Formally, we treat \(\gamma \) as a map \([n] \rightarrow {\mathcal {E}}\) such that \(\gamma (t) = (x_t, x_{t+1})\) with \(x_{n+1} = x_1\) and we write \(\left| \gamma \right| = n\) for the length of the path. For each \(\gamma \in \varGamma ({\mathcal {X}}, {\mathcal {E}})\), we also introduce the reverse closed path \(\gamma ^\star \in \varGamma ({\mathcal {X}}, {\mathcal {E}}^\star )\) given by \(\gamma ^\star (t) = (x_{t+1}, x_t)\). Namely, if \(\gamma \in \varGamma ({\mathcal {X}}, {\mathcal {E}})\), we can write \(\gamma \) informally as a succession of edges such that the starting and finishing states agree (i.e. as an element of \({\mathcal {E}}^n\)).

$$\begin{aligned} \gamma = ((x_1, x_2), (x_2, x_3), \dots , (x_{n - 1}, x_n) ,(x_n, x_1) ), x_i \in {\mathcal {X}}, \forall i \in [n]. \end{aligned}$$

Note that \(\gamma \) is not necessarily a cycle, i.e. in our definition, multiple occurrences of the same point of the space are allowed.

Theorem 1

(Kolmogorov’s criterion [3]) Let P irreducible over \(({\mathcal {X}}, {\mathcal {E}})\). P is reversible if and only if for all \(\gamma \in \varGamma ({\mathcal {X}}, {\mathcal {E}})\),

$$\begin{aligned} \prod _{t = 1 }^{\left| \gamma \right| } P(\gamma (t)) = \prod _{t = 1}^{\left| \gamma \right| } P(\gamma ^\star (t)). \end{aligned}$$

Example 3

When \(\left| {\mathcal {X}} \right| = 2\), all chains are reversible. For \(\left| {\mathcal {X}} \right| = 3\), only one equation needs to be verified for P to be reversible:

$$\begin{aligned} P(1,2)P(2,3)P(3,1) = P(1,3)P(3,2)P(2,1). \end{aligned}$$

We now extend the definition of reversibility to arbitrary irreducible functions \({\mathcal {F}}_+({\mathcal {X}}, {\mathcal {E}})\) (non-negative on \({\mathcal {X}}^2\) and positive exactly on \({\mathcal {E}}\)) based on Kolmogorov’s criterion, and further introduce the concept of log-reversibility for \({\mathcal {F}}({\mathcal {X}}, {\mathcal {E}})\), that considers sums instead of products.

Definition 4

(Reversible and log-reversible functions) Let \({\mathcal {E}} \subset {\mathcal {X}}^2\) such that \({\mathcal {E}} = {\mathcal {E}}^\star \).

  • reversible: A function \(h \in {\mathcal {F}}_+({\mathcal {X}}, {\mathcal {E}})\) is reversible whenever it satisfies that,

    $$\begin{aligned} \prod _{t = 1 }^{\left| \gamma \right| } h(\gamma (t)) = \prod _{t = 1}^{\left| \gamma \right| } h(\gamma ^\star (t)). \end{aligned}$$

    for all finite directed closed paths \(\gamma \in \varGamma ({\mathcal {X}}, {\mathcal {E}})\).

  • log-reversible: A function \(h \in {\mathcal {F}}({\mathcal {X}}, {\mathcal {E}})\) is log-reversible whenever it satisfies that,

    $$\begin{aligned} \sum _{t = 1 }^{\left| \gamma \right| } h(\gamma (t)) = \sum _{t = 1}^{\left| \gamma \right| } h(\gamma ^\star (t)). \end{aligned}$$

    for all finite directed closed paths \(\gamma \in \varGamma ({\mathcal {X}}, {\mathcal {E}})\).

Remark These definitions do not rely on connectedness properties of \({\mathcal {E}}\) per se, but we will assume irreducibility nonetheless. Observe that when h is represented by an irreducible row-stochastic matrix, the definition of reversibility of h as a function and as a Markov operator coincide by Kolmogorov’s criterion (Theorem 1). Clearly, for \(h \in {\mathcal {F}}({\mathcal {X}}, {\mathcal {E}})\), \(\exp [h]\) being reversible is equivalent to h being log-reversible. We could endow the set of positive reversible functions on \({\mathcal {E}}\) with a group structure by considering the standard multiplicative operation on functions. We will choose however (Lemma 4), to rather construct and focus on the vector space of log-reversible functions.

Lemma 1

Let \(h \in {\mathcal {F}}_+(\mathcal X, {\mathcal {E}})\) such that \({{\,\mathrm{rank}\,}}(h) = 1\). Then h is a reversible function.

Proof

Consider a closed path \(\gamma \in \varGamma ({\mathcal {X}}, {\mathcal {E}})\). Writing \(h = u_h ^\intercal v_h\), we successively have that

$$\begin{aligned} \begin{aligned} \prod _{t = 1 }^{\left| \gamma \right| } h(\gamma (t))&= \prod _{t = 1}^{\left| \gamma \right| } u_h(x_t)v_h(x_{t+1}) = \prod _{t = 1}^{\left| \gamma \right| } u_h(x_t) \prod _{t = 1}^{\left| \gamma \right| } v_h(x_t) = \prod _{t = 1}^{\left| \gamma \right| } h(\gamma ^\star (t)).\\ \end{aligned} \end{aligned}$$

\(\square \)

Theorem 2

(Characterization of reversible e-family) Let \({\mathcal {V}}({\mathcal {X}}, {\mathcal {E}})\) be an irreducible e-family of Markov chains, with natural parametrization \(\theta \), generated by K and \((g_i)_{i \in [d]}\). The following two statements are equivalent.

  1. (i)

    \({\mathcal {V}}({\mathcal {X}}, {\mathcal {E}})\) is reversible.

  2. (ii)

    \({\mathcal {E}} = {\mathcal {E}}^\star \) and \({\mathcal {V}}({\mathcal {X}}, {\mathcal {E}})\) is such that the carrier kernel K and generator functions \(g_i, \forall i \in [d]\) are all log-reversible functions.

Proof

We apply Kolmogorov’s criterion to some arbitrary family member. Let \(\gamma \) be some finite closed path in \(({\mathcal {X}}, {\mathcal {E}})\),

$$\begin{aligned} \prod _{t = 1 }^{\left| \gamma \right| } P_\theta (\gamma (t)) = \prod _{t = 1}^{\left| \gamma \right| } P_\theta (\gamma ^\star (t)). \end{aligned}$$

Rewriting the left-hand side,

$$\begin{aligned} \begin{aligned}&\prod _{t =1 }^{\left| \gamma \right| } P_\theta (\gamma (t)) \\&= \prod _{t = 1}^{\left| \gamma \right| } \exp \left( K(x_t, x_{t+1}) + R(\theta , x_{t+1}) - R(\theta , x_t) + \sum _{i = 1}^{d}\theta ^i g_i(x_t, x_{t+1}) - \psi (\theta ) \right) \\&= \exp (-\left| \gamma \right| \psi (\theta )) \exp \left( \sum _{t = 1}^{\left| \gamma \right| } \left[ K(x_t, x_{t+1}) + \sum _{i = 1}^{d}\theta ^i g_i(x_t, x_{t+1}) \right] \right) . \\ \end{aligned} \end{aligned}$$

Proceeding in a similar way with the right-hand side, we obtain

$$\begin{aligned} \begin{aligned}&\sum _{t =1 }^{\left| \gamma \right| } K(x_t, x_{t+1}) + \sum _{i = 1}^{d}\theta ^i \sum _{t =1 }^{\left| \gamma \right| } g_i(x_t, x_{t+1}) \\ =&\sum _{t = 1}^{\left| \gamma \right| } K(x_{t+1}, x_{t}) + \sum _{i = 1}^{d}\theta ^i \sum _{t = 1}^{\left| \gamma \right| } g_i(x_{t+1}, x_{t}). \\ \end{aligned} \end{aligned}$$

When K and the \(g_i\) are log-reversible, this equality is verified for any closed path, and every member of the family is therefore reversible. Taking \(\theta = 0\) yields the reversibility requirement for K. Further taking \(\theta ^i = \delta _{i}(j)\) for \(j \in [d]\) similarly yields the requirement for \(g_j\). \(\square \)

This path checking approach, although mathematically convenient, is not algorithmically efficient. In order to determine whether a full-support kernel –or function– is reversible, the number of distinct Kolmogorov equations that must be checked is

$$\begin{aligned} \sum _{k = 3}^{\left| {\mathcal {X}} \right| } {\left( {\begin{array}{c}\left| {\mathcal {X}} \right| \\ k\end{array}}\right) } \frac{(k - 1)!}{2}, \qquad [38, Proposition~2.1] \end{aligned}$$

which corresponds to the maximal number of cycles (i.e. closed paths such that the only repeated vertices are the first and last one) in a complete graph over \(\left| {\mathcal {X}} \right| \) nodes. Such testing algorithm becomes rapidly intractable as \(\left| {\mathcal {X}} \right| \) increases. However for Markov kernels, we know that this is equivalent to verifying the detailed balance equation, which can be achieved in (at most) polynomial time \({\mathcal {O}}(\left| {\mathcal {X}} \right| ^3)\), by solving a linear system in order to find \(\pi \). We show that this idea naturally extends to verifying reversibility of functions, enabling us to design an algorithm of time complexity \({\mathcal {O}}(\left| {\mathcal {X}} \right| ^3)\).

Lemma 2

Let \(h \in {\mathcal {F}}_+({\mathcal {X}}, {\mathcal {E}})\) irreducible. h is reversible if and only if \(\varPi _h \circ h ^\intercal \) is a symmetric matrix, with \(\varPi _h = v_h ^\intercal u_h\) the PF projection of h, where \(u_h \) and \(v_h\) are respectively the left and right PF eigenvectors of h, normalized such that \(u_h v_h ^\intercal = 1\).

Proof

Treat h as the linear operator \(h :{\mathbb {R}}^{{\mathcal {X}}} \rightarrow {\mathbb {R}}^{{\mathcal {X}}}\). Suppose first that h is reversible. We apply PF theory, which guarantees that the following Cesàro averages converge [39, Example 8.3.2] to some positive projection,

$$\begin{aligned} \lim _{n \rightarrow \infty } \frac{1}{n} \sum _{k = 0}^{n} \left( \frac{h}{\rho (h)} \right) ^k = \varPi _h = v_h ^\intercal u_h, \end{aligned}$$
(7)

Fix \((x, x') \in {\mathcal {X}}^2\) such that \(h(x, x') \ne 0\). For \(k \in {\mathbb {N}}\), we write \(\varGamma _k(x,x') \subset \varGamma ({\mathcal {X}}, {\mathcal {E}})\) the set of all directed closed paths \(\gamma \) with \((x_1, x_2, \dots , x_k) \in {\mathcal {X}}^k\) such that

$$\begin{aligned} \gamma = ((x, x_1), (x_1, x_2), \dots , (x_{k-1}, x_k), (x_k, x'), (x', x)). \end{aligned}$$

For any such cycle, it holds (perhaps vacuously if \(\varGamma _k(x,x') = \emptyset \)) that

$$\begin{aligned} \begin{aligned} h(x, x_1) \cdots h(x_{k}, x') h(x', x)&= h(x, x')h(x', x_k) \cdots h(x_1, x). \end{aligned} \end{aligned}$$

Summing this equality over all possible paths in \(\varGamma _k(x, x')\) (i.e. summing over all \((x_1, \dots , x_k) \in {\mathcal {X}}^k\), with the assumption that \(h(x,x') = 0\) whenever \((x,x') \not \in {\mathcal {E}}\)), we obtain

$$\begin{aligned} h^k(x, x') h(x', x) = h(x, x')h^k(x', x). \end{aligned}$$

In the case where \(h(x, x') = 0\), the above equation holds by symmetry of \({\mathcal {E}}\). For \(n \in {\mathbb {N}}\), appropriately rescaling on both sides with the PF root, summing over all \(k \in \left\{ 0, \dots ,n \right\} \) and taking the limit at \(n \rightarrow \infty \), (7) yields detailed balance equations with respect to the projection \(\varPi _h\),

$$\begin{aligned} \begin{aligned} \varPi _h(x, x') h(x', x)&= h(x, x') \varPi _h(x', x), \\ \end{aligned} \end{aligned}$$

or in other words, reversibility of h implies symmetry of \(\varPi _h \circ h ^\intercal \).

To prove necessity, we suppose now that this symmetry holds, with \(\varPi _h\) the PF projection of h. We know that \({{\,\mathrm{rank}\,}}(\varPi _h) = {{\,\mathrm{rank}\,}}(v_h ^\intercal u_h) = 1\), and that \(\varPi _h\) is positive. Consider some finite directed closed path \(\gamma \). Rearranging products yields

$$\begin{aligned} \begin{aligned} \prod _{t = 1 }^{\left| \gamma \right| } h(\gamma (t))&= \left( \prod _{t = 1 }^{\left| \gamma \right| } \frac{\varPi _h(\gamma (t))}{\varPi _h(\gamma ^\star (t))} \right) \prod _{t = 1 }^{\left| \gamma \right| } h(\gamma ^\star (t)), \\ \end{aligned} \end{aligned}$$

but the first factor on the right-hand side vanishes, from the fact that rank one functions are always reversible (Lemma 1). This concludes the proof of the lemma. \(\square \)

Notice that we can define \(\pi _h(x) \triangleq u_h(x)/v_h(x)\) the positive entry-wise ratio of the PF eigenvectors. We can then restate Lemma 2 in terms of the familiar detailed balance equation \(\pi _h(x) h(x, x') = \pi _h(x') h(x', x)\).

Corollary 1

Let \(h \in {\mathcal {F}}({\mathcal {X}}, {\mathcal {E}})\) some irreducible function. h is log-reversible if and only if there exists \(f \in {\mathbb {R}}^{\mathcal {X}}\) such that \(\forall x, x' \in {\mathcal {X}}\), \(h(x, x') = h(x', x) + f(x') - f(x)\).

Remark: when h is known to be reversible, one can compute \(\pi _h\) in \({\mathcal {O}}(\left| {\mathcal {X}} \right| )\), by adapting the technique of [40]; unfortunately, it is not possible to check for reversibility using this method. If the space becomes large, the reader can consider iterative (power) methods to compute the PF projector, potentially further reducing the verification time cost. We end this section with a technical lemma that will allow us in later sections to swiftly compute expectations of functions under certain reversibility or skew-symmetricity properties.

Lemma 3

Let P irreducible with associated edge measure matrix Q. For a function \(g :{\mathcal {X}}^2 \rightarrow {\mathbb {R}}\), we write \(Q[g] = \sum _{x, x' \in {\mathcal {X}}} Q(x, x') g(x, x')\).

  1. (i)

    If g is log-reversible, \(Q[g] = Q^\star [g]\).

  2. (ii)

    If g is skew-symmetric and P is reversible, \(Q[g] = 0\).

  3. (iii)

    If there exists \(f \in {\mathbb {R}}^{\mathcal {X}}\) such that for all \(x, x' \in {\mathcal {X}}\), \(g(x,x') = f(x') - f(x)\), \(Q[g] = 0\) (regardless of P being reversible).

Proof

Claim (iii) follows by property of edge measure Q.

$$\begin{aligned} \begin{aligned} Q[g]&= \sum _{x, x' \in {\mathcal {X}}} Q(x, x')(f(x') - f(x)) \\&= \sum _{x' \in {\mathcal {X}}} f(x') \sum _{x \in {\mathcal {X}}} Q(x, x') - \sum _{x \in {\mathcal {X}}} f(x) \sum _{x' \in {\mathcal {X}}} Q(x, x') \\&= \sum _{x' \in {\mathcal {X}}} f(x') \pi (x') - \sum _{x \in {\mathcal {X}}} f(x) \pi ( x) = 0.\\ \end{aligned} \end{aligned}$$

From Corollary 1, claim (iii), and re-indexing,

$$\begin{aligned} \begin{aligned} Q[g] = Q[g ^\intercal ] + \sum _{x,x' \in {\mathcal {X}}} Q(x,x')(f(x') - f(x)) = Q^\star [g], \end{aligned} \end{aligned}$$

which yields (i). To prove (ii), consider g such that \(g(x', x) = - g(x, x')\). Then by re-indexing and symmetry of Q,

$$\begin{aligned} \begin{aligned} Q[g]&= \sum _{x', x \in {\mathcal {X}}} Q(x', x) g(x', x) = - \sum _{x', x \in {\mathcal {X}}} Q(x, x') g(x, x') = - Q[g].\\ \end{aligned} \end{aligned}$$

\(\square \)

5 The e-family of reversible Markov kernels

In Section 5.1, we begin by analyzing the affine structure of the space of log-reversible functions, derive its dimension, construct a basis, and deduce that the manifold of all irreducible reversible Markov kernels forms an exponential family. The dimension of this family confirms the well-known fact that the number of free parameters for a reversible kernel is only about half of what is required for the general case, hence that reversible chains serve in a sense as a “natural intermediate” [41, Section 5] in terms of model complexity. In Section 5.2, we proceed to derive a systematic parametrization of the manifold \({\mathcal {W}}({\mathcal {X}}, {\mathcal {E}})\), similar in spirit to the one given in Ito and Amari [17], and in Nagaoka [22, Example 1]. In Section 5.3, we connect our results to general differential geometry, and point out that reversible kernels form a doubly autoparallel submanifold in \({\mathcal {W}} ({\mathcal {X}}, {\mathcal {E}})\). Finally, we conclude with a brief discussion on reversible geodesics (Section 5.4).

5.1 Affine structures

Identifying \({\mathcal {X}}\) with [m], we can endow the set with the natural order induced from \({\mathbb {N}}\). In this section, we will henceforth assume that \({\mathcal {E}}\) is symmetric, and consider the following subsets of \({\mathcal {E}}\),

$$\begin{aligned} \begin{aligned} T_-({\mathcal {E}})&\triangleq \left\{ (x, x') \in {\mathcal {E}} :x' > x \right\} , \\ T_+({\mathcal {E}})&\triangleq \left\{ (x, x') \in {\mathcal {E}} :x' < x \right\} , \\ T_0({\mathcal {E}})&\triangleq \left\{ (x, x') \in {\mathcal {E}} :x' = x \right\} , \\ \end{aligned} \end{aligned}$$

and

$$\begin{aligned} \begin{aligned} T({\mathcal {E}})&\triangleq \big \{(x, x') \in {\mathcal {E}} :x' \le x, (x, x') \ne (m, x_\star ), x_\star = {\mathop {{{\,\mathrm{\hbox {arg min}}\,}}}_{x \in {\mathcal {X}}}} \left\{ (m, x) \in {\mathcal {E}} \right\} \big \}. \end{aligned} \end{aligned}$$

We immediately observe that the following cardinality relations hold

$$\begin{aligned} \begin{aligned} \left| T_+({\mathcal {E}}) \right|&= \left| T_-({\mathcal {E}}) \right| , \\ \left| T({\mathcal {E}}) \right|&= \left| T_+({\mathcal {E}}) \right| + \left| T_0({\mathcal {E}}) \right| - 1 = \frac{\left| {\mathcal {E}} \right| + \left| T_0({\mathcal {E}}) \right| }{2} - 1,\\ \end{aligned} \end{aligned}$$
(8)

and that from irreducibility, \(x^\star \ne m\). The last expression in (8) highlights the fact that \(\left| T({\mathcal {E}}) \right| \) is independent of any ordering of elements of \({\mathcal {X}}\). Note also that the element \((m, x_\star )\) in the definition of \(T({\mathcal {E}})\) plays no special role, and could be replaced with any other element of \(T_0({\mathcal {E}}) \cup T_+({\mathcal {E}})\). We define the sets of symmetric and log-reversible functions (Definition 4) over the graph \(({\mathcal {X}}, {\mathcal {E}})\), respectively by

$$\begin{aligned} \begin{aligned} {\mathcal {F}} _{\mathsf {sym}}({\mathcal {X}}, {\mathcal {E}})&\triangleq \left\{ h \in {\mathcal {F}} ({\mathcal {X}}, {\mathcal {E}}) :\forall x, x' \in {\mathcal {X}}, h(x, x') = h(x', x) \right\} , \\ {\mathcal {F}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}})&\triangleq \left\{ h \in {\mathcal {F}} ({\mathcal {X}}, {\mathcal {E}}) :h \text { is log-reversible } \right\} . \end{aligned} \end{aligned}$$

We note that \({\mathcal {F}} _{\mathsf {sym}}({\mathcal {X}}, {\mathcal {E}})\) is isomorphic to the vector space of symmetric matrices whose entries are null outside of \({\mathcal {E}}\), thus \(\dim {\mathcal {F}} _{\mathsf {sym}}({\mathcal {X}}, {\mathcal {E}}) = \left| T_+({\mathcal {E}}) \right| + \left| T_0({\mathcal {E}}) \right| \). We now show that is also a vector space, and that it contains \({\mathcal {N}} ({\mathcal {X}}, {\mathcal {E}})\) defined at (4).

Lemma 4

The following vector subspace inclusions hold:

$$\begin{aligned} {\mathcal {N}} ({\mathcal {X}}, {\mathcal {E}}) {\mathop { \subset }\limits ^{(i)}} {\mathcal {F}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}}) {\mathop { \subset }\limits ^{(ii)}} {\mathcal {F}} ({\mathcal {X}}, {\mathcal {E}}). \end{aligned}$$

Proof

To verify (ii), we argue that is closed by linear combinations from properties of the sum. The fact that the null function is trivially reversible concludes this claim. For (i), consider an element \(h \in {\mathcal {N}}({\mathcal {X}}, {\mathcal {E}})\), such that \(h(x, x') = f(x') - f(x) + c\). Then \(h(x, x') = h(x', x) + 2f(x') - 2f(x)\), and from Corollary 1, , thus the inclusion holds. The set is closed by linear combinations by properties of sums again, and taking \(f = 0, c =0\) is allowed, whence claim (i). \(\square \)

Remark 4

In fact, defining

$$\begin{aligned} \begin{aligned} {\mathcal {N}}_0({\mathcal {X}}, {\mathcal {E}}) \triangleq \bigg \{&h \in {\mathcal {F}}({\mathcal {X}}, {\mathcal {E}}) :\exists f \in {\mathbb {R}}^{\mathcal {X}}, \\&\forall (x, x') \in {\mathcal {E}}, h(x, x') = f(x') - f(x) \bigg \}, \end{aligned} \end{aligned}$$
(9)

Corollary 1 implies that .

It is then possible to further define the quotient space of reversible generator functions

$$\begin{aligned} {\mathcal {G}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}}) \triangleq {\mathcal {F}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}})/{\mathcal {N}}({\mathcal {X}}, {\mathcal {E}}). \end{aligned}$$

Theorem 3

The following statements hold.

  1. (i)

    The set of reversible generators can be endowed with a \(\left| T({\mathcal {E}}) \right| \)-dimensional vector space structure.

  2. (ii)

    The set of irreducible and reversible Markov kernels over \(({\mathcal {X}}, {\mathcal {E}})\) forms an e-family of dimension .

Proof

Let g be a log-reversible function over \(({\mathcal {X}},{\mathcal {E}})\). From Corollary 1, there exists \(f \in {\mathbb {R}}^{\mathcal {X}}\) such that \(g(x, x') = g(x', x) + f(x') - f(x)\), or writing \(h(x, x') = g(x, x') + {\tilde{f}}(x) - {\tilde{f}}(x')\) with \({\tilde{f}} = f/2\) (i.e. \(h = (g + g ^\intercal )/2\)), it holds that \(h(x, x') = h(x', x)\), i.e. h is symmetric. thus also corresponds to the alternative quotient space

$$\begin{aligned} {\mathcal {G}}_{\mathsf {rev} } ({\mathcal {X}},{\mathcal {E}}) \cong {\mathcal {F}} _{\mathsf {sym}}({\mathcal {X}},{\mathcal {E}}) / {\mathbb {R}}, \end{aligned}$$
(10)

and as a consequence . This concludes the proof of (i). Let , and recall the definition (5) of the diffeomorphism \(\varDelta \). By Theorem 2, . Conversely, let \(P \in {\mathcal {W}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}})\). Then by the Kolmogorov criterion (Theorem 1), \(\log [P] \in {\mathcal {F}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}})\), and there exist \((g, f, c) \in {\mathcal {G}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}}) \times {\mathbb {R}}^{\mathcal {X}} \times {\mathbb {R}}\) such that for any \(x, x' \in {\mathcal {X}}, \log P(x,x') = g(x,x') + f(x') - f(x) + c\) (where c is unique, f is unique up to an additive constant, and both can be recovered from PF theory). In other words, there exists \(g \in {\mathcal {G}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}})\), with \(P = {\mathfrak {s}}\circ \exp [g]\), hence \(P \in \varDelta ({\mathcal {G}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}}))\), proving that

$$\begin{aligned} \varDelta ({\mathcal {G}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}})) = {\mathcal {W}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}}). \end{aligned}$$

Claim (ii) then follows from Nagaoka [22, Theorem 2], as discussed at the end of Section 2. \(\square \)

Corollary 2

For the set of positive Markov kernel, \(\left| T_0({\mathcal {E}}) \right| = \left| {\mathcal {X}} \right| \) and \(\left| {\mathcal {E}} \right| = \left| {\mathcal {X}} \right| ^2\), thus \(\dim {\mathcal {W}}_{\mathsf {rev} }({\mathcal {X}}, {\mathcal {X}}^2) = \left| {\mathcal {X}} \right| (\left| {\mathcal {X}} \right| +1)/2 - 1\). This is in line with the known number of degrees of freedom of reversible Markov chains [9, 41].

Theorem 4

The family of functions \(g_{i j} = \delta _i ^\intercal \delta _j + \delta _j ^\intercal \delta _i\), for \((i,j) \in T({\mathcal {E}})\), forms a basis of \({\mathcal {G}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}})\).

Proof

We begin by proving the independence of the family in the quotient space \({\mathcal {G}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}})\). Since \(g_{ij}\) is symmetric in the sense that \(g_{ij} = g_{ij} ^\intercal \), it trivially verifies the log-reversibility property, thus belongs to \({\mathcal {G}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}})\). Let now \(g \in {\mathcal {G}}_{\mathsf {rev} }({\mathcal {X}}, {\mathcal {E}})\) be such that

$$\begin{aligned} g = \sum _{(i, j) \in T({\mathcal {E}})} \alpha _{i j} g_{ij}, \end{aligned}$$

with \(\alpha _{ij} \in {\mathbb {R}}\), for any \((i,j) \in T({\mathcal {E}})\), and suppose that \(g = 0_{{\mathcal {G}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}})}\). Our first step is to observe that necessarily \(g = 0_{{\mathcal {F}} ({\mathcal {X}},{\mathcal {E}} )}\), i.e. g must be the null vector in the ambient space. Let us suppose for contradiction that there exist \((f, c) \in ({\mathbb {R}}^{\mathcal {X}}, {\mathbb {R}})\) such that \(g(x, x') = f(x') - f(x) + c\) and either \(c \ne 0\) or f is not constant over \({\mathcal {X}}\). Since by definition, \((m, x_\star ), (x_\star , m) \not \in T({\mathcal {E}})\),

$$\begin{aligned} \begin{aligned} 0 = g(m, x_\star ) = f(m) - f(x_\star ) + c, \\ 0 = g(x_\star , m) = f(x_\star ) - f(m) + c, \end{aligned} \end{aligned}$$

therefore summing the latter equalities yields \(c = 0\), thus f cannot be constant. But then, g is both symmetric and skew-symmetric, which leads to a contradiction, and \(g = 0_{{\mathcal {F}}({\mathcal {X}}, {\mathcal {E}})}\). Since the family \(\left\{ g_{ij} :(i,j) \in T({\mathcal {E}}) \right\} \) is independent in the ambient space \({\mathcal {F}}({\mathcal {X}}, {\mathcal {E}})\), the coefficients \(\alpha _{ij}, (i,j) \in T({\mathcal {E}})\) must be null, and as result, the family is also linearly independent in \({\mathcal {G}}_{\mathsf {rev} }({\mathcal {X}}, {\mathcal {E}})\). Finally, since from Theorem 3, \(\left| T({\mathcal {E}}) \right| = \dim {\mathcal {G}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}})\), the family is maximally independent, hence constitutes a basis of the quotient vector space. \(\square \)

Remark 5

An alternative way of showing the linear independence of the family \(\left\{ g_{ij} :(i,j) \in T({\mathcal {E}}) \right\} \) in Theorem 4 consists in verifying that (i) the family is independent in \({\mathcal {F}}_{\mathsf {sym}}\), (ii) \({\mathbb {R}}\ \not \subset {{\,\mathrm{span}\,}}\left\{ g_{ij} :(i,j) \in T({\mathcal {E}}) \right\} \), and then invoking (10).

5.2 Parametrization of the manifold of reversible kernels

Recall that from [22, Example 1], in the complete graph case (\({\mathcal {E}} = {\mathcal {X}}^2\)), we can find an explicit parametrization for \({\mathcal {W}}({\mathcal {X}}, {\mathcal {X}}^2)\). Indeed, picking any \(x_\star \in {\mathcal {X}}\), we can easily verify that for the two cases where \(x'= x_\star \) and \(x' \ne x_\star \),

$$\begin{aligned} \begin{aligned} \log P(x, x') =&\sum _{i = 1}^{\left| {\mathcal {X}} \right| } \sum _{\begin{array}{c} j = 1 \\ j \ne x_\star \end{array}}^{\left| {\mathcal {X}} \right| } \log \frac{P(i, j) P(j, x_\star )}{P(i, x_\star )P(x_\star , x_\star )} \delta _i(x)\delta _j(x') \\&+ \log P(x, x_\star ) - \log P(x', x_\star ) + \log P(x_\star , x_\star ). \end{aligned} \end{aligned}$$

In the remainder of this section, we show how to derive a similar parametrization for \({\mathcal {W}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}})\). We start by recalling the definition of the expectation parameter of an exponential family of kernels. For an e-family \({\mathcal {V}}_e\), following the notation of Definition 2, we define

$$\begin{aligned} \eta _i(\theta ) \triangleq Q_\theta [g_i] = \sum _{x, x' \in {\mathcal {X}}} Q_\theta (x, x') g_i(x,x'), \end{aligned}$$

and call \(\eta = (\eta _1, \dots , \eta _d)\) the expectation parameter of the family. We will first derive \(\eta \) and later convert to the natural parameter \(\theta \) using the following lemma.

Lemma 5

For a given exponential family, we can express the chart transition maps between the expectation and natural parameters \(\theta \circ \eta ^{-1}\) and \(\eta \circ \theta ^{-1}\). Extending the notation at Lemma 3,

  1. (i)
    $$\begin{aligned}\eta _i(\theta ) = Q_\theta [g_i] = \sum _{x, x' \in {\mathcal {X}}} Q_\theta (x,x') g_i(x,x') .\end{aligned}$$
  2. (ii)
    $$\begin{aligned} \begin{aligned} \theta ^i(\eta )&= \left( \frac{\partial }{\partial \eta _i} Q_\eta \right) \left[ \log P_\eta - K \right] \\&= \sum _{x, x' \in {\mathcal {X}}} \left( \frac{\partial }{\partial \eta _i} Q_\eta (x, x') \right) \left( \log P_\eta (x, x') - K(x, x') \right) . \end{aligned} \end{aligned}$$

    In particular, when the carrier kernel verifies \(K = 0\), we more simply have

    $$\begin{aligned} \theta ^i(\eta ) = \left( \frac{\partial }{\partial \eta _i} Q_\eta \right) \left[ \log P_\eta \right] . \end{aligned}$$

Proof

It is well-known that \(\eta _i(\theta ) = \frac{\partial }{\partial \theta ^i} \psi _\theta = Q_\theta [g_i]\) [10, Lemma 5.1], [22, Theorem 4], [21, (28)], therefore we only need to show (ii). Let \(g_1, g_2, \dots , g_d\) be a collection of independent functions of \({\mathcal {G}} ({\mathcal {X}}, {\mathcal {E}})\). Consider the exponential family as in Definition 2. Recall that for two transition kernels \(P_1, P_2\) respectively irreducible over \(({\mathcal {X}}, {\mathcal {E}}_1)\) and \(({\mathcal {X}}, {\mathcal {E}}_2)\), and with stationary distributions \(\pi _1\) and \(\pi _2\), the information divergence of \(P_1\) from \(P_2\) is given by

$$\begin{aligned} D\left( P_1 \parallel P_2\right) = {\left\{ \begin{array}{ll} \sum _{(x,x') \in {\mathcal {E}}_1} \pi _1(x) P_1(x, x') \log \frac{P_1(x, x')}{P_2(x, x')}, &{}\text { when } {\mathcal {E}}_1 { \subset } {\mathcal {E}}_2, \\ \infty &{}\text { otherwise}. \end{array}\right. } \end{aligned}$$
(11)

Writing \(P_0\) for \(P_\theta \) when \(\theta = 0\),

$$\begin{aligned} \begin{aligned}&D\left( P_\theta \parallel P_0\right) \\&= \sum _{x, x' \in {\mathcal {X}}} Q_\theta (x,x') \log \frac{P_\theta (x, x')}{P_0(x, x')} \\&= \sum _{x, x' \in {\mathcal {X}}} Q_\theta (x,x') \left[ \sum _{i = 1}^{d}\theta ^i g_i(x,x') + R_\theta (x') - R_\theta (x) - \psi _\theta - R_0(x') + R_0(x) + \psi _0 \right] \\&= \sum _{i = 1}^{d} \theta ^i \eta _i - \psi _\theta + \psi _0, \end{aligned} \end{aligned}$$

where for the last equality we used (i) of the present lemma and Lemma 3-(iii). Moreover, by a direct computation,

$$\begin{aligned} Q_\theta \left[ - \log P_0 \right] = \psi _0 - Q_\theta [K]. \end{aligned}$$

Thus, the potential function is given by

$$\begin{aligned} \begin{aligned} \varphi (\eta )&\triangleq \sum _{i = 1}^{d} \theta ^i \eta _i - \psi _\theta = Q_\theta \left[ \log P_\theta \right] - Q_\theta [K] = Q_\eta \left[ \log P_\eta \right] - Q_\eta [K]. \end{aligned} \end{aligned}$$
(12)

By taking the derivative, we recover \(\frac{\partial }{\partial \eta _i} \varphi (\eta ) = \theta ^i(\eta )\) [22, (17)]. Moreover, from (12), we have that

$$\begin{aligned} \begin{aligned} \frac{\partial }{\partial \eta _i} \varphi (\eta )&= \frac{\partial }{\partial \eta _i} \left( Q_\eta \left[ \log P_\eta \right] - Q_\eta [K] \right) \\&= \left( \frac{\partial }{\partial \eta _i} Q_\eta \right) \left[ \log P_\eta - K \right] + Q_\eta \left[ \frac{\partial \log P_\eta }{\partial \eta _i} \right] - Q_\eta \left[ \frac{\partial K }{\partial \eta _i} \right] \\&= \left( \frac{\partial }{\partial \eta _i} Q_\eta \right) \left[ \log P_\eta - K \right] , \\ \end{aligned} \end{aligned}$$

where for the last equality, we used the fact that \(Q_\eta [ \partial K / \partial \eta _i] = 0\), and that from \(P_\eta \) being stochastic,

$$\begin{aligned} \sum _{x,x' \in {\mathcal {X}}} Q_\eta (x,x') \frac{\partial }{\partial \eta _i} \log P_\eta (x,x') = \sum _{x,x' \in {\mathcal {X}}} \pi _\eta (x) \frac{\partial }{\partial \eta _i} P_\eta (x,x') = 0. \end{aligned}$$

This finishes proving (ii) of the lemma. \(\square \)

Theorem 5

Let \(P \in {\mathcal {W}}_{\mathsf {rev} }({\mathcal {X}}, {\mathcal {E}})\), with stationary distribution \(\pi \). Using the basis \(g_{ij} = \delta _i ^\intercal \delta _j + \delta _j ^\intercal \delta _i\), we can write Q, the edge measure matrix associated with P, as a member of the m-family of reversible kernels,

$$\begin{aligned} \begin{aligned} Q&= \frac{g_\star }{2} + \sum _{(i,j) \in T({\mathcal {E}})} (g_{ij} - g_\star )\frac{Q(i,j)}{1 + \delta _i(j)}, \end{aligned} \end{aligned}$$

where \(g_\star = \delta _{m} ^\intercal \delta _{x_\star } + \delta _{x_\star } ^\intercal \delta _{m}\), and we can write P as a member of the e-family,

$$\begin{aligned} \begin{aligned} \log P(x, x')&= \sum _{(i,j) \in T({\mathcal {E}})} \frac{1}{2} \log \frac{P(i, j)P(j, i)}{P(m, x_\star )P(x_\star , m )}\left( \frac{g_{ij}(x, x') }{1 + \delta _i(j)}\right) \\&+ \frac{1}{2} \log \pi (x') - \frac{1}{2} \log \pi (x) + \frac{1}{2} \log P(m, x_\star )P(x_\star , m ), \end{aligned} \end{aligned}$$

when \((x,x') \in {\mathcal {E}}\), \(P(x,x') = 0\) otherwise, and where \(x_\star = {{\,\mathrm{\hbox {arg min}}\,}}_{x \in {\mathcal {X}}} \left\{ (m, x) \in {\mathcal {E}} \right\} \).

Proof

Let us consider the basis

$$\begin{aligned} g_{i j} = \delta _i ^\intercal \delta _j + \delta _j ^\intercal \delta _i, \end{aligned}$$

and taking \(K = 0\), we are looking for a parametrization of the type

$$\begin{aligned} \begin{aligned} {\widetilde{P}}_\theta (x, x')&= \exp \left( \sum _{(i,j) \in T({\mathcal {E}})} \theta ^{ij} g_{ij}(x, x') \right) , \\ P_\theta (x, x')&= {\mathfrak {s}}({\widetilde{P}}_\theta )(x, x') = {\widetilde{P}}_\theta (x, x') \exp \left( R_\theta (x') - R_\theta (x) - \psi _\theta \right) , \end{aligned} \end{aligned}$$

where \(\exp \psi _\theta \) and \(\exp [R_\theta ]\) are respectively the PF root and right PF eigenvector of \({\widetilde{P}}_\theta \). We first derive a parametrization of the edge measure \(Q_\eta \) as a member of an m-family (following Definition 1-(ii) with respect to the expectation parameter \(\eta \)). For \((i, j) \in {\mathcal {X}}^2\), by Lemma 5-(i),

$$\begin{aligned} \eta _{ij} = Q_\eta [g_{ij}] = Q_\eta (i,j) + Q_\eta (j, i) = 2Q_\eta ( i,j), \end{aligned}$$

and thus, from symmetry of \(Q_\eta \) and since \(Q_\eta \in {\mathcal {P}}({\mathcal {X}}^2)\),

$$\begin{aligned} \begin{aligned} Q_\eta (i, j) = {\left\{ \begin{array}{ll} 0 &{} \text { when } (i,j) \not \in {\mathcal {E}} \\ \eta _{ij} /2 &{} \text { when } (i, j) \in T({\mathcal {E}}), i \ne j \\ \eta _{ji}/2 &{} \text { when } (j, i) \in T({\mathcal {E}}), i \ne j \\ \eta _{ii}/2 &{} \text { when } (i, i) \in T_0({\mathcal {E}}) \\ { \frac{1}{2}\left( 1 - \sum _{(i,j) \in T({\mathcal {E}})}\frac{\eta _{ij}}{1 + \delta _i(j)} \right) } &{} \text { when } (i, j) \in \left\{ (m, x_\star ), (x_\star , m) \right\} , \end{array}\right. } \end{aligned} \end{aligned}$$

and more compactly, for \((x, x') \in {\mathcal {X}}^2\),

$$\begin{aligned} \begin{aligned} Q_\eta&= \frac{g_\star }{2} + \sum _{(i,j) \in T({\mathcal {E}})} \frac{g_{ij} - g_\star }{2(1 + \delta _i(j))}\eta _{ij}, \end{aligned} \end{aligned}$$

where \(g_\star \) is defined as in the statement of the theorem. We differentiate by \(\eta _{i j}\) for \((i,j) \in T({\mathcal {E}})\), to obtain

$$\begin{aligned} \frac{\partial Q_\eta }{\partial \eta _{ij}} = \frac{ g_{ij} - g_\star }{2(1 + \delta _{i}(j))}. \end{aligned}$$

Invoking (ii) of Lemma 5, we convert the expectation parametrization to a natural one,

$$\begin{aligned} \theta ^{ij}(\eta ) = \sum _{(x, x') \in {\mathcal {X}}} \left( \frac{\partial }{\partial \eta _{ij}}Q_\eta (x,x') \right) \log P(x, x'), \end{aligned}$$

so that

$$\begin{aligned} \begin{aligned} \theta ^{ij}(\eta )&= \frac{1}{1 + \delta _{i}(j)} \sum _{(x, x') \in {\mathcal {X}}} \left( \frac{ g_{ij}(x,x') - g_\star (x, x') }{2}\right) \log P(x, x') \\&= \frac{1}{2(1 + \delta _{i}(j))} \log \frac{P(i, j)P(j, i)}{P(m, x_\star )P(x_\star , m)}. \\ \end{aligned} \end{aligned}$$

Notice that \({\widetilde{P}}_\theta = {\widetilde{P}}_\theta ^\intercal \), hence the right and left PF eigenvector are identical, i.e. \(R_\theta = L_\theta \) and as is known (see (6)), the stationary distribution is given by \(\pi _\theta = \exp [2 R_\theta ]/\sum _{x \in {\mathcal {X}}} \exp (2 R_\theta (x))\). In fact, we can easily verify that the right PF eigenvector is given by \(\exp [R_\theta ] = \sqrt{\pi }\), and that the PF root is

$$\begin{aligned} \exp \psi _\theta = (P(m , x_\star ) P(x_\star , m))^{-1/2}. \end{aligned}$$

Indeed, letting \(x \in {\mathcal {X}}\), from detailed balance of P, we have

$$\begin{aligned} \begin{aligned} \sum _{x' \in {\mathcal {X}}} {\widetilde{P}}_\theta (x,x') \sqrt{\pi (x')}&= \sum _{x' \in {\mathcal {X}}} \sqrt{\frac{P(x , x') P(x', x)}{P(m , x_\star ) P(x_\star , m)}} = \frac{\sqrt{\pi (x)}}{\sqrt{P(m , x_\star ) P(x_\star , m)}}. \end{aligned} \end{aligned}$$

\(\square \)

5.3 The doubly autoparallel submanifold of reversible kernels

Recall that we can view \({\mathcal {W}} = {\mathcal {W}}({\mathcal {X}}, {\mathcal {E}})\) as a smooth manifold of dimension \(d = \dim {\mathcal {W}} = \left| {\mathcal {E}} \right| - \left| {\mathcal {X}} \right| \). For each \(P \in {\mathcal {W}}\), we can then consider the tangent plane \(T_P\) at P, endowed with a d-dimensional vector space structure. Together with the manifold, we define an information geometric structure consisting of a Riemannian metric, called the Fisher information metric \({\mathfrak {g}}\), and a pair of torsion-free affine connections \(\nabla ^{(e)}\) and \(\nabla ^{(m)}\) respectively called e-connection and m-connection, that are dual with respect to \({\mathfrak {g}}\), i.e. for any vector fields \(X, Y, Z \in \varGamma (T{\mathcal {W}})\),

$$\begin{aligned} \begin{aligned} X {\mathfrak {g}}(Y, Z) = {\mathfrak {g}}(\nabla ^{(e)}_X Y, Z) + {\mathfrak {g}}( Y, \nabla ^{(m)}_X Z), \end{aligned} \end{aligned}$$

where \(\varGamma (T {\mathcal {W}})\) is the set of all sections over the tangent bundle. We now review an explicit construction for \({\mathfrak {g}}, \nabla ^{(m)}, \nabla ^{(e)}\).

Construction in the natural chart map.

Consider a parametric family \({\mathcal {V}} = \left\{ P_\theta :\theta \in \varTheta \right\} \) with \(\varTheta \) open subset of \({\mathbb {R}}^d\). For any \(n \in {\mathbb {N}}\), we define the path measure \(Q_\theta ^{(n)} \in {\mathcal {P}}({\mathcal {X}}^n)\) induced from the kernel \(P_\theta \).

$$\begin{aligned} Q_\theta ^{(n)}(x_1, x_2, \dots , x_n) = \pi _\theta (x_1) \prod _{t = 1}^{n - 1} P_\theta (x_t, x_{t+1}). \end{aligned}$$

Nagaoka [22] defines the Fisher metric as

$$\begin{aligned} \begin{aligned} {\mathfrak {g}}_{ij}(\theta )&\triangleq \sum _{(x, x') \in {\mathcal {E}}} Q_\theta (x, x') \partial _i \log P_\theta (x, x') \partial _j \log P_\theta (x, x'), \\&= \sum _{(x, x') \in {\mathcal {E}}} \partial _i \log P_\theta (x, x') \partial _j Q_\theta (x, x'), \\&= \lim _{n \rightarrow \infty } \frac{1}{n} {\mathfrak {g}}_{ij}^{ n}(\theta ), \end{aligned} \end{aligned}$$

and the dual affine e/m-connections of \(\left\{ P_\theta :\theta \in \varTheta \right\} \) by their Christoffel symbols,

$$\begin{aligned} \begin{aligned} \varGamma ^{(e)}_{ij, k}(\theta )&\triangleq \sum _{(x, x') \in {\mathcal {E}}} \partial _i \partial _j \log P_\theta (x, x') \partial _k Q_\theta (x, x') = \lim _{n \rightarrow \infty } \frac{1}{n} \varGamma ^{(e), n}_{ij, k}(\theta ), \\ \varGamma ^{(m)}_{ij, k}(\theta )&\triangleq \sum _{(x, x') \in {\mathcal {E}}} \partial _i \partial _j Q_\theta (x, x') \partial _k \log P_\theta (x, x') = \lim _{n \rightarrow \infty } \frac{1}{n} \varGamma ^{(m), n}_{ij, k}(\theta ), \\ \end{aligned} \end{aligned}$$

where \({\mathfrak {g}}_{ij}^{n}(\theta ), \varGamma ^{(e), n}_{ij, k}(\theta ), \varGamma ^{(m), n}_{ij, k}(\theta )\) are the Fisher metric, and Christoffel symbols of the e/m-connections that pertain to the distribution family \(\left\{ Q_\theta ^{(n) } \right\} _{\theta \in \varTheta }\).

Autoparallelity.

Connections allow us to talk about covariant derivatives and parallelity of vectors fields.

Definition 5

A submanifold \({\mathcal {V}}\) is called autoparallel in \({\mathcal {W}}\) with respect to a connection \(\nabla \), when for any vector fields \(\forall X, Y \in \varGamma (T {\mathcal {V}})\), it holds that

$$\begin{aligned} \nabla _X Y \in \varGamma (T {\mathcal {V}}). \end{aligned}$$

A submanifold \({\mathcal {V}}\) of \({\mathcal {W}}\) is then an e-family (resp. m-family) if and only if it is autoparallel with respect to \(\nabla ^{(e)}\) (resp. \(\nabla ^{(m)}\)) [22, Theorem 6]. As the manifold of reversible kernels is both an e-family and an m-family, it is called doubly autoparallel [42, Definition 1].

Theorem 6

The manifold \({\mathcal {W}}_{\mathsf {rev} }({\mathcal {X}}, {\mathcal {E}})\) of irreducible and reversible Markov chains over \(({\mathcal {X}}, {\mathcal {E}})\) is a doubly autoparallel submanifold in \({\mathcal {W}}({\mathcal {X}}, {\mathcal {E}})\) with dimension

$$\begin{aligned} \dim {\mathcal {W}}_{\mathsf {rev} }({\mathcal {X}}, {\mathcal {E}}) = \frac{\left| {\mathcal {E}} \right| + \left| T_0({\mathcal {E}}) \right| }{2} - 1, \end{aligned}$$

where \(T_0({\mathcal {E}}) = \left\{ (x,x') \in {\mathcal {E}} :x = x' \right\} \).

Proof

The set of reversible Markov chains is an e-family (Theorem 3), and an m-family (Theorem 5). \(\square \)

5.4 Reversible geodesics

In this section, we let two irreducible reversible kernels \(P_0\) and \(P_1\) over \(({\mathcal {X}}, {\mathcal {E}})\), and discuss the geodesics that connect them with respect to \(\nabla ^{(e)}\) and \(\nabla ^{(m)}\). Although already guaranteed (see for example Ohara and Ishi [42, Proposition 1]), we offer alternative elementary proofs that any kernel lying on either e/m-geodesic is irreducible and reversible.

m-geodesics.

By irreducibility, there exist unique \(Q_0, Q_1, \in {\mathcal {Q}}({\mathcal {X}}, {\mathcal {E}})\) corresponding to \(P_0, P_1\). Moreover, by reversibility \(Q_0\) and \(Q_1\) are symmetric. We let

$$\begin{aligned} G_m(P_0, P_1) \triangleq \left\{ P_\xi :Q_\xi = \xi Q_1 + (1 - \xi ) Q_0 :\xi \in [0, 1] \right\} , \end{aligned}$$
(13)

be the m-geodesic (auto-parallel curve with respect to the m-connection) connecting \(P_0\) and \(P_1\). Then \(G_m(P_0, P_1)\) forms an m-family of dimension 1. For any \(\xi \in [0,1]\), the matrix \(Q_\xi \) is symmetric as convex combination of two symmetric matrices. \(Q_\xi \) takes value 0 exactly when \(Q_0, Q_1\), i.e. \(P_0, P_1\) take value 0. Furthermore, writing \(\pi _0\) (resp. \(\pi _1\)) the unique stationary distribution of \(P_0\) (resp. \(P_1\)),

$$\begin{aligned} \begin{aligned} \sum _{x'} Q_\xi (x, x')&= \xi \pi _1(x) + (1 - \xi ) \pi _0(x), \\ \sum _{x} Q_\xi (x, x')&= \xi \pi _1(x') + (1 - \xi ) \pi _0(x'), \end{aligned} \end{aligned}$$

thus \(Q_\xi \) always defines a proper associated stochastic irreducible stochastic \(P_\xi \).

e-geodesics.

We consider the auto-parallel curve with respect to the e-connection that connect \(P_0\) and \(P_1\),

$$\begin{aligned} G_e(P_0, P_1) \triangleq \left\{ P_0(x, x') \exp \left( \theta \ln \frac{P_{1}(x, x')}{P_0(x, x')} + R_\theta (x') - R_\theta (x) - \psi _\theta \right) : \theta \in [0,1] \right\} .\nonumber \\ \end{aligned}$$
(14)

The set \(G_e(P_0, P_1)\) forms an e-family of dimension 1. Indeed, from Theorem 2, and since \(P_0\) and \(P_1\) are reversible by hypothesis, it suffices to verify that \((x, x') \mapsto P_1(x, x') / P_0(x, x')\) is a reversible function over \(({\mathcal {X}}, {\mathcal {E}})\). This follows from a simple application of the Kolmogorov criterion (Theorem 1).

6 Reversible information projections

Reversible Markov kernels, as self-adjoint linear operators, enjoy a set of powerful yet brittle spectral properties. The eigenvalues are real, the second largest in magnitude controls the time to stationarity of the Markov process [8, Chapter 12], and all are stable under perturbation and estimation [43]. However, any deviation from reversibility carries steep consequences, as the spectrum can suddenly become complex, and partially loses control over the mixing time. Furthermore, eigenvalue perturbation results that were dimensionless [44, Corollary 4.10  (Weyl’s inequality)] now come at a cost possibly exponential in the dimension [44, Theorem 1.4  (Ostrowski-Elsner)]. For some irreducible P with stationary distribution \(\pi \), it is therefore interesting to find the closest representative that is reversible, so as to enable Hilbert space techniques. Computing the closest reversible transition kernel with respect to a norm induced from an inner product was considered in Nielsen and Weber [45], who showed that the problem reduces to solving a convex minimization problem with a unique solution.

In this section, we examine this problem under a different notion of distance. We consider information projections onto the reversible family of transition kernels \({\mathcal {W}}_{\mathsf {rev} }({\mathcal {X}}, {\mathcal {E}})\), for some symmetric edge set \({\mathcal {E}}\). We define the m-projection and the e-projection of P onto the set of reversible transition kernels \({\mathcal {W}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}})\) respectively as

$$\begin{aligned} \begin{aligned} P_m \triangleq {\mathop {{{\,\mathrm{\hbox {arg min}}\,}}}_{{\bar{P}} \in {\mathcal {W}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}})}} D\left( P \parallel {\bar{P}}\right) , \qquad P_e \triangleq {\mathop {{{\,\mathrm{\hbox {arg min}}\,}}}_{{\bar{P}} \in {\mathcal {W}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}})}} D\left( {\bar{P}} \parallel P\right) , \end{aligned} \end{aligned}$$

where \(D\left( \cdot \parallel \cdot \right) \) is the informational divergence, that was defined at (11). These two generally distinct projections (D is not symmetric in its arguments) correspond to the closest reversible chains when considering information divergence as a measure of distance. Under a careful choice of the connection graph of the reversible family, we derive closed-form expressions for \(P_m\) and \(P_e\), along with Pythagorean identities, as illustrated in Figure 1.

Theorem 7

Let P be irreducible over \(({\mathcal {X}}, {\mathcal {E}})\).

m-projection.

Fig. 1
figure 1

Information projections \(P_e\) and \(P_m\) of P onto \({\mathcal {W}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {X}}^2)\) in the full support case (\({\mathcal {E}} = {\mathcal {X}}^2\)) (Theorem 7), Pythagorean identities (Theorem 7), and the bisection property (Proposition 2)

The m-projection \(P_m\) of P onto \({\mathcal {W}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}} \cup {\mathcal {E}}^\star )\) is given by

$$\begin{aligned} \begin{aligned} P_m&= \frac{P + P^\star }{2}. \end{aligned} \end{aligned}$$

Moreover, for any \({\bar{P}} \in {\mathcal {W}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}} \cup {\mathcal {E}}^\star )\), \(P_m\) satisfies the following Pythagorean identity.

$$\begin{aligned} D\left( P \parallel {\bar{P}}\right) = D\left( P \parallel P_m\right) + D\left( P_m \parallel {\bar{P}}\right) . \end{aligned}$$

e-projection. When \({\mathcal {E}} \cap {\mathcal {E}}^\star \) is a strongly connected directed graph, the e-projection \(P_e\) of P onto \({\mathcal {W}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}} \cap {\mathcal {E}}^\star )\) is given by

$$\begin{aligned} \begin{aligned} P_e&= {\mathfrak {s}}({\widetilde{P}}_e), \qquad \text { with } {\widetilde{P}}_e(x,x') = \sqrt{P(x,x' )P^\star (x,x')}, \end{aligned} \end{aligned}$$

and where \({\mathfrak {s}}\) is the stochastic rescaling mapping defined at (3). Moreover, for any \({\bar{P}} \in {\mathcal {W}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}} \cap {\mathcal {E}}^\star )\), \(P_e\) satisfies the following Pythagorean identity.

$$\begin{aligned} D\left( {\bar{P}} \parallel P\right) = D\left( {\bar{P}} \parallel P_e\right) + D\left( P_e \parallel P\right) . \end{aligned}$$

Proof

Our first order of business is to show that \(P_m\) and \(P_e\) belong respectively to \({\mathcal {W}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}} \cup {\mathcal {E}}^\star )\) and \({\mathcal {W}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}} \cap {\mathcal {E}}^\star )\). It is easy to see that \(P_m(x, x') > 0\) exactly when \((x, x')\) or \((x', x)\) belongs to \({\mathcal {E}}\), hence \(P_m \in {\mathcal {W}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}} \cup {\mathcal {E}} ^\star )\), and that \(P_e(x, x') > 0\) whenever \((x, x')\) belongs to both \({\mathcal {E}}\) and \({\mathcal {E}}^\star \). Moreover, since the time-reversal operation preserves the stationary distribution of an irreducible chain, \(P_m\) has the same stationary distribution \(\pi _m = \pi \), and a straightforward computation shows that \(P_m\) satisfies the detailed balance equation. To prove reversibility of \(P_e\), we rewrite

$$\begin{aligned} \begin{aligned} \log P_e(x, x')&= \frac{1}{2}\log P(x,x')P(x',x) - \log \rho ({\widetilde{P}}_e) \\&+ \log \left( \sqrt{\pi (x')} v_e(x') \right) - \log \left( \sqrt{\pi (x)} v_e(x) \right) . \end{aligned} \end{aligned}$$

From Corollary 1, \(\log [P_e] \in {\mathcal {F}}_{\mathsf {rev} }({\mathcal {X}},{\mathcal {E}})\), thus \(P_e \in {\mathcal {W}}_{\mathsf {rev} }({\mathcal {X}},{\mathcal {E}})\).

To prove optimality of \(P_m\), it suffices to verify the following Pythagorean identity

$$\begin{aligned} D\left( P \parallel {\bar{P}}\right) = D\left( P \parallel P_m\right) + D\left( P_m \parallel {\bar{P}}\right) . \end{aligned}$$

Writing \(Q_m = {{\,\mathrm{diag}\,}}(\pi ) P_m\), notice that \(P_m = (P + P^\star )/2\) is equivalent to \(Q_m = (Q + Q^\star )/2\). We then have

$$\begin{aligned}&D\left( P \parallel P_m\right) + D\left( P_m \parallel {\bar{P}}\right) - D\left( P \parallel {\bar{P}}\right) \\&= \sum _{x, x' \in {\mathcal {X}}} \bigg ( Q(x,x') \log \frac{P(x,x')}{P_m(x, x')} + Q_m(x,x') \log \frac{P_m(x,x')}{{\bar{P}}(x,x')} \\&\qquad - Q(x,x') \log \frac{P(x,x')}{{\bar{P}}(x,x')} \bigg ) \\&= \sum _{x, x' \in {\mathcal {X}}} \left( Q_m(x,x') - Q(x,x') \right) \log \frac{P_m(x,x')}{{\bar{P}}(x,x')} \\&= \sum _{x, x' \in {\mathcal {X}}} \left( \frac{ Q^\star (x,x') - Q(x,x')}{2}\right) \log \frac{P_m(x,x')}{{\bar{P}}(x,x')} \\&= \frac{1}{2} Q^\star \left[ \log (P_m / {\bar{P}}) \right] - \frac{1}{2} Q\left[ \log (P_m / {\bar{P}}) \right] = 0, \\ \end{aligned}$$

where the last equality stems from (i) of Lemma 3 and reversibility of \(P_m\) and \({\bar{P}}\). Similarly, to prove optimality of \(P_e\), it suffices to verify that

$$\begin{aligned} D\left( {\bar{P}} \parallel P\right) = D\left( {\bar{P}} \parallel P_e\right) + D\left( P_e \parallel P\right) . \end{aligned}$$

By reorganizing terms

$$\begin{aligned} \begin{aligned} D\left( {\bar{P}} \parallel P_e\right) + D\left( P_e \parallel P\right) - D\left( {\bar{P}} \parallel P\right) = {\bar{Q}}\left[ \log (P / P_e) \right] - Q_e\left[ \log (P / P_e) \right] . \\ \end{aligned} \end{aligned}$$

From the definition of \(P_e(x,x')\),

$$\begin{aligned} \begin{aligned} \log \frac{P(x,x')}{P_e(x,x')} = \frac{1}{2}\log \frac{P(x,x')}{P(x',x)} + \frac{1}{2}\log \frac{\pi (x)}{\pi (x')} + \log \frac{v_e(x)}{v_e(x')} + \log \rho ({\widetilde{P}}_e). \\ \end{aligned} \end{aligned}$$

The first three terms being skew-symmetric, reversibility of \({\bar{P}}\) and (ii) of Lemma 3 yield that

$$\begin{aligned} \begin{aligned} {\bar{Q}} \left[ \log (P / P_e) \right] = \log \rho ({\widetilde{P}}_e). \\ \end{aligned} \end{aligned}$$

By a similar argument, \(Q_e \left[ \log ( P /P_e) \right] = \log \rho ({\widetilde{P}}_e)\), which concludes the proof. \(\square \)

In other words, the m-projection is given by the natural additive reversiblization [46, (2.4)] of P, while the e-projection is achieved by some newly defined exponential reversiblization of P.

The difference between the m-projection and the e-projection is illustrated in the following example.

Example 4

Let us consider the family of biased lazy random walks \(P_\theta = P_{(\theta _1, \theta _2)}\), given in Example 1. Note that \({\mathcal {E}}={\mathcal {E}}^\star \). The m-projection \(P_m\) of \(P_\theta \) onto \({\mathcal {W}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}})\) is the unbiased lazy random walk given by \(P_m = P_{(\theta ', 0)}\) with \(\theta ' = \theta _1 - \log \cosh \theta _2\), i.e.

$$\begin{aligned} P_m(x,x) = \frac{e^{\theta _1}}{e^{\theta _1}+e^{\theta _2}+e^{-\theta _2}}, \end{aligned}$$
$$\begin{aligned} P_m(x,x+1) = P_m(x+1,x)= \frac{e^{\theta _2}+e^{-\theta _2}}{2(e^{\theta _1}+e^{\theta _2}+e^{-\theta _2})}. \end{aligned}$$

On the other hand, the e-projection \(P_e\) of \(P_\theta \) onto \({\mathcal {W}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {E}})\) is the unbiased lazy random walk given by \(P_e = P_{(\theta _1, 0)}\), i.e.

$$\begin{aligned} P_e(x,x) = \frac{e^{\theta _1}}{e^{\theta _1}+2},~~~ P_e(x,x+1) = P_e(x+1,x)= \frac{1}{e^{\theta _1}+2}. \end{aligned}$$

Remark 6

We observe that, although the m-projection preserves the stationary distribution, this is not true for \(P_e\), which exhibits a stationary distribution \(\pi _e\) generally different from \(\pi \). Furthermore, while the solution for the m-projection is always properly defined by taking union of the edge sets, our expression for the e-projection requires additional constraints on the connection graph of P. Indeed, taking the intersection \({\mathcal {E}} \cap {\mathcal {E}}^\star \), we always obtain a symmetric set, but can lose strong connectedness. We note but do not pursue the fact that reversibility can be defined for the less well-behaved set of reducible chains. In this case, \(\pi \) need not be unique, or could take null values, and the kernel could have a complex spectrum.

Finally, we show that for any irreducible P, both its reversible projections \(P_m\) and \(P_e\) are equidistant from P and its time-reversal \(P^\star \) (see also Fig. 1).

Proposition 2

(Bisection property) Let P irreducible, and let \(P_m\) (resp. \(P_e\)) the m-projection (resp. e-projection) of P onto \({\mathcal {W}}_{\mathsf {rev} }({\mathcal {X}}, {\mathcal {E}})\).

$$\begin{aligned} \begin{aligned} D\left( P \parallel P_m\right) = D\left( P^\star \parallel P_m\right) , \qquad D\left( P_e \parallel P\right) = D\left( P_e \parallel P^\star \right) . \\ \end{aligned} \end{aligned}$$

Proof

For \(P_1\) irreducible over \(({\mathcal {X}}, {\mathcal {E}}_1)\) and \(P_2\) irreducible over \(({\mathcal {X}}, {\mathcal {E}}_2)\), it is easy to see that

$$\begin{aligned} {\mathcal {E}}_1 \subset {\mathcal {E}}_2 \implies D\left( P_1 \parallel P_2\right) = D\left( P_1^\star \parallel P_2^\star \right) . \end{aligned}$$

Then take \(P_2 = P_m\) for the first equality, and \(P_1 = P_e\) for the second. \(\square \)

7 The e-family of reversible edge measures

Recall that \({\mathcal {P}}({\mathcal {X}}^2)\), the set of all distributions over \({\mathcal {X}}^2\), forms an e-family [27, Example 2.8]. For some e-family of irreducible transition kernels \({\mathcal {V}}_e { \subset } {\mathcal {W}}({\mathcal {X}}, {\mathcal {X}}^2)\), one may wonder whether the corresponding family of edge measures also forms an e-family of distributions in \({\mathcal {P}}({\mathcal {X}}^2)\). We begin by illustrating that this holds in particular for the e-family obtained by tilting a memoryless Markov kernel.

Example 5

Consider the degenerate Markov kernel corresponding to an iid process \(P(x, x') = \pi (x')\) for \(\pi \in {\mathcal {P}}({\mathcal {X}})\). For a given function \(g :{\mathcal {X}} \rightarrow {\mathbb {R}}\), and \(\theta \in {\mathbb {R}}\), construct \({\widetilde{P}}_\theta (x, x') = P(x, x') e^{\theta g(x')} = \pi (x') e^{\theta g(x')}\). Then \(v_\theta = {\varvec{1}}\) is right eigenvector of \({\widetilde{P}}_\theta \) with eigenvalue \(\rho (\theta ) = \sum _{x' \in {\mathcal {X}}} \pi (x') e^{\theta g(x')}\). Letting \(\pi _\theta (x) = \pi (x) e^{\theta g(x)}/\rho (\theta )\), we see that \(\pi _\theta \) is the left PF eigenvector of \({\widetilde{P}}_\theta \), and the stationary distribution of the rescaled \(P_\theta \). We can therefore write,

$$\begin{aligned} Q_\theta (x,x') = \exp \left( \log \pi (x)\pi (x') + \theta (g(x) + g(x')) - 2 \log \rho (\theta ) \right) , \end{aligned}$$

thus \(\left\{ Q_\theta \right\} _{\theta \in \varTheta }\) forms an exponential family of distributions over \({\mathcal {X}}^2\). This fact can be further understood in the following manner. An e-family of distributions \(\left\{ \pi _\theta \right\} _{\theta }\) induces an e-family of memoryless Markov kernels \(\left\{ P_\theta \right\} _\theta \) with \(P_\theta (x,x') = \pi _\theta (x')\) (see Lemma 7 for a proof of this fact for the set of all memoryless kernels), and thus with edge measures \(Q_\theta (x,x') = \pi _\theta (x) \pi _\theta (x')\). Since the 2-iid extension \(\left\{ \pi _\theta (x)\pi _\theta (x') \right\} _\theta \) of the e-family \(\left\{ \pi _\theta \right\} _\theta \) is also an e-family, it follows that \(\left\{ Q_\theta (x,x') \right\} _\theta \) forms an e-family.

In the remainder of this section, we show that the subset of positive reversible edge measures \({\mathcal {Q}}_{\mathsf {rev} } = {\mathcal {Q}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {X}}^2)\), induced from the e-family of reversible positive kernels, forms a submanifold of \({\mathcal {P}}({\mathcal {X}}^2)\) that is autoparallel with respect to the e-connection, i.e. \({\mathcal {Q}}_{\mathsf {rev} }\) is an e-family of distribution of over pairs. Our proof will rely on the definition of a Markov map.

Definition 6

(e.g. Nagaoka [47]) We say that \(M :{\mathcal {P}}({\mathcal {X}}) \rightarrow {\mathcal {P}}({\mathcal {Y}})\) is a Markov map, when there exists a transition kernel \(P_M\) from \({\mathcal {X}}\) to \({\mathcal {Y}}\) (also called a channel) such that for any \(\mu \in {\mathcal {P}}({\mathcal {X}})\),

$$\begin{aligned} M(\mu ) = \sum _{x \in {\mathcal {X}}} P_M(x, \cdot ) \mu (x). \end{aligned}$$

Let \({\mathcal {U}}\) and \({\mathcal {V}}\) be smooth submanifolds (statistical models) of \({\mathcal {P}}({\mathcal {X}})\) and \( {\mathcal {P}}({\mathcal {Y}})\) respectively. When there exists a pair of Markov maps \(M :{\mathcal {P}}({\mathcal {X}}) \rightarrow {\mathcal {P}}({\mathcal {Y}})\), \(N :{\mathcal {P}}({\mathcal {Y}}) \rightarrow {\mathcal {P}}({\mathcal {X}})\) such that their restrictions \(M|_{\mathcal {U}}\), \(N|_{\mathcal {V}}\) are bijections between \({\mathcal {U}}\) and \({\mathcal {V}}\), and are the inverse mappings of each other, we say that \({\mathcal {U}}\) and \({\mathcal {V}}\) are Markov equivalent, and write \({\mathcal {U}} \cong {\mathcal {V}}\).

Lemma 6

It holds that

$$\begin{aligned} {\mathcal {Q}}_{\mathsf {rev} } ({\mathcal {X}}, {\mathcal {X}}^2) \cong {\mathcal {P}}\left( \left[ \frac{ \left| {\mathcal {X}} \right| \left( \left| {\mathcal {X}} \right| + 1 \right) }{2} \right] \right) . \end{aligned}$$

Proof

Identify \({\mathcal {X}} = [m]\), and consider \(Q \in {\mathcal {Q}}_{\mathsf {rev} }\) such that

$$\begin{aligned} \begin{aligned} Q = \begin{pmatrix} \eta _{11} &{} \frac{\eta _{12}}{2} &{} \frac{\eta _{13}}{2} &{} \dots &{} \frac{\eta _{1m}}{2} \\ \frac{\eta _{12}}{2} &{} \eta _{22} &{} \frac{\eta _{23}}{2} &{} \dots &{} \frac{\eta _{2m}}{2} \\ \vdots &{} &{} \ddots &{} &{} \vdots \\ \vdots &{} &{} &{} \eta _{(m-1)(m-1)} &{} \frac{\eta _{(m-1)m}}{2} \\ \frac{\eta _{1m}}{2} &{} \dots &{} \frac{\eta _{23}}{2} &{} \dots &{} \eta _{mm}, \\ \end{pmatrix} \end{aligned} \end{aligned}$$

and where \(\eta _{mm} = 1 - \sum _{i \le j, (i,j) \ne (m, m)} \eta _{ij}\). We flatten the definition of Q.

$$\begin{aligned} Q = \left( \eta _{11}, \eta _{22}, \dots , \eta _{mm}, \frac{\eta _{12}}{2}, \frac{\eta _{12}}{2}, \dots , \underbrace{\frac{\eta _{ij}}{2}, \frac{\eta _{ij}}{2},}_{(i,j) :i < j} \dots , \frac{\eta _{(m-1)m}}{2}, \frac{\eta _{(m-1)m}}{2} \right) . \end{aligned}$$

Let the matrix E with \(m(m-1)/2\) columns and \(m(m-1)\) rows be such that,

$$\begin{aligned} \begin{aligned} E ^\intercal =\begin{pmatrix} 1 &{} 1 &{} 0 &{} 0 &{} \dots &{} 0 &{} 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} 1 &{} 1 &{} \dots &{} 0 &{} 0 &{} 0 &{} 0 \\ 0 &{} 0 &{}0 &{} 0 &{} \dots &{} 1 &{} 1 &{} 0 &{} 0 \\ 0 &{} 0 &{} 0 &{} 0 &{} \dots &{} 0 &{} 0 &{} 1 &{} 1 \\ \end{pmatrix} . \end{aligned} \end{aligned}$$

Block matrix multiplication yields

$$\begin{aligned} Q \begin{pmatrix} I_m &{} 0 \\ 0 &{} E \end{pmatrix} = \left( \eta _{11}, \eta _{22}, \dots , \eta _{mm}, \eta _{12}, \dots , \eta _{ij}, \dots , \eta _{(m-1)m} \right) \in {\mathcal {P}}\left( \left[ \frac{m(m+1)}{2} \right] \right) , \end{aligned}$$

and further observing that for \(F = \frac{1}{2} E ^\intercal \), it holds that \(FE = I_{m(m-1)/2}\). Thus the mappings defined by \(\begin{pmatrix} I_m &{} 0 \\ 0 &{} E \end{pmatrix}\) and \(\begin{pmatrix} I_m &{} 0 \\ 0 &{} \frac{1}{2}E ^\intercal \end{pmatrix}\) are Markov maps and verify \(\begin{pmatrix} I_m &{} 0 \\ 0 &{} \frac{1}{2}E ^\intercal \end{pmatrix} \begin{pmatrix} I_m &{} 0 \\ 0 &{} E \end{pmatrix} = I_{m(m+1)/2}\). This finishes proving the claim. \(\square \)

Theorem 8

The set \({\mathcal {Q}}_{\mathsf {rev} } \) forms an e-family and an m-family of \({\mathcal {P}}({\mathcal {X}}^2)\) with dimension \(\left| {\mathcal {X}} \right| (\left| {\mathcal {X}} \right| + 1)/2 - 1\). Moreover, \({\mathcal {Q}}\) does not form an e-family in \({\mathcal {P}}({\mathcal {X}}^2)\) (except when \(\left| {\mathcal {X}} \right| = 2\)).

Proof

Since \({\mathcal {Q}}_{\mathsf {rev} } { \subset } {\mathcal {P}}({\mathcal {X}}^2)\), the claim stems from the equivalence between (i) and (ii) of Nagaoka [47, Theorem 1], and application of Lemma 6, and the fact that \(\dim {\mathcal {P}} \left( \left[ \frac{\left| {\mathcal {X}} \right| (\left| {\mathcal {X}} \right| + 1)}{2} \right] \right) = \frac{\left| {\mathcal {X}} \right| (\left| {\mathcal {X}} \right| + 1)}{2} - 1\). In order to prove that \({\mathcal {Q}}\) is not an e-family in \({\mathcal {P}}({\mathcal {X}}^2)\), we first construct the following family of edge measures over three states.

$$\begin{aligned} \begin{aligned} Q_0^{(3)} = \frac{1}{13}\begin{pmatrix} 1 &{} 1 &{} 2 \\ 2 &{} 1 &{} 2 \\ 1 &{} 3 &{} 1 \end{pmatrix}, \qquad Q_1^{(3)} = \frac{1}{13}\begin{pmatrix} 1 &{} 3 &{} 1 \\ 2 &{} 1 &{} 2 \\ 2 &{} 1 &{} 1 \end{pmatrix}. \end{aligned} \end{aligned}$$

Computing the point on the e-geodesic in \({\mathcal {P}}({\mathcal {X}}^2)\) at parameter value 1/2, yields

$$\begin{aligned} \begin{aligned} Q_{1/2}^{(3)} \propto \begin{pmatrix} 1 &{} \sqrt{3} &{} \sqrt{2} \\ 2 &{} 1 &{} 2 \\ \sqrt{2} &{} \sqrt{3} &{} 1 \end{pmatrix}, \end{aligned} \end{aligned}$$

which does not belong to \({\mathcal {Q}}\). We can readily expand the above example to general state space size, \(m > 3\), by considering the one-padded versions of the above \(Q_i^{(m)} \propto \begin{pmatrix} Q_i^{(3)} &{} {\varvec{1}}_3 ^\intercal {\varvec{1}}_{m - 3} \\ {\varvec{1}}_{m - 3} ^\intercal {\varvec{1}}_{3} &{} {\varvec{1}}_{m - 3} ^\intercal {\varvec{1}}_{m - 3} \end{pmatrix}\), for \(i \in \left\{ 0, 1 \right\} \). \(\square \)

Remark 7

  1. (i)

    Nagaoka [47, Theorem 1-(iv)], actually proves the stronger result that \({\mathcal {Q}}_{\mathsf {rev} }\) forms an \(\alpha \)-family in \({\mathcal {P}}({\mathcal {X}}^2)\), for any \(\alpha \in {\mathbb {R}}\) (see Amari and Nagaoka [27, Section 2.6] for a definition of \(\alpha \)-families).

  2. (ii)

    We note but do not pursue here the fact that a more refined treatment over some irreducible edge set \({\mathcal {E}} \subsetneq {\mathcal {X}}^2\) is possible.

8 Comparison of remarkable families of Markov chains

We briefly compare the geometric properties of reversible kernels with that of several other remarkable families of Markov chains, and compile a summary in Table 1.

Family of all kernels irreducible over \(({\mathcal {X}}, {\mathcal {E}})\): \({\mathcal {W}}({\mathcal {X}}, {\mathcal {E}})\).This family is known to form both an e-family and an m-family of dimension \(\left| {\mathcal {E}} \right| - \left| {\mathcal {X}} \right| \) [22, Corollary 1].

Table 1 Summary of geometric properties of submanifolds of irreducible Markov kernels (\(\left| {\mathcal {X}} \right| \ge 3\))

Family of all reversible kernels irreducible over . We show in Theorem 3 and Theorem 6 that is both an e-family and m-family or dimension \(T({\mathcal {E}})\), where

$$\begin{aligned} \left| T({\mathcal {E}}) \right| = \frac{\left| {\mathcal {E}} \right| + \left| T_0({\mathcal {E}}) \right| }{2} - 1, \end{aligned}$$

with \(T_0({\mathcal {E}}) = \left\{ (x,x') \in {\mathcal {E}} :x = x' \right\} \).

Family of positive memoryless (iid) kernels: \({\mathcal {W}} _{\mathsf {iid}}({\mathcal {X}}, {\mathcal {X}}^2)\). This family comprises degenerate irreducible kernels that correspond to iid processes, i.e. where all rows are equal to the stationary distribution. Notice that for \(P \in {\mathcal {W}} _{\mathsf {iid}}\), irreducibility forces P to be positive. We show that \({\mathcal {W}} _{\mathsf {iid}}\) is an e-family of dimension \(\left| {\mathcal {X}} \right| - 1\) (Lemma 7), but not an m-family (Lemma 8).

Lemma 7

\({\mathcal {W}}_{\mathsf {iid} }\) forms an e-family of dimension \(\left| {\mathcal {X}} \right| - 1\).

Proof

For \({\mathcal {X}} = [m]\), let us consider the following parametrization proposed by Ito and Amari [17]:

$$\begin{aligned} \begin{aligned} \log P(x,x') =&\sum _{i = 1}^{m - 1} \log \frac{P(m, i)P(i, m)}{P(m, m)P(m, m)} \delta _i(x') \\&+ \sum _{i = 1}^{m-1} \sum _{j = 1}^{m-1} \log \frac{P(i, j) P(m,m)}{P(m,j)P(i,m)} \delta _i(x) \delta _j(x') \\&+ \log P(x, m) - \log P(x', m) + \log P(m, m). \end{aligned} \end{aligned}$$

This corresponds to the basis

$$\begin{aligned} \begin{aligned} g_i&= {\varvec{1}}^\intercal \delta _i, \qquad i \in [m-1], \\ g_{ij}&= \delta _i ^\intercal \delta _j, \qquad i,j \in [m-1], \\ \end{aligned} \end{aligned}$$

with parameters

$$\begin{aligned} \begin{aligned} \theta ^i&= \log \frac{P(m, i)P(i, m)}{P(m,m)P(m,m)}, \qquad \theta ^{ij} = \log \frac{P(i, j)P(m, m)}{P(m,j)P(i,m)}. \\ \end{aligned} \end{aligned}$$

Let P irreducible with stationary distribution \(\pi \). Suppose first that P is memoryless, i.e. for all \(x, x' \in {\mathcal {X}}\), \(P(x,x') = \pi (x')\). In this case, for all \(i,j \in [m - 1]\), the coefficient \(\theta ^{ij}\) vanishes, and for all \(i \in [m - 1]\), it holds that \(\theta ^i = \pi (i)/\pi (m)\), so that we can write more simply

$$\begin{aligned} \begin{aligned} \log P(x,x') =&\sum _{i = 1}^{m - 1} \log \frac{\pi (i)}{\pi (m)} \delta _i(x') + \log \pi (m). \end{aligned} \end{aligned}$$
(15)

Conversely, now suppose that \(\theta ^{i j} = 0\) for any \(i,j \in [m - 1]\). Then the matrix

$$\begin{aligned} {\widetilde{P}}(x,x') = \exp \left( \sum _{i = 1}^{m - 1} \log \frac{P(m, i)P(i, m)}{P(m, m)P(m, m)} \delta _i(x') \right) \end{aligned}$$

has rank one, the right PF eigenvector is constant, and P is memoryless. As a result, \({\mathcal {W}} _{\mathsf {iid}}\) is an e-family of \({\mathcal {W}}\) such that \(\theta ^{ij} = 0\) for every \(i,j \in [m - 1]\). \(\square \)

Lemma 8

\({\mathcal {W}}_{\mathsf {iid} }\) does not form an m-family.

Proof

We prove the case \(\left| {\mathcal {X}} \right| = 2\) and \(p \ne 1/2\),

$$\begin{aligned} \begin{aligned} P_0 = \begin{pmatrix} p &{} 1 - p \\ p &{} 1 - p \end{pmatrix}, \qquad P_1 = \begin{pmatrix} 1 - p &{} p \\ 1 - p &{} p \end{pmatrix}. \end{aligned} \end{aligned}$$

Computing the corresponding edge measures,

$$\begin{aligned} \begin{aligned} Q_0 = \begin{pmatrix} p^2 &{} p(1 - p) \\ p(1 - p) &{} (1 - p)^2 \end{pmatrix}, \qquad Q_1 = \begin{pmatrix} (1 - p)^2 &{} p(1-p) \\ p(1 - p) &{} p^2 \end{pmatrix}. \end{aligned} \end{aligned}$$

But then if we let

$$\begin{aligned} \begin{aligned} Q_{1/2} = \frac{1}{2} Q_{0} + \frac{1}{2} Q_1 = \begin{pmatrix} \frac{1}{2}(p^2 + (1- p)^2) &{} p(1 - p) \\ p(1 - p) &{} \frac{1}{2}(p^2 + (1- p)^2) \end{pmatrix}, \end{aligned} \end{aligned}$$

we see that the stationary distribution is \(\pi _{1/2} = {\varvec{1}}/ 2\), and

$$\begin{aligned} P_{1/2} = \begin{pmatrix} p^2 + (1 - p)^2 &{} 2p(1-p) \\ 2p(1-p) &{} p^2 + (1 - p)^2 \end{pmatrix}. \end{aligned}$$

But for \(p \ne 0\), \(P_{1/2}\) does not belong to \({\mathcal {W}} _{\mathsf {iid}}\), hence the family is not an m-family. The proof can be extended to the more general \({\mathcal {X}} = [m], m >2 \) by considering instead the two kernels defined by \(\pi _{p} = ( p, 1 - p, 1, \dots , 1 )/(m - 1)\) and \(\pi _{1 - p}\) for \(p \in (0, 1), p \ne 1/2\). \(\square \)

For simplicity, in the remainder of this section, we mostly consider the full support case.

Family of positive doubly-stochastic kernel: \({\mathcal {W}} _{\mathsf {bis}}({\mathcal {X}}, {\mathcal {X}}^2)\). Recall that a kernel P is said to be doubly-stochastic, or bi-stochastic, when P and \(P ^\intercal \) are both stochastic matrices. In this case, the stationary distribution is always uniform. It is known that the set of doubly stochastic Markov chains forms an m-family of dimension \((\left| {\mathcal {X}} \right| - 1)^2\) [10, Example 4]. However, as a consequence of Lemma 10, it does not form an e-family (except when \(\left| {\mathcal {X}} \right| = 2\)).

Family of positive symmetric kernel: \({\mathcal {W}} _{\mathsf {sym}}({\mathcal {X}}, {\mathcal {X}}^2)\). A Markov kernel is symmetric, when \(P = P ^\intercal \), hence this family lies at the intersection between reversible and doubly-stochastic families of Markov kernels, which are both m-families. This implies that symmetric kernels also form an m-family. In fact, Lemma 9 shows that the dimension of this family is \(\left| {\mathcal {X}} \right| (\left| {\mathcal {X}} \right| - 1)/2\). Lemma 10, however, shows that \({\mathcal {W}} _{\mathsf {sym}}\) only forms an e-family for \(\left| {\mathcal {X}} \right| = 2\).

Lemma 9

\({\mathcal {W}}_{\mathsf {sym} }\) forms an m-family of dimension \(\left| {\mathcal {X}} \right| (\left| {\mathcal {X}} \right| - 1)/2\).

Proof

To prove the claim, we will rely on Definition 1-(ii) of a mixture family. Consider the functions \(s_0 :{\mathcal {X}}^2 \rightarrow {\mathbb {R}}\) and \(s_{ij} :{\mathcal {X}}^2 \rightarrow {\mathbb {R}}\) for \(i,j \in {\mathcal {X}}, i > j\) such that for any \(x,x' \in {\mathcal {X}}\), \(s_0(x,x') = \delta _x(x')/\left| {\mathcal {X}} \right| \) and \(s_{ij} = \delta _i ^\intercal \delta _j + \delta _j ^\intercal \delta _i - 2 \delta _i ^\intercal \delta _i\). Let \(Q \in {\mathcal {Q}}\), we verify that for any \(x,x' \in {\mathcal {X}}\),

$$\begin{aligned} \begin{aligned} Q(x,x')&= s_0(x,x') + \sum _{\begin{array}{c} i,j \in {\mathcal {X}} \\ i > j \end{array} } s_{ij}(x,x') Q(i, j), \end{aligned} \end{aligned}$$
(16)

and moreover

$$\begin{aligned} \begin{aligned} \sum _{x,x' \in {\mathcal {X}}} s_0(x,x') = 1, \qquad \sum _{x,x' \in {\mathcal {X}}} s_{i j}(x,x') = 0, \forall i,j \in {\mathcal {X}}, i > j. \\ \end{aligned} \end{aligned}$$

It remains to show that the \(s_0, s_0 + s_{ij}\), for \(i > j\), are affinely independent, or equivalently, that the \(s_{ij}\), for \(i > j\), are linearly independent. Let \(s = \sum _{i > j} \alpha _{ij} s_{ij}\) with \(\alpha _{ij} \in {\mathbb {R}}\), for any \(i > j\), be such that \(s = 0\). For any \(i > j\), taking \(x = i, x'=j\) yields \(\alpha _{ij} = 0\), thus the family is independent, hence constitutes a basis, and the dimension is \(\left| \left\{ i,j \in {\mathcal {X}} :i > j \right\} \right| = \left| {\mathcal {X}} \right| (\left| {\mathcal {X}} \right| - 1) / 2\). \(\square \)

Lemma 10

For \(\left| {\mathcal {X}} \right| \ge 2\),

  1. (i)

    The set \({\mathcal {W}} _{\mathsf {sym}}\) does not form an e-family, unless \(\left| {\mathcal {X}} \right| = 2\).

  2. (ii)

    The set \({\mathcal {W}} _{\mathsf {bis}}\) does not form an e-family, unless \(\left| {\mathcal {X}} \right| = 2\).

Proof

We first treat the case \(\left| {\mathcal {X}} \right| = 2\) for (i) and (ii). Notice that

$$\begin{aligned} P_\theta = \begin{pmatrix} \frac{e^\theta }{1 + e^\theta } &{} \frac{1}{1 + e^\theta } \\ \frac{1}{1 + e^\theta } &{} \frac{e^\theta }{1 + e^\theta } \end{pmatrix} \end{aligned}$$

for \(\theta \in {\mathbb {R}}\) satisfies \(P_\theta \in {\mathcal {W}}_{\mathsf {sym}}\), and that the latter expression exhausts all irreducible symmetric chains. We can therefore write

$$\begin{aligned} {\mathcal {W}} _{\mathsf {sym}}= \left\{ P_\theta :P_\theta (x,x') = \exp \left( \delta _x(x') \theta - \log (e^\theta + 1) \right) , \theta \in {\mathbb {R}} \right\} , \end{aligned}$$

which follows the defintion at (2) of an e-family with carrier kernel \(K = 0\), generator \(g(x,x') = \delta _x(x')\), natural parameter \(\theta \), \(R_\theta = 0\) and potential function \(\psi _\theta = \log (e^\theta + 1)\). Furthermore, for \(\left| {\mathcal {X}} \right| = 2\), it is easy to see that symmetric and doubly-stochastic families coincide, hence \({\mathcal {W}} _{\mathsf {bis}}\) is also an e-family.

We now prove (i) for \(\left| {\mathcal {X}} \right| = 3\). We will consider two positive symmetric Markov kernels \(P_0\) and \(P_1\), and look at the e-geodesic

$$\begin{aligned} G_e(P_0, P_1) \triangleq \left\{ {\mathfrak {s}}\left( {\widetilde{P}}_\theta \right) : {\widetilde{P}}_\theta = P_0(x, x')^{1 - \theta } P_{1}(x, x')^\theta , \theta \in [0,1] \right\} , \end{aligned}$$

where the map \({\mathfrak {s}}\), defined in (3), enforces stochasticity. The matrix \(P_\theta (x, x') = {\mathfrak {s}}({\widetilde{P}}_\theta )\) is symmetric, if and only if the right eigenvector of \({\widetilde{P}}_\theta \) is constant. This, in turn, is equivalent to the rows of \({\widetilde{P}}_\theta \) being all equal. Consider the two symmetric kernels

$$\begin{aligned} P_0 = \begin{pmatrix} \alpha &{} 2/3 - \alpha &{} 1/3 \\ 2/3 - \alpha &{} \alpha &{} 1/3 \\ 1/3 &{} 1/3 &{} 1/3 \end{pmatrix}, \; \; P_1 = \frac{1}{3}\begin{pmatrix} 1 &{} 1 &{} 1 \\ 1 &{} 1 &{} 1 \\ 1 &{} 1 &{} 1 \\ \end{pmatrix}, \end{aligned}$$

with free parameter \(\alpha \ne 1/3\), and let us inspect the curve at parameter \(\theta = 1/2\). For \(P_{1/2}\) to be symmetric, it is necessary that

$$\begin{aligned} \sqrt{\alpha } + \sqrt{2/3 - \alpha } = 2\sqrt{1/3}, \end{aligned}$$

whose unique solution is precisely \(\alpha = 1/3\). Invoking Nagaoka [22, Corollary 3] finishes proving (i) for \(\left| {\mathcal {X}} \right| = 3\). We extend the proof to \(\left| {\mathcal {X}} \right| \ge 4\) using the padding argument of Theorem 8, considering \(\frac{1}{m}\begin{pmatrix} 3 P_i &{} {\varvec{1}}^\intercal {\varvec{1}}\\ {\varvec{1}}^\intercal {\varvec{1}}&{} {\varvec{1}}^\intercal {\varvec{1}}\end{pmatrix}\). Suppose for contradiction that (ii) is false, i.e. bi-stochastic matrices form an e-family. Take then any e-geodesic between two arbitrary symmetric kernels. The latter operators being reversible, so is the geodesic. But then this curve must also be composed entirely of symmetric matrices, hence the geodesic is symmetric, which contradicts (i). \(\square \)

Remark 8

For \(\left| {\mathcal {X}} \right| \ge 3\), the following hierarchies hold:

$$\begin{aligned}&{\mathcal {W}} _{\mathsf {iid}}{\mathop {\subsetneq }\limits ^{\text {e-family}}} {\mathcal {W}}_{\mathsf {rev} } {\mathop {\subsetneq }\limits ^{\text {e-family}}} {\mathcal {W}},\\&{\mathcal {W}} _{\mathsf {sym}}{\mathop {\subsetneq }\limits ^{\text {m-family}}} {\mathcal {W}}_{\mathsf {rev} } {\mathop {\subsetneq }\limits ^{\text {m-family}}} {\mathcal {W}},\\&{\mathcal {W}} _{\mathsf {sym}}{\mathop {\subsetneq }\limits ^{\text {m-family}}} {\mathcal {W}} _{\mathsf {bis}}{\mathop {\subsetneq }\limits ^{\text {m-family}}} {\mathcal {W}}. \end{aligned}$$

9 Generation of the reversible family

In this final section, we consider the family of positive Markov kernels \({\mathcal {W}} = {\mathcal {W}}({\mathcal {X}},{\mathcal {X}}^2)\) i.e. where the support \({\mathcal {E}} = {\mathcal {X}}^2\). We first show that is in a sense the smallest exponential family that contains \({\mathcal {W}} _{\mathsf {sym}}\), the family of symmetric Markov kernels. Our notion of minimality relies on the following definition of the exponential hull of some submanifold of \({\mathcal {W}}\).

Definition 7

(Exponential hull) Let \({\mathcal {V}} { \subset } {\mathcal {W}}\).

$$\begin{aligned} \begin{aligned} {{\,\mathrm{e-hull}\,}}({\mathcal {V}}) = \Bigg \{ {\mathfrak {s}}({\widetilde{P}})&:\log [{\widetilde{P}}] = \sum _{i = 1}^{k} \alpha _i \log [P_i ], \\&k \in {\mathbb {N}}, \alpha _1, \dots , \alpha _k \in {\mathbb {R}}, \sum _{i = 1}^{k} \alpha _i = 1, P_1, \dots P_k \in {\mathcal {V}} \Bigg \}, \end{aligned} \end{aligned}$$

where \({\mathfrak {s}}\) is defined in (3).

Remark: When \(U = \frac{1}{\left| {\mathcal {X}} \right| } {\varvec{1}}^\intercal {\varvec{1}}\in {\mathcal {V}}\), the constraint \(\sum _{i = 1}^{k} \alpha _i = 1\) is redundant. Indeed, since U corresponds to the origin in e-coordinates, the linear hull and affine hull coincide in this case.

By (10), the reversible family is generated by symmetric functions. Even though not all symmetric functions correspond to symmetric kernels, the reversible family is spanned by the symmetric kernels as follows.

Theorem 9

For \(\left| {\mathcal {X}} \right| \ge 3\), it holds that

$$\begin{aligned} {{\,\mathrm{e-hull}\,}}({\mathcal {W}}_{\mathsf {sym} }) = {\mathcal {W}}_{\mathsf {rev} } . \end{aligned}$$

Proof

We begin by proving the inclusion . Let \(P \in {{\,\mathrm{e-hull}\,}}({\mathcal {W}} _{\mathsf {sym}})\), then there exist a positive \({\widetilde{P}} \in {\mathcal {F}}_+\), and \(k \in {\mathbb {N}}, \alpha _1, \dots , \alpha _k \in {\mathbb {R}}, P_1, \dots , P_k \in {\mathcal {W}} _{\mathsf {sym}}\) such that \(\log {\widetilde{P}}(x,x') = \sum _{i = 1}^{k} \alpha _i \log P_i(x,x')\). Observe that the function \(\sum _{i = 1}^{k} \alpha _i \log P_i(x,x')\) is symmetric in x and \(x'\), thus \(\log {\widetilde{P}}(x,x')\) is log-reversible, and P is reversible.

We now prove the second inclusion . We let

$$\begin{aligned} {\mathcal {H}} = {{\,\mathrm{span}\,}}\left( \left\{ \log [P] :P \in {\mathcal {W}}_{\mathsf {sym}} \right\} \cup {\mathcal {N}} \right) . \end{aligned}$$

Recall from Theorem 4 that the functions \(g_{ij} = \delta _i ^\intercal \delta _j + \delta _j ^\intercal \delta _i\) , for \((i,j) \in T({\mathcal {X}}^2)\), form a basis of the quotient space . It suffices therefore to show that \(\left\{ g_{ij} :(i, j) \in T({\mathcal {X}}^2) \right\} { \subset } {\mathcal {H}}\). Introduce a free parameter \(t \in (0, 1), t \ne 1/2\), and let us fix \((i,j) \in T_+({\mathcal {X}})\). Consider \(P_{ij,t} \in {\mathcal {W}} _{\mathsf {sym}}\) defined as follows

$$\begin{aligned} P_{ij,t}(x,x') \triangleq {\left\{ \begin{array}{ll} 2(1 - t)/\left| {\mathcal {X}} \right| &{} \text { if } (x,x') \in \left\{ (i,i), (j,j) \right\} , \\ 2t/\left| {\mathcal {X}} \right| &{} \text { if } (x,x') \in \left\{ (i,j), (j,i) \right\} , \\ 1/\left| {\mathcal {X}} \right| &{} \text { otherwise, } \end{array}\right. } \end{aligned}$$

and the functions \({\hat{h}}_{ij}, {\tilde{h}}_{ij}\)

$$\begin{aligned} \begin{aligned} {\hat{h}}_{ij}&= \log \left| {\mathcal {X}} \right| + \log \left[ P_{ij,t} \right] = a (\delta _i ^\intercal \delta _i + \delta _j ^\intercal \delta _j) + b (\delta _i ^\intercal \delta _j + \delta _j ^\intercal \delta _i), \\ {\tilde{h}}_{ij}&= \log \left| {\mathcal {X}} \right| + \log \left[ P_{ij,1 - t} \right] = b (\delta _i ^\intercal \delta _i + \delta _j ^\intercal \delta _j) + a (\delta _i ^\intercal \delta _j + \delta _j ^\intercal \delta _i), \\ \end{aligned} \end{aligned}$$

where for simplicity we wrote \(a = \log 2(1 - t)\) and \(b = \log 2t \ne a\). Since the function \(((x, x') \mapsto \log \left| {\mathcal {X}} \right| ) \in {\mathcal {N}}\), we have \({\hat{h}}_{ij}, {\tilde{h}}_{ij} \in {\mathcal {H}}\). Notice that we can write

$$\begin{aligned} g_{ij} = \frac{b {\hat{h}}_{ij} - a {\tilde{h}}_{ij}}{b^2 - a^2}, \end{aligned}$$

hence also \(g_{ij} \in {\mathcal {H}}\). Introduce the function

$$\begin{aligned} h_{ij} = \frac{a {\hat{h}}_{ij} - b {\tilde{h}}_{ij}}{a^2 - b^2} = \delta _i ^\intercal \delta _i + \delta _j ^\intercal \delta _j \in {\mathcal {H}}, \end{aligned}$$

and observe that we can rewrite the identity \(I = {\varvec{1}}^\intercal {\varvec{1}}- \sum _{(i,j) \in T_+({\mathcal {X}}^2)} g_{ij}\) with \({\varvec{1}}^\intercal {\varvec{1}}\) being a constant function. It follows that \(I \in {\mathcal {H}}\), and for any \(j \in {\mathcal {X}}\), we can express

$$\begin{aligned} g_{jj} = \frac{2}{{ \left| {\mathcal {X}} \right| - 2}} \left( \sum _{\begin{array}{c} i \in {\mathcal {X}} \\ i > j \end{array}} h_{ij} + \sum _{\begin{array}{c} i \in {\mathcal {X}} \\ i < j \end{array}} h_{ji} - I \right) \in {\mathcal {H}}. \end{aligned}$$

As a result, \(\left\{ g_{ij} :(i,j) \in T({\mathcal {X}}^2) \right\} { \subset } {\mathcal {H}}\), and the theorem follows. \(\square \)

Remark 9

Observe that in the above proof, it is crucial that \(\left| {\mathcal {X}} \right| \ge 3\). For \(\left| {\mathcal {X}} \right| =2 \), we can only have \(h_{21} = \begin{pmatrix} 1 &{} 0 \\ 0 &{} 1 \end{pmatrix}\), and cannot construct \(g_{11} = \begin{pmatrix} 1 &{} 0 \\ 0 &{} 0 \end{pmatrix}\) nor \(g_{22} = \begin{pmatrix} 0 &{} 0 \\ 0 &{} 1 \end{pmatrix}\). This is consistent with the observation that for \(\left| {\mathcal {X}} \right| =2 \).

Secondly, we show that is also the smallest mixture family that contains \({\mathcal {W}} _{\mathsf {iid}}\), the family of Markov kernels that correspond to iid processes. For this, we define minimality in terms of a mixture hull.

Definition 8

(Mixture hull) Let \({\mathcal {V}} { \subset } {\mathcal {W}}\).

$$\begin{aligned} \begin{aligned} {{\,\mathrm{m-hull}\,}}({\mathcal {V}}) = \Bigg \{ P&:Q \in {\mathcal {Q}}, Q = \sum _{i = 1}^{k} \alpha _i Q_i, \\&k \in {\mathbb {N}}, \alpha _1, \dots , \alpha _k \in {\mathbb {R}}, P_1, \dots , P_k \in {\mathcal {V}}\Bigg \}, \end{aligned} \end{aligned}$$

where Q (resp. \(Q_i\)) pertains to P (resp. \(P_i\)).

Theorem 10

It holds that

$$\begin{aligned} {{\,\mathrm{m-hull}\,}}({\mathcal {W}}_{\mathsf {iid} } ) = {\mathcal {W}}_{\mathsf {rev} }. \end{aligned}$$

Proof

Let \(P \in {{\,\mathrm{m-hull}\,}}({\mathcal {W}} _{\mathsf {iid}})\), then the corresponding edge measure can be expressed as a linear combination \(\sum _{i = 1}^{k} \alpha _i Q_i\), with \(k \in {\mathbb {N}}, \alpha _1, \dots , \alpha _k \in {\mathbb {R}}\), and where the \(Q_i\) pertain to some degenerate iid kernel \(P_i = {\varvec{1}}^\intercal \pi _i\). This implies that \(Q_i(x,x') = \pi _i(x) \pi _i(x')\), hence \(Q_i\) is symmetric. In turn, Q is symmetric, i.e. P is reversible, and .

For \((i,j) \in {\mathcal {X}}^2\), \(i \ge j\), and \(\varepsilon \in [0, 1]\), consider the mixture distribution

$$\begin{aligned} \pi _{ij, \varepsilon } = \frac{\varepsilon }{\left| {\mathcal {X}} \right| } {\varvec{1}}+ (1 - \varepsilon ) \frac{\delta _i + \delta _j}{2} \in {\mathcal {P}}({\mathcal {X}}). \end{aligned}$$

A direct computation yields that the pair probabilities of the iid process can be written as

$$\begin{aligned} \begin{aligned} Q_{i j, \varepsilon }(x, x') =&\frac{\varepsilon ^2}{\left| {\mathcal {X}} \right| ^2} + \frac{\varepsilon (1 - \varepsilon )}{2 \left| {\mathcal {X}} \right| } \left\{ \delta _i(x) + \delta _j(x) + \delta _i(x') + \delta _j(x') \right\} + \frac{(1 - \varepsilon )^2}{4} \\&\times \left\{ \delta _i(x)\delta _i(x') + \delta _i(x)\delta _j(x') + \delta _j(x)\delta _i(x') + \delta _j(x)\delta _j(x')\right\} . \end{aligned} \end{aligned}$$

We first show that \(\left\{ Q_{ij,0} :i \ge j \right\} \) forms a basis of \({\mathcal {F}}_{\mathsf {sym}}\). Let \(\left\{ \alpha _{ij} \in {\mathbb {R}}:i \ge j \right\} \) be such that \(\sum _{i \ge j} \alpha _{ij} Q_{ij, 0} = 0\). Consider first \(x, x' \in {\mathcal {X}}\) such that \(x > x'\).

$$\begin{aligned} \sum _{i \ge j} \alpha _{ij} Q_{ij, 0}(x,x') = \frac{1}{4}\sum _{i \ge j} \alpha _{ij} \delta _i(x)\delta _j(x') = \frac{1}{4}\alpha _{xx'} = 0 \text { and } \alpha _{xx'} = 0. \end{aligned}$$

By a similar argument for the case \(x < x'\), we obtain that \(\alpha _{xx'} = 0\) for any \(x \ne x'\). Inspecting now the diagonal for \(x \in {\mathcal {X}}\),

$$\begin{aligned} \sum _{i \ge j} \alpha _{ij} Q_{ij, 0}(x,x) = \sum _{i \in {\mathcal {X}}} \alpha _{ii} Q_{ii, 0}(x,x) = \sum _{i \in {\mathcal {X}}} \alpha _{ii} \delta _i(x) = \alpha _{xx} = 0. \end{aligned}$$

This implies that the family \(\left\{ Q_{ij, 0} :i \ge j \right\} \) is independent. Since \(\dim {\mathcal {F}} _{\mathsf {sym}}= \left| {\mathcal {X}} \right| (\left| {\mathcal {X}} \right| + 1) /2 = \left| \left\{ Q_{ij, 0} :i \ge j \right\} \right| \), it is maximally so, thus forms a basis. However, the basis elements are not in \({\mathcal {W}} _{\mathsf {iid}}\). We therefore examine the case \(\varepsilon > 0\), and leverage the property that in normed vector spaces, finite linearly independent systems are stable under small perturbations (see Lemma 11 reported below for convenience) in order to show existence of a basis in \({\mathcal {W}} _{\mathsf {iid}}\).

Lemma 11

(Costara and Popa [48, p.9, Exercise 35]) Let \(n \in {\mathbb {N}}\), X is a normed vector space and \(x_1, \dots , x_n\) are n linearly independent elements in X. Then there exists \(\eta > 0\) such that if \(y_1, y_2, \dots , y_n\) are such that \(\left\| y_i \right\| < \eta \) for \(i = 1, \dots , n\), then \(x_1 + y_1, x_2 + y_2, \dots , x_n + y_n\) are also n linearly independent elements in X.

Let us consider \(({\mathcal {F}} _{\mathsf {sym}}, \left\| \cdot \right\| _{1,1})\), the space of real symmetric matrices equipped with the entry-wise \(\ell _1\) norm. For any \(i \ge j\) and for any \(\varepsilon \in (0, 1)\),

$$\begin{aligned}&\left\| Q_{ij, \varepsilon } - Q_{ij, 0} \right\| _{1,1} \triangleq \sum _{x,x' \in {\mathcal {X}}} \left| Q_{ij, \varepsilon }(x,x') - Q_{ij, 0}(x,x') \right| \\&\le \left| {\mathcal {X}} \right| ^2 \left| \frac{\varepsilon ^2}{\left| {\mathcal {X}} \right| ^2} \right| + \left| \frac{\varepsilon (1 - \varepsilon )}{2 \left| {\mathcal {X}} \right| } \right| \sum _{x,x'} \left| \delta _i(x) + \delta _j(x) + \delta _i(x') + \delta _j(x') \right| \\&+ \left| \frac{\varepsilon (2 - \varepsilon )}{4} \right| \sum _{x,x' \in {\mathcal {X}}} \left| \delta _i(x)\delta _i(x') + \delta _i(x)\delta _j(x') + \delta _j(x)\delta _i(x') + \delta _j(x)\delta _j(x') \right| \\&\le \varepsilon ^2 + 2 \varepsilon \left| {\mathcal {X}} \right| + 2 \varepsilon \left| {\mathcal {X}} \right| ^2, \end{aligned}$$

thus

$$\begin{aligned} \left\| Q_{ij, \varepsilon } - Q_{ij, 0} \right\| _{1,1} \le 5 \varepsilon \left| {\mathcal {X}} \right| ^2. \end{aligned}$$

Let \(\eta \) as defined in Lemma 11, with respect to the basis \(\left\{ Q_{ij,0} :i \ge j \right\} \), and choose \(0< \varepsilon < \frac{\eta }{5 \left| {\mathcal {X}} \right| ^2}\). Then \(\left\| Q_{ij, \varepsilon } - Q_{ij, 0} \right\| _{1,1} < \eta \), thus the family \(\left\{ Q_{ij, \varepsilon :i \ge j} \right\} \) is a also basis for \({\mathcal {F}} _{\mathsf {sym}}\) that lies in \({\mathcal {W}} _{\mathsf {iid}}\), whence the theorem. \(\square \)