Segregating Markov chains

Dealing with finite Markov chains in discrete time, the focus often lies on convergence behavior and one tries to make different copies of the chain meet as fast as possible and then stick together. There is, however, a very peculiar kind of discrete finite Markov chain, for which two copies started in different states can be coupled to meet almost surely in finite time, yet their distributions keep a total variation distance bounded away from 0, even in the limit as time goes off to infinity. We show that the supremum of total variation distance kept in this context is $\frac12$.


Introduction
When the long time behavior of Markov chains is analyzed, one of the most common strategies is to couple several copies of the chain started in different states. In doing so, one standard approach is to define two copies of a Markov chain (started in different states) on a common probability space, correlated in such a way that they are likely to meet within some moderate time, and glue them together as soon as this happens.
This idea is so predominant that little attention was directed away from such couplings; in the standard reference [6] it was even claimed erroneously that any coupling of two Markov chains with the same transition probabilities can be modified so that the two chains stay together at all times after their first simultaneous visit to a single state. A counterexample to this statement was in fact given in [8]: If a coupling of two copies of the same Markov chain is changed in such a way that the second copy mimicks the behavior of the first one once they meet, the altered individual process might no longer be a copy of the given chain.
For the sake of simplicity, we want to restrict our considerations to timehomogeneous Markov chains evolving in discrete time and on countable state spaces -except for Remark 1 and Theorem 4.3, where we discuss how the argument used to derive Theorem 4.1 applies to more general settings as well. So let X = (X n ) n∈N0 denote a Markov chain on a countable state space S with transition probabilities {P (r, s) = P(X n+1 = s | X n = r); r, s ∈ S, n ∈ N 0 }. While L(X n ) will be used as shorthand notation for the distribution of X n in general, we will denote the distribution of X n given X 0 = x, i.e. for a copy of the chain started in x ∈ S, by P n (x, . ).
In what follows, we want to describe and investigate the kind of Markov chain, that was first introduced and analyzed by Häggström [3]: A chain in which two copies started in different states can be coupled such that they almost surely meet, but their distributions do not come arbitrarily close to one another with respect to total variation distance (cf. Definition 1). This phenomenon -that is somewhat counterintuitive in the light of the usual coupling constructions -will be referred to as segregation of two states. Further, we consider the constant κ := sup lim n→∞ P n (x, .) − P n (y, .) TV , where the supremum is taken over finite Markov chain transition matrices P and states x and y, such that two copies of the chain corresponding to P , one started in x and the other in y, can be coupled to meet a.s. in finite time. To put it briefly, the main result of this paper is that κ equals 1 2 . As a preparation, the second section deals with the concept of couplings in general and convergence of Markov chains. Much of this is standard, but there is also the lesser known but crucial distinction between Markovian and faithful couplings. Section 3 presents Häggström's result (κ ≥ 3 − 2 √ 2) and puts the idea of segregating Markov chains into a broader context.
In Section 4, more precisely in Theorem 4.1, we prove that the value 1 2 is an upper bound on κ.
In Section 5, a constructive and intuitively accessible example of a Markov chain is given, that segregates two states such that the total variation distance kept can be pushed arbitrarily close to 1 e . This improves on the example in [3] and serves as a warmup for the more technical and implicit construction in the last section.
Finally, in Section 6 we introduce and employ the idea of separation to show that for any ε > 0, there exist Markov chains segregating two states x and y such that copies started in these states can be coupled to meet almost surely while their distributions P n (x, .) and P n (y, .) have a total variation distance of at least 1 2 − ε for all n ∈ N, see Theorem 6.2. Together with the upper bound from Section 4, this establishes our main result, Theorem 6.1, stating that κ = 1 2 .

Preliminaries: convergence and couplings
In order to quantify the difference between two probability measures (such as the distributions of two copies of a Markov chain at a fixed time) there are quite a few distance measures. The so-called total variation distance is among the most common ones.

Definition 1
Let µ and ν be two probability distributions on a countable set S. The total variation distance between the two measures is then defined as This notion of distance is used in most of the standard convergence theorems on finite Markov chains as well (e.g. see Thm. 4.9 in [6]): is an irreducible and aperiodic Markov chain on a finite state space S. Then there exists a unique limiting distribution π on S, called the stationary distribution, as well as constants α ∈ (0, 1) and C > 0 such that for all x ∈ S, n > 0.
If the distribution of a Markov chain at time n converges to the same distribution π as n tends to infinity, irrespectively of its starting distribution, a standard way to measure the speed of convergence is the variation distance Sometimes it is more convenient, however, to consider the related function d(n) := sup x,y∈S P n (x, .) − P n (y, .) TV .
Both these functions, d and d, are non-increasing in n and d is in addition submultiplicative, i.e. d(m + n) ≤ d(m) · d(n). Submultiplicativity need not hold for d, but can be verified for 2 d instead. Furthermore, it holds that d(n) ≤ d(n) ≤ 2 d(n). For proofs of the elementary facts just stated, we refer to Lemma 2.20 in [1]. Note that there S is assumed to be finite, but the arguments immediately transfer to countable S.
On the basis of the notion of distance d, the central concept of mixing time is defined, loosely speaking, as the time it takes until the effect of the starting distribution has begun to disappear substantially.

Definition 2
Given a Markov chain, for which the distribution of X n converges to a fixed distribution π (irrespectively of the distribution of X 0 ), define its mixing time by t mix := min{n ∈ N 0 ; d(n) ≤ 1 4 }.
As already mentioned, the tool that often makes proofs about convergence of Markov chains both short and elegant is the coupling approach. Let us therefore properly define this standard concept and then highlight which additional properties a coupling can have.

Definition 3
We define a coupling of two copies of a Markov chain on S to be a process ((X n , Y n )) n∈N0 on S × S, with the property that both (X n ) n∈N0 and (Y n ) n∈N0 are Markov chains on S with the same transition probabilities (but possibly different starting distributions).
If the process ((X n , Y n )) n∈N0 is itself a Markov chain (not necessarily timehomogeneous), it is called a Markovian coupling.
In order to get good estimates on mixing times it is often of importance to bring into line the long term behavior of the chain started in different states. In order to do so, one wants to make sure that the two coupled chains stay together once they meet, more precisely: if X m = Y m , then X n = Y n , for all n ≥ m. Couplings with this property are sometimes called "sticky" couplings. As noted in the introduction, it is however not possible to modify every coupling in such a way that it becomes sticky by simply glueing together the two copies once they meet, see Prop. 3 in [8] for an example. The crucial property is the following:
It should be mentioned that the term "Markovian coupling" is used in [6] to describe what we just defined as faithful coupling. However, since we actually want to focus on couplings that are not faithful (but may still be Markov chains -as both the example in Section 5 and the one in [3] are), we want to make this distinction by adopting the definitions in [8] and deviate from the notions in [6].
In order to understand what makes faithful couplings special, note that in general a coupling of two copies X and Y of a Markov chain with transition probabilities {P (r, s); r, s ∈ S} fulfills for all x n , x n+1 ∈ S, n ∈ N 0 , and likewise xn∈S P Y n+1 = y n+1 | (X n , Y n ) = (x n , y n ) · P(X n = x n | Y n = y n ) = P (y n , y n+1 ), for all y n , y n+1 ∈ S, n ∈ N 0 . So the extra condition on a faithful coupling amounts to P X n+1 = x n+1 | (X n , Y n ) = (x n , y n ) being constant in y n and It is immediate to check that any faithful coupling can indeed be transformed into a sticky coupling by just letting the chains run according to the given coupling until they meet and then run them together as two identical copies of the same chain, without affecting the marginals. Exploiting this fact leads to the estimate for any faithful coupling of two copies, X started in x and Y started in y, where is the first meeting time of the coupled chains (cf. Thm. 1 in [8]).

Chains that meet and separate
If two copies of a Markov chain are coupled, but the coupling is not sticky, clearly they can meet in one state and separate afterwards. As mentioned above, if the coupling is not faithful (i.e. violates the conditions given in Definition 4), in some cases it cannot be transformed into a sticky coupling by simply letting the two copies coalesce once they meet. As a byproduct, Häggström [3] observed an even stronger form of incompatibility of two coupled copies of a chain that meet. He gives an example of a finite reducible Markov chain with the following property: Two copies of the chain, started in different states x and y, can be coupled in such a way that they meet almost surely in finite time, while the total variation distance of their distributions never drops below a fixed positive value. More precisely, he shows (see Prop. 4.1 in [3]): There exists a finite state Markov chain such that for two of its states x and y we have that lim while on the other hand there exists a Markovian coupling of the chains X = (X n ) n∈N0 and Y = (Y n ) n∈N0 , starting at X 0 = x and Y 0 = y, with the property that their first meeting time τ = inf{n ≥ 0; X n = Y n } is finite with probability 1.
Note that for any Markov chain and any two states x and y, the sequence ( P n (x, .)−P n (y, .) TV ) n∈N0 is non-increasing. This, together with the fact that the total variation distance is always non-negative, guarantees the existence of lim n→∞ P n (x, .) − P n (y, .) TV .
In fact, the reducible Markov chain in the example given in [3] comprises only 6 states (see Figure 1 below). For p ∈ [1 − 1 , two copies started in x and y can be coupled such that their first meeting time is a.s. less than or equal to 2 (for the explicit calculations, see Prop. 4.1 in [3]). The copies will reach one of the two absorbing states (a and b) after two steps and the probability that the chain started in x lands in a is 1 − 2p (1 − p), in b accordingly 2p (1 − p). By symmetry, for the chain started in y it is precisely reversed.  So for n ≥ 2, P n (x, .) and P n (y, .) are unchanging and different if p = 1 2 . Choosing p = 1 2 √ 2 (or p = 1 − 1 2 √ 2) maximizes their total variation distance at 3 − 2 √ 2 ≈ 0.17153. As mentioned in the introduction, we will call Markov chains that have the property described in Proposition 3.1 to be segregating the states x and y. From the convergence theorem we know that such things cannot happen for irreducible finite Markov chains (not even periodic ones). In addition to that, even if the chain is either reducible or infinite, a coupling that lets two copies started in different states meet almost surely, while the total variation distance of their distributions stays bounded away from 0 for all time, cannot be faithful due to the coupling inequality (2).

The upper bound
From the previous section we know that there exist finite reducible Markov chains that segregate two states x and y. A natural question in this respect is how large a total variation distance between P n (x, .) and P n (y, .) can be kept, under the condition that two copies started in x and y respectively can be coupled to meet in finite time almost surely -in other words, the value of κ as defined in (1). The example in [3] shows κ ≥ 3 − 2 √ 2; the following theorem establishes 1 2 as an upper bound. Theorem 4.1 Consider a Markov chain on the countable state space S and two fixed states x and y. Further, we denote by X = (X n ) n∈N0 and Y = (Y n ) n∈N0 two coupled copies of the chain, started in x and y respectively, and their first meeting time by τ . If X and Y can be coupled in such a way that P(τ < ∞) = 1, it holds that This result is an immediate implication of the following proposition: and τ as above. Then it follows that Proof: Fix n ∈ N 0 and a subset A ⊆ S by means of which we define the processes (M t ) n t=0 and (N t ) n t=0 given by It is easily checked that these processes are martingales (with respect to the filtrations generated by X and Y respectively). Further let B x and B y denote the events that M t ≥ 1 2 and N t < 1 2 respectively for all 0 ≤ t ≤ n. As the event B x ∩ B y implies M t = N t and with that X t = Y t for all 0 ≤ t ≤ n (almost surely), it follows that {τ ≤ n} is (up to a nullset) contained in the union of B c x and B c y . Next, we define where the infimum is understood to be n if the corresponding set is empty. Note that τ x and τ y are stopping times for M t and N t respectively. Since (M t ) n t=0 and (N t ) n t=0 are bounded martingales, the Optional Stopping Theorem (see for example Cor. 17.7 in [6]) gives the estimates and P n (y, A) = E N 0 = E N τy ≥ 1 2 · P(B c y ). Combining these two inequalities, we get Finally, maximizing the left-hand side over all subsets A ⊆ S yields

Remark 1
Reading carefully through the proof of Proposition 4.2, one may notice that the martingale argument used essentially does not require our general assumptions of time-homogeneity and countable state space. For time-inhomogeneous chains, M t := P (X n ∈ A | X t ), can no longer be written as P n−t (X t , A) (likewise for N t ), but this does not impair the argument and we can again conclude where L(X n ) denotes the distribution of X n . Given an uncountable state space, the first meeting time τ is no longer measurable by default. If we add this as an extra condition, however, the above proof (with the minor modification that only measurable sets A are considered) extends to this setting as well.
To fully exhaust the range of validity of this argument, let us leave the default preconditions for a moment and consider continuous-time Markov processes in full generality (as e.g. Kallenberg [5] defines them). In this case, we have to add further technical assumptions to save the line of reasoning and result of Proposition 4.2. For a measurable set A ⊆ S, time horizon T > 0 and 0 ≤ t ≤ T , define (similar to the above) (4)

Theorem 4.3
Consider a continuous-time Markov process (not necessarily time-homogeneous) with general state space S. Let X = (X t ) t≥0 and Y = (Y t ) t≥0 denote two coupled copies of the process, that are started in fixed states x and y respectively, and let τ denote their first meeting time. Fix a time horizon T > 0 and assume that {τ ≤ T } is measurable. If for all measurable sets A ⊆ S it is possible to choose versions of the martingales (M t ) t∈[0,T ] and (N t ) t∈[0,T ] , as defined in (4), that are a.s. continuous from the right, while having the property that for all Proof: Again, we fix some measurable subset A ⊆ S and define the martingales (4), with the two additional properties stated in the theorem. Following the proof of Proposition 4.2 (literally, besides replacing n by T ) we can still conclude that Using the Optional Stopping Theorem for continuous-time martingales (see e.g. Thm. (3.2) in [7]) we can conclude just as above.

Remark 2
A simple way to ensure the assumed properties of (M t ) t∈[0,T ] and (N t ) t∈[0,T ] is to consider a topology on S and to require two things: first, that the Markov process a.s. has right-continuous sample paths and second that the transition probabilities P( As an aside, it might be worth noting that the analogue of Proposition 4.2 is not valid if we drop the additional assumptions completely, as it might be possible then to alter trajectories without changing transition probabilities. For instance, consider the process where ξ ∼ unif(0, 1).
Two independent copies, started at 0 and 1 respectively and using independent copies of ξ, will almost surely meet, but the total variation distance of their distributions stays 1 for all t ∈ [0, 1].
Besides these generalizations, the statement from Proposition 4.2 can also be used to get upper bounds on mixing times -similar to the usual approach, see for example Cor. 5.3 in [6] -replacing the coupling inequality (2) as starting point. In doing so, we pay by an additional factor 1 2 in front of P(τ ≤ n), but can in return employ any kind of coupling, not only faithful ones. It remains to be seen whether this will ever turn out useful in practice as basically all standard coupling constructions are faithful. However, we want to mention at this point that non-Markovian couplings actually already proved to be useful in applications, cf. [4] for instance.

Proposition 4.4
Consider a Markov chain X = (X n ) n∈N0 with the property that L(X n ) converges to a fixed distribution π irrespectively of the starting distribution. Further, suppose that for some α ∈ (0, 1] and each pair of states x, y ∈ S there exists a (not necessarily faithful or even Markovian) coupling ((X n , Y n )) n∈N0 of two copies of the chain started in x and y respectively, such that the first meeting time τ of the two coupled processes fulfills P(τ ≤ n) ≥ α. Then .
Proof: From Proposition 4.2 we can conclude that Consequently, as d is submultiplicative and dominates d, we get for any k ∈ N : , this estimate becomes

A simple example that narrows the gap
In this section, we will present another finite state Markov chain that improves on the value of 3 − 2 √ 2 established by Häggström [3]. To begin with, let us prepare a lemma, which will come in useful when the total variation distance in our example of a finite segregating Markov chain is to be assessed.
Consider a sequence of independent Bernoulli trials, each with success probability p < 1. The distribution of the number of successful attempts until r failures have occurred is called the negative binomial distribution with parameters r and p and commonly denoted by NB(r, p).

Lemma 5.1
For µ := NB(1, p) and ν := NB(2, p), it holds that Proof: A standard calculation shows that for two probability distributions µ and ν on a discrete space S, their total variation distance can be calculated as see for example Remark 4.3 in [6]. For µ = NB(1, p) and ν = NB(2, p), we have: Consequently, using the well-known formulas for a finite geometric sum and its derivative, we can compute the total variation distance and get If p = m m+1 , for some m ∈ N 0 , the number p 1−p = m is integer and an elementary simplification of the expression to the right in (5) verifies the final claim.
Let us now use this lemma to establish the following result: For all ε > 0, there exists a Markov chain segregating two states x and y such that lim Proof: Let us consider the finite reducible MC depicted in Figure 2. The state space S comprises 3m + 5 states, among which the two initial states x and y as well as the m + 2 absorbing states labeled 0, . . . , m and >.
x y 0 It is immediate to check that the two copies, started in x respectively y, will hit an absorbing state after at most m + 2 steps and that on {0, . . . , m} the distributions P m+2 (x, .) and P m+2 (y, .) coincide with NB(1, p) and NB(2, p) respectively. The probabilities to land in the state labeled > are P m+2 (x, >) = P(Z 1 > m) and P m+2 (y, >) = P(Z 2 > m), Choosing p = m m+1 , we can conclude from Lemma 5.1 that P n (x, .) − P n (y, .) TV = p Taking m large enough, more precisely such that 1 − 1 m+1 m+1 > 1 e − ε, will establish the claim if we can present a coupling that ensures that the two copies started in x and y will meet with probability 1, either before or when they hit an absorbing state.
In order to establish such a coupling, let Y = (Y n ) n∈N0 be a copy of the MC started in y. The copy X = (X n ) n∈N0 started in x mimicks all movements of Y, with the delay of one step, until it finally hits an absorbing state: First, it will move downwards until the two processes meet -in particular this implies that its first step is downwards with probability 1, as x = y. Then, once X n = Y n for some 1 ≤ n ≤ m + 1, the next step of the process X is to move to the right to an absorbing state, i.e. X n+1 = n − 1. If Y never moves to the right, neither does X and both finally end up in the state >.
First of all, we need to check whether the two coordinate processes are indeed copies of the MC given in Figure 2: It is obvious from our construction that it suffices to verify this for the process X. The way X is defined -to move downwards in the first step and then always imitate the previous move of Y until ending up in an absorbing state -gives the right marginals due to the structure of the MC: As all the non-absorbing states apart from x have the same transition probabilities (p downwards and 1 − p to the right), X performs indeed a random walk on the graph in Figure 2 according to the transition probabilites of the MC. Note that there is just this one way Y can end up in an absorbing state before it meets X, namely if it moves downwards only. Then, however, X copies this behavior and ends up in the state > as well, so the coupling guarantees τ ≤ m + 2 with probability 1. This trivially implies the almost sure finiteness of the first meeting time τ and in conclusion the claim that the MC segregates the two states x and y.
In order to get the idea of how faithfulness plays a crucial role in this context, it is worth noting that the coupling in our example -likewise the one in [3] -is in fact Markovian, but not faithful. Such couplings, however, have to be nonfaithful as faithfulness would imply that the total variation distance necessarily tends to 0 as already mentioned at the end of Section 3.

Closing the gap
In this last section, we want to further improve the lower bound 1 e , established by the example from the previous section, in order to determine the true value of the constant κ, defined in (1). These efforts amount to the following: The value of κ, denoting the supremum of lim n→∞ P n (x, .) − P n (y, .) TV taken over all segregating Markov chains and segregated states x and y, is 1 2 .
In view of Theorem 4.1, the final step to derive this result lies in proving that for any ε > 0, a value of at least 1 2 − ε can actually be attained. In order to do so, we will focus on (reducible) Markov chains with a specific structure which allows us to consider even simpler chains in finite time instead: When talking about a Markov chain with finite time horizon X = (X t ) T t=0 on state space S with transition probabilities {P (r, s); r, s ∈ S}, we are in fact thinking of a different chain, namely the reducible Markov chain Y = (Y n ) n∈N0 on the state space S × {0, . . . , T }, with transition probabilities P(Y n+1 = (s, n + 1) | Y n = (r, n)) = P (r, s) for all r, s ∈ S, n ∈ {0, . . . , T − 1} and P(Y n+1 = (s, T ) | Y n = (s, T )) = 1.
In other words, we consider the evolution in time as new generations of states and stop the original chain at time T by making all states corresponding to time T absorbing. Incidentally, the example given in [3] is also of this kind; it corresponds to a two state Markov chain with T = 2 (cf. Figure 1).
So for the remainder of this article, we actually consider Markov chains in discrete, finite time and with finite state space only. With this simplification in mind, we want to prove the following: Theorem 6.2 For any ε > 0 there exists a Markov chain X, two states x and y and a positive integer T such that and such that there is a coupling of two copies of the chain, ((X t , Y t )) T t=0 , with initial states x and y respectively satisfying P (X t = Y t for some 0 ≤ t ≤ T ) = 1.
Before proceeding with a proof of this theorem, we want to introduce an alternative way to view segregating couplings. Let X be any discrete-time Markov chain with a finite state space S, and fix two states x, y ∈ S and a positive integer T .
For any coupling of two copies of this chain, ((X t , Y t )) T t=0 , started in X 0 = x and Y 0 = y respectively, one can consider the corresponding meeting probability A natural question to ask in this setting is how large one can make this probability for a given (finite) Markov chain and given x, y and T by maximizing over all such couplings. Let X denote the subset of {x} × S T consisting of all possible trajectories of X = (X t ) T t=0 , and respectively Y ⊆ {y} × S T for Y = (Y t ) T t=0 . Observe that any coupling of X and Y is determined by the values (p xy ) x∈X ,y∈Y , where p xy denotes the probability of the event that both X = x = (x t ) T t=0 and Y = y. We will denote the maximal value of (6) by C T (x, y) and call it the optimal meeting probability.
While finding explicit couplings that maximize the meeting probability can quickly become cumbersome as the number of possible trajectories grows, it turns out that the problem of optimizing the meeting probability has a useful dual, which allows us to determine C T (x, y) without having to deal with the couplings directly. This duality corresponds to the idea of max-flow/min-cut and König's theorem in combinatorial optimization.
t=0 be a sequence of subsets of S, the (finite) state space of the considered Markov chain X. We will refer to any such sequence as a separating sequence. We define the separation of any separating sequence as We say that the separating sequence is non-trivial if both summands on the right-hand side in (7) are non-zero. We further define the optimal separation S T (x, y) as the maximum separation over all possible separating sequences.
It is not too hard to see that the optimal meeting probability and optimal separation are related. Specifically, for any coupling ((X t , Y t )) T t=0 such that (X 0 , Y 0 ) = (x, y) and any separating sequence A = (A t ) T t=0 we have Maximizing the meeting probability over all possible couplings of two copies, started in x, y respectively, and minimizing the upper bound by maximizing the separation S A T (x, y) over all separating sequences yields However, for our purposes (namely to guarantee C T (x, y)=1), we rather need to bound C T (x, y) from below. In this respect it is quite convenient that the inequality (8) actually holds as an equality, as the following theorem shows.

Theorem 6.3
Given an arbitrary but fixed finite Markov chain X, two states x, y ∈ S and time horizon T , we have C T (x, y) = 2 − S T (x, y).
Proof: A simple way to prove the reverse inequality is to employ the max-flow min-cut theorem, in the same way it can be used to prove Strassen's monotone coupling theorem. Starting from the sets X and Y as above, we build the following directed graph, which will be denoted by G = (V, E): First we let each x ∈ X and y ∈ Y be represented by a node. Then we add two further nodes: a source s and a sink t. When it comes to the directed edges, there will be an arrow (s, x) for all x ∈ X and (y, t) for all y ∈ Y. Additionally, we include the edge (x, y), if the two trajectories x ∈ X and y ∈ Y share at least one state, i.e. x t = y t for some 0 ≤ t ≤ T ; in the sequel, we will write this as x ∼ y.
Finally, we have to assign capacities to these directed edges: The edges (s, x) and (y, t) will get capacities P(X = x) and P(Y = y) respectively. All edges going in between X and Y get capacity 1, see Figure 3 below for an illustration.
x ∼ y Figure 3: The auxiliary graph G to which we apply the max-flow min-cut theorem.
Let us now consider the minimum cut problem on G. From the fact that the cut ({s}, V \ {s}) has value 1, we know that we can focus on cuts that are not cutting any edges going in between X and Y when trying to find a minimal one. For a cut (B, B c ) of this kind, with s ∈ B and t ∈ B c say, we note that x ∼ y can not occur for x ∈ X ∩ B and y ∈ Y ∩ B c , due to the assumption that no such edges are cut. Furthermore, the edges cut incident to s have at least the value P(X ∈ X ∩ B c ), the ones incident to t at least P(Y ∈ Y ∩ B).
Let us define the set sequence A = (A t ) T t=0 by in order to bound the value of the given cut from below by Consequently, as this bound applies to any minimal cut, using the max-flow min-cut theorem (see for example Thm. 1, Chap. III in [2]) we are guaranteed the existence of a maximal flow of value at least 2 − S T (x, y). Let us denote the respective flow through the edge corresponding to x ∼ y by q xy .
We can use this maximal flow to establish a coupling of X and Y in the same vein as in Doeblin's coupling lemma (see for instance Prop. 4.7 in [6]): First we let X = x and Y = y simultaneously with probability q xy for all x ∼ y. Then we define X to follow the trajectory x with the remaining probability P(X = x) − y: x∼y q xy and similarly Y to follow the trajectory y with probability independently, for all x ∈ X and y ∈ Y.
From the flow constraints, we know that all these probabilities are in [0, 1] and the resulting coupling satisfies P(X ∼ Y) ≥ 2 − S T (x, y). The theorem then follows by combining this inequality with (8).
One can observe that, as the left-hand side of (9) is a probability and hence at most 1, we must always have optimal separation at least 1. Indeed, we can obtain separation equal to 1, for instance by taking A t = S for all 0 ≤ t ≤ T . Recall that a separating sequence is called non-trivial if the probabilities that X t ∈ A t for all 0 ≤ t ≤ T given X 0 = x and X t ∈ A t for all 0 ≤ t ≤ T given X 0 = y are both non-zero. Clearly, any trivial separating sequence has separation at most 1, so it follows from Theorem 6.3 that for any finite state Markov chain in discrete time, any x, y and T as above, the following two statements are equivalent: (a) The meeting probability under optimal coupling of two copies, started in x, y respectively, is 1.
(b) For all non-trivial separating sequences A = (A t ) T t=0 , we get S A T (x, y) ≤ 1.
In order to get acquainted with the idea behind the concept of separation, let us take a look at the simplest non-trivial example: Consider the Markov chain X with state space {0, 1} and transition probabilities P (0, 1) = P (1, 0) = α as well as P (0, 0) = P (1, 1) = 1 − α, and take x = 0, y = 1. As mentioned above, the case where T = 2 is Häggström's [3] example of a segregating Markov chain.
Since this chain only has two states, any non-trivial separating sequence must have A t = {0} or A t = {1} for each 0 ≤ t ≤ T . As α ≤ 1 2 , the non-trivial separating sequence given by A t = {0} for all t is obviously best possible. It is immediate to check that its separation equals 2(1 − α) T . Hence, the optimal meeting probability of two copies, started in states 0 and 1 respectively, is 1 if and only if 2(1 − α) T ≤ 1. Using induction, one can easily check that for this chain we have So by choosing α = α(T ) such that 2(1 − α) T = 1, we obtain a Markov chain that segregates the states 0 and 1 with total variation distance (2 1−1/T − 1) T , which tends to 1 4 as T → ∞.
The next example is supposed to illustrate that reducible Markov chains obtained from irreducible and aperiodic finite chains in finite time, in the way described before Theorem 6.2, usually do segregate any two states: Example 6.2 Let X be any irreducible aperiodic Markov chain with a finite state space S, and let x and y be any two states. Pick ε > 0 and n ∈ N such that P n (x , y ) ≥ ε for all x , y ∈ S. Then, for any non-trivial separating sequence (A t ) nk t=0 , we have for any k ∈ N. Hence, by taking T = nk for a sufficiently large k it follows that the optimal meeting probability during [0, T ] is 1. This shows that, unless P T (x, .) − P T (y, .) TV = 0, the Markov chain segregates the two states x and y, choosing T sufficiently large.
We now turn to the proof of Theorem 6.2. Let X be the Markov chain with state space {0, 1, . . . , L} for some positive integer L and transition probabilities given by P (0, 1) = P (L, L − 1) = 1 − P (0, 0) = 1 − P (L, L) = α as well as P (i, i + 1) = P (i, i − 1) = 1 2 for all 0 < i < L, see Figure 4. Such chains, with S = {0, 1, . . . , L} and the additional property that |X t+1 − X t | ≤ 1 a.s. for all t, are commonly called finite birth-and-death chains, cf. Section 2.5 in [6]. To begin with, our main interest lies in the optimal meeting probability of this chain given the starting states x = 0 and y = L. We will then show that we can obtain segregation between 0 and L with total variation arbitrarily close to 1 2 by choosing L, T and α appropriately. . . . The qualitative behavior of this chain for small α is easy to describe: Most of the time, the process is either at 0 or L. Occasionally, that is at rate α, the process takes one step inwards and typically moves around for order L steps before hitting one of the marginal states 0 or L, more precisely the expected time to reach {0, L}, when starting in state 1 equals L − 1. With probability 1 − 1 L it will return to the same side it started at, with probability 1 L it will cross over to the opposite side (check the analysis of the so-called gambler's ruin problem in Section 2.1 in [6] for the explicit calculations).
In preparation to the in-depth analysis of the separation of states 0 and L, let us collect a few general estimations for this chain in the following lemma, which will come in useful later on. For the sake of clarity we will use the standard big O notation to represent error terms, i.e. for any non-negative function f in k and α, the expression O (f (k, α)) denotes a quantity that is bounded in absolute value by c · f (k, α), where the constant c > 0 does not depend on k, α or t.
(b) To show the second claim, consider the sequence a t = E [X t | X 0 = 0]. Using part (a), we know that where the error terms are bounded by 2 L 2 α in absolute value. Furthermore, since E [X t+1 − X t | X t ] equals α if X t = 0, −α if X t = L and 0 otherwise, we can infer from (13) where the error term is bounded by 4 Lα 2 in absolute value irrespectively of t. Solving this recursion, using a 0 = 0 and t−1 The estimate (10) immediately follows from (13), which together with part (a) implies (11).
(c) The case k = 0 is obvious, so we may assume k > 0. Let τ denote the first time t ≥ 0 for which X t = k + 1. Consider the sequence b t = E [X t∧τ | X 0 = 0], where t ∧ τ denotes the minimum of t and τ . Note that part (a) implies P (X t∧τ = i | X 0 = 0) ≤ P (X t = i | X 0 = 0) ≤ 2α for any i ∈ {1, . . . , k} and further With the same reasoning as in part (b), we end up in a similar situation with b 0 = 0 and b t satisfying the recursive formula Solving it gives b t = k + 1 − (k + 1) (1 − α k+1 ) t + O(k 2 α) and plugging this into (14) completes the proof of part (c), using and noting that X t∧τ ≤ k if and only if X t ≤ k for all 0 ≤ t ≤ t.
Let us now take a closer look on the optimal separation of states 0 and L in this chain. To make our lives easier, we establish three auxiliary results showing that among the non-trivial separating sequences, there are very simple ones which are essentially best possible as T grows large.

Proposition 6.5
Let L be fixed, and let α = α(T ) = Θ 1 T . Then, for sufficiently large T , any non-trivial separating sequence A = (A t ) T t=0 such that S A T (0, L) > 1 (if such exist) must satisfy 0 ∈ A t and L / ∈ A t for all 0 ≤ t ≤ T .
Proof: By (10), we know that P t (0, 0) = P t (L, L) > 1/2 for any 0 ≤ t ≤ T given T sufficiently large. Hence if 0 ∈ A t1 and L ∈ A t2 for some 0 ≤ t 1 , t 2 ≤ L, then By symmetry, we can assume without loss of generality, that 0 ∈ A t for all 0 ≤ t ≤ T . Note that if there is some t 1 such that L ∈ A t1 , then by part (a) of Lemma 6.4, we get So the proposition follows if we can show that there exists a constant ε > 0 such that for T sufficiently large any non-trivial separating sequence satisfies First note that since A is non-trivial, there exists a trajectory y ∈ {L} × S T such that y t ∈ A t for all 0 ≤ t ≤ T and P(X = y | X 0 = L) > 0. Further, recall that the chain can only attain y with positive probability if |y t+1 − y t | ≤ 1 for all 0 ≤ t ≤ T − 1.
Next, let X = (X t ) T +1 t=0 be a copy of the Markov chain started in X 0 = 0. We define the process X = (X t ) T t=0 as X t = X t+1 for all 0 ≤ t ≤ T if X 1 = 0, and otherwise put X 0 = 0 and let this process evolve independently of X. Clearly, this implies that (X t ) T t=0 and X have the same distribution. Now, if X t = L for some 0 ≤ t ≤ T , then the trajectory of X and y necessarily either intersect or cross. Consequently, we either have X 1 = 0 (which occurs with probability α) or at least one of X and X meets y, and is thus outside A t for some t. By the union bound, we find From (12) we know that P(X t ≤ L − 1 for all 0 ≤ t ≤ T | X 0 = 0) is bounded away from 1 as T tends to infinity with our choice of α = Θ( 1 T ). So choosing ε > 0 small, T large enough such that will do the job.
t=0 be any separating sequence such that 0 ∈ A t and L ∈ A t for all 0 ≤ t ≤ T . For any 0 ≤ a ≤ T , we define the separating sequence A a = (A a t ) T t=0 by A a t := A t+a (mod T +1) . Then Proof: The case where a = 0 is obvious, so we may assume a > 0. By part (a) of Lemma 6.4, for any fixed 0 ≤ t ≤ T , the probability that X t ∈ A t \ {0} given X 0 = 0 is at most 2 Lα. From this we can infer where the first and last equality follow from the Markov property and the central one from time homogeneity of the chain. By symmetry, the same argument works for the chain started at L and the sequence of complementary sets (S \ A t ) T t=0 . Proposition 6.7 For any separating sequence A = (A t ) T t=0 such that 0 ∈ A t and L ∈ A t for all 0 ≤ t ≤ T , there exists a k ∈ {0, . . . , L} such that where k denotes the constant separating sequence whose elements are all equal to the set {0, 1, . . . , k}.
Proof: By Proposition 6.6, we have For any 0 ≤ k ≤ L, let us define Note that the function f is decreasing, f (0) = 1 and f (L) = 0. To simplify the notation, we additionally set f (L + 1) := 0 and write M := max 0≤t≤T X t . Considering only the first summands coming from each of the separation terms S A a T (0, L), 0 ≤ a ≤ T , we find Fix k ∈ {0, . . . , L}, pick k to minimize |{t : k ∈ A t }| over {0, . . . , k} and note that this implies Let τ ≥ 0 be the first time when the Markov chain visits state k . Given M = k, we know τ ≤ T , hence We conclude that Arguing analogously in the case of X 0 = L, with the modifications that we consider min 0≤t≤T X t instead of max 0≤t≤T X t and τ now denotes the first time when the chain is in state k, yields where we used 1 − f (k) ≥ 1 − 1 T +1 · |{t : k ∈ A t }| to derive the inequality and the last equality follows from L k=0 f (k) − f (k + 1) = 1. By combining these two estimates, it follows that From the fact that the coefficients f (k) − f (k + 1), 0 ≤ k ≤ L, sum up to 1, plus f (L) = f (L + 1), we can conclude that there exists some k ∈ {0, . . . , L − 1} such that which completes the proof.
Combining Propositions 6.5 and 6.7, it follows that for any fixed L ≥ 1, any α = α(T ) = Θ 1 T and T sufficiently large, the optimal separation S T (0, L) is the maximum of 1 and for 0 ≤ k < L, where the inequality follows from (12).
To finish the proof of Theorem 6.2, we need one more elementary estimation: Then it holds sup Proof: First note that the function f A lies in C ∞ ((0, 1)) and is symmetric around x = 1 2 . Calculating its first two derivatives gives .
and as a consequence the sign of f A (x) is given by the sign of Due to A > 0, the function g A is strictly convex. Assuming the existence of two local maxima of f A on (0, 1) -at points x 1 < x 2 say -forces the existence of a local minimum at which contradicts the strict convexity of g A . Consequently, f A can have at most one local maximum in (0, 1), which then lies at x = 1 2 for symmetry reasons.
In conclusion, f A either attains its maximum on (0, 1) at x = 1 2 or converges to its supremum on the boundary.
Given ε > 0, we can choose δ > 0 small, L large enough and then pick T sufficiently large to make the right-hand side of (17) larger than 1 2 − ε. This completes the proof.

Remark 3
One downside of the implicit construction proving Theorem 6.2 is the fact that it does not give much information about the coupling involved. As the coupling will have to take into account the whole trajectories of the two individual copies, it is highly unlikely that the coupled process will have the Markov property. In this respect, it is still an open problem if the value of 1 e , established in Section 5, can be pushed further (as supremum of achievable total variation distances that can be retained in segregating Markov chains), if we restrict ourselves to Markovian couplings.
We can however rule out that there exists a single chain in discrete time with two segregated states x and y such that lim n→∞ P n (x, .) − P n (y, .) TV = 1 2 , i.e. for which the value 1 2 actually is attained (cf. the following proposition, which slightly improves the result from Theorem 4.1).

Proposition 6.9
Consider a Markov chain in discrete time with countable state space S and two states x, y ∈ S. If two copies, X = (X t ) t∈N0 and Y = (Y t ) t∈N0 , started in x and y respectively, can be coupled to meet almost surely in finite time, it follows that lim n→∞ P n (x, .) − P n (y, .) TV < 1 2 .
Proof: As a matter of fact, we can alter the proof of Proposition 4.2 to derive the above statement: For the reasoning there to work, we need a function f : N 0 × S → [0, 1], replacing the martingales M t and N t , with the following two properties: (i) f (t, X t ) t∈N0 is a martingale with respect to the natural filtration of X, and likewise for Y.
In order to compile such a function, let us define the sets A n ⊆ S, n ∈ N 0 , via A n := {s ∈ S; P n (x, s) > P n (y, s)}, which implies P n (x, .) − P n (y, .) TV = P n (x, A n ) − P n (y, A n ), and further f n (t, s) := P n−t (s, A n ) for all t ≤ n. Finally, choose f to be the limit of a pointwise converging subsequence of the uniformly bounded sequence of functions (f n ) n∈N0 . Then (ii) is immediate and since for all n ∈ N 0 , f n (t, X t ) is a martingale for 0 ≤ t ≤ n, bounded by 0 and 1 from below and above respectively, the conditional dominated convergence theorem ensures that f (t, X t ) inherits these properties.
Note that the almost sure limit of f (t, X t ) as t → ∞ exists, according to Doob's martingale convergence theorem, which implies that f (τ x , X τx ) is well defined even on B x = {τ x = ∞}. Further note that B c x ∪ B c y is an almost sure event, due to the fact that X and Y meet in finite time with probability 1.
Hence by symmetry it is safe to assume P(B c x ) > 0, which verifies the claim.