Topics: Sample Space, Trajectories; Laws of Large Numbers: WLLN, SLLN; Proof of Big Theorem.

Background:

  • Borel–Cantelli (B.1.2);

  • monotonicity of expectation (B.2);

  • convergence of expectation (B.8)–(B.9);

  • properties of variance: (B.3) and Theorem B.4.

2.1 Sample Space

Let us connect the definition of X = {X n, n ≥ 0} of a Markov chain with the general framework of Sect. B.1. (We write X n or X(n).) In that section, we explained that a random experiment is described by a sample space. The elements of the sample space are the possible outcomes of the experiment. A probability is defined on subsets, called events, of that sample space. Random variables are real-valued functions of the outcome of the experiment.

To clarify these concepts, consider the case where the X n are i.i.d. Bernoulli random variables with P(X n = 1) = P(X n = 0) = 0.5. These random variables describe flips of a fair coin. The random experiment is to flip the coin repeatedly, forever. Thus, one possible outcome of this experiment is an infinite sequence of 0’s and 1’s. Note that an outcome is not 0 or 1: it is an infinite sequence since the outcome specifies what happens when we flip the coin forever. Thus, the set Ω of outcomes is the set {0, 1} of infinite sequences of 0’s and 1’s. If ω is one such sequence, we have ω = (ω 0, ω 1, …) where ω n ∈{0, 1}. It is then natural to define X n(ω) = ω n, which simply says that X n is the outcome of flip n, for n ≥ 0. Hence \(X_n(\omega ) \in \Re \) for all ω ∈ Ω and we see that each X n is a real-valued function defined on Ω. For instance, X 0(1101001…) = 1 since ω 0 = 1 when ω = 1101001… . Similarly, X 1(1101001…) = 1 and X 2(1101001…) = 0. To specify the random experiment, it remains to define the probability on Ω. The simplest way is to say that

$$\displaystyle \begin{aligned} & P(\{ \omega | \omega_0 = a, \omega_1 = b, \ldots, \omega_n = z\}) \\ &~~~~~~ = P(X_0 = a, \ldots , X_n = z) = 1/2^{n+1} \end{aligned} $$

for all n ≥ 0 and a, b, …, z ∈{0, 1}. For instance,

$$\displaystyle \begin{aligned} P(\{\omega | \omega_0 = 1\}) = P(X_0 = 1) = 1/2. \end{aligned}$$

Similarly,

$$\displaystyle \begin{aligned} P(\{\omega | \omega_0 = 1, \omega_1 = 0\}) = P(X_0 = 1, X_1 = 0) = 1/4. \end{aligned}$$

Observe that we define the probability of a set of outcomes, or event, {ω|ω 0 = a, ω 1 = b, …, ω n = z} instead of specifying the probability of each outcome ω. The reason is that the probability that we observe a specific infinite sequence of 0’s and 1’s is zero. That is, P({ω}) = 0 for all ω ∈ Ω. Such a description does not tell us much about the coin flips! For instance, it does not specify the bias of the coin, or the fact that successive flips are independent. Hence, the correct way to proceed is to specify the probability of events, that are sets of outcomes, instead of the probability of individual outcomes.

For a Markov chain, there is some sample space Ω and each X n is a function X n(ω) of the outcome ω that takes values in \({\mathcal {X}}\). A probability is defined on subsets of Ω.

In this example, one can choose Ω to be the set of possible infinite sequences of symbols in \({\mathcal {X}}\). That is, \(\varOmega = {\mathcal {X}}^\infty \) and an element ω ∈ Ω is ω = (ω 0, ω 1, …) with \(\omega _n \in {\mathcal {X}}\) for n ≥ 0. With this choice, one has X n(ω) = ω n for n ≥ 0 and ω ∈ Ω, as shown in Fig. 2.1. This choice of Ω, similar to what we did for the coin flips, is called the canonical sample space. Thus, an outcome is the actual sequence of values of the Markov chain, called the trajectory, or realization of the Markov chain. It remains to specify the probability of event in Ω. The trick here is that the probability that the Markov chain follows a specific infinite sequence is 0, similarly to the probability that coin flips follow a specific infinite sequence such as all heads. Thus, one should specify the probability of subsets of Ω, not of individual outcomes. One specifies that

Fig. 2.1
figure 1

In the canonical sample space, the outcome ω is the trajectory of the Markov chain

$$\displaystyle \begin{aligned} &P(X_0 = i_0, X_1 = i_1, \ldots, X_n = i_n) \\ &\quad = \pi_0(i_0)P(i_0, i_1) \times \cdots \times P(i_{n-1}, i_n), {} \end{aligned} $$
(2.1)

for all n ≥ 0 and i 0, i 1, …, i n in \({\mathcal {X}}\). Here, π 0(i 0) is the probability that the Markov chain starts in state i 0.

This identity is equivalent to (1.3). Indeed, if we let

$$\displaystyle \begin{aligned} A_n = \{X_0 = i_0, X_1 = i_1, \ldots, X_n = i_n\} \end{aligned}$$

and

$$\displaystyle \begin{aligned} A_{n-1} = \{X_0 = i_0, X_1 = i_1, \ldots, X_{n-1} = i_{n-1}\}, \end{aligned}$$

then

$$\displaystyle \begin{aligned} P(A_n) = P[A_n | A_{n-1}]P(A_{n-1}) = P(A_{n-1})P(i_{n-1}, i_n), \end{aligned}$$

by (1.3), so that (2.1) holds by induction on n.

Thus, one has defined the probability of events characterized by the first n + 1 values of the Markov chain. It turns out that there is one probability on Ω that is consistent with these values.

2.2 Laws of Large Numbers for Coin Flips

Before we discuss the case of Markov chains, let us consider the simpler example of coin flips. Let then {X n, n ≥ 0} be i.i.d. Bernoulli random variables with P(X n = 0) = P(X n = 1) = 0.5, as in the previous section. We think of X n = 1 if flip n yields heads and X n = 0 if it yields tails. We want to show that, as we keep flipping the coin, the fraction of heads approaches 50%. There are two statements that make this idea precise.

2.2.1 Convergence in Probability

The first statement, called the Weak Law of Large Numbers (WLLN), says that it is very unlikely that the fraction of heads in n coin flips differs from 50% by even a small amount, say 1%, if n is large. For instance, let n = 105. We want to show that the likelihood that the fraction of heads among 105 flips is more than 51% or less than 49% is small. Moreover, this likelihood can be made as small as we wish if we flip the coin more times.

To show this, let

$$\displaystyle \begin{aligned} Y_n = \frac{X_0 + \cdots + X_{n-1}}{n} \end{aligned}$$

be the fraction of heads in the first n flips. We claim that

$$\displaystyle \begin{aligned} P(|Y_n - E(Y_n)| \geq \epsilon) \leq \frac{\mbox{var}(Y_n)}{\epsilon^2}. \end{aligned} $$
(2.2)

This result is called Chebyshev’s inequality (Fig. 2.2).

Fig. 2.2
figure 2

Pafnuty Chebyshev. 1821–1884

To see (2.2), observe thatFootnote 1

$$\displaystyle \begin{aligned} 1\{|Y_n - E(Y_n)| \geq \epsilon\} \leq \frac{(Y_n - E(Y_n))^2}{\epsilon^2}. \end{aligned} $$
(2.3)

Indeed, if |Y n − E(Y n)|≥ 𝜖, then (Y nE(Y n))2 ≥ 𝜖 2, so that if the left-hand side of inequality (2.3) is one, the right-hand side is at least equal to one. Also, if the left-hand side is zero, it is less than or equal to the right-hand side. Thus, (2.3) holds and (2.2) follows by taking the expected values in (2.3), since E(1A) = P(A) and E((Y nE(Y n))2) = var(Y n) and since expectation is monotone (B.2).

Now, E(Y n) = 0.5 and

$$\displaystyle \begin{aligned} \mbox{var}(Y_n) = \frac{\mbox{var}(X_0 + \cdots + X_{n-1})}{n^2} = \frac{n \mbox{var}(X_0)}{n^2}. \end{aligned}$$

To see this, recall that if one multiplies a random variable by a, its variance is multiplied by a 2 (see (B.3)). Also, the variance of a sum of independent random variables is the sum of their variances (see Theorem B.4). Hence,

$$\displaystyle \begin{aligned} P(|Y_n - 0.5| \geq \epsilon) \leq \frac{\mbox{var}(X_0)}{n \epsilon^2}. \end{aligned}$$

Since X 0 =D B(0.5), we find that

$$\displaystyle \begin{aligned} \mbox{var}(X_0) &= E\big(X_0^2\big) - (E(X_0))^2 \\ &= E(X_0) - (E(X_0))^2 = 0.5 - 0.25 = 0.25. \end{aligned} $$

Thus,

$$\displaystyle \begin{aligned} P(|Y_n - 0.5| \geq \epsilon) \leq \frac{1}{ 4n \epsilon^2}. \end{aligned}$$

In particular, if we choose 𝜖 = 1% = 0.01, we find

$$\displaystyle \begin{aligned} P(|Y_n - 0.5| \geq 1\%) \leq \frac{2,500}{n} = 0.025 \mbox{ with } n = 10^5. \end{aligned}$$

More generally, we have shown that

$$\displaystyle \begin{aligned} P(|Y_n - 0.5| \geq \epsilon) \to 0 \mbox{ as } n \to \infty, \forall \epsilon > 0. \end{aligned}$$

This is the WLLN.

2.2.2 Almost Sure Convergence

The second statement is the Strong Law of Large Numbers (SLLN). It says that, for all the sequences of coin flips we will ever observe, the fraction Y n actually converges to 50% as we keep on flipping the coin.

There are many sequences of coin flips for which the fraction of heads does not approach 50%. For instance, the sequence that yields heads for every flip is such that Y n = 1 for all n and thus Y n does not converge to 50%. Similarly, the sequence 001001001001001… is such that Y n approaches 1∕3 and not 50%. What the SLLN implies is that all those sequences such that Y n does not converge to 50% have probability 0: they will never be observed.

Thus, this statement is very deep because there are so many sequences to rule out. Keeping track of all of them seems rather formidable. Indeed, the proof of this statement is quite clever. Here is how it proceeds. Note that

$$\displaystyle \begin{aligned} P( |Y_n - 0.5| \geq \epsilon) \leq E\left( \frac{|Y_n - 0.5|{}^4}{\epsilon^4}\right), \forall n, \epsilon > 0. \end{aligned}$$

Indeed,

$$\displaystyle \begin{aligned} 1\{ |Y_n - 0.5| \geq \epsilon\} \leq \frac{|Y_n - 0.5|{}^4}{\epsilon^4} \end{aligned}$$

and the previous inequality follows by taking expectations. Now,

$$\displaystyle \begin{aligned} E\big(|Y_n - 0.5|{}^4\big) = E\left( \frac{((X_0 - 0.5) + \cdots + (X_{n-1} - 0.5))^4}{n^4}\right). \end{aligned}$$

Also, with Z m = X m − 0.5, one has

$$\displaystyle \begin{aligned} E((X_0 - 0.5) + \cdots + (X_{n-1} - 0.5))^4) &= E\left(\left(\sum_{m=0}^{n-1} Z_m\right)^4\right) \\ &= E\left( \sum_{a, b, c, d} Z_aZ_bZ_cZ_d\right), \end{aligned} $$

where the sum is over all a, b, c, d ∈{0, 1, …, n − 1}. This sum consists of n terms \(Z_a^4\), n(n − 1) terms \(Z_a^2 Z_b^2\) with a ≠ b and other terms where at least a factor Z a is not repeated. The latter terms have zero-mean since E(Z a Z b Z c Z d) = E(Z a)E(Z b Z c Z d) = 0, by independence, whenever b, c, and d are all different from a. Consequently,

$$\displaystyle \begin{aligned} E\left( \sum_{a, b, c, d} Z_aZ_bZ_cZ_d\right) = n E\big(Z_0^4\big) + n(n-1) E\big(Z_0^2Z_1^2\big) = n \alpha + n(n-1) \beta \end{aligned}$$

with \(\alpha = E(Z_0^4)\) and \(\beta = E(Z_0^2 Z_1^2)\). Hence, substituting the result of this calculation in the previous expressions, we find that

$$\displaystyle \begin{aligned} P( |Y_n - 0.5| \geq \epsilon) \leq \frac{ n \alpha + n(n-1) \beta}{ n^4 \epsilon^4} \leq \frac{n^2 (\alpha + \beta)}{n^4 \epsilon^4} = \frac{\alpha + \beta}{n^2 \epsilon^4}. \end{aligned}$$

This inequality implies thatFootnote 2

$$\displaystyle \begin{aligned} \sum_{n \geq 1} P( |Y_n - 0.5| \geq \epsilon) < \infty. \end{aligned}$$

This expression shows that the events A n := {|Y n − 0.5|≥ 𝜖} have probabilities that add up to a finite number. From the Borel–Cantelli Theorem B.1, we conclude that

$$\displaystyle \begin{aligned} P(A_n, \mbox{ i.o.}) = 0. \end{aligned}$$

This result says that, with probability one, ω belongs only to finitely many A n’s. Hence,Footnote 3 with probability one, there is some n(ω) so that ωA n for n ≥ n(ω). That is,

$$\displaystyle \begin{aligned} |Y_n(\omega) - 0.5| \leq \epsilon, \forall n \geq n(\omega). \end{aligned}$$

Since this property holds for an arbitrary 𝜖 > 0, we conclude that, with probability one,

$$\displaystyle \begin{aligned} Y_n(\omega) \to 0.5 \mbox{ as } n \to \infty. \end{aligned}$$

Indeed, if Y n(ω) does not converge to 50%, there must be some 𝜖 > 0 so that |Y n − 0.5| > 𝜖 for infinitely many n’s and we have seen that this is not the case.

2.3 Laws of Large Numbers for i.i.d. RVs

The results that we proved for coin flips extend to i.i.d. random variables {X n, n ≥ 0} to show that

$$\displaystyle \begin{aligned} Y_n := \frac{X_0 + \cdots + X_{n-1}}{n} \end{aligned}$$

approaches E(X 0) as n →. As for coin flips, there are two ways of making that statement precise.

2.3.1 Weak Law of Large Numbers

We need a definition.

Definition 2.1 (Convergence in Probability)

Let X n, n ≥ 0 and X be random variables defined on a common probability space. One says that X n converges in probability to X, and one writes \(X_n \overset {p}{\rightarrow } X\) if, for all 𝜖 > 0,

$$\displaystyle \begin{aligned} P(|X_n - X| \geq \epsilon) \rightarrow 0 \mbox{ as } n \rightarrow \infty.\end{aligned} $$

The Weak Law of Large Numbers (WLLN) is the following result.

Theorem 2.1 (Weak Law of Large Numbers)

Let {X n, n ≥ 0} be a sequence of i.i.d. random variables with mean μ. Then

$$\displaystyle \begin{aligned} Y_n = \frac{X_0 + \cdots + X_{n-1}}{n} \overset{p}{\rightarrow} \mu. \end{aligned} $$
(2.4)

\({\blacksquare }\)

Proof

Assume that \(E(X_n^2) < \infty \). The proof is then the same as for coin flips and is left as an exercise. For the general case, see Theorem 15.14. □

The first result of this type was proved by Jacob Bernoulli (Fig. 2.3).

Fig. 2.3
figure 3

Jacob Bernoulli. 1655–1705

2.3.2 Strong Law of Large Numbers

We again need a definition.

Definition 2.2 (Almost Sure Convergence)

Let X n, n ≥ 0 and X be random variables defined on a common probability space. One says that X n converges almost surely to X as n →, and one writes X n → X, a.s. if

$$\displaystyle \begin{aligned} P\left( \lim_{n \to \infty} X_n(\omega) = X(\omega) \right) = 1.\end{aligned} $$

Thus, this convergence means that the sequence of real numbers X n(ω) converges to the real number X(ω) as n →, with probability one.

Let {X n, n ≥ 0} be as in the statement of Theorem 2.1. We have the following result.Footnote 4

Theorem 2.2 (Strong Law of Large Numbers)

Let {X n, n ≥ 0} be a sequence of i.i.d. random variables with mean μ. Then

$$\displaystyle \begin{aligned} \frac{X_0 + \cdots + X_{n-1}}{n} \rightarrow \mu \mathit{\mbox{ as }} n \rightarrow \infty, \mathit{\mbox{ with probability }} 1.\end{aligned} $$

\({\blacksquare }\)

Thus, the sample mean values Y n := (X 0 + ⋯ + X n−1)∕n converge to the expected value, with probability 1. (See Fig. 2.4.)

Fig. 2.4
figure 4

When rolling a balanced die, the sample mean converges to 3.5

Proof

Assume that

$$\displaystyle \begin{aligned} E\big(X_n^4\big) < \infty. \end{aligned}$$

The proof is then the same as for coin flips and is left as an exercise. The proof of the SLLN in the general case is given in Theorem 15.14. □

Figure 2.5 illustrates the SLLN and WLLN. The SLLN states that the sample means of i.i.d. random variables converge to the mean, with probability one. The WLLN says that as the number of samples increases, the fraction of realizations where the sample mean differs from the mean by some amount gets small.

Fig. 2.5
figure 5

SLLN and WLLN for i.i.d. U[0, 1] random variables

2.4 Law of Large Numbers for Markov Chains

The long-term fraction of time that a finite irreducible Markov chain spends in a given state is the invariant probability of that state. For instance, a Markov chain X(n) on {0, 1} with P(0, 1) = a = P(1, 0) with a ∈ (0, 1] spends half of the time in state 0, in the long term. The Markov chain in Fig. 1.2 spends a fraction 12∕39 of the time in state A, in the long term.

To understand this property, one should look at the returns to state i, as shown in Fig. 2.6. The figure shows a particular sequence of values of X(n) and it decomposes this sequence into cycles between successive returns to a given state i. A new cycle starts when the Markov chain comes back to i. The durations of these successive cycles, T 1, T 2, T 3, …, are independent and identically distributed, because the Markov chains start afresh from state i at each time T n, independently of the previous states. This is a consequence of the Markov property for any given value k of T n and of the fact that the distribution of the evolution starting from state i at time k does not depend on k.

Fig. 2.6
figure 6

The cycles between returns to state i are i.i.d. The law of large numbers explains the convergence of the long-term fraction of time to a constant

It is easy to see that these random times have a finite mean. Indeed, fix one state i. Then, starting from any given state j, there is some minimum number M j of steps required to go to state i. Also, there is some probability p j that the Markov chain will go from j to i in M j steps. Let then M =maxj M j and p =minj p j. We can then argue that, starting from any state at time 0, there is at least a probability p that the Markov chain visits state i after at most M steps. If it does not, we repeat the argument starting at time M. We conclude that T i ≤  where τ is a geometric random with parameter p. Hence E(T i) ≤ ME(τ) = Mp < , as claimed. Note also that \(E(T_i^4) \leq M^4 E(\tau ^4) < \infty \).

The Strong Law of Large Numbers states that

$$\displaystyle \begin{aligned} \frac{T_1 + T_2 + \cdots + T_k}{k} \rightarrow E(T_1), \mbox{ as } k \rightarrow \infty, \mbox{ with probability } 1. \end{aligned} $$
(2.5)

Thus, the long-term fraction of time that the Markov chain spends in state i is given by

$$\displaystyle \begin{aligned} \lim_{k \rightarrow \infty} \frac{k}{T_1 + T_2 + \cdots + T_k} = \frac{1}{E(T_1)}, \mbox{ with probability } 1. \end{aligned} $$
(2.6)

Let us clarify why (2.6) implies that the fraction of time in state i converges to 1∕E(T 1). Let A(n) be the number of visits to state i by time n. We want to show that A(n)∕n converges to 1∕E(T 1). Then,

$$\displaystyle \begin{aligned} \frac{k}{T_1 + \cdots + T_{k+1}} < \frac{A(n)}{n} = \frac{k}{n} \leq \frac{k}{T_1 + \cdots + T_k} \end{aligned}$$

whenever T 1 + ⋯ + T k ≤ n < T 1 + ⋯ + T k+1. If we believe that T k+1k → 0 as k →, the inequality above shows that

$$\displaystyle \begin{aligned} \frac{A(n)}{n} \rightarrow \frac{1}{E(T_1)}, \end{aligned}$$

as claimed. To see why T k+1k goes to zero, note that

$$\displaystyle \begin{aligned} P\left( \frac{T_{k+1}}{k} > \epsilon \right) \leq P\left( \frac{M \tau}{k} > \epsilon\right) \leq P(\tau > \alpha k) \leq (1 - p)^{\alpha k} \end{aligned}$$

with α = 𝜖M.

Thus, by Borel–Cantelli Theorem B.1, the event T k+1k > 𝜖 occurs only for finitely many values of k, which proves the convergence to zero.

2.5 Proof of Big Theorem

This section presents the proof of the main result about Markov chains.

2.5.1 Proof of Theorem 1.1 (a)

Let m j be the expected return time to state j. That is,

$$\displaystyle \begin{aligned} m_j = E[T_j | X(0) = j] \mbox{ with } T_j = \min\{n > 0 | X(n) = j \}. \end{aligned}$$

We show that π(j) = 1∕m j, j = 1, …, N is the unique invariant distribution if the Markov chain is irreducible.

During n = 1, …, N where N ≫ 1, the Markov chain visits state j a fraction 1∕m j of the times. A fraction P(j, i) of those times, it visits state i just after visiting state j. Thus, a fraction (1∕m j)P(j, i) of the times, the Markov chain visits j then i in successive steps. By summing over j, we find the fraction of the times that the Markov chain visits i. Thus,

$$\displaystyle \begin{aligned} \sum_j \frac{1}{m_j} P(j, i) = \frac{1}{m_i}. \end{aligned}$$

Hence, there is one invariant distribution π and it is given by π i = 1∕m i, which is the fraction of time that the Markov chain spends in state i.

To show that the invariant distribution is unique, assume that there is another one, say ϕ(i). Start the Markov chain with that distribution. Then

$$\displaystyle \begin{aligned} \frac{1}{N} \sum_{n=0}^{N-1} 1\{X(n) = i\} \rightarrow \pi(i). \end{aligned}$$

However, taking expectation, we find that the left-hand side is equal to ϕ(i). Thus, ϕ = π and the invariant distribution is unique.Footnote 5

2.5.2 Proof of Theorem 1.1 (b)

If the Markov chain is irreducible but not aperiodic, then π n may not converge to the invariant distribution π. For instance, if the Markov chain alternates between 0 and 1 and starts from 0, then π n = [1, 0] for n even and π n = [0, 1] for n odd, so that π n does not converge to π = [0.5, 0.5].

If the Markov chain is aperiodic, π n → π. Moreover, the convergence is geometric. We first illustrate the argument on a simple example shown in Fig. 2.7. Consider the number of steps to go from 1 to 1. Note that

$$\displaystyle \begin{aligned} \{n > 0 | P^n(1,1) > 0\} = \{3, 4, 6, 7, 8, 9, 10, \ldots \}. \end{aligned}$$
Fig. 2.7
figure 7

An aperiodic Markov chain

Thus, P n(1, 1) > 0 if n ≥ 6. Now, P[X(2) = 1|X(0) = 2] > 0, so that P[X(n) = 1|X(0) = 2] > 0 for n ≥ 8. Indeed, if n ≥ 8, then X can go from 2 to 1 in two steps and then from 1 to 1 in n − 2 steps. The argument is similar for the other states and we find that there is some M > 0 and some p > 0 such that

$$\displaystyle \begin{aligned} P[X(M) = 1 | X(0) = i] \geq p, i = 1, 2, 3, 4. \end{aligned}$$

Now, consider two copies of the Markov chain: {X(n), n ≥ 0} and {Y (n), n ≥ 0}. One chooses X(0) with distribution π 0 and Y (0) with the invariant distribution π. The two Markov chains evolve independently initially. We define

$$\displaystyle \begin{aligned} \tau = \min\{n > 0 | X(n) = Y(n)\}. \end{aligned}$$

In view of the observation above,

$$\displaystyle \begin{aligned} P(X(M) = 1 \mbox{ and } Y(M) = 1) \geq p^2. \end{aligned}$$

Thus, P(τ > M) ≤ 1 − p 2. If τ > M, then the two Markov chains have not met yet by time M. Using the same argument as before, we see that they have a probability at least p 2 of meeting in the next M steps. Thus,

$$\displaystyle \begin{aligned} P( \tau > kM ) \leq \big(1 - p^2\big)^k. \end{aligned}$$

Now, modify X(n) by gluing it to Y (n) after time τ. This coupling operation does not change the fact that X(n) still evolves according to the transition matrix P, so that P(X(n) = i) = π n(i) where π n = π 0 P n.

Now,

$$\displaystyle \begin{aligned} \sum_i |P(X(n) = i) - P(Y(n) = i)| \leq 2 P(X(n) \neq Y(n)) \leq 2 P(\tau > n). \end{aligned}$$

Hence,

$$\displaystyle \begin{aligned} \sum_i | \pi_n(i) - \pi(i)| \leq 2 P(\tau > n), \end{aligned}$$

and this implies that

$$\displaystyle \begin{aligned} \sum_i | \pi_n(i) - \pi(i)| \leq 2\big(1 - p^2\big)^k \mbox{ if } n > kM. \end{aligned}$$

To extend this argument to a general aperiodic Markov chain, we need the fact that for each state i there is some integer n i such that P n(i, i) > 0 for all n ≥ n i. We prove that fact as Lemma 2.3 in the following section.

2.5.3 Periodicity

We start with a property of the set of return times of an irreducible Markov chain.

Lemma 2.1

Fix a state i and let S := {n > 0|P n(i, i) > 0} and d = g.c.d.(S). There must be two integers n and n + d in the set S.

Proof

The trick is clever. We first illustrate it on an example. Assume S = {9, 15, 21, …} with d = g.c.d.(S) = 3. There must be a, b ∈ S with g.c.d.{a, b} = 3. Otherwise, the gcd of S would not be 3. Here, we can choose a = 15 and b = 21. Now, consider the following operations:

$$\displaystyle \begin{aligned} (a, b) = (15, 21) \rightarrow (6, 15) \rightarrow (6, 9) \rightarrow (3, 6) \rightarrow (3, 3). \end{aligned}$$

At each step, we go from (x, y) with x ≤ y to the ordered pair of {x, y − x}. Note that at each step, each term in the pair (x, y) is an integer linear combination of a and b. For instance, (6, 15) = (b − a, a). Then, (6, 9) = (b − a, a − (b − a)) = (b − a, 2a − b), and so on. Eventually, we must get to (3, 3). Indeed, the terms are always decreasing until we get to zero. Assume we get to (x, x) with x ≠ 3. At the previous step, we had (x, 2x). The step before must have been (x, 3x), and so on. Going back all the way to (a, b), we see that a and b are both multiples of x. But then, g.c.d.{a, b} = x, a contradiction.

From this construction, since at each step the terms are integer linear combinations of a and b, we see that

$$\displaystyle \begin{aligned} 3 = ma + nb \end{aligned}$$

for some integers m and n. Thus,

$$\displaystyle \begin{aligned} 3 = m^+a + n^+b - m^-a - n^-b, \end{aligned}$$

where \(m^+ = \max \{m, 0\}\) and m = m + − m , and similarly for n + and n . Now we can choose

$$\displaystyle \begin{aligned} N = m^-a + n^-b \mbox{ and } N + 3 = m^+a + n^+b. \end{aligned}$$

The last step of the argument is to notice that if a, b ∈ S, then αa + βb ∈ S for any integers α and β that are not both zero. This fact follows from the definition of S as the return times from i to i. Hence, both N and N + 3 are in S.

The proof for a general set S with gcd equal to d is identical. □

This result enables us to show that the period of a Markov chain is well-defined.

Lemma 2.2

For an irreducible Markov chain, d(i) defined in ( 1.6 ) has the same value for all states.

Proof

Pick j ≠ i. We show that d(j) ≤ d(i). This suffices to prove the lemma, since by symmetry one also has d(i) ≤ d(j).

By irreducibility, P m(j, i) > 0 for some m and P n(i, j) > 0 for some n. Now, by definition of d(i) and by the previous lemma, there is some integer N such that P N(i, i) > 0 and P N+d(i)(i, i) > 0. But then,

$$\displaystyle \begin{aligned} P^{m + N + n}(j, j) > 0 \mbox{ and } P^{m + N + d(i) + n}(j, j) > 0. \end{aligned}$$

This implies that the integers K := n + N + m and K + d(i) are both in S := {n > 0|P n(j, j) > 0}. Clearly, this shows that

$$\displaystyle \begin{aligned} d(j) := g.c.d.(S) \leq d(i).\end{aligned} $$

The following fact then suffices for our proof of convergence, as we explained in the example.

Lemma 2.3

Let X be an irreducible aperiodic Markov chain. Let S = {n > 0|P n(i, i) > 0}. Then, there is some n i such that n  S, for all n  n i.

Proof

We know from Lemma 2.1 that there is some integer N such that N, N + 1 ∈ S. We claim that

$$\displaystyle \begin{aligned} n \in S, \forall n > N^2. \end{aligned}$$

To see this, first note that for m > N − 1 one has

$$\displaystyle \begin{aligned} & mN + 0 = mN, \\ & mN +1 = (m - 1)N + (N+1), \\ & mN +2 = (m-2)N + 2(N +1), \\ & \ldots, \\ & mN + N - 1 = (m-N+1)N + (N - 1)(N + 1). \end{aligned} $$

Now, for n > N 2 one can write

$$\displaystyle \begin{aligned} n = mN + k \end{aligned}$$

for some k ∈{0, 1, …, N − 1} and m > N − 1. Thus, n is an integer linear combination of N and N + 1 that are both in S, so that n ∈ S. □

2.6 Summary

  • Sample Space;

  • Laws of Large Numbers: SLLN and WLLN;

  • WLLN from Chebyshev’s Inequality;

  • SLLN from Borel–Cantelli and fourth moment bound;

  • SLLN for Markov chains using the i.i.d. return times to a state;

  • Proof of Big Theorem.

2.6.1 Key Equations and Formulas

Table 1

2.7 References

An excellent text on Markov Chains is Chung (1967). A more advanced text on probability theory is Billingsley (2012).

2.8 Problems

Problem 2.1

Consider a Markov chain X n that takes values in {0, 1}. Explain why {0, 1} is not its sample space.

Problem 2.2

Consider again a Markov chain that takes values in {0, 1} with P(0, 1) = a and P(1, 0) = b. Exhibit two different sample spaces and the probability on them for that Markov chain.

Problem 2.3

Draw the smallest periodic Markov chain. Show that the fraction of time in the states converges but the probability of being in a state at time n does not converge.

Problem 2.4

For the Markov chain in Problem 2.2, calculate the eigenvalues and use them to get a bound on the distance between the distribution at time n and the invariant distribution.

Problem 2.5

Why does the strong law imply the weak law? More concretely, let X n, X be random variables such that X n → X almost surely. Show that X n → X in probability.

Hint

Fix 𝜖 > 0 and define Z n = 1{|X n − X|≥ 𝜖}. Use DCT to show that E(Z n) → 0 as n → if X n → X almost surely.

Problem 2.6

Draw a Markov chain with four states that is irreducible and aperiodic. Consider two independent versions of the Markov chain: one that starts in state 1, the other in state 2. Explain what they will meet after a finite time.

Problem 2.7

Consider the Markov chain of Fig. 1.2. Use Python to calculate the eigenvalues of P. Let λ be the largest absolute value of the eigenvalues other than 1. Use Python to calculate

$$\displaystyle \begin{aligned} d(n) := \sum_i |\pi(i) - \pi_n(i)|, \end{aligned}$$

where π 0(A) = 1. Plot d(n) and λ n as functions of n.

Problem 2.8

You flip a fair coin. If the outcome is “head,” you get a random amount of money equal to X and if it is“ tail,” you get a random amount Y . Prove formally that on average, you get

$$\displaystyle \begin{aligned} \frac{1}{2} E(X) + \frac{1}{2} E(Y). \end{aligned}$$

Problem 2.9

Can you find random variables that converge to 0 almost surely, but not in probability?

Problem 2.10

Let {X n, n ≥ 1} be i.i.d. zero-mean random variables with variance σ 2. Show that X nn → 0 with probability one as n →.

Hint

Borel–Cantelli.

Problem 2.11

Let X n be a finite irreducible Markov chain on \({\mathcal {X}}\) with invariant distribution π and \(f: {\mathcal {X}} \to \Re \) some function. Show that

$$\displaystyle \begin{aligned} \frac{1}{N} \sum_{n=0}^{N-1} f(X_n) \to \sum_{i \in {\mathcal{X}}} \pi(i) f(i) \mbox{ w.p. } 1, \mbox{ as } N \to \infty. \end{aligned}$$