Abstract
In this chapter, we take a second look at the concepts that we introduced in PageRank-A. The main ideas concern the long-term behavior of Markov chains and, in particular, the significance of the invariant distribution. As explained in the Introduction, part B of each chapter can be omitted in an introductory course.
Section 2.1 discusses the sample space of a Markov chain: the outcomes and their probability. Section 2.2 explains the meaning of the weak and strong laws of large numbers. Section 2.3 proves these results for independent and identically distributed random variables. Section 2.4 proves the strong law of large numbers for Markov chains. Section 2.5 concludes the chapter by proving the Big Theorem for Markov chains.
You have full access to this open access chapter, Download chapter PDF
Topics: Sample Space, Trajectories; Laws of Large Numbers: WLLN, SLLN; Proof of Big Theorem.
Background:
2.1 Sample Space
Let us connect the definition of X = {X n, n ≥ 0} of a Markov chain with the general framework of Sect. B.1. (We write X n or X(n).) In that section, we explained that a random experiment is described by a sample space. The elements of the sample space are the possible outcomes of the experiment. A probability is defined on subsets, called events, of that sample space. Random variables are real-valued functions of the outcome of the experiment.
To clarify these concepts, consider the case where the X n are i.i.d. Bernoulli random variables with P(X n = 1) = P(X n = 0) = 0.5. These random variables describe flips of a fair coin. The random experiment is to flip the coin repeatedly, forever. Thus, one possible outcome of this experiment is an infinite sequence of 0’s and 1’s. Note that an outcome is not 0 or 1: it is an infinite sequence since the outcome specifies what happens when we flip the coin forever. Thus, the set Ω of outcomes is the set {0, 1}∞ of infinite sequences of 0’s and 1’s. If ω is one such sequence, we have ω = (ω 0, ω 1, …) where ω n ∈{0, 1}. It is then natural to define X n(ω) = ω n, which simply says that X n is the outcome of flip n, for n ≥ 0. Hence \(X_n(\omega ) \in \Re \) for all ω ∈ Ω and we see that each X n is a real-valued function defined on Ω. For instance, X 0(1101001…) = 1 since ω 0 = 1 when ω = 1101001… . Similarly, X 1(1101001…) = 1 and X 2(1101001…) = 0. To specify the random experiment, it remains to define the probability on Ω. The simplest way is to say that
for all n ≥ 0 and a, b, …, z ∈{0, 1}. For instance,
Similarly,
Observe that we define the probability of a set of outcomes, or event, {ω|ω 0 = a, ω 1 = b, …, ω n = z} instead of specifying the probability of each outcome ω. The reason is that the probability that we observe a specific infinite sequence of 0’s and 1’s is zero. That is, P({ω}) = 0 for all ω ∈ Ω. Such a description does not tell us much about the coin flips! For instance, it does not specify the bias of the coin, or the fact that successive flips are independent. Hence, the correct way to proceed is to specify the probability of events, that are sets of outcomes, instead of the probability of individual outcomes.
For a Markov chain, there is some sample space Ω and each X n is a function X n(ω) of the outcome ω that takes values in \({\mathcal {X}}\). A probability is defined on subsets of Ω.
In this example, one can choose Ω to be the set of possible infinite sequences of symbols in \({\mathcal {X}}\). That is, \(\varOmega = {\mathcal {X}}^\infty \) and an element ω ∈ Ω is ω = (ω 0, ω 1, …) with \(\omega _n \in {\mathcal {X}}\) for n ≥ 0. With this choice, one has X n(ω) = ω n for n ≥ 0 and ω ∈ Ω, as shown in Fig. 2.1. This choice of Ω, similar to what we did for the coin flips, is called the canonical sample space. Thus, an outcome is the actual sequence of values of the Markov chain, called the trajectory, or realization of the Markov chain. It remains to specify the probability of event in Ω. The trick here is that the probability that the Markov chain follows a specific infinite sequence is 0, similarly to the probability that coin flips follow a specific infinite sequence such as all heads. Thus, one should specify the probability of subsets of Ω, not of individual outcomes. One specifies that
for all n ≥ 0 and i 0, i 1, …, i n in \({\mathcal {X}}\). Here, π 0(i 0) is the probability that the Markov chain starts in state i 0.
This identity is equivalent to (1.3). Indeed, if we let
and
then
by (1.3), so that (2.1) holds by induction on n.
Thus, one has defined the probability of events characterized by the first n + 1 values of the Markov chain. It turns out that there is one probability on Ω that is consistent with these values.
2.2 Laws of Large Numbers for Coin Flips
Before we discuss the case of Markov chains, let us consider the simpler example of coin flips. Let then {X n, n ≥ 0} be i.i.d. Bernoulli random variables with P(X n = 0) = P(X n = 1) = 0.5, as in the previous section. We think of X n = 1 if flip n yields heads and X n = 0 if it yields tails. We want to show that, as we keep flipping the coin, the fraction of heads approaches 50%. There are two statements that make this idea precise.
2.2.1 Convergence in Probability
The first statement, called the Weak Law of Large Numbers (WLLN), says that it is very unlikely that the fraction of heads in n coin flips differs from 50% by even a small amount, say 1%, if n is large. For instance, let n = 105. We want to show that the likelihood that the fraction of heads among 105 flips is more than 51% or less than 49% is small. Moreover, this likelihood can be made as small as we wish if we flip the coin more times.
To show this, let
be the fraction of heads in the first n flips. We claim that
This result is called Chebyshev’s inequality (Fig. 2.2).
To see (2.2), observe thatFootnote 1
Indeed, if |Y n − E(Y n)|≥ 𝜖, then (Y n − E(Y n))2 ≥ 𝜖 2, so that if the left-hand side of inequality (2.3) is one, the right-hand side is at least equal to one. Also, if the left-hand side is zero, it is less than or equal to the right-hand side. Thus, (2.3) holds and (2.2) follows by taking the expected values in (2.3), since E(1A) = P(A) and E((Y n − E(Y n))2) = var(Y n) and since expectation is monotone (B.2).
Now, E(Y n) = 0.5 and
To see this, recall that if one multiplies a random variable by a, its variance is multiplied by a 2 (see (B.3)). Also, the variance of a sum of independent random variables is the sum of their variances (see Theorem B.4). Hence,
Since X 0 =D B(0.5), we find that
Thus,
In particular, if we choose 𝜖 = 1% = 0.01, we find
More generally, we have shown that
This is the WLLN.
2.2.2 Almost Sure Convergence
The second statement is the Strong Law of Large Numbers (SLLN). It says that, for all the sequences of coin flips we will ever observe, the fraction Y n actually converges to 50% as we keep on flipping the coin.
There are many sequences of coin flips for which the fraction of heads does not approach 50%. For instance, the sequence that yields heads for every flip is such that Y n = 1 for all n and thus Y n does not converge to 50%. Similarly, the sequence 001001001001001… is such that Y n approaches 1∕3 and not 50%. What the SLLN implies is that all those sequences such that Y n does not converge to 50% have probability 0: they will never be observed.
Thus, this statement is very deep because there are so many sequences to rule out. Keeping track of all of them seems rather formidable. Indeed, the proof of this statement is quite clever. Here is how it proceeds. Note that
Indeed,
and the previous inequality follows by taking expectations. Now,
Also, with Z m = X m − 0.5, one has
where the sum is over all a, b, c, d ∈{0, 1, …, n − 1}. This sum consists of n terms \(Z_a^4\), n(n − 1) terms \(Z_a^2 Z_b^2\) with a ≠ b and other terms where at least a factor Z a is not repeated. The latter terms have zero-mean since E(Z a Z b Z c Z d) = E(Z a)E(Z b Z c Z d) = 0, by independence, whenever b, c, and d are all different from a. Consequently,
with \(\alpha = E(Z_0^4)\) and \(\beta = E(Z_0^2 Z_1^2)\). Hence, substituting the result of this calculation in the previous expressions, we find that
This inequality implies thatFootnote 2
This expression shows that the events A n := {|Y n − 0.5|≥ 𝜖} have probabilities that add up to a finite number. From the Borel–Cantelli Theorem B.1, we conclude that
This result says that, with probability one, ω belongs only to finitely many A n’s. Hence,Footnote 3 with probability one, there is some n(ω) so that ω∉A n for n ≥ n(ω). That is,
Since this property holds for an arbitrary 𝜖 > 0, we conclude that, with probability one,
Indeed, if Y n(ω) does not converge to 50%, there must be some 𝜖 > 0 so that |Y n − 0.5| > 𝜖 for infinitely many n’s and we have seen that this is not the case.
2.3 Laws of Large Numbers for i.i.d. RVs
The results that we proved for coin flips extend to i.i.d. random variables {X n, n ≥ 0} to show that
approaches E(X 0) as n →∞. As for coin flips, there are two ways of making that statement precise.
2.3.1 Weak Law of Large Numbers
We need a definition.
Definition 2.1 (Convergence in Probability)
Let X n, n ≥ 0 and X be random variables defined on a common probability space. One says that X n converges in probability to X, and one writes \(X_n \overset {p}{\rightarrow } X\) if, for all 𝜖 > 0,
◇
The Weak Law of Large Numbers (WLLN) is the following result.
Theorem 2.1 (Weak Law of Large Numbers)
Let {X n, n ≥ 0} be a sequence of i.i.d. random variables with mean μ. Then
\({\blacksquare }\)
Proof
Assume that \(E(X_n^2) < \infty \). The proof is then the same as for coin flips and is left as an exercise. For the general case, see Theorem 15.14. □
The first result of this type was proved by Jacob Bernoulli (Fig. 2.3).
2.3.2 Strong Law of Large Numbers
We again need a definition.
Definition 2.2 (Almost Sure Convergence)
Let X n, n ≥ 0 and X be random variables defined on a common probability space. One says that X n converges almost surely to X as n →∞, and one writes X n → X, a.s. if
◇
Thus, this convergence means that the sequence of real numbers X n(ω) converges to the real number X(ω) as n →∞, with probability one.
Let {X n, n ≥ 0} be as in the statement of Theorem 2.1. We have the following result.Footnote 4
Theorem 2.2 (Strong Law of Large Numbers)
Let {X n, n ≥ 0} be a sequence of i.i.d. random variables with mean μ. Then
\({\blacksquare }\)
Thus, the sample mean values Y n := (X 0 + ⋯ + X n−1)∕n converge to the expected value, with probability 1. (See Fig. 2.4.)
Proof
Assume that
The proof is then the same as for coin flips and is left as an exercise. The proof of the SLLN in the general case is given in Theorem 15.14. □
Figure 2.5 illustrates the SLLN and WLLN. The SLLN states that the sample means of i.i.d. random variables converge to the mean, with probability one. The WLLN says that as the number of samples increases, the fraction of realizations where the sample mean differs from the mean by some amount gets small.
2.4 Law of Large Numbers for Markov Chains
The long-term fraction of time that a finite irreducible Markov chain spends in a given state is the invariant probability of that state. For instance, a Markov chain X(n) on {0, 1} with P(0, 1) = a = P(1, 0) with a ∈ (0, 1] spends half of the time in state 0, in the long term. The Markov chain in Fig. 1.2 spends a fraction 12∕39 of the time in state A, in the long term.
To understand this property, one should look at the returns to state i, as shown in Fig. 2.6. The figure shows a particular sequence of values of X(n) and it decomposes this sequence into cycles between successive returns to a given state i. A new cycle starts when the Markov chain comes back to i. The durations of these successive cycles, T 1, T 2, T 3, …, are independent and identically distributed, because the Markov chains start afresh from state i at each time T n, independently of the previous states. This is a consequence of the Markov property for any given value k of T n and of the fact that the distribution of the evolution starting from state i at time k does not depend on k.
It is easy to see that these random times have a finite mean. Indeed, fix one state i. Then, starting from any given state j, there is some minimum number M j of steps required to go to state i. Also, there is some probability p j that the Markov chain will go from j to i in M j steps. Let then M =maxj M j and p =minj p j. We can then argue that, starting from any state at time 0, there is at least a probability p that the Markov chain visits state i after at most M steps. If it does not, we repeat the argument starting at time M. We conclude that T i ≤ Mτ where τ is a geometric random with parameter p. Hence E(T i) ≤ ME(τ) = M∕p < ∞, as claimed. Note also that \(E(T_i^4) \leq M^4 E(\tau ^4) < \infty \).
The Strong Law of Large Numbers states that
Thus, the long-term fraction of time that the Markov chain spends in state i is given by
Let us clarify why (2.6) implies that the fraction of time in state i converges to 1∕E(T 1). Let A(n) be the number of visits to state i by time n. We want to show that A(n)∕n converges to 1∕E(T 1). Then,
whenever T 1 + ⋯ + T k ≤ n < T 1 + ⋯ + T k+1. If we believe that T k+1∕k → 0 as k →∞, the inequality above shows that
as claimed. To see why T k+1∕k goes to zero, note that
with α = 𝜖∕M.
Thus, by Borel–Cantelli Theorem B.1, the event T k+1∕k > 𝜖 occurs only for finitely many values of k, which proves the convergence to zero.
2.5 Proof of Big Theorem
This section presents the proof of the main result about Markov chains.
2.5.1 Proof of Theorem 1.1 (a)
Let m j be the expected return time to state j. That is,
We show that π(j) = 1∕m j, j = 1, …, N is the unique invariant distribution if the Markov chain is irreducible.
During n = 1, …, N where N ≫ 1, the Markov chain visits state j a fraction 1∕m j of the times. A fraction P(j, i) of those times, it visits state i just after visiting state j. Thus, a fraction (1∕m j)P(j, i) of the times, the Markov chain visits j then i in successive steps. By summing over j, we find the fraction of the times that the Markov chain visits i. Thus,
Hence, there is one invariant distribution π and it is given by π i = 1∕m i, which is the fraction of time that the Markov chain spends in state i.
To show that the invariant distribution is unique, assume that there is another one, say ϕ(i). Start the Markov chain with that distribution. Then
However, taking expectation, we find that the left-hand side is equal to ϕ(i). Thus, ϕ = π and the invariant distribution is unique.Footnote 5
2.5.2 Proof of Theorem 1.1 (b)
If the Markov chain is irreducible but not aperiodic, then π n may not converge to the invariant distribution π. For instance, if the Markov chain alternates between 0 and 1 and starts from 0, then π n = [1, 0] for n even and π n = [0, 1] for n odd, so that π n does not converge to π = [0.5, 0.5].
If the Markov chain is aperiodic, π n → π. Moreover, the convergence is geometric. We first illustrate the argument on a simple example shown in Fig. 2.7. Consider the number of steps to go from 1 to 1. Note that
Thus, P n(1, 1) > 0 if n ≥ 6. Now, P[X(2) = 1|X(0) = 2] > 0, so that P[X(n) = 1|X(0) = 2] > 0 for n ≥ 8. Indeed, if n ≥ 8, then X can go from 2 to 1 in two steps and then from 1 to 1 in n − 2 steps. The argument is similar for the other states and we find that there is some M > 0 and some p > 0 such that
Now, consider two copies of the Markov chain: {X(n), n ≥ 0} and {Y (n), n ≥ 0}. One chooses X(0) with distribution π 0 and Y (0) with the invariant distribution π. The two Markov chains evolve independently initially. We define
In view of the observation above,
Thus, P(τ > M) ≤ 1 − p 2. If τ > M, then the two Markov chains have not met yet by time M. Using the same argument as before, we see that they have a probability at least p 2 of meeting in the next M steps. Thus,
Now, modify X(n) by gluing it to Y (n) after time τ. This coupling operation does not change the fact that X(n) still evolves according to the transition matrix P, so that P(X(n) = i) = π n(i) where π n = π 0 P n.
Now,
Hence,
and this implies that
To extend this argument to a general aperiodic Markov chain, we need the fact that for each state i there is some integer n i such that P n(i, i) > 0 for all n ≥ n i. We prove that fact as Lemma 2.3 in the following section.
2.5.3 Periodicity
We start with a property of the set of return times of an irreducible Markov chain.
Lemma 2.1
Fix a state i and let S := {n > 0|P n(i, i) > 0} and d = g.c.d.(S). There must be two integers n and n + d in the set S.
Proof
The trick is clever. We first illustrate it on an example. Assume S = {9, 15, 21, …} with d = g.c.d.(S) = 3. There must be a, b ∈ S with g.c.d.{a, b} = 3. Otherwise, the gcd of S would not be 3. Here, we can choose a = 15 and b = 21. Now, consider the following operations:
At each step, we go from (x, y) with x ≤ y to the ordered pair of {x, y − x}. Note that at each step, each term in the pair (x, y) is an integer linear combination of a and b. For instance, (6, 15) = (b − a, a). Then, (6, 9) = (b − a, a − (b − a)) = (b − a, 2a − b), and so on. Eventually, we must get to (3, 3). Indeed, the terms are always decreasing until we get to zero. Assume we get to (x, x) with x ≠ 3. At the previous step, we had (x, 2x). The step before must have been (x, 3x), and so on. Going back all the way to (a, b), we see that a and b are both multiples of x. But then, g.c.d.{a, b} = x, a contradiction.
From this construction, since at each step the terms are integer linear combinations of a and b, we see that
for some integers m and n. Thus,
where \(m^+ = \max \{m, 0\}\) and m = m + − m −, and similarly for n + and n −. Now we can choose
The last step of the argument is to notice that if a, b ∈ S, then αa + βb ∈ S for any integers α and β that are not both zero. This fact follows from the definition of S as the return times from i to i. Hence, both N and N + 3 are in S.
The proof for a general set S with gcd equal to d is identical. □
This result enables us to show that the period of a Markov chain is well-defined.
Lemma 2.2
For an irreducible Markov chain, d(i) defined in ( 1.6 ) has the same value for all states.
Proof
Pick j ≠ i. We show that d(j) ≤ d(i). This suffices to prove the lemma, since by symmetry one also has d(i) ≤ d(j).
By irreducibility, P m(j, i) > 0 for some m and P n(i, j) > 0 for some n. Now, by definition of d(i) and by the previous lemma, there is some integer N such that P N(i, i) > 0 and P N+d(i)(i, i) > 0. But then,
This implies that the integers K := n + N + m and K + d(i) are both in S := {n > 0|P n(j, j) > 0}. Clearly, this shows that
□
The following fact then suffices for our proof of convergence, as we explained in the example.
Lemma 2.3
Let X be an irreducible aperiodic Markov chain. Let S = {n > 0|P n(i, i) > 0}. Then, there is some n i such that n ∈ S, for all n ≥ n i.
Proof
We know from Lemma 2.1 that there is some integer N such that N, N + 1 ∈ S. We claim that
To see this, first note that for m > N − 1 one has
Now, for n > N 2 one can write
for some k ∈{0, 1, …, N − 1} and m > N − 1. Thus, n is an integer linear combination of N and N + 1 that are both in S, so that n ∈ S. □
2.6 Summary
-
Sample Space;
-
Laws of Large Numbers: SLLN and WLLN;
-
WLLN from Chebyshev’s Inequality;
-
SLLN from Borel–Cantelli and fourth moment bound;
-
SLLN for Markov chains using the i.i.d. return times to a state;
-
Proof of Big Theorem.
2.6.1 Key Equations and Formulas
2.8 Problems
Problem 2.1
Consider a Markov chain X n that takes values in {0, 1}. Explain why {0, 1} is not its sample space.
Problem 2.2
Consider again a Markov chain that takes values in {0, 1} with P(0, 1) = a and P(1, 0) = b. Exhibit two different sample spaces and the probability on them for that Markov chain.
Problem 2.3
Draw the smallest periodic Markov chain. Show that the fraction of time in the states converges but the probability of being in a state at time n does not converge.
Problem 2.4
For the Markov chain in Problem 2.2, calculate the eigenvalues and use them to get a bound on the distance between the distribution at time n and the invariant distribution.
Problem 2.5
Why does the strong law imply the weak law? More concretely, let X n, X be random variables such that X n → X almost surely. Show that X n → X in probability.
Hint
Fix 𝜖 > 0 and define Z n = 1{|X n − X|≥ 𝜖}. Use DCT to show that E(Z n) → 0 as n →∞ if X n → X almost surely.
Problem 2.6
Draw a Markov chain with four states that is irreducible and aperiodic. Consider two independent versions of the Markov chain: one that starts in state 1, the other in state 2. Explain what they will meet after a finite time.
Problem 2.7
Consider the Markov chain of Fig. 1.2. Use Python to calculate the eigenvalues of P. Let λ be the largest absolute value of the eigenvalues other than 1. Use Python to calculate
where π 0(A) = 1. Plot d(n) and λ n as functions of n.
Problem 2.8
You flip a fair coin. If the outcome is “head,” you get a random amount of money equal to X and if it is“ tail,” you get a random amount Y . Prove formally that on average, you get
Problem 2.9
Can you find random variables that converge to 0 almost surely, but not in probability?
Problem 2.10
Let {X n, n ≥ 1} be i.i.d. zero-mean random variables with variance σ 2. Show that X n∕n → 0 with probability one as n →∞.
Hint
Borel–Cantelli.
Problem 2.11
Let X n be a finite irreducible Markov chain on \({\mathcal {X}}\) with invariant distribution π and \(f: {\mathcal {X}} \to \Re \) some function. Show that
Notes
- 1.
By definition, 1{C} takes the value 1 if the condition C holds and the value 0 otherwise.
- 2.
Recall that
$$\displaystyle \begin{aligned} \sum_n \frac{1}{n^2} < \infty. \end{aligned}$$ - 3.
Let n(ω) − 1 be the largest n such that ω ∈ A n.
- 4.
Almost sure convergence implies convergence in probability, so SLLN is stronger than WLLN. See Problem 2.5.
- 5.
Indeed,
$$\displaystyle \begin{aligned} E(1\{X(n)=i\}) =P(X(n) = i)= \phi(i). \end{aligned}$$
References
P. Billingsley, Probability and Measure, Third Edition (Wiley, Hoboken, 2012)
K.L. Chung, Markov Chains with Stationary Transition Probabilities (Springer, Berlin, 1967)
Author information
Authors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2021 The Author(s)
About this chapter
Cite this chapter
Walrand, J. (2021). PageRank: B. In: Probability in Electrical Engineering and Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-030-49995-2_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-49995-2_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-49994-5
Online ISBN: 978-3-030-49995-2
eBook Packages: Computer ScienceComputer Science (R0)