Topics: Inference, Sufficient Statistic, Infinite Markov Chains, Poisson, Boosting, Multi-Armed Bandits, Capacity, Bounds, Martingales, SLLN

15.1 Inference

One key concept that we explored is that of inference. The general problem of inference can be formulated as follows. There is a pair of random quantities (X, Y ). One observes Y  and one wants to guess X (Fig. 15.1).

Fig. 15.1
figure 1

The inference problem is to guess the value of X from that of Y

Thus, the goal is to find a function g(⋅) such that \(\hat X := g(Y)\) is close to X, in a sense to be made precise. Here are a few sample problems:

  • X is the weight of a person and Y  is her height;

  • X = 1 is a house is on fire, X = 0 otherwise, and Y  is a measurement of the CO density at a sensor;

  • X ∈{0, 1}N is a bit string that a transmitter sends and \(Y \in \Re ^{[0, T]}\) is a signal that the receiver receives;

  • Y  is one woman’s genome and X = 1 if she develops a specific form of breast cancer and X = 0 otherwise;

  • Y  is a vector of characteristics of a movie and of one person and X is the number of stars that the person gives to the movie;

  • Y  is the photograph of a person’s face and X = 1 if it is that of a man and X = 0 otherwise;

  • X is a sentence and Y  is the signal that a microphone picks up.

We explained a few different formulations of this problem in Chaps. 7 and 9:

  • Known Distribution: We know the joint distribution of (X, Y );

  • Off-Line: We observe a set of sample values of (X, Y );

  • On-Line: We observe successive values of samples of (X, Y );

  • Maximum Likelihood Estimate: We do not want to assume a distribution for X, only the conditional distribution of Y  given X; the goal is to find the value of X that makes the observed Y  most likely;

  • Maximum A Posteriori Estimate: We know a prior distribution for X and the conditional distribution of Y  given X; the goal is to find the value of X that is most likely given Y ;

  • Hypothesis Test: We do not want to assume a distribution for X ∈{0, 1}, only a conditional distribution of Y  given X; the goal is to maximize the probability of correctly deciding that X = 1 while keeping the probability that we decide that X = 1 when in fact X = 0 below some given β.

  • MMSE: Given the joint distribution of X and Y , we want to find the function g(Y ) that minimizes E((Xg(Y ))2).

  • LLSE: Given the joint distribution of X and Y , we want to find the linear function a + bY  that minimizes E((XabY )2).

15.2 Sufficient Statistic

A useful notion for inference problems is that of a sufficient statistic. We have not discussed this notion so far. It is time to do it.

Definition 15.1 (Sufficient Statistic)

We say that h(Y ) is a sufficient statistic for X if

$$\displaystyle \begin{aligned} f_{Y|X}[y|x] = f(h(y), x)g(y), \end{aligned}$$

or, equivalently, it

$$\displaystyle \begin{aligned} f_{Y|h(Y), X}[y|s, x] = f_{Y|h(Y)}[y|s]. \end{aligned}$$

We leave the verification of this equivalence to the reader.

Before we discuss the meaning of this definition, let us explore some implications. First note that if we have a prior f X(x) and we want to calculate MAP[X|Y = y], we have

$$\displaystyle \begin{aligned} MAP[X|Y=y] &= \arg \max_x f_X(x)f_{Y|X}[y|x] \\ & = \arg \max_x f_X(x)f(h(y), x)g(y) \\ & = \arg \max_x f_X(x)f(h(y), x). \end{aligned} $$

Consequently, the maximizer is some function of h(y). Hence,

$$\displaystyle \begin{aligned} MAP[X|Y] = g(h(Y)), \end{aligned}$$

for some function g(⋅). In words, the information in Y  that is useful to calculate MAP[X|Y ] is contained in h(Y ).

In the same way, we see that MLE[X|Y ] is also a function of h(Y ).

Observe also that

$$\displaystyle \begin{aligned} f_{X|Y}[x|y] = \frac{f_X(x) f_{Y|X}[y|x]}{f_Y(y)} = \frac{f_X(x) f(h(y), x)g(y)}{f_Y(y)}. \end{aligned}$$

Now,

$$\displaystyle \begin{aligned} f_Y(y) &= \int_{- \infty}^\infty f_X(x) f(h(y), x)g(y) dx = g(y) \int_{- \infty}^\infty f_X(x) f(h(y), x)dx \\ &= g(y) \phi(h(y)), \end{aligned} $$

where

$$\displaystyle \begin{aligned} \phi(h(y)) = \int_{- \infty}^\infty f_X(x) f(h(y), x)dx. \end{aligned}$$

Hence,

$$\displaystyle \begin{aligned} f_{X|Y}[x|y] = \frac{f_X(x) f(h(y), x)}{\phi(h(Y))}. \end{aligned}$$

Thus, the conditional density of X given Y  depends only on h(Y ). Consequently,

$$\displaystyle \begin{aligned} E[X|Y] = \psi(h(Y)). \end{aligned}$$

Now, consider the hypothesis testing problem when X ∈{0, 1}. Note that

$$\displaystyle \begin{aligned} L(y) = \frac{f_{Y|X}[y|1]}{f_{Y|X}[y|0]} = \frac{f(h(y), 1)g(y)}{f(h(y), 0)g(y)} = \psi(h(y)). \end{aligned}$$

Thus, the likelihood ratio depends only on h(y) and it follows that the solution of the hypothesis testing problem is also a function of h(Y ).

15.2.1 Interpretation

The definition of sufficient statistic is quite abstract. The intuitive meaning is that if h(Y ) is sufficient for X, then Y  is some function of h(Y ) and a random variable Z that is independent of X and Y . That is,

$$\displaystyle \begin{aligned} Y = g(h(Y), Z). \end{aligned} $$
(15.1)

For instance, say that Y = (Y 1, …, Y n) where the Y m are i.i.d. and Bernoulli with parameter X ∈ [0, 1]. Let h(Y ) = Y 1 + ⋯ + Y n. Then we can think of Y  as being constructed from h(Y ) by selecting randomly which h(Y ) random variables among (Y 1, …, Y n) are equal to one. This random choice is some independent random variable Z. In such a case, we see that Y  does not contain any information about X that is not already in h(Y ).

To see the equivalence between this interpretation and the definition, first assume that (15.1) holds. Then

$$\displaystyle \begin{aligned} P[Y \approx y | X = x] &= P[h(Y) \approx h(y) | X = x] P(g(h(y), Z) \approx y) \\ & = f(h(y), x)g(y), \end{aligned} $$

so that h(Y ) is sufficient for X. Conversely, if h(Y ) is sufficient for X, then we can find some Z such that g(h(y), Z) has the density f Y |h(Y )[y|h(y)].

15.3 Infinite Markov Chains

We studied Markov chains on a finite state space \(\mathcal {X} = \{1, 2, \ldots , N\}\). Let us explore the countably infinite case where \(\mathcal {X} = \{0, 1, \ldots \}\).

One is given an initial distribution \(\pi = \{\pi (x), x \in \mathcal {X}\}\), where π(x) ≥ 0 and \(\sum _{x \in \mathcal {X}} \pi (x) = 1\). Also, one is given a set of nonnegative numbers \(\{P(x, y), x, y \in \mathcal {X}\}\) such that

$$\displaystyle \begin{aligned} \sum_{y \in \mathcal{X}} P(x, y) = 1, \forall x \in \mathcal{X}. \end{aligned}$$

The sequence {X(n), n ≥ 0} is then a Markov chain with initial distribution π and probability transition matrix P if

$$\displaystyle \begin{aligned} P(X(0) &= x_0, X(1) = x_1, \ldots, X(n) = x_n) \\ & = \pi(x_0)P(x_0, x_1) \times \cdots \times P(x_{n-1}, x_n), \end{aligned} $$

for all n ≥ 0 and all x 0, …, x n in \(\mathcal {X}\).

One defines irreducible and aperiodic as in the case of a finite Markov chain. Recall that if a finite Markov chain is irreducible, then it visits all its states infinitely often and it spends a positive fraction of time in each state.

That may not happen when the Markov chain is infinite. To see this, consider the following example (see Fig. 15.2). One has π(0) = 1 and P(i, i + 1) = p for i ≥ 1 and

$$\displaystyle \begin{aligned} P(i+1, i) = 1 - p =: q = P(0, 0), \forall i. \end{aligned}$$
Fig. 15.2
figure 2

An infinite Markov chain

Assume that p ∈ (0, 1). Then the Markov chain is irreducible. However, it is intuitively clear that X(n) → as n → if p > 0.5. To see that this is indeed the case, let Z(n) be i.i.d. random variables with P(Z(n) = 1) = p and P(Z(n) = −1) = q. Then note that

$$\displaystyle \begin{aligned} X(n) = \max\{X(n-1) + Z(n), 0\}, \end{aligned}$$

so that

$$\displaystyle \begin{aligned} X(n) \geq X(0) + Z(1) + \cdots + Z(n-1), n \geq 0. \end{aligned}$$

Also,

$$\displaystyle \begin{aligned} \frac{X(n)}{n} \geq \frac{X(0) + Z(1) + \cdots + Z(n-1)}{n} \rightarrow E(Z(n)) > 0, \end{aligned}$$

where the convergence follows by the SLLN. This implies that X(n) →, as claimed.

Thus, X(n) eventually is larger than any given N and remains larger. This shows that X(n) visits every state only finitely many times. We say that the states are transient because they are visited only finitely often.

We say that a state is recurrent if it is not transient. In that case, the state is called positive recurrent if the average time between successive visits is finite; otherwise it is called null recurrent.

Here is the result that corresponds to Theorem 1.1

Theorem 15.1 (Big Theorem for Infinite Markov Chains)

Consider an infinite Markov chain.

  1. (a)

    If the Markov chain is irreducible, the states are either all transient, all positive recurrent, or all null recurrent. We then say that the Markov chain is transient, positive recurrent, or null recurrent, respectively.

  2. (b)

    If the Markov chain is positive recurrent, it has a unique invariant distribution π and π(i) is the long-term fraction of time that X(n) is equal to i.

  3. (c)

    If the Markov chain is positive recurrent and also aperiodic, then the distribution π n of X(n) converges to π.

  4. (d)

    If the Markov chain is not positive recurrent, it does not have an invariant distribution and the fraction of time that it spends in any state goes to zero.

\({\blacksquare }\)

It turns out that the Markov chain in Fig. 15.2 is null recurrent for p = 0.5 and positive recurrent for p < 0.5. In the latter case, its invariant distribution is

$$\displaystyle \begin{aligned} \pi(i) = (1 - \rho) \rho^i, i\geq 0, \mbox{ where } \rho := \frac{p}{q}. \end{aligned}$$

15.3.1 Lyapunov–Foster Criterion

Here is a useful sufficient condition for positive recurrence.

Theorem 15.2 (Lyapunov–Foster)

Let X(n) be an irreducible Markov chain on an infinite state space \(\mathcal {X}\) . Assume there exists some function \(V: \mathcal {X} \rightarrow [0, \infty )\) such that

$$\displaystyle \begin{aligned} E[V(X(n+1)) - V(X(n)) | X(n) = x] \leq - \alpha + \beta 1\{x \in A\}, \end{aligned}$$

where A is a finite set, α > 0 and β > 0.

Then the Markov chain is positive recurrent.

Such a function V  is said to be a Lyapunov function for the Markov chain. \({\blacksquare }\)

The condition means that the Lyapunov function decreases by at least α on average when X(n) is outside some finite set A. The intuitive reason why this makes the Markov chain positive recurrent is that, since the Lyapunov function is nonnegative, it cannot decrease forever. Thus, it must spend a positive fraction of time inside the finite set A. By the big theorem, this implies that it is positive recurrent.

15.4 Poisson Process

The Poisson process is an important model in applied probability. It is a good approximation of the arrivals of packets at a router, of telephone calls, of new TCP connections, of customers at a cashier.

15.4.1 Definition

We start with a definition of the Poisson process. (See Fig. 15.3.)

Fig. 15.3
figure 3

Poisson process: the times S n between jumps are i.i.d. and exponentially distributed with rate λ

Definition 15.2 (Poisson Process)

Let λ > 0 and {S 1, S 2, …} be i.i.d. Exp(λ) random variables. Let also T n = S 1 + ⋯ + S n for n ≥ 1. Define

$$\displaystyle \begin{aligned} N_t = \max\{n \geq 1 | T_n \leq t\}, t \geq 0, \end{aligned}$$

with N t = 0 if t < T 1. Then, N := {N t, t ≥ 0} is a Poisson process with rate λ. Note that T n is the n-th jump time of N. ◇

15.4.2 Independent Increments

Before exploring the properties of the Poisson process, we recall two properties of the exponential distribution.

Theorem 15.3 (Properties of Exponential Distribution)

Let τ be exponentially distributed with rate λ > 0. That is,

$$\displaystyle \begin{aligned} F_\tau(t) = P( \tau \leq t) = 1 - \exp\{- \lambda t\}, t \geq 0. \end{aligned}$$

In particular, the pdf of τ is \(f_\tau (t) = \lambda \exp \{- \lambda t\}\) for t ≥ 0. Also, E(τ) = λ −1 and var(τ) = λ −2.

Then,

$$\displaystyle \begin{aligned} P[ \tau > t + s | \tau > s] = P(\tau > t). \end{aligned}$$

This is the memoryless property of the exponential distribution.

Also,

$$\displaystyle \begin{aligned} P[ \tau \leq t + \epsilon | \tau > t] = \lambda \epsilon + o(\epsilon). \end{aligned}$$

\({\blacksquare }\)

Proof

$$\displaystyle \begin{aligned} & P[ \tau > t + s | \tau > s] = \frac{P(\tau > t + s)}{P(\tau > s)} \\ &~~~~ = \frac{\exp\{- \lambda (t + s)\}}{\exp\{- \lambda s\}} = \exp\{- \lambda t\} \\ &~~~~ = P(\tau > t). \end{aligned} $$

The interpretation of this property is that if a lightbulb has an exponentially distributed lifetime, then an old bulb is exactly as good as a new one (as long as it is still burning).

We use this property to show that the Poisson process is also memoryless, in a precise sense.

Theorem 15.4 (Poisson Process Is Memoryless)

Let N := {N t, t ≥ 0} is a Poisson process with rate λ. Fix t > 0. Given {N s, s  t}, the process {N s+t − N t, s ≥ 0} is a Poisson process with rate λ.

As a consequence, the process has stationary and independent increments. That is, for any 0 ≤ t 1 < t 2 < ⋯, the increments \(\{N_{t_{n+1}} - N_{t_n}, n \geq 1\}\) of the Poisson process are independent and the distribution of \(N_{t_{n+1}} - N_{t_n}\) depends only on t n+1 − t n. \({\blacksquare }\)

Proof

Figure 15.4 illustrates that result. Given {N s, s ≤ t}, the first jump time of {N s+t − N t, s ≥ 0} is Exp(λ), by the memoryless property of the exponential distribution. The subsequent inter-jump times are i.i.d. and Exp(λ). This proves the theorem. □

Fig. 15.4
figure 4

Given the past of the process up to time t, the future jump times are those of a Poisson process

15.4.3 Number of Jumps

One has the following result.

Theorem 15.5 (The Number of Jumps Is Poisson)

N := {N t, t ≥ 0} is a Poisson process with rate λ. Then N t has a Poisson distribution with mean λt. \({\blacksquare }\)

Proof

There are a number of ways of showing this result. The standard way is as follows. Note that

$$\displaystyle \begin{aligned} P(N_{t + \epsilon} = n) = P(N_t = n)(1 - \lambda \epsilon) + P(N_t = n - 1) \lambda \epsilon + o(\epsilon). \end{aligned}$$

Hence,

$$\displaystyle \begin{aligned} \frac{d}{dt} P(N_t = n) = \lambda P(N_t = n - 1) - \lambda P(N_t = n). \end{aligned}$$

Thus,

$$\displaystyle \begin{aligned} \frac{d}{dt} P(N_t = 0) = - \lambda P(N_t = 0). \end{aligned}$$

Since P(N 0 = 0) = 1, this shows that \(P(N_t = 0) = \exp \{- \lambda t\}\) for t ≥ 0. Now, assume that

$$\displaystyle \begin{aligned} P(N_t = n) = g(n, t)\exp\{- \lambda t\}, n \geq 0. \end{aligned}$$

Then, the differential equation above shows that

$$\displaystyle \begin{aligned} \frac{d}{dt} [g(n, t)\exp\{- \lambda t\}] = \lambda [g(n-1, t) - g(n, t)] \exp\{- \lambda t\}, \end{aligned}$$

i.e.,

$$\displaystyle \begin{aligned} \frac{d}{dt} g(n, t) = \lambda g(n-1, t). \end{aligned}$$

This expression shows by induction that \(g(n, t) = \frac {(\lambda t)^n}{n!}\).

A different proof makes use of the density of the jumps. Let T n be the n-th jump of the process and S n = T n − T n−1, as before. Then

$$\displaystyle \begin{aligned} & P(T_1 \in (t_1, t_1 + dt_1), \ldots , T_n \in (t_n, t_n + dt_n), T_{n+1} > t) \\ & \quad = P(S_1 \in (t_1, t_1 + dt_1), \ldots , S _n \in (t_n - t_{n-1}, t_n \\ & \qquad - t_{n-1}+ dt_n), S_{n+1} > t - t_n) \\ & \quad = \lambda \exp\{- \lambda t_1\} dt_1 \lambda \exp\{- \lambda (t_2 - t_1)\} dt_2 \cdots \exp\{- \lambda (t - t_n)\} \\ &~~ = \lambda^n dt_1 \cdots dt_n \exp\{- \lambda t\}. \end{aligned} $$

To derive this expression, we used the fact that the S n are i.i.d. Exp(λ). The expression above shows that, given that there are n jumps in [0, t], they are equally likely to be anywhere in the interval. Also,

$$\displaystyle \begin{aligned} P(N_t = n) = \int_S \lambda^n dt_1 \cdots dt_n \exp\{- \lambda t\}, \end{aligned}$$

where S = {t 1, …, t n|0 < t 1 < ⋯ < t n < t}. Now, observe that S is a fraction of [0, t]n that corresponds to the times t i being in a particular order. There are n! such orders and, by symmetry, each order corresponds to a subset of [0, t]n of the same size. Thus, the volume of S is t nn!. We conclude that

$$\displaystyle \begin{aligned} P(N_t = n) = \frac{t^n}{n!} \lambda^n \exp\{- \lambda t\}, \end{aligned}$$

which proves the result. □

15.5 Boosting

You follow the advice of some investment experts when you buy stocks. Their recommendations are often contradictory. How do you make your decisions so that, in retrospect, you are not doing too bad compared to the best of the experts? The intuition is that you should try to follow the leader, but randomly. To make the situation concrete, Fig. 15.5 shows three experts (B, I, T) and the profits one would make by following their advice on the successive days.

Fig. 15.5
figure 5

The three experts and the profits of their recommended stocks

On a given day, you choose which expert to follow the next day. Figure 15.6 shows your profit if you make the sequence of selections indicated by the red circles. In these selections, you choose to follow B the first 2 days, then I the next to days, then T the last day. Of course, you have to choose the day before, and the actual profit is only known the next day. The figure also shows the regrets that you accumulate when comparing your profit to that of the three experts. Your total profit is − 5 and the profit you would have made if you had followed B all the time would have been − 2, so your regret compared to B is − 2 − (−5) = 3, and similarly for the other two experts.

Fig. 15.6
figure 6

A specific sequence of choices and the resulting profit and regrets

The problem is to make the expert selection every day so as to minimize the worst regret, i.e., the regret with respect to the most successful expert. More precisely, the goal is to minimize the rate of growth of the worst regret. Here is the result.

Theorem 15.6 (Minimum Regret Algorithm)

Generally, the worst regret grows like \(O(\sqrt {n})\) with the number n of steps. One algorithm that achieves this rate of regret is to choose expert E at step n + 1 with probability π n+1(E) given by

$$\displaystyle \begin{aligned} \pi_{n+1}(E) = A_n \exp\{\alpha P_n(E)/\sqrt{n}\}, \mathit{\mbox{ for }} E \in \{B, I, T\}, \end{aligned}$$

where η > 0 is a constant, A n is such that these probabilities add up to one, and P n(E) is the profit that expert E makes in the first n days. \({\blacksquare }\)

Thus, the algorithm favors successful experts. However, the algorithm makes random selections. It is easy to construct examples where a deterministic algorithm accumulates a regret that grows like n.

Figure 15.7 shows a simulation of three experts and of the selection algorithm in the theorem. The experts are random walks with drift 0.1. The simulation shows that the selection algorithm tends to fall behind the best expert by \(O(\sqrt {n})\).

Fig. 15.7
figure 7

A simulation of the experts and the selection algorithm

The proof of the theorem can be found in Cesa-Bianchi and Lugosi (2006).

15.6 Multi-Armed Bandits

Here is a classical problem. You are given two coins, both with an unknown bias (the probability of heads). At each step k = 1, 2, … you choose a coin to flip. Your goal is to accumulate heads as fast as possible. Let X k be the number of heads you accumulate after k steps. Let also \(X_k^*\) be the number of heads that you would accumulate if you always flipped the coin with the largest bias. The regret of your strategy after n steps is defined as

$$\displaystyle \begin{aligned} R_k = E(X_k^* - X_k). \end{aligned}$$

Let θ 1 and θ 2 be the bias of coins 1 and 2, respectively. Then \(E(X_k^*) = k \max \{\theta _1, \theta _2\}\) and the best strategy is to flip the coin with the largest bias at each step. However, since the two biases are unknown, you cannot use that strategy. We explain below that there is a strategy such that the regret grows like \(\log (k)\) with the number of steps.

Any good strategy keeps on estimating the biases. Indeed, any strategy that stops estimating and then forever flips the coin that is believed to be best has a positive probability of getting stuck with the worst coin, thus accumulating a regret that grows linearly over time. Thus, a good strategy must constantly explore, i.e., flip both coins to learn their bias.

However, a good strategy should exploit the estimates by flipping the coin that is believed to be better more frequently than the other. Indeed, if you were to flip the two coins the same fraction of time, the regret would also grow linearly. Hence, a good strategy must exploit the accumulated knowledge about the biases.

The key question is how to balance exploration and exploitation. The strategy called Thompson Sampling does this optimally. Assume that the biases θ 1 and θ 2 of the two coins are independent and uniformly distributed in [0, 1]. Say that you have flipped the coins a number of times. Given the outcomes of these coin flips, one can in principle compute the conditional distributions of θ 1 and θ 2. Given these conditional distributions, one can calculate the probability that θ 1 > θ 2. The Thompson Sampling strategy is to choose coin 1 with that probability and coin 2 otherwise for the next flip. Here is the key result.

Theorem 15.7 (Minimum Regret of Thompson Sampling)

If the coins have different biases, then any strategy is such that

$$\displaystyle \begin{aligned} R_k \geq O(\log{k}). \end{aligned}$$

Moreover, Thompson Sampling achieves this lower bound. \({\blacksquare }\)

The notation \(O(\log {k})\) indicates a function g(k) that grows like \(\log {k}\), i.e., such that \(g(k)/\log {k}\) converges to a positive constant as k →.

Thus this strategy does not necessarily choose the coin with the largest expected bias. It is the case that the strategy favors the coin that has been more successful so far, thus exploiting the information. But the selection is random, which contributes to the exploration.

One can show that if flips of coin 1 have produced h heads and t tails, then the conditional density of θ 1 is g(θ;h, t), where

$$\displaystyle \begin{aligned} g(\theta; h, t) = \frac{(h + t)!}{h!t!} \theta^h (1 - \theta)^t, \theta \in [0, 1]. \end{aligned}$$

The same result holds for coin 2. Thus, Thompson Sampling generates \(\hat \theta _1\) and \(\hat \theta _2\) according to these densities.

For a proof of this result, see Agrawal and Goyal (2012). See also Russo et al. (2018) for applications of multi-armed bandits.

A rough justification of the result goes as follows. Say that θ 1 > θ 2. One can show that after flipping coin 2 a number n of times, it takes about n steps until you flip it again when using Thompson Sampling. Your regret then grows by one at times 1, 1 + 1, 2 + 2, 4 + 4, …, 2n, 2n+1, …. Thus, the regret is of order n after O(2n) steps. Equivalently, after N = 2n steps, the regret is of order \(n = \log {N}\).

15.7 Capacity of BSC

Consider a binary symmetric channel with error probability p ∈ (0, 0.5). Every bit that the transmitter sends has a chance of being corrupted. Thus, it is impossible to transmit any bit string fully reliably across this channel. No matter what the transmitter sends, the receiver can never be sure that it got the message right.

However, one might be able to achieve a very small probability of error. For instance, say that p = 0.1 and that one transmits a bit by repeating it N times, where N ≫ 1. As the receiver gets the N bits, it uses a majority decoding. That is, if it gets more zeros than ones, it decides that transmitter sent a zero, and conversely for a one. The probability of error can be made arbitrarily small by choosing N very large. However, this scheme gets to transmit only one bit every N steps. We say that the rate of the channel is 1∕N and it seems that to achieve a very small error probability, the rate has to become negligible.

It turns out that our pessimistic conclusion is wrong. Claude Shannon (Fig. 15.8), in the late 1940s, explained that the channel can transmit at any rate less than C(p), where (see Fig. 15.9)

$$\displaystyle \begin{aligned} C(p) = 1 - H(p) \mbox{ with } H(p) = - p \log_2 p - (1 - p) \log_2(1 - p), \end{aligned} $$
(15.2)

with a probability less than 𝜖, for any 𝜖 > 0.

Fig. 15.8
figure 8

Claude Shannon. 1916–2001

Fig. 15.9
figure 9

The capacity C(p) of the BSC with error probability p

For instance, C(0.1) ≈ 0.53. Fix a rate less than C(0.1), say R = 0.5. Pick any 𝜖 > 0, say 𝜖 = 10−8. Then, it is possible to transmit bits across this channel at rate R = 0.5, with a probability of error per bit less than 10−8. The same is true if we choose 𝜖 = 10−12: it is possible to transmit at the same rate R with a probability of error less than 10−12. The actual scheme that we use depends on 𝜖, and it becomes more complex when 𝜖 is smaller; however, the rate R does not depend on 𝜖. Quite a remarkable result! Needless to say, it baffled all the engineers who had been busily designing various ad hoc transmission schemes.

Shannon’s key insight is that long sequences are typical. There is a statistical regularity in random sequences such as Markov chains or i.i.d. random variables and this regularity manifests itself in a characteristic of long sequences. For instance, flip many times a biased coin with P(head) = 0.1. The sequence that you will observe is likely to have about 10% of heads. Many other sequences are so unlikely that you will not see them. Thus, there are relatively few long sequences that are possible. In this example, although there are M = 2N possible sequences of N coin flips, only about \(\sqrt {M}\) are typical when P(head) = 0.1. Moreover, by symmetry, these typical sequences are all equally likely. For that reason, the errors of the BSC must correspond to relatively few patterns. Say that there are only A possible patterns of errors for N transmissions. Then, any bit string of length N that the sender transmits will correspond to A possible received “output” strings: one for every typical error sequence. Thus, it might be possible to choose B different “input” strings of length N for the transmitter so that the A received “output” strings for each one of these B input strings are all distinct. However, one might worry that choosing the B input strings would be rather complex if we want their sets of output strings to be distinct.

Shannon noticed that if we pick the input strings completely randomly, this will work. Thus, Shannon scheme is as follows. Pick a large N. Choose B strings of N bits randomly, each time by flipping a fair coin N times. Call these inputs strings X 1, …X B. These are the codewords. Let S 1 be the set of A typical outputs that correspond to X 1. Let Y j be the output that corresponds to input X j. Note that the Y j are sequences of fair coin flips, by symmetry of the channel. Thus, each Y j is equally likely to be any one of the 2N possible output strings. In particular, the probability that Y j falls in S 1 is A∕2N (Fig. 15.10).

Fig. 15.10
figure 10

Because of the random choice of the codewords, the likelihood that one codeword produces an output that is typical for another codeword is A2N

In fact,

$$\displaystyle \begin{aligned} P({\mathbf{Y}}_2 \in S_1 \mbox{ or } {\mathbf{Y}}_3 \in S_1 \ldots \mbox{ or } {\mathbf{Y}}_B \in S_1) \leq B\times A2^{-N}. \end{aligned}$$

Indeed, the probability of a union of events is not larger than the sum of their probabilities. We explain below that A = 2NH(p). Thus, if we choose B = 2NR, we see that the expression above is less than or equal to

$$\displaystyle \begin{aligned} 2^{NR} \times 2^{NH(p)} \times 2^{-N} \end{aligned}$$

and this expression goes to zero as N increases, provided that

$$\displaystyle \begin{aligned} R + H(p) < 1, \mbox{ i.e., } R < C(p) := 1 - H(p). \end{aligned}$$

Thus, the receiver makes an error with a negligible probability if one does not choose too many codewords. Note that B = 2NR corresponds to transmitting NR different bits in N steps, thus transmitting at rate R.

How does the receiver recognize the bit string that the transmitter sent? The idea is to give the list of the B input strings, i.e., codewords, to the receiver. When it receives a string, the receiver looks in the list to find the codeword that is the closest to the string it received. With a very high probability, it is the string that the transmitter sent.

It remains to show that A = 2NH(p). Fortunately, this calculation is a simple consequence of the SLLN. Let X := {X(n), n = 1, …, N} be i.i.d. random variables with P(X(n) = 1) = p and P(X(n) = 0) = 1 − p. For a given sequence x = (x(1), …, x(N)) ∈{0, 1}N, let

$$\displaystyle \begin{aligned} \psi(\mathbf{x}) := \frac{1}{N} \log_2 (P(\mathbf{X} = \mathbf{x})). \end{aligned} $$
(15.3)

Note that, with \(|\mathbf {x}| := \sum _{n=1}^N x(n)\),

$$\displaystyle \begin{aligned} \psi(\mathbf{x}) &= \frac{1}{N} \log_2 (p^{|\mathbf{x}|} (1 - p)^{N - |\mathbf{x}|}) \\ & = \frac{|\mathbf{x}|}{N} \log_2(p) + \frac{N - |\mathbf{x}|}{N} \log_2(1 - p). \end{aligned} $$

Thus, the random string X of N bits is such that

$$\displaystyle \begin{aligned} \psi(\mathbf{X}) = \frac{|\mathbf{X}|}{N} \log_2(p) + \frac{N - |\mathbf{X}|}{N} \log_2(1 - p). \end{aligned}$$

But we know from the SLLN that |X|∕N → p as N →. Thus, for N ≫ 1,

$$\displaystyle \begin{aligned} \psi(\mathbf{X}) \approx p \log_2(p) + (1 - p) \log_2(1 - p) =: - H(p). \end{aligned}$$

This calculation shows that any sequence x of values that X takes has approximately the same value of ψ(x). But, by (15.3), this implies all the sequences x that occur have approximately the same probability

$$\displaystyle \begin{aligned} 2^{- N H(p)}. \end{aligned}$$

We conclude that there are 2NH(p) typical sequences and that they are all essentially equally likely. Thus, A = 2NH(p).

Recall that for the Gaussian channel with the MLE detection rule, the channel becomes a BSC with

$$\displaystyle \begin{aligned} p = p(\sigma^2) := P(\mathcal{N}(0, \sigma^2) > 0.5). \end{aligned}$$

Accordingly, we can calculate the capacity C(p(σ 2)) as a function of the noise standard deviation σ. Figure 15.11 shows the result.

Fig. 15.11
figure 11

The capacity of the BSC that corresponds to a \(\mathcal {N}(0, \sigma ^2)\) additive noise. The detector uses the MLE

These results of Shannon on the capacity, or achievable rates, of channels have had a profound impact on the design of communication systems. Suddenly, engineers had a target and they knew how far or how close their systems were to the feasible rate. Moreover, the coding scheme of Shannon, although not really practical, provided a valuable insight into the design of codes for specific channels. Shannon’s theory, called Information Theory, is an inspiring example of how a profound conceptual insight can revolutionize an engineering field.

Another important part of Shannon’s work concerns the coding of random objects. For instance, how many bits does it take to encode a 500-page book? Once again, the relevant notion is that of typicality. As an example, we know that to encode a string of N flips of a biased coin with P(head) = p, we need only NH(p) bits, because this is the number of typical sequences. Here, H(p) is called the entropy of the coin flip. Similarly, if {X(n), n ≥ 1} is an irreducible, finite, and aperiodic Markov chain with invariant distribution π and transition probabilities P(i, j), then one can show that to encode {X(1), …, X(N)} one needs approximately NH(P) bits, where

$$\displaystyle \begin{aligned} H(P) = - \sum_i \pi(i) \sum_j P(i, j) \log_2 P(i, j) \end{aligned}$$

is called the entropy rate of the Markov chain. A practical scheme, called Liv–Zempel compression, essentially achieves this limit. It is the basis for most file compression algorithms (e.g., ZIP).

Shannon put these two ideas together: channel capacity and source coding. Here is an example of his source–channel coding result. How fast can one send the symbols X(n) produced by the Markov chain through a BSC channel? The answer is C(p)∕H(P). Intuitively, it takes H(P) bits per symbol X(n) and the BSC can send C(p) bits per unit time. Moreover, to accomplish this rate, one first encodes the source and one separately chooses the codewords for the BSC, and one then uses them together. Thus, the channel coding is independent of the source coding and vice versa. This is called the separation theorem of Claude Shannon.

15.8 Bounds on Probabilities

We explain how to derive estimates of probabilities using Chebyshev and Chernoff’s inequalities and also using the Gaussian approximation. These methods also provide a useful insight into the likelihood of events. The power of these methods is that they can be used in very complex situations.

Theorem 15.8 (Markov, Chernoff, and Jensen Inequalities)

Let X be a random variable. Then one has

  1. (a)

    Markov’s Inequality: Footnote 1

    $$\displaystyle \begin{aligned} P( X \geq a) \leq \frac{E(f(X))}{f(a)}, \end{aligned} $$
    (15.4)

    for all f(⋅) that is nondecreasing and positive.

  2. (b)

    Chernoff’s Inequality (Fig. 15.12):Footnote 2

    $$\displaystyle \begin{aligned} P( X \geq a) \leq E( \exp\{\theta (X - a)\} ), \end{aligned} $$
    (15.5)
    Fig. 15.12
    figure 12

    Herman Chernoff. b. 1923

    for all θ > 0.

  3. (c)

    Jensen’s Inequality (Fig. 15.13):Footnote 3

    $$\displaystyle \begin{aligned} f(E(X)) \leq E(f(X)), \end{aligned} $$
    (15.6)
    Fig. 15.13
    figure 13

    Johan Jensen. 1859–1925

    for all f(⋅) that is convex.

\({\blacksquare }\)

These results are easy to show, so here is a proof.

Proof

  1. (a)

    Since f(⋅) is nondecreasing and positive, we have

    $$\displaystyle \begin{aligned} 1\{X \geq a\} \leq \frac{f(X)}{f(a)}, \end{aligned}$$

    so that (15.4) follows by taking expectations.

  2. (b)

    The inequality (15.5) is a particular case of Markov’s inequality (15.4) for \(f(X) = \exp \{\theta X\}\) with θ > 0.

  3. (c)

    Let f(⋅) be a convex function. This means that it lies above any tangent. In particular,

    $$\displaystyle \begin{aligned} f(X) \geq f(E(X)) + f'(E(X))(X - E(X)), \end{aligned}$$

    as shown in Fig. 15.14. The inequality (15.6) then follows by taking expectations.

    Fig. 15.14
    figure 14

    A convex function f(⋅) lies above its tangents. In particular, it lies above a tangent at E(X), which implies Jensen’s inequality.

15.8.1 Applying the Bounds to Multiplexing

Recall the multiplexing problem. There are N users who are independently active with probability p. Thus, the number of active users Z is B(N, p). We want to find m so that P(Z ≥ m) = 5%.

As a first estimate of m, we use Chebyshev’s inequality (2.2) which says that

$$\displaystyle \begin{aligned} P(|\nu - E(\nu)| > \epsilon) \leq \frac{\mbox{var}(\nu)}{\epsilon^2}. \end{aligned}$$

Now, if Z = B(N, p), one has E(Z) = Np and var(Z) = Np(1 − p).Footnote 4 Hence, since ν = B(100, 0.2), one has E(ν) = 20 and var(ν) = 16. Chebyshev’s inequality gives

$$\displaystyle \begin{aligned} P(|\nu - 20| > \epsilon) \leq \frac{16}{\epsilon^2}. \end{aligned}$$

Thus, we expect that

$$\displaystyle \begin{aligned} P(\nu - 20 > \epsilon) \leq \frac{8}{\epsilon^2}, \end{aligned}$$

because it is reasonable to think that the distribution of ν is almost symmetric around its mean, as we see in Fig. 3.4. We want to choose m = 20 + 𝜖 so that P(ν > m) ≤ 5%. This means that we should choose 𝜖 so that 8∕𝜖 2 = 5%. This gives 𝜖 = 13, so that m = 33. Thus, according to Chebyshev’s inequality, it is safe to assume that no more than 33 users are active and we can choose C so that C∕33 is a satisfactory rate for users.

As a second approach, we use Chernoff’s inequality (15.5) which states that

$$\displaystyle \begin{aligned} P(\nu \geq Na) \leq E(\exp\{\theta(\nu - Na)\}), \forall \theta > 0. \end{aligned}$$

To calculate the right-hand size, we note that if Z = Bernoulli(N, p), then we can write as Z = X(1) + ⋯ + X(N), where the X(n) are i.i.d. random variables with P(X(n) = 1) = p and P(X(n) = 0) = 1 − p. Then,

$$\displaystyle \begin{aligned} E(\exp\{\theta Z\}) & = E(\exp\{\theta X(1) + \cdots + \theta X(N)\}) \\ & = E(\exp\{\theta X(1)\} \times \cdots \times \exp\{\theta X(N)\} ). \end{aligned} $$

To continue the calculation, we note that, since the X(n) are independent, so are the random variables \(\exp \{\theta X(n)\}\).Footnote 5 Also, the expected value of a product of independent random variables is the product of their expected values (see Appendix A). Hence,

$$\displaystyle \begin{aligned} E(\exp\{\theta Z\}) & = E(\exp\{\theta X(1)\}) \times \cdots \times E(\exp\{\theta X(N)\}) \\ & = E( \exp\{ \theta X(1) \})^N = \exp\{N \varLambda (\theta ) \} \end{aligned} $$

where we define

$$\displaystyle \begin{aligned} \varLambda (\theta) = \log(E(\exp\{\theta X(1)\})). \end{aligned}$$

Thus, Chernoff’s inequality says that

$$\displaystyle \begin{aligned} & P(Z \geq Na) \leq \exp\{N \varLambda (\theta ) \} \exp\{- \theta Na \} \\ & \quad = \exp\{ N( \varLambda (\theta) - \theta a)\} \end{aligned} $$

Since this inequality holds for every θ > 0, let us minimize the right-hand side with respect to θ. That is, let us define

$$\displaystyle \begin{aligned} \varLambda^*(a) = \max_{\theta > 0} \{ \theta a - \varLambda (\theta ) \}. \end{aligned}$$

Then, we see that

$$\displaystyle \begin{aligned} P(Z \geq Na) \leq \exp\{- N \varLambda^*(a) \}. \end{aligned} $$
(15.7)

Figure 15.15 shows this function when p = 0.2.

Fig. 15.15
figure 15

The logarithm divided by N of the probability of too many active users

We now evaluate Λ(θ) and Λ (a). We find

$$\displaystyle \begin{aligned} E(\exp\{\theta X(1)\}) = 1 - p + pe^\theta, \end{aligned}$$

so that

$$\displaystyle \begin{aligned} \varLambda(\theta) = \log(1 - p + pe^\theta) \end{aligned}$$

and

$$\displaystyle \begin{aligned} \varLambda^*(a) = \max_{\theta > 0} \{\theta a - \log(1 - p + pe^\theta)\}. \end{aligned}$$

Setting to zero the derivative with respect to θ of the term between brackets, we find

$$\displaystyle \begin{aligned} a = \frac{1}{1 - p + pe^\theta} (p e^\theta), \end{aligned}$$

which gives, for a > p,

$$\displaystyle \begin{aligned} e^\theta = \frac{a(1 - p)}{(1 - a)p}. \end{aligned}$$

Substituting back in Λ (a), we get

$$\displaystyle \begin{aligned} \varLambda^*(a) = - a \log(\frac{a}{p}) - (1 - a) \log(\frac{1 - a}{1-p}), \forall a > p. \end{aligned}$$

Going back to our example, we want to find m = Na so that

$$\displaystyle \begin{aligned} P(\nu \geq Na) \approx 0.05. \end{aligned}$$

Using (15.7), we need to find Na so that

$$\displaystyle \begin{aligned} \exp\{- N \varLambda^*(a) \} \approx 0.05 = \exp\{ \log(0.05) \}, \end{aligned}$$

i.e.,

$$\displaystyle \begin{aligned} \varLambda^*(a) = - \frac{\log(0.05)}{N} \approx 0.03. \end{aligned}$$

Looking at Fig. 15.15, we find a = 0.30. This corresponds to m = 30. Thus, Chernoff’s estimate says that P(ν > 30) ≈ 5% and that we can size the network assuming that only 30 users are active at any one time.

By the way, the calculations we have performed above show that Chernoff’s bound can be written as

$$\displaystyle \begin{aligned} P(Z \geq Na) \leq \frac{P(B(N, p) = Na)}{P(B(N, a) = Na)}. \end{aligned}$$

15.9 Martingales

A martingale represents the sequence of fortunes of someone playing a fair game of chance. In such a game, the expected gain is always zero. A simple example is a random walk with zero-mean step size. Martingales are good models of noise and of processes discounted based on their expected value (e.g., the stock market). This theory is due to Doob (1953).

Martingales have an important property that generalizes the strong law of large numbers. It says that a martingale bounded in expectation converges almost surely. This result is used to show that fluctuations vanish and that a process converges to its mean value. The convergence of stochastic gradient algorithms and approximations of random processes by differential equations follow from that property.

15.9.1 Definitions

Let X n be the fortune at time n ≥ 0 when one plays a game of chance. The game is fair if

$$\displaystyle \begin{aligned} E[X_{n+1} | X^n] = X_n, \forall n \geq 0. \end{aligned} $$
(15.8)

In this expression, X n := {X m, m ≤ n}. Thus, in a fair game, one cannot expect to improve one’s fortune. A sequence {X n, n ≥ 0} of random variables with that property is a martingale.

This basic definition generalizes to the case where one has access to additional information and is still unable to improve one’s fortune. For instance, say that the additional information is the value of other random variables Y n. One then has the following definitions.

Definition 15.3 (Martingale, Supermartingale, Submartingale)

The sequence of random variables {X n, n ≥ 0} is a martingale with respect to {X n, Y n, n ≥ 0} if

$$\displaystyle \begin{aligned} E[X_{n+1} | X^n, Y^n ] = X_n, \forall n \geq 0 \end{aligned} $$
(15.9)

with X n = {X m, m ≤ n} and Y n = {Y m, m ≤ n}.

If (15.9) holds with =  replaced by ≤, then X n is a supermartingale; if it holds with ≥, then X n is a submartingale. ◇

In many cases, we do not specify the random variables Y n and we simply say that X n is a martingale, or a submartingale, or a supermartingale.

Note that if X n is a martingale, then

$$\displaystyle \begin{aligned} E(X_n) = E(X_0), \forall n \geq 0. \end{aligned}$$

Indeed, E(X n) = E(E[X n|X 0, Y 0]) by the smoothing property of conditional expectation (see Theorem 9.5).

15.9.2 Examples

A few examples illustrate the definition.

Random Walk

Let {Z n, n ≥ 0} be independent and zero-mean random variables. Then X n := Z 0 + ⋯ + Z n for n ≥ 0 is a martingale. Indeed,

$$\displaystyle \begin{aligned} E[X_{n+1} | X^n] = E[ Z_0 + \cdots + Z_n + Z_{n+1} | Z_0, \ldots , Z_n ] = Z_0 + \cdots + Z_n = X_n. \end{aligned}$$

Note that if E(Z n) ≤ 0, then X n is a supermartingale; if E(Z n) ≥ 0, then X n is a submartingale.

Product

Let {Z n, n ≥ 0} be independent random variables with mean 1. Then X n := Z 0 ×⋯ × Z n for n ≥ 0 is a martingale. Indeed,

$$\displaystyle \begin{aligned} E[X_{n+1} | X^n] = E[ Z_0 \times \cdots \times Z_n \times Z_{n+1} | Z_0, \ldots , Z_n ] = Z_0 \times \cdots \times Z_n = X_n. \end{aligned}$$

Note that if Z n ≥ 0 and E(Z n) ≤ 1 for all n, then X n is a supermartingale. Similarly, if Z n ≥ 0 and E(Z n) ≥ 1 for all n,, then X n is a submartingale.

Branching Process

For m ≥ 1 and n ≥ 0, let \(X_m^n\) be i.i.d. random variables distributed like X that take values in \(\mathbb {Z}_+ := \{0, 1, 2, \ldots \}\) and have mean μ. The branching process is defined by Y 0 = 1 and

$$\displaystyle \begin{aligned} Y_{n+1} = \sum_{m = 1}^{Y_n} X_m^n, n \geq 0. \end{aligned}$$

The interpretation is that there are Y n individuals in a population at the n-th generation. Individual m in that population has \(X_m^n\) children.

One can see that

$$\displaystyle \begin{aligned} Z_n = \mu^{-n} Y_n, n \geq 0 \end{aligned}$$

is a martingale. Indeed,

$$\displaystyle \begin{aligned} E[Y_{n+1} | Y_0, \ldots , Y_n] = Y_n \mu, \end{aligned}$$

so that

$$\displaystyle \begin{aligned} E[Z_{n+1} | Z_0, \ldots , Z_n] = E[ \mu^{- (n + 1)} Y_{n+1} | Y_0, \ldots , Y_n] = \mu^{-n} Y_n = Z_n. \end{aligned}$$

Let f(s) = E(e s X) and q be the smallest nonnegative solution of q = f(q). One can then show that

$$\displaystyle \begin{aligned} W_n = q^{Z_n}, n \geq 1 \end{aligned}$$

is a martingale.

Proof

Exercise. □

Doob Martingale

Let {X n, n = 1, …, N} be random variables and Y = f(X 1, …, X N), where f is some bounded measurable real-valued function. Then

$$\displaystyle \begin{aligned} Z_n := E[Y \mid X^n], n = 0, \ldots , N \end{aligned}$$

is a martingale (by the smoothing property of conditional expectation, see Theorem 9.5) called a Doob martingale. Here are a two examples.

  1. 1.

    Throw N balls into M bins, and let Y  be some function of the throws: the number of empty bins, the max load, the second-highly loaded bin, or some similar function. Let X n be the index of the bin into which ball n lands. Then Z n = E[Y ∣X n] is a martingale.

  2. 2.

    Suppose we have r red and b blue balls in a bin. We draw balls without replacement from this bin: what is the number of red balls drawn? Let X n be the indicator for whether ball n is red, and let Y = X 1 + ⋯ + X n be the number of red balls. Then Z n is a martingale.

You Cannot Beat the House

To study convergence, we start by explaining a key property of martingales that says there is no winning recipe to play a fair game of chance.

Theorem 15.9 (You Cannot Win)

Let X n be a martingale with respect to {X n, Z n, n ≥ 0} and V n some bounded function of (X n, Z n). Then

$$\displaystyle \begin{aligned} Y_n = \sum_{m=0}^n V_{m-1} (X_m - X_{m-1}), n \geq 1, \end{aligned} $$
(15.10)

with Y 0 := 0 is a martingale. \({\blacksquare }\)

Proof

One has

$$\displaystyle \begin{aligned} & E[Y_n - Y_{n-1} \mid X^{n-1}, Z^{n-1} ] \\ & \quad = E[ V_{n-1}(X_n - X_{n-1}) \mid X^{n-1}, Z^{n-1}] \\ & \quad = V_{n-1} E[X_n - X_{n-1} \mid X^{n-1}, Z^{n-1}] = 0. \end{aligned} $$

The meaning of Y n is the fortune that you would get by betting V m−1 at time m − 1 on the gain X m − X m−1 of the next round of the game. This bet must be based on the information (X m−1, Z m−1) that you have when placing the bet, not on the outcome of the next round, obviously. The theorem says that your fortune remains a martingale even after adjusting your bets in real time.

Stopping Times

When playing a game of chance, one may decide to stop after observing a particular sequence of gains and losses. The decision to stop is non-anticipative. That is, one cannot say “never mind, I did not mean to play the last three rounds.” Thus, the random stopping time τ must have the property that the event {τ ≤ n} must be a function of the information available at time n, for all n ≥ 0. Such a random time is a stopping time.

Definition 15.4 (Stopping Time)

A random variable τ is a stopping time for the sequence {X n, Y n, n ≥ 0} if τ takes values in {0, 1, 2, …} and

$$\displaystyle \begin{aligned} P[\tau \leq n | X_m, Y_m, m \geq 0] = \phi_n(X^n, Y^n) , \forall n \geq 0 \end{aligned}$$

for some functions ϕ n. ◇

For instance,

$$\displaystyle \begin{aligned} \tau = \min \{n \geq 0 \mid (X_n, Y_n) \in \mathcal{A}\}, \end{aligned}$$

where \(\mathcal {A}\) is a set in \(\Re ^2\) is a stopping time for the sequence {X n, Y n, n ≥ 0}. Thus, you may want to stop the first time that either you go broke or your fortune exceeds $1000.00.

One might hope that a smart choice of when to stop playing a fair game could improve one’s expected fortune. However, that is not the case, as the following fact shows.

Theorem 15.10 (Optional Stopping)

Let {X n, n ≥ 0} be a martingale and τ a stopping time with respect to {X n, Y n, n ≥ 0}. Then Footnote 6

$$\displaystyle \begin{aligned} E[X_{\tau \wedge n} | X_0, Y_0] = X_0. \end{aligned}$$

\({\blacksquare }\)

In the statement of the theorem, for a random time σ one defines X σ := X n when σ = n.

Proof

Note that X τn is the fortune Y n that one accumulates by betting V m = 1{τ ∧ n > m} at time m in (15.10), i.e., by betting 1 until one stops at time τ ∧ n. Since 1{τ ∧ n > m} = 1 −{τ ∧ n ≤ m} = ϕ(X m, Y m), the resulting fortune is a martingale. □

You will note that bounding τ ∧ n in the theorem above is essential. For instance, let X n correspond to the random walk described above with P(Z n = 1) = P(Z n = −1) = 0.5. If we define \(\tau = \min \{n \geq 0 \mid X_n = 10\}\), one knows that τ is finite. (See the comments below Theorem 15.1.) Hence, X τ = 10, so that

$$\displaystyle \begin{aligned} E[X_\tau | X_0 = 0] = 10 \neq X_0.\end{aligned} $$

However, if we bound the stopping time, the theorem says that

$$\displaystyle \begin{aligned} E[X_{\tau \wedge n} | X_0 = 0] = 0.\end{aligned} $$
(15.11)

This result deserves some thought.

One might be tempted to take the limit of the left-hand side of (15.11) as n → and note that

$$\displaystyle \begin{aligned} \lim_{n \rightarrow \infty} X_{\tau \wedge n} = X_\tau = 10,\end{aligned} $$

because τ is finite. One then might conclude that the left-hand size of (15.11) goes to 10, which would contradict (15.11). However, the limit and the expectation do not interchange because the random variables X τn are not bounded. However, if they were, one would get E[X τ|X 0] = X 0, by the dominated convergence theorem. We record this observation as the next result.

Theorem 15.11 (Optional Stopping—2)

Let {X n, n ≥ 0} be a martingale and τ a stopping time with respect to {X n, Y n, n ≥ 0}. Assume that |X n|≤ V  for some random variable V  such that E(V ) < ∞. Then

$$\displaystyle \begin{aligned} E[X_\tau | X_0, Y_0] = X_0. \end{aligned}$$

\({\blacksquare }\)

L 1-Bounded Martingales

An L 1-bounded martingale cannot bounce up and down infinitely often across an interval [a, b]. For if it did, you could increase your fortune without bound by betting 1 on the way up across the interval and betting 0 on the way down. We will see shortly that this cannot happen. As a result, the martingale must converge. (Note that this is not true if the martingale is not L 1-bounded, as the random walk example shows.)

Theorem 15.12 (L 1-Bounded Martingales Convergence)

Let {X n, n ≥ 0} be a martingale such that E(|X n|) ≤ K for all n. Then X n converges almost surely to a finite random variable X . \({\blacksquare }\)

Proof

Consider an interval [a, b]. We show that X n cannot up-cross this interval infinitely often. (See Fig. 15.16.) Let us bet 1 on the way up and 0 on the way down. That is, wait until X n gets first below a, then bet 1 at every step until X n > b, then stop betting until X n gets below a, and continue in this way.

Fig. 15.16
figure 16

If X n does not converge, there are some rational numbers a < b such that X n crosses the interval [a, b] infinitely often

If X m crossed the interval U n times by time n, your fortune Y n is now at least (b − a)U n + (X n − a). Indeed, your gain was at least b − a for every upcrossing and, in the last steps of your playing, you lose at most X n − a if X n never crosses above b after you last resumed betting. But, since Y n is a martingale, we have

$$\displaystyle \begin{aligned} E(Y_n) = Y_0 \geq (b - a) E(U_n) + E(X_n - a) \geq (b - a)E(U_n) - K - a. \end{aligned}$$

(We used the fact that X n ≥−|X n|, so that E(X n) ≥−E(|X n|) = −K. This shows that E(U n) ≤ B = (K + Y 0 + a)∕(b − a) < . Letting n →, since U nU, where U is the total number of upcrossings of the interval [a, b], it follows by the monotone convergence theorem that E(U) ≤ B. Consequently, U is finite. Thus, X n cannot up-cross any given interval [a, b] infinitely often.

Consequently, the probability that it up-crosses infinitely often any interval with rational limits is zero (since there are countably many such intervals).

This implies that X n must converge, either to + , −, or to a finite value. Since E(|X n|) ≤ K, the probability that X n converges to +  or − is zero. □

The following is a direct but useful consequence. We used this result in the proof of the convergence of the stochastic gradient projection algorithm (Theorem 12.2).

Theorem 15.13 (L 2-Bounded Martingales Convergence)

Let X n be a L 2 -bounded martingale, i.e., such that \(E(X_n^2) \leq K^2, \forall n \geq 0\) , then X n → X , almost surely, for some finite random variable X . \({\blacksquare }\)

Proof

We have

$$\displaystyle \begin{aligned} E(|X_n|)^2 \leq E(X_n^2) \leq K^2, \end{aligned}$$

by Jensen’s inequality. Thus, it follows that E(|X n|) ≤ K for all n, so that the result of the theorem applies to this martingale. □

One can also show that E(|X n − X |2) → 0.

15.9.3 Law of Large Numbers

The SLLN can be proved as an application of the convergence of martingales, as Doob (1953) showed.

Theorem 15.14 (SLLN)

Let {X n, n ≥ 1} be i.i.d. random variables with E(|X n|) = K < ∞ and E(X n) = μ. Then

$$\displaystyle \begin{aligned} \frac{X_1+ \cdots + X_n}{n} \rightarrow \mu, \mathit{\mbox{ almost surely as }} n \rightarrow \infty. \end{aligned}$$

\({\blacksquare }\)

Proof

Let

$$\displaystyle \begin{aligned} S_n = X_1 + \cdots + X_n, n \geq 1. \end{aligned}$$

Note that

$$\displaystyle \begin{aligned} E[X_1 | S_n, S_{n+1}, \ldots ] = \frac{1}{n} S_n =: Y_{-n}, \end{aligned} $$
(15.12)

by symmetry. Thus,

$$\displaystyle \begin{aligned} E[Y_{-n} \mid S_{n+1}, \ldots ] &= E[ E[X_1 \mid S_n, S_{n+1}, \ldots ] \mid S_{n+1}, \ldots ] \\ &= E[X_1 \mid S_{n+1}, \ldots ] = Y_{-n - 1}. \end{aligned} $$

Thus, {…, Y n−2, Y n−1, Y n, …} is a martingale. (It is a Doob martingale.) This implies as before that the number U n of upcrossings of an interval [a, b] is such that E(U n) ≤ B < . As before, we conclude that \(U := \lim U_n < \infty \), almost surely. Hence, Y n converges almost surely to a random variable Y .

Now, since

$$\displaystyle \begin{aligned} Y_{- \infty} = \lim_{n \rightarrow \infty} \frac{X_1 + \cdots + X_n}{n}, \end{aligned}$$

we see that Y is independent of (X 1, …, X n) for any finite n. Indeed, the limit does not depend on the values of the first n random variables. However, since Y is a function of {X n, n ≥ 1}, it must be independent of itself, i.e., be a constant.

Since E(Y ) = E(Y 1) = μ, we see that Y  = μ. □

15.9.4 Wald’s Equality

A useful application of martingales is the following. Let {X n, n ≥ 1} be i.i.d. random variables. Let τ be a random variable independent of the X n’s that take values in {1, 2, …} with E(τ) < . Then

$$\displaystyle \begin{aligned} E(X_1 + \cdots + X_\tau ) = E(\tau) E(X_1). \end{aligned} $$
(15.13)

This expression is known as Wald’s Equality.

To see this, note that Y n = X 1 + ⋯ + X n − nE(X 1) is a martingale. Also, τ is a stopping time. Thus,

$$\displaystyle \begin{aligned} E(Y_{\tau \wedge n}) = E(Y_1) = 0, \end{aligned}$$

which gives the identity with τ replaced by τ ∧ n. If E(τ) < , one can let n go to infinity and get the result. (For instance, replace X i by \(X_i^+\) and use MCT, similarly for \(X_i^-\), then subtract.)

15.10 Summary

  • General inference problems: guessing X given Y , Bayesian or not;

  • Sufficient statistic: h(Y ) is sufficient for X;

  • Infinite Markov Chains: PR, NR, T;

  • Lyapunov–Foster Criterion;

  • Poisson Process: independent stationary increments;

  • Continuous-Time Markov Chain: rate matrix;

  • Shannon Capacity of BSC: typical sequences and random codes;

  • Bounds: Chernoff and Jensen;

  • Martingales and Convergence;

  • Strong Law of Large Numbers.

15.10.1 Key Equations and Formulas

Inference Problem

Guess X given Y : MAP, MLE, HT

S.15.1

Sufficient Statistic

f Y |X[y|x] = f(h(y), x)g(y)

D.15.1

Infinite MC

Irreducible ⇒ T, NR or PR

T.15.1

Poisson Process

Jumps w.p. λ𝜖 in next 𝜖 seconds

D.15.2

Continuous-Time MC

Jumps from i to j w. rate Q(i, j)

D.6.1

Shannon Capacity C

Can transmit reliably at any rate R < C

S.15.7

“ ” of BSC(p)

C = 1 + plog2(p) + (1 − p)log2(1 − p)

(15.2)

Chernoff

\(P(X > a) \leq E(\exp \{ \theta (X - a)\}), \forall \theta \geq 0\)

(15.5)

Jensen

h convex ⇒ E(h(X)) ≥ h(E(X))

(15.6)

Martingales

zero expected increase

D.15.3

MG Convergence

A.s. to finite RV if L 1 or L 2 bounded

T.15.12

Wald

E(X 1 + ⋯ + X τ) = E(τ)E(X 1)

(15.13)

15.11 References

For the theory of Markov chains, see Chung (1967). The text Harchol-Balter (2013) explains basic queueing theory and many applications to computer systems and operations research.

The book Bremaud (1998) is also highly recommended for its clarity and the breadth of applications. Information Theory is explained in the textbook Cover and Thomas (1991). I learned the theory of martingales mostly from Neveu (1975). The theory of multi-armed bandits is explained in Cesa-Bianchi and Lugosi (2006). The text Hastie et al. (2009) is an introduction to applications of statistics in data science (Fig. 15.17).

Fig. 15.17
figure 17

CTMC

15.12 Problems

Problem 15.1

Suppose that y 1, …, y n are i.i.d. samples of N(μ, σ 2). What is a sufficient statistic for estimating μ given σ = 1. What is a sufficient statistic for estimating σ given μ = 1?

Problem 15.2

Customers arrive to a store according to a Poisson process with rate 4 (per hour).

  1. (a)

    What is the probability that exactly 3 customers arrive during 1 h?

  2. (b)

    What is the probability that more than 40 min is required before the first customer arrives?

Problem 15.3

Consider two independent Poisson processes with rates λ 1 and λ 2. Those processes measure the number of customers arriving in stores 1 and 2.

  1. (a)

    What is the probability that a customer arrives in store 1 before any arrives in store 2?

  2. (b)

    What is the probability that in the first hour exactly 6 customers arrive at the two stores? (The total for both is 6)

  3. (c)

    Given exactly 6 have arrived at the two stores, what is the probability all 6 went to store 1?

Problem 15.4

Consider the continuous-time Markov chain in Fig. 15.17.

  1. (a)

    Find the invariant distribution.

  2. (b)

    Simulate the MC and see that the fraction of time spent in state 1 converges to π(1).

Problem 15.5

Consider a first-come-first-served discrete-time queuing system with a single server. The arrivals are Bernoulli with rate λ. The service times are i.i.d. and independent of the arrival times. Each service time Z takes values in {1, 2, …, K} such that E(Z) = 1∕μ and λ < μ.

  1. (a)

    Construct the Markov chain that models the queue. What are the states and transition probabilities? [Hint: Suppose the head of the line task of the queue still requires z units of service. Include z in the state description of the MC.]

  2. (b)

    Use Lyapunov–Foster argument to show the queue is stable or equivalently the MC is positive recurrent.

Problem 15.6

Suppose that random variable X takes value in the set {1, 2, …, K} such that \(\Pr (X_1=k) = p_k > 0\), and \(\sum _{k=1}^K p_k =1\). Suppose X 1, X 2, …, X n is a sequence of n i.i.d. samples of X.

  1. (a)

    How many possible sequences exist?

  2. (b)

    How many typical sequences exist when n is large?

  3. (c)

    Find a condition that answers to parts (a) and (b) are the same.

Problem 15.7

Let {N t, t ≥ 0} be a Poisson process with rate λ. Let S n denote the time of the n-th event. Find

  1. (a)

    the pdf of S n.

  2. (b)

    E[S 5].

  3. (c)

    E[S 4|N(1) = 2].

  4. (d)

    E[N(4) − N(2)|N(1) = 3].

Problem 15.8

A queue has Poisson arrivals with rate λ. It has two servers that work in parallel. When there are at least two customers in the queue, two are being served. When there is only one customer, only one server is active. The service times are i.i.d. Exp(μ).

  1. (a)

    Argue that the queue length is a Markov Chain.

  2. (b)

    Draw the state transition diagram.

  3. (c)

    Find the minimum value of μ so that the queue is positive recurrent and solve the balance equations.

Problem 15.9

Let {X t, t ≥ 0} be a continuous-time Markov chain with rate matrix Q = {q(i, j)}. Define q(i) =∑ji q(i, j). Let also \(T_i = \inf \{t > 0 | X_t = i\}\) and \(S_i = \inf \{t > 0 | X_t \neq i\}\). Then (select the correct answers)

  • E[S i|X 0 = i] = q(i);

  • P[T i < T j|X 0 = k] = q(k, i)∕(q(k, i) + q(k, j)) for i, j, k distinct;

  • If α(k) = P[T i < T j|X 0 = k], then \(\alpha (k) = \sum _s \frac {q(k, s)}{q(k)} \alpha (s)\) for k∉{i, j}.

Problem 15.10

A continuous-time queue has Poisson arrivals with rate λ, and it is equipped with infinitely many servers. The servers can work in parallel on multiple customers, but they are non-cooperative in the sense that a single customer can only be served by one server. Thus, when there are k customers in the queue, k servers are active. Suppose that the service time of each customer is exponentially distributed with rate μ and they are i.i.d.

  1. (a)

    Argue that the queue length is a Markov chain. Draw the transition diagram of the Markov chain.

  2. (b)

    Prove that for all finite values of λ and μ the Markov chain is positive recurrent and find the invariant distribution.

Problem 15.11

Consider a Poisson process {N t, t ≥ 0} with rate λ = 1. Let random variable S i denote the time of the i-th arrival. [Hint: You recall that \(f_{S_i}(x) = \frac {x^{i-1}e^{-x}}{(i-1)!}1\{x \geq 0\}\).]

  1. (a)

    Given S 3 = s, find the joint distribution of S 1 and S 2. Show you work.

  2. (b)

    Find E[S 2|S 3 = s].

  3. (c)

    Find E[S 3|N 1 = 2].

Problem 15.12

Let \(S = \sum _{i=1}^N X_i\) denote the total amount of money withdrawn from an ATM in 8 h, where:

  1. (a)

    X i are i.i.d. random variables denoting the amount withdrawn by each customer with E[X i] = 30 and Var[X i] = 400.

  2. (b)

    N is a Poisson random variable denoting the total number of customers with E[N] = 80.

Find E[S] and Var[S].

Problem 15.13

One is given two independent Poisson processes M t and N t with respective rates λ and μ, where λ > μ. Find E(τ), where

$$\displaystyle \begin{aligned} \tau = \max\{t \geq 0 \mid M_t \leq N_t + 5\}. \end{aligned}$$

(Note that this is a max, not a min.)

Problem 15.14

Consider a queue with Poisson arrivals with rate λ. The service times are all equal to one unit of time. Let X t be the queue length at time t (t ≥ 0).

  1. (a)

    Is X t a Markov chain? Prove or disprove.

  2. (b)

    Let Y n be the queue length just after the n-th departure from the queue (n ≥ 1). Prove that Y n is a Markov chain. Draw a state diagram.

  3. (c)

    Prove that Y n is positive recurrent when λ < 1.

Problem 15.15

Consider a queue with Poisson arrivals with rate λ. The queue can hold N customers. The service times are i.i.d. Exp(μ). When a customer arrives, you can choose to pay him c so that he does not join the queue. You also pay c when a customer arrives at a full queue. You want to decide when to accept customers to minimize the cost of rejecting them, plus the cost of the average waiting time they spend in the queue.

  1. (a)

    Formulate the problem as a Markov decision problem. For simplicity, consider a total discounted cost. That is, if x t customers are in the system at time t, then the waiting cost during [t, t + 𝜖] is e βt x t 𝜖. Similarly, if you reject a customer at time t, then the cost is ce βt.

  2. (b)

    Write the dynamic programming equations.

  3. (c)

    Use Python to solve the equations.

Problem 15.16

The counting process N := {N t, 0 ≤ t ≤ T} is defined as follows:

Given τ, {N t, 0 ≤ t ≤ τ} and {N t − N τ, τ ≤ t ≤ T} are independent Poisson processes with respective rates λ 0 and λ 1.

Here, λ 0 and λ 1 are known and such that 0 < λ 0 < λ 1. Also, τ is exponentially distributed with known rate μ > 0.

  1. 1.

    Find the MLE of τ given N.

  2. 2.

    Find the MAP of τ given N.

Problem 15.17

Figure 15.18 shows a system where a source alternates between the ON and OFF states according to a continuous-time Markov chain with the transition rates indicated. When the source is ON, it sends a fluid with rate 2 into the queue. When the source is OFF, it does not send any fluid. The queue is drained at constant rate 1 whenever it contains some fluid. Let X t be the amount of fluid in the queue at time t ≥ 0.

  1. (a)

    Plot a typical trajectory of the random process {X t, t ≥ 0}.

    Fig. 15.18
    figure 18

    The system

  2. (b)

    Intuitively, what are conditions on λ and μ that should guarantee the “stability” of the queue?

  3. (c)

    Is the process {X t, t ≥ 0} Markov?

Problem 15.18

Let {N t, t ≥ 0} be a Poisson process with rate λ that is exponentially distributed with rate μ > 0.

  1. (a)

    Find MLE[λ|N s, 0 ≤ s ≤ t];

  2. (b)

    Find MAP[λ|N s, 0 ≤ s ≤ t];

  3. (c)

    What is a sufficient statistic for λ given {N s, 0 ≤ s ≤ t};

  4. (d)

    Instead of λ being exponentially distributed, assume that λ is known to take values in [5, 10]. Give an estimate of the time t required to estimate λ within 5% with probability 95%.

Problem 15.19

Consider two queues in parallel in discrete time with Bernoulli arrival processes of rates λ 1 and λ 2, and geometric service rates of μ 1 and μ 2, respectively. There is only one server that can serve either queue 1 and queue 2 at each time. Consider the scheduling policy that serves queue 1 at time n if μ 1 Q 1(n) > μ 2 Q 2(n), and serve queue 2 otherwise, where Q 1(n) and Q 2(n) are queue lengths of the queues at time n. Use the Lyapunov function \(V(Q_1(n),Q_2(n)) = Q^2_1(n) + Q^2_2(n)\) to show that the queues are stable if λ 1μ 1 + λ 2μ 2 < 1. This scheduling policy is known as Max-Weight or Back-Pressure policy.