Topics: Optimality of Huffman Codes, LDPC Codes, Proof of Neyman–Pearson Theorem, Jointly Gaussian RVs, Statistical Tests, ANOVA

8.1 Proof of Optimality of the Huffman Code

We stated the following result in Chap. 7. Here, we provide a proof.

Theorem 8.1 (Optimality of Huffman Code)

The Huffman code has the smallest average number of bits per symbol among all prefix-free codes (Fig. 8.1). \({\blacksquare }\)

Fig. 8.1
figure 1

Huffman code

Proof

The argument in Huffman (1952) is by induction on the number of symbols. Assume that the Huffman code has an average path length L(n) that is minimum for n symbols and that there is some other tree T with a smaller average path length A(n + 1) than the Huffman code for n + 1 symbols. Let X and Y  be the two least frequent symbols and x ≥ y their frequencies. We can pick these symbols in T so that their path lengths are maximum and such that Y  has the largest path length in T. Otherwise, we could swap Y  in T with a more frequent symbol and reduce the average path length. Accept for now the claim that we can also pick X and Y  so that they are siblings in T. By merging X and Y  into their parent Z with frequency z = x + y, we have constructed a code for n symbols with average path length A(n + 1) − z. Hence, L(n) ≤ A(n + 1) − z. Now, the Huffman code for n + 1 symbol would merge X and Y  also, so that its average path length is L(n + 1) := L(n) + z. Thus, L(n + 1) ≤ A(n + 1), which contradicts the assumption that the Huffman code is not optimal for n + 1 symbols. It remains to prove the claim about X and Y  being siblings. First note that Y  having the maximum path length, it cannot be an only child, for otherwise, we would replace its parent by Y  and reduce the path length. Say that Y  has a sibling V  other than X. By swapping V  and X, one does not increase the average path length, since the frequency of V  is not smaller than that of X. This concludes the proof. □

8.2 Proof of Neyman–Pearson Theorem 7.4

The idea of the proof is to consider any other decision rule that produces an estimate \(\tilde X\) with \(P[\tilde X = 1 | X = 0] \leq \beta \) and to show that

$$\displaystyle \begin{aligned} P[\tilde X = 1 | X = 1] \leq P[\hat X | X = 1], \end{aligned} $$
(8.1)

where \(\hat X\) is specified by the theorem. To show this, we note that

$$\displaystyle \begin{aligned} (\hat X - \tilde X)(L(Y) - \lambda) \geq 0. \end{aligned}$$

Indeed, when L(Y ) − λ > 0, one has \(\hat X = 1 \geq \tilde X\), so that the expression above is indeed nonnegative. Similarly, when L(Y ) − λ < 0, one has \(\hat X = 0 \leq \tilde X\), so that the expression is again nonnegative.

Taking the expected value of this expression given X = 0, we find

$$\displaystyle \begin{aligned} & E[ \hat X L(Y) | X = 0] - E[ \tilde X L(Y) | X = 0] \\ &~~~~~~~~~~~~~~~ \geq \lambda (E[\hat X | X = 0] - E[\tilde X | X = 0 ]). {} \end{aligned} $$
(8.2)

Now,

$$\displaystyle \begin{aligned} E[\hat X | X = 0] = P[\hat X = 1|X = 0] = \beta \geq P[ \tilde X = 1 | X = 0] = E[\tilde X | X = 0]. \end{aligned}$$

Hence, (8.2) implies that

$$\displaystyle \begin{aligned} E[ \hat X L(Y) | X = 0] \geq E[ \tilde X L(Y) | X = 0]. \end{aligned} $$
(8.3)

Observe that, for any function g(Y ), one has

$$\displaystyle \begin{aligned} E[g(Y)L(Y) | X = 0] &= \int g(y) L(y) f_{Y|X}[y|0] dy \\ &= \int g(y) \frac{f_{Y|X}[y|1] }{f_{Y|X}[y|0] } f_{Y|X}[y|0] dy \\ &= \int g(y) f_{Y|X}[y|1] dy \\ &= E[g(Y) | X = 1]. \end{aligned} $$

Note that this result continues to hold even for a function g(Y, Z) where Z is a random variable that is independent of X and Y . In particular,

$$\displaystyle \begin{aligned} E[\hat X L(Y) | X = 0] = E[\hat X | X = 1] = P[\hat X = 1 | X = 1]. \end{aligned}$$

Similarly,

$$\displaystyle \begin{aligned} E[\tilde X L(Y) | X = 0 ] = P[\tilde X = 1 | X = 1]. \end{aligned}$$

Combining these results with (8.3) gives (8.1).

\({\square }\)

8.3 Jointly Gaussian Random Variables

In many systems, the errors in the different components of the measured vector Y are not independent. A suitable model for this situation is that

$$\displaystyle \begin{aligned} \mathbf{Y} = \mathbf{X} + A \mathbf{Z}, \end{aligned}$$

where Z = (Z 1, Z 2) is a pair of i.i.d. N(0, 1) random variables and A is some 2 × 2 matrix. The key idea here is that the components of the noise vector A Z will not be independent in general. For instance, if the two rows of A are identical, so are the two components of A Z. Thus, this model allows to capture a dependency between the errors in the two components. The model also suggests that the dependency comes from the fact that the errors are different linear combinations of the same fundamental sources of noise.

For such a model, how does one compute MLE[X|Y]? We explain in the next section that

$$\displaystyle \begin{aligned} f_{\mathbf{Y} | \mathbf{X}}[\mathbf{y} | \mathbf{x}] = \frac{1}{2 \pi |A|} \exp\left\{- \frac{1}{2} (\mathbf{y} - \mathbf{x})' (AA')^{-1} (\mathbf{y} - \mathbf{x})\right\}, \end{aligned} $$
(8.4)

where A′ is the transposed of matrix A, i.e., A′(i, j) = A(j, i) for i, j ∈{1, 2}.

Consequently, the MLE is the value x k of x that minimizes

$$\displaystyle \begin{aligned} (\mathbf{y} - \mathbf{x})' (AA')^{-1} (\mathbf{y} - \mathbf{x})= || A^{-1} \mathbf{y} - A^{-1} \mathbf{x} ||{}^2. \end{aligned}$$

(For simplicity, we assume that A is invertible.)

That is, we want to find the vector x k such that A −1 x k is the closest to A −1 y.

One way to understand this result is to note that

$$\displaystyle \begin{aligned} \mathbf{W} := A^{-1} \mathbf{Y} = A^{-1} \mathbf{X} + \mathbf{Z} =: \mathbf{V} + \mathbf{Z}. \end{aligned}$$

Thus, if we calculate A −1 Y from the measured vector Y, we find that its components are i.i.d. N(0, 1) for a given value of X. Hence, it is easy to calculate MLE[V|W = w]: it is the closest value to w in the set {A −1 x 1, …, A −1 x 16} of possible values of V. It is then reasonable to expect that we can recover the MLE of X by multiplying the MLE of V = A −1 X by A, i.e., that

$$\displaystyle \begin{aligned} MLE[\mathbf{X}|\mathbf{Y} = \mathbf{y}] = A \times MLE[\mathbf{V} | \mathbf{W} = A^{-1} \mathbf{y}]. \end{aligned}$$

8.3.1 Density of Jointly Gaussian Random Variables

Our goal in this section is to explain (8.4) and more general versions of this result.

We start by stating the main definition and a result that we prove later.

Definition 8.1 (Jointly Gaussian N(μ Y, Σ Y) Random Variables)

The random variables Y = (Y 1, …, Y n) are jointly Gaussian with mean μ Y and covariance Σ Y, which we write as Y =D N(μ Y, Σ Y), if

$$\displaystyle \begin{aligned} \mathbf{Y} = A \mathbf{X} + \mu_{\mathbf{Y}} \mbox{ with } \varSigma_{\mathbf{Y}} = AA', \end{aligned}$$

where X is a vector of independent N(0, 1) random variables. ◇

Here is the main result.

Theorem 8.2 (Density of N(μ Y, Σ Y) Random Variables)

Let Y =D N(μ Y, Σ Y). Then

$$\displaystyle \begin{aligned} f_{\mathbf{Y}}(\mathbf{y}) = \frac{1}{\sqrt{|\varSigma_{\mathbf{Y}}|} (2 \pi )^{n/2}} \exp\left\{- \frac{1}{2} (\mathbf{y} - \mu_{\mathbf{Y}})' \varSigma^{-1}_{\mathbf{Y}} (\mathbf{y} - \mu_{\mathbf{Y}}) \right\}. \end{aligned} $$
(8.5)

\({\blacksquare }\)

The level curves of this jpdf are ellipses, as sketched in Fig. 8.2.

Fig. 8.2
figure 2

The level curves of f Y

Note that this joint distribution is determined by the mean and the covariance matrix. In particular, if Y  = (V , W ) are jointly Gaussian, then the joint distribution is characterized by the mean and Σ V, Σ W and cov(V, W). We know that if V and W are independent, then they are uncorrelated, i.e., cov(V, W) = 0. Since the joint distribution is characterized by the mean and covariance, we conclude that if they are uncorrelated, they are independent. We note this fact as a theorem.

Theorem 8.3 (Jointly Gaussian RVs Are Independent Iff Uncorrelated)

Let V and W be jointly Gaussian random variables. Then, there are independent if and only if they are uncorrelated. \({\blacksquare }\)

We will use the following result.

Theorem 8.4 (Linear Combinations of JG Are JG)

Let V and W be jointly Gaussian. Then A V + a and B W + b are jointly Gaussian. \({\blacksquare }\)

Proof

By definition, V and W are jointly Gaussian if they are linear functions of i.i.d. N(0, 1) random variables. But then A V + a and B W + b are linear functions of the same i.i.d. N(0, 1) random variables, so that they are jointly Gaussian. More explicitly, there are some i.i.d. \(\mathcal {N}(0, 1)\) random variables X so that

$$\displaystyle \begin{aligned} \left[ \begin{array}{c} \mathbf{V} \\ \mathbf{W} \end{array} \right] = \left[ \begin{array}{c} \mathbf{c} \\ \mathbf{d} \end{array} \right] + \left[ \begin{array}{c} C \\ D \end{array} \right] \mathbf{X}, \end{aligned}$$

so that

$$\displaystyle \begin{aligned} \left[ \begin{array}{c} A \mathbf{V} + \mathbf{a} \\ B \mathbf{W} + \mathbf{b} \end{array} \right] = \left[ \begin{array}{c} \mathbf{a} + A \mathbf{c} \\ \mathbf{b} + B \mathbf{d} \end{array} \right] + \left[ \begin{array}{c} AC \\ BD \end{array} \right] \mathbf{X}. \end{aligned}$$

As an example, let X, Y  be independent N(0, 1) random variables. Then,

$$\displaystyle \begin{aligned} X + Y \mbox{ and } X - Y \mbox{ are independent}. \end{aligned}$$

Indeed, these random variables are jointly Gaussian by Theorem 8.4. Also, they are uncorrelated since

$$\displaystyle \begin{aligned} E((X + Y)(X - Y)) - E(X+Y)E(X - Y) = E(X^2 - Y^2) = 0. \end{aligned}$$

Hence, they are independent by Theorem 8.3.

We devote the remainder of this section to the derivation of (8.5). We explain in Theorem B.13 how to calculate the p.d.f. of A X + b from the density of X. We recall the result here for convenience:

$$\displaystyle \begin{aligned} f_{\mathbf{Y}}(\mathbf{y}) = \frac{1}{|A|} f_{\mathbf{X}}(\mathbf{x}) \mbox{ where } A \mathbf{x} + \mathbf{b} = \mathbf{y}. \end{aligned} $$
(8.6)

Let us apply (8.6) to the case where X is a vector of n i.i.d. N(0, 1) random variables. In this case,

$$\displaystyle \begin{aligned} & f_{\mathbf{X}}(\mathbf{x}) = \varPi_{i = 1}^n f_{X_i}(x_i) = \varPi_{i=1}^n \frac{1}{\sqrt{2 \pi}} \exp\left\{- \frac{x_i^2}{2}\right\} \\ &~~~~~~~~~ = \frac{1}{( 2 \pi)^{n/2}} \exp\left\{- \frac{||\mathbf{x}||{}^2}{2}\right\}. \end{aligned} $$

Then, (8.6) gives

$$\displaystyle \begin{aligned} f_{\mathbf{Y}}(\mathbf{y}) = \frac{1}{|A|} \frac{1}{( 2 \pi)^{n/2}} \exp\left\{- \frac{||\mathbf{x}||{}^2}{2}\right\}, \end{aligned}$$

where A x + μ Y = y. Thus,

$$\displaystyle \begin{aligned} \mathbf{x} = A^{-1}(\mathbf{y} - \mu_{\mathbf{Y}}) \end{aligned}$$

and

$$\displaystyle \begin{aligned} ||\mathbf{x}||{}^2 = ||A^{-1}(\mathbf{y} - \mu_{\mathbf{Y}})||{}^2 = (\mathbf{y} - \mu_{\mathbf{Y}})' (A^{-1})' A^{-1}(\mathbf{y} - \mu_{\mathbf{Y}}), \end{aligned}$$

where we used the facts that ||z||2 = z z and (M v) = v ′M′.

Recall the definition of the covariance matrix:

$$\displaystyle \begin{aligned} \varSigma_{\mathbf{Y}} = E((\mathbf{Y} - E(\mathbf{Y}))(\mathbf{Y} - E(\mathbf{Y}))'). \end{aligned}$$

Since Y = A X + μ Y and Σ X = I, the identity matrix, we see that

$$\displaystyle \begin{aligned} \varSigma_{\mathbf{Y}} = A \varSigma_{\mathbf{X}} A' = AA'. \end{aligned}$$

In particular,

$$\displaystyle \begin{aligned} |\varSigma_{\mathbf{Y}}| = |A|{}^2. \end{aligned}$$

Hence, we find that

$$\displaystyle \begin{aligned} f_{\mathbf{Y}}(\mathbf{y}) = \frac{1}{\sqrt{|\varSigma_{\mathbf{Y}}|} (2 \pi )^{n/2}} \exp\left\{- \frac{1}{2} (\mathbf{y} - \mu_{\mathbf{Y}})' \varSigma^{-1}_{\mathbf{Y}} (\mathbf{y} - \mu_{\mathbf{Y}}) \right\}. \end{aligned}$$

This is precisely (8.5).

8.4 Elementary Statistics

This section explains some basic statistical tests that are at the core of “data science.”

8.4.1 Zero-Mean?

Consider the following hypothesis testing problem. The random variable Y  is \(\mathcal {N}(\mu , 1)\). We want to decide between two hypotheses:

$$\displaystyle \begin{aligned} & H_0: \mu = 0. {} \end{aligned} $$
(8.7)
$$\displaystyle \begin{aligned} & H_1: \mu \neq 0. {} \end{aligned} $$
(8.8)

We know that P[|Y | > 2∣H 0] ≈ 5%. That is, if we reject H 0 when |Y | > 2, the probability of “false alarm,” i.e., of rejecting the hypothesis when it is correct is 5%. This is what all the tests that we will discuss in this chapter do. However, there are many tests that achieve the same false alarm probability. For instance, we could reject H 0 when Y > 1.64 and the probability of false alarm would also be 5%. Or, we could reject H 0 when Y  is in the interval [1, 1.23]. The probability of that event under H 0 is also about 5%.

Thus, there are many tests that reject H 0 with a probability of false alarm equal to 5%. Intuitively, we feel that the first one—rejecting H 0 when |Y | > 2—is more sensible than the others. This intuition probably comes from the idea that the alternative hypothesis H 1 : μ ≠ 0 appears to be a symmetric assumption about the likely values of μ. That is, we do not have a reason to believe that under H 1 the mean μ is more likely to be positive than negative. We just know that it is nonzero. Given this symmetry, it is intuitively reasonable that the test should be symmetric. However, there are many symmetric tests! So, we need a more careful justification.

To justify the test |Y | > 2, we note the following simple result.

Theorem 8.5

Consider the following hypothesis testing problem: Y  is \(\mathcal {N}(\mu , 1)\) and

$$\displaystyle \begin{aligned} & H_0: \mu = 0\\ & H_1: \mu \mathit{\mbox{ has a symmetric distribution about }} 0. \end{aligned} $$

Then, the Neyman–Pearson test with probability of false alarm 5% is to reject H 0 when |Y | > 2. \({\blacksquare }\)

Proof

We know that the Neyman–Pearson test is a likelihood ratio test. Thus, it suffices to show that the likelihood ratio is increasing in |Y |. Assume that the density of μ under H 1 is h(x). (The same argument goes through it μ is a mixed random variable.) Then the pdf f 1(y) of Y  under H 1 is as follows:

$$\displaystyle \begin{aligned} f_1(y) = \int h(x)f(y - x)dx, \end{aligned}$$

where \(f(x) = (1/\sqrt {2 \pi }) \exp \{ - 0.5 y^2\}\) is the pdf of a \(\mathcal {N}(0,1)\) random variable. Consequently, the likelihood ratio L(y) of Y  is given by

$$\displaystyle \begin{aligned} L(y) &= \frac{f_1(y)}{f(y)} = \int h(x) \frac{f(y-x)}{f(y)} dx = \int h(x) \exp\{- xy\} \exp\left\{- \frac{x^2}{2} \right\} dx \\ &= 0.5 \int [h(x) + h(-x)] \exp\{- xy\} \exp\left\{- \frac{x^2}{2} \right\} dx \\ &= 0.5 \int h(x) [\exp\{xy\} + \exp\{-xy\}] dx, \end{aligned} $$

where the fourth identity comes from h(x) = 0.5h(x) + 0.5h(−x), since h(x) = h(−x). This expression shows that L(y) = L(−y). Also,

$$\displaystyle \begin{aligned} L'(y) {=} 0.5 \int h(x) x [\exp\{xy\} {-} \exp\{-xy\}] dx {=} \int_0^\infty h(x) x [\exp\{xy\} {-} \exp\{-xy\}] dx, \end{aligned}$$

by symmetry of the integrand. For y > 0 and x > 0, we see that the last integrand is positive, so that L′(y) > 0 for y > 0.

Hence, L(y) is symmetric and increasing in y > 0, so that it is an increasing function of |y|, which completes the proof. □

As a simple application, say that you buy 100 light bulbs from brand A and 100 from brand B. You want to test whether that have the same mean lifetime. You measure the lifetimes \(\{X^A_1, \ldots , X^A_{100}\}\) and \(\{X^B_1, \ldots , X^B_{100}\}\) of the bulbs of the two batches and you calculate

$$\displaystyle \begin{aligned} Y = \frac{(X^A_1 + \cdots X^A_{100}) - (X^B_1 + \cdots X^B_{100})}{\sigma\sqrt{N}}, \end{aligned}$$

where σ is the standard deviation of \(X^A_n + X^B_n\) that we assume to be known.

By the CLT, it is reasonable to approximate Y  by a \(\mathcal {N}(0,1)\) random variable. Thus, we reject the hypothesis that the bulbs of the two brands have the same average lifetime if |Y | > 2.

Of course, assuming that σ is known is not realistic. The next test is then more practical.

8.4.2 Unknown Variance

A practically important variation of the previous example is when the variance σ 2 is not known. In that case, the Neyman–Pearson test is to decide H 1 when

$$\displaystyle \begin{aligned} \frac{|\hat \mu|}{\hat \sigma} > \lambda, \end{aligned}$$

where \(\hat \mu \) is the sample mean of the Y m, as before,

$$\displaystyle \begin{aligned} \hat \sigma^2 = \frac{1}{n-1} \sum_{m=1}^n (Y_m - \hat \mu)^2 \end{aligned}$$

is the sample variance, and λ is such that \(P(\frac {|t_{n-1}|}{\sqrt {n-1}} > t_{n-1} ) = \beta \).

Here, t n−1 is a random variable with a t distribution with n − 1 degrees of freedom. By definition, this means that

$$\displaystyle \begin{aligned} t_{n-1} = \frac{\mathcal{N}(0, 1)}{\sqrt{\chi^2_{n-1}/(n-1)}}, \end{aligned}$$

where \(\chi ^2_{n-1}\) is the sum of the squares of n − 1 i.i.d. \(\mathcal {N}(0,1)\) random variables.

Thus, this chi-squared test is very similar to the previous one, except that one replaces the standard deviation σ by it estimate \(\hat \sigma \) and the threshold λ is adjusted (increased) to reflect the uncertainty in σ. Statistical packages provide routines to calculate the appropriate value of λ. (See scipy.stats.chisquare for Python.)

Figure 8.3 explains the result. The rotation symmetry of Z implies that we can assume that V = Z 1 and that W = (0, Z 2, …, Z n). As in the previous examples, one uses the symmetry assumption under H 1 to prove that the likelihood ratio is monotone in \(\hat \mu / \hat \sigma \).

Fig. 8.3
figure 3

The projection error

Coming back to our lightbulbs example, what should we do if we have different number of bulbs of the two brands? The next test covers that situation.

8.4.3 Difference of Means

You observe {X n, n = 1, …, n 1} and {Y n, n = 1, …, n 2}. Assume that these random variables are all independent and that the X n are \(\mathcal {N}(\mu _1, 1)\) and the Y n are \(\mathcal {N}(\mu _2, 1)\). We want to test whether μ 1 = μ 2.

Define

$$\displaystyle \begin{aligned} Z = \frac{1}{n_1^{-1} + n_2^{-1}} \left( \frac{X_1 + \cdots + X_{n_1}}{n_1} - \frac{Y_1 + \cdots + Y_{n_2}}{n_2} \right). \end{aligned}$$

Then \(Z = \mathcal {N}(\mu , 1)\) where μ = μ 1 − μ 2. Testing μ 1 = μ 2 is then equivalent to testing μ = 0. A sensible decision is then to reject the hypothesis that μ 1 = μ 2 if |Z| > 2.

In practice, if n 1 and n 2 are not too small, one can invoke the Central Limit Theorem to justify the same test even when the random variables are not Gaussian. That is typically how this test is used. Also, when the random variables have nonzero means and unknown variances, one then renormalizes them by subtracting their sample mean and dividing by the sample standard deviation.

Needless to say, some care must be taken. It is not difficult to find distributions for which this test does not perform well. This fact helps explain why many poorly conducted statistical studies regularly contradict one another. Many publications decry this fallacy of the p-value. The p-value is the name given to the probability of false alarm.

8.4.4 Mean in Hyperplane?

A generalization of the previous example is as follows:

$$\displaystyle \begin{aligned} & H_0: \mathbf{Y} = \mathcal{N}(\mu, \sigma^2 \mathbf{I}), \mu \in \mathcal{L}\\ & H_1: \mathbf{Y} = \mathcal{N}(\mu, \sigma^2 \mathbf{I}), \mu \in \Re^n. \end{aligned} $$

Here, \(\mathcal {L}\) is an m-dimensional subspace in \(\Re ^n\).

Here is the test that has a probability of false alarm (deciding H 1 when H 0 is true) less than β: Decide

$$\displaystyle \begin{aligned} H = H_1 \mbox{ if and only if } \frac{1}{\sigma^2} \|\mathbf{Y} - \hat \mu \|{}^2 > \beta_{n-m}, \end{aligned}$$

where

$$\displaystyle \begin{aligned} & \hat \mu = \arg \min \{ \|\mathbf{Y} - \mathbf{x}\|{}^2 : \mathbf{x} \in \mathcal{L}\} \\ & P(\chi^2_{n-m} > \beta_{n - m}) = \beta. \end{aligned} $$

In this expression, \(\chi ^2_{n-m}\) represents a random variable that has a chi-square distribution with n − m degrees of freedom. This means that it is distributed like the sum of n − m random variables that are i.i.d. \(\mathcal {N}(0,1)\).

Figure 8.4 shows that

$$\displaystyle \begin{aligned} \mathbf{Y} - \hat \mu = \sigma \mathbf{Z}. \end{aligned}$$

Now, the distribution of Z is invariant under rotation. Consequently, we can rotate the axes around μ so that Z = σ(0, …, 0, Z m+1, …, Z n). Thus,

$$\displaystyle \begin{aligned} \mathbf{Y} - \hat \mu = \sigma (0, \ldots, 0, Z_{m+1}, \ldots, Z_n), \end{aligned}$$

so that \(\|\mathbf {Y} - \hat \mu \|{ }^2 = \sigma ^2 (Z_{m+1}^2 + \cdots + Z_n^2)\), which proves the result.

Fig. 8.4
figure 4

The projection error

As in our simple example, this test has a probability of false alarm equal to β. Here also, one can show that the test maximizes the probability of correct detection subject to that bound on the probability of false alarm if under H 1 one knows that μ has a symmetric pmf around \(\mathcal {L}\). This means that μ = γ i + v i with probability p i∕2 and γ i − v i with probability p i∕2 where \(\gamma _i \in \mathcal {L}\) and v i is orthogonal to \(\mathcal {L}\), for i = 1, …, K. The continuous version of this symmetry should be clear. The verification of this fact is similar to the simple case we discussed above.

8.4.5 ANOVA

Our next model is more general and is widely used. In this model, \(\mathbf {Y} = \mathcal {N}(A \gamma , \sigma ^2 \mathbf {I})\). We would like to test whether  = 0, which is the H 0 hypothesis. Here, A is a n × k matrix, with k < n. Also, M is a q × k matrix with q < k.

The decision is to reject H 0 if F > F 0 where

$$\displaystyle \begin{aligned} & F = \frac{\|\mathbf{Y} - \mu_0\|{}^2 - \|\mathbf{Y} - \mu_1\|{}^2}{\|\mathbf{Y} - \mu_1\|{}^2} \times \frac{n-k}{q} \\ & \mu_0 = \arg \min_\mu \{ \| \mathbf{Y} - \mu\|{}^2 : \mu = A \gamma, M \gamma = 0\} \\ & \mu_1 = \arg \min_\mu \{ \|\mathbf{Y} - \mu\|{}^2 : \mu = A \gamma \} \\ & \beta = P\left(\frac{\chi^2_q /q}{\chi^2_{n-k} /(n - k)} > F_0\right). \end{aligned} $$

In the last expression, the ratio of two χ 2 random variables is said to be an F distribution, in the honor of Sir Ronald A. Fisher who introduced this F-test in 1920.

This test has a probability of false alarm equal to β, as Fig. 8.5 shows. This figure represents the situation under H 0, when Y = μ 0 + σ Z and shows that F is the ratio of two χ 2 random variables, so that it has an F distribution.

Fig. 8.5
figure 5

The F-test. The figure shows that F is the ratio of two independent chi-square random variables

As in the previous examples, the optimality of the test in terms of probability of correct detection requires some symmetry assumptions of μ under H 1.

8.5 LDPC Codes

Low Density Parity Check (LDPC) codes are among the most efficient codes used in practice. Gallager invented these codes in his 1960 thesis (Gallager 1963, Fig. 8.6). These codes are used extensively today, for instance, in satellite video transmissions. They are almost optimal for BSC channels and also for many other channels.

Fig. 8.6
figure 6

Robert G. Gallager, b. 1931

The LDPC codes are as follows. Let x ∈ {0, 1}n be an n-bit string to be transmitted. One augments this string with the m-bit string y where

$$\displaystyle \begin{aligned} \mathbf{y} = H \mathbf{x}. \end{aligned} $$
(8.9)

Here, H is an m × n matrix with entries in {0, 1}, one views x and y as column vectors and the operations are addition modulo 2. For instance, if

$$\displaystyle \begin{aligned} H = \left[ \begin{array}{c c c c c c c c} 1 & 0 & 1 & 1 & 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 1 & 1 & 0 & 1 & 0 \\ 1 & 1 & 0 & 0 & 0 & 1 & 0 & 1 \\ 0 & 0 & 1 & 0 & 1 & 1 & 1 & 1 \end{array} \right] \end{aligned}$$

and x = [01001010], then y = [1110]. This calculation of the parity check bits y from x is illustrated by the graph, called Tanner graph, shown in Fig. 8.7.

Fig. 8.7
figure 7

Tanner graph representation of the LDPC code. The graph shows the nonzero entries of H, so that y = H x. The receiver gets \(\tilde {\mathbf {x}}\) and \(\tilde {\mathbf {y}}\) instead of x and y. The nodes x j are called message nodes and the nodes y i are called check nodes

Thus, instead of simply sending the bit string x, one sends both x and y. The bits in y are parity check bits. Because of possible transmission errors, the receiver may get \(\tilde {\mathbf {x}}\) and \(\tilde {\mathbf {y}}\) instead of x and y. The receiver computes \(H \tilde {\mathbf {x}}\) and compares the result with \(\tilde {\mathbf {y}}\). The idea is that if \(\tilde {\mathbf {y}} = H \tilde {\mathbf {x}}\), then it is likely that \(\tilde {\mathbf {x}} = \mathbf {x}\) and \(\tilde {\mathbf {y}} = \mathbf {y}\). In other words, it is unlikely that errors would have corrupted x and y in a way that these vectors would still satisfy the relation \(\tilde {\mathbf {y}} = H \tilde {\mathbf {x}}\). Thus, one expects the scheme to be good at detecting errors, at least if the matrix H is well chosen.

In addition to detecting errors, the LDPC code is used for error correction. If \(\tilde {\mathbf {y}} \neq H \tilde {\mathbf {x}}\), one tries to find the least number of components of \(\tilde {\mathbf {x}}\) and \(\tilde {\mathbf {y}}\) that can be changed to satisfy the equations. These would be the most likely transmission errors, if we assume that bit errors are i.i.d. have a very small probability. However, searching for the possible combinations of components to change is exponentially hard. Instead, one uses iterative algorithms that approximate the solution.

We illustrate a commonly used decoding algorithm, called belief propagation (BP). We assume that each received bit is erroneous with probability 𝜖 ≪ 1 and correct with probability \(\bar \epsilon = 1 - \epsilon \), independently of the other bits. We also assume that the transmitted bits x j are equally likely to be 0 or 1. This implies that the parity check bits y i are also equally likely to be 0 or 1, by symmetry. In this algorithm, the message nodes x j and the check nodes y i exchange beliefs along the links of the graph of Fig. 8.7 about the probability that the x j are equal to 1.

In steps 1, 3, 5, … of the algorithm, each node x j sends to each node y i to which it is attached an estimate of P(x j = 1). Each node y i then combines these estimates to send back new estimates to each x j about P(x j = 1). Here is the calculation that the y nodes perform. Consider a situation shown in Fig. 8.8 where node y 1 gets the estimates a = P(x 1 = 1), b = P(x 2 = 1), c = P(x 3 = 1). Assume also that \(\tilde y_1 = 1\), from which node y 1 calculates \(P[y_1 = 1 | \tilde y_1 ] = 1 - \epsilon = \bar \epsilon \), by Bayes’ rule. Since the graph shows that x 1 + x 2 + x 3 = y 1, node y 1 estimates the probability that x 1 = 1 as the probability that an odd number of bits among {x 2, x 3, y 1} are equal to one (Fig. 8.9).

Fig. 8.8
figure 8

Node y 1 gets estimates from x nodes and calculates new estimates

Fig. 8.9
figure 9

Each node j is equal to one w.p. p j and to zero otherwise, independently of the other nodes. The probability that an odd number of nodes are one is given in the figure

To see how to do the calculation, assume that x 1, …, x n are independent {0, 1}-random variables with p i = P(x i = 1). Note that

$$\displaystyle \begin{aligned} 1 - (1 - 2x_1) \times \cdots \times (1 - 2 x_n) \end{aligned}$$

is equal to zero if the number of variables that are equal to one among {x 1, …, x n} is even and is equal to two if it is odd. Thus, taking expectation,

$$\displaystyle \begin{aligned} 2P(\mbox{odd}) = 1 - \varPi_{i=1}^n (1 - 2p_i), \end{aligned}$$

so that

$$\displaystyle \begin{aligned} P(\mbox{odd}) = \frac{1}{2} - \frac{1}{2}\varPi_{i=1}^n (1 - 2p_i). \end{aligned} $$
(8.10)

Thus, in Fig. 8.8, one finds that

$$\displaystyle \begin{aligned} & P(x_1 = 1) = P(\mbox{odd among } x_2, x_3, y_1 ) \\ &~~~ = \frac{1}{2} - \frac{1}{2} (1 - 2b)(1 - 2c)(1 -2 \bar \epsilon). {} \end{aligned} $$
(8.11)

The y-nodes in Fig. 8.7 use that procedure to calculate new estimates and send them to the x-nodes.

In steps 2, 4, 6, … of the algorithm, each x j nodes combines the estimates of P(x j = 1) it gets from \(\tilde x_j\) and from the y-nodes in the previous steps to calculate new estimates. Each node x j assumes that the different estimates it got are derived from independent observations. That is, node x j gets opinions about P(x j = 1) from independent experts, namely \(\tilde x_j\) and the y i to which it is attached in the graph. Node x j will merge the opinion of these experts to calculate new estimates.

How should one merge the opinions of independent experts? Say that N experts make independent observations Y 1, …, Y N and provide estimates p i = P[X = 1|Y i]. Assume that the prior probability is that P(X = 1) = P(X = 0) = 1∕2. How should one estimate P[X = 1|p 1, …, p N]? Here is the calculation.

One has

$$\displaystyle \begin{aligned} & P[X = 1 | Y_1, \ldots , Y_N] = \frac{P(X=1, Y_1, \ldots, Y_N)}{P(Y_1, \ldots, Y_N)} \\ &~~~~~~~~ = \frac{P[Y_1, \ldots, Y_N | X = 1] P(X = 1)}{\sum_{x = 0, 1} P[Y_1, \ldots, Y_N | X = x] P(X = x) } \\ &~~~~~~~~ = \frac{P[Y_1 | X=1] \times \cdots \times P[Y_N | X=1]}{\sum_{x = 0, 1}P[Y_1 | X=x] \times \cdots \times P[Y_N | X=x]}. \\ \end{aligned} $$
(8.12)

Now,

$$\displaystyle \begin{aligned} P[Y_n | X=x] = \frac{P(X = x, Y_n)}{P(X=x)} = \frac{P[X = x|Y_n]P(Y_n)}{1/2}. \end{aligned}$$

Thus,

$$\displaystyle \begin{aligned} P[Y_n|X = 0] = 2(1 - p_n)P(Y_n) \mbox{ and } P[Y_n|X = 1] = 2p_nP(Y_n). \end{aligned}$$

Substituting these expressions in (8.12), one finds that

$$\displaystyle \begin{aligned} P[X=1|Y_1, \ldots, Y_N] = \frac{ p_1\cdots p_N}{p_1p_2 \cdots p_N + (1 - p_1) \cdots (1 - p_N)}, \end{aligned} $$
(8.13)

as shown in Fig. 8.10.

Fig. 8.10
figure 10

Merging the opinion of independent experts about P(X = 1) when the prior is 1∕2

Let us apply this rule to the situation shown in Fig. 8.11. In the figure, node x 1 gets an estimate 𝜖 of P(x 1 = 1) from observing \(\tilde x_1 = 0\). It also gets estimates a, b, c from the nodes y 1, y 2, y 3 and node x 1 assumes that these estimates were based on independent observations.

Fig. 8.11
figure 11

Node x 1 gets estimates of P(x 1 = 1) from y nodes and calculates new estimates

To calculate a new estimate that it will send to node y 1, node x 1 combines the estimates from \(\tilde x_1, y_2\) and y 3. This estimate is

$$\displaystyle \begin{aligned} \frac{\epsilon b c}{\epsilon b c + \bar \epsilon \bar b \bar c}, \end{aligned} $$
(8.14)

where \(\bar b = 1 - b\) and \(\bar c = 1 - c\). In the next step, node x 1 will send that estimate to node y 1. It also calculates estimates for nodes y 2 and y 3.

Summing up, the algorithm is as follows. At each odd step, node x j sends X(i, j) to each node y i. At each even step, node y i sends Y (i, j) to each node x j. One has

$$\displaystyle \begin{aligned} Y(i, j) = \frac{1}{2} - \frac{1}{2} (1 - 2 \epsilon) (1 -2 \tilde y_i)\varPi_{s \in A(i,j) } (1 - 2 X(i, s)), \end{aligned} $$
(8.15)

where A(i, j) = {s ≠ jH(i, s) = 1} and

$$\displaystyle \begin{aligned} X(i, j) = \frac{N(i, j)}{N(i, j) + D(i, j)}, \end{aligned} $$
(8.16)

where

$$\displaystyle \begin{aligned} N(i, j) = P[x_j = 1 | \tilde x_j ] \varPi_{\{v \neq i \mid H(v, j) = 1\}} Y(v, j) \end{aligned}$$

and

$$\displaystyle \begin{aligned} D(i, j) = P[x_j = 0 | \tilde x_j ] \varPi_{\{v \neq i \mid H(v, j) = 1\}} (1 - Y(v, j) )\end{aligned} $$

with

$$\displaystyle \begin{aligned} P[x_j = 1 | \tilde x_j ] = \epsilon + (1 - 2 \epsilon) \tilde x_j. \end{aligned}$$

Also, node x j can update its probability of being 1 by merging the opinions of the experts as

$$\displaystyle \begin{aligned} X(j) = \frac{ N(j)}{N(j) + D(j)}, \end{aligned} $$
(8.17)

where

$$\displaystyle \begin{aligned} N(j) = P[x_j = 1 | \tilde x_j ] \varPi_{\{v \mid H(v, j) = 1\}} Y(v, j) \end{aligned}$$

and

$$\displaystyle \begin{aligned} D(j) = P[x_j = 0 | \tilde x_j ] \varPi_{\{v \mid H(v, j) = 1\}} (1 - Y(v, j)). \end{aligned}$$

After enough iterations, one makes the detection decisions x j = 1{X(j) ≥ 0.5}.

Figure 8.12 shows the evolution over time of the estimated probabilities that the x j are equal to one. Our code is a direct implementations of the formulas in this section. More sophisticated implementations use sums of logarithms instead of products.

Fig. 8.12
figure 12

Belief propagation applied to the example of Fig. 8.7. The horizontal axis is the step of the algorithm. The vertical axis is the best guess for each x(i) at that step. For clarity, we separated the guesses by 0.1. The final detection is [0, 1, 0, 0, 1, 0, 1, 0], which is intuitively the best guess

Simulations, and a deep theory, show that this algorithm performs well if the graph does not have small cycles. In such a case, the assumption that the estimates are obtained from independent observations is almost correct.

8.6 Summary

  • LDPC Codes;

  • Jointly Gaussian Random Variables, independent if uncorrelated;

  • Proof of Neyman–Pearson Theorem;

  • Testing properties of the mean.

8.6.1 Key Equations and Formulas

LDPC

y = H x

(8.9)

P(odd)

P(∑j X j = 1) = 0.5 − 0.5Π j(1 − 2p j)

(8.10)

Fusion of Experts

\(P[X=1|Y_1, \ldots , Y_n] = \varPi _j p_j /( \varPi _j p_j + \varPi _j \bar p_j)\)

(8.13)

Jointly Gaussian

N(μ, Σ) ⇔ f X = …

(8.4)

If X, Y are J.G., then

X ⊥Y ⇒X, Y are independent

Theorem 8.3

8.7 References

The book (Richardson and Urbanke 2008) is a comprehensive reference on LDPC codes and iterative decoding techniques.

8.8 Problems

Problem 8.1

Construct two Gaussian random variables that are not jointly Gaussian. Hint: Let \(X =_D \mathcal {N}(0,1)\) and Z be independent random variables with P(Z = 1) = P(Z = −1) = 1∕2. Define Y = XZ. Show that X and Y  meet the requirements of the problem.

Problem 8.2

Assume that \(X =_D (Y + Z)/\sqrt {2}\) where Y  and Z are independent and distributed like X. Show that \(X = \mathcal {N}(0, \sigma ^2)\) for some σ 2 ≥ 0. Hint: First show that E(X) = 0. Second, show by induction that \(X =_D (V_1 + \cdots + V_{m})/\sqrt {m}\) for m = 2n. where the V i are i.i.d. and distributed like X. Conclude using the CLT.

Problem 8.3

Consider Problem 7.8 but assume now that \(\mathbf {Z} =_D \mathcal {N}(\mathbf {0}, \varSigma )\) where

$$\displaystyle \begin{aligned} \varSigma = \left[ \begin{array}{c c} 0.2 & 0.1 \\ 0.1 & 0.3 \end{array} \right]. \end{aligned}$$

The symbols are equally likely and the receiver uses the MLE. Simulate the system using Python to estimate the fraction of errors.