Abstract
Chapter 7 explained the detection and hypothesis testing problems, Huffman codes and the situation where errors are independent and Gaussian. In this chapter, we prove the optimality of the Huffman code in Sect. 8.1 and the Neyman–Pearson Theorem in Sect. 8.2. Section 8.3 discusses the theory of jointly Gaussian random variables that is used to analyze the modulation schemes of Sect. 7.5 . Section 8.4 uses the results on jointly Gaussian random variables to explain hypothesis tests that arise when analyzing data. That section discusses the chi-squared test and the F-test. Section 8.5 is devoted to the LDPC codes that are widely used in high-speed communication links. These codes augment a group of bits to be transmitted over a noisy channel with additional bits computed from those in the group. When it receives the bits, when the augmented bits are not consistent, the receiver attempts to determine the bits that are most likely to have been corrupted by noise.
You have full access to this open access chapter, Download chapter PDF
Topics: Optimality of Huffman Codes, LDPC Codes, Proof of Neyman–Pearson Theorem, Jointly Gaussian RVs, Statistical Tests, ANOVA
8.1 Proof of Optimality of the Huffman Code
We stated the following result in Chap. 7. Here, we provide a proof.
Theorem 8.1 (Optimality of Huffman Code)
The Huffman code has the smallest average number of bits per symbol among all prefix-free codes (Fig. 8.1). \({\blacksquare }\)
Proof
The argument in Huffman (1952) is by induction on the number of symbols. Assume that the Huffman code has an average path length L(n) that is minimum for n symbols and that there is some other tree T with a smaller average path length A(n + 1) than the Huffman code for n + 1 symbols. Let X and Y be the two least frequent symbols and x ≥ y their frequencies. We can pick these symbols in T so that their path lengths are maximum and such that Y has the largest path length in T. Otherwise, we could swap Y in T with a more frequent symbol and reduce the average path length. Accept for now the claim that we can also pick X and Y so that they are siblings in T. By merging X and Y into their parent Z with frequency z = x + y, we have constructed a code for n symbols with average path length A(n + 1) − z. Hence, L(n) ≤ A(n + 1) − z. Now, the Huffman code for n + 1 symbol would merge X and Y also, so that its average path length is L(n + 1) := L(n) + z. Thus, L(n + 1) ≤ A(n + 1), which contradicts the assumption that the Huffman code is not optimal for n + 1 symbols. It remains to prove the claim about X and Y being siblings. First note that Y having the maximum path length, it cannot be an only child, for otherwise, we would replace its parent by Y and reduce the path length. Say that Y has a sibling V other than X. By swapping V and X, one does not increase the average path length, since the frequency of V is not smaller than that of X. This concludes the proof. □
8.2 Proof of Neyman–Pearson Theorem 7.4
The idea of the proof is to consider any other decision rule that produces an estimate \(\tilde X\) with \(P[\tilde X = 1 | X = 0] \leq \beta \) and to show that
where \(\hat X\) is specified by the theorem. To show this, we note that
Indeed, when L(Y ) − λ > 0, one has \(\hat X = 1 \geq \tilde X\), so that the expression above is indeed nonnegative. Similarly, when L(Y ) − λ < 0, one has \(\hat X = 0 \leq \tilde X\), so that the expression is again nonnegative.
Taking the expected value of this expression given X = 0, we find
Now,
Hence, (8.2) implies that
Observe that, for any function g(Y ), one has
Note that this result continues to hold even for a function g(Y, Z) where Z is a random variable that is independent of X and Y . In particular,
Similarly,
Combining these results with (8.3) gives (8.1).
\({\square }\)
8.3 Jointly Gaussian Random Variables
In many systems, the errors in the different components of the measured vector Y are not independent. A suitable model for this situation is that
where Z = (Z 1, Z 2) is a pair of i.i.d. N(0, 1) random variables and A is some 2 × 2 matrix. The key idea here is that the components of the noise vector A Z will not be independent in general. For instance, if the two rows of A are identical, so are the two components of A Z. Thus, this model allows to capture a dependency between the errors in the two components. The model also suggests that the dependency comes from the fact that the errors are different linear combinations of the same fundamental sources of noise.
For such a model, how does one compute MLE[X|Y]? We explain in the next section that
where A′ is the transposed of matrix A, i.e., A′(i, j) = A(j, i) for i, j ∈{1, 2}.
Consequently, the MLE is the value x k of x that minimizes
(For simplicity, we assume that A is invertible.)
That is, we want to find the vector x k such that A −1 x k is the closest to A −1 y.
One way to understand this result is to note that
Thus, if we calculate A −1 Y from the measured vector Y, we find that its components are i.i.d. N(0, 1) for a given value of X. Hence, it is easy to calculate MLE[V|W = w]: it is the closest value to w in the set {A −1 x 1, …, A −1 x 16} of possible values of V. It is then reasonable to expect that we can recover the MLE of X by multiplying the MLE of V = A −1 X by A, i.e., that
8.3.1 Density of Jointly Gaussian Random Variables
Our goal in this section is to explain (8.4) and more general versions of this result.
We start by stating the main definition and a result that we prove later.
Definition 8.1 (Jointly Gaussian N(μ Y, Σ Y) Random Variables)
The random variables Y = (Y 1, …, Y n)′ are jointly Gaussian with mean μ Y and covariance Σ Y, which we write as Y =D N(μ Y, Σ Y), if
where X is a vector of independent N(0, 1) random variables. ◇
Here is the main result.
Theorem 8.2 (Density of N(μ Y, Σ Y) Random Variables)
Let Y =D N(μ Y, Σ Y). Then
\({\blacksquare }\)
The level curves of this jpdf are ellipses, as sketched in Fig. 8.2.
Note that this joint distribution is determined by the mean and the covariance matrix. In particular, if Y ′ = (V ′, W ′) are jointly Gaussian, then the joint distribution is characterized by the mean and Σ V, Σ W and cov(V, W). We know that if V and W are independent, then they are uncorrelated, i.e., cov(V, W) = 0. Since the joint distribution is characterized by the mean and covariance, we conclude that if they are uncorrelated, they are independent. We note this fact as a theorem.
Theorem 8.3 (Jointly Gaussian RVs Are Independent Iff Uncorrelated)
Let V and W be jointly Gaussian random variables. Then, there are independent if and only if they are uncorrelated. \({\blacksquare }\)
We will use the following result.
Theorem 8.4 (Linear Combinations of JG Are JG)
Let V and W be jointly Gaussian. Then A V + a and B W + b are jointly Gaussian. \({\blacksquare }\)
Proof
By definition, V and W are jointly Gaussian if they are linear functions of i.i.d. N(0, 1) random variables. But then A V + a and B W + b are linear functions of the same i.i.d. N(0, 1) random variables, so that they are jointly Gaussian. More explicitly, there are some i.i.d. \(\mathcal {N}(0, 1)\) random variables X so that
so that
□
As an example, let X, Y be independent N(0, 1) random variables. Then,
Indeed, these random variables are jointly Gaussian by Theorem 8.4. Also, they are uncorrelated since
Hence, they are independent by Theorem 8.3.
We devote the remainder of this section to the derivation of (8.5). We explain in Theorem B.13 how to calculate the p.d.f. of A X + b from the density of X. We recall the result here for convenience:
Let us apply (8.6) to the case where X is a vector of n i.i.d. N(0, 1) random variables. In this case,
Then, (8.6) gives
where A x + μ Y = y. Thus,
and
where we used the facts that ||z||2 = z ′ z and (M v)′ = v ′M′.
Recall the definition of the covariance matrix:
Since Y = A X + μ Y and Σ X = I, the identity matrix, we see that
In particular,
Hence, we find that
This is precisely (8.5).
8.4 Elementary Statistics
This section explains some basic statistical tests that are at the core of “data science.”
8.4.1 Zero-Mean?
Consider the following hypothesis testing problem. The random variable Y is \(\mathcal {N}(\mu , 1)\). We want to decide between two hypotheses:
We know that P[|Y | > 2∣H 0] ≈ 5%. That is, if we reject H 0 when |Y | > 2, the probability of “false alarm,” i.e., of rejecting the hypothesis when it is correct is 5%. This is what all the tests that we will discuss in this chapter do. However, there are many tests that achieve the same false alarm probability. For instance, we could reject H 0 when Y > 1.64 and the probability of false alarm would also be 5%. Or, we could reject H 0 when Y is in the interval [1, 1.23]. The probability of that event under H 0 is also about 5%.
Thus, there are many tests that reject H 0 with a probability of false alarm equal to 5%. Intuitively, we feel that the first one—rejecting H 0 when |Y | > 2—is more sensible than the others. This intuition probably comes from the idea that the alternative hypothesis H 1 : μ ≠ 0 appears to be a symmetric assumption about the likely values of μ. That is, we do not have a reason to believe that under H 1 the mean μ is more likely to be positive than negative. We just know that it is nonzero. Given this symmetry, it is intuitively reasonable that the test should be symmetric. However, there are many symmetric tests! So, we need a more careful justification.
To justify the test |Y | > 2, we note the following simple result.
Theorem 8.5
Consider the following hypothesis testing problem: Y is \(\mathcal {N}(\mu , 1)\) and
Then, the Neyman–Pearson test with probability of false alarm 5% is to reject H 0 when |Y | > 2. \({\blacksquare }\)
Proof
We know that the Neyman–Pearson test is a likelihood ratio test. Thus, it suffices to show that the likelihood ratio is increasing in |Y |. Assume that the density of μ under H 1 is h(x). (The same argument goes through it μ is a mixed random variable.) Then the pdf f 1(y) of Y under H 1 is as follows:
where \(f(x) = (1/\sqrt {2 \pi }) \exp \{ - 0.5 y^2\}\) is the pdf of a \(\mathcal {N}(0,1)\) random variable. Consequently, the likelihood ratio L(y) of Y is given by
where the fourth identity comes from h(x) = 0.5h(x) + 0.5h(−x), since h(x) = h(−x). This expression shows that L(y) = L(−y). Also,
by symmetry of the integrand. For y > 0 and x > 0, we see that the last integrand is positive, so that L′(y) > 0 for y > 0.
Hence, L(y) is symmetric and increasing in y > 0, so that it is an increasing function of |y|, which completes the proof. □
As a simple application, say that you buy 100 light bulbs from brand A and 100 from brand B. You want to test whether that have the same mean lifetime. You measure the lifetimes \(\{X^A_1, \ldots , X^A_{100}\}\) and \(\{X^B_1, \ldots , X^B_{100}\}\) of the bulbs of the two batches and you calculate
where σ is the standard deviation of \(X^A_n + X^B_n\) that we assume to be known.
By the CLT, it is reasonable to approximate Y by a \(\mathcal {N}(0,1)\) random variable. Thus, we reject the hypothesis that the bulbs of the two brands have the same average lifetime if |Y | > 2.
Of course, assuming that σ is known is not realistic. The next test is then more practical.
8.4.2 Unknown Variance
A practically important variation of the previous example is when the variance σ 2 is not known. In that case, the Neyman–Pearson test is to decide H 1 when
where \(\hat \mu \) is the sample mean of the Y m, as before,
is the sample variance, and λ is such that \(P(\frac {|t_{n-1}|}{\sqrt {n-1}} > t_{n-1} ) = \beta \).
Here, t n−1 is a random variable with a t distribution with n − 1 degrees of freedom. By definition, this means that
where \(\chi ^2_{n-1}\) is the sum of the squares of n − 1 i.i.d. \(\mathcal {N}(0,1)\) random variables.
Thus, this chi-squared test is very similar to the previous one, except that one replaces the standard deviation σ by it estimate \(\hat \sigma \) and the threshold λ is adjusted (increased) to reflect the uncertainty in σ. Statistical packages provide routines to calculate the appropriate value of λ. (See scipy.stats.chisquare for Python.)
Figure 8.3 explains the result. The rotation symmetry of Z implies that we can assume that V = Z 1 and that W = (0, Z 2, …, Z n). As in the previous examples, one uses the symmetry assumption under H 1 to prove that the likelihood ratio is monotone in \(\hat \mu / \hat \sigma \).
Coming back to our lightbulbs example, what should we do if we have different number of bulbs of the two brands? The next test covers that situation.
8.4.3 Difference of Means
You observe {X n, n = 1, …, n 1} and {Y n, n = 1, …, n 2}. Assume that these random variables are all independent and that the X n are \(\mathcal {N}(\mu _1, 1)\) and the Y n are \(\mathcal {N}(\mu _2, 1)\). We want to test whether μ 1 = μ 2.
Define
Then \(Z = \mathcal {N}(\mu , 1)\) where μ = μ 1 − μ 2. Testing μ 1 = μ 2 is then equivalent to testing μ = 0. A sensible decision is then to reject the hypothesis that μ 1 = μ 2 if |Z| > 2.
In practice, if n 1 and n 2 are not too small, one can invoke the Central Limit Theorem to justify the same test even when the random variables are not Gaussian. That is typically how this test is used. Also, when the random variables have nonzero means and unknown variances, one then renormalizes them by subtracting their sample mean and dividing by the sample standard deviation.
Needless to say, some care must be taken. It is not difficult to find distributions for which this test does not perform well. This fact helps explain why many poorly conducted statistical studies regularly contradict one another. Many publications decry this fallacy of the p-value. The p-value is the name given to the probability of false alarm.
8.4.4 Mean in Hyperplane?
A generalization of the previous example is as follows:
Here, \(\mathcal {L}\) is an m-dimensional subspace in \(\Re ^n\).
Here is the test that has a probability of false alarm (deciding H 1 when H 0 is true) less than β: Decide
where
In this expression, \(\chi ^2_{n-m}\) represents a random variable that has a chi-square distribution with n − m degrees of freedom. This means that it is distributed like the sum of n − m random variables that are i.i.d. \(\mathcal {N}(0,1)\).
Figure 8.4 shows that
Now, the distribution of Z is invariant under rotation. Consequently, we can rotate the axes around μ so that Z = σ(0, …, 0, Z m+1, …, Z n). Thus,
so that \(\|\mathbf {Y} - \hat \mu \|{ }^2 = \sigma ^2 (Z_{m+1}^2 + \cdots + Z_n^2)\), which proves the result.
As in our simple example, this test has a probability of false alarm equal to β. Here also, one can show that the test maximizes the probability of correct detection subject to that bound on the probability of false alarm if under H 1 one knows that μ has a symmetric pmf around \(\mathcal {L}\). This means that μ = γ i + v i with probability p i∕2 and γ i − v i with probability p i∕2 where \(\gamma _i \in \mathcal {L}\) and v i is orthogonal to \(\mathcal {L}\), for i = 1, …, K. The continuous version of this symmetry should be clear. The verification of this fact is similar to the simple case we discussed above.
8.4.5 ANOVA
Our next model is more general and is widely used. In this model, \(\mathbf {Y} = \mathcal {N}(A \gamma , \sigma ^2 \mathbf {I})\). We would like to test whether Mγ = 0, which is the H 0 hypothesis. Here, A is a n × k matrix, with k < n. Also, M is a q × k matrix with q < k.
The decision is to reject H 0 if F > F 0 where
In the last expression, the ratio of two χ 2 random variables is said to be an F distribution, in the honor of Sir Ronald A. Fisher who introduced this F-test in 1920.
This test has a probability of false alarm equal to β, as Fig. 8.5 shows. This figure represents the situation under H 0, when Y = μ 0 + σ Z and shows that F is the ratio of two χ 2 random variables, so that it has an F distribution.
As in the previous examples, the optimality of the test in terms of probability of correct detection requires some symmetry assumptions of μ under H 1.
8.5 LDPC Codes
Low Density Parity Check (LDPC) codes are among the most efficient codes used in practice. Gallager invented these codes in his 1960 thesis (Gallager 1963, Fig. 8.6). These codes are used extensively today, for instance, in satellite video transmissions. They are almost optimal for BSC channels and also for many other channels.
The LDPC codes are as follows. Let x ∈ {0, 1}n be an n-bit string to be transmitted. One augments this string with the m-bit string y where
Here, H is an m × n matrix with entries in {0, 1}, one views x and y as column vectors and the operations are addition modulo 2. For instance, if
and x = [01001010], then y = [1110]. This calculation of the parity check bits y from x is illustrated by the graph, called Tanner graph, shown in Fig. 8.7.
Thus, instead of simply sending the bit string x, one sends both x and y. The bits in y are parity check bits. Because of possible transmission errors, the receiver may get \(\tilde {\mathbf {x}}\) and \(\tilde {\mathbf {y}}\) instead of x and y. The receiver computes \(H \tilde {\mathbf {x}}\) and compares the result with \(\tilde {\mathbf {y}}\). The idea is that if \(\tilde {\mathbf {y}} = H \tilde {\mathbf {x}}\), then it is likely that \(\tilde {\mathbf {x}} = \mathbf {x}\) and \(\tilde {\mathbf {y}} = \mathbf {y}\). In other words, it is unlikely that errors would have corrupted x and y in a way that these vectors would still satisfy the relation \(\tilde {\mathbf {y}} = H \tilde {\mathbf {x}}\). Thus, one expects the scheme to be good at detecting errors, at least if the matrix H is well chosen.
In addition to detecting errors, the LDPC code is used for error correction. If \(\tilde {\mathbf {y}} \neq H \tilde {\mathbf {x}}\), one tries to find the least number of components of \(\tilde {\mathbf {x}}\) and \(\tilde {\mathbf {y}}\) that can be changed to satisfy the equations. These would be the most likely transmission errors, if we assume that bit errors are i.i.d. have a very small probability. However, searching for the possible combinations of components to change is exponentially hard. Instead, one uses iterative algorithms that approximate the solution.
We illustrate a commonly used decoding algorithm, called belief propagation (BP). We assume that each received bit is erroneous with probability 𝜖 ≪ 1 and correct with probability \(\bar \epsilon = 1 - \epsilon \), independently of the other bits. We also assume that the transmitted bits x j are equally likely to be 0 or 1. This implies that the parity check bits y i are also equally likely to be 0 or 1, by symmetry. In this algorithm, the message nodes x j and the check nodes y i exchange beliefs along the links of the graph of Fig. 8.7 about the probability that the x j are equal to 1.
In steps 1, 3, 5, … of the algorithm, each node x j sends to each node y i to which it is attached an estimate of P(x j = 1). Each node y i then combines these estimates to send back new estimates to each x j about P(x j = 1). Here is the calculation that the y nodes perform. Consider a situation shown in Fig. 8.8 where node y 1 gets the estimates a = P(x 1 = 1), b = P(x 2 = 1), c = P(x 3 = 1). Assume also that \(\tilde y_1 = 1\), from which node y 1 calculates \(P[y_1 = 1 | \tilde y_1 ] = 1 - \epsilon = \bar \epsilon \), by Bayes’ rule. Since the graph shows that x 1 + x 2 + x 3 = y 1, node y 1 estimates the probability that x 1 = 1 as the probability that an odd number of bits among {x 2, x 3, y 1} are equal to one (Fig. 8.9).
To see how to do the calculation, assume that x 1, …, x n are independent {0, 1}-random variables with p i = P(x i = 1). Note that
is equal to zero if the number of variables that are equal to one among {x 1, …, x n} is even and is equal to two if it is odd. Thus, taking expectation,
so that
Thus, in Fig. 8.8, one finds that
The y-nodes in Fig. 8.7 use that procedure to calculate new estimates and send them to the x-nodes.
In steps 2, 4, 6, … of the algorithm, each x j nodes combines the estimates of P(x j = 1) it gets from \(\tilde x_j\) and from the y-nodes in the previous steps to calculate new estimates. Each node x j assumes that the different estimates it got are derived from independent observations. That is, node x j gets opinions about P(x j = 1) from independent experts, namely \(\tilde x_j\) and the y i to which it is attached in the graph. Node x j will merge the opinion of these experts to calculate new estimates.
How should one merge the opinions of independent experts? Say that N experts make independent observations Y 1, …, Y N and provide estimates p i = P[X = 1|Y i]. Assume that the prior probability is that P(X = 1) = P(X = 0) = 1∕2. How should one estimate P[X = 1|p 1, …, p N]? Here is the calculation.
One has
Now,
Thus,
Substituting these expressions in (8.12), one finds that
as shown in Fig. 8.10.
Let us apply this rule to the situation shown in Fig. 8.11. In the figure, node x 1 gets an estimate 𝜖 of P(x 1 = 1) from observing \(\tilde x_1 = 0\). It also gets estimates a, b, c from the nodes y 1, y 2, y 3 and node x 1 assumes that these estimates were based on independent observations.
To calculate a new estimate that it will send to node y 1, node x 1 combines the estimates from \(\tilde x_1, y_2\) and y 3. This estimate is
where \(\bar b = 1 - b\) and \(\bar c = 1 - c\). In the next step, node x 1 will send that estimate to node y 1. It also calculates estimates for nodes y 2 and y 3.
Summing up, the algorithm is as follows. At each odd step, node x j sends X(i, j) to each node y i. At each even step, node y i sends Y (i, j) to each node x j. One has
where A(i, j) = {s ≠ j∣H(i, s) = 1} and
where
and
with
Also, node x j can update its probability of being 1 by merging the opinions of the experts as
where
and
After enough iterations, one makes the detection decisions x j = 1{X(j) ≥ 0.5}.
Figure 8.12 shows the evolution over time of the estimated probabilities that the x j are equal to one. Our code is a direct implementations of the formulas in this section. More sophisticated implementations use sums of logarithms instead of products.
Simulations, and a deep theory, show that this algorithm performs well if the graph does not have small cycles. In such a case, the assumption that the estimates are obtained from independent observations is almost correct.
8.6 Summary
-
LDPC Codes;
-
Jointly Gaussian Random Variables, independent if uncorrelated;
-
Proof of Neyman–Pearson Theorem;
-
Testing properties of the mean.
8.6.1 Key Equations and Formulas
8.7 References
The book (Richardson and Urbanke 2008) is a comprehensive reference on LDPC codes and iterative decoding techniques.
8.8 Problems
Problem 8.1
Construct two Gaussian random variables that are not jointly Gaussian. Hint: Let \(X =_D \mathcal {N}(0,1)\) and Z be independent random variables with P(Z = 1) = P(Z = −1) = 1∕2. Define Y = XZ. Show that X and Y meet the requirements of the problem.
Problem 8.2
Assume that \(X =_D (Y + Z)/\sqrt {2}\) where Y and Z are independent and distributed like X. Show that \(X = \mathcal {N}(0, \sigma ^2)\) for some σ 2 ≥ 0. Hint: First show that E(X) = 0. Second, show by induction that \(X =_D (V_1 + \cdots + V_{m})/\sqrt {m}\) for m = 2n. where the V i are i.i.d. and distributed like X. Conclude using the CLT.
Problem 8.3
Consider Problem 7.8 but assume now that \(\mathbf {Z} =_D \mathcal {N}(\mathbf {0}, \varSigma )\) where
The symbols are equally likely and the receiver uses the MLE. Simulate the system using Python to estimate the fraction of errors.
References
R.G. Gallager, Low Density Parity Check Codes (M.I.T. Press, Cambridge, 1963)
D.A. Huffman, A method for the construction of minimum-redundancy codes. Proceeding of the IRE, pp. 1098–1101 (1952)
T. Richardson, R. Urbanke. Modern Coding Theory (Cambridge University Press, Cambridge, 2008)
Author information
Authors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2021 The Author(s)
About this chapter
Cite this chapter
Walrand, J. (2021). Digital Link—B. In: Probability in Electrical Engineering and Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-030-49995-2_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-49995-2_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-49994-5
Online ISBN: 978-3-030-49995-2
eBook Packages: Computer ScienceComputer Science (R0)