Application: Transmitting bits across a physical mediumTopics: MAP, MLE, Hypothesis Testing

7.1 Digital Link

A digital link consists of a transmitter and a receiver. It transmits bits over some physical medium that can be a cable, a phone line, a laser beam, an optical fiber, an electromagnetic wave, or even a sound wave. This contrasts with an analog system that transmits signals without converting them into bits, as in Fig. 7.1.

Fig. 7.1
figure 1

An analog communication system

An elementary such systemFootnote 1 consists of a phone line and, to send a bit 0, the transmitter applies a voltage − 1 Volt across its end of the line for T seconds; to send a bit 1, it applies the voltage + 1 Volt for T second. The receiver measures the voltage across its end of the line. If the voltage that the receiver measures is negative, it decides that the transmitter must have sent a 0; if it is positive, it decides that the transmitter sent a 1. This system is not error-free. The receiver gets a noisy and attenuated version of what the transmitter sent. Thus, there is a chance that a 0 is mistaken for a 1, and vice versa. Various coding techniques are used to reduce the chances of such errors Fig. 7.2 shows the general structure of a digital link.

Fig. 7.2
figure 2

Components of a digital link

In this chapter, we explore the operating principles of digital links and their characteristics. We start with a discussion of Bayes’ rule and of detection theory. We apply these ideas to a simple model of communication link. We then explore a coding scheme that makes the transmissions faster. We conclude the chapter with a discussion of modulation and detection schemes that actual transmission systems, such as ADSL and Cable Modems, use.

7.2 Detection and Bayes’ Rule

The receiver gets some signal S and tries to guess what the transmitter sent. We explore a general model of this problem and we then apply it to concrete situations.

7.2.1 Bayes’ Rule

The basic formulation is that there are N possible exclusive circumstances C 1, …, C N under which a particular symptom S can occur. By exclusive, we mean that exactly one circumstance occurs (Fig. 7.3). Each circumstance C i has some prior probability p i and q i is the probability that S occurs under circumstance C i. Thus,

$$\displaystyle \begin{aligned} p_i = P(C_i) \mbox{ and } q_i = P[S \mid C_i], \mbox{ for } i = 1, \ldots, N, \end{aligned}$$

where

$$\displaystyle \begin{aligned} p_i \geq 0, q_i \in [0, 1] \mbox{ for } i = 1, \ldots, N \mbox{ and } \sum_{i=1}^N p_i = 1. \end{aligned}$$
Fig. 7.3
figure 3

The symptom and its possible circumstances. Here, p i = P(C i) and q i = P[SC i]

The posterior probability π i that circumstance C i is in effect given that S is observed can be computed by using Bayes’ rule as we explain next. One has

$$\displaystyle \begin{aligned} & \pi(i) = P[C_i | S] = \frac{P(C_i \mbox{ and } S)}{P(S)} \\ &~~~~ = \frac{P(C_i \mbox{ and } S)}{\sum_{j=1}^N P(C_j \mbox{ and } S)} = \frac{P[S|C_i]P(C_i)}{\sum_{j=1}^N P[S|C_j]P(C_j)} \\ &~~~~ = \frac{p_iq_i}{\sum_{j = 1}^N p_jq_j}. \end{aligned} $$

Given the importance of this result, we state it as a theorem.

Theorem 7.1 (Bayes’ Rule)

One has

$$\displaystyle \begin{aligned} \pi_i = \frac{p_iq_i}{\sum_{j = 1}^N p_jq_j}, i = 1, \ldots, N. \end{aligned} $$
(7.1)

\({\blacksquare }\)

This rule is very simple but is a canonical example of how observations affect our beliefs. It is due to Thomas Bayes (Fig. 7.4).

Fig. 7.4
figure 4

Thomas Bayes, 1701–1761

7.2.2 Circumstances vs. Causes

In the previous section we were careful to qualify the C i as possible circumstances, not as causes. The distinction is important. Say that you go to a beach, eat an ice cream, and leave with a sunburn. Later, you meet a friend who did not go to the beach, did not eat an ice cream, and did not get sunburned. More generally, the probability that someone got sunburned is larger if that person ate an ice cream. However, it would be silly to qualify the ice cream as the cause of the sunburn.

Unfortunately, confusing correlation and causation is a prevalent mistake.

7.2.3 MAP and MLE

Given the previous model, we see that the most likely circumstance under which the symptom occurs, which we call the Maximum A Posteriori (MAP) estimate of the circumstance given the symptom, is

$$\displaystyle \begin{aligned} MAP = \arg \max_i \pi_i = \arg \max_i p_i q_i. \end{aligned}$$

The notation is that if h(⋅) is a function, then \(\arg \max _x h(x)\) is any value of x that achieves the maximum of h(⋅). Thus, if \(x^* = \arg \max _x h(x)\), then h(x ) ≥ h(x) for all x.

Thus, the MAP is the most likely circumstance, a posteriori, that is, after having observed the symptom.

Note that if all the prior probabilities are equal, i.e., if p i = 1∕N for all i, then the MAP maximizes q i. In general, the estimate that maximizes q i is called the Maximum Likelihood Estimate (MLE) of the circumstance given the symptom. That is,

$$\displaystyle \begin{aligned} MLE = \arg \max_i q_i. \end{aligned}$$

That is, the MLE is the circumstance that makes the symptom most likely.

More generally, one has the following definitions.

Definition 7.1 (MAP and MLE)

Let (X, Y ) be discrete random variables. Then

$$\displaystyle \begin{aligned} MAP[X|Y=y] = \arg \max_x P(X = x\ \mbox{and}\ Y = y) \end{aligned}$$

and

$$\displaystyle \begin{aligned} MLE[X|Y=y] = \arg \max_x P[Y = y | X = x]. \end{aligned}$$

These definitions extend in the natural way to the continuous case, as we will get to see later.

7.2.3.1 Example: Ice Cream and Sunburn

As an example, say that on a particular summer day in Berkeley 500 out of 100, 000 people eat ice cream, among which 50 get sunburned and that among the 99, 500 who do not eat ice cream, 600 get sunburned. Then, the MAP of eating ice cream given sunburn is No but the MLE is Yes. Indeed, we see that

$$\displaystyle \begin{aligned} P(sunburn~and~ice~cream) = 50 < P(sunburn~and~no~ice~cream) = 600, \end{aligned}$$

so that among those who have a sunburn, a minority eat ice cream, so that it is more likely that a sunburn person did not eat ice cream. Hence, the MAP if No. However, the fraction of people who have a sunburn is larger among those who eat ice cream (10%) than among those who do not (0.6%). Hence, the MLE is Yes.

7.2.4 Binary Symmetric Channel

We apply the concepts of MLE and MAP to a simplified model of a communication link. Figure 7.5 illustrates the model, called a binary symmetric channel (BSC).

Fig. 7.5
figure 5

The binary symmetric channel

In this model, the transmitter sends a 0 or a 1 and the receiver gets the transmitted bit with probability 1 − p, otherwise it gets the opposite bit. Thus, the channel makes an error with probability p. We assume that if the transmitter sends successive bits, the errors are i.i.d.

Note that if p = 0 or p = 1, then one can recover exactly every bit that is sent. Also, if p = 0.5, then the output is independent of the input and no useful information goes through the channel. What happens in the other cases?

Call X ∈{0, 1} the input of the channel and Y ∈{0, 1} its output. Assume that you observe Y = 1 and that P(X = 1) = α, so that P(X = 0) = 1 − α. We have the following result illustrated in Fig. 7.6.

Fig. 7.6
figure 6

MAP for BSC. Here, α = P(X = 1) and p is the probability of a channel error

Theorem 7.2 (MAP and MLE for BSC)

For the BSC with p < 0.5,

$$\displaystyle \begin{aligned} MAP[X|Y = 0] = 1\{\alpha > 1 - p\}, MAP[X|Y = 1] = 1\{\alpha > p\} \end{aligned}$$

and

$$\displaystyle \begin{aligned} MLE[X|Y] = Y. \end{aligned}$$

\({\blacksquare }\)

To understand the MAP results, consider the case Y = 1. Since p < 0.5, we are inclined to think that X = 1. However, if α is small, this is unlikely. The result is that X = 1 is more likely than X = 0 if α > p, i.e., if the prior is “stronger” than the noise. The case Y = 0 is similar.

Proof

In the terminology of Bayes’ rule, the event Y = 1 is the symptom. Also, the prior probabilities are

$$\displaystyle \begin{aligned} p_0 = 1 - \alpha \mbox{ and } p_1 = \alpha, \end{aligned}$$

and the conditional probabilities are

$$\displaystyle \begin{aligned} q_0 = P[Y = 1 | X = 0] = p \mbox{ and } q_1 = P[Y = 1 | X = 1] = 1 - p. \end{aligned}$$

Hence,

$$\displaystyle \begin{aligned} MAP[X|Y = 1] = \arg \max_{i \in \{0, 1\}} p_i q_i. \end{aligned}$$

Thus,

$$\displaystyle \begin{aligned} MAP[X|Y = 1] = \left\{ \begin{array}{l l} 1, & \mbox{ if } p_1 q_1 = \alpha (1 - p) > p_0 q_0 = (1 - \alpha) p \\ 0, & \mbox{ otherwise}. \end{array} \right. \end{aligned}$$

Hence, MAP[X|Y = 1] = 1{α > p}. That is, when Y = 1, your guess is that X = 1 if the prior that X = 1 is larger than the probability that the channel makes an error.

Also,

$$\displaystyle \begin{aligned} MLE[X|Y =1] = \arg \max_{i \in \{0, 1\}} q_i. \end{aligned}$$

In this case, since p < 0.5, we see that MLE[X|Y = 1] = 1, because Y = 1 is more likely when X = 1 than when X = 0. Thus, the MLE ignores the prior and always guesses that X = 1 when Y = 1, even though the prior probability P(X = 1) = α may be very small.

Similarly, we see that

$$\displaystyle \begin{aligned} MAP[X|Y = 0] = \arg \max_{i \in \{0, 1\}} p_i (1 - q_i). \end{aligned}$$

Thus,

$$\displaystyle \begin{aligned} MAP[X|Y = 0] = \left\{ \begin{array}{l l} 1, & \mbox{ if } p_1 (1 - q_1) = \alpha p > p_0 (1 - q_0) (1 - \alpha) (1 - p) \\ 0, & \mbox{ otherwise}. \end{array} \right. \end{aligned}$$

Hence, MAP[X|Y = 0] = 1{α > 1 − p}. Thus, when Y = 0, you guess that X = 1 if X = 1 is more likely a priori than the channel being correct.

Also, MLE[X|Y = 0] = 0 because p < 0.5, irrespectively of α. □

7.3 Huffman Codes

Coding can improve the characteristics of a digital link. We explore Huffman codes in this section.

Say that you want to transmit strings of symbols A, B, C, D across a digital link. The simplest method is to encode these symbols as 00, 01, 10, and 11, respectively. In so doing, each symbol requires transmitting two bits. Assuming that there is no error, if the receiver gets the bits 0100110001, it recovers the string BADAB.

Now assume that the strings are such that the symbols occur with the following frequencies: (A, 55%), (B, 30%), (C, 10%), (D, 5%). Thus, A occurs 55% of the time, and similarly for the other symbols. In this situation, one may design a code where A requires fewer bits than D.

The Huffman code (Huffman 1952, Fig. 7.7) for this example is as follows:

$$\displaystyle \begin{aligned} A = 0, B = 10, C = 110, D = 111. \end{aligned}$$

The average number of bits required per symbol is

$$\displaystyle \begin{aligned} 1 \times 55\% + 2 \times 30\% + 3 \times 10\% + 3 \times 5\% = 1.6. \end{aligned}$$

Thus, one saves 20% of the transmissions and the resulting system is 25% faster (ah! arithmetics). Note that the code is such that, when there is no error, the receiver can recover the symbols uniquely from the bits it gets. For instance, if the receiver gets 110100111, the symbols are CBAD, without ambiguity.

Fig. 7.7
figure 7

David Huffman, 1925–1999

The reason why there is no possible ambiguity is that one can picture the bits as indicating the path in a tree that ends with a leaf of the tree, as shown in Fig. 7.8. Thus, starting with the first bit received, one walks down the tree until one reaches a leaf. One then repeats for the subsequent bits. In our example, when the bits are 110100111, one starts at the top of the tree, then one follows the branches 110 and reaches leaf C, then one restarts from the top and follows the branches 10 and gets to the leaf B, and so on. Codes that have this property of being uniquely decodable in one pass are called prefix-free codes.

Fig. 7.8
figure 8

Huffman code

The construction of the code is simple. As shown in Fig. 7.8, one joins the two symbols with the smallest frequency of occurrence, here C and D, with branches 0 and 1 and assigns the group CD the sum of the symbol frequencies, here 0.15. One then continues in the same way, joining CD and B and assigning the group BCD the frequency 0.3 + 0.15 = 0.45. Finally, one joins A and BCD. The resulting tree specifies the code.

The following property is worth noting.

Theorem 7.3 (Optimality of Huffman Code)

The Huffman code has the smallest average number of bits per symbol among all prefix-free codes. \({\blacksquare }\)

Proof

See Chap. 8. □

It should be noted that other codes have a smaller average length, but they are not symbol-by-symbol codes and are more complex. One code is based on the observation that there are only 2nH likely strings of n ≫ 1 symbols, where

$$\displaystyle \begin{aligned} H = - \sum_X x \log_2(x). \end{aligned}$$

In this expression, x is the frequency of symbol X and the sum is over all the symbols. This expression H is the entropy of the distribution of the symbols. Thus, by listing all these strings and assigning nH bits to identify them, one requires only nH bits for n symbols, or H bits per symbol (See Sect. 15.7.).

In our example, one has

$$\displaystyle \begin{aligned} & H = - 0.55 \log_2(0.55) - 0.3 \log_2(0.3) \\ &~~~~ - 0.1 \log_2(0.1) - 0.05 \log_2(0.05) = 1.54. \end{aligned} $$

Thus, for this example, the savings over the Huffman code are not spectacular, but it is easy to find examples for which they are. For instance, assume that there are only two symbols A and B with frequencies p and 1 − p, for some p ∈ (0, 1). The Huffman code requires one bit per symbol, but codes based on long strings require only − plog2(p) − (1 − p)log2(1 − p) bits per symbol. For p = 0.1, this is 0.47, which is less than half the number of bits of the Huffman code.

Coding based on long strings of symbols are discussed in Sect. 15.7.

7.4 Gaussian Channel

In the previous sections, we had a simplified model of a channel as a BSC. In this section, we examine a more realistic model of the channel that captures the physical characteristic of the noise. In this model, the transmitter sends a bit X ∈{0, 1} and the receiver gets Y  where

$$\displaystyle \begin{aligned} Y = X + Z. \end{aligned}$$

In this identity, \(Z =_D \mathcal {N}(0, \sigma ^2)\) and is independent of X. We say that this is an additive Gaussian noise channel.

Figure 7.9 shows the densities of Y  when X = 0 and when X = 1. Indeed, when X = x, we see that \(Y =_D \mathcal {N}(x, \sigma ^2)\).

Fig. 7.9
figure 9

The pdf of Y  is f 0 when X = 0 and f 1 when X = 1

Assume that the receiver observes Y . How should it decide whether X = 0 or X = 1? Assume again that P(X = 1) = p 1 = α and P(X = 0) = p 0 = 1 − α.

In this example, P[Y = y|X = 0] = 0 for all values of y. Indeed, Y  is a continuous random variable. So, we must change a little our discussion of Bayes’ rule. Here is how to do it. Pretend that we do not measure Y  with infinite precision but that we instead observe that Y ∈ (y, y + 𝜖) where 0 < 𝜖 ≪ 1. Thus, the symptom is Y ∈ (y, y + 𝜖) and it now has a positive probability. In fact,

$$\displaystyle \begin{aligned} q_0 = P[Y \in (y, y + \epsilon) | X = 0] \approx f_0(y) \epsilon, \end{aligned}$$

by definition of the density f 0(y) of Y  when X = 0. Similarly,

$$\displaystyle \begin{aligned} q_1 = P[Y \in (y, y + \epsilon) | X = 1] \approx f_1(y) \epsilon. \end{aligned}$$

Hence,

$$\displaystyle \begin{aligned} MAP[X| Y \in (y, y + \epsilon)] = \arg \max_{i \in \{0, 1\}} p_i f_i(y) \epsilon. \end{aligned}$$

Since the result does not depend on 𝜖, we write

$$\displaystyle \begin{aligned} MAP[X|Y = y] = \arg \max_{i \in \{0, 1\}} p_i f_i(y). \end{aligned}$$

Similarly,

$$\displaystyle \begin{aligned} MLE[X|Y = y] = \arg \max_{i \in \{0, 1\}} f_i(y). \end{aligned}$$

We can verify that

$$\displaystyle \begin{aligned} MAP[X|Y=y] = 1\left\{y \geq \frac{1}{2} + \sigma^2 \log\left(\frac{p_0}{p_1}\right)\right\}. \end{aligned} $$
(7.2)

Also, the resulting probability of error is

$$\displaystyle \begin{aligned} & P\left(\mathcal{N}(0, \sigma^2) \geq \frac{1}{2} + \sigma^2 \log\left(\frac{p_0}{p_1}\right)\right) p_0 \\ &~~~~ + P\left(\mathcal{N}(1, \sigma^2) \leq \frac{1}{2} + \sigma^2 \log\left(\frac{p_0}{p_1}\right)\right) p_1. \end{aligned} $$

Also,

$$\displaystyle \begin{aligned} MLE[X|Y=y] = 1\{y \geq 0.5\}. \end{aligned}$$

If we choose the MLE detection rule, the system has the same probability of error as a BSC channel with

$$\displaystyle \begin{aligned} p = p(\sigma^2) := P(\mathcal{N}(0, \sigma^2) > 0.5) = P\left(\mathcal{N}(0, 1) > \frac{0.5}{\sigma}\right). \end{aligned}$$

Simulation

Figure 7.10 shows the simulation results when α = 0.5 and σ = 1. The code is in the Jupyter notebook for this chapter.

Fig. 7.10
figure 10

Simulation of the AGN channel with α = 0.5 and σ = 1

BPSK

The system in the previous section was very simple and corresponds to a practical transmission scheme called Binary Phase Shift Keying (BPSK). In this system, instead of sending a constant voltage for T seconds to represent either a bit 0 or a bit 1, the transmitter sends a sine wave for T seconds and the phase of that sine wave depends on whether the transmitter sends a 0 or a 1 (Fig. 7.11).

Fig. 7.11
figure 11

The signal that the transmitter sends when using BPSK

Specifically, to send bit 0, the transmitter sends the signal

$$\displaystyle \begin{aligned} {\mathbf{s}}_0 = \{s_0(t) = A \sin{}( 2 \pi ft), t \in [0, T]\}. \end{aligned}$$

Here, T is a multiple of the period, so that fT = k for some integer k. To send a bit 1, the transmitter sends the signal s 1 = −s 0. Why all this complication? The signal is a sine wave around frequency f and the designer can choose a frequency that the transmission medium transports well. For instance, if the transmission is wireless, the frequency f is chosen so that the antennas radiate and receive that frequency well. The wavelength of the transmitted electromagnetic wave is the speed of light divided by f and it should be of the same order as the physical length of the antenna. For instance, 1GHz corresponds to a wavelength of one foot and it can be transmitted and received by suitably shaped cell phone antennas.

In any case, the transmitter sends the signal s i to send a bit i, for i = 0, 1. The receiver attempts to detect whether s 0 or s 1 = −s 0 was sent. To do this, it multiplies the received signal by a sine wave at the frequency f, then computes the average value of the product. That is, if the receiver gets the signal r = {r t, 0 ≤ t ≤ T}, it computes

$$\displaystyle \begin{aligned} \frac{1}{T} \int_0^T r_t \sin{}(2 \pi f t ) dt. \end{aligned}$$

You can verify that if r = s 0, then the result is A∕2 and if r = s 1, then the result is − A∕2. Thus, the receiver guesses that bit 0 was transmitted if this average value is positive and that bit 1 was transmitted otherwise.

The signal that the receiver gets is not s i when the transmitter sends s i. Instead, the receiver gets an attenuated and noisy version of that signal. As a result, after doing its calculation, the receiver gets B + Z or − B + Z where B is some constant that depends on the attenuation, Z is a \(\mathcal {N}(0, \sigma ^2)\) random variable and σ 2 reflects the power of the noise.

Accordingly, the detection problem amounts to detecting the mean value of a Gaussian random variable, which is the problem that we discussed earlier.

7.5 Multidimensional Gaussian Channel

When using BPSK, the transmitter has a choice between two signals: s 0 and s 1. Thus, in T seconds, the transmitter sends one bit. To increase the transmission rate, communication engineers devised a more efficient scheme called Quadrature Amplitude Modulation (QAM). When using this scheme, a transmitter can send a number k of bits every T seconds. The scheme can be designed for different values of k. When k = 1, the scheme is identical to BPSK. For k > 1, there are 2k different signals and each one is of the form

$$\displaystyle \begin{aligned} a \cos{}(2 \pi ft ) + b \sin{}(2 \pi ft), \end{aligned}$$

where the coefficients (a, b) characterize the signal and correspond to a given string of k-bits. These coefficients form a constellation as shown in Fig. 7.12 in the case of QAM-16, which corresponds to k = 4.

Fig. 7.12
figure 12

A QAM-16 constellation

When the receiver gets the signal, it multiplies it by \(2\cos {}(2\pi ft)\) and computes the average over T seconds. This average value should be the coefficient a if there was not attenuation and no noise. The receiver also multiplies the signal by \(2\sin {}(2 \pi ft)\) and computes the average over T seconds. The result should be the coefficient b. From the value of (a, b), the receiver can tell the four bits that the transmitter sent.

Because of the noise (we can correct for the attenuation), the receiver gets a pair of values Y = (Y 1, Y 2), as shown in the figure. The receiver essentially finds the constellation point closest to the measured point Y and reads off the corresponding bits.

The values of |a| and |b| are bounded, because of a power constraint on the transmitter. Accordingly, a constellation with more points (i.e., a larger value of k) has points that are closer together. This proximity increases the likelihood that the noise misleads the receiver. Thus, the size of the constellation should be adapted to the power of the noise. This is in fact what actual systems do. For instance, a cable modem and an ADSL modem divide the frequency band into small channels and they measure the noise power in each channel and choose the appropriate constellation for each. WiFi, LTE, and 5G systems use a similar scheme.

7.5.1 MLE in Multidimensional Case

We can summarize the effect of modulation, demodulation, amplification to compensate for the attenuation and the noise as follows. The transmitter sends one of the sixteen vectors x k = (a k, b k) shown in Fig. 7.12. Let us call the transmitted vector X. The vector that the receiver computes is Y.

Assume first that

$$\displaystyle \begin{aligned} \mathbf{Y} = \mathbf{X} + \mathbf{Z} \end{aligned}$$

where Z = (Z 1, Z 2) and Z 1, Z 2 are i.i.d. N(0, σ 2) random variables. That is, we assume that the errors in Y 1 and Y 2 are independent and Gaussian. In this case, we can calculate the conditional density f Y|X[y|x] as follows. Given X = x, we see that Y 1 = x 1 + Z 1 and Y 2 = x 2 + Z 2. Since Z 1 and Z 2 are independent, it follows that Y 1 and Y 2 are independent as well. Moreover, Y 1 = N(x 1, σ 2) and Y 2 = N(x 2, σ 2). Hence,

$$\displaystyle \begin{aligned} f_{\mathbf{Y}|\mathbf{X}}[\mathbf{y} | \mathbf{x}] = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left\{- \frac{(y_1 - x_1)^2}{2 \sigma^2}\right\} \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left\{- \frac{(y_2 - x_2)^2}{2 \sigma^2}\right\}. \end{aligned}$$

Recall that MLE[X|Y = y] is the value of x ∈{x 1, …, x 16} that maximizes this expression. Accordingly, it is the value x k that minimizes

$$\displaystyle \begin{aligned} || {\mathbf{x}}_k - \mathbf{y}||{}^2 = (x_1 - y_1)^2 + (x_2 - y_2)^2. \end{aligned}$$

Thus, MLE[X|Y] is indeed the constellation point that is the closest to the measured value Y.

7.6 Hypothesis Testing

There are many situations where the MAP and MLE are not satisfactory guesses. This is the case for designing alarms, medical tests, failure detection algorithms, and many other applications. We describe an important formulation, called the hypothesis testing problem.

7.6.1 Formulation

We consider the case where X ∈{0, 1} and where one assumes a distribution of Y  given X. The goal will be to solve the following problem:

$$\displaystyle \begin{aligned} & \mbox{Maximize } PCD := P[\hat X = 1 | X = 1] \\ & \mbox{ subject to } PFA := P[\hat X = 1 | X = 0 ] \leq \beta. \end{aligned} $$

Here, PCD is the probability of correct detection, i.e., of detecting that X = 1 when it is actually equal to 1. Also, PFA is the probability of false alarm, i.e., of declaring that X = 1 when it is in fact equal to zero. The constant β is a given bound on the probability of false alarm.Footnote 2

For making sense of the terminology, think that X = 1 means that your house is on fire. It is not reasonable to assume a prior probability that X = 1, so that the MAP formulation is not appropriate. Also, the MLE amounts to assuming that P(X = 1) = 1∕2, which is not suitable here. In the hypothesis testing formulation, the goal is to detect a fire with the largest possible probability, subject to a bound on the probability of false alarm. That is, one wishes to make the fire detector as sensitive as possible, but not so sensitive that it produces frequent false alarms.

One has the following useful concept.

Definition 7.2 (Receiver Operating Characteristic (ROC))

If the solution of the problem is PCD = R(β), the function R(β) is called the Receiver Operating Characteristic (ROC). ◇

A typical ROC is shown in Fig. 7.13. The terminology comes from the fact that this function depends on the conditional distributions of Y  given X = 0 and given X = 1, i.e., of the signal that is received about X.

Fig. 7.13
figure 13

The Receiver Operating Characteristic is the maximum probability of correct detection R(β) as a function of the bound β on the probability of false alarm

Note the following features of that curve. First, R(1) = 1 because if one is allowed to have PFA = 1, then one can choose \(\hat X = 1\) for all observations; in that case PCD = 1.

Second, the function R(β) is concave. To see this, let 0 ≤ β 1 < β 2 ≤ 1 and assume that g i(Y ) achieves P[g i(Y ) = 1|X = 1] = R(β i) and P[g i(Y ) = 1|X = 0] = β i for i = 1, 2. Choose 𝜖 ∈ (0, 1) and define X′ = g 1(Y ) with probability 𝜖 and X′ = g 2(Y ) otherwise. Then,

$$\displaystyle \begin{aligned} P[X' = 1 | X = 0 ] & = \epsilon P[g_1(Y) = 1 | X = 0] + (1 - \epsilon)P[g_2(Y) = 1 | X = 0] \\ & = \epsilon \beta _1 + (1 - \epsilon) \beta_2. \end{aligned} $$

Also,

$$\displaystyle \begin{aligned} P[X' = 1 | X = 1 ] & = \epsilon P[g_1(Y) = 1 | X = 1] + (1 - \epsilon)P[g_2(Y) = 1 | X = 1] \\ & = \epsilon R(\beta_1) + (1 - \epsilon) R(\beta_2). \end{aligned} $$

Now, the decision rule \(\hat X\) that maximizes \(P[\hat X = 1 | X = 1]\) subject to \(P[\hat X = 1 | X = 0] = \epsilon \beta _1 + (1 - \epsilon ) \beta _2\) must be at least as good as X′. Hence,

$$\displaystyle \begin{aligned} R(\epsilon \beta _1 + (1 - \epsilon) \beta_2) \geq \epsilon \beta _1 + (1 - \epsilon) \beta_2. \end{aligned}$$

This inequality proves the concavity of R(β).

Third, the function R(β) is nondecreasing. Intuitively, if one can make a larger PFA, one can decide \(\hat X = 1\) with a larger probability, which increases PCD. To show this formally, let β 2 = 1 in the previous derivation.

Fourth, note that it may not be the case that R(0) = 0. For instance, assume that Y = X. In this case, one chooses \(\hat X = Y = X\), so that PCD = 1 and PFA = 0.

7.6.2 Solution

The solution of the hypothesis testing problem is stated in the following theorem.

Theorem 7.4 (Neyman–Pearson (1933))

The decision \(\hat X\) that maximizes PCD subject to PFA  β is given by

$$\displaystyle \begin{aligned} \hat X = \left\{ \begin{array}{l l} 1, & \mathit{\mbox{ if }} L(Y) > \lambda \\ 1 \mathit{\mbox{ w.p. }} \gamma, & \mathit{\mbox{ if }} L(Y) = \lambda \\ 0, & \mathit{\mbox{ if }} L(Y) < \lambda. \end{array} \right. \end{aligned} $$
(7.3)

In these expressions,

$$\displaystyle \begin{aligned} L(y) = \frac{f_{Y|X}[y|1]}{f_{Y|X}[y|0]} \end{aligned}$$

is the likelihood ratio , i.e., the ratio of the likelihood of y when X = 1 divided by its likelihood when X = 0. Also, λ > 0 and γ ∈ [0, 1] are chosen so that the resulting \(\hat X\) satisfies

$$\displaystyle \begin{aligned} P[\hat X = 1|X = 0] = \beta. \end{aligned}$$

\({\blacksquare }\)

Thus, if L(Y ) is large, \(\hat X = 1\). The fact that L(Y ) is large means that the observed value Y  is much more likely when X = 1 than when X = 0. One is then inclined to decide that X = 1, i.e. to guess \(\hat X = 1\). The situation is similar when L(Y ) is small. By adjusting λ, one controls the sensitivity of the detector. If λ is small, one tends to choose \(\hat X = 1\) more frequently, which increases PCD but also PFA. One then chooses λ so that the detector is just sensitive enough so that PFA = β. In some problems, one may have to hedge the guess for the critical value λ as we will explain in examples (Fig. 7.14).

Fig. 7.14
figure 14

Jerzy Neyman, 1894–1981

We prove this theorem in the next chapter. Let us consider a number of examples.

7.6.3 Examples

7.6.3.1 Gaussian Channel

Recall our model of the scalar Gaussian channel:

$$\displaystyle \begin{aligned} Y = X + Z, \end{aligned}$$

where Z = N(0, σ 2) and is independent of X. In this model, X ∈{0, 1} and the receiver tries to guess X from the received signal Y .

We looked at two formulations: MLE and MAP. In the MLE, we want to find the value of X that makes Y  most likely. That is,

$$\displaystyle \begin{aligned} MLE[X|Y=y] = \arg \max_x f_{Y|X}[y|x]. \end{aligned}$$

The answer is MLE[X|Y ] = 0 if Y < 0.5 and MLE[X|Y ] = 1, otherwise.

The MAP is the most likely value of X in {a, b} given Y . That is,

$$\displaystyle \begin{aligned} MAP[X|Y=y] = \arg \max_x P[X = x | Y = y]. \end{aligned}$$

To calculate the MAP, one needs to know the prior probability p 0 that X = 0. We found out that MAP[X|Y = y] = 1 if \(y \geq 0.5 + \sigma ^2 \log (p_0/p_1)\) and MAP[X|Y = y] = 0 otherwise.

In the hypothesis testing formulation, we choose a bound β on \(PFA = P[ \hat X = 1 | X = 0].\) According to Theorem 7.4, we should calculate the likelihood ratio L(Y ). We find that

$$\displaystyle \begin{aligned} L(y) = \frac{\exp\left\{-\frac{(y - 1)^2}{2 \sigma^2}\right\}}{\exp\left\{-\frac{y^2}{2 \sigma^2}\right\}} = \exp\left\{\frac{2y -1}{2 \sigma^2}\right\}. \end{aligned}$$

Note that, for any given λ, P(L(Y ) = λ) = 0. Moreover, L(y) is strictly increasing in y. Hence, (7.3) simplifies to

$$\displaystyle \begin{aligned} \hat X = \left\{ \begin{array}{c c} 1, & \mbox{ if } y \geq y_0 \\ 0, & \mbox{ otherwise}. \end{array} \right. \end{aligned}$$

We choose y 0 so that PFA = β, i.e., so that

$$\displaystyle \begin{aligned} P[\hat X = 1 | X = 0] = P[Y \geq y_0 | X = 0] = \beta. \end{aligned}$$

Now, given X = 0, Y = N(0, σ 2). Hence, y 0 is such that

$$\displaystyle \begin{aligned} P(N(0, \sigma^2) \geq y_0) = \beta, \end{aligned}$$

i.e., such that

$$\displaystyle \begin{aligned} P\left(N(0, 1) \geq \frac{y_0}{\sigma}\right) = \beta. \end{aligned}$$

For instance, Fig. 3.7 shows that if β = 5%, then y 0σ = 1.65. Figure 7.15 illustrates the solution.

Fig. 7.15
figure 15

The solution of the hypothesis testing problem for a Gaussian channel

Let us calculate the ROC for the Gaussian channel. Let y(β) be such that P(N(0, 1) ≥ y(β)) = β, so that y 0 = y(β)σ. The probability of correct detection is then

$$\displaystyle \begin{aligned} & PCD = P[\hat X = 1|X = 1] = P[Y \geq y_0 | X = 1] = P(N(1, \sigma^2) \geq y_0) \\ &~~~ = P(N(0, \sigma^2) \geq y_0 - 1) = P(N(0, 1) \geq \sigma^{-1} y_0 - \sigma^{-1}) \\ & ~~~ = P(N(0, 1) \geq y(\beta) - \sigma^{-1}). \end{aligned} $$

Figure 7.16 shows the ROC for different values of σ, obtained using Python. Not surprisingly, the performance of the system degrades when the channel is noisier.

Fig. 7.16
figure 16

The ROC a Gaussian channel Y = X + Z where X ∈{0, 1} and Z = N(0, σ 2)

7.6.3.2 Mean of Exponential RVs

In this second example, we are testing the mean of exponential random variables. The story is that a machine produces lightbulbs that have an exponentially distributed lifespan with mean 1∕λ x when X = x ∈{0, 1}. Assume that λ 0 < λ 1. The interpretation is that the machine is defective when X = 1 and produces lightbulbs that have a shorter lifespan.

Let Y = (Y 1, …, Y n) be the observed lifespans of n bulbs. We want to detect that X = 1 with PFA ≤ β = 5%.

We find

$$\displaystyle \begin{aligned} & L(y) = \frac{f_{Y|X}[y|1]}{f_{Y|X}[y|0]} = \frac{\varPi_{i = 1}^n \lambda_1 \exp\{- \lambda_1 y_i\}} {\varPi_{i = 1}^n \lambda_0 \exp\{- \lambda_0 y_i\}} \\ & ~~~~ = \left(\frac{\lambda_1}{\lambda_0} \right)^n \exp\left\{-(\lambda_1 - \lambda_0) \sum_{i=1}^n y_i\right\}. \end{aligned} $$

Since λ 1 > λ 0, we find that L(y) is strictly decreasing in ∑i y i and also that P(L(Y ) = λ) = 0 for all λ. Thus, (7.3) simplifies to

$$\displaystyle \begin{aligned} \hat X = \left\{ \begin{array}{c c} 1, & \mbox{ if } \sum_{i=1}^n Y_i \leq a \\ 0, & \mbox{ otherwise}, \end{array} \right. \end{aligned}$$

where a is chosen so that

$$\displaystyle \begin{aligned} P\left[\sum_{i = 1}^n Y_i \leq a | X = 0\right] = \beta = 5\%. \end{aligned}$$

Now, when X = 0, the Y i are i.i.d. random variables that are exponentially distributed with mean 1∕λ 0. The distribution of their sum is rather complicated. We approximate it using the Central Limit Theorem.

We haveFootnote 3

$$\displaystyle \begin{aligned} \frac{Y_1 + \cdots + Y_n - n \lambda_0^{-1}}{\sqrt{n}} \approx N(0, \lambda_0^{-2}). \end{aligned}$$

Now,

$$\displaystyle \begin{aligned} \sum_{i = 1}^n Y_i \leq a \Leftrightarrow \frac{Y_1 + \cdots + Y_n - n \lambda_0^{-1}}{\sqrt{n}} \leq \frac{a - n \lambda_0^{-1}}{\sqrt{n}}. \end{aligned} $$

Hence,

$$\displaystyle \begin{aligned} P\left[\sum_{i = 1}^n Y_i \leq a | X = 0\right] &\approx P\left(N(0, \lambda_0^{-2}) \leq \frac{a - n \lambda_0^{-1}}{\sqrt{n}}\right) \\ &= P\left(N(0, 1) \leq \lambda_0 \frac{a - n \lambda_0^{-1}}{\sqrt{n}}\right). \end{aligned} $$

Hence, if we want this probability to be equal to 5%, by (3.2), we must choose a so that

$$\displaystyle \begin{aligned} \lambda_0 \frac{a - n \lambda_0^{-1}}{\sqrt{n}} = 1.65,\end{aligned} $$

i.e.,

$$\displaystyle \begin{aligned} a = (n + 1.65 \sqrt{n}) \lambda_0^{-1} .\end{aligned} $$

One point is worth noting for this example. We see that the calculation of \(\hat X\) is based on Y 1 + ⋯ + Y n. Thus, although one has measured the individual lifespans of the n bulbs, the decision is based only on their sum, or equivalently on their average.

7.6.3.3 Bias of a Coin

In this example, we observe n coin flips. Given X = x ∈{0, 1}, the coins are i.i.d. B(p x). That is, given X = x, the outcomes Y 1, …, Y n of the coin flips are i.i.d. and equal to 1 with probability p x and to zero otherwise. We assume that p 1 > p 0 = 0.5. That is, we want to test whether the coin is fair or biased.

Here, the random variables Y i are discrete. We see that

$$\displaystyle \begin{aligned} & P[Y_i = y_i, i = 1, \ldots, n|X = x] = \varPi_{i=1}^n p_x^{Y_i} (1 - p_x)^{1 - Y_i} \\ &~~~~~~ = p_x^S (1 - p_x)^{n - S} \mbox{ where } S = Y_1 + \cdots + Y_n.\end{aligned} $$

Hence,

$$\displaystyle \begin{aligned} & L(Y_1, \ldots, Y_n) = \frac{P[Y_i = y_i, i = 1, \ldots, n|X = 1] }{P[Y_i = y_i, i = 1, \ldots, n|X = 0] }\\ &~~~~~~ = \left( \frac{p_1}{p_0} \right)^S \left( \frac{1 - p_1}{1 - p_0} \right)^{n - S} = \left( \frac{1 - p_1}{1 - p_0} \right)^n \left( \frac{p_1(1 - p_0)}{p_0(1 - p_1)} \right)^S.\end{aligned} $$

Since p 1 > p 0, we see that the likelihood ratio is increasing in S. Thus, the solution of the hypothesis testing problem is

$$\displaystyle \begin{aligned} \hat X = 1\{S \geq n_0\},\end{aligned} $$

where n 0 is such that P[S ≥ n 0|X = 0] ≈ β. To calculate n 0, we approximate S, when X = 0, by using the Central Limit Theorem. We have

$$\displaystyle \begin{aligned} & P[S \geq n_0 | X = 0] = P\left[\frac{S - np_0}{\sqrt{n}} \geq \frac{n_0 - np_0}{\sqrt{n}}| X = 0\right] \\ &~~~ \approx P\left(N(0, p_0(1 - p_0)) \geq \frac{n_0 - np_0}{\sqrt{n}}\right) \\ &~~~ = P\left(N(0, 0.25) \geq \frac{n_0 - np_0}{\sqrt{n}}\right) = P\left(N(0, 1) \geq \frac{2n_0 - n}{\sqrt{n}}\right). \end{aligned} $$

Say that β = 5%, then we need

$$\displaystyle \begin{aligned} \frac{2n_0 - n}{\sqrt{n}} = 1.65, \end{aligned}$$

by (3.2). Hence,

$$\displaystyle \begin{aligned} n_0 = 0.5n + 0.83 \sqrt{n}. \end{aligned}$$

7.6.3.4 Discrete Observations

In the examples that we considered so far, the random variable L(Y ) is continuous. In such cases, the probability that L(Y ) = λ is always zero, and there is no need to randomize the choice of \(\hat X\) for specific values of Y . In our next examples, that need arises.

First consider, as usual, the problem of choosing \(\hat X \in \{0, 1\}\) to maximize the probability of correct detection \(P[\hat X = 1 | X = 1]\) subject to a bound \(P[\hat X = 1 | X = 0] \leq \beta \) on the probability of false alarm. However, assume that we make no observation. In this case, the solution is to choose \(\hat X = 1\) with probability β. This choice meets the bound on the probability of false alarm and achieves a probability of correct detection equal to β. This randomized choice is better than always deciding \(\hat X = 0\).

Now consider a more complex example where Y ∈{A, B, C} and

$$\displaystyle \begin{aligned} & P[Y = A | X = 1] = 0.2, P[Y = B|X = 1] = 0.2, P[Y = C|X = 1] = 0.6 \\ & P[Y = A | X = 0] = 0.2, P[Y = B|X = 0] = 0.5, P[Y = C|X = 0] = 0.3. \end{aligned} $$

Accordingly, the values of the likelihood ratio L(y) = P[Y = y|X = 1]∕P[Y = y|X = 0] are as follows:

$$\displaystyle \begin{aligned} L(A) = 1, L(B) = 0.4 \mbox{ and } L(C) = 2. \end{aligned}$$

We rank the observations in increasing order of the values of L, as shown in Fig. 7.17.

Fig. 7.17
figure 17

The three possible observations

The solution of the hypothesis testing problem amounts to choosing a threshold λ and a randomization γ so that

$$\displaystyle \begin{aligned} P[ \hat X = 1 | Y ] = 1\{ L(Y) > \lambda \} + \gamma 1\{L(Y) = \lambda \}.\end{aligned} $$

Also, we choose λ and γ so that \(P[\hat X = 1 | X = 0] = \beta \).

Figure 7.17 shows that if we choose λ = 2.1, then L(Y ) < λ, for all values of Y , so that we always decide \(\hat X = 0\). Accordingly, PCD = 0 and PFA = 0.

The figure also shows that if we choose λ = 2 and a parameter γ, then we decide \(\hat X = 1\) when L(Y ) = 2 with probability γ. Thus, if X = 0, we decide \(\hat X = 1\) with probability 0.3γ, because Y = C with probability 0.3 when X = 0 and this is precisely when L(Y ) = 2 and we randomize with probability γ. The figure shows other examples.

It should be clear that as we reduce λ from 2.1 to 0.39, the probability that we decide \(\hat X = 1\) when X = 0 increases from 0 to 1. Also, by choosing the parameter γ suitably when λ is set to a possible value of L(Y ), we can adjust PFA to any value in [0, 1].

For instance, we can have PFA = 0.05 if we choose λ = 2 and γ = 0.05∕0.3. Similarly, we can have PFA = 0.4 by choosing λ = 1 and γ = 0.5. Indeed, in this case, we decide \(\hat X = 1\) when Y = C and also with probability 0.5 when Y = A, so that this occurs with probability 0.3 + 0.2 × 0.5 = 0.4 when X = 0. The corresponding PCD is then 0.6 + 0.2 × 0.5 = 0.7.

Figure 7.18 shows PCD as a function of the bound on PFA.

Fig. 7.18
figure 18

The ROC for the discrete observation example

7.7 Summary

  • MAP and MLE;

  • BPSK;

  • Huffman Codes;

  • Independent Gaussian Errors;

  • Hypothesis Testing: Neyman–Pearson Theorem.

7.7.1 Key Equations and Formulas

Bayes’ Rule

π i = p i q i∕(∑j p j q j)

Theorem 7.1

MAP[X|Y = y]

\(\arg \max _x P[X=x|Y=y]\)

Definition 7.1

MLE[X|Y = y]

\(\arg \max _x P[Y=y|X=x]\)

Definition 7.1

Likelihood Ratio

L(y) = f Y |X[y|1]∕f Y |X[y|0]

Theorem 7.4

Gaussian Channel

\(MAP[X|Y=y] = 1\{y \geq \frac {1}{2} + \sigma ^2 \log (\frac {p_0}{p_1})\}\)

(7.2)

Neyman–Pearson Theorem

\(P[\hat X = 1 | Y] = 1\{L(Y) > \lambda \} + \gamma 1\{L(Y) = \lambda \}\)

Theorem 7.4

ROC

ROC(β) =  max. PCD s.t. PFA ≤ β

Definition 7.2

7.8 References

Detection theory is obviously a classical topic. It is at the core of digital communication (see e.g., Proakis (2000)). The Neyman–Pearson Theorem is introduced in Neyman and Pearson (1933). For a discussion of hypothesis testing, see Lehmann (2010). For more details on digital communication and, in particular, on wireless communication, see the excellent presentation in Tse and Viswanath (2005).

7.9 Problems

Problem 7.1

Assume that when \(X = 0, Y = \mathcal {N}(0, 1)\) and when \(X = 1, Y = \mathcal {N}(0, \sigma ^2)\) with σ 2 > 1. Calculate MLE[X|Y ].

Problem 7.2

Let X, Y  be i.i.d. U[0, 1] random variables. Define V = X + Y  and W = X − Y .

  1. (a)

    Show that V  and W are uncorrelated;

  2. (b)

    Are V  and W independent? Prove or disprove.

Problem 7.3

A digital link uses the QAM-16 constellation shown in Fig. 7.12 with x 1 = (1, −1). The received signal is Y = X + Z where \(\mathbf {Z} =_D \mathcal {N}(\mathbf {0}, \sigma ^2 \mathbf {I})\). The receiver uses the MAP. Simulate the system using Python to estimate the fraction of errors for σ = 0.2, 0.3.

Problem 7.4

Use Python to verify the CLT with i.i.d. U[0, 1] random variables X n. That is, generate the random variables {X 1, …, X N} for N = 10000. Calculate

$$\displaystyle \begin{aligned} Y_n = \frac{X_{100n + 1} + \cdots + X_{(n+1)100} - 50}{10}, n = 0, 1, \ldots, 99. \end{aligned}$$

Plot the empirical cdf of {Y 0, …, Y 99} and compare with the cdf of a \(\mathcal {N}(0, 1/12)\) random variable.

Problem 7.5

You are testing a digital link that corresponds to a BSC with some error probability 𝜖 ∈ [0, 0.5).

  1. (a)

    Assume you observe the input and the output of the link. How do you find the MLE of 𝜖.

  2. (b)

    You are told that the inputs are i.i.d. bits that are equal to 1 with probability 0.6 and to 0 with probability 0.4. You observe n outputs. How do you calculate the MLE of 𝜖.

  3. (c)

    The situation is as in the previous case, but you are told that 𝜖 has pdf 4 − 8x on [0, 0.5). How do you calculate the MAP of 𝜖 given n outputs.

Problem 7.6

The situation is the same as in the previous problem. You observe n inputs and outputs of the BSC. You want to solve a hypothesis problem to detect that 𝜖 > 0.1 with a probability of false alarm at most equal to 5%. Assume that n is very large and use the CLT.

Problem 7.7

The random variable X is such that P(X = 1) = 2∕3 and P(X = 0) = 1∕3. When X = 1, the random variable Y  is exponentially distributed with rate 1. When X = 0, the random variable Y  is uniformly distributed in [0, 2]. (Hint: Be careful about the case Y > 2.)

  1. (a)

    Find MLE[X|Y ];

  2. (b)

    Find MAP[X|Y ];

  3. (c)

    Solve the following hypothesis testing problem:

    $$\displaystyle \begin{aligned} & \mbox{Maximize } P[\hat X= 1 | X = 1] \\ & \mbox{ subject to } P[\hat X = 1 | X = 0] \leq 5\%. \end{aligned} $$

Problem 7.8

Simulate the following communication channel. There is an i.i.d. source that generates symbols {1, 2, 3, 4} according to a prior distribution π = [p 1, p 2, p 3, p 4]. The symbols are modulated by QPSK scheme, i.e. they are mapped to constellation points (±1, ±1). The communication is on a baseband Gaussian channel, i.e. if the sent signal is (x 1, x 2), the received signal is

$$\displaystyle \begin{aligned} y_1 = x_1 + Z_1, \end{aligned}$$
$$\displaystyle \begin{aligned} y_2 = x_2 + Z_2, \end{aligned}$$

where Z 1 and Z 2 are independent N(0, σ 2) random variables. Find the MAP detector and ML detector analytically.

Simulate the channel using Python for π = [0.1, 0.2, 0.3, 0.4], and σ = 0.1 and σ = 0.5. Evaluate the probability of correct detection.

Problem 7.9

Let X be equally likely to take any of the values {1, 2, 3}. Given X, the random variable Y  is \(\mathcal {N}(X, 1)\).

  1. (a)

    Find MAP[X|Y ];

  2. (b)

    Calculate MLE[X|Y ];

  3. (c)

    Calculate E((XY )2).

Problem 7.10

The random variable X is such that P(X = 0) = P(X = 1) = 0.5. Given X, the random variables Y n are i.i.d. U[0, 1.1 − 0.1X]. The goal is to guess \(\hat X\) from the observations Y n. Each observation has a cost β > 0. To get nice numerical solutions, we assume that

$$\displaystyle \begin{aligned} \beta = 0.018 \approx 0.5 (1.1)^{-10} \log(1.1). \end{aligned}$$
  1. (a)

    Assume that you have observed Y n = (Y 1, …, Y n). What is the guess \(\hat X_n\) based on these observations that maximizes the probability that \(\hat X_n = X\)?

  2. (b)

    What is the corresponding value of \(P(\hat X_n = X)\)?

  3. (c)

    Choose n to maximize \(P(X = \hat X_n) - \beta n\) where \(\hat X_n\) is chosen on the basis of Y 1, …, Y n). Hint: You will recall that

    $$\displaystyle \begin{aligned} \frac{d}{dx} (a^x) = a^x \log(a). \end{aligned}$$

Problem 7.11

The random variable X is exponentially distributed with mean 1. Given X, the random variable Y  is exponentially distributed with rate X.

  1. (a)

    Find MLE[X|Y ];

  2. (b)

    Find MAP[X|Y ];

  3. (c)

    Solve the following hypothesis testing problem:

    $$\displaystyle \begin{aligned} & \mbox{Maximize } P[\hat X = 1 | X = a] \\ & \mbox{ subject to } P[\hat X = 1 | X = 1] \leq 5\%, \end{aligned} $$

    where a > 1 is given.

Problem 7.12

Consider a random variable Y  that is exponentially distributed with parameter θ. You observe n i.i.d. samples Y 1, …, Y n of this random variable. Calculate \(\hat \theta = MLE[\theta | Y_1, \ldots , Y_n]\). What is the bias of this estimator, i.e., \(E[\hat \theta -\theta |\theta ]\)? Does the bias converge to 0 as n goes to infinity?

Problem 7.13

Assume that Y =D U[a, b]. You observe n i.i.d. samples Y 1, …, Y n of this random variable. Calculate the maximum likelihood estimator \(\hat a\) of a and \(\hat b\) of b. What is the bias of \(\hat a\) and \(\hat b\)?

Problem 7.14

We are looking at an hypothesis testing problem where \(X, \hat X\) take values in {0, 1}. The value of \(\hat X\) is decided based on the observed value of the random vector Y. We assume that Y has a density f i(y) given that X = i, for i = 0, 1, and we define L(y) := f 1(y)∕f 0(y).

Define g(β) to be the maximum value of \(P[\hat X = 1 | X = 1]\) subject to \(P[\hat X = 1 | X = 0] \leq \beta \) for β ∈ [0, 1]. Then (choose the correct answers, if any)

:

g(β) ≥ 1 − β;

:

g(β) ≥ β;

:

The optimal decision is described by a function \(h(\mathbf {y}) = P[\hat X = 1 | \mathbf {Y} = \mathbf {y}]\) and this function is nondecreasing in f 1(y)∕f 0(y).

Problem 7.15

Given θ ∈{0, 1}, X = θ(1, 1) + V where V 1 and V 2 are independent and uniformly distributed in [−2, 2]. Solve the hypothesis testing problem:

$$\displaystyle \begin{aligned} & \mbox{Maximize } P[\hat \theta = 1 | \theta = 1] \\ & \mbox{ s.t. } P[\hat \theta = 1 | \theta = 0] \leq 5\%. \end{aligned} $$

Problem 7.16

Given θ = 1, X =D Exp(1) and, given θ = 0, X =D U[0, 2].

  1. (a)

    Find \({\hat \theta } = HT[\theta | X, \beta ]\), defined as the random variable \({\hat \theta }\) determined from X that maximizes \(P[\hat {\theta } = 1 | \theta = 1]\) subject to \(P[\hat {\theta } = 1 | \theta = 0] \leq \beta \);

  2. (b)

    Compute the resulting value of \(\alpha ( \beta ) = P[\hat {\theta } = 1 | \theta = 1]\);

  3. (c)

    Sketch the ROC curve α(β) for β ∈ [0, 1].

Problem 7.17

You observe a random sequence {X n, n = 0, 1, 2, …}. With probability p, θ = 0 and this sequence is i.i.d. Bernoulli with P(X n = 0) = P(X n = 1) = 0.5. With probability 1 − p, θ = 1 and the sequence is a stationary Markov chain on {0, 1} with transition probabilities P(0, 1) = P(1, 0) = α. The parameter α is given in (0, 1).

  1. (1)

    Find MAP[θ|X 0, …, X n];

  2. (2)

    Discuss the convergence of \({\hat \theta }_n\);

  3. (3)

    Discuss the composite hypothesis testing problem where α < 0.5 when θ = 1 and α = 0.5 when θ = 0.

Problem 7.18

If θ = 0, the sequence {X n, n ≥ 0} is a Markov chain on a finite set \(\mathcal {X}\) with transition matrix P 0. If θ = 1, the transition matrix is P 1. In both cases, X 0 = x 0 is known. Find MLE[θ|X 0, …, X n].