Application: Estimation, Tracking

Topics: LLSE, MMSE, Kalman Filter

9.1 Examples

A GPS receiver uses the signals it gets from satellites to estimate its location (Fig. 9.1). Temperature and pressure sensors provide signals that a computer uses to estimate the state of a chemical reactor.

Fig. 9.1
figure 1

Estimating the location of a device from satellite signals

A radar measures electromagnetic waves that an object reflects and uses the measurements to estimate the position of that object (Fig. 9.2).

Fig. 9.2
figure 2

Estimating the position of an object from radar signals

Similarly, your car’s control computer estimates the state of the car from measurements it gets from various sensors (Fig. 9.3).

Fig. 9.3
figure 3

Estimating the state of a vehicle from sensor signals

9.2 Estimation Problem

The basic estimation problem can be formulated as follows. There is a pair of continuous random variables (X, Y ). The problem is to estimate X from the observed value of Y .

This problem admits a few different formulations:

  • Known Distribution: We know the joint distribution of (X, Y );

  • Off-Line: We observe a set of sample values of (X, Y );

  • On-Line: We observe successive values of samples of (X, Y );

The objective is to choose the inference function g(⋅) to minimize the expected error C(g) where

$$\displaystyle \begin{aligned} C(g) = E(c(X, g(Y))). \end{aligned}$$

In this expression, \(c(X, \hat X)\) is the cost of guessing \(\hat X\) when the actual value is X. A standard example is

$$\displaystyle \begin{aligned} c(X, \hat X) = |X - \hat X|{}^2. \end{aligned}$$

We will also study the case when \(X \in \Re ^d\) for d > 1. In such a situation, one uses \(c(X, \hat X) = ||X - \hat X||{ }^2\). If the function g(⋅) can be arbitrary, the function that minimizes C(g) is the Minimum Mean Squares Estimate (MMSE) of X given Y . If the function g(⋅) is restricted to be linear, i.e., of the form a + BY , the linear function that minimizes C(g) is the Linear Least Squares Estimate (LLSE) of X given Y . One may also restrict g(⋅) to be a polynomial of a given degree. For instance, one may define the Quadratic Least Squares Estimate QLSE of X given Y . See Fig. 9.4.

Fig. 9.4
figure 4

Least squares estimates of X given Y : LLSE is linear, QLSE is quadratic, and MMSE can be an arbitrary function

As we will see, a general method for the off-line inference problem is to choose a parametric class of functions \(\{g_w, w \in \Re ^d\}\) and to then minimize the empirical error

$$\displaystyle \begin{aligned} \sum_{k = 1}^K c(X_k, g_w(Y_k)) \end{aligned}$$

over the parameters w. Here, the (X k, Y k) are the observed samples. The parametric function could be linear, polynomial, or a neural network.

For the on-line problem, one also chooses a similar parametric family of functions and one uses a stochastic gradient descent algorithm of the form

$$\displaystyle \begin{aligned} w(k+1) = w(k) - \gamma \nabla_w c(X_{k+1}, g_w(Y_{k+1})),\end{aligned} $$

where ∇ is the gradient with respect to w and γ > 0 is a small step size. The justification for this approach is that, since γ is small, by the SLLN, the update tends to be in the direction of

$$\displaystyle \begin{aligned} - \sum_{i = k}^{k+K-1} \nabla_w c(X_{i+1}, g_w(Y_{i+1}) \approx - K \nabla E(c(X_k, g_w(Y_k)) = - K \nabla C(g_w),\end{aligned} $$

which would correspond to a gradient algorithm to minimize C(g w).

9.3 Linear Least Squares Estimates

In this section, we study the linear least squares estimates. Recall the setup that we explained in the previous section. There is a pair (X, Y ) of random variables with some joint distribution and the problem is to find the function g(Y ) = a + bY  that minimizes

$$\displaystyle \begin{aligned} C(g) = E(|X - g(Y)|{}^2). \end{aligned}$$

One consider the cases where the distribution is known, or a set of samples has been observed, or one observes one sample at a time.

Assume that the joint distribution of (X, Y ) is known. This means that we know the joint cumulative distribution function (j.c.d.f.) F X,Y(x, y).Footnote 1

We are looking for the function g(Y ) = a + bY  that minimizes

$$\displaystyle \begin{aligned} C(g) = E(|X - g(Y)|{}^2) = E(|X - a - bY|{}^2). \end{aligned}$$

We denote this function by L[X|Y ]. Thus, we have the following definition.

Definition 9.1 (Linear Least Squares Estimate (LLSE))

The LLSE of X given Y , denoted by L[X|Y ], is the linear function a + bY  that minimizes

$$\displaystyle \begin{aligned} E(|X - a - bY|{}^2). \end{aligned}$$

Note that

$$\displaystyle \begin{aligned} & C(g) = E(X^2 + a^2 + b^2 Y^2 - 2aX - 2bXY + 2abY) \\ &~~~~ = E(X^2) + a^2 + b^2 E(Y^2) - 2a E(X) - 2b E(XY) + 2ab E(Y). \end{aligned} $$

To find the values of a and b that minimize that expression, we set to zero the partial derivatives with respect to a and b. This gives the following two equations:

$$\displaystyle \begin{aligned} & 0 = 2a - 2E(X) + 2bE(Y) {} \end{aligned} $$
(9.1)
$$\displaystyle \begin{aligned} & 0 = 2bE(Y^2) - 2E(XY) + 2aE(Y). {} \end{aligned} $$
(9.2)

Solving these equations for a and b, we find that

$$\displaystyle \begin{aligned} L[X|Y] = a + bY = E(X) + \frac{\mbox{cov}(X, Y)}{\mbox{var}(Y)}(Y - E(Y)), \end{aligned}$$

where we used the identities

$$\displaystyle \begin{aligned} \mbox{cov}(X, Y) = E(XY) - E(X)E(Y) \mbox{ and } \mbox{var}(Y) = E(Y^2) - E(Y)^2. \end{aligned}$$

We summarize this result as a theorem.

Theorem 9.1 (Linear Least Squares Estimate)

One has

$$\displaystyle \begin{aligned} L[X|Y] = E(X) + \frac{\mathit{\mbox{cov}}(X, Y)}{\mathit{\mbox{var}}(Y)}(Y - E(Y)). \end{aligned} $$
(9.3)

\({\blacksquare }\)

As a first example, assume that

$$\displaystyle \begin{aligned} Y = \alpha X + Z, \end{aligned} $$
(9.4)

where X and Z are zero-mean and independent. In this case, we find Footnote 2

$$\displaystyle \begin{aligned} & \mbox{cov}(X, Y) = E(XY) - E(X)E(Y) \\ &~~~~~~~~~~~~~~~ = E(X(\alpha X + Z)) = \alpha E(X^2) \\ & \mbox{var}(Y) = \alpha^2 \mbox{var}(X) + \mbox{var}(Z) = \alpha^2 E(X^2) + E(Z^2). \end{aligned} $$

Hence,

$$\displaystyle \begin{aligned} L[X | Y] = \frac{\alpha E(X^2)}{\alpha^2 E(X^2) + E(Z^2)} Y = \frac{\alpha^{-1}Y}{1 + SNR^{-1}}, \end{aligned}$$

where

$$\displaystyle \begin{aligned} SNR := \frac{\alpha^2 E(X^2)}{\sigma^2} \end{aligned}$$

is the signal-to-noise ratio, i.e., the ratio of the power E(α 2 X 2) of the signal in Y  divided by the power E(Z 2) of the noise. Note that if SNR is small, then L[X|Y ] is close to zero, which is the best guess about X if one does not make any observation. Also, if SNR is very large, then L[X|Y ] ≈ α −1 Y , which is the correct guess if Z = 0.

As a second example, assume that

$$\displaystyle \begin{aligned} X = \alpha Y + \beta Y^2, \end{aligned} $$
(9.5)

whereFootnote 3 Y =D U[0, 1]. Then,

$$\displaystyle \begin{aligned} & E(X) = \alpha E(Y) + \beta E(Y^2) = \alpha/2 + \beta/3; \\ & \mbox{cov}(X, Y) = E(XY) - E(X)E(Y) \\ &~~~~~~~~~~~~~~~ = E(\alpha Y^2 + \beta Y^3) - (\alpha/2 + \beta/3)(1/2) \\ &~~~~~~~~~~~~~~~ = \alpha/3 + \beta/4 - \alpha/4 - \beta/6 \\ &~~~~~~~~~~~~~~~ = (\alpha + \beta)/12 \\ & \mbox{var}(Y) = E(Y^2) - E(Y)^2 = 1/3 - (1/2)^2 = 1/12. \end{aligned} $$

Hence,

$$\displaystyle \begin{aligned} L[X|Y] = \alpha/2 + \beta/3 + (\alpha + \beta)(Y - 1/2) = - \beta/6 + (\alpha + \beta)Y. \end{aligned}$$

This estimate is sketched in Fig. 9.5. Obviously, if one observes Y , one can compute X. However, recall that L[X|Y ] is restricted to being a linear function of Y .

Fig. 9.5
figure 5

The figure shows L[αY + βY 2|Y ] when Y =D U[0, 1]

9.3.1 Projection

There is an insightful interpretation of L[X|Y ] as a projection that also helps understand more complex estimates. This interpretation is that L[X|Y ] is the projection of X onto the set \(\mathcal {L}(Y)\) of linear functions of Y .

This interpretation is sketched in Fig. 9.6. In that figure, random variables are represented by points and \(\mathcal {L}(Y)\) is shown as a plane since the linear combination of points in that set is again in the set. In the figure, the square of the length of a vector from a random variable V  to another random variable W is E(|V − W|2). Also, we say that two vectors V  and W are orthogonal if E(VW) = 0. Thus, L[X|Y ] = a + bY  is the projection of X onto \(\mathcal {L}(Y)\) if X − L[X|Y ] is orthogonal to every linear function of Y , i.e., if

$$\displaystyle \begin{aligned} E((X - a - bY)(c + dY)) = 0, \forall c, d \in \Re. \end{aligned}$$

Equivalently,

$$\displaystyle \begin{aligned} E(X) = a + bE(Y) \mbox{ and } E((X - a - bY)Y) = 0. \end{aligned} $$
(9.6)

These two equations are the same as (9.1)–(9.2). We call the identities (9.6) the projection property.

Fig. 9.6
figure 6

L[X|Y ] is the projection of X onto \(\mathcal {L}(Y)\)

Figure 9.7 illustrates the projection when

$$\displaystyle \begin{aligned} X = \mathcal{N}(0, 1) \mbox{ and } Y = X + Z \mbox{ where } Z = \mathcal{N}(0, \sigma^2). \end{aligned}$$

In this figure, the length of Z is equal to \(\sqrt {E(Z^2)} = \sigma \), the length of X is \(\sqrt {E(X^2)} = 1\) and the vectors X and Z are orthogonal because E(XZ) = 0.

Fig. 9.7
figure 7

Example of projection

We see that the triangles \(0 \hat X X\) and 0XY  are similar. Hence,

$$\displaystyle \begin{aligned} \frac{||\hat X||}{||X||} = \frac{||X||}{||Y||}, \end{aligned}$$

so that

$$\displaystyle \begin{aligned} \frac{||\hat X||}{1} = \frac{1}{\sqrt{1 + \sigma^2}} = \frac{||Y||}{1 + \sigma^2}, \end{aligned}$$

since \(||Y|| = \sqrt {1 + \sigma ^2}.\) This shows that

$$\displaystyle \begin{aligned} \hat X = \frac{1}{1 + \sigma^2} Y. \end{aligned}$$

To see why the projection property implies that L[X|Y ] is the closest point to X in \(\mathcal {L}(Y)\), as suggested by Fig. 9.6, we verify that

$$\displaystyle \begin{aligned} E(|X - L[X|Y]|{}^2) \leq E(|X - h(Y)|{}^2), \end{aligned}$$

for any given h(Y ) = c + dY . The idea of the proof is to verify Pythagoras’ identity on the right triangle with vertices X, L[X|Y ] and h(Y ). We have

$$\displaystyle \begin{aligned} & E(|X - h(Y)|{}^2) = E(|X - L[X|Y] + L[X|Y] - h(Y)|{}^2) \\ &~~~~ = E(|X - L[X|Y]|{}^2) + E(|L[X|Y] - h(Y)|{}^2) \\ &~~~~~~~~~~~ + 2E((X - L[X|Y])(L[X|Y] - h(Y))). \end{aligned} $$

Now, the projection property (9.6) implies that the last term in the above expression is equal to zero. Indeed, L[X|Y ] − h(Y ) is a linear function of Y . It follows that

$$\displaystyle \begin{aligned} E(|X - h(Y)|{}^2) &= E(|X - L[X|Y]|{}^2) + E(|L[X|Y] - h(Y)|{}^2) \\ &\geq E(|X - L[X|Y]|{}^2), \end{aligned} $$

as was to be proved.

9.4 Linear Regression

Assume now that, instead of knowing the joint distribution of (X, Y ), we observe K i.i.d. samples (X 1, Y 1), …, (X K, Y K) of these random variables. Our goal is still to construct a function g(Y ) = a + bY  so that

$$\displaystyle \begin{aligned} E(|X - a - bY|{}^2) \end{aligned}$$

is minimized. We do this by choosing a and b to minimize the sum of the squares of the errors based on the samples. That is, we choose a and b to minimize

$$\displaystyle \begin{aligned} \sum_{k=1}^K |X_k - a - bY_k|{}^2. \end{aligned}$$

To do this, we set to zero the derivatives of this sum with respect to a and b. Algebra shows that the resulting values of a and b are such that

$$\displaystyle \begin{aligned} a + bY = E_K(X) + \frac{\mbox{cov}_K(X, Y)}{\mbox{var}_K(Y)} (Y - E_K(Y)), \end{aligned} $$
(9.7)

where we defined

$$\displaystyle \begin{aligned} & E_K(X) = \frac{1}{K} \sum_{k=1}^K X_k, E_K(Y) = \frac{1}{K} \sum_{k=1}^K Y_k,\\ & \mbox{cov}_K(X, Y) = \frac{1}{K} \sum_{k=1}^K X_k Y_k - E_K(X)E_K(Y), \\ & \mbox{var}_K(Y) = \frac{1}{K} \sum_{k=1}^K Y_k^2 - E_K(Y)^2. \end{aligned} $$

That is, the expression (9.7) is the same as (9.3), except that the expectation is replaced by the sample mean. The expression (9.7) is called the linear regression of X over Y . It is shown in Fig. 9.8.

Fig. 9.8
figure 8

The linear regression of X over Y

One has the following result.

Theorem 9.2 (Linear Regression Converges to LLSE)

As the number of samples increases, the linear regression approaches the LLSE. \({\blacksquare }\)

Proof

As K →, one has, by the Strong Law of Large Numbers,

$$\displaystyle \begin{aligned} & E_K(X) \rightarrow E(X), E_K(Y) \rightarrow E(Y), \\ & \mbox{cov}_K(X, Y) \rightarrow \mbox{cov}(X, Y), \mbox{var}_K(Y) \rightarrow \mbox{var}(Y). \end{aligned} $$

Combined with the expressions for the linear regression and the LLSE, these properties imply the result. □

Formula (9.3) and the linear regression provide an intuitive meaning of the covariance cov(X, Y ). If this covariance is zero, then L[X|Y ] does not depend on Y . If it is positive (negative), it increases (decreases, respectively) with Y . Thus, cov(X, Y ) measures a form of dependency in terms of linear regression. For instance, the random variables in Fig. 9.9 are uncorrelated since L[X|Y ] does not depend on Y .

Fig. 9.9
figure 9

The random variables X and Y  are uncorrelated. Note that they are not independent

9.5 A Note on Overfitting

In the previous section, we examined the problem of finding the linear function a + bY  that best approximates X, in the mean squared error sense. We could develop the corresponding theory for quadratic approximations a + bY + cY 2, or for polynomial approximations of a given degree. The ideas would be the same and one would have a similar projection interpretation.

In principle, a higher degree polynomial approximates X better than a lower degree one since there are more such polynomials. The question of fitting the parameters with a given number of observations is more complex.

Assume you observe N data points {(X n, Y n), n = 1, …, N}. If the values Y n are different, one can define the function g(⋅) by g(Y n) = X n for n = 1, …, N. This function achieves a zero-mean squared error. What is then the point of looking for a linear function, or a quadratic, or some polynomial of a given degree? Why not simply define g(Y n) = X n?

Remember that the goal of the estimation is to discover a function g(⋅) that is likely to work well for data points we have not yet observed. For instance, we hope that E(C(X N+1, g(Y N+1)) is small, where (X N+1, Y N+1) has the same distribution as the samples (X n, Y n) we have observed for n = 1, …, N.

If we define g(Y n) = X n, this does not tell us how to calculate g(Y N+1) for a value Y N+1 we have not observed. However, if we construct a polynomial g(⋅) of a given degree based on the N samples, then we can calculate g(Y n+1). The key observation is that a higher degree polynomial may not be a better estimate because it tends to fit noise instead of important statistics.

As a simple illustration of overfitting, say that we observe (X 1, Y 1) and Y 2. We want to guess X 2. Assume that the samples X n, Y n are all independent and U[−1, 1]. If we guess \(\hat X_2 = 0\), the mean squared error is \(E((X_2 - \hat X_2)^2) = E(X_2^2) = 1/3\). If we use the guess \(\hat X_2 = X_1\) based on the observations, then \(E((X_2 - \hat X_2)^2) = E((X_2 - X_1)^2) = 2/3\). Hence, ignoring the observation is better than taking it into account.

The practical question is how to detect overfitting. For instance, how does one determine whether a linear regression is better than a quadratic regression? A simple test is as follows. Say you observed N samples {(X n, Y n), n = 1, …, N}. You remove sample n and compute a linear regression using the N − 1 other samples. You use that regression to calculate the estimate \(\hat X_n\) of X n based on Y n. You then compute the squared error \((X_n - \hat X_n)^2\). You repeat that procedure for n = 1, …, N and add up the squared errors. You then use the same procedure for a quadratic regression and you compare.

9.6 MMSE

For now, assume that we know the joint distribution of (X, Y ) and consider the problem of finding the function g(Y ) that minimizes

$$\displaystyle \begin{aligned} E(|X - g(Y)|{}^2), \end{aligned}$$

per all the possible functions g(⋅). The best function is called the MMSE of X given Y . We have the following theorem:

Theorem 9.3 (The MMSE Is the Conditional Expectation)

The MMSE of X given Y  is given by

$$\displaystyle \begin{aligned} g(Y) = E[X|Y], \end{aligned}$$

where E[X|Y ] is the conditional expectation of X given Y . \({\blacksquare }\)

Before proving this result, we need to define the conditional expectation.

Definition 9.2 (Conditional Expectation)

The conditional expectation of X given Y  is defined by

$$\displaystyle \begin{aligned} E[X|Y=y] = \int_{- \infty}^\infty x f_{X|Y}[x|y]dx, \end{aligned}$$

where

$$\displaystyle \begin{aligned} f_{X|Y}[x|y] := \frac{f_{X, Y}(x, y)}{f_Y(y)} \end{aligned}$$

is the conditional density of X given Y . ◇

Figure 9.10 illustrates the conditional expectation. That figure assumes that the pair (X, Y ) is picked uniformly in the shaded area. Thus, if one observes that Y ∈ (y, y + dy), the point X is uniformly distributed along the segment that cuts the shaded area at Y = y. Accordingly, the average value of X is the mid-point of that segment, as indicated in the figure. The dashed red line shows how that mean value depends on Y  and it defines E[X|Y ].

Fig. 9.10
figure 10

The conditional expectation E[X|Y ] when the pair (X, Y ) is picked uniformly in the shaded area

The following result is a direct consequence of the definition.

Lemma 9.4 (Orthogonality Property of MMSE)

  1. (a)

    For any function ϕ(⋅), one has

    $$\displaystyle \begin{aligned} E((X - E[X|Y])\phi(Y)) = 0.\end{aligned} $$
    (9.8)
  2. (b)

    Moreover, if the function g(Y ) is such that

    $$\displaystyle \begin{aligned} E((X - g(Y))\phi(Y)) = 0, \forall \phi(\cdot),\end{aligned} $$
    (9.9)

    then g(Y ) = E[X|Y ].

Proof

  1. (a)

    To verify (9.8) note that

    $$\displaystyle \begin{aligned} & E(E[X|Y]\phi(Y)) = \int_{- \infty}^\infty E[X|Y=y]\phi(y)f_Y(y)dy \\ &~~~~ = \int_{- \infty}^\infty \int_{- \infty}^\infty x \frac{f_{X, Y}(x, y)}{f_Y(y)} dx \phi(y)f_Y(y)dy \\ &~~~~ =\int_{- \infty}^\infty \int_{- \infty}^\infty x \phi(y) f_{X, Y}(x, y) dx dy \\ &~~~~ = E(X\phi(Y)), \end{aligned} $$

    which proves (9.8).

  2. (b)

    To prove the second part of the lemma, note that

    $$\displaystyle \begin{aligned} & E(|g(Y) - E[X|Y]|{}^2) \\ &~~~ = E((g(Y) - E[X|Y])\{(g(Y) - X) - (E[X|Y] - X)\}) = 0, \end{aligned} $$

    because of (9.8) and (9.9) with ϕ(Y ) = g(Y ) − E[X|Y ].

    Note that the second part of the lemma simply says that the projection property characterizes uniquely the conditional expectation. In other words, there is only one projection of X onto \(\mathcal {G}(Y)\).

We can now prove the theorem.

Proof of Theorem 9.3

The identity (9.8) is the projection property. It states that X − E[X|Y ] is orthogonal to the set \(\mathcal {G}(Y)\) of functions of Y , as shown in Fig. 9.11.

Fig. 9.11
figure 11

The conditional expectation E[X|Y ] as the projection of X on the set \(\mathcal {G}(Y)\) of functions of Y

In particular, it is orthogonal to h(Y ) − E[X|Y ]. As in the case of the LLSE, this projection property implies that

$$\displaystyle \begin{aligned} E(|X - h(Y)|{}^2) \geq E(|X - E[X|Y]|{}^2), \end{aligned}$$

for any function h(⋅). This implies that E[X|Y ] is indeed the MMSE of X given Y . □

From the definition, we see how to calculate E[X|Y ] from the conditional density of X given Y . However, in many cases one can calculate E[X|Y ] more simply. One approach is to use the following properties of conditional expectation.

Theorem 9.5 (Properties of Conditional Expectation)

  1. (a)

    Linearity:

    $$\displaystyle \begin{aligned} E[a_1 X_1 + a_2 X_2 | Y] = a_1 E[X_1 |Y] + a_2 E[X_2 | Y]; \end{aligned}$$
  2. (b)

    Factoring Known Values:

    $$\displaystyle \begin{aligned} E[ h(Y)X | Y] = h(Y) E[X|Y]; \end{aligned}$$
  3. (c)

    Independence: If X and Y  are independent, then

    $$\displaystyle \begin{aligned} E[X|Y] = E(X). \end{aligned}$$
  4. (d)

    Smoothing:

    $$\displaystyle \begin{aligned} E(E[X | Y]) = E(X); \end{aligned}$$
  5. (e)

    Tower:

    $$\displaystyle \begin{aligned} E[E[X | Y, Z] | Y] = E[X|Y]. \end{aligned}$$

\({\blacksquare }\)

Proof

  1. (a)

    By Lemma 9.4(b), it suffices to show that

    $$\displaystyle \begin{aligned} a_1 X_1 + a_2 X_2 - (a_1 E[X_1 | Y] + a_2 E[X_2|Y]) \end{aligned}$$

    is orthogonal to \(\mathcal {G}(Y)\). But this is immediate since it is the sum of two terms

    $$\displaystyle \begin{aligned} a_i (X_i - E[X_i|Y]) \end{aligned}$$

    for i = 1, 2 that are orthogonal to \(\mathcal {G}(Y)\).

  2. (b)

    By Lemma 9.4(b), it suffices to show that

    $$\displaystyle \begin{aligned} h(Y) X - h(Y) E[X|Y] \end{aligned}$$

    is orthogonal to \(\mathcal {G}(Y)\), i.e., that

    $$\displaystyle \begin{aligned} E((h(Y) X - h(Y) E[X|Y])\phi(Y)) = 0, \forall \phi(\cdot). \end{aligned}$$

    Now,

    $$\displaystyle \begin{aligned} E((h(Y) X - h(Y) E[X|Y])\phi(Y)) = E((X - E[X|Y]) h(Y) \phi(Y)) = 0, \end{aligned}$$

    because X − E[X|Y ] is orthogonal to \(\mathcal {G}(Y)\) and therefore to h(Y )ϕ(Y ).

  3. (c)

    By Lemma 9.4(b), it suffices to show that

    $$\displaystyle \begin{aligned} X - E(X) \end{aligned}$$

    is orthogonal to \(\mathcal {G}(Y)\). Now,

    $$\displaystyle \begin{aligned} E((X - E(X))\phi(Y)) = E(X - E(X))E(\phi(Y)) = 0. \end{aligned}$$

    The first equality follows from the fact that X − E(X) and ϕ(Y ) are independent since they are functions of independent random variables.Footnote 4

  4. (d)

    Letting ϕ(Y ) = 1 in (9.8), we find

    $$\displaystyle \begin{aligned} E(X - E[X|Y]) = 0, \end{aligned}$$

    which is the identity we wanted to prove.

  5. (e)

    The projection property states that E[W|Y ] = V  if V  is a function of Y  and if W − V  is orthogonal to \(\mathcal {G}(Y)\). Applying this characterization to W = E[X|Y, Z] and V = E[X|Y ], we find that to show that E[E[X|Y, Z]|Y ] = E[X|Y ], it suffices to show that E[X|Y, Z] − E[X|Y ] is orthogonal to \(\mathcal {G}(Y)\). That is, we should show that

    $$\displaystyle \begin{aligned} E(h(Y)(E[X|Y,Z] - E[X|Y])) = 0 \end{aligned}$$

    for any function h(Y ). But E(h(Y )(X − E[X|Y, Z])) = 0 by the projection property, because h(Y ) is some function of (Y, Z). Also, E(h(Y )(X − E[X|Y ])) = 0, also by the projection property. Hence,

    $$\displaystyle \begin{aligned} \begin{array}{rcl} & &\displaystyle E(h(Y)(E[X|Y,Z] {-} E[X|Y])) {=} E(h(Y)(X {-} E[X|Y]))\\ & &\displaystyle \qquad {-} E(h(Y)(X {-} E[X|Y,Z])) {=} 0. \end{array} \end{aligned} $$

As an example, assume that X, Y, Z are i.i.d. U[0, 1]. We want to calculate

$$\displaystyle \begin{aligned} E[ (X + 2Y)^2 | Y]. \end{aligned}$$

We find

$$\displaystyle \begin{aligned} & E[ (X + 2Y)^2 | Y] = E[ X^2 + 4Y^2 + 4XY | Y ] \\ &~~~ = E[X^2 | Y] + 4E[Y^2 |Y] + 4E[XY|Y] , \mbox{ by linearity} \\ &~~~ = E(X^2) + 4E[Y^2 |Y] + 4E[XY|Y] , \mbox{ by independence} \\ &~~~ = E(X^2) + 4Y^2 + 4YE[X|Y] , \mbox{ by factoring known values} \\ &~~~ = E(X^2) + 4Y^2 + 4YE(X), \mbox{ by independence} \\ &~~~ = \frac{1}{3} + 4Y^2 + 2Y, \mbox{ since } X =_D U[0, 1]. \end{aligned} $$

Note that calculating the conditional density of (X + 2Y )2 given Y  would have been quite a bit more tedious.

In some situations, one may be able to exploit symmetry to evaluate the conditional expectation. Here is one representative example. Assume that X, Y, Z are i.i.d. Then, we claim that

$$\displaystyle \begin{aligned} E[X | X + Y + Z] = \frac{1}{3}(X + Y + Z). \end{aligned} $$
(9.10)

To see this, note that, by symmetry,

$$\displaystyle \begin{aligned} E[X | X + Y + Z] = E[Y | X + Y + Z] = E[Z | X + Y + Z]. \end{aligned}$$

Denote by V  the common value of these random variables. Note that their sum is

$$\displaystyle \begin{aligned} 3V = E[X + Y + Z | X + Y + Z], \end{aligned}$$

by linearity. Thus, 3V = X + Y + Z, which proves our claim.

\({\square }\)

9.6.1 MMSE for Jointly Gaussian

In general L[X|Y ] ≠ E[X|Y ]. As a trivial example, Let Y =D U[−1, 1] and X = Y 2. Then E[X|Y ] = Y 2 and L[X|Y ] = E(X) = 1∕3 since cov(X, Y ) = E(XY ) − E(X)E(Y ) = 0.

Figure 9.12 recalls that E[X|Y ] is the projection of X onto \(\mathcal {G}(Y)\), whereas L[X|Y ] is the projection of X onto \(\mathcal {L}(Y)\). Since \(\mathcal {L}(Y)\) is a subspace of \(\mathcal {G}(Y)\), one expects the two projections to be different, in general.

Fig. 9.12
figure 12

The MMSE and LLSE are generally different

However, there are examples where E[X|Y ] happens to be linear. We saw one such example in (9.10) and it is not difficult to construct many other examples.

There is an important class of problems where this occurs. It is when X and Y  are jointly Gaussian. We state that result as a theorem.

Theorem 9.6 (MMSE for Jointly Gaussian RVs)

Let X, Y  be jointly Gaussian random variables. Then

$$\displaystyle \begin{aligned} E[X|Y] = L[X|Y] = E(X) + \frac{\mathit{\mbox{cov}}(X, Y)}{\mathit{\mbox{var}}(Y)}(Y - E(Y)). \end{aligned}$$

\({\blacksquare }\)

Proof

Note that

$$\displaystyle \begin{aligned} X - L[X|Y] \mbox{ and } Y \mbox{ are uncorrelated}.\end{aligned} $$

Also, X − L[X|Y ] and Y  are two linear functions of the jointly Gaussian random variables X and Y . Consequently, they are jointly Gaussian by Theorem 8.4 and they are independent by Theorem 8.3.

Consequently,

$$\displaystyle \begin{aligned} X - L[X|Y] \mbox{ and } \phi(Y) \mbox{ are independent},\end{aligned} $$

for any ϕ(⋅), because functions of independent random variables are independent by Theorem B.11 in Appendix B. Hence,

$$\displaystyle \begin{aligned} X - L[X|Y] \mbox{ and } \phi(Y) \mbox{ are uncorrelated},\end{aligned} $$

for any ϕ(⋅) by Theorem B.4 of Appendix B.

This shows that

$$\displaystyle \begin{aligned} X - L[X|Y] \mbox{ is orthogonal to } \mathcal{G}(Y), \end{aligned}$$

and, consequently, that L[X|Y ] = E[X|Y ]. □

9.7 Vector Case

So far, to keep notation at a minimum, we have considered L[X|Y ] and E[X|Y ] when X and Y  are single random variables. In this section, we discuss the vector case, i.e., L[X|Y] and E[X|Y] when X and Y are random vectors. The only difficulty is one of notation. Conceptually, there is nothing new.

Definition 9.3 (LLSE of Random Vectors)

Let X and Y be random vectors of dimensions m and n, respectively. Then

$$\displaystyle \begin{aligned} L[\mathbf{X} | \mathbf{Y}] = A \mathbf{y} + \mathbf{b} \end{aligned}$$

where A is the m × n matrix and b the vector in \(\Re ^m\) that minimize

$$\displaystyle \begin{aligned} E(|| \mathbf{X} - A \mathbf{Y} - \mathbf{b}||{}^2). \end{aligned}$$

Thus, as in the scalar case, the LLSE is the linear function of the observations that best approximates X, in the mean squared error sense.

Before proceeding, review the notation of Sect. B.6 for Σ Y and cov(X, Y).

Theorem 9.7 (LLSE of Vectors)

Let X and Y be random vectors such that Σ Y is nonsingular.

  1. (a)

    Then

    $$\displaystyle \begin{aligned} L[\mathbf{X} | \mathbf{Y}] = E(\mathbf{X}) + \mathit{\mbox{cov}}(\mathbf{X}, \mathbf{Y}) \varSigma_{\mathbf{Y}}^{-1}(\mathbf{Y} - E(\mathbf{Y})). \end{aligned} $$
    (9.11)
  2. (b)

    Moreover,

    $$\displaystyle \begin{aligned} E(||\mathbf{X} - L[\mathbf{X} | \mathbf{Y}] ||{}^2) = \mathit{\mbox{tr}}(\varSigma_{\mathbf{X}} - \mathit{\mbox{cov}}(\mathbf{X},\mathbf{Y}) \varSigma_{\mathbf{Y}}^{-1} \mathit{\mbox{cov}}(\mathbf{Y},\mathbf{X})). \end{aligned} $$
    (9.12)

    In this expression, for a square matrix M, tr(M) :=∑i M i,i is the trace of the matrix.

\({\blacksquare }\)

Proof

  1. (a)

    The proof is similar to the scalar case. Let Z be the right-hand side of (9.11). One shows that the error X −Z is orthogonal to all the linear functions of Y. One then uses that fact to show that X is closer to Z than to any other linear function h(Y) of Y.

    First we show the orthogonality. Since E(X −Z) = 0, we have

    $$\displaystyle \begin{aligned} E((\mathbf{X} - \mathbf{Z})(B \mathbf{Y} + \mathbf{b})') = E((\mathbf{X} - \mathbf{Z})(B \mathbf{Y})') = E((\mathbf{X} - \mathbf{Z})\mathbf{Y}')B' . \end{aligned}$$

    Next, we show that E((X −Z)Y ) = 0. To see this, note that

    $$\displaystyle \begin{aligned} & E((\mathbf{X} - \mathbf{Z})\mathbf{Y}') = E((\mathbf{X} - \mathbf{Z})(\mathbf{Y} - E(\mathbf{Y}))')\\ &~~~ = E((\mathbf{X} - E(\mathbf{X}))(\mathbf{Y} - E(\mathbf{Y}))') \\ &~~~~~~~~~ - \mbox{cov}(\mathbf{X}, \mathbf{Y}) \varSigma_{\mathbf{Y}}^{-1} E((\mathbf{Y} - E(\mathbf{Y}))(\mathbf{Y} - E(\mathbf{Y}))')\\ &~~~ = \mbox{cov}(\mathbf{X}, \mathbf{Y}) - \mbox{cov}(\mathbf{X}, \mathbf{Y}) \varSigma_{\mathbf{Y}}^{-1} \varSigma_{\mathbf{Y}} = 0. \end{aligned} $$

    Second, we show that Z is closer to X than any linear h(Y). We have

    $$\displaystyle \begin{aligned} & E(||\mathbf{X} - h(\mathbf{Y})||{}^2) = E((\mathbf{X} - h(\mathbf{Y}))'(\mathbf{X} - h(\mathbf{Y}))) \\ &~~~ = E((\mathbf{X} - \mathbf{Z} + \mathbf{Z} - h(\mathbf{Y}))'(\mathbf{X} - \mathbf{Z} + \mathbf{Z} - h(\mathbf{Y}))) \\ &~~~ = E(||\mathbf{X} - \mathbf{Z}||{}^2) + E(||\mathbf{Z} - h(\mathbf{Y})||{}^2) + 2 E((\mathbf{X} - \mathbf{Z})'(\mathbf{Z} - h(\mathbf{Y}))). \end{aligned} $$

    We claim that the last term is equal to zero. To see this, note that

    $$\displaystyle \begin{aligned} E((\mathbf{X} - \mathbf{Z})'(\mathbf{Z} - h(\mathbf{Y})) = \sum_{i=1}^n E((X_i - Z_i)(Z_i - h_i(\mathbf{Y}))). \end{aligned}$$

    Also,

    $$\displaystyle \begin{aligned} E((X_i - Z_i)(Z_i - h_i(\mathbf{Y}))) = E((\mathbf{X} - \mathbf{Z})(\mathbf{Z} - h(\mathbf{Y}))')_{i, i} \end{aligned}$$

    and the matrix E((X −Z)(Z − h(Y))) is equal to zero since X −Y is orthogonal to any linear function of Y and, in particular, to Z − h(Y).

    (Note: an alternative way of showing that the last term is equal to zero is to write

    $$\displaystyle \begin{aligned} E((\mathbf{X} - \mathbf{Z})'(\mathbf{Z} - h(\mathbf{Y})) = \mbox{tr}E((\mathbf{X} - \mathbf{Z})(\mathbf{Z} - h(\mathbf{Y}))') = 0, \end{aligned}$$

    where the first equality comes from the fact that tr(AB) = tr(BA) for matrices of compatible dimensions.)

  2. (b)

    Let \(\tilde {\mathbf {X}} := \mathbf {X} - E[\mathbf {X} | \mathbf {Y}]\) be the estimation error. Thus,

    $$\displaystyle \begin{aligned} \tilde{\mathbf{X}} = \mathbf{X} - E(\mathbf{X}) - \mbox{cov}(\mathbf{X}, \mathbf{Y}) \varSigma_{\mathbf{Y}}^{-1}(\mathbf{Y} - E(\mathbf{Y})). \end{aligned}$$

    Now, if V and W are two zero-mean random vectors and M a matrix,

    $$\displaystyle \begin{aligned} & \mbox{cov}(\mathbf{V} - M \mathbf{W}) = E((\mathbf{V} - M \mathbf{W})(\mathbf{V} - M \mathbf{W})') \\ &~~~~ = E(\mathbf{V} \mathbf{V}' - 2 M \mathbf{W} \mathbf{V}' + M \mathbf{W} \mathbf{W}' M') \\ &~~~~ = \mbox{cov}(\mathbf{V}) - 2M \mbox{cov}( \mathbf{W}, \mathbf{V}) + M \mbox{cov}(\mathbf{W})M'. \end{aligned} $$

    Hence,

    $$\displaystyle \begin{aligned} &\mbox{cov}(\tilde{\mathbf{X}}) = \varSigma_{\mathbf{X}} - 2 \mbox{cov}(\mathbf{X}, \mathbf{Y}) \varSigma_{\mathbf{Y}}^{-1} \mbox{cov}(\mathbf{Y}, \mathbf{X}) \\ &~~~~~~~~~~~~~~~~ + \mbox{cov}(\mathbf{X}, \mathbf{Y}) \varSigma_{\mathbf{Y}}^{-1} \varSigma_{\mathbf{Y}} \varSigma_{\mathbf{Y}}^{-1} \mbox{cov}(\mathbf{Y}, \mathbf{X}) \\ &~~~~~~~~~~~ = \varSigma_{\mathbf{X}} - \mbox{cov}(\mathbf{X}, \mathbf{Y}) \varSigma_{\mathbf{Y}}^{-1} \mbox{cov}(\mathbf{Y}, \mathbf{X}). \end{aligned} $$

    To conclude the proof, note that, for a zero-mean random vector V,

    $$\displaystyle \begin{aligned} E(||\mathbf{V}||{}^2) = E( \mbox{tr}(\mathbf{V} \mathbf{V}')) = \mbox{tr}(E(\mathbf{V} \mathbf{V}')) = \mbox{tr}(\varSigma_{\mathbf{V}}). \end{aligned}$$

9.8 Kalman Filter

The Kalman Filter is an algorithm to update the estimate of the state of a system using its output, as sketched in Fig. 9.13. The system has a state X(n) and an output Y (n) at time n = 0, 1, …. These variables are defined through a system of linear equations:

Fig. 9.13
figure 13

The Kalman Filter computes the LLSE of the state of a system given the past of its output

$$\displaystyle \begin{aligned} & X(n+1) = AX(n) + V(n), n \geq 0; {} \end{aligned} $$
(9.13)
$$\displaystyle \begin{aligned} & Y(n) = CX(n) + W(n), n \geq 0. {} \end{aligned} $$
(9.14)

In these equations, the random variables {X(0), V (n), W(n), n ≥ 0} are all orthogonal and zero-mean. The covariance of V (n) is Σ V and that of W(n) is Σ W. The filter is developed when the variables are random vectors and A, C are matrices of compatible dimensions.

The objective is to derive recursive equations to calculate

$$\displaystyle \begin{aligned} \hat X(n) = L[X(n) | Y(0), \ldots, Y(n)], n \geq 0. \end{aligned}$$

9.8.1 The Filter

Here is the result, due to Rudolf Kalman (Fig. 9.14), which we prove in the next chapter. Do not panic when you see the equations!

Fig. 9.14
figure 14

Rudolf Kalman, 1930–2016

Theorem 9.8 (Kalman Filter)

One has

$$\displaystyle \begin{aligned} & \hat X(n) = A \hat X(n-1) + K_n [Y(n) - CA \hat X(n-1)] {} \end{aligned} $$
(9.15)
$$\displaystyle \begin{aligned} & K_n = S_n C'[CS_nC' + \varSigma_W]^{-1} {} \end{aligned} $$
(9.16)
$$\displaystyle \begin{aligned} & S_n = A \varSigma_{n-1}A' + \varSigma_V {} \end{aligned} $$
(9.17)
$$\displaystyle \begin{aligned} & \varSigma_n = (I - K_nC)S_n. {} \end{aligned} $$
(9.18)

Moreover,

$$\displaystyle \begin{aligned} S_n = \mathit{\mbox{cov}}(X(n) - A \hat X(n-1)) \mathit{\mbox{ and }} \varSigma_n = \mathit{\mbox{cov}}(X(n) - \hat X(n)). \end{aligned} $$
(9.19)

\({\blacksquare }\)

We will give a number of examples of this result. But first, let us make a few comments.

  • Equations (9.15)–(9.18) are recursive: the estimate at time n is a simple linear function of the estimate at time n − 1 and of the new observation Y (n).

  • The matrix K n is the filter gain. It can be precomputed at time 0.

  • The covariance of the error \(X(n) - \hat X(n)\), Σ n, can also be precomputed at time 0: it does not depend on the observations {Y (0), …, Y (n)}. The estimate \(\hat X(n)\) depends on these observations but the mean squared error does not.

  • If X(0) and the noise random variables are Gaussian, then the Kalman filter computes the MMSE.

  • Finally, observe that these equations, even though they look a bit complicated, can be programmed in a few lines. This filter is elementary to implement and this explains its popularity.

9.8.2 Examples

In this section, we examine a few examples of the Kalman filter.

9.8.2.1 Random Walk

The first example is a filter to track a “random walk” by making noisy observations.

Let

$$\displaystyle \begin{aligned} & X(n+1) = X(n) + V(n) {} \end{aligned} $$
(9.20)
$$\displaystyle \begin{aligned} & Y(n) = X(n) + W(n) {} \end{aligned} $$
(9.21)
$$\displaystyle \begin{aligned} & \mbox{var}(V(n)) = 0.04, \mbox{var}(W(n)) = 0.09. {} \end{aligned} $$
(9.22)

That is, X(n) has orthogonal increments and it is observed with orthogonal noise. Figure 9.15 shows a simulation of the filter. The left-hand part of the figure shows that the estimate tracks the state with a bounded error. The middle part of the figure shows the variance of the error, which can be precomputed. The right-hand part of the figure shows the filter with the time-varying gain (in blue) and the filter with the limiting gain (in green). The filter with the constant gain performs as well as the one with the time-varying gain, in the limit, as justified by part (c) of the theorem.

Fig. 9.15
figure 15

The Kalman Filter for (9.20)–(9.22)

9.8.2.2 Random Walk with Unknown Drift

In the second example, one tracks a random walk that has an unknown drift. This system is modeled by the following equations:

$$\displaystyle \begin{aligned} & X_1(n+1) = X_1(n) + X_2(n) + V(n) {} \end{aligned} $$
(9.23)
$$\displaystyle \begin{aligned} & X_2(n+1) = X_2(n) {} \end{aligned} $$
(9.24)
$$\displaystyle \begin{aligned} & Y(n) = X_1(n) + W(n) {} \end{aligned} $$
(9.25)
$$\displaystyle \begin{aligned} & \mbox{var}(V(n)) = 1, \mbox{var}(W(n)) = 0.25. {} \end{aligned} $$
(9.26)

In this model, X 2(n) is the constant but unknown drift and X 1(n) is the value of the “random walk.” Figure 9.16 shows a simulation of the filter. It shows that the filter eventually estimates the drift and that the estimate of the position of the walk is quite accurate.

Fig. 9.16
figure 16

The Kalman Filter for (9.23)–(9.26)

9.8.2.3 Random Walk with Changing Drift

In the third example, one tracks a random walk that has changing drift. This system is modeled by the following equations:

$$\displaystyle \begin{aligned} & X_1(n+1) = X_1(n) + X_2(n) + V_1(n) {} \end{aligned} $$
(9.27)
$$\displaystyle \begin{aligned} & X_2(n+1) = X_2(n) + V_2(n) {} \end{aligned} $$
(9.28)
$$\displaystyle \begin{aligned} & Y(n) = X_1(n) + W(n) {} \end{aligned} $$
(9.29)
$$\displaystyle \begin{aligned} & \mbox{var}(V_1(n)) = 1, \mbox{var}(V_2(n)) = 0.01, {} \end{aligned} $$
(9.30)
$$\displaystyle \begin{aligned} & \mbox{var}(W(n)) = 0.25. {} \end{aligned} $$
(9.31)

In this model, X 2(n) is the varying drift and X 1(n) is the value of the “random walk.” Figure 9.17 shows a simulation of the filter. It shows that the filter tries to track the drift and that the estimate of the position of the walk is quite accurate.

Fig. 9.17
figure 17

The Kalman Filter for (9.27)–(9.31)

9.8.2.4 Falling Object

In the fourth example, one tracks a falling object. The elevation Z(n) of that falling object follows the equation

$$\displaystyle \begin{aligned} Z(n) = Z(0) + S(0)n - gn^2/2 + V(n), n \geq 0, \end{aligned}$$

where S(0) is the initial vertical velocity of the object and g is the gravitational constant at the surface of the earth. In this expression, V (n) is some noise that perturbs the motion. We observe η(n) = Z(n) + W(n), where W(n) is some noise.

Since the term − gn 2∕2 is known, we consider

$$\displaystyle \begin{aligned} X_1(n) = Z(n) + gn^2/2 \mbox{ and } Y(n) = \eta(n) + gn^2/2. \end{aligned}$$

With this change of variables, the system is described by the following equations:

$$\displaystyle \begin{aligned} & X_1(n+1) = X_1(n) + X_2(n) + V(n) {} \end{aligned} $$
(9.32)
$$\displaystyle \begin{aligned} & X_2(n+1) = X_2(n) {} \end{aligned} $$
(9.33)
$$\displaystyle \begin{aligned} & Y(n) = X_1(n) + W(n) {} \end{aligned} $$
(9.34)
$$\displaystyle \begin{aligned} & \mbox{var}(V_1(n)) = 100 \mbox{ and } \mbox{var}(W(n)) = 1600. {} \end{aligned} $$
(9.35)

Figure 9.18 shows a simulation of the filter that computes \(\hat X_1(n)\) from which we subtract gt 2∕2 to get an estimate of the actual altitude Z(n) of the object.

Fig. 9.18
figure 18

The Kalman Filter for (9.32)–(9.35)

9.9 Summary

  • LLSE, linear regression, and MMSE;

  • Projection characterization;

  • MMSE of jointly Gaussian is linear;

  • Kalman Filter.

9.9.1 Key Equations and Formulas

LLSE

L[X|Y ] = E(X) + cov(X, Y )var(Y )−1(Y − E(Y ))

Theorem 9.1

Orthogonality

X − L[X|Y ] ⊥ a + bY

(9.6)

Linear Regression

converges to L[X|Y ]

Theorem 9.2

Conditional Expectation

E[X|Y ] = …

Definition 9.2

Orthogonality

X − E[X|Y ] ⊥ g(Y )

Lemma 9.4

MMSE = CE

MMSE[X|Y ] = E[X|Y ]

Theorem 9.3

Properties of CE

Linearity, smoothing, etc…

Theorem 9.5

CE for J.G.

If X, Y  J.G., then E[X|Y ] = L[X|Y ] = ⋯

Theorem 9.6

LLSE vectors

\(L[\mathbf {X} | \mathbf {Y}] = E(\mathbf {X}) + \varSigma _{\mathbf {X}, \mathbf {Y}} \varSigma _{\mathbf {Y}}^{-1} (\mathbf {Y} - E(\mathbf {Y}))\)

Theorem 9.7

Kalman Filter

\(\hat X(n) = A \hat X(n-1)+ K_n [Y(n) - CA \hat X(n-1)]\)

Theorem 9.8

9.10 References

LLSE, MMSE, and linear regression are covered in Chapter 4 of Bertsekas and Tsitsiklis (2008). The Kalman filter was introduced in Kalman (1960). The text (Brown and Hwang 1996) is an easy introduction to Kalman filters with many examples.

9.11 Problems

Problem 9.1

Assume that \(X_n = Y_n + 2Y_n^2 + Z_n\) where the Y n and Z n are i.i.d. U[0, 1]. Let also X = X 1 and Y = Y 1.

  1. (a)

    Calculate L[X|Y ] and E((XL[X|Y ])2);

  2. (b)

    Calculate Q[X|Y ] and E((XQ[X|Y ])2) where Q[X|Y ] is the quadratic least squares estimate of X given Y .

  3. (c)

    Design a stochastic gradient algorithm to compute Q[X|Y ] and implement it in Python.

Problem 9.2

We want to compare the off-line and on-line methods for computing L[X|Y ]. Use the setup of the previous problem.

  1. (a)

    Generate N = 1, 000 samples and compute the linear regression of X given Y . Say that this is X = aY + b

  2. (b)

    Using the same samples, compute the linear fit recursively using the stochastic gradient algorithm. Say that you obtain X = cY + d

  3. (c)

    Evaluate the quality of the two estimates your obtained by computing E((XaYb)2) and E((XcYd)2).

Problem 9.3

The random variables X, Y, Z are jointly Gaussian,

$$\displaystyle \begin{aligned}(X,Y,Z)^T \sim N\left((0,0,0)^T, \left[ \begin{array}{ccc} 2 & 2 & 1 \\ 2 &4 & 2 \\ 1 & 2 & 1 \end{array} \right]\right).\end{aligned}$$
  1. (a)

    Find E[X|Y, Z];

  2. (b)

    Find the variance of error.

Problem 9.4

You observe three i.i.d. samples X 1, X 2, X 3 from the distribution \(f_{X|\theta }(x) = \frac 12 e^{-|x-\theta |}\), where \(\theta \in \mathbb {R}\) is the parameter to estimate. Find MLE[θ|X 1, X 2, X 3].

Problem 9.5

  1. (a)

    Given three independent N(0, 1) random variables X, Y , and Z, find the following minimum mean square estimator:

    $$\displaystyle \begin{aligned} E[X+3Y | 2Y + 5Z]. \end{aligned}$$
  2. (b)

    For the above, compute the mean squared error of the estimator.

Problem 9.6

Given two independent N(0, 1) random variables X and Y, find the following linear least square estimator:

$$\displaystyle \begin{aligned} L[X | X^2 + Y]. \end{aligned}$$

Hint: The characteristic function of a N(0, 1) random variable X is as follows:

$$\displaystyle \begin{aligned} E(e^{isX}) = e^{-\frac{1}{2}s^2}. \end{aligned}$$

Problem 9.7

Consider a sensor network with n sensors that are making observations Y n = (Y 1, …, Y n) of a signal X where

$$\displaystyle \begin{aligned} Y_i = a X + Z_i, i = 1, \ldots, n. \end{aligned}$$

In this expression, X =D N(0, 1), Z i =D N(0, σ 2), for i = 1, …, n and these random variables are mutually independent.

  1. (a)

    Compute the MMSE estimator of X given Y n.

  2. (b)

    Compute the mean squared error \(\sigma _n^2\) of the estimator.

  3. (c)

    Assume each measurement has a cost C and that we want to minimize

    $$\displaystyle \begin{aligned} nC + \sigma_n^2. \end{aligned}$$

    Find the best value of n.

  4. (d)

    Assume that we can decide at each step whether to make another measurement or to stop. Our goal is to minimize the expected value of

    $$\displaystyle \begin{aligned} \nu C + \sigma_\nu^2, \end{aligned}$$

    where ν is the random number of measurements. Do you think there is a decision rule that will do better than the deterministic value n derived in (c)? Explain.

Problem 9.8

We want to use a Kalman filter to detect a change in the popularity of a word in twitter messages. To do this, we create a model of the number Y n of times that particular word appears in twitter messages on day n. The model is as follows:

$$\displaystyle \begin{aligned} & X(n+1) = X(n) \\ & Y(n) = X(n) + W(n), \end{aligned} $$

where the W(n) are zero-mean and uncorrelated. This model means that we are observing numbers of occurrences with an unknown mean X(n) that is supposed to be constant. The idea is that if the mean actually changes, we should be able to detect it by noticing that the errors between \(\hat Y(n)\) and Y (n) are large. Propose an algorithm for detecting that change and implement it in Python.

Problem 9.9

The random variable X is exponentially distributed with mean 1. Given X, the random variable Y  is exponentially distributed with rate X.

  1. (a)

    Calculate E[Y |X].

  2. (b)

    Calculate E[X|Y ].

Problem 9.10

The random variables X, Y, Z are i.i.d. \(\mathcal {N}(0, 1)\).

  1. (a)

    Find L[X 2 + Y 2|X + Y ];

  2. (b)

    Find E[X + 2Y |X + 3Y + 4Z];

  3. (c)

    Find E[(X + Y )2|X − Y ].

Problem 9.11

Let (V n, n ≥ 0) be i.i.d. N(0, σ 2) and independent of X 0 = N(0, u 2). Define

$$\displaystyle \begin{aligned} X_{n+1} = aX_n + V_n, ~n \geq 0. \end{aligned}$$
  1. 1.

    What is the distribution of X n for n ≥ 1?

  2. 2.

    Find E[X n+m|X n] for 0 ≤ n < n + m.

  3. 3.

    Find u so that the distribution of X n is the same for all n ≥ 0.

Problem 9.12

Let θ =D U[0, 1], and given θ, the random variable X is uniformly distributed in [0, θ]. Find E[θ|X].

Problem 9.13

Let (X, Y )T ∼ N([0;0], [3, 1;1, 1]). Find E[X 2|Y ].

Problem 9.14

Let (X, Y, Z)T ∼ N([0;0;0], [5, 3, 1;3, 9, 3;1, 3, 1]). Find E[X|Y, Z].

Problem 9.15

Consider arbitrary random variables X and Y . Prove the following property:

$$\displaystyle \begin{aligned} \mbox{var}(Y) = E(\mbox{var}[Y | X]) + \mbox{var}(E[Y|X]). \end{aligned}$$

Problem 9.16

Let the joint p.d.f. of two random variables X and Y  be

$$\displaystyle \begin{aligned} f_{X,Y}(x,y)=\frac 14(2x+y)1\{0 \leq x \leq 1\}1\{0 \leq y \leq 2\}. \end{aligned}$$

First show that this is a valid joint p.d.f. Suppose you observe Y  drawn from this joint density. Find MMSE[X|Y ].

Problem 9.17

Given four independent N(0, 1) random variables X, Y , Z, and V , find the following minimum mean square estimate:

$$\displaystyle \begin{aligned}E[X + 2Y + 3Z|Y + 5Z + 4V ].\end{aligned}$$

Find the mean squared error of the estimate.

Problem 9.18

Assume that X, Y  are two random variables that are such that E[X|Y ] = L[X|Y ]. Then, it must be that (choose the correct answers, if any)

:

X and Y  are jointly Gaussian;

:

X can be written as X = aY + Z where Z is a random variable that is independent of Y ;

:

E((X − L[X|Y ])Y k) = 0 for all k ≥ 0;

:

\(E((X - L[X|Y]) \sin {}(3Y + 5)) = 0\).

Problem 9.19

In a linear system with independent Gaussian noise, with state X n and observation Y n, the Kalman filter computes (choose the correct answers, if any)

:

MLE[Y n|X n];

:

MLE[X n|Y n];

:

MAP[Y n|X n];

:

MAP[X n|Y n];

:

E[X n|Y n];

:

E[Y n|X n];

:

E[X n|Y n];

:

E[Y n|X n].

Problem 9.20

Let (X, Y) where Y  = [Y 1, Y 2, Y 3, Y 4] be N(μ, Σ) with μ′ = [2, 1, 3, 4, 5] and

$$\displaystyle \begin{aligned} \varSigma = \left[ \begin{array}{c c c c c} 3 & 4 & 6 & 12 & 8 \\ 4 & 6 & 9 & 18 & 12 \\ 6 & 9 & 14 & 28 & 18 \\ 12 & 18 & 28 & 56 & 36 \\ 8 & 12 & 18 & 36 & 24 \end{array} \right]. \end{aligned}$$

Find E[X|Y].

Problem 9.21

Let X = A V and Y = C V where V = N(0, I).

Find E[X|Y].

Problem 9.22

Given θ ∈{0, 1}, X = N(0, Σ θ) where

$$\displaystyle \begin{aligned} \varSigma_0 = \left[ \begin{array}{c c} 1 & 0 \\ 0 & 1 \end{array} \right] \mbox{ and } \varSigma_ 1 = \left[ \begin{array}{c c} 1 & \rho \\ \rho & 1 \end{array} \right], \end{aligned}$$

where ρ > 0 is given.

Find MLE[θ|X].

Problem 9.23

Given two independent N(0, 1) random variables X and Y, find the following linear least square estimator:

$$\displaystyle \begin{aligned} L[X | X^3 + Y]. \end{aligned}$$

Hint: The characteristic function of a N(0, 1) random variable X is as follows:

$$\displaystyle \begin{aligned} E(e^{isX}) = e^{-\frac{1}{2}s^2}. \end{aligned}$$

Problem 9.24

Let X, Y, Z be i.i.d. \(\mathcal {N}(0, 1)\). Find

$$\displaystyle \begin{aligned} E[X | X + Y, X + Z, Y - Z]. \end{aligned}$$

Hint: Argue that the observation Y − Z is redundant.

Problem 9.25

Let X, Y 1, Y S, Y 3 be zero-mean with covariance matrix

$$\displaystyle \begin{aligned} \varSigma = \left[ \begin{array}{c c c c} 10 & 6 & 5 & 16 \\ 6 & 9 & 6 & 21 \\ 5 & 6 & 6 & 18 \\ 16 & 21 & 18 & 57 \end{array} \right]. \end{aligned}$$

Find L[X|Y 1, Y 2, Y 3]. Hint: You will observe that Σ Y is singular. This means that at least one of the observations Y 1, Y 2, or Y 3 is redundant, i.e., is a linear combination of the others. This implies that L[X|Y 1, Y 2, Y 3] = L[X|Y 1, Y 2].