Abstract
This chapter explains how to estimate an unobserved random variable or vector from available observations. This problem arises in many examples, as illustrated in Sect. 9.1. The basic problem is defined in Sect. 9.2. One commonly used approach is the linear least squares estimate explained in Sect. 9.3. A related notion is the linear regression covered in Sect. 9.4. Section 9.5 comments on the problem of overfitting. Sections 9.6 and 9.7 explain the minimum means squares estimate that may be a nonlinear function of the observations and the remarkable fact that it is linear for jointly Gaussian random variables. Section 9.8 is devoted to the Kalman filter, which is a recursive algorithm for calculating the linear least squares estimate of the state of a system given previous observations.
You have full access to this open access chapter, Download chapter PDF
Application: Estimation, Tracking
Topics: LLSE, MMSE, Kalman Filter
9.1 Examples
A GPS receiver uses the signals it gets from satellites to estimate its location (Fig. 9.1). Temperature and pressure sensors provide signals that a computer uses to estimate the state of a chemical reactor.
A radar measures electromagnetic waves that an object reflects and uses the measurements to estimate the position of that object (Fig. 9.2).
Similarly, your car’s control computer estimates the state of the car from measurements it gets from various sensors (Fig. 9.3).
9.2 Estimation Problem
The basic estimation problem can be formulated as follows. There is a pair of continuous random variables (X, Y ). The problem is to estimate X from the observed value of Y .
This problem admits a few different formulations:
-
Known Distribution: We know the joint distribution of (X, Y );
-
Off-Line: We observe a set of sample values of (X, Y );
-
On-Line: We observe successive values of samples of (X, Y );
The objective is to choose the inference function g(⋅) to minimize the expected error C(g) where
In this expression, \(c(X, \hat X)\) is the cost of guessing \(\hat X\) when the actual value is X. A standard example is
We will also study the case when \(X \in \Re ^d\) for d > 1. In such a situation, one uses \(c(X, \hat X) = ||X - \hat X||{ }^2\). If the function g(⋅) can be arbitrary, the function that minimizes C(g) is the Minimum Mean Squares Estimate (MMSE) of X given Y . If the function g(⋅) is restricted to be linear, i.e., of the form a + BY , the linear function that minimizes C(g) is the Linear Least Squares Estimate (LLSE) of X given Y . One may also restrict g(⋅) to be a polynomial of a given degree. For instance, one may define the Quadratic Least Squares Estimate QLSE of X given Y . See Fig. 9.4.
As we will see, a general method for the off-line inference problem is to choose a parametric class of functions \(\{g_w, w \in \Re ^d\}\) and to then minimize the empirical error
over the parameters w. Here, the (X k, Y k) are the observed samples. The parametric function could be linear, polynomial, or a neural network.
For the on-line problem, one also chooses a similar parametric family of functions and one uses a stochastic gradient descent algorithm of the form
where ∇ is the gradient with respect to w and γ > 0 is a small step size. The justification for this approach is that, since γ is small, by the SLLN, the update tends to be in the direction of
which would correspond to a gradient algorithm to minimize C(g w).
9.3 Linear Least Squares Estimates
In this section, we study the linear least squares estimates. Recall the setup that we explained in the previous section. There is a pair (X, Y ) of random variables with some joint distribution and the problem is to find the function g(Y ) = a + bY that minimizes
One consider the cases where the distribution is known, or a set of samples has been observed, or one observes one sample at a time.
Assume that the joint distribution of (X, Y ) is known. This means that we know the joint cumulative distribution function (j.c.d.f.) F X,Y(x, y).Footnote 1
We are looking for the function g(Y ) = a + bY that minimizes
We denote this function by L[X|Y ]. Thus, we have the following definition.
Definition 9.1 (Linear Least Squares Estimate (LLSE))
The LLSE of X given Y , denoted by L[X|Y ], is the linear function a + bY that minimizes
◇
Note that
To find the values of a and b that minimize that expression, we set to zero the partial derivatives with respect to a and b. This gives the following two equations:
Solving these equations for a and b, we find that
where we used the identities
We summarize this result as a theorem.
Theorem 9.1 (Linear Least Squares Estimate)
One has
\({\blacksquare }\)
As a first example, assume that
where X and Z are zero-mean and independent. In this case, we find Footnote 2
Hence,
where
is the signal-to-noise ratio, i.e., the ratio of the power E(α 2 X 2) of the signal in Y divided by the power E(Z 2) of the noise. Note that if SNR is small, then L[X|Y ] is close to zero, which is the best guess about X if one does not make any observation. Also, if SNR is very large, then L[X|Y ] ≈ α −1 Y , which is the correct guess if Z = 0.
As a second example, assume that
whereFootnote 3 Y =D U[0, 1]. Then,
Hence,
This estimate is sketched in Fig. 9.5. Obviously, if one observes Y , one can compute X. However, recall that L[X|Y ] is restricted to being a linear function of Y .
9.3.1 Projection
There is an insightful interpretation of L[X|Y ] as a projection that also helps understand more complex estimates. This interpretation is that L[X|Y ] is the projection of X onto the set \(\mathcal {L}(Y)\) of linear functions of Y .
This interpretation is sketched in Fig. 9.6. In that figure, random variables are represented by points and \(\mathcal {L}(Y)\) is shown as a plane since the linear combination of points in that set is again in the set. In the figure, the square of the length of a vector from a random variable V to another random variable W is E(|V − W|2). Also, we say that two vectors V and W are orthogonal if E(V W) = 0. Thus, L[X|Y ] = a + bY is the projection of X onto \(\mathcal {L}(Y)\) if X − L[X|Y ] is orthogonal to every linear function of Y , i.e., if
Equivalently,
These two equations are the same as (9.1)–(9.2). We call the identities (9.6) the projection property.
Figure 9.7 illustrates the projection when
In this figure, the length of Z is equal to \(\sqrt {E(Z^2)} = \sigma \), the length of X is \(\sqrt {E(X^2)} = 1\) and the vectors X and Z are orthogonal because E(XZ) = 0.
We see that the triangles \(0 \hat X X\) and 0XY are similar. Hence,
so that
since \(||Y|| = \sqrt {1 + \sigma ^2}.\) This shows that
To see why the projection property implies that L[X|Y ] is the closest point to X in \(\mathcal {L}(Y)\), as suggested by Fig. 9.6, we verify that
for any given h(Y ) = c + dY . The idea of the proof is to verify Pythagoras’ identity on the right triangle with vertices X, L[X|Y ] and h(Y ). We have
Now, the projection property (9.6) implies that the last term in the above expression is equal to zero. Indeed, L[X|Y ] − h(Y ) is a linear function of Y . It follows that
as was to be proved.
9.4 Linear Regression
Assume now that, instead of knowing the joint distribution of (X, Y ), we observe K i.i.d. samples (X 1, Y 1), …, (X K, Y K) of these random variables. Our goal is still to construct a function g(Y ) = a + bY so that
is minimized. We do this by choosing a and b to minimize the sum of the squares of the errors based on the samples. That is, we choose a and b to minimize
To do this, we set to zero the derivatives of this sum with respect to a and b. Algebra shows that the resulting values of a and b are such that
where we defined
That is, the expression (9.7) is the same as (9.3), except that the expectation is replaced by the sample mean. The expression (9.7) is called the linear regression of X over Y . It is shown in Fig. 9.8.
One has the following result.
Theorem 9.2 (Linear Regression Converges to LLSE)
As the number of samples increases, the linear regression approaches the LLSE. \({\blacksquare }\)
Proof
As K →∞, one has, by the Strong Law of Large Numbers,
Combined with the expressions for the linear regression and the LLSE, these properties imply the result. □
Formula (9.3) and the linear regression provide an intuitive meaning of the covariance cov(X, Y ). If this covariance is zero, then L[X|Y ] does not depend on Y . If it is positive (negative), it increases (decreases, respectively) with Y . Thus, cov(X, Y ) measures a form of dependency in terms of linear regression. For instance, the random variables in Fig. 9.9 are uncorrelated since L[X|Y ] does not depend on Y .
9.5 A Note on Overfitting
In the previous section, we examined the problem of finding the linear function a + bY that best approximates X, in the mean squared error sense. We could develop the corresponding theory for quadratic approximations a + bY + cY 2, or for polynomial approximations of a given degree. The ideas would be the same and one would have a similar projection interpretation.
In principle, a higher degree polynomial approximates X better than a lower degree one since there are more such polynomials. The question of fitting the parameters with a given number of observations is more complex.
Assume you observe N data points {(X n, Y n), n = 1, …, N}. If the values Y n are different, one can define the function g(⋅) by g(Y n) = X n for n = 1, …, N. This function achieves a zero-mean squared error. What is then the point of looking for a linear function, or a quadratic, or some polynomial of a given degree? Why not simply define g(Y n) = X n?
Remember that the goal of the estimation is to discover a function g(⋅) that is likely to work well for data points we have not yet observed. For instance, we hope that E(C(X N+1, g(Y N+1)) is small, where (X N+1, Y N+1) has the same distribution as the samples (X n, Y n) we have observed for n = 1, …, N.
If we define g(Y n) = X n, this does not tell us how to calculate g(Y N+1) for a value Y N+1 we have not observed. However, if we construct a polynomial g(⋅) of a given degree based on the N samples, then we can calculate g(Y n+1). The key observation is that a higher degree polynomial may not be a better estimate because it tends to fit noise instead of important statistics.
As a simple illustration of overfitting, say that we observe (X 1, Y 1) and Y 2. We want to guess X 2. Assume that the samples X n, Y n are all independent and U[−1, 1]. If we guess \(\hat X_2 = 0\), the mean squared error is \(E((X_2 - \hat X_2)^2) = E(X_2^2) = 1/3\). If we use the guess \(\hat X_2 = X_1\) based on the observations, then \(E((X_2 - \hat X_2)^2) = E((X_2 - X_1)^2) = 2/3\). Hence, ignoring the observation is better than taking it into account.
The practical question is how to detect overfitting. For instance, how does one determine whether a linear regression is better than a quadratic regression? A simple test is as follows. Say you observed N samples {(X n, Y n), n = 1, …, N}. You remove sample n and compute a linear regression using the N − 1 other samples. You use that regression to calculate the estimate \(\hat X_n\) of X n based on Y n. You then compute the squared error \((X_n - \hat X_n)^2\). You repeat that procedure for n = 1, …, N and add up the squared errors. You then use the same procedure for a quadratic regression and you compare.
9.6 MMSE
For now, assume that we know the joint distribution of (X, Y ) and consider the problem of finding the function g(Y ) that minimizes
per all the possible functions g(⋅). The best function is called the MMSE of X given Y . We have the following theorem:
Theorem 9.3 (The MMSE Is the Conditional Expectation)
The MMSE of X given Y is given by
where E[X|Y ] is the conditional expectation of X given Y . \({\blacksquare }\)
Before proving this result, we need to define the conditional expectation.
Definition 9.2 (Conditional Expectation)
The conditional expectation of X given Y is defined by
where
is the conditional density of X given Y . ◇
Figure 9.10 illustrates the conditional expectation. That figure assumes that the pair (X, Y ) is picked uniformly in the shaded area. Thus, if one observes that Y ∈ (y, y + dy), the point X is uniformly distributed along the segment that cuts the shaded area at Y = y. Accordingly, the average value of X is the mid-point of that segment, as indicated in the figure. The dashed red line shows how that mean value depends on Y and it defines E[X|Y ].
The following result is a direct consequence of the definition.
Lemma 9.4 (Orthogonality Property of MMSE)
-
(a)
For any function ϕ(⋅), one has
$$\displaystyle \begin{aligned} E((X - E[X|Y])\phi(Y)) = 0.\end{aligned} $$(9.8) -
(b)
Moreover, if the function g(Y ) is such that
$$\displaystyle \begin{aligned} E((X - g(Y))\phi(Y)) = 0, \forall \phi(\cdot),\end{aligned} $$(9.9)then g(Y ) = E[X|Y ].
Proof
-
(a)
To verify (9.8) note that
$$\displaystyle \begin{aligned} & E(E[X|Y]\phi(Y)) = \int_{- \infty}^\infty E[X|Y=y]\phi(y)f_Y(y)dy \\ &~~~~ = \int_{- \infty}^\infty \int_{- \infty}^\infty x \frac{f_{X, Y}(x, y)}{f_Y(y)} dx \phi(y)f_Y(y)dy \\ &~~~~ =\int_{- \infty}^\infty \int_{- \infty}^\infty x \phi(y) f_{X, Y}(x, y) dx dy \\ &~~~~ = E(X\phi(Y)), \end{aligned} $$which proves (9.8).
-
(b)
To prove the second part of the lemma, note that
$$\displaystyle \begin{aligned} & E(|g(Y) - E[X|Y]|{}^2) \\ &~~~ = E((g(Y) - E[X|Y])\{(g(Y) - X) - (E[X|Y] - X)\}) = 0, \end{aligned} $$because of (9.8) and (9.9) with ϕ(Y ) = g(Y ) − E[X|Y ].
Note that the second part of the lemma simply says that the projection property characterizes uniquely the conditional expectation. In other words, there is only one projection of X onto \(\mathcal {G}(Y)\).
□
We can now prove the theorem.
Proof of Theorem 9.3
The identity (9.8) is the projection property. It states that X − E[X|Y ] is orthogonal to the set \(\mathcal {G}(Y)\) of functions of Y , as shown in Fig. 9.11.
In particular, it is orthogonal to h(Y ) − E[X|Y ]. As in the case of the LLSE, this projection property implies that
for any function h(⋅). This implies that E[X|Y ] is indeed the MMSE of X given Y . □
From the definition, we see how to calculate E[X|Y ] from the conditional density of X given Y . However, in many cases one can calculate E[X|Y ] more simply. One approach is to use the following properties of conditional expectation.
Theorem 9.5 (Properties of Conditional Expectation)
-
(a)
Linearity:
$$\displaystyle \begin{aligned} E[a_1 X_1 + a_2 X_2 | Y] = a_1 E[X_1 |Y] + a_2 E[X_2 | Y]; \end{aligned}$$ -
(b)
Factoring Known Values:
$$\displaystyle \begin{aligned} E[ h(Y)X | Y] = h(Y) E[X|Y]; \end{aligned}$$ -
(c)
Independence: If X and Y are independent, then
$$\displaystyle \begin{aligned} E[X|Y] = E(X). \end{aligned}$$ -
(d)
Smoothing:
$$\displaystyle \begin{aligned} E(E[X | Y]) = E(X); \end{aligned}$$ -
(e)
Tower:
$$\displaystyle \begin{aligned} E[E[X | Y, Z] | Y] = E[X|Y]. \end{aligned}$$
\({\blacksquare }\)
Proof
-
(a)
By Lemma 9.4(b), it suffices to show that
$$\displaystyle \begin{aligned} a_1 X_1 + a_2 X_2 - (a_1 E[X_1 | Y] + a_2 E[X_2|Y]) \end{aligned}$$is orthogonal to \(\mathcal {G}(Y)\). But this is immediate since it is the sum of two terms
$$\displaystyle \begin{aligned} a_i (X_i - E[X_i|Y]) \end{aligned}$$for i = 1, 2 that are orthogonal to \(\mathcal {G}(Y)\).
-
(b)
By Lemma 9.4(b), it suffices to show that
$$\displaystyle \begin{aligned} h(Y) X - h(Y) E[X|Y] \end{aligned}$$is orthogonal to \(\mathcal {G}(Y)\), i.e., that
$$\displaystyle \begin{aligned} E((h(Y) X - h(Y) E[X|Y])\phi(Y)) = 0, \forall \phi(\cdot). \end{aligned}$$Now,
$$\displaystyle \begin{aligned} E((h(Y) X - h(Y) E[X|Y])\phi(Y)) = E((X - E[X|Y]) h(Y) \phi(Y)) = 0, \end{aligned}$$because X − E[X|Y ] is orthogonal to \(\mathcal {G}(Y)\) and therefore to h(Y )ϕ(Y ).
-
(c)
By Lemma 9.4(b), it suffices to show that
$$\displaystyle \begin{aligned} X - E(X) \end{aligned}$$is orthogonal to \(\mathcal {G}(Y)\). Now,
$$\displaystyle \begin{aligned} E((X - E(X))\phi(Y)) = E(X - E(X))E(\phi(Y)) = 0. \end{aligned}$$The first equality follows from the fact that X − E(X) and ϕ(Y ) are independent since they are functions of independent random variables.Footnote 4
-
(d)
Letting ϕ(Y ) = 1 in (9.8), we find
$$\displaystyle \begin{aligned} E(X - E[X|Y]) = 0, \end{aligned}$$which is the identity we wanted to prove.
-
(e)
The projection property states that E[W|Y ] = V if V is a function of Y and if W − V is orthogonal to \(\mathcal {G}(Y)\). Applying this characterization to W = E[X|Y, Z] and V = E[X|Y ], we find that to show that E[E[X|Y, Z]|Y ] = E[X|Y ], it suffices to show that E[X|Y, Z] − E[X|Y ] is orthogonal to \(\mathcal {G}(Y)\). That is, we should show that
$$\displaystyle \begin{aligned} E(h(Y)(E[X|Y,Z] - E[X|Y])) = 0 \end{aligned}$$for any function h(Y ). But E(h(Y )(X − E[X|Y, Z])) = 0 by the projection property, because h(Y ) is some function of (Y, Z). Also, E(h(Y )(X − E[X|Y ])) = 0, also by the projection property. Hence,
$$\displaystyle \begin{aligned} \begin{array}{rcl} & &\displaystyle E(h(Y)(E[X|Y,Z] {-} E[X|Y])) {=} E(h(Y)(X {-} E[X|Y]))\\ & &\displaystyle \qquad {-} E(h(Y)(X {-} E[X|Y,Z])) {=} 0. \end{array} \end{aligned} $$
□
As an example, assume that X, Y, Z are i.i.d. U[0, 1]. We want to calculate
We find
Note that calculating the conditional density of (X + 2Y )2 given Y would have been quite a bit more tedious.
In some situations, one may be able to exploit symmetry to evaluate the conditional expectation. Here is one representative example. Assume that X, Y, Z are i.i.d. Then, we claim that
To see this, note that, by symmetry,
Denote by V the common value of these random variables. Note that their sum is
by linearity. Thus, 3V = X + Y + Z, which proves our claim.
\({\square }\)
9.6.1 MMSE for Jointly Gaussian
In general L[X|Y ] ≠ E[X|Y ]. As a trivial example, Let Y =D U[−1, 1] and X = Y 2. Then E[X|Y ] = Y 2 and L[X|Y ] = E(X) = 1∕3 since cov(X, Y ) = E(XY ) − E(X)E(Y ) = 0.
Figure 9.12 recalls that E[X|Y ] is the projection of X onto \(\mathcal {G}(Y)\), whereas L[X|Y ] is the projection of X onto \(\mathcal {L}(Y)\). Since \(\mathcal {L}(Y)\) is a subspace of \(\mathcal {G}(Y)\), one expects the two projections to be different, in general.
However, there are examples where E[X|Y ] happens to be linear. We saw one such example in (9.10) and it is not difficult to construct many other examples.
There is an important class of problems where this occurs. It is when X and Y are jointly Gaussian. We state that result as a theorem.
Theorem 9.6 (MMSE for Jointly Gaussian RVs)
Let X, Y be jointly Gaussian random variables. Then
\({\blacksquare }\)
Proof
Note that
Also, X − L[X|Y ] and Y are two linear functions of the jointly Gaussian random variables X and Y . Consequently, they are jointly Gaussian by Theorem 8.4 and they are independent by Theorem 8.3.
Consequently,
for any ϕ(⋅), because functions of independent random variables are independent by Theorem B.11 in Appendix B. Hence,
for any ϕ(⋅) by Theorem B.4 of Appendix B.
This shows that
and, consequently, that L[X|Y ] = E[X|Y ]. □
9.7 Vector Case
So far, to keep notation at a minimum, we have considered L[X|Y ] and E[X|Y ] when X and Y are single random variables. In this section, we discuss the vector case, i.e., L[X|Y] and E[X|Y] when X and Y are random vectors. The only difficulty is one of notation. Conceptually, there is nothing new.
Definition 9.3 (LLSE of Random Vectors)
Let X and Y be random vectors of dimensions m and n, respectively. Then
where A is the m × n matrix and b the vector in \(\Re ^m\) that minimize
◇
Thus, as in the scalar case, the LLSE is the linear function of the observations that best approximates X, in the mean squared error sense.
Before proceeding, review the notation of Sect. B.6 for Σ Y and cov(X, Y).
Theorem 9.7 (LLSE of Vectors)
Let X and Y be random vectors such that Σ Y is nonsingular.
-
(a)
Then
$$\displaystyle \begin{aligned} L[\mathbf{X} | \mathbf{Y}] = E(\mathbf{X}) + \mathit{\mbox{cov}}(\mathbf{X}, \mathbf{Y}) \varSigma_{\mathbf{Y}}^{-1}(\mathbf{Y} - E(\mathbf{Y})). \end{aligned} $$(9.11) -
(b)
Moreover,
$$\displaystyle \begin{aligned} E(||\mathbf{X} - L[\mathbf{X} | \mathbf{Y}] ||{}^2) = \mathit{\mbox{tr}}(\varSigma_{\mathbf{X}} - \mathit{\mbox{cov}}(\mathbf{X},\mathbf{Y}) \varSigma_{\mathbf{Y}}^{-1} \mathit{\mbox{cov}}(\mathbf{Y},\mathbf{X})). \end{aligned} $$(9.12)In this expression, for a square matrix M, tr(M) :=∑i M i,i is the trace of the matrix.
\({\blacksquare }\)
Proof
-
(a)
The proof is similar to the scalar case. Let Z be the right-hand side of (9.11). One shows that the error X −Z is orthogonal to all the linear functions of Y. One then uses that fact to show that X is closer to Z than to any other linear function h(Y) of Y.
First we show the orthogonality. Since E(X −Z) = 0, we have
$$\displaystyle \begin{aligned} E((\mathbf{X} - \mathbf{Z})(B \mathbf{Y} + \mathbf{b})') = E((\mathbf{X} - \mathbf{Z})(B \mathbf{Y})') = E((\mathbf{X} - \mathbf{Z})\mathbf{Y}')B' . \end{aligned}$$Next, we show that E((X −Z)Y ′) = 0. To see this, note that
$$\displaystyle \begin{aligned} & E((\mathbf{X} - \mathbf{Z})\mathbf{Y}') = E((\mathbf{X} - \mathbf{Z})(\mathbf{Y} - E(\mathbf{Y}))')\\ &~~~ = E((\mathbf{X} - E(\mathbf{X}))(\mathbf{Y} - E(\mathbf{Y}))') \\ &~~~~~~~~~ - \mbox{cov}(\mathbf{X}, \mathbf{Y}) \varSigma_{\mathbf{Y}}^{-1} E((\mathbf{Y} - E(\mathbf{Y}))(\mathbf{Y} - E(\mathbf{Y}))')\\ &~~~ = \mbox{cov}(\mathbf{X}, \mathbf{Y}) - \mbox{cov}(\mathbf{X}, \mathbf{Y}) \varSigma_{\mathbf{Y}}^{-1} \varSigma_{\mathbf{Y}} = 0. \end{aligned} $$Second, we show that Z is closer to X than any linear h(Y). We have
$$\displaystyle \begin{aligned} & E(||\mathbf{X} - h(\mathbf{Y})||{}^2) = E((\mathbf{X} - h(\mathbf{Y}))'(\mathbf{X} - h(\mathbf{Y}))) \\ &~~~ = E((\mathbf{X} - \mathbf{Z} + \mathbf{Z} - h(\mathbf{Y}))'(\mathbf{X} - \mathbf{Z} + \mathbf{Z} - h(\mathbf{Y}))) \\ &~~~ = E(||\mathbf{X} - \mathbf{Z}||{}^2) + E(||\mathbf{Z} - h(\mathbf{Y})||{}^2) + 2 E((\mathbf{X} - \mathbf{Z})'(\mathbf{Z} - h(\mathbf{Y}))). \end{aligned} $$We claim that the last term is equal to zero. To see this, note that
$$\displaystyle \begin{aligned} E((\mathbf{X} - \mathbf{Z})'(\mathbf{Z} - h(\mathbf{Y})) = \sum_{i=1}^n E((X_i - Z_i)(Z_i - h_i(\mathbf{Y}))). \end{aligned}$$Also,
$$\displaystyle \begin{aligned} E((X_i - Z_i)(Z_i - h_i(\mathbf{Y}))) = E((\mathbf{X} - \mathbf{Z})(\mathbf{Z} - h(\mathbf{Y}))')_{i, i} \end{aligned}$$and the matrix E((X −Z)(Z − h(Y))′) is equal to zero since X −Y is orthogonal to any linear function of Y and, in particular, to Z − h(Y).
(Note: an alternative way of showing that the last term is equal to zero is to write
$$\displaystyle \begin{aligned} E((\mathbf{X} - \mathbf{Z})'(\mathbf{Z} - h(\mathbf{Y})) = \mbox{tr}E((\mathbf{X} - \mathbf{Z})(\mathbf{Z} - h(\mathbf{Y}))') = 0, \end{aligned}$$where the first equality comes from the fact that tr(AB) = tr(BA) for matrices of compatible dimensions.)
-
(b)
Let \(\tilde {\mathbf {X}} := \mathbf {X} - E[\mathbf {X} | \mathbf {Y}]\) be the estimation error. Thus,
$$\displaystyle \begin{aligned} \tilde{\mathbf{X}} = \mathbf{X} - E(\mathbf{X}) - \mbox{cov}(\mathbf{X}, \mathbf{Y}) \varSigma_{\mathbf{Y}}^{-1}(\mathbf{Y} - E(\mathbf{Y})). \end{aligned}$$Now, if V and W are two zero-mean random vectors and M a matrix,
$$\displaystyle \begin{aligned} & \mbox{cov}(\mathbf{V} - M \mathbf{W}) = E((\mathbf{V} - M \mathbf{W})(\mathbf{V} - M \mathbf{W})') \\ &~~~~ = E(\mathbf{V} \mathbf{V}' - 2 M \mathbf{W} \mathbf{V}' + M \mathbf{W} \mathbf{W}' M') \\ &~~~~ = \mbox{cov}(\mathbf{V}) - 2M \mbox{cov}( \mathbf{W}, \mathbf{V}) + M \mbox{cov}(\mathbf{W})M'. \end{aligned} $$Hence,
$$\displaystyle \begin{aligned} &\mbox{cov}(\tilde{\mathbf{X}}) = \varSigma_{\mathbf{X}} - 2 \mbox{cov}(\mathbf{X}, \mathbf{Y}) \varSigma_{\mathbf{Y}}^{-1} \mbox{cov}(\mathbf{Y}, \mathbf{X}) \\ &~~~~~~~~~~~~~~~~ + \mbox{cov}(\mathbf{X}, \mathbf{Y}) \varSigma_{\mathbf{Y}}^{-1} \varSigma_{\mathbf{Y}} \varSigma_{\mathbf{Y}}^{-1} \mbox{cov}(\mathbf{Y}, \mathbf{X}) \\ &~~~~~~~~~~~ = \varSigma_{\mathbf{X}} - \mbox{cov}(\mathbf{X}, \mathbf{Y}) \varSigma_{\mathbf{Y}}^{-1} \mbox{cov}(\mathbf{Y}, \mathbf{X}). \end{aligned} $$To conclude the proof, note that, for a zero-mean random vector V,
$$\displaystyle \begin{aligned} E(||\mathbf{V}||{}^2) = E( \mbox{tr}(\mathbf{V} \mathbf{V}')) = \mbox{tr}(E(\mathbf{V} \mathbf{V}')) = \mbox{tr}(\varSigma_{\mathbf{V}}). \end{aligned}$$
□
9.8 Kalman Filter
The Kalman Filter is an algorithm to update the estimate of the state of a system using its output, as sketched in Fig. 9.13. The system has a state X(n) and an output Y (n) at time n = 0, 1, …. These variables are defined through a system of linear equations:
In these equations, the random variables {X(0), V (n), W(n), n ≥ 0} are all orthogonal and zero-mean. The covariance of V (n) is Σ V and that of W(n) is Σ W. The filter is developed when the variables are random vectors and A, C are matrices of compatible dimensions.
The objective is to derive recursive equations to calculate
9.8.1 The Filter
Here is the result, due to Rudolf Kalman (Fig. 9.14), which we prove in the next chapter. Do not panic when you see the equations!
Theorem 9.8 (Kalman Filter)
One has
Moreover,
\({\blacksquare }\)
We will give a number of examples of this result. But first, let us make a few comments.
-
Equations (9.15)–(9.18) are recursive: the estimate at time n is a simple linear function of the estimate at time n − 1 and of the new observation Y (n).
-
The matrix K n is the filter gain. It can be precomputed at time 0.
-
The covariance of the error \(X(n) - \hat X(n)\), Σ n, can also be precomputed at time 0: it does not depend on the observations {Y (0), …, Y (n)}. The estimate \(\hat X(n)\) depends on these observations but the mean squared error does not.
-
If X(0) and the noise random variables are Gaussian, then the Kalman filter computes the MMSE.
-
Finally, observe that these equations, even though they look a bit complicated, can be programmed in a few lines. This filter is elementary to implement and this explains its popularity.
9.8.2 Examples
In this section, we examine a few examples of the Kalman filter.
9.8.2.1 Random Walk
The first example is a filter to track a “random walk” by making noisy observations.
Let
That is, X(n) has orthogonal increments and it is observed with orthogonal noise. Figure 9.15 shows a simulation of the filter. The left-hand part of the figure shows that the estimate tracks the state with a bounded error. The middle part of the figure shows the variance of the error, which can be precomputed. The right-hand part of the figure shows the filter with the time-varying gain (in blue) and the filter with the limiting gain (in green). The filter with the constant gain performs as well as the one with the time-varying gain, in the limit, as justified by part (c) of the theorem.
9.8.2.2 Random Walk with Unknown Drift
In the second example, one tracks a random walk that has an unknown drift. This system is modeled by the following equations:
In this model, X 2(n) is the constant but unknown drift and X 1(n) is the value of the “random walk.” Figure 9.16 shows a simulation of the filter. It shows that the filter eventually estimates the drift and that the estimate of the position of the walk is quite accurate.
9.8.2.3 Random Walk with Changing Drift
In the third example, one tracks a random walk that has changing drift. This system is modeled by the following equations:
In this model, X 2(n) is the varying drift and X 1(n) is the value of the “random walk.” Figure 9.17 shows a simulation of the filter. It shows that the filter tries to track the drift and that the estimate of the position of the walk is quite accurate.
9.8.2.4 Falling Object
In the fourth example, one tracks a falling object. The elevation Z(n) of that falling object follows the equation
where S(0) is the initial vertical velocity of the object and g is the gravitational constant at the surface of the earth. In this expression, V (n) is some noise that perturbs the motion. We observe η(n) = Z(n) + W(n), where W(n) is some noise.
Since the term − gn 2∕2 is known, we consider
With this change of variables, the system is described by the following equations:
Figure 9.18 shows a simulation of the filter that computes \(\hat X_1(n)\) from which we subtract gt 2∕2 to get an estimate of the actual altitude Z(n) of the object.
9.9 Summary
-
LLSE, linear regression, and MMSE;
-
Projection characterization;
-
MMSE of jointly Gaussian is linear;
-
Kalman Filter.
9.9.1 Key Equations and Formulas
LLSE | L[X|Y ] = E(X) + cov(X, Y )var(Y )−1(Y − E(Y )) | Theorem 9.1 |
Orthogonality | X − L[X|Y ] ⊥ a + bY | (9.6) |
Linear Regression | converges to L[X|Y ] | Theorem 9.2 |
Conditional Expectation | E[X|Y ] = … | Definition 9.2 |
Orthogonality | X − E[X|Y ] ⊥ g(Y ) | Lemma 9.4 |
MMSE = CE | MMSE[X|Y ] = E[X|Y ] | Theorem 9.3 |
Properties of CE | Linearity, smoothing, etc… | Theorem 9.5 |
CE for J.G. | If X, Y J.G., then E[X|Y ] = L[X|Y ] = ⋯ | Theorem 9.6 |
LLSE vectors | \(L[\mathbf {X} | \mathbf {Y}] = E(\mathbf {X}) + \varSigma _{\mathbf {X}, \mathbf {Y}} \varSigma _{\mathbf {Y}}^{-1} (\mathbf {Y} - E(\mathbf {Y}))\) | Theorem 9.7 |
Kalman Filter | \(\hat X(n) = A \hat X(n-1)+ K_n [Y(n) - CA \hat X(n-1)]\) | Theorem 9.8 |
9.11 Problems
Problem 9.1
Assume that \(X_n = Y_n + 2Y_n^2 + Z_n\) where the Y n and Z n are i.i.d. U[0, 1]. Let also X = X 1 and Y = Y 1.
-
(a)
Calculate L[X|Y ] and E((X − L[X|Y ])2);
-
(b)
Calculate Q[X|Y ] and E((X − Q[X|Y ])2) where Q[X|Y ] is the quadratic least squares estimate of X given Y .
-
(c)
Design a stochastic gradient algorithm to compute Q[X|Y ] and implement it in Python.
Problem 9.2
We want to compare the off-line and on-line methods for computing L[X|Y ]. Use the setup of the previous problem.
-
(a)
Generate N = 1, 000 samples and compute the linear regression of X given Y . Say that this is X = aY + b
-
(b)
Using the same samples, compute the linear fit recursively using the stochastic gradient algorithm. Say that you obtain X = cY + d
-
(c)
Evaluate the quality of the two estimates your obtained by computing E((X − aY − b)2) and E((X − cY − d)2).
Problem 9.3
The random variables X, Y, Z are jointly Gaussian,
-
(a)
Find E[X|Y, Z];
-
(b)
Find the variance of error.
Problem 9.4
You observe three i.i.d. samples X 1, X 2, X 3 from the distribution \(f_{X|\theta }(x) = \frac 12 e^{-|x-\theta |}\), where \(\theta \in \mathbb {R}\) is the parameter to estimate. Find MLE[θ|X 1, X 2, X 3].
Problem 9.5
-
(a)
Given three independent N(0, 1) random variables X, Y , and Z, find the following minimum mean square estimator:
$$\displaystyle \begin{aligned} E[X+3Y | 2Y + 5Z]. \end{aligned}$$ -
(b)
For the above, compute the mean squared error of the estimator.
Problem 9.6
Given two independent N(0, 1) random variables X and Y, find the following linear least square estimator:
Hint: The characteristic function of a N(0, 1) random variable X is as follows:
Problem 9.7
Consider a sensor network with n sensors that are making observations Y n = (Y 1, …, Y n) of a signal X where
In this expression, X =D N(0, 1), Z i =D N(0, σ 2), for i = 1, …, n and these random variables are mutually independent.
-
(a)
Compute the MMSE estimator of X given Y n.
-
(b)
Compute the mean squared error \(\sigma _n^2\) of the estimator.
-
(c)
Assume each measurement has a cost C and that we want to minimize
$$\displaystyle \begin{aligned} nC + \sigma_n^2. \end{aligned}$$Find the best value of n.
-
(d)
Assume that we can decide at each step whether to make another measurement or to stop. Our goal is to minimize the expected value of
$$\displaystyle \begin{aligned} \nu C + \sigma_\nu^2, \end{aligned}$$where ν is the random number of measurements. Do you think there is a decision rule that will do better than the deterministic value n derived in (c)? Explain.
Problem 9.8
We want to use a Kalman filter to detect a change in the popularity of a word in twitter messages. To do this, we create a model of the number Y n of times that particular word appears in twitter messages on day n. The model is as follows:
where the W(n) are zero-mean and uncorrelated. This model means that we are observing numbers of occurrences with an unknown mean X(n) that is supposed to be constant. The idea is that if the mean actually changes, we should be able to detect it by noticing that the errors between \(\hat Y(n)\) and Y (n) are large. Propose an algorithm for detecting that change and implement it in Python.
Problem 9.9
The random variable X is exponentially distributed with mean 1. Given X, the random variable Y is exponentially distributed with rate X.
-
(a)
Calculate E[Y |X].
-
(b)
Calculate E[X|Y ].
Problem 9.10
The random variables X, Y, Z are i.i.d. \(\mathcal {N}(0, 1)\).
-
(a)
Find L[X 2 + Y 2|X + Y ];
-
(b)
Find E[X + 2Y |X + 3Y + 4Z];
-
(c)
Find E[(X + Y )2|X − Y ].
Problem 9.11
Let (V n, n ≥ 0) be i.i.d. N(0, σ 2) and independent of X 0 = N(0, u 2). Define
-
1.
What is the distribution of X n for n ≥ 1?
-
2.
Find E[X n+m|X n] for 0 ≤ n < n + m.
-
3.
Find u so that the distribution of X n is the same for all n ≥ 0.
Problem 9.12
Let θ =D U[0, 1], and given θ, the random variable X is uniformly distributed in [0, θ]. Find E[θ|X].
Problem 9.13
Let (X, Y )T ∼ N([0;0], [3, 1;1, 1]). Find E[X 2|Y ].
Problem 9.14
Let (X, Y, Z)T ∼ N([0;0;0], [5, 3, 1;3, 9, 3;1, 3, 1]). Find E[X|Y, Z].
Problem 9.15
Consider arbitrary random variables X and Y . Prove the following property:
Problem 9.16
Let the joint p.d.f. of two random variables X and Y be
First show that this is a valid joint p.d.f. Suppose you observe Y drawn from this joint density. Find MMSE[X|Y ].
Problem 9.17
Given four independent N(0, 1) random variables X, Y , Z, and V , find the following minimum mean square estimate:
Find the mean squared error of the estimate.
Problem 9.18
Assume that X, Y are two random variables that are such that E[X|Y ] = L[X|Y ]. Then, it must be that (choose the correct answers, if any)
-
:
-
X and Y are jointly Gaussian;
-
:
-
X can be written as X = aY + Z where Z is a random variable that is independent of Y ;
-
:
-
E((X − L[X|Y ])Y k) = 0 for all k ≥ 0;
-
:
-
\(E((X - L[X|Y]) \sin {}(3Y + 5)) = 0\).
Problem 9.19
In a linear system with independent Gaussian noise, with state X n and observation Y n, the Kalman filter computes (choose the correct answers, if any)
-
:
-
MLE[Y n|X n];
-
:
-
MLE[X n|Y n];
-
:
-
MAP[Y n|X n];
-
:
-
MAP[X n|Y n];
-
:
-
E[X n|Y n];
-
:
-
E[Y n|X n];
-
:
-
E[X n|Y n];
-
:
-
E[Y n|X n].
Problem 9.20
Let (X, Y) where Y ′ = [Y 1, Y 2, Y 3, Y 4] be N(μ, Σ) with μ′ = [2, 1, 3, 4, 5] and
Find E[X|Y].
Problem 9.21
Let X = A V and Y = C V where V = N(0, I).
Find E[X|Y].
Problem 9.22
Given θ ∈{0, 1}, X = N(0, Σ θ) where
where ρ > 0 is given.
Find MLE[θ|X].
Problem 9.23
Given two independent N(0, 1) random variables X and Y, find the following linear least square estimator:
Hint: The characteristic function of a N(0, 1) random variable X is as follows:
Problem 9.24
Let X, Y, Z be i.i.d. \(\mathcal {N}(0, 1)\). Find
Hint: Argue that the observation Y − Z is redundant.
Problem 9.25
Let X, Y 1, Y S, Y 3 be zero-mean with covariance matrix
Find L[X|Y 1, Y 2, Y 3]. Hint: You will observe that Σ Y is singular. This means that at least one of the observations Y 1, Y 2, or Y 3 is redundant, i.e., is a linear combination of the others. This implies that L[X|Y 1, Y 2, Y 3] = L[X|Y 1, Y 2].
Notes
- 1.
See Appendix B.
- 2.
Indeed, E(XZ) = E(X)E(Z) = 0, by independence.
- 3.
Thus,
$$\displaystyle \begin{aligned} E(Y^k) = (1 + k)^{-1}. \end{aligned}$$ - 4.
See Appendix B.
References
D. Bertsekas, J. Tsitsiklis, Introduction to Probability (Athena, Nashua, 2008)
R.G. Brown, P.Y.C. Hwang, Introduction to Random Signals and Applied Kalman Filtering (Wiley, Hoboken, 1996)
R.E. Kalman, A new approach to linear filtering and prediction problems. Trans. ASME J. Basic Eng. 82(Series D), 35–45 (1960)
Author information
Authors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2021 The Author(s)
About this chapter
Cite this chapter
Walrand, J. (2021). Tracking—A. In: Probability in Electrical Engineering and Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-030-49995-2_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-49995-2_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-49994-5
Online ISBN: 978-3-030-49995-2
eBook Packages: Computer ScienceComputer Science (R0)