Topics: LQG Control, incomplete observations

14.1 LQG Control

The ideas of dynamic programming that we explained for a controlled Markov chain apply to other controlled systems. We discuss the case of a linear system with quadratic cost and Gaussian noise, which is called the LQG problem. For simplicity, we consider only the scalar case.

The system is

$$\displaystyle \begin{aligned} X(n+1) = aX(n) + U(n) + V(n), n \geq 0. \end{aligned} $$
(14.1)

Here, X(n) is the state, U(n) is a control value, and V (n) is the noise. We assume that the random variables V (n) are i.i.d. and N(0, σ 2).

The problem is to choose, at each time n, the control value U(n) in \(\Re \) based on the observed state values up to time n to minimize the expected cost

$$\displaystyle \begin{aligned} E \left [\sum_{n=0}^N \left( X(n)^2 + \beta U(n)^2 \right) | X(0) = x \right ]. \end{aligned} $$
(14.2)

Thus, the goal of the control is to keep the state value close to zero, and one pays a cost for the control.

The problem is then to trade-off the cost of a large value of the state and that of the control that can bring the state back close to zero. To get some intuition for the solution, consider a simple form of this trade-off: minimizing

$$\displaystyle \begin{aligned} (ax + u)^2 + \beta u^2. \end{aligned}$$

In this simple version of the problem, there is no noise and we apply the control only once. To minimize this expression over u, we set the derivative with respect to u equal to zero and we find

$$\displaystyle \begin{aligned} 2(ax + u) + 2 \beta u = 0, \end{aligned}$$

so that

$$\displaystyle \begin{aligned} u = - \frac{a}{1 + \beta} x. \end{aligned}$$

Thus, the value of the control that minimizes the cost is linear in the state. We should use a large control value when the state is far from the desired value 0. The following result shows that the same conclusion holds for our problem (Fig. 14.1).

Fig. 14.1
figure 1

The optimal control is linear in the state

Theorem 14.1

Optimal LQG Control The control values U(n) that minimize (14.2) for the system (14.1) are

$$\displaystyle \begin{aligned} U(n) = g(N - n) X(n), \end{aligned}$$

where

$$\displaystyle \begin{aligned} & g(m) = - \frac{ad(m-1)}{\beta + d(m-1)}, m \geq 0; {} \end{aligned} $$
(14.3)
$$\displaystyle \begin{aligned} & d(m) = 1 + \frac{a^2 \beta d(m-1)}{\beta + d(m-1)}, m \geq 0 {} \end{aligned} $$
(14.4)

with d(−1) = 0.

That is, the optimal control is linear in the state and the coefficient depends on the time-to-go. These coefficients can be pre-computed at time 0 and they do not depend on the noise variance. Thus, the control values would be calculated in the same way if V (n) = 0 for all n. \({\blacksquare }\)

Proof

Let V m(x) be the minimum value of (14.2) when N is replaced by m. The stochastic dynamic programming equations are

$$\displaystyle \begin{aligned} V_m(x) = \min_u \left \{x^2 + \beta u^2 + E(V_{m - 1}(ax + u + V)) \right \}, m \geq 0, \end{aligned} $$
(14.5)

where V = N(0, σ 2). Also, V −1(x) := 0.

We claim that the solution of these equations is

$$\displaystyle \begin{aligned} V_m(x) = c(m) + d(m) x^2 \end{aligned}$$

for some constants c(m) and d(m) where d(m) satisfies (14.4).

That is, we claim that

$$\displaystyle \begin{aligned} \min_u \{x^2 + \beta u^2 + E[ c(m-1) + d(m-1) (ax + u + V)^2 \} = c(m) + d(m) x^2, \end{aligned} $$
(14.6)

where d(m) is given by (14.4) and the minimizer is u = g(m)x where g(m) is given by (14.3).

The verification is a simple algebraic exercise that we leave to the reader. □

14.1.1 Letting N →

What happens if N becomes very large in (14.2)? Proceeding formally, we examine (14.4) and observe that if |a| < 1, then d(m) → d as m → where d is the solution of the fixed-point equation

$$\displaystyle \begin{aligned} d = f(d) := 1 + \frac{a^2 \beta d}{\beta + d}. \end{aligned}$$

To see why this is the case, note that

$$\displaystyle \begin{aligned} f'(d) = \frac{a^2 \beta^2}{(\beta + d)^2}, \end{aligned}$$

so that 0 < f′(d) < a 2 for d ≥ 0. Also, f(d) > 0 for d ≥ 0. Hence, f(d) is a contraction. That is,

$$\displaystyle \begin{aligned} |f(d_1) - f(d_2)| \leq \alpha |d_1 - d_2|, \forall d_1, d_2 \geq 0 \end{aligned}$$

for some α ∈ (0, 1). (Here, α = a 2.) In particular, choosing d 1 = d and d 2 = d(m), we find that

$$\displaystyle \begin{aligned} |d - d(m+1)| \leq \alpha |d - d(m)|, \forall m \geq 0. \end{aligned}$$

Thus,

$$\displaystyle \begin{aligned} |d - d(m)| \leq \alpha^m |d - d(0)|, \end{aligned}$$

which shows that d(m) → d, as claimed. Consequently, (14.3) shows that g(m) → g as m →, where

$$\displaystyle \begin{aligned} g = - \frac{ad}{\beta + d}. \end{aligned}$$

Thus, when the time-to-go m is very large, the optimal control approaches U(N − m) = gX(N − m). This suggests that this control may minimize the cost (14.2) when N tends to infinity (Fig. 14.2).

Fig. 14.2
figure 2

The optimal control for the average cost

The formal way to study this problem is to consider the long-term average cost defined by

$$\displaystyle \begin{aligned} \lim_{N \rightarrow \infty} \frac{1}{N} E \left[\sum_{n=0}^N \left( X(n)^2 + \beta U(n)^2 \right) | X(0) = x \right ]. \end{aligned}$$

This expression is the average cost per unit time. One can show that if |a| < 1, then the control U(n) = gX(n) with g defined as before indeed minimizes that average cost.

14.2 LQG with Noisy Observations

In the previous section, we controlled a linear system with Gaussian noise assuming that we observed the state. We now consider the case of noisy observations.

The system is

$$\displaystyle \begin{aligned} &X(n+1) = aX(n) + U(n) + V(n), n \geq 0; {} \end{aligned} $$
(14.7)
$$\displaystyle \begin{aligned} &Y(n) = X(n) + W(n), {} \end{aligned} $$
(14.8)

where the random variables W(n) are i.i.d. \(\mathcal {N}(0, w^2)\) and are independent of the V (n).

The problem is to find, for each n, the value of U(n) based on the values of Y n := {Y (0), …, Y (n)} that minimize the expected total cost (14.2).

The following result gives the solution of the problem (Fig. 14.3).

Fig. 14.3
figure 3

The optimal control is linear in the estimate of the state

Theorem 14.2

Optimal LQG Control with Noisy Observations The solution of the problem is

$$\displaystyle \begin{aligned} U(n) = g(N-n) \hat X(n), \end{aligned}$$

where

$$\displaystyle \begin{aligned} \hat X(n) = E[X(n) | Y(0), \ldots, Y(n), U(0), \ldots, U(n-1)] \end{aligned}$$

can be computed by using the Kalman filter and the constants g(m) are given by (14.3)(14.4).

Thus, the control values are the same as when X(n) is observed exactly, except that X(n) is replaced by \(\hat X(n)\) . This feature is called certainty equivalence. \({\blacksquare }\)

Proof

The fact that the values of g(n) do not depend on the noise V (n) gives us some inkling as to why the result in the theorem can be expected: given Y n, the state X(n) is \(\mathcal {N}( \hat X(n), v^2)\) for some variance v 2. Thus, we can view the noisy observation as increasing the variance of the state, as if the variance of V (n) were increased.

Instead of providing the complete algebra, let us sketch why the result holds. Assume that the minimum expected cost-to-go at time N − m + 1 given Y Nm+1 is

$$\displaystyle \begin{aligned} c(m-1) + d(m-1) \hat X(N - m + 1)^2. \end{aligned}$$

Then, at time N − m, the expected cost-to-go given Y Nm and U(N − m) = u is the expected value of

$$\displaystyle \begin{aligned} X(N-m )^2 + \beta u^2 + c(m-1) + d(m-1) \hat X(N - m+1)^2 \end{aligned}$$

given Y Nm and U(N − m) = u. Now,

$$\displaystyle \begin{aligned} X(N-m) = \hat X(N-m) + \eta, \end{aligned}$$

where η is a Gaussian random variable independent of Y Nm. Also, as we saw when we discussed the Kalman filter,

$$\displaystyle \begin{aligned} \hat X(N-m+1) &= a \hat X(N-m) + u\\ & \quad + K(N-m+1)\{ Y(N-m+1) - E[Y(N-m+1) | Y^{N-m}] \}. \end{aligned} $$

Moreover, we know from our study of conditional expectation of jointly Gaussian random variables, that Y (N − m + 1) − E[Y (N − m + 1)|Y Nm] is a Gaussian random variable that has mean zero and is independent of Y Nm. Hence,

$$\displaystyle \begin{aligned} \hat X(N-m+1) = a \hat X(N-m) + u + Z \end{aligned}$$

for some independent zero-mean Gaussian random variable Z.

Thus, the expected cost-to-go at time N − m − 1 is the expected value of

$$\displaystyle \begin{aligned} & (\hat X(N-m) + \eta)^2 + \beta u^2 + c(m-1) \\ & \quad + d(m-1) (a \hat X(N-m) + Z)^2, \end{aligned} $$

i.e., of

$$\displaystyle \begin{aligned} \hat X(N-m)^2 + \beta u^2 + c(m-1) + d(m-1)( a \hat X(N-m) + u + Z)^2. \end{aligned}$$

This expression is identical to (14.6), except that x is replaced by \(\hat X(N-m)\) and V  is replaced by Z. Since the variance of V  does not affect the calculations of c(m) and d(m), this concludes the proof. □

14.2.1 Letting N →

As when X(n) is observed exactly, one can show that, if |a| < 1, the control

$$\displaystyle \begin{aligned} U(n) = g \hat X(n) \end{aligned}$$

minimizes the average cost per unit time. Also, in this case, we know that the Kalman filter becomes stationary and has the form (Fig. 14.4)

$$\displaystyle \begin{aligned} \hat X(n+1) = a \hat X(n) + u + K [Y(n+1) - a \hat X(n) - U(n)]. \end{aligned}$$
Fig. 14.4
figure 4

The optimal control for the average cost with noisy observations. Here, the Kalman filter is stationary

14.3 Partially Observed MDP

In the previous chapter, we considered a controlled Markov chain and the action is based on the knowledge of the state. In this section, we look at problems where the state of the Markov chain is not observed exactly. In other words, we look at a controlled hidden Markov chain. These problems are called partially observed Markov decision problems (POMDPs).

Instead of discussing the general version of this problem, we look at one concrete example to convey the basic ideas.

14.3.1 Example: Searching for Your Keys

The example is illustrated in Fig. 14.5. You have misplaced your keys but you know that they are either in bag A, with probability p, or in bag B, otherwise. Unfortunately, your bags are cluttered and if you spend one unit of time (say 10 s) looking in bag A, you find your keys with probability α if they are there. Similarly, the probability for bag B is β. Every time unit, you choose which bag to explore. Your objective is to minimize the expected time until you find your keys.

Fig. 14.5
figure 5

Where to look for your keys?

The state of the system is the location A or B of your keys. However, you do not observe that state. The key idea (excuse the pun) is to consider the conditional probability p n that the keys are in bag A given all your observations up to time n. It turns out that p n is a controlled Markov chain, as we explain shortly. Unfortunately, the set of possible value of p n is [0, 1], which is not finite, nor even countable. Let us not get discouraged by this technical issue.

Assume that at time n, when the keys are in bag A with probability p n, you look in bag A for one unit of time and you do not see the keys. What is then p n+1? We claim that

$$\displaystyle \begin{aligned} p_{n+1} = \frac{p_n(1 - \alpha)}{p_n(1 - \alpha) + (1 - p_n)} =: f(A, p_n).\end{aligned} $$

Indeed, this is the probability that the keys are in bag A and we do not see them, divided by the probability that we do not see the keys (either when they are there or when they are not). Of course, if we see the keys, the problem stops.

Similarly, say that we look in bag B and we do not see the keys. Then

$$\displaystyle \begin{aligned} p_{n+1} = \frac{p_n}{p_n + (1 - p_n)(1 - \beta)} =: f(B, p_n). \end{aligned}$$

Thus, we control p n with our actions. Let V (p) be the minimum expected time until we find the keys, given that they are in bag A with probability p. Then, the DPE are

$$\displaystyle \begin{aligned} V(p) = 1 + \min \{ (1 - p \alpha)V(f(A, p)), (1 - (1 - p)\beta)V(f(B, p))\}. \end{aligned} $$
(14.9)

The constant 1 is the duration of the first step. The first term in the minimum is what happens when you look in bag A. With probability 1 − , you do not find your keys and you will then have to wait a minimum expected time equal to V (f(A, p)) to find your keys, because the probability that they are in bag A is now f(A, p). The other term corresponds to first looking in bag B.

These equations look hopeless. However, they are easy to solve in Python. One discretizes [0, 1] into K intervals and one rounds off the updates f(A, p) and f(B, p).

Thus, the updates are for a finite vector V = (V (1∕K), V (2∕K), …, V (1)). With this discretization, the equations (14.9) look like

$$\displaystyle \begin{aligned} \mathbf{V} = \phi( \mathbf{V}), \end{aligned}$$

where ϕ(⋅) is the right-hand side of (14.9). These are fixed-point equations. To solve them, we initialize V 0 = 0 and we iterate

$$\displaystyle \begin{aligned} {\mathbf{V}}_{t + 1} = \phi({\mathbf{V}}_t), t \geq 0. \end{aligned}$$

With a bit of luck, that can be justified mathematically, this algorithm converges to V, the solution of the DPE. The solution is shown in Fig. 14.6, for different values of α and β. The figure also shows the optimum action as a function of p. The discretization uses K = 1000 values in [0, 1] and the iteration is performed 100 times.

Fig. 14.6
figure 6

Numerical solution of (14.9)

14.4 Summary

  • LQG Control Problem with State Observations;

  • LQG Control Problem with Noisy Observations;

  • Partially Observed MDP.

14.4.1 Key Equations and Formulas

Table 1

14.5 References

The texts Bertsekas (2005), Kumar and Varaiya (1986) and Goodwin and Sin (2009) cover LQG control. The first two texts discuss POMDP.

14.6 Problems

Problem 14.1

Consider the system

$$\displaystyle \begin{aligned} X(n+1) = 0.8 X(n) + U(n) + V(n), n \geq 0, \end{aligned}$$

where X(0) = 0 and the random variables V (n) are i.i.d. and \(\mathcal {N}(0, 0.2)\). The U(n) are control values.

  1. (a)

    Simulate the system when U(n) = 0 for all n ≥ 0.

  2. (b)

    Implement the control given in Theorem 14.1 with N = 100 and simulate the controlled system.

  3. (c)

    Implement the control with the constant gain g =limn g(n) and simulate the system.

Problem 14.2

Consider the system

$$\displaystyle \begin{aligned} X(n+1) & = 0.8 X(n) + U(n) + V(n), n \geq 0 \\ Y(n) & = X(n) + W(n), n \geq 0, \end{aligned} $$

where X(0) = 0 and the random variables V (n), W(n) are independent with \(V(n) =_D \mathcal {N}(0, 0.2)\) and \(W(n) =_D \mathcal {N}(0, \sigma ^2)\).

  1. (a)

    Implement the control described in Theorem 14.2 for σ 2 = 0.1 and σ 2 = 0.4 and simulate the controlled system.

  2. (b)

    Implement the limiting control with the limiting gain and the stationary Kalman filter for σ 2 = 0.1 and σ 2 = 0.4. Simulate the system.

  3. (c)

    Compare the systems with the time-varying and the limiting controls.

Problem 14.3

There are two coins. One is fair and the other one has a probability of “head” equal to 0.6. You cannot tell which is which by looking at the coins. At each step n ≥ 1, you must choose which coin to flip. The goal is to maximize the expected number of “heads.”

  1. (a)

    Formulate the problem as a POMDP.

  2. (b)

    Discretize the state of the system as we did in the “searching for your keys” example and write the SDPEs.

  3. (c)

    Implement the SDPEs in Python and simulate the resulting system.