Abstract
There is a class of control problems that admit a particularly elegant solution: the linear quadratic Gaussian (LQG) problems. In these problems, the state dynamics and observations are linear, the cost is quadratic, and the noise is Gaussian. Section 14.1 explains the theory of LQG problems when one observes the state. Section 14.2 discusses the situation when the observations are noisy and shows the remarkable certainty equivalence property of the solution. Section 14.3 explains how noisy observations affect Markov decision problems.
You have full access to this open access chapter, Download chapter PDF
Topics: LQG Control, incomplete observations
14.1 LQG Control
The ideas of dynamic programming that we explained for a controlled Markov chain apply to other controlled systems. We discuss the case of a linear system with quadratic cost and Gaussian noise, which is called the LQG problem. For simplicity, we consider only the scalar case.
The system is
Here, X(n) is the state, U(n) is a control value, and V (n) is the noise. We assume that the random variables V (n) are i.i.d. and N(0, σ 2).
The problem is to choose, at each time n, the control value U(n) in \(\Re \) based on the observed state values up to time n to minimize the expected cost
Thus, the goal of the control is to keep the state value close to zero, and one pays a cost for the control.
The problem is then to trade-off the cost of a large value of the state and that of the control that can bring the state back close to zero. To get some intuition for the solution, consider a simple form of this trade-off: minimizing
In this simple version of the problem, there is no noise and we apply the control only once. To minimize this expression over u, we set the derivative with respect to u equal to zero and we find
so that
Thus, the value of the control that minimizes the cost is linear in the state. We should use a large control value when the state is far from the desired value 0. The following result shows that the same conclusion holds for our problem (Fig. 14.1).
Theorem 14.1
Optimal LQG Control The control values U(n) that minimize (14.2) for the system (14.1) are
where
with d(−1) = 0.
That is, the optimal control is linear in the state and the coefficient depends on the time-to-go. These coefficients can be pre-computed at time 0 and they do not depend on the noise variance. Thus, the control values would be calculated in the same way if V (n) = 0 for all n. \({\blacksquare }\)
Proof
Let V m(x) be the minimum value of (14.2) when N is replaced by m. The stochastic dynamic programming equations are
where V = N(0, σ 2). Also, V −1(x) := 0.
We claim that the solution of these equations is
for some constants c(m) and d(m) where d(m) satisfies (14.4).
That is, we claim that
where d(m) is given by (14.4) and the minimizer is u = g(m)x where g(m) is given by (14.3).
The verification is a simple algebraic exercise that we leave to the reader. □
14.1.1 Letting N →∞
What happens if N becomes very large in (14.2)? Proceeding formally, we examine (14.4) and observe that if |a| < 1, then d(m) → d as m →∞ where d is the solution of the fixed-point equation
To see why this is the case, note that
so that 0 < f′(d) < a 2 for d ≥ 0. Also, f(d) > 0 for d ≥ 0. Hence, f(d) is a contraction. That is,
for some α ∈ (0, 1). (Here, α = a 2.) In particular, choosing d 1 = d and d 2 = d(m), we find that
Thus,
which shows that d(m) → d, as claimed. Consequently, (14.3) shows that g(m) → g as m →∞, where
Thus, when the time-to-go m is very large, the optimal control approaches U(N − m) = gX(N − m). This suggests that this control may minimize the cost (14.2) when N tends to infinity (Fig. 14.2).
The formal way to study this problem is to consider the long-term average cost defined by
This expression is the average cost per unit time. One can show that if |a| < 1, then the control U(n) = gX(n) with g defined as before indeed minimizes that average cost.
14.2 LQG with Noisy Observations
In the previous section, we controlled a linear system with Gaussian noise assuming that we observed the state. We now consider the case of noisy observations.
The system is
where the random variables W(n) are i.i.d. \(\mathcal {N}(0, w^2)\) and are independent of the V (n).
The problem is to find, for each n, the value of U(n) based on the values of Y n := {Y (0), …, Y (n)} that minimize the expected total cost (14.2).
The following result gives the solution of the problem (Fig. 14.3).
Theorem 14.2
Optimal LQG Control with Noisy Observations The solution of the problem is
where
can be computed by using the Kalman filter and the constants g(m) are given by (14.3)–(14.4).
Thus, the control values are the same as when X(n) is observed exactly, except that X(n) is replaced by \(\hat X(n)\) . This feature is called certainty equivalence. \({\blacksquare }\)
Proof
The fact that the values of g(n) do not depend on the noise V (n) gives us some inkling as to why the result in the theorem can be expected: given Y n, the state X(n) is \(\mathcal {N}( \hat X(n), v^2)\) for some variance v 2. Thus, we can view the noisy observation as increasing the variance of the state, as if the variance of V (n) were increased.
Instead of providing the complete algebra, let us sketch why the result holds. Assume that the minimum expected cost-to-go at time N − m + 1 given Y N−m+1 is
Then, at time N − m, the expected cost-to-go given Y N−m and U(N − m) = u is the expected value of
given Y N−m and U(N − m) = u. Now,
where η is a Gaussian random variable independent of Y N−m. Also, as we saw when we discussed the Kalman filter,
Moreover, we know from our study of conditional expectation of jointly Gaussian random variables, that Y (N − m + 1) − E[Y (N − m + 1)|Y N−m] is a Gaussian random variable that has mean zero and is independent of Y N−m. Hence,
for some independent zero-mean Gaussian random variable Z.
Thus, the expected cost-to-go at time N − m − 1 is the expected value of
i.e., of
This expression is identical to (14.6), except that x is replaced by \(\hat X(N-m)\) and V is replaced by Z. Since the variance of V does not affect the calculations of c(m) and d(m), this concludes the proof. □
14.2.1 Letting N →∞
As when X(n) is observed exactly, one can show that, if |a| < 1, the control
minimizes the average cost per unit time. Also, in this case, we know that the Kalman filter becomes stationary and has the form (Fig. 14.4)
14.3 Partially Observed MDP
In the previous chapter, we considered a controlled Markov chain and the action is based on the knowledge of the state. In this section, we look at problems where the state of the Markov chain is not observed exactly. In other words, we look at a controlled hidden Markov chain. These problems are called partially observed Markov decision problems (POMDPs).
Instead of discussing the general version of this problem, we look at one concrete example to convey the basic ideas.
14.3.1 Example: Searching for Your Keys
The example is illustrated in Fig. 14.5. You have misplaced your keys but you know that they are either in bag A, with probability p, or in bag B, otherwise. Unfortunately, your bags are cluttered and if you spend one unit of time (say 10 s) looking in bag A, you find your keys with probability α if they are there. Similarly, the probability for bag B is β. Every time unit, you choose which bag to explore. Your objective is to minimize the expected time until you find your keys.
The state of the system is the location A or B of your keys. However, you do not observe that state. The key idea (excuse the pun) is to consider the conditional probability p n that the keys are in bag A given all your observations up to time n. It turns out that p n is a controlled Markov chain, as we explain shortly. Unfortunately, the set of possible value of p n is [0, 1], which is not finite, nor even countable. Let us not get discouraged by this technical issue.
Assume that at time n, when the keys are in bag A with probability p n, you look in bag A for one unit of time and you do not see the keys. What is then p n+1? We claim that
Indeed, this is the probability that the keys are in bag A and we do not see them, divided by the probability that we do not see the keys (either when they are there or when they are not). Of course, if we see the keys, the problem stops.
Similarly, say that we look in bag B and we do not see the keys. Then
Thus, we control p n with our actions. Let V (p) be the minimum expected time until we find the keys, given that they are in bag A with probability p. Then, the DPE are
The constant 1 is the duration of the first step. The first term in the minimum is what happens when you look in bag A. With probability 1 − pα, you do not find your keys and you will then have to wait a minimum expected time equal to V (f(A, p)) to find your keys, because the probability that they are in bag A is now f(A, p). The other term corresponds to first looking in bag B.
These equations look hopeless. However, they are easy to solve in Python. One discretizes [0, 1] into K intervals and one rounds off the updates f(A, p) and f(B, p).
Thus, the updates are for a finite vector V = (V (1∕K), V (2∕K), …, V (1)). With this discretization, the equations (14.9) look like
where ϕ(⋅) is the right-hand side of (14.9). These are fixed-point equations. To solve them, we initialize V 0 = 0 and we iterate
With a bit of luck, that can be justified mathematically, this algorithm converges to V, the solution of the DPE. The solution is shown in Fig. 14.6, for different values of α and β. The figure also shows the optimum action as a function of p. The discretization uses K = 1000 values in [0, 1] and the iteration is performed 100 times.
14.4 Summary
-
LQG Control Problem with State Observations;
-
LQG Control Problem with Noisy Observations;
-
Partially Observed MDP.
14.4.1 Key Equations and Formulas
14.6 Problems
Problem 14.1
Consider the system
where X(0) = 0 and the random variables V (n) are i.i.d. and \(\mathcal {N}(0, 0.2)\). The U(n) are control values.
-
(a)
Simulate the system when U(n) = 0 for all n ≥ 0.
-
(b)
Implement the control given in Theorem 14.1 with N = 100 and simulate the controlled system.
-
(c)
Implement the control with the constant gain g =limn→∞ g(n) and simulate the system.
Problem 14.2
Consider the system
where X(0) = 0 and the random variables V (n), W(n) are independent with \(V(n) =_D \mathcal {N}(0, 0.2)\) and \(W(n) =_D \mathcal {N}(0, \sigma ^2)\).
-
(a)
Implement the control described in Theorem 14.2 for σ 2 = 0.1 and σ 2 = 0.4 and simulate the controlled system.
-
(b)
Implement the limiting control with the limiting gain and the stationary Kalman filter for σ 2 = 0.1 and σ 2 = 0.4. Simulate the system.
-
(c)
Compare the systems with the time-varying and the limiting controls.
Problem 14.3
There are two coins. One is fair and the other one has a probability of “head” equal to 0.6. You cannot tell which is which by looking at the coins. At each step n ≥ 1, you must choose which coin to flip. The goal is to maximize the expected number of “heads.”
-
(a)
Formulate the problem as a POMDP.
-
(b)
Discretize the state of the system as we did in the “searching for your keys” example and write the SDPEs.
-
(c)
Implement the SDPEs in Python and simulate the resulting system.
References
D. Bertsekas, Dynamic Programming and Optimal Control (Athena, Nashua, 2005)
G.C. Goodwin, K.S. Sin, Adaptive Filtering Prediction and Control (Dover, New York, 2009)
P.R. Kumar, P.P. Varaiya, Stochastic Systems: Estimation, Identification and Adaptive Control (Prentice-Hall, Upper Saddle River, 1986)
Author information
Authors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2021 The Author(s)
About this chapter
Cite this chapter
Walrand, J. (2021). Route Planning: B. In: Probability in Electrical Engineering and Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-030-49995-2_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-49995-2_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-49994-5
Online ISBN: 978-3-030-49995-2
eBook Packages: Computer ScienceComputer Science (R0)