Encyclopedia of Systems and Control

Living Edition
| Editors: John Baillieul, Tariq Samad

Approximate Dynamic Programming (ADP)

  • Paul J. WerbosEmail author
Living reference work entry
DOI: https://doi.org/10.1007/978-1-4471-5102-9_100096-1
  • 298 Downloads

Abstract

Approximate dynamic programming (ADP or RLADP) includes a wide variety of general methods to solve for optimal decision and control in the face of complexity, nonlinearity, stochasticity, and/or partial observability. This entry first reviews methods and a few key applications across decision and control engineering (e.g., vehicle and logistics control), computer science (e.g., AlphaGo), operations research, and connections to economics, neuropsychology, and animal behavior. Then it summarizes a sixfold mathematical taxonomy of methods in use today, with pointers to the future.

Keywords

Bellman Pontryagin Adaptive critic HJB Nonlinear optimal control HDP DHP Neurocontrol TD(λAdaptive dynamic programming RLADP Reinforcement learning Neurodynamic programming Missile interception Nonlinear robust control Deep learning 

Introduction

What Is ADP: A General Definition of ADP and Notation

Logically, ADP is that branch of applied mathematics which develops and studies general purpose methods for optimal decision or control over dynamical systems which may or may not be subject to stochastic disturbance, partial observability, nonlinearity, and complexity, for situations where exact dynamic programming (DP) is too expensive to perform directly. Many ADP methods assume the case of discrete time t, but others address the case of continuous time t. ADP tries to learn or converge to the exact DP solution as accurately and as reliably as possible, but some optimization problems are more difficult than others. In some applications, it is useful enough to start from the best available controller based on more classical methods and then use ADP iteration to improve that controller. In other applications, users of ADP have obtained results so accurate that they can claim to have designed fully robust nonlinear controllers, by solving (accurately enough) the Hamilton-Jacobi-Bellman (HJB) equation for the control task under study.

Let us write the state of a dynamical system at time t as the vector \( \underline {\mathbf {x}}(\mathrm {t})\), and the vector of observations as \( \underline {\mathbf {y}}(\mathrm {t})\). (Full observability is just the special case where \( \underline {\mathbf {y}}(\mathrm {t}) = \underline {\mathbf {x}}(\mathrm {t})\).) Let us define a policy \( \underline {\boldsymbol {\pi }}\) as a control rule, function, or procedure used to calculate the vector \( \underline {\mathbf {u}}(\mathrm {t})\) of controls (physical actions, decisions) by:
$$\displaystyle \begin{aligned} \underline{\mathbf{u}}(\mathrm{t}) = \underline{\boldsymbol{\pi}} ({\{}\mathrm{y}(\uptau ), \uptau \le \mathrm{t}{\}}) {} \end{aligned} $$
(1)

The earliest useful ADP methods (White and Sofge 1992) asked the user to supply a control rule \( \underline {\boldsymbol {\pi }}( \underline {\mathbf {y}}(\mathrm {t}), \underline {\boldsymbol {\alpha }})\) or \( \underline {\boldsymbol {\pi }}( \underline {\mathbf {R}}(\mathrm {t}), \underline {\boldsymbol {\alpha }})\), where \( \underline {\boldsymbol {\alpha }}\) is the set of parameters or weights in the control rule, and where \( \underline {\mathbf {R}}(\mathrm {t})\) may either be an estimate of the state \( \underline {\mathbf {x}}(\mathrm {t})\), or, more generally, a representation of the “belief state” which is defined as the probability density function \(\mathrm {Pr}( \underline {\mathbf {x}}(\mathrm {t})\vert {\{} \underline {\mathbf {y}}(\uptau ), \uptau \le \mathrm {t}{\}})\). These types of methods may logically be called “adaptive policy methods.” Of course, the user is free to try several candidate control rules. Just as he or she might try different stochastic models in trying to identify an unknown plant and compare to see which policy works best in simulation. Long before the term “neurodynamic programming” was coined, the earliest ADP designs included the use of universal function approximators such as neural networks to try to approximate the best possible function \( \underline {\boldsymbol {\pi }}\).

Later work on ADP developed applications in areas such as missile interception (like the work of Balakrishnan included in Lewis and Liu (2013)) and in complex logistics (like the paper by L. Werbos, Kozma et al on Metamodeling) where it works better simply to calculate the optimal action u(t) at each time t, exploiting other aspects of the standard DP tools. In the future, hybrid methods may be developed which modify a global control law on the fly, similar to the moving intercept trick used by McAvoy with great success in model predictive control (MPC) (White and Sofge 1992). These may be called hybrid policy methods.

The earliest ADP methods (White and Sofge 1992) focused entirely on the case of discrete time. In that case, the task of ADP is to find the policy π which maximizes (or minimizes) the expected value of:
$$\displaystyle \begin{aligned} \mathrm{J}_0\left(x\, \left( \mathrm{t} \right) \right)=\sum_{\uptau =\mathrm{t}}^{\mathrm{T}} \frac{\mathrm{U}(x\left( \uptau \right),u\left( \uptau \right))}{{(1+\mathrm{r})}^{\uptau -\mathrm{t}}}, \end{aligned} $$
(2)
where U is a utility function (or cost function), where T is a termination time (which may be infinity), where r is a discount factor (interest rate), and where future values of x and u are stochastic functions of the policy π.

As with any optimization problem or control design, it is essential for the user to think hard about what he or she really wants and what the ultimate metric of success or performance really is. The interest rate r is just as important as the utility function, in expressing what the ultimate goal of the exercise is. It often makes sense to define U as a measure of economic value added, plus a penalty function to keep the system within a broad range of controllability. With certain plants, like the dynamically unstable SR-71 aircraft, optimal performance will sometimes take advantage of giving the system “permission” to deviate from the nominal control trajectory to some degree. To solve a very tricky nonlinear control problem, it often helps to solve it first with a larger interest rate, in order to find a good starting point, and then “tighten the screws” or “extending the foresight horizon” by dialing r down to zero. Many ADP computer programs use a kind of “discount factor” γ = 1∕(1 + r), to simply programming, but there is a huge literature on decision analysis and economics on how to understand the interest rate r.

Continuous time versions of ADP have been pioneered by Frank Lewis (Lewis and Liu 2013), with extensions to differential games (such as two-player “adversarial” ADP offering strategies for worst case survival).

Varieties of ADP from Reinforcement Learning to Operations Research

In practice, there probably does not exist an integrated mathematical textbook on ADP which gives both full explanation and access to the full range of practical tools for the field as a whole. The reason for this is that nonlinear dynamic optimization problems abound in a wide variety of fields, from economics and psychology to control engineering and even to physics. The underlying mathematics and tools are universal, but even today it is important to reach across disciplines and across different approaches and terminologies to know what is available.

Prior to the late 1980s, it was well understood that “exact” DP and the obvious approximations to DP suffer from a severe “curse of dimensionality.” Finding an exact solution to a DP task required calculations which grew exponentially, even for small tasks in the general case. In the special case of a linear plant and a quadratic utility function, the calculations are much easier; that gave rise to the vast field of linear-quadratic optimal control, discussed in other articles of this encyclopedia. Ron Howard developed iterative dynamic programming for systems with a discrete number of states; this was rigorous and universal enough that it was disseminated from operations research to mathematical economics, but it did not help with curse of dimensionality as such. Research in (mainly discrete) Markov decision processes (MDP) and partially observed MDP (POMDP) has also provided useful mathematical foundations for further work.

All of this work was well grounded in the concept of cardinal utility function, defined in the classic book Theory of Games and Economic Behavior by von Neumann and Morgenstern. It started from the classic work of Richard Bellman, most notably the Bellman equation, to be presented in the next section. von Neumann’s concept had deep roots in utilitarian economics and in Aristotle’s concept of “telos”; in fact, there are striking parallels between modern discussion of how to formulate a utility function and Aristotle’s discussion of “happiness” in his Nicomachean Ethics.

In the same early period, there was substantial independent interest in a special case of ADP called “reinforcement learning” in psychology and in artificial intelligence in computer science. Unfortunately, the same term, “reinforcement learning,” was applied to many other things as well. For example, it was sometimes used as an umbrella term for all types of optimization method, including static optimization, performed in a heuristic manner.

The original concept of “reinforcement learning” came from B.F. Skinner and his many followers, who worked to develop mathematical models to describe animal learning. Instead of a utility function, “U,” they developed models of animals as systems which learn to maximize a reinforcement signal, just “r” or “r(t),” representing the reward or punishment, pain, or pleasure, experienced by the animal. Even to this day, followers of that tradition sometimes use that notation. The work of Klopf in the tradition of Skinner was a major driver of the classic 1983 paper by Barto, Sutton, and Anderson in the IEEE Transactions on Systems, Man, and Cybernetics (SMC). This work implicitly assumed that the utility function U is not known as a function and that we only have access to its value at any time.

Many years ago, a chemical engineer said in a workshop: “I really don’t want to study all these complex mathematical details. What I really want is a black box, which I can hook up to my sensors and my actuators, and to a real time measure of performance which I can provide. I really want a black box which could do all the work of figuring out the relations between sensors and actuators, and how to maximize performance over time.” The task of reinforcement learning was seen as the task of how to design that kind of black box.

The early pioneers of artificial intelligence (AI) were well aware of reinforcement learning. For example, Marvin Minsky (in the book Computers and Thought) argued that new methods for reinforcement learning would be the path to true brain-like general artificial intelligence. But no such methods were available. Minsky recognized that all of the early approaches to reinforcement learning suffered from a severe curse of dimensionality and could not begin to handle the kind of complexity which brains learn to handle. AI and computer science mostly gave up on this approach at that time, just as it gave up on neural networks.

In a way, the paralysis ended in 1987, when Sutton read my paper in SMC describing how the curse of dimensionality could be overcome by treating the reinforcement learning problem as a special case of ADP and by using a combination of neural networks, backpropagation, and new approximation methods. That led to many discussions and to an NSF workshop in 1988, which for the first time brought together many of the relevant disciplines and tried to provide an integration of tools and applications. The book from that workshop, Neural Networks for Control, edited by Miller, Sutton, and Werbos is still of current use. However, later workshops went further and deeper; references White and Sofge (1992) and Lewis and Liu (2013), led by the control community, provide the best comprehensive starting point for what is available from the various disciplines.

Within the relevant disciplines, there has been some important fundamental extension since the publication of Lewis and Liu (2013). The most famous extension by far was the triumph of the AlphaGo systems which learned to defeat the best human champion in Go. At the heart of that success was the use of an ADP method in the heuristic dynamic programming (HDP) group of methods, to be discussed in the next section, as tuned and applied by Sutton. However, success in a game of that complexity also depended on combining HDP with complementary techniques, which are actually being assimilated in engineering applications such as self-driving cars, pioneered at the Chinese Academy of Sciences Institute of Automation (CASIA), which is the center of China’s major new push in “the new AI.”

There has also been remarkable progress in ADP within the control field proper, especially in applications and in stability theory. Building on Lewis and Liu (2013), CASIA and Derong Liu have spawned a wide range of new applications. The new book by Liu is probably a good source, but I am not aware of a single good review of all the more recent stability work across the world. Some commentators on “the new AI” have said that China is so far ahead in engineering kinds of applications relevant to the Internet of Things that Western reviews often fail to appreciate just how much they are missing of new technologies. Off-the-shelf software for ADP tends to reflect the limited set of methods best known in computer science, simply because they tend to have more interest than engineers do in open-source software development (and in dissemination which may benefit competitors).

Within the field of operations research, Warren Powell has developed a comprehensive framework and textbook well-informed by discussions across the disciplines and by a wide range of applications from logistics to battery management.

Back in psychology, the original home of reinforcement learning, progress in developing mathematical models powerful enough to be useful in engineering has been limited because of cultural problems between disciplines, similar to what held back reinforcement learning in general before 1987. However, a new path has opened up to connect neuropsychology with ADP which brings us closer to being able to truly understand the wiring and dynamics of the learning systems of the brain (Werbos and Davis 2016).

Mathematical Taxonomy of ADP Methods in General

There are many ways to classify the vast range of ADP tools and applications developed since 1990. Nevertheless, there are few if any ADP systems in use which do not fall into the sixfold classification given in White and Sofge (1992). Here, I will summarize these six types of ADP method and then mention extensions.

Heuristic Dynamic Programming (HDP)

HDP is the most direct method for approximating the Bellman equation, the foundation of exact dynamic programming. There are many legitimate ways to write that equation; one is:
$$\displaystyle \begin{aligned} \mathrm{J}\left(\underline{\mathbf{x}}(\mathrm{t})\right)&=\max\nolimits_{\underline{\mathrm{u}}(\mathrm{t})} \left(\mathrm{U}(\underline{\mathbf{x}} \mathrm{(t)},\underline{\mathrm{u}}\left(\mathrm{t}\right)\right.{}\\ &\quad \left.+\left\langle \mathrm{J}(\underline{\mathbf{x}}\left(\mathrm{t} + 1\right)/\left(1 + \mathrm{r}\right)\right\rangle \right) \end{aligned} $$
(3)
where angle brackets denote the expectation value. When \( \underline {\mathbf {x}}\) is governed by a fully observable stochastic process (MDP), and when r > 0 or T is finite, it is well-known that a function J which solves this equation will exist. (See White and Sofge (1992) for the simple additional term needed to guarantee the existence of J for r = 0 and T = , the case which some of us try to do justice to in our own decision-making.) To calculate the optimal policy of action, one tries to find an exact solution for this function J and then choose actions according to the (static) optimization problem over \( \underline {\mathbf {u}}\) shown in Eq. 3. Because J may be any nonlinear function, in principle, early users would often try to develop a lookup table approximation to J, the size of which would grow exponentially with the number of possible states in the table.

The key idea in HDP is to model the function J in a parameterized way, just as one models stochastic processes when identifying their dynamics. That idea was developed in some detail in my 1987 paper in SMC, but was also described in many papers before that. The treatment in White and Sofge (1992) describes in general terms how to train or estimate the parameters or weights \( \underline {\mathbf {W}}\) in any model of J, \(\hat {\mathrm {J}}( \underline {\mathbf {R}}(\mathrm {t}), \underline {\mathbf {W}})\), supplied by a user. A user who actually understands the dynamics of the plant well enough may choose to try different models of J and see how they perform. But the vast majority of applications today model J by use of a universal approximation function, such as a simple feedforward neural network. Andrew Barron has proven that the required complexity of such an approximation grows far more slowly, as the number of variables grows, than it would when more traditional linear approximation schemes such as lookup tables or Taylor series are used. Ilin, Kozma, and Werbos have shown that approximation is still better when an upgraded type of recurrent neural network is used instead; in the example of generalized maze navigation, a feedforward approximator such as a convolutional neural network simply does not work. In general, unless the model of J is a simple lookup table or linear model, the training requires the use of generalized backpropagation, which provides the required derivatives in exact closed form at minimum computational cost (White and Sofge 1992; Werbos 2005).

In his famous paper on TD(λ) in 1987, Sutton proposed a generalization of HDP to include a kind of approximation quality factor λ which could be used to get convergence albeit to a less exact solution. I am not aware of cases where that particular approximation was useful or important, but with very tricky nonlinear decision problems, it can be helpful to have a graded series of problems to allow convergence to an optimal solution of the problem of interest; see the discussion of “shaping” in White and Sofge (1992). There is a growing literature on “transfer learning” which provides ever more rigorous methods for transferring solutions from simpler problems to more difficult ones, but even a naive reuse of \(\hat {\mathrm {J}}( \underline {\mathbf {R}}(\mathrm {t}), \underline {\mathbf {W}})\) and of π from one plant model to another is often a very effective strategy, as Jay Farrell showed long ago in aerospace control examples.

In getting full value from ADP, it helps to have a deep understanding of what the function \(\hat {\mathrm {J}}( \underline {\mathbf {R}}(\mathrm {t}), \underline {\mathbf {W}})\) represents, across many disciplines (White and Sofge 1992; Lewis and Liu 2013). More and more, the world has come to call this function the “value function.” (This is something of a misnomer, however, as you will see in the discussion of Dual Heuristic Programming (DHP) below.) It is often denoted as “V” in computer science. It is often called a “Critic” in the engineering literature, following an early seminal paper by Widrow.

In control theory, Eq. 3 is often called the “Hamilton-Jacobi-Bellman” equation, because it is the stochastic generalization of the older Hamilton-Jacobi equation of enduring importance in fundamental physics. When control assumes no stochastic disturbance, it really is the older equation that one is solving. However, even in the deterministic cases, theorems like those of Barras require solution of the full Bellman equation (or at least the limit of it as noise goes to zero) in order to design a truly rigorous robust controller for the general nonlinear case. From that viewpoint, ADP is simply a useful toolbox to perform the calculations needed to implement nonlinear robust control in the general case.

Action-Dependent HDP (ADHDP), aka “Q Learning”

This family of ADP methods results from modeling a new value function J’ (or “Q”) defined by:
$$\displaystyle \begin{aligned} \mathrm{J}^{\prime}(\underline{\mathbf{x}}(\mathrm{t}), \underline{\mathbf{u}} (\mathrm{t}) ) &= \mathrm{U} (\underline{\mathbf{x}}(\mathrm{t}), \underline{\mathbf{u}} (\mathrm{t})){}\\ &\quad + < \mathrm{J}(\underline{\mathbf{x}}(\mathrm{t}+1) ) / (1+\mathrm{r})> \end{aligned} $$
(4)

One may develop a recurrence relation similar to (3) simply by substituting this into (3); the general method for estimating \(\hat {\mathrm {J}}^{\prime }( \underline {\mathbf {R}}(\mathrm {t}), \underline {\mathrm {u}}(\mathrm {t}), \underline {\mathbf {W}})\) is given in White and Sofge (1992), along with concrete practical applications from McDonnell Douglas in reconfigurable flight control and in low-cost mass production of carbon-carbon composite parts (a breakthrough important to the later development of the Dreamliner airplane after Boeing bought McDonnell Douglas and assimilated the technology). This type of value function has more complexity than the HDP value function, but its simplicity has value in some applications. One might argue that a truly general system like a brain would have an optimal hybrid of these two methods and more.

Dual Heuristic Programming (DHP, Dual HDP)

DHP uses yet another type of value function \( \underline {\boldsymbol {\lambda }}\) defined by:
$$\displaystyle \begin{aligned} \underline{\boldsymbol{\lambda}}(\underline{\mathbf{x}}(\mathrm{t})) = \nabla _{\underline{\mathbf{x}}}\mathrm{J}(\underline{\mathbf{x}}(\mathrm{t}) {} \end{aligned} $$
(5)

A convergent, consistent method for training this kind of value function is given in White and Sofge (1992), but many readers would find it useful to study the details and explanation by Balakrishnan and by Wunsch et al., reviewed in Lewis and Liu (2013). \( \underline {\boldsymbol {\lambda }}\) is a generalization of the dual variables well-known in economics, in operations research, and in control. The method given in White and Sofge (1992) is essentially a stochastic generalization of the Pontryagin equation. Because DHP entails a rich and detailed flow of feedback of values (plural) from one iteration to the next, it performs better in the fact of more and more complexity, as discussed in theoretical terms in White and Sofge (1992) and shown concretely in simulation studies by Wunsch and Prokhorov reviewed in Lewis and Liu (2013). DHP value learning was the basis for Balakrishnan’s breakthrough in performance in hit-to-kill missile interception (Lewis and Liu 2013), which is possibly better known and understood in China than in the USA for political reasons.

Action-Dependent DHP and Globalized HDP

For reasons of length, I will not describe these three other groups discussed in White and Sofge (1992) in detail. The action-dependent version of DHP may have uses in training time-lagged recurrent networks in real time, and GDHP offers the highest performance and generality at the cost of some complexity. (In applications so far, DHP has performed as well. In the absence of noise, neural model predictive control (White and Sofge 1992) has also performed well and is sometimes considered as a (direct) type of reinforcement learning.)

Future Directions

Extensions to Spatial Complexity, Temporal Complexity, and Creativity

In the late 1990s, major extensions of ADP were designed and even patented to cope with much greater degrees of complexity, by adaptively making use of multiple time intervals and spatial structure (“chunking”) and by addressing the inescapable problem of local minima more effectively. See Werbos (2014) for a review of research steps needed to create systems which can cope with complexity as well as a mouse brain, for example. Realistically, however, the main challenge to ADP at present is to prove optimality under rigorous prior assumptions about “vector” plants (as in “vector intelligence”), considering systems which learn plants dynamics and optimal policy together in real time. It is possible that the more advanced systems may actually be dangerous if widely deployed before they are fully understood. It is amusing how movies like “Terminator 2” and “Terminator 3” may be understood as a (realistic) presentation of possible modes of instability in more complex nonlinear adaptive systems. In July 2014, the National Science Foundation of China and the relevant Dean at Tsinghua University asked me to propose a major new joint research initiative based on Werbos (2014), with self-driving cars as a major testbed; however, the NSF of the USA responded at that time by terminating that activity.

Cross-References

Bibliography

  1. Lewis FL, Liu D (eds), (2013) Reinforcement learning and approximate dynamic programming for feedback control, vol 17. Wiley (IEEE Series), New YorkGoogle Scholar
  2. Werbos PJ (2005) Backwards differentiation in AD and neural nets: past links and new opportunities. In: Bucker M, Corliss G, Hovland P, Naumann, Norris B (eds) Automatic differentiation: applications, theory and implementations. Springer, New YorkGoogle Scholar
  3. Werbos PJ (2014) Werbos, from ADP to the brain: foundations, roadmap, challenges and research priorities. In: Proceedings of the international joint conference on neural networks 2014. IEEE, New Jersey. https://arxiv.org/abs/1404.0554
  4. Werbos PJ, Davis JJ, (2016) Regular cycles of forward and backward signal propagation in prefrontal cortex and in consciousness. Front Syst Neurosci 10:97.  https://doi.org/10.3389/fnsys.2016.00097 CrossRefGoogle Scholar
  5. White DA, Sofge DA (eds) (1992) Handbook of intelligent control: neural, fuzzy, and adaptive approaches. Van Nostrand Reinhold, New YorkGoogle Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2020

Authors and Affiliations

  1. 1.National Science FoundationArlingtonUSA

Section editors and affiliations

  • Thomas Parisini
    • 1
  1. 1.South Kensington CampusElectrical and Electronic Engineering, Imperial CollegeLondonUnited Kingdom