From Reinforcement Learning to Optimal Control: A Unified Framework for Sequential Decisions

Powell, Warren B.

doi:10.1007/978-3-030-60990-0_3

Warren B. Powell⁶

Part of the book series: Studies in Systems, Decision and Control ((SSDC,volume 325))

7882 Accesses
12 Citations

Abstract

There are over 15 distinct communities that work in the general area of sequential decisions and information, often referred to as decisions under uncertainty or stochastic optimization. We focus on two of the most important fields: stochastic optimal control, with its roots in deterministic optimal control, and reinforcement learning, with its roots in Markov decision processes. Building on prior work, we describe a unified framework that covers all 15 different communities and note the strong parallels with the modeling framework of stochastic optimal control. By contrast, we make the case that the modeling framework of reinforcement learning, inherited from discrete Markov decision processes, is quite limited. Our framework (and that of stochastic control) is based on the core problem of optimizing over policies. We describe four classes of policies that we claim are universal and show that each of these two fields has, in their own way, evolved to include examples of each of these four classes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
It is not unusual for people to overlook the need to include beliefs in the state variable. The RL tutorial [23] does this when it presents the multi-armed bandit problem, insisting that it does not have a state variable (see slide 49). In fact, any bandit problem is a sequential decision problem where the state variable is the belief (which can be Bayesian or frequentist). This has long been recognized by the probability community that has worked on bandit problems since the 1950s (see the seminal text [14]). Bellman’s equation (using belief states) was fundamental to the development of Gittins indices in [16] (see [17] for a nice introduction to this rich area of research). It was the concept of Gittins indices that laid the foundation for upper-confidence bounding, which is just a different form of index policy.

References

Astrom, K.J.: Introduction to Stochastic Control Theory. Dover Publications, Mineola (1970)
MATH Google Scholar
Bellman, R.E.: Dynamic Programming. Princeton University Press, Princeton (1957)
MATH Google Scholar
Bellman, R.E., Glicksberg, I., Gross, O.: On the optimal inventory equation. Manage. Sci. 1, 83–104 (1955)
Article MathSciNet Google Scholar
Bertsekas, D.P., Shreve, S.E.: Stochastic Optimal Control: The Discrete Time Case. Academic, New York (1978)
MATH Google Scholar
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)
MATH Google Scholar
Bertsekas, D.P., Tsitsiklis, J.N., Wu, C.: Rollout algorithms for combinatorial optimization. J. Heuristics 3(3), 245–262 (1997)
Article Google Scholar
Bouzaiene-Ayari, B., Cheng, C., Das, S., Fiorillo, R., Powell, W.B.: From single commodity to multiattribute models for locomotive optimization: a comparison of optimal integer programming and approximate dynamic programming. Transp. Sci. 50(2), 1–24 (2016)
Article Google Scholar
Browne, C.B., Powley, E.J., Whitehouse, D., Lucas, S.M., Cowling, P.I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., Colton, S.: A survey of Monte Carlo tree search methods. IEEE Trans. Comput. Intell. AI Games 4(1), 1–49 (2012)
Article Google Scholar
Bubeck, S., Cesa-Bianchi, N.: ‘Regret analysis of stochastic and nonstochastic multi-armed bandit problems’, foundations and trends®. Mach. Learn. 5(1), 1–122 (2012)
MATH Google Scholar
Camacho, E., Bordons, C.: Model Predictive Control. Springer, London (2003)
MATH Google Scholar
Chang, H.S., Fu, M.C., Hu, J., Marcus, S.I.: An adaptive sampling algorithm for solving Markov decision processes. Oper. Res. 53(1), 126–139 (2005)
Article MathSciNet Google Scholar
Cinlar, E.: Probability and Stochastics. Springer, New York (2011)
Book Google Scholar
Coulom, R.: Efficient selectivity and backup operators in Monte-Carlo tree search. Computers and Games, pp. 72–83. Springer, Berlin (2007)
Chapter Google Scholar
DeGroot, M.H.: Optimal Statistical Decisions. Wiley, Hoboken (1970)
Google Scholar
Fu, M.C.: Markov decision processes, AlphaGo, and Monte Carlo tree search: back to the future. TutORials Oper. Res., pp. 68–88 (2017)
Google Scholar
Gittins, J., Jones, D.: A dynamic allocation index for the sequential design of experiments. In: Gani, J. (ed.) Progress in Statistics, pp. 241–266. North Holland, Amsterdam (1974)
Google Scholar
Gittins, J., Glazebrook, K.D., Weber, R.R.: Multi-Armed Bandit Allocation Indices. Wiley, New York (2011)
Google Scholar
Rossiter, J.A.: Model-Based Predictive Control. CRC Press, Boca Raton (2004)
Google Scholar
Jiang, D.R., Pham, T.V., Powell, W.B., Salas, D.F., Scott, W.R.: A comparison of approximate dynamic programming techniques on benchmark energy storage problems: does anything work? In: IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp. 1–8. IEEE, Orlando, FL (2014)
Google Scholar
Kirk, D.E.: Optimal Control Theory: An Introduction. Dover, New York (2004)
Google Scholar
Kothare, M.V., Balakrishnan, V., Morari, M.: Robust constrained model predictive control using linear matrix inequalities. Automatica 32(10), 1361–1379 (1996)
Google Scholar
Kushner, H.: Introduction to Stochastic Control. Holt, Rinehart and Winston, New York (1971)
Google Scholar
Lazaric, A.: Introduction to Reinforcement Learning (2019). https://www.irit.fr/cimi-machine-learning/sites/irit.fr.CMS-DRUPAL7.cimi_ml/files/pictures/lecture1_bandits.pdf
Lewis, F.L., Vrabie, D.: Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits Syst. Mag. 9(3), 32–50 (2009)
Google Scholar
Lewis, F.L., Vrabie, D., Syrmos, V.L.: Optimal Control, 3rd edn. Wiley, Hoboken (2012)
Google Scholar
Maxwell, M.S., Henderson, S.G., Topaloglu, H.: Tuning approximate dynamic programming policies for ambulance redeployment via direct search. Stoch. Syst. 3(2), 322–361 (2013)
Article MathSciNet Google Scholar
Murray, J.J., Member, S., Cox, C.J., Lendaris, G.G., Fellow, L., Saeks, R.: Adaptive dynamic programming. IEEE Trans. Syst., Man, Cybern.-Part C Appl. Rev. 32(2), 140–153 (2002)
Article Google Scholar
Nisio, M.: Stochastic Control Theory: Dynamic Programming Principle. Springer, New York (2014)
Google Scholar
Powell, W.B.: Approximate Dynamic Programming: Solving the Curses of Dimensionality, 2nd edn. Wiley, Hoboken (2011)
Google Scholar
Powell, W.B.: Clearing the jungle of stochastic optimization. Bridging Data and Decisions, pp. 109–137 (2014) (January 2015)
Google Scholar
Powell, W.B.: A unified framework for stochastic optimization. Eur. J. Oper. Res. 275(3), 795–821 (2019)
Google Scholar
Powell, W.B.: Reinforcement Learning and Stochastic Optimization: A Unified Framework for Sequential Decisions, Princeton
Google Scholar
Powell, W.B., Frazier, P.I.: Optimal learning. TutORials Oper. Res., pp. 213–246 (2008)
Google Scholar
Powell, W.B., Meisel, S.: Tutorial on stochastic optimization in energy - part II: an energy storage illustration. IEEE Trans. Power Syst. (2016)
Google Scholar
Powell, W.B., Ryzhov, I.O.: Optimal Learning (2012)
Google Scholar
Puterman, M.L.: Markov Decision Processes, 2nd edn. Wiley, Hoboken (2005)
Google Scholar
Rakovic, S.V., Kouvaritakis, B., Cannon, M., Panos, C., Findeisen, R.: Parameterized tube model predictive control. IEEE Trans. Autom. Control 57(11), 2746–2761 (2012)
Google Scholar
Recht, B.: A tour of reinforcement learning: the view from continuous control. Annu. Rev. Control., Robot., Auton. Syst. 2(1), 253–279 (2019)
Google Scholar
Sethi, S.P.: Optimal Control Theory: Applications to Management Science and Economics, 3rd edn. Springer, Boston (2019)
Google Scholar
Si, J., Barto, A.G., Powell, W.B., Wunsch, D.: Handbook of Learning and Approximate Dynamic Programming. Wiley-IEEE Press (2004)
Google Scholar
Simão, H., Day, J., George, A.P., Gifford, T., Nienow, J., Powell, W.B.: An approximate dynamic programming algorithm for large-scale fleet management: a case application. Transp. Sci. (2009)
Google Scholar
Sontag, E.: Mathematical Control Theory, 2nd edn., pp. 1–544. Springer, Berlin (1998)
Google Scholar
Spall, J.C.: Introduction to Stochastic Search and Optimization: Estimation, Simulation and Control. Wiley, Hoboken (2003)
Google Scholar
Stengel, R.F.: Stochastic Optimal Control: Theory and Application. Wiley, Hoboken (1986)
Google Scholar
Stengel, R.F.: Optimal Control and Estimation. Dover Publications, New York (1994)
Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT, Cambridge (1998)
Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. MIT Press, Cambridge (2018)
Google Scholar
Vrabie, D., Lewis, F.: Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems. Neural Netw. 22(3), 237–246 (2009)
Google Scholar
Werbos, P.J.: Beyond regression: new tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University (1974)
Google Scholar
Yong, J., Zhou, X.Y.: Stochastic Controls: Hamiltonian Systems and HJB Equations. Springer, New York (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Operations Research and Financial Engineering, Princeton University, Princeton, USA
Warren B. Powell

Authors

Warren B. Powell
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Warren B. Powell .

Editor information

Editors and Affiliations

The Daniel Guggenheim School of Aerospace Engineering, Georgia Institute of Technology, Atlanta, GA, USA
Kyriakos G. Vamvoudakis
Department of Electrical Engineering, The University of Texas at Arlington, Arlington, TX, USA
Yan Wan
Department of Electrical Engineering, The University of Texas at Arlington, Arlington, TX, USA
Frank L. Lewis
Army Research Office, Durham, NC, USA
Derya Cansever

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Powell, W.B. (2021). From Reinforcement Learning to Optimal Control: A Unified Framework for Sequential Decisions. In: Vamvoudakis, K.G., Wan, Y., Lewis, F.L., Cansever, D. (eds) Handbook of Reinforcement Learning and Control. Studies in Systems, Decision and Control, vol 325. Springer, Cham. https://doi.org/10.1007/978-3-030-60990-0_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-60990-0_3
Published: 24 June 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60989-4
Online ISBN: 978-3-030-60990-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics