Abstract
There are over 15 distinct communities that work in the general area of sequential decisions and information, often referred to as decisions under uncertainty or stochastic optimization. We focus on two of the most important fields: stochastic optimal control, with its roots in deterministic optimal control, and reinforcement learning, with its roots in Markov decision processes. Building on prior work, we describe a unified framework that covers all 15 different communities and note the strong parallels with the modeling framework of stochastic optimal control. By contrast, we make the case that the modeling framework of reinforcement learning, inherited from discrete Markov decision processes, is quite limited. Our framework (and that of stochastic control) is based on the core problem of optimizing over policies. We describe four classes of policies that we claim are universal and show that each of these two fields has, in their own way, evolved to include examples of each of these four classes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
It is not unusual for people to overlook the need to include beliefs in the state variable. The RL tutorial [23] does this when it presents the multi-armed bandit problem, insisting that it does not have a state variable (see slide 49). In fact, any bandit problem is a sequential decision problem where the state variable is the belief (which can be Bayesian or frequentist). This has long been recognized by the probability community that has worked on bandit problems since the 1950s (see the seminal text [14]). Bellman’s equation (using belief states) was fundamental to the development of Gittins indices in [16] (see [17] for a nice introduction to this rich area of research). It was the concept of Gittins indices that laid the foundation for upper-confidence bounding, which is just a different form of index policy.
References
Astrom, K.J.: Introduction to Stochastic Control Theory. Dover Publications, Mineola (1970)
Bellman, R.E.: Dynamic Programming. Princeton University Press, Princeton (1957)
Bellman, R.E., Glicksberg, I., Gross, O.: On the optimal inventory equation. Manage. Sci. 1, 83–104 (1955)
Bertsekas, D.P., Shreve, S.E.: Stochastic Optimal Control: The Discrete Time Case. Academic, New York (1978)
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)
Bertsekas, D.P., Tsitsiklis, J.N., Wu, C.: Rollout algorithms for combinatorial optimization. J. Heuristics 3(3), 245–262 (1997)
Bouzaiene-Ayari, B., Cheng, C., Das, S., Fiorillo, R., Powell, W.B.: From single commodity to multiattribute models for locomotive optimization: a comparison of optimal integer programming and approximate dynamic programming. Transp. Sci. 50(2), 1–24 (2016)
Browne, C.B., Powley, E.J., Whitehouse, D., Lucas, S.M., Cowling, P.I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., Colton, S.: A survey of Monte Carlo tree search methods. IEEE Trans. Comput. Intell. AI Games 4(1), 1–49 (2012)
Bubeck, S., Cesa-Bianchi, N.: ‘Regret analysis of stochastic and nonstochastic multi-armed bandit problems’, foundations and trends®. Mach. Learn. 5(1), 1–122 (2012)
Camacho, E., Bordons, C.: Model Predictive Control. Springer, London (2003)
Chang, H.S., Fu, M.C., Hu, J., Marcus, S.I.: An adaptive sampling algorithm for solving Markov decision processes. Oper. Res. 53(1), 126–139 (2005)
Cinlar, E.: Probability and Stochastics. Springer, New York (2011)
Coulom, R.: Efficient selectivity and backup operators in Monte-Carlo tree search. Computers and Games, pp. 72–83. Springer, Berlin (2007)
DeGroot, M.H.: Optimal Statistical Decisions. Wiley, Hoboken (1970)
Fu, M.C.: Markov decision processes, AlphaGo, and Monte Carlo tree search: back to the future. TutORials Oper. Res., pp. 68–88 (2017)
Gittins, J., Jones, D.: A dynamic allocation index for the sequential design of experiments. In: Gani, J. (ed.) Progress in Statistics, pp. 241–266. North Holland, Amsterdam (1974)
Gittins, J., Glazebrook, K.D., Weber, R.R.: Multi-Armed Bandit Allocation Indices. Wiley, New York (2011)
Rossiter, J.A.: Model-Based Predictive Control. CRC Press, Boca Raton (2004)
Jiang, D.R., Pham, T.V., Powell, W.B., Salas, D.F., Scott, W.R.: A comparison of approximate dynamic programming techniques on benchmark energy storage problems: does anything work? In: IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp. 1–8. IEEE, Orlando, FL (2014)
Kirk, D.E.: Optimal Control Theory: An Introduction. Dover, New York (2004)
Kothare, M.V., Balakrishnan, V., Morari, M.: Robust constrained model predictive control using linear matrix inequalities. Automatica 32(10), 1361–1379 (1996)
Kushner, H.: Introduction to Stochastic Control. Holt, Rinehart and Winston, New York (1971)
Lazaric, A.: Introduction to Reinforcement Learning (2019). https://www.irit.fr/cimi-machine-learning/sites/irit.fr.CMS-DRUPAL7.cimi_ml/files/pictures/lecture1_bandits.pdf
Lewis, F.L., Vrabie, D.: Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits Syst. Mag. 9(3), 32–50 (2009)
Lewis, F.L., Vrabie, D., Syrmos, V.L.: Optimal Control, 3rd edn. Wiley, Hoboken (2012)
Maxwell, M.S., Henderson, S.G., Topaloglu, H.: Tuning approximate dynamic programming policies for ambulance redeployment via direct search. Stoch. Syst. 3(2), 322–361 (2013)
Murray, J.J., Member, S., Cox, C.J., Lendaris, G.G., Fellow, L., Saeks, R.: Adaptive dynamic programming. IEEE Trans. Syst., Man, Cybern.-Part C Appl. Rev. 32(2), 140–153 (2002)
Nisio, M.: Stochastic Control Theory: Dynamic Programming Principle. Springer, New York (2014)
Powell, W.B.: Approximate Dynamic Programming: Solving the Curses of Dimensionality, 2nd edn. Wiley, Hoboken (2011)
Powell, W.B.: Clearing the jungle of stochastic optimization. Bridging Data and Decisions, pp. 109–137 (2014) (January 2015)
Powell, W.B.: A unified framework for stochastic optimization. Eur. J. Oper. Res. 275(3), 795–821 (2019)
Powell, W.B.: Reinforcement Learning and Stochastic Optimization: A Unified Framework for Sequential Decisions, Princeton
Powell, W.B., Frazier, P.I.: Optimal learning. TutORials Oper. Res., pp. 213–246 (2008)
Powell, W.B., Meisel, S.: Tutorial on stochastic optimization in energy - part II: an energy storage illustration. IEEE Trans. Power Syst. (2016)
Powell, W.B., Ryzhov, I.O.: Optimal Learning (2012)
Puterman, M.L.: Markov Decision Processes, 2nd edn. Wiley, Hoboken (2005)
Rakovic, S.V., Kouvaritakis, B., Cannon, M., Panos, C., Findeisen, R.: Parameterized tube model predictive control. IEEE Trans. Autom. Control 57(11), 2746–2761 (2012)
Recht, B.: A tour of reinforcement learning: the view from continuous control. Annu. Rev. Control., Robot., Auton. Syst. 2(1), 253–279 (2019)
Sethi, S.P.: Optimal Control Theory: Applications to Management Science and Economics, 3rd edn. Springer, Boston (2019)
Si, J., Barto, A.G., Powell, W.B., Wunsch, D.: Handbook of Learning and Approximate Dynamic Programming. Wiley-IEEE Press (2004)
Simão, H., Day, J., George, A.P., Gifford, T., Nienow, J., Powell, W.B.: An approximate dynamic programming algorithm for large-scale fleet management: a case application. Transp. Sci. (2009)
Sontag, E.: Mathematical Control Theory, 2nd edn., pp. 1–544. Springer, Berlin (1998)
Spall, J.C.: Introduction to Stochastic Search and Optimization: Estimation, Simulation and Control. Wiley, Hoboken (2003)
Stengel, R.F.: Stochastic Optimal Control: Theory and Application. Wiley, Hoboken (1986)
Stengel, R.F.: Optimal Control and Estimation. Dover Publications, New York (1994)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT, Cambridge (1998)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. MIT Press, Cambridge (2018)
Vrabie, D., Lewis, F.: Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems. Neural Netw. 22(3), 237–246 (2009)
Werbos, P.J.: Beyond regression: new tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University (1974)
Yong, J., Zhou, X.Y.: Stochastic Controls: Hamiltonian Systems and HJB Equations. Springer, New York (1999)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Powell, W.B. (2021). From Reinforcement Learning to Optimal Control: A Unified Framework for Sequential Decisions. In: Vamvoudakis, K.G., Wan, Y., Lewis, F.L., Cansever, D. (eds) Handbook of Reinforcement Learning and Control. Studies in Systems, Decision and Control, vol 325. Springer, Cham. https://doi.org/10.1007/978-3-030-60990-0_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-60990-0_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60989-4
Online ISBN: 978-3-030-60990-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)