Skip to main content

From Reinforcement Learning to Optimal Control: A Unified Framework for Sequential Decisions

  • Chapter
  • First Online:
Handbook of Reinforcement Learning and Control

Part of the book series: Studies in Systems, Decision and Control ((SSDC,volume 325))

Abstract

There are over 15 distinct communities that work in the general area of sequential decisions and information, often referred to as decisions under uncertainty or stochastic optimization. We focus on two of the most important fields: stochastic optimal control, with its roots in deterministic optimal control, and reinforcement learning, with its roots in Markov decision processes. Building on prior work, we describe a unified framework that covers all 15 different communities and note the strong parallels with the modeling framework of stochastic optimal control. By contrast, we make the case that the modeling framework of reinforcement learning, inherited from discrete Markov decision processes, is quite limited. Our framework (and that of stochastic control) is based on the core problem of optimizing over policies. We describe four classes of policies that we claim are universal and show that each of these two fields has, in their own way, evolved to include examples of each of these four classes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    It is not unusual for people to overlook the need to include beliefs in the state variable. The RL tutorial [23] does this when it presents the multi-armed bandit problem, insisting that it does not have a state variable (see slide 49). In fact, any bandit problem is a sequential decision problem where the state variable is the belief (which can be Bayesian or frequentist). This has long been recognized by the probability community that has worked on bandit problems since the 1950s (see the seminal text [14]). Bellman’s equation (using belief states) was fundamental to the development of Gittins indices in [16] (see [17] for a nice introduction to this rich area of research). It was the concept of Gittins indices that laid the foundation for upper-confidence bounding, which is just a different form of index policy.

References

  1. Astrom, K.J.: Introduction to Stochastic Control Theory. Dover Publications, Mineola (1970)

    MATH  Google Scholar 

  2. Bellman, R.E.: Dynamic Programming. Princeton University Press, Princeton (1957)

    MATH  Google Scholar 

  3. Bellman, R.E., Glicksberg, I., Gross, O.: On the optimal inventory equation. Manage. Sci. 1, 83–104 (1955)

    Article  MathSciNet  Google Scholar 

  4. Bertsekas, D.P., Shreve, S.E.: Stochastic Optimal Control: The Discrete Time Case. Academic, New York (1978)

    MATH  Google Scholar 

  5. Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)

    MATH  Google Scholar 

  6. Bertsekas, D.P., Tsitsiklis, J.N., Wu, C.: Rollout algorithms for combinatorial optimization. J. Heuristics 3(3), 245–262 (1997)

    Article  Google Scholar 

  7. Bouzaiene-Ayari, B., Cheng, C., Das, S., Fiorillo, R., Powell, W.B.: From single commodity to multiattribute models for locomotive optimization: a comparison of optimal integer programming and approximate dynamic programming. Transp. Sci. 50(2), 1–24 (2016)

    Article  Google Scholar 

  8. Browne, C.B., Powley, E.J., Whitehouse, D., Lucas, S.M., Cowling, P.I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., Colton, S.: A survey of Monte Carlo tree search methods. IEEE Trans. Comput. Intell. AI Games 4(1), 1–49 (2012)

    Article  Google Scholar 

  9. Bubeck, S., Cesa-Bianchi, N.: ‘Regret analysis of stochastic and nonstochastic multi-armed bandit problems’, foundations and trends®. Mach. Learn. 5(1), 1–122 (2012)

    MATH  Google Scholar 

  10. Camacho, E., Bordons, C.: Model Predictive Control. Springer, London (2003)

    MATH  Google Scholar 

  11. Chang, H.S., Fu, M.C., Hu, J., Marcus, S.I.: An adaptive sampling algorithm for solving Markov decision processes. Oper. Res. 53(1), 126–139 (2005)

    Article  MathSciNet  Google Scholar 

  12. Cinlar, E.: Probability and Stochastics. Springer, New York (2011)

    Book  Google Scholar 

  13. Coulom, R.: Efficient selectivity and backup operators in Monte-Carlo tree search. Computers and Games, pp. 72–83. Springer, Berlin (2007)

    Chapter  Google Scholar 

  14. DeGroot, M.H.: Optimal Statistical Decisions. Wiley, Hoboken (1970)

    Google Scholar 

  15. Fu, M.C.: Markov decision processes, AlphaGo, and Monte Carlo tree search: back to the future. TutORials Oper. Res., pp. 68–88 (2017)

    Google Scholar 

  16. Gittins, J., Jones, D.: A dynamic allocation index for the sequential design of experiments. In: Gani, J. (ed.) Progress in Statistics, pp. 241–266. North Holland, Amsterdam (1974)

    Google Scholar 

  17. Gittins, J., Glazebrook, K.D., Weber, R.R.: Multi-Armed Bandit Allocation Indices. Wiley, New York (2011)

    Google Scholar 

  18. Rossiter, J.A.: Model-Based Predictive Control. CRC Press, Boca Raton (2004)

    Google Scholar 

  19. Jiang, D.R., Pham, T.V., Powell, W.B., Salas, D.F., Scott, W.R.: A comparison of approximate dynamic programming techniques on benchmark energy storage problems: does anything work? In: IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp. 1–8. IEEE, Orlando, FL (2014)

    Google Scholar 

  20. Kirk, D.E.: Optimal Control Theory: An Introduction. Dover, New York (2004)

    Google Scholar 

  21. Kothare, M.V., Balakrishnan, V., Morari, M.: Robust constrained model predictive control using linear matrix inequalities. Automatica 32(10), 1361–1379 (1996)

    Google Scholar 

  22. Kushner, H.: Introduction to Stochastic Control. Holt, Rinehart and Winston, New York (1971)

    Google Scholar 

  23. Lazaric, A.: Introduction to Reinforcement Learning (2019). https://www.irit.fr/cimi-machine-learning/sites/irit.fr.CMS-DRUPAL7.cimi_ml/files/pictures/lecture1_bandits.pdf

  24. Lewis, F.L., Vrabie, D.: Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits Syst. Mag. 9(3), 32–50 (2009)

    Google Scholar 

  25. Lewis, F.L., Vrabie, D., Syrmos, V.L.: Optimal Control, 3rd edn. Wiley, Hoboken (2012)

    Google Scholar 

  26. Maxwell, M.S., Henderson, S.G., Topaloglu, H.: Tuning approximate dynamic programming policies for ambulance redeployment via direct search. Stoch. Syst. 3(2), 322–361 (2013)

    Article  MathSciNet  Google Scholar 

  27. Murray, J.J., Member, S., Cox, C.J., Lendaris, G.G., Fellow, L., Saeks, R.: Adaptive dynamic programming. IEEE Trans. Syst., Man, Cybern.-Part C Appl. Rev. 32(2), 140–153 (2002)

    Article  Google Scholar 

  28. Nisio, M.: Stochastic Control Theory: Dynamic Programming Principle. Springer, New York (2014)

    Google Scholar 

  29. Powell, W.B.: Approximate Dynamic Programming: Solving the Curses of Dimensionality, 2nd edn. Wiley, Hoboken (2011)

    Google Scholar 

  30. Powell, W.B.: Clearing the jungle of stochastic optimization. Bridging Data and Decisions, pp. 109–137 (2014) (January 2015)

    Google Scholar 

  31. Powell, W.B.: A unified framework for stochastic optimization. Eur. J. Oper. Res. 275(3), 795–821 (2019)

    Google Scholar 

  32. Powell, W.B.: Reinforcement Learning and Stochastic Optimization: A Unified Framework for Sequential Decisions, Princeton

    Google Scholar 

  33. Powell, W.B., Frazier, P.I.: Optimal learning. TutORials Oper. Res., pp. 213–246 (2008)

    Google Scholar 

  34. Powell, W.B., Meisel, S.: Tutorial on stochastic optimization in energy - part II: an energy storage illustration. IEEE Trans. Power Syst. (2016)

    Google Scholar 

  35. Powell, W.B., Ryzhov, I.O.: Optimal Learning (2012)

    Google Scholar 

  36. Puterman, M.L.: Markov Decision Processes, 2nd edn. Wiley, Hoboken (2005)

    Google Scholar 

  37. Rakovic, S.V., Kouvaritakis, B., Cannon, M., Panos, C., Findeisen, R.: Parameterized tube model predictive control. IEEE Trans. Autom. Control 57(11), 2746–2761 (2012)

    Google Scholar 

  38. Recht, B.: A tour of reinforcement learning: the view from continuous control. Annu. Rev. Control., Robot., Auton. Syst. 2(1), 253–279 (2019)

    Google Scholar 

  39. Sethi, S.P.: Optimal Control Theory: Applications to Management Science and Economics, 3rd edn. Springer, Boston (2019)

    Google Scholar 

  40. Si, J., Barto, A.G., Powell, W.B., Wunsch, D.: Handbook of Learning and Approximate Dynamic Programming. Wiley-IEEE Press (2004)

    Google Scholar 

  41. Simão, H., Day, J., George, A.P., Gifford, T., Nienow, J., Powell, W.B.: An approximate dynamic programming algorithm for large-scale fleet management: a case application. Transp. Sci. (2009)

    Google Scholar 

  42. Sontag, E.: Mathematical Control Theory, 2nd edn., pp. 1–544. Springer, Berlin (1998)

    Google Scholar 

  43. Spall, J.C.: Introduction to Stochastic Search and Optimization: Estimation, Simulation and Control. Wiley, Hoboken (2003)

    Google Scholar 

  44. Stengel, R.F.: Stochastic Optimal Control: Theory and Application. Wiley, Hoboken (1986)

    Google Scholar 

  45. Stengel, R.F.: Optimal Control and Estimation. Dover Publications, New York (1994)

    Google Scholar 

  46. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT, Cambridge (1998)

    Google Scholar 

  47. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. MIT Press, Cambridge (2018)

    Google Scholar 

  48. Vrabie, D., Lewis, F.: Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems. Neural Netw. 22(3), 237–246 (2009)

    Google Scholar 

  49. Werbos, P.J.: Beyond regression: new tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University (1974)

    Google Scholar 

  50. Yong, J., Zhou, X.Y.: Stochastic Controls: Hamiltonian Systems and HJB Equations. Springer, New York (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Warren B. Powell .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Powell, W.B. (2021). From Reinforcement Learning to Optimal Control: A Unified Framework for Sequential Decisions. In: Vamvoudakis, K.G., Wan, Y., Lewis, F.L., Cansever, D. (eds) Handbook of Reinforcement Learning and Control. Studies in Systems, Decision and Control, vol 325. Springer, Cham. https://doi.org/10.1007/978-3-030-60990-0_3

Download citation

Publish with us

Policies and ethics