Skip to main content

Control Optimization with Reinforcement Learning

  • Chapter
  • First Online:
Simulation-Based Optimization

Part of the book series: Operations Research/Computer Science Interfaces Series ((ORCS,volume 55))

Abstract

his chapter focuses on a relatively new methodology called reinforcement learning (RL). RL will be presented here as a form of simulation-based dynamic programming, primarily used for solving Markov and semi-Markov decision problems. Pioneering work in the area of RL was performed within the artificial intelligence community, which views it as a “machine learning” method. This perhaps explains the roots of the word “learning” in the name reinforcement learning. We also note that within the artificial intelligence community, “learning” is sometimes used to describe function approximation, e.g., regression. Some kind of function approximation, as we will see below, usually accompanies RL. The word “reinforcement” is linked to the fact that RL algorithms can be viewed as agents that learn through trials and errors (feedback).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Bibliography

  1. P. Abbeel, A. Coates, T. Hunter, A.Y. Ng, Autonomous autorotation of an RC helicopter, in International Symposium on Robotics, Seoul, 2008

    Google Scholar 

  2. J. Abounadi, D. Bertsekas, V.S. Borkar, Learning algorithms for Markov decision processes with average cost. SIAM J. Control Optim. 40(3), 681–698 (2001)

    Article  Google Scholar 

  3. J.S. Albus, Brain, Behavior and Robotics (Byte Books, Peterborough, 1981)

    Google Scholar 

  4. L. Baird, Residual algorithms: reinforcement learning with function approximation, in Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City (Morgan Kaufmann, 1995), pp. 30–37

    Google Scholar 

  5. A.G. Barto, P. Anandan, Pattern recognizing stochastic learning automata. IEEE Trans. Syst. Man Cybern. 15, 360–375 (1985)

    Article  Google Scholar 

  6. A.G. Barto, S.J. Bradtke, S.P. Singh, Learning to act using real-time dynamic programming. Artif. Intell. 72, 81–138 (1995)

    Article  Google Scholar 

  7. A.G. Barto, R.S. Sutton, C.W. Anderson, Neuronlike elements that can solve difficult learning control problems. IEEE Trans. Syst. Man Cybern. 13, 835–846 (1983)

    Google Scholar 

  8. R.E. Bellman, Dynamic Programming (Princeton University Press, Princeton, 1957)

    Google Scholar 

  9. R.E. Bellman, S.E. Dreyfus, Applied Dynamic Programming (Princeton University Press, Princeton, 1962)

    Google Scholar 

  10. D.P. Bertsekas, Dynamic Programming and Optimal Control, 3rd edn. (Athena Scientific, Belmont, 2007)

    Google Scholar 

  11. D.P. Bertsekas, Approximate policy iteration: a survey and some new methods. J. Control Theory Appl. 9(3), 310–335 (2011)

    Article  Google Scholar 

  12. D.P. Bertsekas, J.N. Tsitsiklis, An analysis of the shortest stochastic path problems. Math. Oper. Res. 16, 580–595 (1991)

    Article  Google Scholar 

  13. D.P. Bertsekas, J. Tsitsiklis, Neuro-Dynamic Programming (Athena Scientific, Belmont, 1996)

    Google Scholar 

  14. D.P. Bertsekas, H. Yu, Distributed asynchronous policy iteration in dynamic programming, in Proceedings of the 48th Allerton Conference on Communication, Control, and Computing, Monticello (IEEE, 2010)

    Google Scholar 

  15. D.P. Bertsekas, H. Yu, Q-learning and enhanced policy iteration in discounted dynamic programming, in Proceedings of the 49th IEEE Conference on Decision and Control (CDC), Atlanta, 2010, pp. 1409–1416

    Google Scholar 

  16. L.B. Booker, Intelligent behaviour as an adaptation to the task environment, PhD thesis, University of Michigan, Ann Arbor, 1982

    Google Scholar 

  17. V.S. Borkar, Stochastic approximation with two-time scales. Syst. Control Lett. 29, 291–294 (1997)

    Article  Google Scholar 

  18. V.S. Borkar, Asynchronous stochastic approximation. SIAM J. Control Optim. 36(3), 840–851 (1998)

    Article  Google Scholar 

  19. V.S. Borkar, Stochastic Approximation: A Dynamical Systems Viewpoint (Hindusthan Book Agency, New Delhi, 2008)

    Google Scholar 

  20. J.A. Boyan, A.W. Moore, Generalization in reinforcement learning: safely approximating the value function. Adv. Neural Inf. Process. Syst. 7, 369–376 (1995)

    Google Scholar 

  21. S. Bradtke, A.G. Barto, Linear least squares learning for temporal differences learning. Mach. Learn. 22, 33–57 (1996)

    Google Scholar 

  22. S.J. Bradtke, M. Duff, Reinforcement learning methods for continuous-time Markov decision problems, in Advances in Neural Information Processing Systems 7 (MIT, Cambridge, MA, 1995)

    Google Scholar 

  23. R.I. Brafman, M. Tennenholtz, R-max: a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res. 3, 213–231 (2002)

    Google Scholar 

  24. L. Busoniu, R. Babuska, B. De Schutter, D. Ernst, Reinforcement Learning and Dynamic Programming Using Function Approximators (CRC, Boca Raton, 2010)

    Book  Google Scholar 

  25. X.R. Cao, Stochastic Learning and Optimization: A Sensitivity-Based View (Springer, Boston, 2007)

    Book  Google Scholar 

  26. H.S. Chang, M.C. Fu, J. Hu, S. Marcus, Recursive learning automata approach to Markov decision processes. IEEE Trans. Autom. Control 52(7), 1349–1355 (2007)

    Article  Google Scholar 

  27. H.S. Chang, M.C. Fu, J. Hu, S.I. Marcus, Simulation-Based Algorithms for Markov Decision Processes (Springer, London, 2007)

    Book  Google Scholar 

  28. H.S. Chang, M.C. Fu, J. Hu, S.I. Marcus, An asymptotically efficient simulation-based algorithm for finite horizon stochastic dynamic programming. IEEE Trans. Autom. Control 52(1), 89–94 (2007)

    Article  Google Scholar 

  29. H.S. Chang, H.-G. Lee, M.C. Fu, S. Marcus, Evolutionary policy iteration for solving Markov decision processes. IEEE Trans. Autom. Control 50(11), 1804–1808 (2005)

    Article  Google Scholar 

  30. C. Darken, J. Chang, J. Moody, Learning rate schedules for faster stochastic gradient search, in Neural Networks for Signal Processing 2 – Proceedings of the 1992 IEEE Workshop, ed. by D.A. White, D.A. Sofge (IEEE, Piscataway, 1992)

    Google Scholar 

  31. T.K. Das, A. Gosavi, S. Mahadevan, N. Marchalleck, Solving semi-Markov decision problems using average reward reinforcement learning. Manag. Sci. 45(4), 560–574 (1999)

    Article  Google Scholar 

  32. S. Davies, Multi-dimensional interpolation and triangulation for reinforcement learning. Adv. Neural Inf. Process. Syst. 9, (1996)

    Google Scholar 

  33. L. Devroye, L. Gyorfi, G. Lugosi, A Probabilistic Theory of Pattern Recognition (Springer, New York, 1996)

    Book  Google Scholar 

  34. C. Diuk, L. Li, B.R. Leffler, The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning, in Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, 2009

    Google Scholar 

  35. F. Garcia, S. Ndiaye, A learning rate analysis of reinforcement learning algorithms in finite horizon, in Proceedings of the 15th International Conference on Machine Learning, Madison (Morgan Kauffmann, 1998)

    Google Scholar 

  36. A. Gosavi, A reinforcement learning algorithm based on policy iteration for average reward: empirical results with yield management and convergence analysis. Mach. Learn. 55, 5–29 (2004)

    Article  Google Scholar 

  37. A. Gosavi, Reinforcement learning for long-run average cost. Eur. J. Oper. Res. 155, 654–674 (2004)

    Article  Google Scholar 

  38. A. Gosavi, On step-sizes, stochastic paths, and survival probabilities in reinforcement learning, in Proceedings of the 2008 Winter Simulation Conference, Miami (IEEE, 2008)

    Google Scholar 

  39. A. Gosavi, Reinforcement learning: a tutorial survey and recent advances. INFORMS J. Comput. 21(2), 178–192 (2009)

    Article  Google Scholar 

  40. A. Gosavi, Reinforcement learning for model building and variance-penalized control, in Proceedings of the 2009 Winter Simulation Conference, Austin (IEEE, 2009)

    Google Scholar 

  41. A. Gosavi, Finite horizon Markov control with one-step variance penalties, in Conference Proceedings of the Allerton Conference, University of Illinois at Urbana-Champaign, 2010

    Google Scholar 

  42. A. Gosavi, Model building for robust reinforcement learning, in Conference Proceedings of Artificial Neural Networks in Engineering (ANNIE), St. Louis (ASME Press, 2010), pp. 65–72

    Google Scholar 

  43. A. Gosavi, Approximate policy iteration for semi-Markov control revisited, in Procedia Computer Science, Complex Adaptive Systems, Chicago (Elsevier, 2011), pp. 249–255

    Google Scholar 

  44. A. Gosavi, Target-sensitive control of Markov and semi-Markov processes. Int. J. Control Autom. Syst. 9(5), 1–11 (2011)

    Article  Google Scholar 

  45. A. Gosavi, Approximate policy iteration for Markov control revisited, in Procedia Computer Science, Complex Adaptive Systems, Chicago (Elsevier, 2012)

    Google Scholar 

  46. A. Gosavi, Codes for neural networks, DP, and RL in the C language for this book (2014), http://web.mst.edu/~gosavia/bookcodes.html

  47. A. Gosavi, Using simulation for solving Markov decision processes, in Handbook of Simulation Optimization (forthcoming), ed. by M. Fu (Springer, New York, 2014)

    Google Scholar 

  48. A. Gosavi, N. Bandla, T.K. Das, A reinforcement learning approach to a single leg airline revenue management problem with multiple fare classes and overbooking. IIE Trans. 34(9), 729–742 (2002)

    Google Scholar 

  49. A. Gosavi, S. Murray, J. Hu, S. Ghosh, Model-building adaptive critics for semi-Markov control. J. Artif. Intell. Soft Comput. Res. 2(1) (2012)

    Google Scholar 

  50. T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer, New York, 2001)

    Book  Google Scholar 

  51. G.E. Hinton, Distributed representations. Technical report, CMU-CS-84-157, Carnegie Mellon University, Pittsburgh, 1984

    Google Scholar 

  52. J.H. Holland, Adaptation in Natural and Artificial Systems (University of Michigan Press, Ann Arbor, 1975)

    Google Scholar 

  53. J.H. Holland, Escaping brittleness: the possibility of general-purpose learning algorithms applied to rule-based systems, in Machine Learning: An Artificial Intelligence Approach, ed. by R.S. Michalski, J.G. Carbonell, T.M. Mitchell (Morgan Kaufmann, San Mateo, 1986), pp. 593–623

    Google Scholar 

  54. R. Howard, Dynamic Programming and Markov Processes (MIT, Cambridge, MA, 1960)

    Google Scholar 

  55. J. Hu, H.S. Chang, Approximate stochastic annealing for online control of infinite horizon Markov decision processes. Automatica 48(9), 2182–2188 (2012)

    Article  Google Scholar 

  56. J. Hu, M.C. Fu, S.I. Marcus, A model reference adaptive search method for global optimization. Oper. Res. 55, 549–568 (2007)

    Article  Google Scholar 

  57. J. Hu, M.C. Fu, S.I. Marcus, A model reference adaptive search method for stochastic global optimization. Commun. Inf. Syst. 8, 245–276 (2008)

    Google Scholar 

  58. S. Ishii, W. Yoshida, J. Yoshimoto, Control of exploitation-exploration meta-parameter in reinforcement learning. Neural Netw. 15, 665–687 (2002)

    Article  Google Scholar 

  59. A. Jalali, M. Ferguson, Computationally efficient adaptive control algorithms for Markov chains, in Proceedings of the 29th IEEE Conference on Decision and Control, Honolulu, 1989, pp. 1283–1288

    Google Scholar 

  60. S.A. Johnson, J.R. Stedinger, C.A. Shoemaker, Y. Li, J.A. Tejada-Guibert, Numerical solution of continuous state dynamic programs using linear and spline interpolation. Oper. Res. 41(3), 484–500 (1993)

    Article  Google Scholar 

  61. L.P. Kaelbling, M.L. Littman, A.W. Moore, Reinforcement learning: a survey. J. Artif. Intell. Res. 4, 237–285 (1996)

    Google Scholar 

  62. P. Kanerva, Sparse Distributed Memory (MIT, Cambridge, MA, 1988)

    Google Scholar 

  63. M. Kearns, S. Singh, Near-optimal reinforcement learning in polynomial time. Mach. Learn. 49(2), 209–232 (2002)

    Article  Google Scholar 

  64. J.G. Kemeny, J.L. Snell, Finite Markov Chains (van Nostrand-Reinhold, New York, 1960)

    Google Scholar 

  65. A.H. Klopf, Brain function and adaptive systems—a heterostatic theory. Technical report AFCRL-72-0164, 1972

    Google Scholar 

  66. R. Koppejan, S. Whiteson, Neuroevolutionary reinforcement learning for generalized helicopter control, in GECCO: Proceedings of the Genetic and Evolutionary Computation Conference, Montreal, 2009, pp. 145–152

    Google Scholar 

  67. M. Lagoudakis, R. Parr, Least-squares policy iteration. J. Mach. Learn. Res. 4, 1107–1149 (2003)

    Google Scholar 

  68. S. Mahadevan, Average reward reinforcement learning: foundations, algorithms, and empirical results. Mach. Learn. 22(1), 159–195 (1996)

    Google Scholar 

  69. S. Mahadevan, Learning representation and control in Markov decision processes: new frontiers, in Foundations and Trends in Machine Learning, vol. I(4) (Now Publishers, Boston, 2009), pp. 403–565

    Google Scholar 

  70. J.I. McGill, G.J. van Ryzin. Revenue management: research overview and prospects. Transp. Sci. 33(2), 233–256 (1999)

    Article  Google Scholar 

  71. J. Michels, A. Saxena, A.Y. Ng, High speed obstacle avoidance using monocular vision and reinforcement learning, in Proceedings of the 22nd International Conference on Machine Learning, Bonn, 2005

    Google Scholar 

  72. A.W. Moore, C.G. Atkeson, Prioritized sweeping: reinforcement learning with less data and less real time. Mach. Learn. 13, 103–130 (1993)

    Google Scholar 

  73. A.Y. Ng, H.J. Kim, M.I. Jordan, S. Sastry, Autonomous helicopter flight via reinforcement learning. Adv. Neural Inf. Process. Syst. 17 (2004). MIT

    Google Scholar 

  74. D. Ormoneit, S. Sen, Kernel-based reinforcement learning. Mach. Learn. 49(2–3), 161–178 (2002)

    Article  Google Scholar 

  75. J. Peng, R.J. Williams, Incremental multi-step Q-learning. Mach. Learn. 22, 226–232 (1996). Morgan Kaufmann

    Google Scholar 

  76. C.R. Philbrick, P.K. Kitanidis, Improved dynamic programming methods for optimal control of lumped-parameter stochastic systems. Oper. Res. 49(3), 398–412 (2001)

    Article  Google Scholar 

  77. W. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley-Interscience, Hoboken, 2007)

    Book  Google Scholar 

  78. W. Powell, I. Ryzhov, Optimal Learning (Wiley, New York, 2012)

    Book  Google Scholar 

  79. H. Robbins, S. Monro, A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)

    Article  Google Scholar 

  80. G.A. Rummery, M. Niranjan, On-line Q-learning using connectionist systems. Technical report CUED/F-INFENG/TR 166. Engineering Department, Cambridge University, 1994

    Google Scholar 

  81. A.L. Samuel, Some studies in machine learning using the game of checkers, in Computers and Thought, ed. by E.A. Feigenbaum, J. Feldman (McGraw-Hill, New York 1959)

    Google Scholar 

  82. N. Schutze, G.H.Schmitz. Neuro-dynamic programming as a new framework for decision support for deficit irrigation systems, in International Congress on Modelling and Simulation, Christchurch, 2007, pp. 2271–2277

    Google Scholar 

  83. A. Schwartz, A reinforcement learning method for maximizing undiscounted rewards, in Proceeding of the Tenth Annual Conference on Machine Learning, Amherst, 1993, pp. 298–305

    Google Scholar 

  84. S. Singh, T. Jaakkola, M. Littman, C. Szepesvari, Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 39, 287–308 (2000)

    Article  Google Scholar 

  85. A.L. Strehl, M.L. Littman, A theoretical analysis of model-based interval estimation, in Proceedings of the 22th International Conference on Machine Learning, Bonn, 2005, pp. 856–863

    Google Scholar 

  86. R.S. Sutton, Learning to predict by the method of temporal differences. Mach. Learn. 3, 9–44 (1988)

    Google Scholar 

  87. R.S. Sutton, Integrated architectures for learning, planning, and reacting based on approximating dynamic programming, in Proceedings of the 7th International Workshop on Machine Learning, Austin (Morgan Kaufmann, San Mateo, 1990), pp. 216–224

    Google Scholar 

  88. R.S. Sutton, Generalization in reinforcement learning: successful examples using sparse coarse coding, in Advances in Neural Information Processing Systems 8 (MIT, Cambridge, MA 1996)

    Google Scholar 

  89. R. Sutton, A.G. Barto, Reinforcement Learning: An Introduction (MIT, Cambridge, MA, 1998)

    Google Scholar 

  90. C. Szepesvári, Algorithms for reinforcement learning, in Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 10 (Morgan Claypool Publishers, San Rafael, 2010), pp. 1–103

    Google Scholar 

  91. P. Tadepalli, D. Ok, Model-based average reward reinforcement learning algorithms. Artif. Intell. 100, 177–224 (1998)

    Article  Google Scholar 

  92. S. Thrun, W. Burgard, D. Fox, Probabilistic Robotics (MIT, Cambridge, MA, 2005)

    Google Scholar 

  93. A. Turgeon, Optimal operation of multi-reservoir power systems with stochastic inflows. Water Resour. Res. 16(2), 275–283 (1980)

    Article  Google Scholar 

  94. J.A.E.E. van Nunen, A set of successive approximation methods for discounted Markovian decision problems. Z. Oper. Res. 20, 203–208 (1976)

    Google Scholar 

  95. H. van Seijen, S. Whiteson, H. van Hasselt, M. Wiering, Exploiting best-match equations for efficient reinforcement learning. J. Mach. Learn. Res. 12, 2045–2094 (2011)

    Google Scholar 

  96. C.J. Watkins, Learning from delayed rewards, PhD thesis, Kings College, Cambridge, May 1989

    Google Scholar 

  97. P.J. Werbös, Building and understanding adaptive systems: a statistical/numerical approach to factory automation and brain research. IEEE Trans. Syst. Man Cybern. 17, 7–20 (1987)

    Article  Google Scholar 

  98. P.J. Werbös, Consistency of HDP applied to a simple reinforcement learning problem. Neural Netw. 3, 179–189 (1990)

    Article  Google Scholar 

  99. P.J. Werbös, A menu of designs for reinforcement learning over time, in Neural Networks for Control (MIT, Cambridge, MA, 1990), pp. 67–95

    Google Scholar 

  100. P.J. Werbös, Approximate dynamic programming for real-time control and neural modeling, in Handbook of Intelligent Control, ed. by D.A. White, D.A. Sofge (Van Nostrand Reinhold, New York, 1992)

    Google Scholar 

  101. S. Whiteson, Adaptive Representations for Reinforcement Learning. Volume 291 of Studies in Computational Intelligence (Springer, Berlin, 2010)

    Google Scholar 

  102. S. Whiteson, P. Stone, Evolutionary function approximation for reinforcement learning. J. Mach. Learn. Res. 7, 877–917 (2006)

    Google Scholar 

  103. M.A. Wiering, R.P. Salustowicz, J. Schmidhuber, Model-based reinforcement learning for evolving soccer strategies, in Computational Intelligence in Games (Springer, Heidelberg, 2001)

    Google Scholar 

  104. R.J. Williams, On the use of backpropagation in associative reinforcement learning, in Proceedings of the Second International Conference on Neural Networks, vol. I, San Diego, CA (IEEE, New York, 1988)

    Google Scholar 

  105. W. Yoshida, S. Ishii, Model-based reinforcement learning: a computational model and an fMRI study. Neurocomputing 63, 253–269 (2005)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer Science+Business Media New York

About this chapter

Cite this chapter

Gosavi, A. (2015). Control Optimization with Reinforcement Learning. In: Simulation-Based Optimization. Operations Research/Computer Science Interfaces Series, vol 55. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7491-4_7

Download citation

Publish with us

Policies and ethics