Control Optimization with Reinforcement Learning

Gosavi, Abhijit

doi:10.1007/978-1-4899-7491-4_7

Abhijit Gosavi⁴

Part of the book series: Operations Research/Computer Science Interfaces Series ((ORCS,volume 55))

6346 Accesses
13 Citations

Abstract

his chapter focuses on a relatively new methodology called reinforcement learning (RL). RL will be presented here as a form of simulation-based dynamic programming, primarily used for solving Markov and semi-Markov decision problems. Pioneering work in the area of RL was performed within the artificial intelligence community, which views it as a “machine learning” method. This perhaps explains the roots of the word “learning” in the name reinforcement learning. We also note that within the artificial intelligence community, “learning” is sometimes used to describe function approximation, e.g., regression. Some kind of function approximation, as we will see below, usually accompanies RL. The word “reinforcement” is linked to the fact that RL algorithms can be viewed as agents that learn through trials and errors (feedback).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Bibliography

P. Abbeel, A. Coates, T. Hunter, A.Y. Ng, Autonomous autorotation of an RC helicopter, in International Symposium on Robotics, Seoul, 2008
Google Scholar
J. Abounadi, D. Bertsekas, V.S. Borkar, Learning algorithms for Markov decision processes with average cost. SIAM J. Control Optim. 40(3), 681–698 (2001)
Article Google Scholar
J.S. Albus, Brain, Behavior and Robotics (Byte Books, Peterborough, 1981)
Google Scholar
L. Baird, Residual algorithms: reinforcement learning with function approximation, in Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City (Morgan Kaufmann, 1995), pp. 30–37
Google Scholar
A.G. Barto, P. Anandan, Pattern recognizing stochastic learning automata. IEEE Trans. Syst. Man Cybern. 15, 360–375 (1985)
Article Google Scholar
A.G. Barto, S.J. Bradtke, S.P. Singh, Learning to act using real-time dynamic programming. Artif. Intell. 72, 81–138 (1995)
Article Google Scholar
A.G. Barto, R.S. Sutton, C.W. Anderson, Neuronlike elements that can solve difficult learning control problems. IEEE Trans. Syst. Man Cybern. 13, 835–846 (1983)
Google Scholar
R.E. Bellman, Dynamic Programming (Princeton University Press, Princeton, 1957)
Google Scholar
R.E. Bellman, S.E. Dreyfus, Applied Dynamic Programming (Princeton University Press, Princeton, 1962)
Google Scholar
D.P. Bertsekas, Dynamic Programming and Optimal Control, 3rd edn. (Athena Scientific, Belmont, 2007)
Google Scholar
D.P. Bertsekas, Approximate policy iteration: a survey and some new methods. J. Control Theory Appl. 9(3), 310–335 (2011)
Article Google Scholar
D.P. Bertsekas, J.N. Tsitsiklis, An analysis of the shortest stochastic path problems. Math. Oper. Res. 16, 580–595 (1991)
Article Google Scholar
D.P. Bertsekas, J. Tsitsiklis, Neuro-Dynamic Programming (Athena Scientific, Belmont, 1996)
Google Scholar
D.P. Bertsekas, H. Yu, Distributed asynchronous policy iteration in dynamic programming, in Proceedings of the 48th Allerton Conference on Communication, Control, and Computing, Monticello (IEEE, 2010)
Google Scholar
D.P. Bertsekas, H. Yu, Q-learning and enhanced policy iteration in discounted dynamic programming, in Proceedings of the 49th IEEE Conference on Decision and Control (CDC), Atlanta, 2010, pp. 1409–1416
Google Scholar
L.B. Booker, Intelligent behaviour as an adaptation to the task environment, PhD thesis, University of Michigan, Ann Arbor, 1982
Google Scholar
V.S. Borkar, Stochastic approximation with two-time scales. Syst. Control Lett. 29, 291–294 (1997)
Article Google Scholar
V.S. Borkar, Asynchronous stochastic approximation. SIAM J. Control Optim. 36(3), 840–851 (1998)
Article Google Scholar
V.S. Borkar, Stochastic Approximation: A Dynamical Systems Viewpoint (Hindusthan Book Agency, New Delhi, 2008)
Google Scholar
J.A. Boyan, A.W. Moore, Generalization in reinforcement learning: safely approximating the value function. Adv. Neural Inf. Process. Syst. 7, 369–376 (1995)
Google Scholar
S. Bradtke, A.G. Barto, Linear least squares learning for temporal differences learning. Mach. Learn. 22, 33–57 (1996)
Google Scholar
S.J. Bradtke, M. Duff, Reinforcement learning methods for continuous-time Markov decision problems, in Advances in Neural Information Processing Systems 7 (MIT, Cambridge, MA, 1995)
Google Scholar
R.I. Brafman, M. Tennenholtz, R-max: a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res. 3, 213–231 (2002)
Google Scholar
L. Busoniu, R. Babuska, B. De Schutter, D. Ernst, Reinforcement Learning and Dynamic Programming Using Function Approximators (CRC, Boca Raton, 2010)
Book Google Scholar
X.R. Cao, Stochastic Learning and Optimization: A Sensitivity-Based View (Springer, Boston, 2007)
Book Google Scholar
H.S. Chang, M.C. Fu, J. Hu, S. Marcus, Recursive learning automata approach to Markov decision processes. IEEE Trans. Autom. Control 52(7), 1349–1355 (2007)
Article Google Scholar
H.S. Chang, M.C. Fu, J. Hu, S.I. Marcus, Simulation-Based Algorithms for Markov Decision Processes (Springer, London, 2007)
Book Google Scholar
H.S. Chang, M.C. Fu, J. Hu, S.I. Marcus, An asymptotically efficient simulation-based algorithm for finite horizon stochastic dynamic programming. IEEE Trans. Autom. Control 52(1), 89–94 (2007)
Article Google Scholar
H.S. Chang, H.-G. Lee, M.C. Fu, S. Marcus, Evolutionary policy iteration for solving Markov decision processes. IEEE Trans. Autom. Control 50(11), 1804–1808 (2005)
Article Google Scholar
C. Darken, J. Chang, J. Moody, Learning rate schedules for faster stochastic gradient search, in Neural Networks for Signal Processing 2 – Proceedings of the 1992 IEEE Workshop, ed. by D.A. White, D.A. Sofge (IEEE, Piscataway, 1992)
Google Scholar
T.K. Das, A. Gosavi, S. Mahadevan, N. Marchalleck, Solving semi-Markov decision problems using average reward reinforcement learning. Manag. Sci. 45(4), 560–574 (1999)
Article Google Scholar
S. Davies, Multi-dimensional interpolation and triangulation for reinforcement learning. Adv. Neural Inf. Process. Syst. 9, (1996)
Google Scholar
L. Devroye, L. Gyorfi, G. Lugosi, A Probabilistic Theory of Pattern Recognition (Springer, New York, 1996)
Book Google Scholar
C. Diuk, L. Li, B.R. Leffler, The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning, in Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, 2009
Google Scholar
F. Garcia, S. Ndiaye, A learning rate analysis of reinforcement learning algorithms in finite horizon, in Proceedings of the 15th International Conference on Machine Learning, Madison (Morgan Kauffmann, 1998)
Google Scholar
A. Gosavi, A reinforcement learning algorithm based on policy iteration for average reward: empirical results with yield management and convergence analysis. Mach. Learn. 55, 5–29 (2004)
Article Google Scholar
A. Gosavi, Reinforcement learning for long-run average cost. Eur. J. Oper. Res. 155, 654–674 (2004)
Article Google Scholar
A. Gosavi, On step-sizes, stochastic paths, and survival probabilities in reinforcement learning, in Proceedings of the 2008 Winter Simulation Conference, Miami (IEEE, 2008)
Google Scholar
A. Gosavi, Reinforcement learning: a tutorial survey and recent advances. INFORMS J. Comput. 21(2), 178–192 (2009)
Article Google Scholar
A. Gosavi, Reinforcement learning for model building and variance-penalized control, in Proceedings of the 2009 Winter Simulation Conference, Austin (IEEE, 2009)
Google Scholar
A. Gosavi, Finite horizon Markov control with one-step variance penalties, in Conference Proceedings of the Allerton Conference, University of Illinois at Urbana-Champaign, 2010
Google Scholar
A. Gosavi, Model building for robust reinforcement learning, in Conference Proceedings of Artificial Neural Networks in Engineering (ANNIE), St. Louis (ASME Press, 2010), pp. 65–72
Google Scholar
A. Gosavi, Approximate policy iteration for semi-Markov control revisited, in Procedia Computer Science, Complex Adaptive Systems, Chicago (Elsevier, 2011), pp. 249–255
Google Scholar
A. Gosavi, Target-sensitive control of Markov and semi-Markov processes. Int. J. Control Autom. Syst. 9(5), 1–11 (2011)
Article Google Scholar
A. Gosavi, Approximate policy iteration for Markov control revisited, in Procedia Computer Science, Complex Adaptive Systems, Chicago (Elsevier, 2012)
Google Scholar
A. Gosavi, Codes for neural networks, DP, and RL in the C language for this book (2014), http://web.mst.edu/~gosavia/bookcodes.html
A. Gosavi, Using simulation for solving Markov decision processes, in Handbook of Simulation Optimization (forthcoming), ed. by M. Fu (Springer, New York, 2014)
Google Scholar
A. Gosavi, N. Bandla, T.K. Das, A reinforcement learning approach to a single leg airline revenue management problem with multiple fare classes and overbooking. IIE Trans. 34(9), 729–742 (2002)
Google Scholar
A. Gosavi, S. Murray, J. Hu, S. Ghosh, Model-building adaptive critics for semi-Markov control. J. Artif. Intell. Soft Comput. Res. 2(1) (2012)
Google Scholar
T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer, New York, 2001)
Book Google Scholar
G.E. Hinton, Distributed representations. Technical report, CMU-CS-84-157, Carnegie Mellon University, Pittsburgh, 1984
Google Scholar
J.H. Holland, Adaptation in Natural and Artificial Systems (University of Michigan Press, Ann Arbor, 1975)
Google Scholar
J.H. Holland, Escaping brittleness: the possibility of general-purpose learning algorithms applied to rule-based systems, in Machine Learning: An Artificial Intelligence Approach, ed. by R.S. Michalski, J.G. Carbonell, T.M. Mitchell (Morgan Kaufmann, San Mateo, 1986), pp. 593–623
Google Scholar
R. Howard, Dynamic Programming and Markov Processes (MIT, Cambridge, MA, 1960)
Google Scholar
J. Hu, H.S. Chang, Approximate stochastic annealing for online control of infinite horizon Markov decision processes. Automatica 48(9), 2182–2188 (2012)
Article Google Scholar
J. Hu, M.C. Fu, S.I. Marcus, A model reference adaptive search method for global optimization. Oper. Res. 55, 549–568 (2007)
Article Google Scholar
J. Hu, M.C. Fu, S.I. Marcus, A model reference adaptive search method for stochastic global optimization. Commun. Inf. Syst. 8, 245–276 (2008)
Google Scholar
S. Ishii, W. Yoshida, J. Yoshimoto, Control of exploitation-exploration meta-parameter in reinforcement learning. Neural Netw. 15, 665–687 (2002)
Article Google Scholar
A. Jalali, M. Ferguson, Computationally efficient adaptive control algorithms for Markov chains, in Proceedings of the 29th IEEE Conference on Decision and Control, Honolulu, 1989, pp. 1283–1288
Google Scholar
S.A. Johnson, J.R. Stedinger, C.A. Shoemaker, Y. Li, J.A. Tejada-Guibert, Numerical solution of continuous state dynamic programs using linear and spline interpolation. Oper. Res. 41(3), 484–500 (1993)
Article Google Scholar
L.P. Kaelbling, M.L. Littman, A.W. Moore, Reinforcement learning: a survey. J. Artif. Intell. Res. 4, 237–285 (1996)
Google Scholar
P. Kanerva, Sparse Distributed Memory (MIT, Cambridge, MA, 1988)
Google Scholar
M. Kearns, S. Singh, Near-optimal reinforcement learning in polynomial time. Mach. Learn. 49(2), 209–232 (2002)
Article Google Scholar
J.G. Kemeny, J.L. Snell, Finite Markov Chains (van Nostrand-Reinhold, New York, 1960)
Google Scholar
A.H. Klopf, Brain function and adaptive systems—a heterostatic theory. Technical report AFCRL-72-0164, 1972
Google Scholar
R. Koppejan, S. Whiteson, Neuroevolutionary reinforcement learning for generalized helicopter control, in GECCO: Proceedings of the Genetic and Evolutionary Computation Conference, Montreal, 2009, pp. 145–152
Google Scholar
M. Lagoudakis, R. Parr, Least-squares policy iteration. J. Mach. Learn. Res. 4, 1107–1149 (2003)
Google Scholar
S. Mahadevan, Average reward reinforcement learning: foundations, algorithms, and empirical results. Mach. Learn. 22(1), 159–195 (1996)
Google Scholar
S. Mahadevan, Learning representation and control in Markov decision processes: new frontiers, in Foundations and Trends in Machine Learning, vol. I(4) (Now Publishers, Boston, 2009), pp. 403–565
Google Scholar
J.I. McGill, G.J. van Ryzin. Revenue management: research overview and prospects. Transp. Sci. 33(2), 233–256 (1999)
Article Google Scholar
J. Michels, A. Saxena, A.Y. Ng, High speed obstacle avoidance using monocular vision and reinforcement learning, in Proceedings of the 22nd International Conference on Machine Learning, Bonn, 2005
Google Scholar
A.W. Moore, C.G. Atkeson, Prioritized sweeping: reinforcement learning with less data and less real time. Mach. Learn. 13, 103–130 (1993)
Google Scholar
A.Y. Ng, H.J. Kim, M.I. Jordan, S. Sastry, Autonomous helicopter flight via reinforcement learning. Adv. Neural Inf. Process. Syst. 17 (2004). MIT
Google Scholar
D. Ormoneit, S. Sen, Kernel-based reinforcement learning. Mach. Learn. 49(2–3), 161–178 (2002)
Article Google Scholar
J. Peng, R.J. Williams, Incremental multi-step Q-learning. Mach. Learn. 22, 226–232 (1996). Morgan Kaufmann
Google Scholar
C.R. Philbrick, P.K. Kitanidis, Improved dynamic programming methods for optimal control of lumped-parameter stochastic systems. Oper. Res. 49(3), 398–412 (2001)
Article Google Scholar
W. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley-Interscience, Hoboken, 2007)
Book Google Scholar
W. Powell, I. Ryzhov, Optimal Learning (Wiley, New York, 2012)
Book Google Scholar
H. Robbins, S. Monro, A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Article Google Scholar
G.A. Rummery, M. Niranjan, On-line Q-learning using connectionist systems. Technical report CUED/F-INFENG/TR 166. Engineering Department, Cambridge University, 1994
Google Scholar
A.L. Samuel, Some studies in machine learning using the game of checkers, in Computers and Thought, ed. by E.A. Feigenbaum, J. Feldman (McGraw-Hill, New York 1959)
Google Scholar
N. Schutze, G.H.Schmitz. Neuro-dynamic programming as a new framework for decision support for deficit irrigation systems, in International Congress on Modelling and Simulation, Christchurch, 2007, pp. 2271–2277
Google Scholar
A. Schwartz, A reinforcement learning method for maximizing undiscounted rewards, in Proceeding of the Tenth Annual Conference on Machine Learning, Amherst, 1993, pp. 298–305
Google Scholar
S. Singh, T. Jaakkola, M. Littman, C. Szepesvari, Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 39, 287–308 (2000)
Article Google Scholar
A.L. Strehl, M.L. Littman, A theoretical analysis of model-based interval estimation, in Proceedings of the 22th International Conference on Machine Learning, Bonn, 2005, pp. 856–863
Google Scholar
R.S. Sutton, Learning to predict by the method of temporal differences. Mach. Learn. 3, 9–44 (1988)
Google Scholar
R.S. Sutton, Integrated architectures for learning, planning, and reacting based on approximating dynamic programming, in Proceedings of the 7th International Workshop on Machine Learning, Austin (Morgan Kaufmann, San Mateo, 1990), pp. 216–224
Google Scholar
R.S. Sutton, Generalization in reinforcement learning: successful examples using sparse coarse coding, in Advances in Neural Information Processing Systems 8 (MIT, Cambridge, MA 1996)
Google Scholar
R. Sutton, A.G. Barto, Reinforcement Learning: An Introduction (MIT, Cambridge, MA, 1998)
Google Scholar
C. Szepesvári, Algorithms for reinforcement learning, in Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 10 (Morgan Claypool Publishers, San Rafael, 2010), pp. 1–103
Google Scholar
P. Tadepalli, D. Ok, Model-based average reward reinforcement learning algorithms. Artif. Intell. 100, 177–224 (1998)
Article Google Scholar
S. Thrun, W. Burgard, D. Fox, Probabilistic Robotics (MIT, Cambridge, MA, 2005)
Google Scholar
A. Turgeon, Optimal operation of multi-reservoir power systems with stochastic inflows. Water Resour. Res. 16(2), 275–283 (1980)
Article Google Scholar
J.A.E.E. van Nunen, A set of successive approximation methods for discounted Markovian decision problems. Z. Oper. Res. 20, 203–208 (1976)
Google Scholar
H. van Seijen, S. Whiteson, H. van Hasselt, M. Wiering, Exploiting best-match equations for efficient reinforcement learning. J. Mach. Learn. Res. 12, 2045–2094 (2011)
Google Scholar
C.J. Watkins, Learning from delayed rewards, PhD thesis, Kings College, Cambridge, May 1989
Google Scholar
P.J. Werbös, Building and understanding adaptive systems: a statistical/numerical approach to factory automation and brain research. IEEE Trans. Syst. Man Cybern. 17, 7–20 (1987)
Article Google Scholar
P.J. Werbös, Consistency of HDP applied to a simple reinforcement learning problem. Neural Netw. 3, 179–189 (1990)
Article Google Scholar
P.J. Werbös, A menu of designs for reinforcement learning over time, in Neural Networks for Control (MIT, Cambridge, MA, 1990), pp. 67–95
Google Scholar
P.J. Werbös, Approximate dynamic programming for real-time control and neural modeling, in Handbook of Intelligent Control, ed. by D.A. White, D.A. Sofge (Van Nostrand Reinhold, New York, 1992)
Google Scholar
S. Whiteson, Adaptive Representations for Reinforcement Learning. Volume 291 of Studies in Computational Intelligence (Springer, Berlin, 2010)
Google Scholar
S. Whiteson, P. Stone, Evolutionary function approximation for reinforcement learning. J. Mach. Learn. Res. 7, 877–917 (2006)
Google Scholar
M.A. Wiering, R.P. Salustowicz, J. Schmidhuber, Model-based reinforcement learning for evolving soccer strategies, in Computational Intelligence in Games (Springer, Heidelberg, 2001)
Google Scholar
R.J. Williams, On the use of backpropagation in associative reinforcement learning, in Proceedings of the Second International Conference on Neural Networks, vol. I, San Diego, CA (IEEE, New York, 1988)
Google Scholar
W. Yoshida, S. Ishii, Model-based reinforcement learning: a computational model and an fMRI study. Neurocomputing 63, 253–269 (2005)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Engineering Management and Systems Engineering, Missouri University of Science and Technology, Rolla, MO, USA
Abhijit Gosavi

Authors

Abhijit Gosavi
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Gosavi, A. (2015). Control Optimization with Reinforcement Learning. In: Simulation-Based Optimization. Operations Research/Computer Science Interfaces Series, vol 55. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7491-4_7

Download citation

DOI: https://doi.org/10.1007/978-1-4899-7491-4_7
Published: 07 August 2014
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4899-7490-7
Online ISBN: 978-1-4899-7491-4
eBook Packages: Business and EconomicsBusiness and Management (R0)

Publish with us

Policies and ethics