Reinforcement Learning in Continuous State and Action Spaces

Part of the Adaptation, Learning, and Optimization book series (ALO, volume 12)

Abstract

Many traditional reinforcement-learning algorithms have been designed for problems with small finite state and action spaces. Learning in such discrete problems can been difficult, due to noise and delayed reinforcements. However, many real-world problems have continuous state or action spaces, which can make learning a good decision policy even more involved. In this chapter we discuss how to automatically find good decision policies in continuous domains. Because analytically computing a good policy from a continuous model can be infeasible, in this chapter we mainly focus on methods that explicitly update a representation of a value function, a policy or both. We discuss considerations in choosing an appropriate representation for these functions and discuss gradient-based and gradient-free ways to update the parameters. We show how to apply these methods to reinforcement-learning problems and discuss many specific algorithms. Amongst others, we cover gradient-based temporal-difference learning, evolutionary strategies, policy-gradient algorithms and (natural) actor-critic methods. We discuss the advantages of different approaches and compare the performance of a state-of-the-art actor-critic method and a state-of-the-art evolutionary strategy empirically.

Keywords

Reinforcement Learning Action Space Stochastic Gradient Descent Adaptive Dynamic Programming Eligibility Trace 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Akimoto, Y., Nagata, Y., Ono, I., Kobayashi, S.: Bidirectional Relation Between CMA Evolution Strategies and Natural Evolution Strategies. In: Schaefer, R., Cotta, C., Kołodziej, J., Rudolph, G. (eds.) PPSN XI. LNCS, vol. 6238, pp. 154–163. Springer, Heidelberg (2010)Google Scholar
  2. Albus, J.S.: A theory of cerebellar function. Mathematical Biosciences 10, 25–61 (1971)Google Scholar
  3. Albus, J.S.: A new approach to manipulator control: The cerebellar model articulation controller (CMAC). In: Dynamic Systems, Measurement and Control, pp. 220–227 (1975)Google Scholar
  4. Amari, S.I.: Natural gradient works efficiently in learning. Neural Computation 10(2), 251–276 (1998)MathSciNetGoogle Scholar
  5. Anderson, C.W.: Learning to control an inverted pendulum using neural networks. IEEE Control Systems Magazine 9(3), 31–37 (1989)Google Scholar
  6. Antos, A., Munos, R., Szepesvári, C.: Fitted Q-iteration in continuous action-space MDPs. In: Advances in Neural Information Processing Systems (NIPS-2007), vol. 20, pp. 9–16 (2008a)Google Scholar
  7. Antos, A., Szepesvári, C., Munos, R.: Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning 71(1), 89–129 (2008b)Google Scholar
  8. Babuska, R.: Fuzzy modeling for control. Kluwer Academic Publishers (1998)Google Scholar
  9. Bäck, T.: Evolutionary algorithms in theory and practice: evolution strategies, evolutionary programming, genetic algorithms. Oxford University Press, USA (1996)Google Scholar
  10. Bäck, T., Schwefel, H.P.: An overview of evolutionary algorithms for parameter optimization. Evolutionary Computation 1(1), 1–23 (1993)Google Scholar
  11. Baird, L.: Residual algorithms: Reinforcement learning with function approximation. In: Prieditis, A., Russell, S. (eds.) Machine Learning: Proceedings of the Twelfth International Conference, pp. 30–37. Morgan Kaufmann Publishers, San Francisco (1995)Google Scholar
  12. Baird, L.C., Klopf, A.H.: Reinforcement learning with high-dimensional, continuous actions. Tech. Rep. WL-TR-93-114, Wright Laboratory, Wright-Patterson Air Force Base, OH (1993)Google Scholar
  13. Bardi, M., Dolcetta, I.C.: Optimal control and viscosity solutions of Hamilton–Jacobi–Bellman equations. Springer, Heidelberg (1997)Google Scholar
  14. Barto, A.G., Sutton, R.S., Anderson, C.W.: Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics SMC-13, 834–846 (1983)Google Scholar
  15. Baxter, J., Bartlett, P.L.: Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research 15, 319–350 (2001)MathSciNetGoogle Scholar
  16. Beard, R., Saridis, G., Wen, J.: Approximate solutions to the time-invariant Hamilton–Jacobi–Bellman equation. Journal of Optimization theory and Applications 96(3), 589–626 (1998)MathSciNetGoogle Scholar
  17. Bellman, R.: Dynamic Programming. Princeton University Press (1957)Google Scholar
  18. Benbrahim, H., Franklin, J.A.: Biped dynamic walking using reinforcement learning. Robotics and Autonomous Systems 22(3-4), 283–302 (1997)Google Scholar
  19. Berenji, H.: Fuzzy Q-learning: a new approach for fuzzy dynamic programming. In: Proceedings of the Third IEEE Conference on Fuzzy Systems, IEEE World Congress on Computational Intelligence, pp. 486–491. IEEE (1994)Google Scholar
  20. Berenji, H., Khedkar, P.: Learning and tuning fuzzy logic controllers through reinforcements. IEEE Transactions on Neural Networks 3(5), 724–740 (1992)Google Scholar
  21. Bertsekas, D.P.: Dynamic Programming and Optimal Control, vol. I. Athena Scientific (2005)Google Scholar
  22. Bertsekas, D.P.: Dynamic Programming and Optimal Control, vol. II. Athena Scientific (2007)Google Scholar
  23. Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-dynamic Programming. Athena Scientific, Belmont (1996)Google Scholar
  24. Bertsekas, D.P., Borkar, V.S., Nedic, A.: Improved temporal difference methods with linear function approximation. In: Handbook of Learning and Approximate Dynamic Programming, pp. 235–260 (2004)Google Scholar
  25. Beyer, H., Schwefel, H.: Evolution strategies–a comprehensive introduction. Natural Computing 1(1), 3–52 (2002)MathSciNetGoogle Scholar
  26. Bhatnagar, S., Sutton, R.S., Ghavamzadeh, M., Lee, M.: Natural actor-critic algorithms. Automatica 45(11), 2471–2482 (2009)MathSciNetGoogle Scholar
  27. Bishop, C.M.: Neural networks for pattern recognition. Oxford University Press, USA (1995)Google Scholar
  28. Bishop, C.M.: Pattern recognition and machine learning. Springer, New York (2006)Google Scholar
  29. Bonarini, A.: Delayed reinforcement, fuzzy Q-learning and fuzzy logic controllers. In: Herrera, F., Verdegay, J.L. (eds.) Genetic Algorithms and Soft Computing. Studies in Fuzziness, vol. 8, pp. 447–466. Physica-Verlag, Berlin (1996)Google Scholar
  30. Boyan, J.A.: Technical update: Least-squares temporal difference learning. Machine Learning 49(2), 233–246 (2002)Google Scholar
  31. Bradtke, S.J., Barto, A.G.: Linear least-squares algorithms for temporal difference learning. Machine Learning 22, 33–57 (1996)Google Scholar
  32. Bryson, A., Ho, Y.: Applied Optimal Control. Blaisdell Publishing Co. (1969)Google Scholar
  33. Buşoniu, L., Ernst, D., De Schutter, B., Babuška, R.: Continuous-State Reinforcement Learning with Fuzzy Approximation. In: Tuyls, K., Nowe, A., Guessoum, Z., Kudenko, D. (eds.) ALAMAS 2005, ALAMAS 2006, and ALAMAS 2007. LNCS (LNAI), vol. 4865, pp. 27–43. Springer, Heidelberg (2008)Google Scholar
  34. Buşoniu, L., Babuška, R., De Schutter, B., Ernst, D.: Reinforcement Learning and Dynamic Programming Using Function Approximators. CRC Press, Boca Raton (2010)Google Scholar
  35. Coulom, R.: Reinforcement learning using neural networks, with applications to motor control. PhD thesis, Institut National Polytechnique de Grenoble (2002)Google Scholar
  36. Crites, R.H., Barto, A.G.: Improving elevator performance using reinforcement learning. In: Touretzky, D.S., Mozer, M.C., Hasselmo, M.E. (eds.) Advances in Neural Information Processing Systems, vol. 8, pp. 1017–1023. MIT Press, Cambridge (1996)Google Scholar
  37. Crites, R.H., Barto, A.G.: Elevator group control using multiple reinforcement learning agents. Machine Learning 33(2/3), 235–262 (1998)Google Scholar
  38. Davis, L.: Handbook of genetic algorithms. Arden Shakespeare (1991)Google Scholar
  39. Dayan, P.: The convergence of TD(λ) for general lambda. Machine Learning 8, 341–362 (1992)Google Scholar
  40. Dayan, P., Sejnowski, T.: TD(λ): Convergence with probability 1. Machine Learning 14, 295–301 (1994)Google Scholar
  41. Dearden, R., Friedman, N., Russell, S.: Bayesian Q-learning. In: Proceedings of the Fifteenth National/Tenth Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence, pp. 761–768. American Association for Artificial Intelligence (1998)Google Scholar
  42. Dearden, R., Friedman, N., Andre, D.: Model based Bayesian exploration. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pp. 150–159 (1999)Google Scholar
  43. Eiben, A.E., Smith, J.E.: Introduction to evolutionary computing. Springer, Heidelberg (2003)Google Scholar
  44. Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 6(1), 503–556 (2005)MathSciNetGoogle Scholar
  45. Fisher, R.A.: On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London Series A, Containing Papers of a Mathematical or Physical Character 222, 309–368 (1922)Google Scholar
  46. Fisher, R.A.: Statistical methods for research workers. Oliver & Boyd, Edinburgh (1925)Google Scholar
  47. Främling, K.: Replacing eligibility trace for action-value learning with function approximation. In: Proceedings of the 15th European Symposium on Artificial Neural Networks (ESANN-2007), pp. 313–318. d-side publishing (2007)Google Scholar
  48. Gaskett, C., Wettergreen, D., Zelinsky, A.: Q-learning in continuous state and action spaces. In: Advanced Topics in Artificial Intelligence, pp. 417–428 (1999)Google Scholar
  49. Geramifard, A., Bowling, M., Sutton, R.S.: Incremental least-squares temporal difference learning. In: Proceedings of the 21st National Conference on Artificial Intelligence, vol. 1, pp. 356–361. AAAI Press (2006)Google Scholar
  50. Geramifard, A., Bowling, M., Zinkevich, M., Sutton, R.: ilstd: Eligibility traces and convergence analysis. In: Advances in Neural Information Processing Systems, vol. 19, pp. 441–448 (2007)Google Scholar
  51. Glasmachers, T., Schaul, T., Yi, S., Wierstra, D., Schmidhuber, J.: Exponential natural evolution strategies. In: Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation, pp. 393–400. ACM (2010)Google Scholar
  52. Glorennec, P.: Fuzzy Q-learning and dynamical fuzzy Q-learning. In: Proceedings of the Third IEEE Conference on Fuzzy Systems, IEEE World Congress on Computational Intelligence, pp. 474–479. IEEE (1994)Google Scholar
  53. Glover, F., Kochenberger, G.: Handbook of metaheuristics. Springer, Heidelberg (2003)Google Scholar
  54. Gomez, F., Schmidhuber, J., Miikkulainen, R.: Accelerated neural evolution through cooperatively coevolved synapses. The Journal of Machine Learning Research 9, 937–965 (2008)MathSciNetGoogle Scholar
  55. Gordon, G.J.: Stable function approximation in dynamic programming. In: Prieditis, A., Russell, S. (eds.) Proceedings of the Twelfth International Conference on Machine Learning (ICML 1995), pp. 261–268. Morgan Kaufmann, San Francisco (1995)Google Scholar
  56. Gordon, G.J.: Approximate solutions to Markov decision processes. PhD thesis, Carnegie Mellon University (1999)Google Scholar
  57. Greensmith, E., Bartlett, P.L., Baxter, J.: Variance reduction techniques for gradient estimates in reinforcement learning. The Journal of Machine Learning Research 5, 1471–1530 (2004)MathSciNetGoogle Scholar
  58. Hansen, N., Ostermeier, A.: Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation 9(2), 159–195 (2001)Google Scholar
  59. Hansen, N., Müller, S.D., Koumoutsakos, P.: Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES). Evolutionary Computation 11(1), 1–18 (2003)Google Scholar
  60. Hansen, N., Auger, A., Ros, R., Finck, S., Pošík, P.: Comparing results of 31 algorithms from the black-box optimization benchmarking BBOB-2009. In: Proceedings of the 12th Annual Conference Companion on Genetic and Evolutionary Computation, GECCO 2010, pp. 1689–1696. ACM, New York (2010)Google Scholar
  61. Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall PTR (1994)Google Scholar
  62. Heidrich-Meisner, V., Igel, C.: Evolution Strategies for Direct Policy Search. In: Rudolph, G., Jansen, T., Lucas, S., Poloni, C., Beume, N. (eds.) PPSN 2008. LNCS, vol. 5199, pp. 428–437. Springer, Heidelberg (2008)Google Scholar
  63. Holland, J.H.: Outline for a logical theory of adaptive systems. Journal of the ACM (JACM) 9(3), 297–314 (1962)MathSciNetGoogle Scholar
  64. Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor (1975)Google Scholar
  65. Howard, R.A.: Dynamic programming and Markov processes. MIT Press (1960)Google Scholar
  66. Huyer, W., Neumaier, A.: SNOBFIT–stable noisy optimization by branch and fit. ACM Transactions on Mathematical Software (TOMS) 35(2), 1–25 (2008)MathSciNetGoogle Scholar
  67. Jiang, F., Berry, H., Schoenauer, M.: Supervised and Evolutionary Learning of Echo State Networks. In: Rudolph, G., Jansen, T., Lucas, S., Poloni, C., Beume, N. (eds.) PPSN 2008. LNCS, vol. 5199, pp. 215–224. Springer, Heidelberg (2008)Google Scholar
  68. Jouffe, L.: Fuzzy inference system learning by reinforcement methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 28(3), 338–355 (1998)Google Scholar
  69. Kakade, S.: A natural policy gradient. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems 14 (NIPS-2001), pp. 1531–1538. MIT Press (2001)Google Scholar
  70. Kennedy, J., Eberhart, R.C.: Particle swarm optimization. In: Proceedings of IEEE International Conference on Neural Networks, Perth, Australia, vol. 4, pp. 1942–1948 (1995)Google Scholar
  71. Kirkpatrick, S.: Optimization by simulated annealing: Quantitative studies. Journal of Statistical Physics 34(5), 975–986 (1984)MathSciNetGoogle Scholar
  72. Klir, G., Yuan, B.: Fuzzy sets and fuzzy logic: theory and applications. Prentice Hall PTR, Upper Saddle River (1995)Google Scholar
  73. Konda, V.: Actor-critic algorithms. PhD thesis, Massachusetts Institute of Technology (2002)Google Scholar
  74. Konda, V.R., Borkar, V.: Actor-critic type learning algorithms for Markov decision processes. SIAM Journal on Control and Optimization 38(1), 94–123 (1999)MathSciNetGoogle Scholar
  75. Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. SIAM Journal on Control and Optimization 42(4), 1143–1166 (2003)MathSciNetGoogle Scholar
  76. Kullback, S.: Statistics and Information Theory. J. Wiley and Sons, New York (1959)Google Scholar
  77. Kullback, S., Leibler, R.A.: On information and sufficiency. Annals of Mathematical Statistics 22, 79–86 (1951)MathSciNetGoogle Scholar
  78. Lagoudakis, M., Parr, R.: Least-squares policy iteration. The Journal of Machine Learning Research 4, 1107–1149 (2003)MathSciNetGoogle Scholar
  79. Lin, C., Lee, C.: Reinforcement structure/parameter learning for neural-network-based fuzzy logic control systems. IEEE Transactions on Fuzzy Systems 2(1), 46–63 (1994)MathSciNetGoogle Scholar
  80. Lin, C.S., Kim, H.: CMAC-based adaptive critic self-learning control. IEEE Transactions on Neural Networks 2(5), 530–533 (1991)Google Scholar
  81. Lin, L.: Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning 8(3), 293–321 (1992)Google Scholar
  82. Lin, L.J.: Reinforcement learning for robots using neural networks. PhD thesis, Carnegie Mellon University, Pittsburgh (1993)Google Scholar
  83. Littman, M.L., Szepesvári, C.: A generalized reinforcement-learning model: Convergence and applications. In: Saitta, L. (ed.) Proceedings of the 13th International Conference on Machine Learning (ICML 1996), pp. 310–318. Morgan Kaufmann, Bari (1996)Google Scholar
  84. Maei, H.R., Sutton, R.S.: GQ (λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. In: Proceedings of the Third Conference On Artificial General Intelligence (AGI-2010), pp. 91–96. Atlantis Press, Lugano (2010)Google Scholar
  85. Maei, H.R., Szepesvári, C., Bhatnagar, S., Precup, D., Silver, D., Sutton, R.: Convergent temporal-difference learning with arbitrary smooth function approximation. In: Advances in Neural Information Processing Systems 22 (NIPS-2009) (2009)Google Scholar
  86. Maei, H.R., Szepesvári, C., Bhatnagar, S., Sutton, R.S.: Toward off-policy learning control with function approximation. In: Proceedings of the 27th Annual International Conference on Machine Learning (ICML-2010). ACM, New York (2010)Google Scholar
  87. Maillard, O.A., Munos, R., Lazaric, A., Ghavamzadeh, M.: Finite sample analysis of Bellman residual minimization. In: Asian Conference on Machine Learning, ACML-2010 (2010)Google Scholar
  88. Mitchell, T.M.: Machine learning. McGraw Hill, New York (1996)Google Scholar
  89. Moriarty, D.E., Miikkulainen, R.: Efficient reinforcement learning through symbiotic evolution. Machine Learning 22, 11–32 (1996)Google Scholar
  90. Moriarty, D.E., Schultz, A.C., Grefenstette, J.J.: Evolutionary algorithms for reinforcement learning. Journal of Artificial Intelligence Research 11, 241–276 (1999)Google Scholar
  91. Murray, J.J., Cox, C.J., Lendaris, G.G., Saeks, R.: Adaptive dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 32(2), 140–153 (2002)Google Scholar
  92. Narendra, K.S., Thathachar, M.A.L.: Learning automata - a survey. IEEE Transactions on Systems, Man, and Cybernetics 4, 323–334 (1974)MathSciNetGoogle Scholar
  93. Narendra, K.S., Thathachar, M.A.L.: Learning automata: an introduction. Prentice-Hall, Inc., Upper Saddle River (1989)Google Scholar
  94. Nedić, A., Bertsekas, D.P.: Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems 13(1-2), 79–110 (2003)MathSciNetGoogle Scholar
  95. Neyman, J., Pearson, E.S.: On the use and interpretation of certain test criteria for purposes of statistical inference part i. Biometrika 20(1), 175–240 (1928)Google Scholar
  96. Ng, A.Y., Parr, R., Koller, D.: Policy search via density estimation. In: Solla, S.A., Leen, T.K., Müller, K.R. (eds.) Advances in Neural Information Processing Systems, vol. 13, pp. 1022–1028. The MIT Press (1999)Google Scholar
  97. Nguyen-Tuong, D., Peters, J.: Model learning for robot control: a survey. Cognitive Processing, 1–22 (2011)Google Scholar
  98. Ormoneit, D., Sen, Ś.: Kernel-based reinforcement learning. Machine Learning 49(2), 161–178 (2002)Google Scholar
  99. Pazis, J., Lagoudakis, M.G.: Binary action search for learning continuous-action control policies. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 793–800. ACM (2009)Google Scholar
  100. Peng, J.: Efficient dynamic programming-based learning for control. PhD thesis, Northeastern University (1993)Google Scholar
  101. Peters, J., Schaal, S.: Natural actor-critic. Neurocomputing 71(7-9), 1180–1190 (2008a)Google Scholar
  102. Peters, J., Schaal, S.: Reinforcement learning of motor skills with policy gradients. Neural Networks 21(4), 682–697 (2008b)Google Scholar
  103. Peters, J., Vijayakumar, S., Schaal, S.: Reinforcement learning for humanoid robotics. In: IEEE-RAS International Conference on Humanoid Robots (Humanoids 2003). IEEE Press (2003)Google Scholar
  104. Poupart, P., Vlassis, N., Hoey, J., Regan, K.: An analytic solution to discrete Bayesian reinforcement learning. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 697–704. ACM (2006)Google Scholar
  105. Powell, M.: UOBYQA: unconstrained optimization by quadratic approximation. Mathematical Programming 92(3), 555–582 (2002)MathSciNetGoogle Scholar
  106. Powell, M.: The NEWUOA software for unconstrained optimization without derivatives. In: Large-Scale Nonlinear Optimization, pp. 255–297 (2006)Google Scholar
  107. Powell, W.B.: Approximate Dynamic Programming: Solving the Curses of Dimensionality. Wiley-Blackwell (2007)Google Scholar
  108. Precup, D., Sutton, R.S.: Off-policy temporal-difference learning with function approximation. In: Machine Learning: Proceedings of the Eighteenth International Conference (ICML 2001), pp. 417–424. Morgan Kaufmann, Williams College (2001)Google Scholar
  109. Precup, D., Sutton, R.S., Singh, S.P.: Eligibility traces for off-policy policy evaluation. In: Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), pp. 766–773. Morgan Kaufmann, Stanford University, Stanford, CA (2000)Google Scholar
  110. Prokhorov, D.V., Wunsch, D.C.: Adaptive critic designs. IEEE Transactions on Neural Networks 8(5), 997–1007 (2002)Google Scholar
  111. Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York (1994)Google Scholar
  112. Puterman, M.L., Shin, M.C.: Modified policy iteration algorithms for discounted Markov decision problems. Management Science 24(11), 1127–1137 (1978)MathSciNetGoogle Scholar
  113. Rao, C.R., Poti, S.J.: On locally most powerful tests when alternatives are one sided. Sankhyā: The Indian Journal of Statistics, 439–439 (1946)Google Scholar
  114. Rechenberg, I.: Evolutionsstrategie - Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Fromman-Holzboog (1971)Google Scholar
  115. Riedmiller, M.: Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 317–328. Springer, Heidelberg (2005)Google Scholar
  116. Ripley, B.D.: Pattern recognition and neural networks. Cambridge University Press (2008)Google Scholar
  117. Rubinstein, R.: The cross-entropy method for combinatorial and continuous optimization. Methodology and Computing in Applied Probability 1(2), 127–190 (1999)MathSciNetGoogle Scholar
  118. Rubinstein, R., Kroese, D.: The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation, and machine learning. Springer-Verlag New York Inc. (2004)Google Scholar
  119. Rückstieß, T., Sehnke, F., Schaul, T., Wierstra, D., Sun, Y., Schmidhuber, J.: Exploring parameter space in reinforcement learning. Paladyn 1(1), 14–24 (2010)Google Scholar
  120. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. In: Parallel Distributed Processing, vol. 1, pp. 318–362. MIT Press (1986)Google Scholar
  121. Rummery, G.A., Niranjan, M.: On-line Q-learning using connectionist sytems. Tech. Rep. CUED/F-INFENG-TR 166, Cambridge University, UK (1994)Google Scholar
  122. Santamaria, J.C., Sutton, R.S., Ram, A.: Experiments with reinforcement learning in problems with continuous state and action spaces. Adaptive Behavior 6(2), 163–217 (1997)Google Scholar
  123. Scherrer, B.: Should one compute the temporal difference fix point or minimize the Bellman residual? The unified oblique projection view. In: Fürnkranz, J., Joachims, T. (eds.) Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 959–966. Omnipress (2010)Google Scholar
  124. Schwefel, H.P.: Numerische Optimierung von Computer-Modellen. Interdisciplinary Systems Research, vol. 26. Birkhäuser, Basel (1977)Google Scholar
  125. Sehnke, F., Osendorfer, C., Rückstieß, T., Graves, A., Peters, J., Schmidhuber, J.: Parameter-exploring policy gradients. Neural Networks 23(4), 551–559 (2010)Google Scholar
  126. Singh, S.P., Sutton, R.S.: Reinforcement learning with replacing eligibility traces. Machine Learning 22, 123–158 (1996)Google Scholar
  127. Spaan, M., Vlassis, N.: Perseus: Randomized point-based value iteration for POMDPs. Journal of Artificial Intelligence Research 24(1), 195–220 (2005)Google Scholar
  128. Stanley, K.O., Miikkulainen, R.: Efficient reinforcement learning through evolving neural network topologies. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2002), pp. 569–577. Morgan Kaufmann, San Francisco (2002)Google Scholar
  129. Strehl, A.L., Li, L., Wiewiora, E., Langford, J., Littman, M.L.: PAC model-free reinforcement learning. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 881–888. ACM (2006)Google Scholar
  130. Strens, M.: A Bayesian framework for reinforcement learning. In: Proceedings of the Seventeenth International Conference on Machine Learning, p. 950. Morgan Kaufmann Publishers Inc. (2000)Google Scholar
  131. Sun, Y., Wierstra, D., Schaul, T., Schmidhuber, J.: Efficient natural evolution strategies. In: Proceedings of the 11th Annual conference on Genetic and Evolutionary Computation (GECCO-2009), pp. 539–546. ACM (2009)Google Scholar
  132. Sutton, R.S.: Temporal credit assignment in reinforcement learning. PhD thesis, University of Massachusetts, Dept. of Comp. and Inf. Sci. (1984)Google Scholar
  133. Sutton, R.S.: Learning to predict by the methods of temporal differences. Machine Learning 3, 9–44 (1988)Google Scholar
  134. Sutton, R.S.: Generalization in reinforcement learning: Successful examples using sparse coarse coding. In: Touretzky, D.S., Mozer, M.C., Hasselmo, M.E. (eds.) Advances in Neural Information Processing Systems, vol. 8, pp. 1038–1045. MIT Press, Cambridge (1996)Google Scholar
  135. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. The MIT press, Cambridge (1998)Google Scholar
  136. Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems 13 (NIPS-2000), vol. 12, pp. 1057–1063 (2000)Google Scholar
  137. Sutton, R.S., Szepesvári, C., Maei, H.R.: A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. In: Advances in Neural Information Processing Systems 21 (NIPS-2008), vol. 21, pp. 1609–1616 (2008)Google Scholar
  138. Sutton, R.S., Maei, H.R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., Wiewiora, E.: Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: Proceedings of the 26th Annual International Conference on Machine Learning (ICML 2009), pp. 993–1000. ACM (2009)Google Scholar
  139. Szepesvári, C.: Algorithms for reinforcement learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 4(1), 1–103 (2010)Google Scholar
  140. Szepesvári, C., Smart, W.D.: Interpolation-based Q-learning. In: Proceedings of the Twenty-First International Conference on Machine Learning (ICML 2004), p. 100. ACM (2004)Google Scholar
  141. Szita, I., Lörincz, A.: Learning tetris using the noisy cross-entropy method. Neural Computation 18(12), 2936–2941 (2006)Google Scholar
  142. Taylor, M.E., Whiteson, S., Stone, P.: Comparing evolutionary and temporal difference methods in a reinforcement learning domain. In: Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, p. 1328. ACM (2006)Google Scholar
  143. Tesauro, G.: Practical issues in temporal difference learning. In: Lippman, D.S., Moody, J.E., Touretzky, D.S. (eds.) Advances in Neural Information Processing Systems, vol. 4, pp. 259–266. Morgan Kaufmann, San Mateo (1992)Google Scholar
  144. Tesauro, G.: TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation 6(2), 215–219 (1994)Google Scholar
  145. Tesauro, G.J.: Temporal difference learning and TD-Gammon. Communications of the ACM 38, 58–68 (1995)Google Scholar
  146. Thrun, S., Schwartz, A.: Issues in using function approximation for reinforcement learning. In: Mozer, M., Smolensky, P., Touretzky, D., Elman, J., Weigend, A. (eds.) Proceedings of the 1993 Connectionist Models Summer School. Lawrence Erlbaum, Hillsdale (1993)Google Scholar
  147. Touzet, C.F.: Neural reinforcement learning for behaviour synthesis. Robotics and Autonomous Systems 22(3/4), 251–281 (1997)Google Scholar
  148. Tsitsiklis, J.N., Van Roy, B.: An analysis of temporal-difference learning with function approximation. Tech. Rep. LIDS-P-2322, MIT Laboratory for Information and Decision Systems, Cambridge, MA (1996)Google Scholar
  149. Tsitsiklis, J.N., Van Roy, B.: An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42(5), 674–690 (1997)MathSciNetGoogle Scholar
  150. van Hasselt, H.P.: Double Q-Learning. In: Advances in Neural Information Processing Systems, vol. 23. The MIT Press (2010)Google Scholar
  151. van Hasselt, H.P.: Insights in reinforcement learning. PhD thesis, Utrecht University (2011)Google Scholar
  152. van Hasselt, H.P., Wiering, M.A.: Reinforcement learning in continuous action spaces. In: Proceedings of the IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-2007), pp. 272–279 (2007)Google Scholar
  153. van Hasselt, H.P., Wiering, M.A.: Using continuous action spaces to solve discrete problems. In: Proceedings of the International Joint Conference on Neural Networks (IJCNN 2009), pp. 1149–1156 (2009)Google Scholar
  154. van Seijen, H., van Hasselt, H.P., Whiteson, S., Wiering, M.A.: A theoretical and empirical analysis of Expected Sarsa. In: Proceedings of the IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp. 177–184 (2009)Google Scholar
  155. Vapnik, V.N.: The nature of statistical learning theory. Springer, Heidelberg (1995)Google Scholar
  156. Vrabie, D., Pastravanu, O., Abu-Khalaf, M., Lewis, F.: Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica 45(2), 477–484 (2009)MathSciNetGoogle Scholar
  157. Wang, F.Y., Zhang, H., Liu, D.: Adaptive dynamic programming: An introduction. IEEE Computational Intelligence Magazine 4(2), 39–47 (2009)Google Scholar
  158. Watkins, C.J.C.H.: Learning from delayed rewards. PhD thesis, King’s College, Cambridge, England (1989)Google Scholar
  159. Watkins, C.J.C.H., Dayan, P.: Q-learning. Machine Learning 8, 279–292 (1992)Google Scholar
  160. Werbos, P.J.: Beyond regression: New tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University (1974)Google Scholar
  161. Werbos, P.J.: Advanced forecasting methods for global crisis warning and models of intelligence. In: General Systems, vol. XXII, pp. 25–38 (1977)Google Scholar
  162. Werbos, P.J.: Backpropagation and neurocontrol: A review and prospectus. In: IEEE/INNS International Joint Conference on Neural Networks, Washington, D.C, vol. 1, pp. 209–216 (1989a)Google Scholar
  163. Werbos, P.J.: Neural networks for control and system identification. In: Proceedings of IEEE/CDC, Tampa, Florida (1989b)Google Scholar
  164. Werbos, P.J.: Consistency of HDP applied to a simple reinforcement learning problem. Neural Networks 2, 179–189 (1990)Google Scholar
  165. Werbos, P.J.: Backpropagation through time: What it does and how to do it. Proceedings of the IEEE 78(10), 1550–1560 (2002)MathSciNetGoogle Scholar
  166. Whiteson, S., Stone, P.: Evolutionary function approximation for reinforcement learning. Journal of Machine Learning Research 7, 877–917 (2006)MathSciNetGoogle Scholar
  167. Whitley, D., Dominic, S., Das, R., Anderson, C.W.: Genetic reinforcement learning for neurocontrol problems. Machine Learning 13(2), 259–284 (1993)Google Scholar
  168. Wieland, A.P.: Evolving neural network controllers for unstable systems. In: International Joint Conference on Neural Networks, vol. 2, pp. 667–673. IEEE, New York (1991)Google Scholar
  169. Wiering, M.A., van Hasselt, H.P.: The QV family compared to other reinforcement learning algorithms. In: Proceedings of the IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp. 101–108 (2009)Google Scholar
  170. Wierstra, D., Schaul, T., Peters, J., Schmidhuber, J.: Natural evolution strategies. In: IEEE Congress on Evolutionary Computation (CEC-2008), pp. 3381–3387. IEEE (2008)Google Scholar
  171. Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, 229–256 (1992)Google Scholar
  172. Williams, R.J., Zipser, D.: A learning algorithm for continually running fully recurrent neural networks. Neural Computation 1(2), 270–280 (1989)Google Scholar
  173. Wilson, D.R., Martinez, T.R.: The general inefficiency of batch training for gradient descent learning. Neural Networks 16(10), 1429–1451 (2003)Google Scholar
  174. Zadeh, L.: Fuzzy sets. Information and Control 8(3), 338–353 (1965)MathSciNetGoogle Scholar
  175. Zhou, C., Meng, Q.: Dynamic balance of a biped robot using fuzzy reinforcement learning agents. Fuzzy Sets and Systems 134(1), 169–187 (2003)MathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  1. 1.Center for Mathematics and Computer ScienceCentrum Wiskunde en Informatica CWIAmsterdamThe Netherlands

Personalised recommendations