Conclusions, Future Directions and Outlook

  • Marco Wiering
  • Martijn van Otterlo
Part of the Adaptation, Learning, and Optimization book series (ALO, volume 12)

Looking Back

This book has provided the reader with a thorough description of the field of reinforcement learning (RL). In this last chapter we will first discuss what has been accomplished with this book, followed by a description of those topics that were left out of this book, mainly because they are outside of the main field of RL or they are small (possibly novel and emerging) subfields within RL. After looking back what has been done in RL and in this book, a step into the future development of the field will be taken, and we will end with the opinions of some of the authors what they think will become the most important areas of research in RL.


Neural Information Processing System Adaptive Dynamic Programming Observable Markov Decision Process Reward Optimizer Monte Carlo Tree Search 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Alpaydin, E.: Introduction to Machine learning. The MIT Press (2010)Google Scholar
  2. Azar, M.G., Munos, R., Ghavamzadaeh, M., Kappen, H.J.: Speedy Q-learning. Advances in Neural Information Processing Systems (2011)Google Scholar
  3. Bakker, B., Schmidhuber, J.: Hierarchical reinforcement learning based on subgoal discovery and subpolicy specialization. In: Proceedings of the 8th Conference on Intelligent Autonomous Systems, IAS-8, pp. 438–445 (2004)Google Scholar
  4. Baxter, J., Tridgell, A., Weaver, L.: Knightcap: A chess program that learns by combining TD(λ) with minimax search. Tech. rep., Australian National University, Canberra (1997)Google Scholar
  5. Berliner, H.: Experiences in evaluation with BKG - a program that plays backgammon. In: Proceedings of IJCAI, pp. 428–433 (1977)Google Scholar
  6. Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-dynamic Programming. Athena Scientific, Belmont (1996)zbMATHGoogle Scholar
  7. Bishop, C.: Pattern Recognition and Machine learning. Springer, Heidelberg (2006)zbMATHGoogle Scholar
  8. Coates, A., Abbeel, P., Ng, A.: Apprenticeship learning for helicopter control. Commun. ACM 52(7), 97–105 (2009)CrossRefGoogle Scholar
  9. Coulom, R.: Efficient Selectivity and Backup Operators in Monte-carlo Tree Search. In: van den Herik, H.J., Ciancarini, P., Donkers, H.H.L.M(J.) (eds.) CG 2006. LNCS, vol. 4630, pp. 72–83. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  10. Cramer, N.L.: A representation for the adaptive generation of simple sequential programs. In: Grefenstette, J. (ed.) Proceedings of an International Conference on Genetic Algorithms and Their Applications, pp. 183–187 (1985)Google Scholar
  11. Crites, R., Barto, A.: Improving elevator performance using reinforcement learning. In: Touretzky, D., Mozer, M., Hasselmo, M. (eds.) Advances in Neural Information Processing Systems, Cambridge, MA, vol. 8, pp. 1017–1023 (1996)Google Scholar
  12. Di Caro, G., Dorigo, M.: An adaptive multi-agent routing algorithm inspired by ants behavior. In: Proceedings of PART 1998 - Fifth Annual Australasian Conference on Parallel and Real-Time Systems (1998)Google Scholar
  13. Dietterich, T., Wang, X.: Batch value function approximation via support vectors. In: Advances in Neural Information Processing Systems, vol. 14, pp. 1491–1498 (2002)Google Scholar
  14. Dorigo, M., Gambardella, L.M.: Ant colony system: A cooperative learning approach to the traveling salesman problem. Evolutionary Computation 1(1), 53–66 (1997)CrossRefGoogle Scholar
  15. Dorigo, M., Maniezzo, V., Colorni, A.: The ant system: Optimization by a colony of cooperating agents. IEEE Transactions on Systems, Man, and Cybernetics-Part B 26(1), 29–41 (1996)CrossRefGoogle Scholar
  16. Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 6, 503–556 (2005)MathSciNetzbMATHGoogle Scholar
  17. Gambardella, L.M., Taillard, E., Dorigo, M.: Ant colonies for the qadratic assignement problem. Journal of the Operational Research Society 50, 167–176 (1999)zbMATHGoogle Scholar
  18. van Hasselt, H.: Double Q-learning. In: Advances in Neural Information Processing Systems, vol. 23, pp. 2613–2621 (2010)Google Scholar
  19. Hutter, M.: Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, Berlin (2004)Google Scholar
  20. Kober, J., Peters, J.: Policy Search for Motor Primitives in Robotics. Machine Learning 84(1-2), 171–203 (2011)zbMATHCrossRefGoogle Scholar
  21. Kolmogorov, A.: Three approaches to the quantitative definition of information. Problems of Information Transmission 1, 1–11 (1965)Google Scholar
  22. Koza, J.R.: Genetic evolution and co-evolution of computer programs. In: Langton, C., Taylor, C., Farmer, J.D., Rasmussen, S. (eds.) Artificial Life II, pp. 313–324. Addison Wesley Publishing Company (1992)Google Scholar
  23. Koza, J.R.: Genetic Programming II – Automatic Discovery of Reusable Programs. MIT Press (1994)Google Scholar
  24. Li, M., Vitányi, P.M.B.: An introduction to Kolmogorov complexity and its applications. In: van Leeuwen, J. (ed.) Handbook of Theoretical Computer Science, pp. 188–254. Elsevier Science Publishers B.V (1990)Google Scholar
  25. Littman, M., Boyan, J.: A distributed reinforcement learning scheme for network routing. In: Alspector, J., Goodman, R., Brown, T. (eds.) Proceedings of the First International Workshop on Applications of Neural Networks to Telecommunication, Hillsdale, New Jersey, pp. 45–51 (1993)Google Scholar
  26. Lowe, D.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 315–333 (2004)Google Scholar
  27. Maei, H., Szepesvari, C., Bhatnagar, S., Sutton, R.: Toward off-policy learning control with function approximation. In: Proceedings of the International Conference on Machine Learning, pp. 719–726 (2010)Google Scholar
  28. McCallum, R.A.: Instance-based utile distinctions for reinforcement learning with hidden state. In: Prieditis, A., Russell, S. (eds.) Machine Learning: Proceedings of the Twelfth International Conference, pp. 387–395. Morgan Kaufmann Publishers, San Francisco (1995)Google Scholar
  29. McGovern, A., Andrew, G., Barto, E.M.: Scheduling straight-line code using reinforcement learning and rollouts. In: Proceedings of Neural Information Processing Systems. MIT Press (1999)Google Scholar
  30. Mitchell, T.M.: Machine learning. McGraw Hill, New York (1996)zbMATHGoogle Scholar
  31. Moore, A.W., Atkeson, C.G.: Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning 13, 103–130 (1993)Google Scholar
  32. Moriarty, D.E., Miikkulainen, R.: Efficient reinforcement learning through symbiotic evolution. Machine Learning 22, 11–32 (1996)Google Scholar
  33. Nevmyvaka, Y., Feng, Y., Kearns, M.: Reinforcement learning for optimized trade execution. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 673–680 (2006)Google Scholar
  34. Nouri, A., Littman, M.: Dimension reduction and its application to model-based exploration in continuous spaces. Machine Learning 81(1), 85–98 (2010)CrossRefGoogle Scholar
  35. van Otterlo, M.: Efficient reinforcement learning using relational aggregation. Proceedings of the Sixth European Workshop on Reinforcement Learning, EWRL-6 (2003)Google Scholar
  36. Peters, J., Mülling, K., Altun, Y.: Relative entropy policy search. In: Fox, M., Poole, D. (eds.) Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010 (2010)Google Scholar
  37. Peters, J., Vijayakumar, S., Schaal, S.: Reinforcement Learning for Humanoid Robotics. In: IEEE-RAS International Conference on Humanoid Robots, Humanoids (2003)Google Scholar
  38. Peters, J., Schaal, S.: Reinforcement Learning of Motor Skills with Policy Gradients. Neural Networks 21(4), 682–697 (2008), doi:10.1016/j.neunet.2008.02.003CrossRefGoogle Scholar
  39. Poland, J., Hutter, M.: Universal learning of repeated matrix games. In: Proc. 15th Annual Machine Learning Conf. of Belgium and The Netherlands (Benelearn 2006), Ghent, pp. 7–14 (2006)Google Scholar
  40. Riedmiller, M.: Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 317–328. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  41. Riedmiller, S., Riedmiller, M.: A neural reinforcement learning approach to learn local dispatching policies in production scheduling. In: Proceedings of International Joint Conference on Artificial Intelligence (IJCAI 1999) (1999)Google Scholar
  42. Ring, M.: Continual learning in reinforcement environments. PhD thesis, University of Texas, Austin, Texas (1994)Google Scholar
  43. Sałustowicz, R.P., Schmidhuber, J.H.: Probabilistic incremental program evolution. Evolutionary Computation 5(2), 123–141 (1997)CrossRefGoogle Scholar
  44. Schmidhuber, J.: The Speed Prior: A New Simplicity Measure Yielding Near-Optimal Computable Predictions. In: Kivinen, J., Sloan, R.H. (eds.) COLT 2002. LNCS (LNAI), vol. 2375, pp. 216–228. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  45. Schmidhuber, J.: Optimal ordered problem solver. Machine Learning 54, 211–254 (2004)zbMATHCrossRefGoogle Scholar
  46. Schmidhuber, J.: Ultimate cognition à la Gödel. Cognitive Computation 1(2), 177–193 (2009)CrossRefGoogle Scholar
  47. Schmidhuber, J.: Formal theory of creativity, fun, and intrinsic motivation (1990-2010). IEEE Transactions on Autonomous Mental Development 2(3), 230–247 (2010)CrossRefGoogle Scholar
  48. Schmidhuber, J., Zhao, J., Schraudolph, N.: Reinforcement learning with self-modifying policies. In: Thrun, S., Pratt, L. (eds.) Learning to Learn, pp. 293–309. Kluwer (1997a)Google Scholar
  49. Schmidhuber, J., Zhao, J., Schraudolph, N.N.: Reinforcement learning with self-modifying policies. In: Thrun, S., Pratt, L. (eds.) Learning to Learn. Kluwer (1997b)Google Scholar
  50. Schmidhuber, J.H.: Temporal-difference-driven learning in recurrent networks. In: Eckmiller, R., Hartmann, G., Hauske, G. (eds.) Parallel Processing in Neural Systems and Computers, pp. 209–212. North-Holland (1990)Google Scholar
  51. Schmidhuber, J.H.: Curious model-building control systems. In: Proceedings of the International Joint Conference on Neural Networks, vol. 2, pp. 1458–1463. IEEE, Singapore (1991a)Google Scholar
  52. Schmidhuber, J.H.: A possibility for implementing curiosity and boredom in model-building neural controllers. In: Meyer, J.A., Wilson, S.W. (eds.) Proceedings of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats, pp. 222–227. MIT Press/Bradford Books (1991b)Google Scholar
  53. Schmidhuber, J.H.: A general method for incremental self-improvement and multi-agent learning in unrestricted environments. In: Yao, X. (ed.) Evolutionary Computation: Theory and Applications. Scientific Publ. Co., Singapore (1996)Google Scholar
  54. Schmidhuber, J.H., Zhao, J., Wiering, M.A.: Shifting inductive bias with success-story algorithm, adaptive Levin search, and incremental self-improvement. Machine Learning 28, 105–130 (1997c)CrossRefGoogle Scholar
  55. Schoknecht, R.: Optimality of reinforcement learning algorithms with linear function approximation. In: Becker, S., Thrun, S., Obermayer, K. (eds.) Advances in Neural Information Processing Systems, NIPS 2002, pp. 1555–1562 (2002)Google Scholar
  56. van Seijen, H., Whiteson, S., van Hasselt, H., Wiering, M.: Exploiting best-match equations for efficient reinforcement learning. Journal of Machine Learning Research 12, 2045–2094 (2011)Google Scholar
  57. Simpkins, C., Bhat, S., Isbell Jr., C., Mateas, M.: Towards adaptive programming: integrating reinforcement learning into a programming language. SIGPLAN Not. 43, 603–614 (2008)CrossRefGoogle Scholar
  58. Singh, S., Litman, D., Kearns, M., Walker, M.: Optimizing dialogue management with reinforcement learning: Experiments with the NJFun system. Journal of Artificial Intelligence Research 16, 105–133 (2002)Google Scholar
  59. Smart, W., Kaelbling, L.: Effective reinforcement learning for mobile robots. In: Proceedings of the IEEE International Conference on Robotics and Automation, pp. 3404–3410 (2002)Google Scholar
  60. Solomonoff, R.: A formal theory of inductive inference. Part I. Information and Control 7, 1–22 (1964)MathSciNetzbMATHCrossRefGoogle Scholar
  61. Solomonoff, R.: Complexity-based induction systems. IEEE Transactions on Information Theory IT-24(5), 422–432 (1978)MathSciNetCrossRefGoogle Scholar
  62. Sutton, R.S.: Learning to predict by the methods of temporal differences. Machine Learning 3, 9–44 (1988)Google Scholar
  63. Sutton, R.S.: Integrated architectures for learning, planning and reacting based on dynamic programming. In: Machine Learning: Proceedings of the Seventh International Workshop (1990)Google Scholar
  64. Sutton, R.S., Precup, D., Singh, S.P.: Between MDPs and semi-MDPs: Learning, planning, learning and sequential decision making. Tech. Rep. COINS 89-95, University of Massachusetts, Amherst (1998)Google Scholar
  65. Tanner, B., White, A.: RL-Glue: Language-independent software for reinforcement-learning experiments. Journal of Machine Learning Research 10, 2133–2136 (2009)Google Scholar
  66. Tesauro, G.: Temporal difference learning and TD-Gammon. Communications of the ACM 38, 58–68 (1995)CrossRefGoogle Scholar
  67. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)zbMATHCrossRefGoogle Scholar
  68. Veness, J., Silver, D., Uther, W., Blair, A.: Bootstrapping from game tree search. In: Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C.K.I., Culotta, A. (eds.) Advances in Neural Information Processing Systems, vol. 22, pp. 1937–1945 (2009)Google Scholar
  69. Veness, J., Ng, K., Hutter, M., Uther, W., Silver, D.: A Monte-carlo AIXI approximation. Journal of Artificial Intelligence Research (2011)Google Scholar
  70. Watkins, C.J.C.H.: Learning from delayed rewards. PhD thesis, King’s College, Cambridge, England (1989)Google Scholar
  71. Watkins, C.J.C.H., Dayan, P.: Q-learning. Machine Learning 8, 279–292 (1992)zbMATHGoogle Scholar
  72. Westra, J.: Organizing adaptation using agents in serious games. PhD thesis, Utrecht University (2011)Google Scholar
  73. Wiering, M.: Self-play and using an expert to learn to play backgammon with temporal difference learning. Journal of Intelligent Learning Systems and Applications 2(2), 57–68 (2010)CrossRefGoogle Scholar
  74. Wiering, M., van Hasselt, H.: Ensemble algorithms in reinforcement learning. IEEE Transactions, SMC Part B, Special Issue on Adaptive Dynamic Programming and Reinforcement Learning in Feedback Control (2008)Google Scholar
  75. Wiering, M., van Hasselt, H.: The QV family compared to other reinforcement learning algorithms. In: Proceedings of the IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL 2009), pp. 101–108 (2009)Google Scholar
  76. Wiering, M.A.: Multi-agent reinforcement learning for traffic light control. In: Langley, P. (ed.) Proceedings of the Seventeenth International Conference on Machine Learning, pp. 1151–1158 (2000)Google Scholar
  77. Wiering, M.A., Schmidhuber, J.H.: Solving POMDPs with Levin search and EIRA. In: Saitta, L. (ed.) Machine Learning: Proceedings of the Thirteenth International Conference, pp. 534–542. Morgan Kaufmann Publishers, San Francisco (1996)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  1. 1.Department of Artificial IntelligenceUniversity of GroningenGroningenThe Netherlands
  2. 2.Radboud UniversityNijmegenThe Netherlands

Personalised recommendations