Skip to main content

Reinforcement Learning

Part of the Springer Theses book series (Springer Theses)

Abstract

This chapter provides an overview of the field of reinforcement learning and concepts that are relevant to the proposed work. The field of reinforcement learning is not very well-known and although the learning paradigm is easily understandable, some of the more detailed concepts can be difficult to grasp. Accordingly, reinforcement learning is presented beginning with a review of the the fundamental concepts and methods. This introduction to reinforcement learning is followed by a review of the three major components of the reinforcement learning method: the environment, the learning algorithm, and the representation of the learned knowledge.

Portions of this chapter previously appeared as: Gatti & Embrechts (2012).

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-12197-0_2
  • Chapter length: 46 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   109.00
Price excludes VAT (USA)
  • ISBN: 978-3-319-12197-0
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   139.99
Price excludes VAT (USA)
Hardcover Book
USD   179.99
Price excludes VAT (USA)
Fig. 2.1
Fig. 2.2
Fig. 2.3
Fig. 2.4
Fig. 2.5
Fig. 2.6
Fig. 2.7
Fig. 2.8

References

  • Albus, J. S. (1975). A new approach to manipulator control: The cerebellar model articulation controller (CMAC). Journal of Dynamic Systems, Measurement, and Control, 97(3), 220–227.

    MATH  CrossRef  Google Scholar 

  • Aldous, D. (1983). Random walks on finite groups and rapidly mixing Markov chains. In Seminar on Probability XVII, Lecture Notes in Mathematics Volume 986 (pp. 243–297). Berlin: Springer.

    Google Scholar 

  • Anderson, C. W. (1987). Strategy learning with multilayer connectionist representations. In Langley, P. (Ed.), Proceedings of the 4th International Workshop on Machine Learning, Irvine, CA, 22–25 June (pp. 103–114). San Mateo, CA: Morgan Kaufmann.

    Google Scholar 

  • Atkeson, C. G. & Santamaría, J. C. (1997). A comparison of direct and model-based reinforcement learning. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Albequerque, NM, 20–25 April (Vol. 4, pp. 3557–3564). doi: 10.1109/ROBOT.1997.606886

    Google Scholar 

  • Atkeson, C. G., Moore, A. W., & Schaal, S. (1997). Locally weighted learning. Artificial Intelligence Review, 11(1–5), 11–73.

    CrossRef  Google Scholar 

  • Archibald, T. W., McKinnon, K. I. M., & Thomas, L. C. (1995). On the generation of Markov decision processes. Journal of the Operational Research Society, 46(3), 354–361.

    MATH  CrossRef  Google Scholar 

  • Awate, Y. P. (2009). Policy-gradient based actor-critic algorithms. In Proceedings of the Global Congress on Intelligent Systems (GCIS), Xiamen, China, 19–21 May (pp. 505–509). doi: 10.1109/GCIS.2009.372

    Google Scholar 

  • Bagnell, J. A. & Schneider, J. G. (2001). Autonomous helicopter control using reinforcement learning policy search methods. In Proceedings of the International Conference on Robotics and Automation, Seoul, Korea, 21–26 May (Vol. 2, pp. 1615–1620). doi: 10.1109/ROBOT.2001.932842

    Google Scholar 

  • Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In Prieditis, A. and Russell, S. (Eds.) Proceedings of the 12th International Conference on Machine Learning (ICML), Tahoe City, CA, 9–12 July (pp. 30–37). San Francisco, CA: Morgan Kaufmann.

    Google Scholar 

  • Baird, L. C. (1999). Reinforcement learning through gradient descent. Unpublished PhD dissertation, Carnegie Mellon University, Pittsburgh, PA.

    Google Scholar 

  • Bakker, B. (2001). Reinforcement learning with LSTM in non-Markovian tasks with longterm dependencies (Technical Report, Department of Psychology, Leiden University). Retrieved from http://staff.science.uva.nl/ ~ bram/RLLSTM_ TR.pdf.

  • Bakker, B. (2007). Reinforcement learning by backpropagation through an LSTM model/critic. In IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL), Honolulu, HI, 1–5 April (pp. 127–134). doi: 10.1109/ADPRL.2007.368179

    Google Scholar 

  • Bakker, B. & Schmidhuber, J. (2004). Hierarchical reinforcement learning based on subgoal discovery and subpolicy specialization. In Groen, F., Amato, N., Bonarini, A., Yoshida, E., & Kröse, B. (Eds.), Proceedings of the 8th Conference on Intelligent Autonomous Systems (IAS-8), Amsterdam, The Netherlands, 10–13 March (pp. 438–445). Amsterdam, Netherlands: IOS Press.

    Google Scholar 

  • Bakker, B., Linaker, F., & Schmidhuber, J. (2002). Reinforcement learning in partially observable mobile robot domains using unsupervised event extraction. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2002), EPFL, Switzerland, 30 September–4 October (Vol. 1, pp. 938–943). doi: 10.1109/IRDS.2002.1041511

    Google Scholar 

  • Barto, A. G. (1990). Connectionist learning for control: An overview. In Miller, W. T., Sutton, R. S., and Werbos, P. J. (Eds.), Neural Networks for Control (pp. 5–58). Cambridge, MA: MIT Press.

    Google Scholar 

  • Barto, A. G., Sutton, R. S., & Anderson, C. (1983). Neuron-like adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics (SMC), 13(5), 834–846.

    CrossRef  Google Scholar 

  • Baxter, J. & Bartlett, P. L. (2000). Reinforcement learning in POMDP’s via direct gradient ascent. In Proceedings of the 17th International Conference on Machine Learning (ICML), Stanford University, Stanford, CA, 29 June–2 July (pp. 41–48). San Francisco, CA: Morgan Kaufmann.

    Google Scholar 

  • Baxter, J., Tridgell, A., & Weaver, L. (1998a). KnightCap: A chess program that learns by combining TD(λ) with minimax search. In Proceedings of the 15th International Conference on Machine Learning, Madison, WI, 24–27 July (pp. 28–36). San Francisco, CA: Morgan Kaufmann.

    Google Scholar 

  • Baxter, J., Tridgell, A., & Weaver, L. (1998b). TDLeaf(λ): Combining temporal difference learning with game-tree search. Australian Journal of Intelligent Information Processing Systems, 5(1), 39–43.

    Google Scholar 

  • Bertsekas, D. P. (1987). Dynamic Programming: Deterministic and Stochastic Models. Englewood Cliffs, NJ: Prentice-Hall.

    MATH  Google Scholar 

  • Bertsekas, D. P. & Tsitsiklis, J. N. (1996). Neuro-dynamic Programming. Belmont, MA: Athena Scientific.

    Google Scholar 

  • Bhatnagar, S., Sutton, R., Ghavamzadeh, M., & Lee, M. (2009). Natural actor critic algorithms. Automatica, 45(11), 2471–2482.

    MATH  MathSciNet  CrossRef  Google Scholar 

  • Binkley, K. J., Seehart, K., & Hagiwara, M. (2007). A study of artificial neural network architectures for Othello evaluation functions. Information and Media Technologies, 2(4), 1129–1139.

    Google Scholar 

  • Bonarini, A., Lazaric, A., & Restelli, M. (2007). Reinforcement learning in complex environments through multiple adaptive partitions. In AI*IA 2007: Artificial Intelligence and Human-Oriented Computing, Proceedings of the 10th Congress of the Italian Association for Artificial Intelligence, Rome, Italy, 10–13 September (pp. 531–542). doi: 10.1007/978-3-540-74782-6_46

    Google Scholar 

  • Boyan, J. A. (2002). Technical update: Least-squares temporal difference learning. Machine Learning, 49(2–3), 233–246.

    MATH  CrossRef  Google Scholar 

  • Boyan, J. A. & Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value function. In Advances in Neural Information Processing Systems 7, (pp. 369–376). Cambridge, MA: MIT Press.

    Google Scholar 

  • Bradtke, S. J. & Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1–3), 33–57.

    MATH  Google Scholar 

  • Castro, D. D. & Mannor, S. (2010). Adaptive bases for reinforcement learning. In Proceedings of the 2010 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), Barcelona, Spain, 20–24 September (pp. 312–327). doi: 10.1007/978-3-642-15880-3_26

    Google Scholar 

  • Chapman, D. & Kaelbling, L. P. (1991). Input generalization in delayed reinforcement learning: An algorithm and performance comparisons. In Proceedings of 12th International Joint Conference on Artificial Intelligence (IJCAI), Sydney, Australia, 24–30 August (Vol. 2, pp. 726–731). San Francisco, CA: Morgan Kaufmann.

    Google Scholar 

  • Coulom, R. (2002a). Feedforward neural networks in reinforcement learning applied to high-dimensional motor control. In Proceedings of the 13th International Conference on Algorithmic Learning Theory (ALT 2002), Lübeck, Germany, 24–26 November (pp. 402–413). doi: 10.1007/3-540-36169-3_32

    Google Scholar 

  • Coulom, R. (2002b). Reinforcement learning using neural networks, with applications to motor control. Unpublished PhD dissertation, National Polytechnic Institute of Grenoble, Grenoble, France.

    Google Scholar 

  • Dann, C., Neumann, G., & Peters, J. (2014). Policy evaluation with temporal differences: A survey and comparison. Journal of Machine Learning Research, 15(1), 809–883.

    MathSciNet  Google Scholar 

  • Dayan, P. (1993). Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4), 613–624.

    CrossRef  Google Scholar 

  • Dayan, P. & Niv, Y. (2008). Reinforcement learning: The good, the bad and the ugly. Current Opinion in Neuroscience, 18(2), 185–196.

    CrossRef  Google Scholar 

  • Dietterich, T. G. (2000). Ensemble methods in machine learning. In Proceedings of the 1st International Workshop on Multiple Classifier Systems (MCS), Cagliari, Italy, 21–23 June (pp. 1–15). doi: 10.1007/3-540-45014-9_1

    Google Scholar 

  • Doya, K. (1996). Temporal difference learning in continuous time and space. In Touretzky, D. S., Mozer, M. C., & Hasselmo, M. E. (Eds.), Advances in Neural Information Processing Systems 8 (pp. 1073–1079). Cambridge, MA: MIT Press.

    Google Scholar 

  • Doya, K. (2000). Reinforcement learning in continuous time and space. Neural Computation, 12(1), 219–245.

    CrossRef  Google Scholar 

  • Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179–211.

    CrossRef  Google Scholar 

  • Fairbanks, M. & Alonso, E. (2012). The divergence of reinforcement learning algorithms with value-iteration and function approximation. In Proceedings of the 2012 International Joint Conference on Neural Networks (IJCNN), Brisbane, Queensland, Australia, 10–15 June (pp. 1–8). doi: 10.1109/IJCNN.2012.6252792

    Google Scholar 

  • Främling, K. (2008). Light-weight reinforcement learning with function approximation for real-life control tasks. In Filipe, J., Andrade-Cetto, J., & Ferrier, J.-L. (Eds.), Proceedings of the 5th International Conference on Informatics in Control, Automation and Robotics, Intelligent Control Systems and Optimization (ICINCO-ICSO), Funchal, Madeira, Portugal, 11–15 May (pp. 127–134). INSTICC Press.

    Google Scholar 

  • Gabel, T. & Riedmiller, M. (2007). On a successful application of multi-agent reinforcement learning to operations research benchmarks. In Proceedings of the 2007 IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL 2007), Honolulu, HI, 1–5 April (pp. 69–75). doi: 10.1109/ADPRL.2007.368171

    Google Scholar 

  • Gabel, T., Lutz, C., & Riedmiller, M. (2011). Improved neural fitted Q iteration applied to a novel computer gaming and learning benchmark. In Proceedings of the 2011 IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL 2011), Paris, France, 11–15 April (pp. 279–286). doi: 10.1109/ADPRL.2011.5967361

    Google Scholar 

  • Galichet, N., Sebag, M., & Teytaud, O. (2013). Exploration vs. exploitation vs safety: Risk-aware multi-armed bandits. In Proceedings of the Asian Conference on Machine Learning (ACML 2013), Canberra, ACT, Australia, 13–15 November (pp. 245–260). Journal of Machine Learning Research (JMLR): Workshop and Conference Proceedings.

    Google Scholar 

  • Gatti, C. J. & Embrechts, M. J. (2012). Reinforcement learning with neural networks: Tricks of the trade. In Georgieva, P., Mihayolva, L., & Jain, L. (Eds.), Advances in Intelligent Signal Processing and Data Mining (pp. 275–310). New York, NY: Springer-Verlag.

    Google Scholar 

  • Gatti, C. J., Embrechts, M. J., & Linton, J. D. (2011a). Parameter settings of reinforcement learning for the game of Chung Toi. In Proceedings of the 2011 IEEE International Conference on Systems, Man, and Cybernetics (SMC 2011), Anchorage, AK, 9–12 October (pp. 3530–3535). doi: 10.1109/ICSMC.2011.6084216

    Google Scholar 

  • Gatti, C. J., Linton, J. D., & Embrechts, M. J. (2011b). A brief tutorial on reinforcement learning: The game of Chung Toi. In Proceedings of the 19th European Symposium on Articial Neural Networks, Computational Intelligence and Machine Learning (ESANN), Bruges, Belgium, 27–29 April (pp. 129–134). Bruges, Belgium: ESANN.

    Google Scholar 

  • Gatti, C. J., Embrechts, M. J., & Linton, J. D. (2013). An empirical analysis of reinforcement learning using design of experiments. In Proceedings of the 21st European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), Bruges, Belgium, 24–26 April (pp. 221–226). Bruges, Belgium: ESANN.

    Google Scholar 

  • Gers, F. (2001). Long short-term memory in recurrent neural networks. Unpublished PhD dissertation, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland.

    Google Scholar 

  • Ghory, I. (2004). Reinforcement learning in board games (Technical Report CSTR-04-004, Department of Computer Science, University of Bristol). Retrieved from http://www.cs.bris.ac.uk/Publications/Papers/2000100.pdf.

  • Gordon, G. J. (1995). Stable function approximation in dynamic programming. In Proceedings of the 12th International Conference on Machine Learning (ICML), Tahoe City, CA, 9–12 July (pp. 261–268). San Francisco, CA: Morgan Kaufmann.

    Google Scholar 

  • Gordon, G. J. (2001). Reinforcement learning with function approximation converges to a region. In Advances in Neural Information Processing Systems 13 (pp. 1040–1046). Cambridge, MA: MIT Press.

    Google Scholar 

  • Gorse, D. (2011). Application of stochastic recurrent reinforcement learning to index trading. In European Symposium on Artificial Neural Networks, Computational Intelligence, and Machine Learning (ESANN), Bruges, Belgium, 27–29 April (pp. 123–128). Bruges, Belgium: ESANN.

    Google Scholar 

  • Gosavi, A., Bandla, N., & Das, T. K. (2002). A reinforcement learning approach to a single leg airline revenue management problem with multiple fare classes and overbooking. IIE Transactions, 34(9), 729–742.

    Google Scholar 

  • Grüning, A. (2007). Elman backpropagation as reinforcement for simple recurrent networks. Neural Computation, 19(11), 3108–3131.

    MATH  CrossRef  Google Scholar 

  • Günther, M. (2008). Automatic feature construction for general game playing. Unpublished masters thesis, Dresden University of Technology, Dresden, Germany.

    Google Scholar 

  • Hafner, R. & Riedmiller, M. (2011). Reinforcement learning in feedback control. Machine Learning, 84(1–2), 137–169.

    MathSciNet  CrossRef  Google Scholar 

  • Hans, A. & Udluft, S. (2010). Ensembles of neural networks for robust reinforcement learning. In Proceedings of the 9th International Conference on Machine Learning and Applications (ICMLA), Washington D.C., 12–14 December (pp. 401–406). doi: 10.1109/ICMLA.2010.66

    Google Scholar 

  • Hans, A. & Udluft, S. (2011). Ensemble usage for more reliable policy identification in reinforcement learning. In European Symposium on Artificial Neural Networks, Computational Intelligence, and Machine Learning (ESANN), Bruges, Belgium, 27–29 April (pp. 165–170). Bruges, Belgium: ESANN.

    Google Scholar 

  • Hochreiter, S. & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    CrossRef  Google Scholar 

  • Hoffmann, A. & Freier, B. (1996). On integrating domain knowledge into reinforcement learning. In International Conference on Neural Information Processing (ICONIP), Hong Kong, China, 24–27 September (pp. 954–959). Singapore: Springer-Verlag.

    Google Scholar 

  • Igel, C. (2003). Neuroevolution for reinforcement learning using evolution strategies. In Proceedings from the 2003 Conference on Evolutionary Computing (CEC), Canberra, Australia, 8–12 December (Vol. 4, pp. 2588–2595). doi: 10.1109/CEC.2003.1299414

    Google Scholar 

  • Jaakkola, T., Singh, S. P., & Jordan, M. I. (1995). Reinforcement learning algorithm for partially observable Markov decision problem. In Advances in Neural Information Processing Systems 7 (pp. 345–352). Cambridge, MA: MIT Press.

    Google Scholar 

  • Jaakkola, T., Jordan, M. I., & Singh, S. P. (2003). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6), 1185–1201.

    CrossRef  Google Scholar 

  • Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 237–285.

    Google Scholar 

  • Kalyanakrishnan, S. & Stone, P. (2007). Batch reinforcement learning in a complex domain. In Proceedings of the 6th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS07), Honolulu, HI, 14–18 May (pp. 650–657). doi: 10.1145/1329125.1329241

    Google Scholar 

  • Kalyanakrishnan, S. & Stone, P. (2009). An empirical analysis of value function-based and policy search reinforcement learning. In Proceedings of the 8th International Conference on Autonomous Agents and Multiagent Systems (AAMAS '09), Budapest, Hungary, 10–15 May (Vol. 2, pp. 749–756). Richland, SC: International Foundation for Autonomous Agents and Multiagent Systems.

    Google Scholar 

  • Kalyanakrishnan, S. & Stone, P. (2011). Characterizing reinforcement learning methods through parameterized learning problems. Machine Learning, 84(1–2), 205–247.

    MathSciNet  CrossRef  Google Scholar 

  • Kappen, H. J. (2007). An introduction to stochastic control theory, path integrals and reinforcement learning. In Marro, J., Garrido, P. L., & Torres, J. J. (Eds.), Cooperative Behavior in Neural Systems, American Institute of Physics Conference Series, Granada, Spain, 11–15 September (Vol. 887, pp. 149–181). American Institute of Physics.

    Google Scholar 

  • Karnin, Z., Koren, T., & Somekh, O. (2013). Almost optimal exploration in multi-armed bandits. In Proceedings of the 30th International Conference on Machine Learning (ICML 2013), Atlanta, GA, 16–21 June (Vol. 28, pp. 1238–1246). JMLR Proceedings.

    Google Scholar 

  • Kohl, N. and Stone, P. (2004). Policy gradient reinforcement learning for fast quadrupedal locomotion. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), New Orleans, LA, 26 April 26–1 May (pp. 2619–2624). doi: 10.1109/ROBOT.2004.1307456

    Google Scholar 

  • Konen, W. & Beielstein, T. B. (2008). Reinforcement learning: Insights from interesting failures in parameter selection. In Parallel Problem Solving from Nature—PPSN X, Proceedings of the 10th International Conference on Parallel Problem Solving from Nature, Dortmund, Germany, 13–17 September (pp. 478–487). doi: 10.1007/978-3-540-87700-4_48

    Google Scholar 

  • Konen, W. & Beielstein, T. B. (2009). Reinforcement learning for games: Failures and successes. In Proceedings of the 11th Genetic and Evolutionary Computation Conference (GECCO), Montreal, Canada, 8–12 July (pp. 2641–2648). doi: 10.1145/1570256.1570375

    Google Scholar 

  • Konidaris, G., Osentoski, S., & Thomas, P. S. (2011). Value function approximation in reinforcement learning using the Fourier basis. In Burgard, W. & Roth, D. (Eds.), Proceedings of the 25th Conference on Artificial Intelligence (AAAI 2011), San Francisco, CA, 7–11 August (pp. 380–385). AAAI.

    Google Scholar 

  • Konidaris, G. D., Scheidwasser, I., & Barto, A. G. (2012). Transfer in reinforcement learning via shared features. Journal of Machine Learning Research, 13(May), 1333–1371.

    MATH  MathSciNet  Google Scholar 

  • Kretchmar, R. M. & Anderson, C. W. (1997). Comparison of CMACs and radial basis functions for local function approximation in reinforcement learning. In International Conference on Neural Networks, Houston, TX, 9–12 June (Vol. 2, pp. 834–837). doi: 10.1109/ICNN.1997.616132

    Google Scholar 

  • Kwok, C. & Fox, D. (2004). Reinforcement learning for sensing strategies. In Proceedings of the International Conference on Intelligent Robots and Systems (IROS 2004), Sendai, Japan, 28 September–2 October (Vol. 4, pp. 3158–3163). doi: 10.1109/IROS.2004.1389903

    Google Scholar 

  • Lange, S., Gabel, T., & Riedmiller, M. (2012). Batch reinforcement learning. In Wiering, M. & van Otterlo, M. (Eds.), Reinforcement Learning: State-of-the-Art (pp. 45–73). New York, NY: Springer.

    Google Scholar 

  • Langley, P. (1988). Machine learning as an experimental science. Machine Learning, 3(1), 5–8.

    MathSciNet  Google Scholar 

  • Lazaric, A. (2008). Knowledge transfer in reinforcement learning. Unpublished PhD dissertation, Politecnico di Milano, Milano, Italy.

    Google Scholar 

  • Lee, J. W. (2001). Stock price prediction using reinforcement learning. In Proceedings of the IEEE International Symposium on Industrial Electronics, Pusan, South Korea, 12–16 June (Vol. 1, pp. 690–695). doi: 10.1109/ISIE.2001.931880

    Google Scholar 

  • O, J., Lee, J., Lee, J. W., & Zhang, B.-T. (2006). Adaptive stock trading and dynamic asset allocation using reinforcement learning. Information Sciences, 176(15), 2121–2147.

    Google Scholar 

  • Li, Y. & Schuurmans, D. (2008). Policy iteration for learning an exercise policy for American options. In Girgin, S., Loth, M., Munos, R., Preux, P., & Ryabko, D., editors, Recent Advances in Reinforcement Learning, Proceedings of the 8th European Workshop on Recent Advances in Reinforcement Learning (EWRL 2008), Villeneuve d’Ascq, France, June 30–July 3 (pp. 165–178). doi: 10.1007/978-3-540-89722-4_13

    Google Scholar 

  • Li, Y., Szepesvari, C., & Schuurmans, D. (2009). Learning exercise policies for American options. In Dyk, D. V. & Welling, M. (Eds.), Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS-09), Clearwater Beach, FL, 16–18 April (Vol. 5, pp. 352–359). JMLR: Workshop and Conference Proceedings.

    Google Scholar 

  • Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3–4), 293–321.

    Google Scholar 

  • Littman, M. L. (2001). Value-function reinforcement learning in Markov games. Journal of Cognitive Systems Research, 2(1), 55–66.

    CrossRef  Google Scholar 

  • Loone, S. M. & Irwin, G. (2001). Improving neural network training solutions using regularisation. Neurocomputing, 37(1–4), 71–90.

    MATH  CrossRef  Google Scholar 

  • Mahadevan, S. & Maggioni, M. (2005). Value function approximation with diffusion wavelets and Laplacian eigenfunctions. In Advances in Neural Information Processing Systems 18. Cambridge, MA: MIT Press.

    Google Scholar 

  • Mahadevan, S. & Maggioni, M. (2007). Proto-value functions: A Laplacian framework for learning representation and control in Markov decision processes. Journal of Machine Learning Research, 8, 2169–2231.

    MATH  MathSciNet  Google Scholar 

  • Mahadevan, S. & Theocharous, G. (1998). Optimizing production manufacturing using reinforcement learning. In Cook, D. J. (Ed.) Proceedings of the 11th International Florida Artificial Intelligence Research Society Conference, Sanibel Island, Florida, 18–20 May (pp. 372–377). AAAI Press.

    Google Scholar 

  • Maia, T. V. (2009). Reinforcement learning, conditioning, and the brain: Successes and challenges. Cognitive, Affective, & Behavioral Neuroscience, 9(4), 343–364.

    MathSciNet  CrossRef  Google Scholar 

  • Makino, T. (2009). Proto-predictive representation of states with simple recurrent temporal-difference networks. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), Montreal, Canada, 14–18 June (pp. 697–704). doi: 10.1145/1553374.1553464

    Google Scholar 

  • Mannen, H. & Wiering, M. (2004). Learning to play chess using TD(λ)-learning with database games. In Nowe, A., Lenaerts, T., & Steenhout, K. (Eds.), Proceedings of the 13th Belgian-Dutch Conference on Machine Learning, Brussels, Belgium, 8–9 January (pp. 72–79). Retrieved from http://www.ai.rug.nl/ mwiering/ group/articles/learning-chess.pdf

    Google Scholar 

  • Menache, I., Mannor, S., & Shimkin, N. (2005). Basis function adaptation in temporal difference reinforcement learning. Annals of Operations Research, 134(1), 215–238.

    MATH  MathSciNet  CrossRef  Google Scholar 

  • Michalski, R. S. (1983). A theory and methodology of inductive learning. Artificial Intelligence, 20(2), 111–161.

    MathSciNet  CrossRef  Google Scholar 

  • Michie, D. & Chambers, R. A. (1968). BOXES: An experiment in adaptive control. In Dale, E. & Michie, D. (Eds.), Machine Intelligence (pp. 137–152). Edinburgh, Scotland: Oliver and Boyd.

    Google Scholar 

  • Mitchell, T. M. & Thrun, S. B. (1992). Explanation-based neural network learning for robot control. In Advances in Neural Information Processing Systems 5 (pp. 287–294). San Francisco, CA: Morgan Kaufmann.

    Google Scholar 

  • Montazeri, H., Moradi, S., & Safabakhsh, R. (2011). Continuous state/action reinforcement learning: A growing self-organizing map approach. Neurocomputing, 74(7), 1069–1082.

    CrossRef  Google Scholar 

  • Moody, J. & Saffell, M. (2001). Learning to trade vis direct reinforcement learning. IEEE Transactions on Neural Networks, 12(4), 875–889.

    CrossRef  Google Scholar 

  • Moody, J. & Tresp, V. (1994). A trivial but fast reinforcement controller. Neural Computation, 6.

    Google Scholar 

  • Moody, J., Wu, L., Liao, Y., & Saffell, M. (1998). Performance functions and reinforcement learning for trading systems and portfolios. Journal of Forecasting, 17(5–6), 441–470.

    CrossRef  Google Scholar 

  • Moore, A. W. (1990). Efficient memory-based learning for robot control. Unpublished PhD dissertation, University of Cambridge, Cambridge, United Kingdom.

    Google Scholar 

  • Moore, B. L., Pyeatt, L. D., Kulkarni, V., Panousis, P., Padrez, K., & Doufas, A. G. (2014). Reinforcement learning for closed-loop Propofol anesthesia: A study in human volunteers. Journal of Machine Learning Research, 15(Feb), 655–696.

    Google Scholar 

  • Nevmyvaka, Y., Feng, Y., & Kearns, M. (2006). Reinforcement learning for optimized trade execution. In Cohen, W. W. and Moore, A. (Eds.), Proceedings of the 23rd International Conference on Machine Learning (ICML), Pittsburgh, PA, 25–29 June (pp. 673–680). New York, NY: ACM.

    Google Scholar 

  • Ng, A. Y., Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse, B., Berger, E. & Liang, E. (2004). Autonomous inverted helicopter flight via reinforcement learning. In International Symposium on Experimental Robotics (ISER-2004), Singapore, 18–21 June (pp. 363–372). Cambridge, MA: MIT Press.

    Google Scholar 

  • Nissen, S. (2007). Large scale reinforcement learning using Q-Sarsa(λ) and cascading neural networks. Unpublished masters thesis, Department of Computer Science, University of Copenhagen, København, Denmark.

    Google Scholar 

  • Niv, Y. (2009). Reinforcement learning in the brain. Journal of Mathematical Psychology, 53(3), 139–154.

    MATH  MathSciNet  CrossRef  Google Scholar 

  • Ollington, R. B., Vamplew, P. H., & Swanson, J. (2009). Incorporating expert advice into reinforcement learning using constructive neural networks. In Franco, L., Elizondo, D. A., & Jerez, J. M. (Eds.), Constructive Neural Networks (pp. 207–224). Berlin: Springer.

    Google Scholar 

  • Orr, M. J. L. (1996). Introduction to radial basis function networks (Technical report, Centre For Cognitive Science, University of Edinburgh). Retrieved from http://www.cc.gatech.edu/~isbell/tutorials/rbf-intro.pdf.

  • Osana, Y. (2011). Reinforcement learning using Kohonen feature map probabilistic associative memory based on weights distribution. In Mellouk, A. (Ed.), Advances in Reinforcement Learning (pp. 121–136). InTech.

    Google Scholar 

  • Osentoski, S. (2009). Action-based representation discovery in Markov decision processes. Unpublished PhD dissertation, University of Massachusetts, Amherst, MA.

    Google Scholar 

  • Papahristou, N. & Refanidis, I. (2011). Training neural networks to play backgammon variants using reinforcement learning. In Applications of Evolutionary Computation, Proceedings of the 11th International Conference on Applications of Evolutionary Computation, Torino, Italy, 27–29 April (pp. 113–122). Berlin: Springer-Verlag.

    Google Scholar 

  • Papavassiliou, V. A. & Russell, S. (1999). Convergence of reinforcement learning with general function approximators. In Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI), Stockholm, Sweden, 31 July–6 August (Vol. 2, pp. 748–755). San Francisco, CA: Morgan Kaufmann.

    Google Scholar 

  • Papierok, S., Noglik, A., & Pauli, J. (2008). Application of reinforcement learning in a real environment using an RBF network. In 1st International Workshop on Evolutionary and Reinforcement Learning for Autonomous Robot Systems (ERLARS), Patras, Greece, 22 July (pp. 17–22). Retrieved from http://www.is.uni-due.de/fileadmin/literatur/publikation/papierok08erlars.pdf

  • Patist, J. P. & Wiering, M. (2004). Learning to play draughts using temporal difference learning with neural networks and databases. In Proceedings of the 13th Belgian-Dutch Conference on Machine Learning, Brussels, Belgium, 8–9 January (pp. 87–94). doi: 10.1007/978-3-540-88190-2_13

    Google Scholar 

  • Peters, J. & Schaal, S. (2006). Policy gradient methods for robotics. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Beijing, China, 9–15 October (pp. 2219–2225). doi: 10.1109/IROS.2006.282564

    Google Scholar 

  • Peters, J. & Schaal, S. (2009). Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4), 682–697.

    CrossRef  Google Scholar 

  • Pollack, J. B. & Blair, A. D. (1996). Why did TD-Gammon work? In Mozer, M. C., Jordan, M. I., & Petsche, T. (Eds.), Advances in Neural Information Processing Systems 9. Cambridge, MA: MIT Press.

    Google Scholar 

  • Pontrandolfo, P., Gosavi, A., Okogbaa, O. G., & Das, T. K. (2002). Global supply chain management: A reinforcement learning approach. International Journal of Production Research, 40(6), 1299–1317.

    CrossRef  Google Scholar 

  • Powell, W. B. (2007). Approximate Dynamic Programming: Solving the Curse of Dimensionality. New York, NY: John Wiley & Sons.

    CrossRef  Google Scholar 

  • Powell, W. B. (2008). What you should know about approximate dynamic programming. Naval Research Logistics, 56(3), 239–249.

    CrossRef  Google Scholar 

  • Powell, W. B. & Ma, J. (2011). A review of stochastic algorithms with continuous value function approximation and some new approximate policy iteration algorithms for multidimensional continuous applications. Journal of Control Theory and Applications, 9(3), 336–352.

    MATH  MathSciNet  CrossRef  Google Scholar 

  • Proper, S. & Tadepalli, P. (2006). Scaling model-based average-reward reinforcement learning for product delivery. In Machine Learning: European Conference on Machine Learning (ECML 2006), Berlin, Germany, 18–22 September (pp. 735–742). doi: 10.1007/11871842_74

    Google Scholar 

  • Rescorla, R. A. & Wagner, A. R. (1972). A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. In Black, A. H. & Prokasy, W. F. (Eds.), Classical Conditioning II: Current research and theory (pp. 64–99). New York, NY: Appleton-Century-Crofts.

    Google Scholar 

  • Riedmiller, M. (2005). Neural fitted Q iteration—First experiences with a data efficient neural reinforcement learning method. In Gama, J., Camacho, R., Brazdil, P. B., Jorge, A. M., & Torgo, L. (Eds.), Proceedings of the 16th European Conference on Machine Learning (ECML 2005), Porto, Portugal, 3–7 October (pp. 317–328). doi: 10.1007/11564096_32

    Google Scholar 

  • Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representation by error propagation. In Rumelhart, D. E. & McClelland, J. L. (Eds.), Parallel Distributed Processing: Exploration in the Microstructure of Cognition. Cambridge, MA: MIT Press.

    Google Scholar 

  • Rummery, G. A. & Niranjan, M. (1994). On-line \(Q\) -learning using connectionist systems (Technical Report CUED/F-INFENG/TR 166, Engineering Department, Cambridge University). Retrieved from http://mi.eng.cam.ac.uk/reports/svr-ftp/auto-pdf/rummery_tr166.pdf

  • Runarsson, T. P. & Lucas, S. M. (2005). Co-evolution versus self-play temporal difference learning for acquiring position evaluation in small-board Go. IEEE Transactions on Evolutionary Computing, 9(6), 628–640.

    CrossRef  Google Scholar 

  • Schaeffer, J., Hlynka, M., & Jussila, V. (2001). Temporal difference learning applied to a high-performance game-playing program. In Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI), Seattle, WA, 4–10 August (Vol. 1, pp. 529–534). San Francisco, CA: Morgan Kaufmann.

    Google Scholar 

  • Schmidhuber, J. (2005). Completely self-referential optimal reinforcement learners. In Proceedings of the International Conference on Artificial Neural Networks (ICANN), Warsaw, Poland, 11–15 September, volume 3697 of Lecture Notes in Computer Science (pp. 223–233). Berlin: Springer.

    Google Scholar 

  • Schmidhuber, J. (2006). G\:odel machines: Fully self-referential optimal universal self-improvers. In Goertzel, B. & Pennachin, C. (Eds.), Artificial General Intelligence (pp. 199–226). doi: 10.1007/11550907_36

    Google Scholar 

  • Schraudolph, N. N., Dayan, P., & Sejnowski, T. J. (1994). Temporal difference learning of position evaluation in the game of Go. In Cowan, J. D. & Alspector, G. T. J. (Eds.), Advances in Neural Information Processing Systems 6. San Francisco, CA: Morgan Kaufmann.

    Google Scholar 

  • Silver, D., Sutton, R. S., & Müller, M. (2012). Temporal-difference search in computer Go. Machine Learning, 87(2), 183–219.

    MATH  MathSciNet  CrossRef  Google Scholar 

  • Şimşek, O. & Barto, A. G. (2004). Using relative novelty to identify useful temporal abstractions in reinforcement learning. In Proceedings of the 21st International Conference on Machine Learning, Banff, Alberta, Canada, 4–8 July (pp. 751–758). doi: 10.1145/1015330.1015353

    Google Scholar 

  • Singh, S. P., Jaakkola, T., & Jordan, M. I. (1994). Learning without state-estimation in partially observable Markovian decision processes. In Proceedings of the 11th International Conference on Machine Learning (ICML), New Brunswick, NJ, 10–13 July (pp. 284–292). San Francisco, CA: Morgan Kauffman.

    Google Scholar 

  • Singh, S. P., Jaakkola, T., & Jordan, M. I. (1995). Reinforcement learning with soft state aggregation. In Advances in Neural Information Processing Systems 7 (pp. 361–368). Cambridge, MA: MIT Press.

    Google Scholar 

  • Singh, S. P. & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22(1–3), 123–158.

    MATH  Google Scholar 

  • Skelly, M. M. (2004). Hierarchical reinforcement learning with function approximation for adaptive control. Unpublished PhD dissertation, Case Western Reserve University, Cleveland, OH.

    Google Scholar 

  • Skoulakis, I. & Lagoudakis, M. (2012). Efficient reinforcement learning in adversarial games. In Proceedings of the 24th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), Athens, Greece, 7–9 November (pp. 704–711). doi: 10.1109/ICTAI.2012.100

    Google Scholar 

  • Smart, W. D. (2002). Making reinforcement learning work on real robots. Unpublished PhD dissertation, Brown University, Providence, RI.

    Google Scholar 

  • Smart, W. D. & Kaelbling, L. P. (2002). Effective reinforcement learning for mobile robots. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Washington, D.C., 11–15 May (Vol. 4, pp. 3404–3410). doi: 10.1109/ROBOT.2002.1014237

    Google Scholar 

  • Smith, A. J. (2002). Applications of the self-organising map to reinforcement learning. Neural Networks, 15(8–9), 1107–1124.

    CrossRef  Google Scholar 

  • Stanley, K. O. & Miikkulainen, R. (2002). Evolving neural networks through augmenting topologies. Evolutionary Computation, 10(2), 99–127.

    CrossRef  Google Scholar 

  • Sutton, R. S. (1984). Temporal credit assignment in reinforcement learning. Unpublished PhD dissertation, University of Massachusetts, Amherst, MA.

    Google Scholar 

  • Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8 (pp. 1038–1044). Cambridge, MA: MIT Press.

    Google Scholar 

  • Sutton, R. S. & Barto, A. G. (1998). Reinforcement Learning. Cambridge, MA: MIT Press.

    Google Scholar 

  • Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient method for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12 (pp. 1057–1063). Cambridge, MA: MIT Press.

    Google Scholar 

  • Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., & Wiewiora, E. (2009a). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th International Conference on Machine Learning, Montreal, Quebec, 14–18 June (pp. 993–1000). doi: 10.1145/1553374.1553501

    Google Scholar 

  • Sutton, R. S., Szepesvári, C., & Maei, H. R. (2009b). A convergent o(n) algorithm for off-policy temporal-difference learning with linear function approximation. In Advances in Neural Information Processing Systems 21 (pp. 1609–1616). Cambridge, MA: MIT Press.

    Google Scholar 

  • Szepesvári, C. (2010). Algorithms for Reinforcement Learning. San Rafael, CA: Morgan & Claypool.

    MATH  Google Scholar 

  • Tan, A.-H., Lu, N., & Xiao, D. (2008). Integrating temporal difference methods and self-organizing neural networks for reinforcement learning with delayed evaluative feedback. IEEE Transactions on Neural Networks, 19(2), 230–244.

    CrossRef  Google Scholar 

  • Taylor, M. E. & Stone, P. (2009). Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(1), 1633–1685.

    MATH  MathSciNet  Google Scholar 

  • Tesauro, G. (1992). Practical issues in temporal difference learning. Machine Learning, 8(3–4), 257–277.

    MATH  Google Scholar 

  • Tesauro, G. (1995). Temporal difference learning and TD-Gammon. Communications of the ACM, 38(3), 58–68.

    CrossRef  Google Scholar 

  • Tesauro, G., Jong, N. K., Das, R., & Bennani, M. N. (2007). On the use of hybrid reinforcement learning for autonomic resource allocation. Clustering Computing, 10(3), 287–299.

    CrossRef  Google Scholar 

  • Thrun, S. (1995). Learning to play the game of Chess. In Advances in Neural Information Processing Systems 7 (pp. 1069–1076). Cambridge, MA: MIT Press.

    Google Scholar 

  • Thrun, S. & Schwartz, A. (1993). Issues in using function approximation for reinforcement learning. In Mozer, M., Smokensky, P., Touretzky, D., Elman, J., & Weigand, A. (Eds.), Proceedings of the 4th Connectionist Models Summer School, Pittsburgh, PA, 2–5 August (pp. 255–263). Hillsdale, NJ: Lawrence Erlbaum.

    Google Scholar 

  • Torrey, L. (2009). Relational transfer in reinforcement learning. Unpublished PhD dissertation, University of Wisconsin, Madison, WI.

    Google Scholar 

  • Touzet, C. F. (1997). Neural reinforcement learning for behaviour synthesis. Robotics and Autonomous Systems, 22(3–4), 251–281.

    CrossRef  Google Scholar 

  • Tsitsiklis, J. N. & Roy, B. V. (1996). Feature-based methods for large scale dynamic programming. Machine Learning, 22(1–3), 59–94.

    MATH  Google Scholar 

  • Tsitsiklis, J. N. & Roy, B. V. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5), 674–690.

    MATH  CrossRef  Google Scholar 

  • van Eck, N. J. & van Wezel, M. (2008). Application of reinforcement learning to the game of othello. Computers & Operations Research, 35(6), 1999–2017.

    MATH  MathSciNet  CrossRef  Google Scholar 

  • van Hasselt, H. & Wiering, M. A. (2007). Reinforcement learning in continuous action spaces. In Proceedings of the 2007 IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL), Honolulu, HI, 1–5 April (pp. 272–279). Retrieved from http://webdocs.cs.ualberta.ca/ ~ vanhasse/ papers/Reinforcement_Learning_in_ Continuous_Action_Spaces.pdf

  • van Seijen, H., Whiteson, S., van Hasselt, H., & Wiering, M. (2011). Exploiting best-match equations for efficient reinforcement learning. Journal of Machine Learning Research, 12(Jun), 2045–2094.

    MATH  Google Scholar 

  • Veness, J., Silver, D., Uther, W., & Blair, A. (2009). Bootstrapping from game tree search. In Bengio, Y., Schuurmans, D., Lafferty, J. D., Williams, C. K. I., & Culotta, A. (Eds.), Advances in Neural Information Processing Systems 22 (pp. 1937–1945). Red Hook, NY: Curran Associates, Inc.

    Google Scholar 

  • Watkins, C. J. C. H. (1989). Learning from delayed rewards. Unpublished PhD dissertation, King’s College, Cambridge, England.

    Google Scholar 

  • Watkins, C. J. C. H. & Dayan, P. (1992). Q-learning. Machine Learning, 8(3–4), 279–292.

    MATH  Google Scholar 

  • Werbos, P. J. (1974). Beyond regression: New tools for prediction and analysis in the behavioural sciences. Unpublished PhD dissertation, Harvard University, Cambridge, MA.

    Google Scholar 

  • Werbos, P. J. (1989). Backpropagation and neurocontrol: A review and prospectus. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Washington, D.C., 18–22 June (pp. 209–216). doi: 10.1109/IJCNN. 1989.118583

    Google Scholar 

  • Whiteson, S. & Stone, P. (2006). Evolutionary function approximation for reinforcement learning. Machine Learning Research, 7, 877–917.

    MATH  MathSciNet  Google Scholar 

  • Whiteson, S., Tanner, B., Taylor, M. E., & Stone, P. (2009). Generalized domains for empirical evaluations in reinforcement learning. In Proceedings of the 26th International Conference on Machine Learning: Workshop on Evaluation Methods for Machine Learning, Montreal, Canada, 14–18 June. Retrieved from http://www.site.uottawa.ca/ICML09WS/papers/w8.pdf

  • Whiteson, S., Taylor, M. E., & Stone, P. (2010). Critical factors in the empirical performance of temporal difference and evolutionary methods for reinforcement learning. Journal of Autonomous Agents and Multi-Agent Systems, 21(1), 1–35.

    CrossRef  Google Scholar 

  • Whiteson, S., Tanner, B., Taylor, M. E., & Stone, P. (2011). Protecting against evaluation overfitting in empirical reinforcement learning. In Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), Paris, France, 11–15 April (pp. 120–127). doi: 10.1109/ ADPRL.2011.5967363

    Google Scholar 

  • Wiering, M. A. (1995). TD learning of game evaluation functions with hierarchical neural architectures. Unpublished masters thesis, Department of Computer Science, University of Amsterdam, Amsterdam, Netherlands.

    Google Scholar 

  • Wiering, M. A. (2010). Self-play and using an expert to learn to play backgammon with temporal difference learning. Journal of Intelligent Learning Systems & Applications, 2(2), 57–68.

    CrossRef  Google Scholar 

  • Wiering, M. A. & van Hasselt, H. (2007). Two novel on-policy reinforcement learning algorithms based on TD(λ)-methods. In Proceedings of the IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), Honolulu, HI, 1–5 April (pp. 280–287). doi: 10.1109/ADPRL.2007.368200

    Google Scholar 

  • Wiering, M. A. & van Hasselt, H. (2008). Ensemble algorithms in reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, 38(4), 930–936.

    CrossRef  Google Scholar 

  • Wiering, M. A., Patist, J. P., & Mannen, H. (2007). Learning to play board games using temporal difference methods (Technical Report UU–CS–2005-048, Institute of Information and Computing Sciences, Utrecht University). Retrieved from http://www.ai.rug.nl/\( \sim \)mwiering/GROUP/ARTICLES/learning_games_TR.pdf.

    Google Scholar 

  • Wierstra, D., Foerster, A., Peters, J., & Schmidhuber, J. (2007). Solving deep memory POMDPs with recurrent policy gradients. In Proceedings of the 17th International Conference on Artificial Neural Networks (ICANN), Paris, France, 9–13 September volume 4668 of Lecture Notes in Computer Science (pp. 697–706). doi: 10.1007/978-3-540-74690-4_71

    Google Scholar 

  • Wierstra, D., Förster, A., Peters, J., & Schmidhuber, J. (2010). Recurrent policy gradients. Logic Journal of the IGPL, 18(5), 620–634.

    MATH  MathSciNet  CrossRef  Google Scholar 

  • Yamada, K. (2011). Network parameter setting for reinforcement learning approaches using neural networks. Journal of Advanced Computational Intelligence and Intelligent Informatics, 15(7), 822–830.

    Google Scholar 

  • Yan, X., Diaconis, P., Rusmevichientong, P., & Roy, B. V. (2004). Solitaire: Man versus machine. In Advances in Neural Information Processing Systems 17 (pp. 1553–1560). Cambridge, MA: MIT Press.

    Google Scholar 

  • Yoshioka, T., Ishii, S., and Ito, M. (1999). Strategy acquisition for the game 'Othello` based on reinforcement learning. IEICE Transactions on Information and Systems, E82-D(12), 1618–1626.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christopher Gatti .

Rights and permissions

Reprints and Permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Gatti, C. (2015). Reinforcement Learning. In: Design of Experiments for Reinforcement Learning. Springer Theses. Springer, Cham. https://doi.org/10.1007/978-3-319-12197-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12197-0_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12196-3

  • Online ISBN: 978-3-319-12197-0

  • eBook Packages: EngineeringEngineering (R0)