Abstract
One of the primary goals of AI is to produce fully autonomous agents that learn optimal behaviors through trial and error by interacting with their environments. The reinforcement learning paradigm is essentially learning through interaction. It has its root in behaviorist psychology. Reinforcement learning is influenced by optimal control, which is underpinned by mathematical dynamic programming formalism. This chapter deals with reinforcement learning.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Abdallah, S., & Kaisers, M. (2016). Addressing environment non-stationarity by repeating Q-learning updates. Journal of Machine Learning Research, 17, 1–31.
Bakker, B., & Schmidhuber, J. (2004). Hierarchical reinforcement learning based on subgoal discovery and subpolicy specialization. In Proceedings of the 8th Conference on Intelligent Autonomous Systems (pp. 438–445). Amsterdam, The Netherlands.
Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuron-like adaptive elements that can solve difficult learning control problems. IEEE Transaction on Systems, Man and Cybernetics, 13(5), 834–846.
Barto, A. G. (1992). Reinforcement learning and adaptive critic methods. In D. A. White & D. A. Sofge (Eds.), Handbook of intelligent control: Neural, fuzzy, and adaptive approaches (pp. 469–471). New York: Van Nostrand Reinhold.
Bohmer, W., Grunewalde, S., Shen, Y., Musial, M., & Obermayer, K. (2013). Construction of approximation spaces for reinforcement learning. Journal of Machine Learning Research, 14, 2067–2118.
Bradtke, S. J., & Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22, 33–57.
Choi, J., & Kim, K.-E. (2011). Inverse reinforcement learning in partially observable environments. Journal of Machine Learning Research, 12, 691–730.
Dayan, P., & Sejnowski, T. (1994). TD(\(\lambda \)) converges with probability 1. Machine Learning, 14(1), 295–301.
Elfwing, S., Uchibe, E., & Doya, K. (2016). From free energy to expected energy: Improving energy-based value function approximation in reinforcement learning. Neural Networks, 84, 17–27.
Furmston, T., Lever, G., & Barber, D. (2016). Approximate Newton methods for policy search in Markov decision processes. Journal of Machine Learning Research, 17, 1–51.
Furnkranz, J., Hullermeier, E., Cheng, W., & Park, S.-H. (2012). Preference-based reinforcement learning: A formal framework and a policy iteration algorithm. Machine Learning, 89, 123–156.
Garcia, J., & Fernandez, F. (2015). A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1), 1437–1480.
Ghavamzadeh, M., & Mahadevan, S. (2007). Hierarchical average reward reinforcement learning. Journal of Machine Learning Research, 8, 2629–2669.
Greenwald, A., Hall, K., & Serrano, R. (2003). Correlated Q-learning. In Proceedings of the 20th International Conference on Machine Learning (pp. 242–249). Washington, DC.
Heess, N., Silver, D., & Teh, Y. W. (2012). Actor–critic reinforcement learning with energy-based policies. In JMLR Workshop and Conference Proceedings: 10th European Workshop on Reinforcement Learning (EWRL) (Vol. 24, pp. 43–57).
Houk, J., Adams, J. L., & Barto, A. G. (1995). A model of how the basal ganglia generate and use neural signals that predict reinforcement. In J. C. Houk, J. Davis, & D. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 250–268). Cambridge, MA: MIT Press.
Hu, J., & Wellman, M. P. (2003). Nash Q-learning for general-sum stochastic games. Journal of Machine Learning Research, 4, 1039–1069.
Hu, Y., Gao, Y., & An, B. (2015). Accelerating multiagent reinforcement learning by equilibrium transfer. IEEE Transactions on Cybernetics, 45(7), 1289–1302.
Hu, Y., Gao, Y., & An, B. (2015). Multiagent reinforcement learning with unshared value functions. IEEE Transactions on Cybernetics, 45(4), 647–662.
Hwang, K.-S., & Lo, C.-Y. (2013). Policy improvement by a model-free Dyna architecture. IEEE Transactions on Neural Networks and Learning Systems, 24(5), 776–788.
Kaelbling, L. P., Littman, M. I., & Moore, A. W. (1996). Reinforcement lerning: A survey. Journal of Artificial Intelligence Research, 4, 237–285.
Kaisers, M., & Tuyls, K. (2010). Frequency adjusted multi-agent Q-learning. In Proceedings of International Conference on Autonomous Agents and Multiagent Systems (AAMAS) (pp. 309–316).
Konda, V., & Tsitsiklis, J. (2000). Actor–critic algorithms. In Advances in neural information processing systems (Vol. 12, pp. 1008–1014).
Lagoudakis, M. G., & Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research, 4, 1107–1149.
Legenstein, R., Wilbert, N., & Wiskott, L. (2010). Reinforcement learning on slow features of high-dimensional input streams. PLoS Computational Biology, 6(8), e1000894.
Leslie, D. S., & Collins, E. J. (2005). Individual Q-learning in normal form games. SIAM Journal on Control and Optimization, 44(2), 495–514.
Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the 11th International Conference on Machine Learning (pp. 157–163). New Brunswick, NJ.
Littman, M. L. (2001). Friend-or-foe Q-learning in general-sum games. In Proceedings of the 18th International Conference on Machine Learning (pp. 322–328). San Francisco, CA: Morgan Kaufmann.
Luciw, M., & Schmidhuber, J. (2012). Low complexity proto-value function learning from sensory observations with incremental slow feature analysis. In Proceedings of International Conference on Artificial Neural Networks, LNCS (Vol. 7553, pp. 279–287). Berlin: Springer.
Martin, H. J. A., de Lope, J., & Maravall, D. (2011). Robust high performance reinforcement learning through weighted \(k\)-nearest neighbors. Neurocomputing, 74(8), 1251–1259.
Mihatsch, O., & Neuneier, R. (2014). Risk-sensitive reinforcement learning. Machine Learning, 49(2), 267–290.
Narendra, K. S., & Thathachar, M. A. L. (1974). Learning automata: A survey. IEEE Transactions on Systems, Man, and Cybernetics, 4(4), 323–334.
Parisi, S., Tangkaratt, V., Peters, J., & Khan, M. E. (2019). TD-regularized actor-critic methods. Machine Learning. https://doi.org/10.1007/s10994-019-05788-0.
Peng, J., & Williams, R. J. (1994). Incremental multi-step Q-learning. In: Proceedings of the 11th International Conference on Machine Learning (pp. 226–232). San Francisco, CA: Morgan Kaufmann.
Piot, B., Geist, M., & Pietquin, O. (2017). Bridging the gap between imitation learning and inverse reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems, 28(8), 1814–1826.
Potjans, W., Morrison, A., & Diesmann, M. (2009). A spiking neural network model of an actor–critic learning agent. Neural Computation, 21, 301–339.
Reynolds, J. N., Hyland, B. I., & Wickens, J. R. (2001). A cellular mechanism of reward-related learning. Nature, 413, 67–70.
Rummery, G., & Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Engineering Department, Cambridge University.
Sallans, B., & Hinton, G. E. (2004). Reinforcement learning with factored states and actions. Journal of Machine Learning Research, 5, 1063–1088.
Sastry, P. S., Phansalkar, V. V., & Thathachar, M. A. L. (1994). Decentralised learning of Nash equilibria in multiperson stochastic games with incomplete information. IEEE Transactions on Systems, Man, and Cybernetics, 24, 769–777.
Schaal, S. (1999). Is imitation learning the route to humanoid robots? Trends in Cognitive Sciences, 3(6), 233–242.
Schultz, W. (1998). Predictive reward signal of dopamine neurons. Journal of Neurophysiology, 80(1), 1–27.
Sutton, R. S. (1988). Learning to predict by the method of temporal difference. Machine Learning, 3(1), 9–44.
Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the 7th International Conference on Machine Learning (pp. 216–224). Austin, TX.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press.
Thathachar, M. A. L., & Sastry, P. S. (2002). Varieties of learning automata: An overview. IEEE Transactions on Systems, Man, and Cybernetics B, 32(6), 711–722.
Tsetlin, M. L. (1973). Automata theory and modeling of biological systems. New York: Academic.
Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5), 674–690.
van Seijen, H., Mahmood, A. R., Pilarski, P. M., Machado, M. C., & Sutton, R. S. (2016). True online temporal-difference learning. Journal of Machine Learning Research, 17, 1–40.
Watkins, C. J. H. C. (1989). Learning from Delayed Rewards. Ph.D. thesis, Department of Computer Science, King’s College, Cambridge University, UK.
Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3), 279–292.
Werbos, P. J. (1990). Consistency of HDP applied to a simple reinforcement learning problem. Neural Networks, 3, 179–189.
Zhao, T., Hachiya, H., Tangkaratt, V., Morimoto, J., & Sugiyama, M. (2013). Efficient sample reuse in policy gradients with parameter-based exploration. Neural Networks, 25(6), 1512–1547.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2019 Springer-Verlag London Ltd., part of Springer Nature
About this chapter
Cite this chapter
Du, KL., Swamy, M.N.S. (2019). Reinforcement Learning. In: Neural Networks and Statistical Learning. Springer, London. https://doi.org/10.1007/978-1-4471-7452-3_17
Download citation
DOI: https://doi.org/10.1007/978-1-4471-7452-3_17
Published:
Publisher Name: Springer, London
Print ISBN: 978-1-4471-7451-6
Online ISBN: 978-1-4471-7452-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)