Reinforcement Learning

Du, Ke-Lin; Swamy, M. N. S.

doi:10.1007/978-1-4471-7452-3_17

Reinforcement Learning

Ke-Lin Du^3,4 &
M. N. S. Swamy³

Chapter
First Online: 13 September 2019

4382 Accesses

Abstract

One of the primary goals of AI is to produce fully autonomous agents that learn optimal behaviors through trial and error by interacting with their environments. The reinforcement learning paradigm is essentially learning through interaction. It has its root in behaviorist psychology. Reinforcement learning is influenced by optimal control, which is underpinned by mathematical dynamic programming formalism. This chapter deals with reinforcement learning.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Abdallah, S., & Kaisers, M. (2016). Addressing environment non-stationarity by repeating Q-learning updates. Journal of Machine Learning Research, 17, 1–31.
MathSciNet MATH Google Scholar
Bakker, B., & Schmidhuber, J. (2004). Hierarchical reinforcement learning based on subgoal discovery and subpolicy specialization. In Proceedings of the 8th Conference on Intelligent Autonomous Systems (pp. 438–445). Amsterdam, The Netherlands.
Google Scholar
Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuron-like adaptive elements that can solve difficult learning control problems. IEEE Transaction on Systems, Man and Cybernetics, 13(5), 834–846.
Article Google Scholar
Barto, A. G. (1992). Reinforcement learning and adaptive critic methods. In D. A. White & D. A. Sofge (Eds.), Handbook of intelligent control: Neural, fuzzy, and adaptive approaches (pp. 469–471). New York: Van Nostrand Reinhold.
Google Scholar
Bohmer, W., Grunewalde, S., Shen, Y., Musial, M., & Obermayer, K. (2013). Construction of approximation spaces for reinforcement learning. Journal of Machine Learning Research, 14, 2067–2118.
MathSciNet MATH Google Scholar
Bradtke, S. J., & Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22, 33–57.
MATH Google Scholar
Choi, J., & Kim, K.-E. (2011). Inverse reinforcement learning in partially observable environments. Journal of Machine Learning Research, 12, 691–730.
MathSciNet MATH Google Scholar
Dayan, P., & Sejnowski, T. (1994). TD(\(\lambda \)) converges with probability 1. Machine Learning, 14(1), 295–301.
Google Scholar
Elfwing, S., Uchibe, E., & Doya, K. (2016). From free energy to expected energy: Improving energy-based value function approximation in reinforcement learning. Neural Networks, 84, 17–27.
Article Google Scholar
Furmston, T., Lever, G., & Barber, D. (2016). Approximate Newton methods for policy search in Markov decision processes. Journal of Machine Learning Research, 17, 1–51.
MathSciNet MATH Google Scholar
Furnkranz, J., Hullermeier, E., Cheng, W., & Park, S.-H. (2012). Preference-based reinforcement learning: A formal framework and a policy iteration algorithm. Machine Learning, 89, 123–156.
Article MathSciNet MATH Google Scholar
Garcia, J., & Fernandez, F. (2015). A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1), 1437–1480.
MathSciNet MATH Google Scholar
Ghavamzadeh, M., & Mahadevan, S. (2007). Hierarchical average reward reinforcement learning. Journal of Machine Learning Research, 8, 2629–2669.
MathSciNet MATH Google Scholar
Greenwald, A., Hall, K., & Serrano, R. (2003). Correlated Q-learning. In Proceedings of the 20th International Conference on Machine Learning (pp. 242–249). Washington, DC.
Google Scholar
Heess, N., Silver, D., & Teh, Y. W. (2012). Actor–critic reinforcement learning with energy-based policies. In JMLR Workshop and Conference Proceedings: 10th European Workshop on Reinforcement Learning (EWRL) (Vol. 24, pp. 43–57).
Google Scholar
Houk, J., Adams, J. L., & Barto, A. G. (1995). A model of how the basal ganglia generate and use neural signals that predict reinforcement. In J. C. Houk, J. Davis, & D. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 250–268). Cambridge, MA: MIT Press.
Google Scholar
Hu, J., & Wellman, M. P. (2003). Nash Q-learning for general-sum stochastic games. Journal of Machine Learning Research, 4, 1039–1069.
MathSciNet MATH Google Scholar
Hu, Y., Gao, Y., & An, B. (2015). Accelerating multiagent reinforcement learning by equilibrium transfer. IEEE Transactions on Cybernetics, 45(7), 1289–1302.
Article Google Scholar
Hu, Y., Gao, Y., & An, B. (2015). Multiagent reinforcement learning with unshared value functions. IEEE Transactions on Cybernetics, 45(4), 647–662.
Article Google Scholar
Hwang, K.-S., & Lo, C.-Y. (2013). Policy improvement by a model-free Dyna architecture. IEEE Transactions on Neural Networks and Learning Systems, 24(5), 776–788.
Article Google Scholar
Kaelbling, L. P., Littman, M. I., & Moore, A. W. (1996). Reinforcement lerning: A survey. Journal of Artificial Intelligence Research, 4, 237–285.
Article Google Scholar
Kaisers, M., & Tuyls, K. (2010). Frequency adjusted multi-agent Q-learning. In Proceedings of International Conference on Autonomous Agents and Multiagent Systems (AAMAS) (pp. 309–316).
Google Scholar
Konda, V., & Tsitsiklis, J. (2000). Actor–critic algorithms. In Advances in neural information processing systems (Vol. 12, pp. 1008–1014).
Google Scholar
Lagoudakis, M. G., & Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research, 4, 1107–1149.
MathSciNet MATH Google Scholar
Legenstein, R., Wilbert, N., & Wiskott, L. (2010). Reinforcement learning on slow features of high-dimensional input streams. PLoS Computational Biology, 6(8), e1000894.
Article MathSciNet Google Scholar
Leslie, D. S., & Collins, E. J. (2005). Individual Q-learning in normal form games. SIAM Journal on Control and Optimization, 44(2), 495–514.
Google Scholar
Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the 11th International Conference on Machine Learning (pp. 157–163). New Brunswick, NJ.
Google Scholar
Littman, M. L. (2001). Friend-or-foe Q-learning in general-sum games. In Proceedings of the 18th International Conference on Machine Learning (pp. 322–328). San Francisco, CA: Morgan Kaufmann.
Google Scholar
Luciw, M., & Schmidhuber, J. (2012). Low complexity proto-value function learning from sensory observations with incremental slow feature analysis. In Proceedings of International Conference on Artificial Neural Networks, LNCS (Vol. 7553, pp. 279–287). Berlin: Springer.
Google Scholar
Martin, H. J. A., de Lope, J., & Maravall, D. (2011). Robust high performance reinforcement learning through weighted \(k\)-nearest neighbors. Neurocomputing, 74(8), 1251–1259.
Google Scholar
Mihatsch, O., & Neuneier, R. (2014). Risk-sensitive reinforcement learning. Machine Learning, 49(2), 267–290.
MATH Google Scholar
Narendra, K. S., & Thathachar, M. A. L. (1974). Learning automata: A survey. IEEE Transactions on Systems, Man, and Cybernetics, 4(4), 323–334.
Article MathSciNet MATH Google Scholar
Parisi, S., Tangkaratt, V., Peters, J., & Khan, M. E. (2019). TD-regularized actor-critic methods. Machine Learning. https://doi.org/10.1007/s10994-019-05788-0.
Article MathSciNet MATH Google Scholar
Peng, J., & Williams, R. J. (1994). Incremental multi-step Q-learning. In: Proceedings of the 11th International Conference on Machine Learning (pp. 226–232). San Francisco, CA: Morgan Kaufmann.
Google Scholar
Piot, B., Geist, M., & Pietquin, O. (2017). Bridging the gap between imitation learning and inverse reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems, 28(8), 1814–1826.
Article MathSciNet Google Scholar
Potjans, W., Morrison, A., & Diesmann, M. (2009). A spiking neural network model of an actor–critic learning agent. Neural Computation, 21, 301–339.
Article MathSciNet MATH Google Scholar
Reynolds, J. N., Hyland, B. I., & Wickens, J. R. (2001). A cellular mechanism of reward-related learning. Nature, 413, 67–70.
Article Google Scholar
Rummery, G., & Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Engineering Department, Cambridge University.
Google Scholar
Sallans, B., & Hinton, G. E. (2004). Reinforcement learning with factored states and actions. Journal of Machine Learning Research, 5, 1063–1088.
MathSciNet MATH Google Scholar
Sastry, P. S., Phansalkar, V. V., & Thathachar, M. A. L. (1994). Decentralised learning of Nash equilibria in multiperson stochastic games with incomplete information. IEEE Transactions on Systems, Man, and Cybernetics, 24, 769–777.
Article MathSciNet MATH Google Scholar
Schaal, S. (1999). Is imitation learning the route to humanoid robots? Trends in Cognitive Sciences, 3(6), 233–242.
Article Google Scholar
Schultz, W. (1998). Predictive reward signal of dopamine neurons. Journal of Neurophysiology, 80(1), 1–27.
Article MathSciNet Google Scholar
Sutton, R. S. (1988). Learning to predict by the method of temporal difference. Machine Learning, 3(1), 9–44.
Google Scholar
Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the 7th International Conference on Machine Learning (pp. 216–224). Austin, TX.
Google Scholar
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press.
MATH Google Scholar
Thathachar, M. A. L., & Sastry, P. S. (2002). Varieties of learning automata: An overview. IEEE Transactions on Systems, Man, and Cybernetics B, 32(6), 711–722.
Article Google Scholar
Tsetlin, M. L. (1973). Automata theory and modeling of biological systems. New York: Academic.
MATH Google Scholar
Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5), 674–690.
Article MathSciNet MATH Google Scholar
van Seijen, H., Mahmood, A. R., Pilarski, P. M., Machado, M. C., & Sutton, R. S. (2016). True online temporal-difference learning. Journal of Machine Learning Research, 17, 1–40.
MathSciNet MATH Google Scholar
Watkins, C. J. H. C. (1989). Learning from Delayed Rewards. Ph.D. thesis, Department of Computer Science, King’s College, Cambridge University, UK.
Google Scholar
Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3), 279–292.
MATH Google Scholar
Werbos, P. J. (1990). Consistency of HDP applied to a simple reinforcement learning problem. Neural Networks, 3, 179–189.
Article Google Scholar
Zhao, T., Hachiya, H., Tangkaratt, V., Morimoto, J., & Sugiyama, M. (2013). Efficient sample reuse in policy gradients with parameter-based exploration. Neural Networks, 25(6), 1512–1547.
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Concordia University, Montreal, QC, Canada
Ke-Lin Du & M. N. S. Swamy
Xonlink Inc., Hangzhou, China
Ke-Lin Du

Authors

Ke-Lin Du
View author publications
You can also search for this author in PubMed Google Scholar
M. N. S. Swamy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ke-Lin Du .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Du, KL., Swamy, M.N.S. (2019). Reinforcement Learning. In: Neural Networks and Statistical Learning. Springer, London. https://doi.org/10.1007/978-1-4471-7452-3_17

Download citation

DOI: https://doi.org/10.1007/978-1-4471-7452-3_17
Published: 13 September 2019
Publisher Name: Springer, London
Print ISBN: 978-1-4471-7451-6
Online ISBN: 978-1-4471-7452-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics