Abstract
This article provides a brief overview of reinforcement learning, from its origins to current research trends, including deep reinforcement learning, with an emphasis on first principles.
G. Hocquet—Work performed while visiting the University of California, Irvine.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abbeel, P., Ng, A.Y.: Apprenticeship learning via inverse reinforcement learning. In: Proceedings of the Twenty-First International Conference on Machine Learning, p. 1. ACM (2004)
Agostinelli, F., Ceglia, N., Shahbaba, B., Sassone-Corsi, P., Baldi, P.: What time is it? deep learning approaches for circadian rhythms. Bioinformatics 32(12), i8–i17 (2016)
Anderson, C.W.: Learning to control an inverted pendulum using neural networks. Control Syst. Mag. IEEE 9(3), 31–37 (1989)
Andre, D., Russell, S.J.: State abstraction for programmable reinforcement learning agents. In: AAAI/IAAI, pp. 119–125 (2002)
Baldi, P., Chauvin, Y.: Neural networks for fingerprint recognition. Neural Comput. 5(3), 402–418 (1993)
Baldi, P., Pollastri, G.: The principled design of large-scale recursive neural network architectures-DAG-RNNs and the protein structure prediction problem. J. Mach. Learn. Res. 4, 575–602 (2003)
Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5, 4308 (2014)
Bellemare, M.G., Ostrovski, G., Guez, A., Thomas, P.S., Munos, R.: Increasing the action gap: new operators for reinforcement learning. In: AAAI, pp. 1476–1483 (2016)
Bellman, R.: The theory of dynamic programming. Technical report, DTIC Document (1954)
Blundell, C., et al.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016)
Boyan, J., Moore, A.W.: Generalization in reinforcement learning: safely approximating the value function. In: Advances in Neural Information Processing Systems, pp. 369–376 (1995)
Boyan, J.A., Littman, M.L., et al.: Packet routing in dynamically changing networks: a reinforcement learning approach. In: Advances in Neural Information Processing Systems, pp. 671–671 (1994)
Brafman, R.I., Tennenholtz, M.: R-max-a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res. 3, 213–231 (2003)
Busoniu, L., Babuska, R., De Schutter, B.: A comprehensive survey of multiagent reinforcement learning. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 38(2), 156–172 (2008)
Busoniu, L., Babuska, R., De Schutter, B., Ernst, D.: Reinforcement Learning and Dynamic Programming Using Function Approximators, vol. 39. CRC Press, Boca Raton (2010)
Cassandra, A.R., Kaelbling, L.P., Littman, M.L.: Acting optimally in partially observable stochastic domains. In: AAAI, vol. 94, p. 1023–1028 (1994)
Chiappa, S., Racaniere, S., Wierstra, D., Mohamed, S.: Recurrent environment simulators. arXiv preprint arXiv:1704.02254 (2017)
Coulom, R.: Efficient selectivity and backup operators in Monte-Carlo tree search. In: van den Herik, H.J., Ciancarini, P., Donkers, H.H.L.M.J. (eds.) CG 2006. LNCS, vol. 4630, pp. 72–83. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-75538-8_7
Crites, R., Barto, A.: Improving elevator performance using reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 8. Citeseer (1996)
Cun, Y.L., et al.: Handwritten digit recognition with a back-propagation network. In: Touretzky, D. (ed.) Advances in Neural Information Processing Systems, pp. 396–404. Morgan Kaufmann, San Mateo (1990)
Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. (MCSS) 2(4), 303–314 (1989)
Dearden, R., Friedman, N., Andre, D.: Model based Bayesian exploration. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pp. 150–159. Morgan Kaufmann Publishers Inc. (1999)
Di Lena, P., Nagata, K., Baldi, P.: Deep architectures for protein contact map prediction. Bioinformatics 28, 2449–2457 (2012). https://doi.org/10.1093/bioinformatics/bts475. First published online: July 30, 2012
Dietterich, T.G.: An overview of MAXQ hierarchical reinforcement learning. In: Choueiry, B.Y., Walsh, T. (eds.) SARA 2000. LNCS (LNAI), vol. 1864, pp. 26–44. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-44914-0_2
Dong, D., Chen, C., Li, H., Tarn, T.J.: Quantum reinforcement learning. IEEE Trans. Syst. Man Cybern. Part B Cybern. 38(5), 1207–1220 (2008)
Dorigo, M., Gambardella, L.: Ant-Q: a reinforcement learning approach to the traveling salesman problem. In: Proceedings of ML-95, Twelfth International Conference on Machine Learning, pp. 252–260 (2014)
Drake, A.W.: Observation of a Markov process through a noisy channel. Ph.D. thesis, Massachusetts Institute of Technology (1962)
Džeroski, S., De Raedt, L., Driessens, K.: Relational reinforcement learning. Mach. Learn. 43(1–2), 7–52 (2001)
Esteva, A., et al.: Dermatologist-level classification of skin cancer with deep neural networks. Nature 542(7639), 115–118 (2017)
de Farias, D.P., Van Roy, B.: The linear programming approach to approximate dynamic programming. Oper. Res. 51(6), 850–865 (2003)
Feng, Z., Zilberstein, S.: Region-based incremental pruning for POMDPs. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pp. 146–153. AUAI Press (2004)
Goldberg, Y.: A primer on neural network models for natural language processing. J. Artif. Intell. Res. 57, 345–420 (2016)
Gosavi, A.: Reinforcement learning: a tutorial survey and recent advances. INFORMS J. Comput. 21(2), 178–192 (2009)
Graves, A., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6645–6649. IEEE (2013)
Guestrin, C., Koller, D., Parr, R., Venkataraman, S.: Efficient solution algorithms for factored MDPs. J. Artif. Intell. Res. 19, 399–468 (2003)
Guestrin, C., Lagoudakis, M., Parr, R.: Coordinated reinforcement learning. In: ICML, vol. 2, pp. 227–234 (2002)
Hasselt, H.V.: Double q-learning. In: Advances in Neural Information Processing Systems, pp. 2613–2621 (2010)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015)
Hebb, D.O.: The Organization of Behavior: A Neuropsychological Approach. Wiley, New York (1949)
Heinrich, J., Lanctot, M., Silver, D.: Fictitious self-play in extensive-form games. In: International Conference on Machine Learning (ICML), pp. 805–813 (2015)
Heinrich, J., Silver, D.: Deep reinforcement learning from self-play in imperfect-information games. arXiv preprint arXiv:1603.01121 (2016)
Holland, J.H.: Genetic algorithms and the optimal allocation of trials. SIAM J. Comput. 2(2), 88–105 (1973)
Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989)
Howard, R.A.: Dynamic programming and Markov processes (1960)
Hutter, M.: Feature reinforcement learning: Part I. Unstructured MDPs. J. Artif. Gen. Intell. 1(1), 3–24 (2009)
Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. J. Artif. Intell. Res. 4, 237–285 (1996)
Kandel, E.R., Schwartz, J.H., Jessell, T.M.: Principles of Neural Science, vol. 4. McGraw-hill, New York (2000)
Kayala, M., Azencott, C., Chen, J., Baldi, P.: Learning to predict chemical reactions. J. Chem. Inf. Model. 51(9), 2209–2222 (2011)
Kayala, M., Baldi, P.: Reactionpredictor: prediction of complex chemical reactions at the mechanistic level using machine learning. J. Chem. Inf. Model. 52(10), 2526–2540 (2012)
Kearns, M., Mansour, Y., Ng, A.Y.: A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Mach. Learn. 49(2–3), 193–208 (2002)
Keerthi, S.S., Ravindran, B.: A tutorial survey of reinforcement learning. Sadhana 19(6), 851–889 (1994)
Kober, J., Bagnell, J.A., Peters, J.: Reinforcement learning in robotics: a survey. Int. J. Robot. Res. 32, 1238–1274 (2013). p. 0278364913495721
Kocsis, L., Szepesvári, C.: Bandit based Monte-Carlo planning. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 282–293. Springer, Heidelberg (2006). https://doi.org/10.1007/11871842_29
Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. In: NIPS. 13, 1008–1014 (1999)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Lai, M.: Giraffe: Using deep reinforcement learning to play chess. arXiv preprint arXiv:1509.01549 (2015)
Leibfried, F., Kushman, N., Hofmann, K.: A deep learning approach for joint video frame and reward prediction in atari games. arXiv preprint arXiv:1611.07078 (2016)
Levin, E., Pieraccini, R., Eckert, W.: A stochastic model of human-machine interaction for learning dialog strategies. IEEE Trans. Speech Audio Process. 8(1), 11–23 (2000)
Levine, S., Finn, C., Darrell, T., Abbeel, P.: End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 17(39), 1–40 (2016)
Levine, S., Pastor, P., Krizhevsky, A., Quillen, D.: Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. In: International Symposium on Experimental Robotics (2016)
Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning (2016)
Lin, C.T., Lee, C.G.: Reinforcement structure/parameter learning for neural-network-based fuzzy logic control systems. IEEE Trans. Fuzzy Syst. 2(1), 46–63 (1994)
Littman, M.L.: Markov games as a framework for multi-agent reinforcement learning. In: Proceedings of the Eleventh International Conference on Machine Learning, vol. 157, pp. 157–163 (1994)
Littman, M.L.: Algorithms for sequential decision making. Ph.D. thesis, Brown University (1996)
Lusci, A., Pollastri, G., Baldi, P.: Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules. J. Chem. Inf. Model. 53(7), 1563–1575 (2013)
McGovern, A., Barto, A.G.: Automatic discovery of subgoals in reinforcement learning using diverse density. Computer Science Department Faculty Publication Series, p. 8 (2001)
Michie, D.: Trial and error. In: Science Survey, Part 2, pp. 129–145 (1961)
Michie, D.: Experiments on the mechanization of game-learning part I. Characterization of the model and its parameters. Comput. J. 6(3), 232–236 (1963)
Michie, D., Chambers, R.A.: Boxes: an experiment in adaptive control. Mach. Intell. 2(2), 137–152 (1968)
Minsky, M.: Steps toward artificial intelligence. Proc. IRE 49(1), 8–30 (1961)
Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning (ICML) (2016)
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Moody, J., Saffell, M.: Reinforcement learning for trading. In: Advances in Neural Information Processing Systems, pp. 917–923 (1999)
Moriarty, D.E., Schultz, A.C., Grefenstette, J.J.: Evolutionary algorithms for reinforcement learning. J. Artif. Intell. Res. (JAIR) 11, 241–276 (1999)
Muggleton, S., De Raedt, L.: Inductive logic programming: theory and methods. J. Logic Program. 19, 629–679 (1994)
Nair, A., et al.: Massively parallel methods for deep reinforcement learning. arXiv preprint arXiv:1507.04296 (2015)
Ng, A.Y., et al.: Autonomous inverted helicopter flight via reinforcement learning. In: Ang, M.H., Khatib, O. (eds.) Experimental Robotics IX. STAR, vol. 21, pp. 363–372. Springer, Heidelberg (2006). https://doi.org/10.1007/11552246_35
Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: ICML, pp. 663–670 (2000)
Oh, J., Guo, X., Lee, H., Lewis, R.L., Singh, S.: Action-conditional video prediction using deep networks in atari games. In: Advances in Neural Information Processing Systems, pp. 2863–2871 (2015)
Oh, J., Singh, S., Lee, H.: Value prediction network. In: Advances in Neural Information Processing Systems, pp. 6120–6130 (2017)
Ormoneit, D., Sen, Ś.: Kernel-based reinforcement learning. Mach. Learn. 49(2–3), 161–178 (2002)
Papadimitriou, C.H., Tsitsiklis, J.N.: The complexity of Markov decision processes. Math. Oper. Res. 12(3), 441–450 (1987)
Parr, R., Russell, S.: Reinforcement learning with hierarchies of machines. In: Advances in Neural Information Processing Systems, pp. 1043–1049 (1998)
Pascanu, R., et al.: Learning model-based planning from scratch. arXiv preprint arXiv:1707.06170 (2017)
Pashenkova, E., Rish, I., Dechter, R.: Value iteration and policy iteration algorithms for Markov decision problem. In: AAAI 1996, Workshop on Structural Issues in Planning and Temporal Reasoning. Citeseer (1996)
Poupart, P., Boutilier, C.: VDCBPI: an approximate scalable algorithm for large POMDPs. In: Advances in Neural Information Processing Systems, pp. 1081–1088 (2004)
Powers, R., Shoham, Y.: New criteria and a new algorithm for learning in multi-agent systems. In: Advances in Neural Information Processing Systems, pp. 1089–1096 (2004)
Randløv, J., Alstrøm, P.: Learning to drive a bicycle using reinforcement learning and shaping. In: ICML, vol. 98, pp. 463–471. Citeseer (1998)
Ross, S.M.: Introduction to Stochastic Dynamic Programming. Academic press, Norwell (2014))
Rummery, G.A., Niranjan, M.: On-line Q-learning using connectionist systems. University of Cambridge, Department of Engineering (1994)
Rusu, A.A., et al.: Policy distillation. In: International Conference on Learning Representations (ICLR) (2016)
Sadowski, P., Collado, J., Whiteson, D., Baldi, P.: Deep learning, dark knowledge, and dark matter. In: Journal of Machine Learning Research, Workshop and Conference Proceedings, vol. 42, pp. 81–97 (2015)
Samuel, A.L.: Some studies in machine learning using the game of checkers. II. Recent progress. IBM J. Res. Dev. 11(6), 601–617 (1967)
SantamarÃa, J.C., Sutton, R.S., Ram, A.: Experiments with reinforcement learning in problems with continuous state and action spaces. Adapt. Behav. 6(2), 163–217 (1997)
Schaul, T., Horgan, D., Gregor, K., Silver, D.: Universal value function approximators. In: International Conference on Machine Learning (ICML), pp. 1312–1320 (2015)
Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015)
Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. In: Proceedings of the International Conference on Learning Representations (ICLR) (2016)
Sherstov, A.A., Stone, P.: On continuous-action Q-learning via tile coding function approximation. Under Review (2004)
Silver, D., et al.: The predictron: end-to-end learning and planning. arXiv preprint arXiv:1612.08810 (2016)
Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)
Silver, D., et al.: Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815 (2017)
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic policy gradient algorithms. In: International Conference on Machine Learning (ICML) (2014)
Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354 (2017)
Singh, S., Bertsekas, D.: Reinforcement learning for dynamic channel allocation in cellular telephone systems. In: Advances in Neural Information Processing Systems, pp. 974–980 (1997)
Singh, S.P., Jaakkola, T.S., Jordan, M.I.: Learning without state-estimation in partially observable Markovian decision processes. In: ICML, pp. 284–292 (1994)
Singh, S.P., Sutton, R.S.: Reinforcement learning with replacing eligibility traces. Mach. Learn. 22(1–3), 123–158 (1996)
Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), vol. 1631, p. 1642. Citeseer (2013)
Spaan, M.T., Spaan, M.T.: A point-based POMDP algorithm for robot planning. In: 2004 IEEE International Conference on Robotics and Automation, Proceedings, ICRA 2004, vol. 3, pp. 2399–2404. IEEE (2004)
Srivastava, R.K., Greff, K., Schmidhuber, J.: Training very deep networks. In: Advances in Neural Information Processing Systems, pp. 2368–2376 (2015)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
Sutton, R.S.: Learning to predict by the methods of temporal differences. Mach. Learn. 3(1), 9–44 (1988)
Sutton, R.S.: Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In: Machine Learning Proceedings 1990, pp. 216–224. Elsevier (1990)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)
Sutton, R.S., Precup, D., Singh, S.: Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artif. Intell. 112(1), 181–211 (1999)
Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Taylor, M.E., Stone, P.: Cross-domain transfer for reinforcement learning. In: Proceedings of the 24th International Conference on Machine Learning, pp. 879–886. ACM (2007)
Tesauro, G.: Temporal difference learning and TD-Gammon. Commun. ACM 38(3), 58–68 (1995)
Thorndike, E.L.: Animal Intelligence: Experimental Studies. Transaction Publishers, New York (1965)
Tsitsiklis, J.N., Van Roy, B.: An analysis of temporal-difference learning with function approximation. IEEE Trans. Autom. Control 42(5), 674–690 (1997)
Van Der Maaten, L., Postma, E., Van den Herik, J.: Dimensionality reduction: a comparative. J. Mach. Learn. Res. 10, 66–71 (2009)
Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: AAAI, pp. 2094–2100 (2016)
Wang, X., Sandholm, T.: Reinforcement learning to play an optimal Nash equilibrium in team Markov games. In: Advances in Neural Information Processing Systems, pp. 1571–1578 (2002)
Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)
Watter, M., Springenberg, J., Boedecker, J., Riedmiller, M.: Embed to control: a locally linear latent dynamics model for control from raw images. In: Advances in Neural Information Processing Systems, pp. 2746–2754 (2015)
Weber, T., et al.: Imagination-augmented agents for deep reinforcement learning. arXiv preprint arXiv:1707.06203 (2017)
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3–4), 229–256 (1992)
Wu, L., Baldi, P.: A scalable machine learning approach to go. In: Weiss, Y., Scholkopf, B., Editors, J.P. (eds.) NIPS 2006. MIT Press, Cambridge (2007)
Wu, L., Baldi, P.: Learning to play go using recursive neural networks. Neural Netw. 21(9), 1392–1400 (2008)
Zhang, W., Dietterich, T.G.: High-performance job-shop scheduling with a time-delay td network. In: Advances in Neural Information Processing Systems, vol. 8, pp. 1024–1030 (1996)
Zhang, W.: Algorithms for partially observable Markov decision processes. Ph.D. thesis, Citeseer (2001)
Zhou, J., Troyanskaya, O.G.: Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12(10), 931–934 (2015)
Acknowledgment
This research was in part supported by National Science Foundation grant IIS-1550705 and a Google Faculty Research Award to PB.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Agostinelli, F., Hocquet, G., Singh, S., Baldi, P. (2018). From Reinforcement Learning to Deep Reinforcement Learning: An Overview. In: Rozonoer, L., Mirkin, B., Muchnik, I. (eds) Braverman Readings in Machine Learning. Key Ideas from Inception to Current State. Lecture Notes in Computer Science(), vol 11100. Springer, Cham. https://doi.org/10.1007/978-3-319-99492-5_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-99492-5_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99491-8
Online ISBN: 978-3-319-99492-5
eBook Packages: Computer ScienceComputer Science (R0)