Simple statistical gradientfollowing algorithms for connectionist reinforcement learning
 Ronald J. Williams
 … show all 1 hide
Abstract
This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units. These algorithms, called REINFORCE algorithms, are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediatereinforcement tasks and certain limited forms of delayedreinforcement tasks, and they do this without explicitly computing gradient estimates or even storing information from which such estimates could be computed. Specific examples of such algorithms are presented, some of which bear a close relationship to certain existing algorithms while others are novel but potentially interesting in their own right. Also given are results that show how such algorithms can be naturally integrated with backpropagation. We close with a brief discussion of a number of additional issues surrounding the use of such algorithms, including what is known about their limiting behaviors as well as further considerations that might be used to help develop similar but potentially more powerful reinforcement learning algorithms.
 Barto, A.G. (1985). Learning by statistical cooperation of selfinterested neuronlike computing elements.Human Neurobiology, 4, 229–256.
 Barto, A.G. & Anandan, P. (1985). Pattern recognizing stochastic learning automata.IEEE Transactions on Systems, Man, and Cybernetics, 15, 360–374.
 Barto, A.G. & Anderson, C.W. (1985). Structural learning in connectionist systems.Proceedings of the Seventh Annual Conference of the Cognitive Science Society, (pp. 43–53). Irvine, CA.
 Barto, A.G., Sutton, R.S., & Anderson, C.W. (1983). Neuronlike elements that can solve difficult learning control problems.IEEE Transactions on Systems, Man, and Cybernetics, 13, 835–846.
 Barto, A.G., Sutton, R.S., & Brouwer, P.S. (1981). Associative search network: A reinforcement learning associative memory.Biological Cybernetics, 40, 201–211.
 Barto, A.G., & Jordan, M.I. (1987). Gradient following without backpropagation in layered networks.Proceedings of the First Annual International Conference on Neural Networks, Vol. II (pp. 629–636). San Diego, CA.
 Barto, A.G., Sutton, R.S., & Watkins, C.J.C.H. (1990). Learning and sequential decision making. In: M. Gabriel & J.W. Moore (Eds.),Learning and computational neuroscience: Foundations of adaptive networks. Cambridge, MA: MIT Press.
 Dayan, P. (1990). Reinforcement comparison. In D.S. Touretzky, J.L. Elman, T.J. Sejnowski, & G.E. Hinton (Eds.),Proceedings of the 1990 Connectionist Models Summer School (pp. 45–51). San Mateo, CA: Morgan Kaufmann.
 Goodwin, G.C. & Sin, K.S. (1984).Adaptive filtering prediction and control. Englewood Cliffs, NJ: PrenticeHall.
 Gullapalli, V. (1990). A stochastic reinforcement learning algorithm for learning realvalued functions.Neural Networks, 3, 671–692.
 Hinton, G.E. & Sejnowski, T.J. (1986). Learning and relearning in Boltzmann machines. In: D.E. Rumelhart & J.L. McClelland, (Eds.),Parallel distributed processing: Explorations in the microstructure of cognition. Vol. 1: Foundations. Cambridge, MA: MIT Press.
 Jordan, M.I. & Rumelhart, D.E. (1990).Forward models: supervised learning with a distal teacher. (Occasional Paper — 40). Cambridge, MA: Massachusetts Institute of Technology, Center for Cognitive Science.
 leCun, Y. (1985). Une procedure d'apprentissage pour resau a sequil assymetrique [A learning procedure for asymmetric threshold networks].Proceedings of Cognitiva, 85, 599–604.
 Munro, P. (1987). A dual backpropagation scheme for scalar reward learning.Proceedings of the Ninth Annual Conference of the Cognitive Science Society (pp. 165–176). Seattle, WA.
 Narendra, K.S. & Thathatchar, M.A.L. (1989).Learning Automata: An introduction. Englewood Cliffs, NJ: Prentice Hall.
 Narendra, K.S. & Wheeler, R.M., Jr. (1983). AnNplayer sequential stochastic game with identical payoffs.IEEE Transactions on Systems, Man, and Cybernetics, 13, 1154–1158.
 Nilsson, N.J. (1980).Principles of artificial intelligence. Palo Alto, CA: Tioga.
 Parker, D.B. (1985).Learninglogic. (Technical Report TR47). Cambridge, MA: Massachusetts Institute of Technology, Center for Computational Research in Economics and Management Science.
 Rohatgi, V.K. (1976)An introduction to probability theory and mathematical statistics. New York: Wiley.
 Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). Learning internal representations by error propagation. In: D.E. Rumelhart & J.L. McClelland, (Eds.),Parallel distributed processing: Explorations in the microstructure of cognition. Vol. 1: Foundations. Cambridge: MIT Press.
 Schmidhuber, J.H. & Huber, R. (1990). Learning to generate focus trajectories for attentive vision. (Technical Report FKI12890). Technische Universität München, Institut für Informatik.
 Sutton, R.S. (1984).Temporal credit assignment in reinforcement learning. Ph.D. Dissertation, Dept. of Computer and Information Science, University of Massachusetts, Amherst, MA.
 Sutton, R.S. (1988). Learning to predict by the methods of temporal differences.Machine Learning, 3, 9–44.
 Thathatchar, M.A.L. & Sastry, P.S. (1985). A new approach to the design of reinforcement schemes for learning automata.IEEE Transactions on Systems, Man, and Cybernetics, 15, 168–175.
 Wheeler, R.M., Jr. & Narendra K.S. (1986). Decentralized learning in finite Markov chains.IEEE Transactions on Automatic Control, 31, 519–526.
 Watkins, C.J.C.H. (1989).Learning from delayed rewards. Ph.D. Dissertation, Cambridge University, Cambridge, England.
 Werbos, P.J. (1974).Beyond regression: new tools for prediction and analysis in the behavioral sciences. Ph.D. Dissertation, Harvard University, Cambridge, MA.
 Williams, R.J. (1986).Reinforcement learning in connectionist networks: A mathematical analysis. (Technical Report 8605). San Diego: University of California, Institute for Cognitive Science.
 Williams, R.J. (1987a).Reinforcementlearning connectionist systems. (Technical Report NUCCS873). Boston, MA: Northeastern University, College of Computer Science.
 Williams, R.J. (1987b). A class of gradientestimating algorithms for reinforcement learning in neural networks.Proceedings of the First Annual International Conference on Neural Networks, Vol. II (pp. 601–608). San Diego, CA.
 Williams, R.J. (1988a). On the use of backpropagation in associative reinforcement learning.Proceedings of the Second Annual International Conference on Neural Networks, Vol. I (pp. 263–270). San Diego, CA.
 Williams, R.J. (1988b).Toward a theory of reinforcementlearning connectionist systems. (Technical Report NUCCS883). Boston, MA: Northeastern University, College of Computer Science.
 Williams, R.J. & Peng, J. (1991). Function optimization using connectionist reinforcement learning algorithms.Connection Science, 3, 241–268.
 Title
 Simple statistical gradientfollowing algorithms for connectionist reinforcement learning
 Journal

Machine Learning
Volume 8, Issue 34 , pp 229256
 Cover Date
 19920501
 DOI
 10.1007/BF00992696
 Print ISSN
 08856125
 Online ISSN
 15730565
 Publisher
 Kluwer Academic Publishers
 Additional Links
 Topics
 Keywords

 Reinforcement learning
 connectionist networks
 gradient descent
 mathematical analysis
 Industry Sectors
 Authors

 Ronald J. Williams ^{(1)}
 Author Affiliations

 1. College of Computer Science, 161 CN, Northeastern University, 360 Huntington Ave., 02115, Boston, MA