## Abstract

This paper gives a compact, self-contained tutorial survey of reinforcement learning, a tool that is increasingly finding application in the development of intelligent dynamic systems. Research on reinforcement learning during the past decade has led to the development of a variety of useful algorithms. This paper surveys the literature and presents the algorithms in a cohesive framework.

## Keywords

Reinforcement learning dynamic programming optimal control neural networks## Preview

Unable to display preview. Download preview PDF.

## References

- Albus J S 1975 A new approach to manipulator control: The cerebellar model articulation controller (CMAC).
*Trans. ASME, J. Dyn. Syst., Meas., Contr.*97:220–227zbMATHGoogle Scholar - Anderson C W 1986
*Learning and problem solving with multilayer connectionist systems*. Ph D thesis, University of Massachusetts, Amherst, MAGoogle Scholar - Anderson C W 1987 Strategy learning with multilayer connectionist representations. Technical report, TR87-509.3, GTE Laboratories, INC., Waltham, MAGoogle Scholar
- Anderson C W 1989 Learning to control an inverted pendulum using neural networks.
*IEEE Contr. Syst. Mag.*: 31–37Google Scholar - Anderson C W 1993 Q-Learning with hidden-unit restarting. In
*Advances in neural information processing systems 5*(eds) S J Hanson, J D Cowan, C L Giles (San Mateo, CA: Morgan Kaufmann) pp 81–88Google Scholar - Bacharach J R 1991 A connectionist learning control architecture for navigation. In
*Advances in neural information processing systems 3*(eds) R P Lippman, J E Moody, D S Touretzky (San Mateo, CA: Morgan Kaufmann) pp 457–463Google Scholar - Bacharach J R 1992
*Connectionist modeling and control of finite state environments*. Ph D thesis, University of Massachusetts, Amherst, MAGoogle Scholar - Barto A G 1985 Learning by statistical cooperation of self-interested neuron-like computing elements.
*Human Neurobiology*4: 229–256Google Scholar - Barto A G 1986 Game-theoritic cooperativity in networks of self interested units. In
*Neural networks for computing*(ed.) J S Denker (New York: American Institute of Physics) pp 41–46Google Scholar - Barto A G 1992 Reinforcemnet learning and adaptive critic methods. In
*Handbook of intelligent control: Neural, fuzzy, and adaptive approaches*(eds) D A White, D A Sofge (New York: Van Nostrand Reinhold) pp 469–491Google Scholar - Barto A G, Anandan P 1985 Pattern recognizing stocahstic learning automata.
*IEEE Trans. Syst., Man Cybern.*15: 360–375zbMATHMathSciNetGoogle Scholar - Barto A G, Anandan P, Anderson C W 1985 Cooperativity in networks of pattern recognizing stochastic learning automata. In
*Proceedings of the Fourth Yale Workshop on Applications of Adaptive Systems Theory*, New Haven, CTGoogle Scholar - Barto A G, Bradtke S J, Singh S P 1992 Real-time learning and control using asymchronous dynamic programming. Technical Report COINS 91-57, University of Massachusetts, Amherst, MAGoogle Scholar
- Barto A G, Jordan M I 1987 Gradient following without back-propagation in layered networks. In
*Proceedings of the IEEE First Annual Conference on Neural Networks*, (eds) M Caudill, C Butler (New York: IEEE) pp II629-II636Google Scholar - Barto A G, Singh S P 1991 On the computational economics of reinforcement learning. In
*Connectionist Models Proceedings of the 1990 Summer School*. (eds) D S Touretzky, J L Elman, T J Sejnowski, G E Hinton (San Mateo, CA: Morgan Kaufmann) pp 35–44Google Scholar - Barto A G, Sutton R S 1981, Landmark learning: an illustration of associative search.
*Biol. Cybern.*42: 1–8zbMATHCrossRefGoogle Scholar - Barto A G, Sutton R S 1982 Simulation of anticipatory responses in classical conditioning by a neuron-like adaptive element.
*Behav. Brain Res.*4: 221–235CrossRefGoogle Scholar - Barto A G, Sutton R S, Anderson C W 1983 Neuronlike elements that can solve difficult learning control problems.
*IEEE Trans. Syst., Man Cybern.*13: 835–846Google Scholar - Barto A G, Sutton R S, Brouwer P S 1981 Associative search network: a reinforcement learning associative memory.
*IEEE Trans. Syst., Man Cybern.*40: 201–211zbMATHGoogle Scholar - Barto A G, Sutton R S, Watkins C J C H 1990 Learning and sequential decision making. In
*Learning and computational neuroscience: Foundations of adaptive networks*. (eds) M Gabriel, J Moore (Cambridge, MA: MIT Press) pp 539–602Google Scholar - Bellman R E, Dreyfus S E 1962
*Applied dynamic programming*. RAND CorporationGoogle Scholar - Bertsekas D P 1982 Distributed dynamic programming.
*IEEE Trans. Autom. Contr.*27: 610–616zbMATHMathSciNetCrossRefGoogle Scholar - Bertsekas D P 1989
*Dynamic programming: Deterministic and stochastic models*(Englewood Cliffs, NJ: Prentice-Hall)Google Scholar - Bertsekas D P, Tsitsiklis J N 1989
*Parallel and distributed computation: Numerical methods*(Englewood Cliffs, NJ: Prentice-Hall)zbMATHGoogle Scholar - Boyen J 1992
*Modular neural networks for learning context-dependent game strategies*. Masters thesis, Computer Speech and Language Processing, University of Cambridge, Cambridge, EnglandGoogle Scholar - Bradtke S J 1993 Reinforcement learning applied to linear quadratic regulation. In
*Advances in neural information processing systems 5*(eds) S J Hanson, J D Cowan, C L Giles (San Mateo, CA: Morgan Kaufmann) pp 295–302Google Scholar - Bradtke S J 1994
*Incremental dynamic programming for on-line adaptive optimal control*. CMPSCI Technical Report 94-62Google Scholar - Brody C 1992 Fast learning with predictive forward models. In
*Advances in neural information processing systems 4*(eds) J E Moody, S J Hanson, R P Lippmann (San Mateo, CA: Morgan Kaufmann) pp 563–570Google Scholar - Brooks R A 1986 Achieving artificial intelligence through building robots. Technical Report, A I Memo 899, Massachusetts Institute of Technology, Aritificial Intelligence Laboratory, Cambridge, MAGoogle Scholar
- Buckland K M, Lawrence P D 1994 Transition point dynamic programming. In
*Advances in neural information processing systems 6*(eds) J D Cowan, G Tesauro, J Alspector (San Fransisco, CA: Morgan Kaufmann) pp 639–646Google Scholar - Chapman D 1991
*Vision, Instruction, and Action*(Cambridge, MA: MIT Press)Google Scholar - Chapman D, Kaelbling L P 1991 Input generalization in delayed reinforcement learning: an algorithm and performance comparisons. In
*Proceedings of the 1991 International Joint Conference on Artificial Intelligence*Google Scholar - Chrisman L 1992 Planning for closed-loop execution using partially observable markovian decision processes. In
*Proceedings of AAAI*Google Scholar - Dayan P 1991a Navigating through temporal difference. In
*Advances in neural information processing systems 3*(eds) R P Lippmann, J E Moody, D S Touretzky (San Mateo, CA: Morgan Kaufmann) pp 464–470Google Scholar - Dayan P 1991b
*Reinforcing connectionism: Learning the statistical way*. Ph D thesis, University of Edinburgh, EdinburghGoogle Scholar - Dayan P, Hinton G E 1993 Feudal reinforcement learning. In
*Advances in neural information processing systems 5*(eds) S J Hanson, J D Cowan, C L Giles (San Mateo, CA: Morgan Kaufmann) pp 271–278Google Scholar - Dayan P, Sejnowski T J 1993 TD(λ) converges with probability 1. Technical Report, CNL, The Salk Institute, San Diego, CAGoogle Scholar
- Dean T L, Wellman M P 1991
*Planning and control*(San Mateo, CA: Morgan Kaufmann)Google Scholar - Gullapalli V 1990 A stochastic reinforcement algorithm for learning real-valued functions.
*Neural Networks*3: 671–692CrossRefGoogle Scholar - Gullapalli V 1992a Reinforcement learning and its application to control. Technical Report, COINS, 92-10, Ph D thesis, University of Massachusetts, Amherst, MAGoogle Scholar
- Gullapalli V 1992b A comparison of supervised and reinforcement learning methods on a reinforcment learning task. In
*Proceedings of the 1991 IEEE Symposium on Intelligent Control*, Arlignton, VAGoogle Scholar - Gullapalli V, Barto A G 1994 Convergence of indirect adaptive asynchronous value iteration algorithms. In
*Advances in neural information processing systems 6*(eds) J D Cowan, G Tesauro, J Alspector (San Fransisco, CA: Morgan Kaufmann) pp 695–702Google Scholar - Gullapalli V, Franklin J A, Benbrahim H 1994 Acquiring robot skills via reinforcement learning.
*IEEE Contr. Syst. Mag.*: 13–24Google Scholar - Hertz J A, Krogh A S, Palmer R G 1991
*Introduction to the theory of neural computation*(Reading, MA: Addison-Wesley)Google Scholar - Jaakkola T, Jordan M I, Singh S P 1994 Convergence of stochastic iterative dynamic programming algorithms. In
*Advances in Neural information processing systems 6*(eds) J D Cowan, G Tesauro, J Alspector (San Fransisco, CA: Morgan Kaufmann) pp. 703–710Google Scholar - Jacobs R A, Jordan M I, Nowlan S J, Hinton G E 1991 Adaptive mixtures of local experts.
*Neural Comput.*3: 79–87CrossRefGoogle Scholar - Jordan M I, Jacobs R A 1990 Learning to control an unstable system with forward modeling. In
*Advances in neural information processing systems 2*(ed.) D S Touretzky (San Mateo, CA: Morgan Kaufmann)Google Scholar - Jordan M I, Rumelhart D E 1990 Forward models: Supervised learning with a distal teacher. Center for Cognitive Science, Occasional Paper # 40, Massachusetts Institute of Technology, Cambridge, MAGoogle Scholar
- Kaelbling L P 1990
*Learning in embedded systems*. (Technical Report, TR-90-04) Ph D thesis, Department of Computer Science, Stanford University, Stanford, CAGoogle Scholar - Kaelbling L P 1991
*Learning in Embedded Systems*(Cambridge, MA: MIT Press)Google Scholar - Klopf A H 1972 Brain function and adaptive systems — a heterostatic theory. Teachnical report AFCRL-72-0164, Air Force Cambridge Research Laboratories, Bedford, MAGoogle Scholar
- Klopf A H 1982
*The hedonistic neuron: A theory of memory, learning and intelligence*. (Washington, D C: Hemisphere)Google Scholar - Klopf A H 1988 A neuronal model of classical conditioning.
*Psychobiology*16: 85–125Google Scholar - Korf R E 1990 Real-time heuristic search.
*Artif. Intell.*42: 189–211CrossRefzbMATHGoogle Scholar - Kumar P R 1985 A survey of some results in stochastic adaptive control.
*SIAM J. Contr. Optim.*23: 329–380zbMATHCrossRefGoogle Scholar - Lin C S, Kim H 1991 CMAC-based adaptive critic self-learning control.
*IEEE Trans. Neural Networks*2: 530–533CrossRefGoogle Scholar - Lin L J 1991a Programming robots using reinforcement learning and teaching. In
*Proceedings of the Ninth National Conference on Artificial Intelligence*, pages 781–786, MIT Press, Cambridge, MAGoogle Scholar - Lin L J 1991b Self-improvement based on reinforcement learning, planning and teaching. In
*Machine Learning: Proceedings of the Eighth International Workshop*(eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 323–327Google Scholar - Lin L J 1991c Self-improving reactive agents: Case studies of reinforcement learning frameworks. In
*From Animals to Animats: Proceedings of the First International Conference on Simulation of Adaptive Behaviour*, (Cambridge, MA: MIT Press) pp 297–305Google Scholar - Lin L J 1992 Self-improving reactive agents based on reinforcement learning, planning and teaching.
*Mach. Learning*8: 293–321Google Scholar - Lin L J 1993 Hierarchical learning of robot skills by reinforcement. In
*Proceedings of the 1993 International Conference on Neural Networks*pp 181–186Google Scholar - Linden A 1993 On discontinuous Q-functions in reinforcement learning. Available via anonymous ftp from archive.cis.ohio-state.edu in directory /pub/neuroproseGoogle Scholar
- Maes P, Brooks R 1990 Learning to coordinate behaviour. In
*Proceedings of the Eighth National Conference on Artificial Intelligence*(San Mateo, CA: Morgan Kaufmann) pp 796–802Google Scholar - Magriel P 1976
*Backgammon*(New York: Times Books)Google Scholar - Mahadevan S, Connell J 1991 Scaling reinforcement learning to robotics by exploiting the subsumption architecture. In
*Machine Learning: Proceedings of the Eighth International Workshop*(eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 328–332Google Scholar - Mazzoni P, Andersen R A, Jordan M I 1990
*A*_{R-P}learning applied to a network model of cortical area 7a. In*Proceedings of the 1990 International Joint Conference on Neural Networks*2: 373–379CrossRefGoogle Scholar - Michie D, Chambers R A 1968 BOXES: An experiment in adaptive control.
*Machine intelligence 2*(eds) E Dale, D Michie (New York: Oliver and Boyd) pp 137–152Google Scholar - Millan J D R, Torras C 1992 A reinforcement connectionist approach to robot path finding in non maze-like environments.
*Mach. Learning*8: 363–395Google Scholar - Minsky M L 1954
*Theory of neural-analog reinforcement systems and application to the brain-model problem*. Ph D thesis, Princeton University, Princeton, NJGoogle Scholar - Minsky M L 1961 Steps towards artificial intelligence. In
*Proceedings of the Institute of Radio Engineers*49: 8–30 (Reprinted 1963 in*Computers and thought*(eds) E A Feigenbaum, J Feldman (New York: McGraw-Hill) pp 406–450MathSciNetGoogle Scholar - Moore A W 1990
*Efficient memory-based learning for robot control*. Ph D thesis, University of Cambridge, Cambridge, UKGoogle Scholar - Moore A W 1991 Variable resolution dynamic programming: Efficiently learning action maps in multivariate real-vlaued state-spaces. In
*Machine Learning: Proceedings of the Eighth International Workshop*(eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 328–332Google Scholar - Moore A W, Atkeson C G 1993 Memory-based reinforcement learning: Efficient computation with prioritized sweeping. In
*Advances in neural information processing systems 5*(eds) S J Hanson, J D Cowan, C L Giles (San Mateo, CA: Morgan Kaufmann) pp 263–270Google Scholar - Mozer M C, Bacharach J 1990a Discovering the structure of reactive environment by exploration. In
*Advances in neural information processing 2*(ed.) D S Touretzky (San Mateo, CA: Morgan Kaufmann) pp 439–446Google Scholar - Mozer M C, Bacharach J 1990b Discovering the structure of reactive environment by exploration.
*Neural Computation*2: 447–457Google Scholar - Narendra K, Thathachar M A L 1989
*Learning automata: An introduction*(Englewood Cliffs, NJ: Prentice Hall)Google Scholar - Peng J, Williams R J 1993 Efficient learning and planning within the Dyna framework. In
*Proceedings of the 1993 International Joint Conference on Neural Networks*, pp 168–174Google Scholar - Platt J C 1991 Learning by combining memorization and gradient descent.
*Advances in neural information processing systems 3*(eds) R P Lippmann, J E Moody, D S Touretzky (San Mateo, CA: Morgan Kaufmann) pp 714–720Google Scholar - Rosen B E, Goodwin J M, Vidal J J 1991 Adaptive range coding.
*Advances in neural information processing systems 3*(eds) R P Lippmann, J E Moody, D S Touretzky (San Mateo, CA: Morgan Kaufmann) pp 486–494Google Scholar - Ross S 1983
*Introduction to stochastic dynamic programming*(New York: Academic Press)zbMATHGoogle Scholar - Rummery G A, Niranjan M 1994 On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, University of Cambridge, Cambridge, EnglandGoogle Scholar
- Samuel A L 1959 Some studies in machine learning using the game of checkers.
*IBM J. Res. Dev.*: 210–229 (Reprinted 1963 in*Computers and thought*(eds) E A Feigenbaum, J Feldman (New York: McGraw-Hill)Google Scholar - Samuel A L 1967 Some studies in machine learning using the game of checkers, II Recent progress.
*IBM J. Res. Dev.*: 601–617Google Scholar - Selfridge O, Sutton R S, Barto A G 1985 Training and tracking in robotics. In
*Proceedings of the Ninth International Joint Conference of Artificial Intelligence*(ed.) A Joshi (San Mateo, CA: Morgan Kaufmann) pp 670–672Google Scholar - Shepansky J F, Macy S A 1987 Teaching artificial neural systems to drive: Manual training techniques for autonomous systems. In
*Proceedings of the First Annual International Conference on Neural Networks*, San Diego, CAGoogle Scholar - Singh S P 1991 Transfer of learning across composition of sequential tasks. In
*Machine Learning: Proceedings of the Eighth International Workshop*(eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 348–352Google Scholar - Singh S P 1992a Reinforcement learning with a hierarchy of abstract models. In
*Proceedings of the Tenth National Conference on Artificial Intelligence*, San Jose, CAGoogle Scholar - Singh S P 1992b On the efficient learning of multiple sequential tasks. In
*Advances in neural information processing systems 4*(eds) J E Moody, S J Hanson, R P Lippmann (San Mateo, CA: Morgan Kaufmann) pp 251–258Google Scholar - Singh S P 1992c Scaling Reinforcement learning algorithms by learning variable temporal resolution models. In
*Proceedings of the Ninth International Machine Learning Conference*Google Scholar - Singh S P 1992d Transfer of learning by composing solutions of elemental sequential tasks.
*Mach. Learning*8: 323–339zbMATHGoogle Scholar - Singh S P, Barto A G, Grupen R, Connelly C 1994 Robust reinforcement learning in motion planning. In
*Advances in neural information processing systems 6*(eds) J D Cowan, G Tesauro, J Alspector (San Fransisco, CA: Morgan Kaufmann) pp 655–662Google Scholar - Singh S P, Yee R C 1993 An upper bound on the loss from approximate optimal-value functions. Technical Report, University of Massachusetts, Amherst, MAGoogle Scholar
- Sutton R S 1984
*Temporal credit assignment in reinforcement learning*. Ph D thesis, University of Massachusetts, Amherst, MAGoogle Scholar - Sutton R S 1988 Learning to predict by the method of temporal differences.
*Mach. Learning*3: 9–44Google Scholar - Sutton R S 1990 Integrated architecture for learning, planning, and reacting based on approximating dyanmic programming. In
*Proceedings of the Seventh International Conference on Machine Learning*(San Mateo, CA: Morgan Kaufmann) pp 216–224Google Scholar - Sutton R S 1991a Planning by incremental dynamic programming. In
*Machine Learning: Proceedings of the Eighth International Workshop*(eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 353–357Google Scholar - Sutton R S 1991b Integrated modeling and control based on reinforcement learning and dynamic programming. In
*Advances in neural information processing systems 3*(eds) R P Lippmann, J E Moody, D S Touretzky (San Mateo, CA: Morgan Kaufmann) pp 471–478Google Scholar - Sutton R S, Barto A G 1981 Toward a modern theory of adaptive networks: Expectation and prediction.
*Psychol. Rev.*88: 135–170CrossRefGoogle Scholar - Sutton R S, Barto A G 1987 A temporal-difference model of classical conditioning. In
*Proceedings of the Ninth Annual Conference of the Cognitive Science Society*, Erlbaum, Hillsdale, NJGoogle Scholar - Sutton R S, Barto A G 1990 Time-derivative models of Pavlovian reinforcement.
*Learning and Computational Neuroscience: Foundations of Adaptive Networks*(eds) M Gabriel, J Moore (Cambridge, MA: MIT Press) pp 497–537Google Scholar - Sutton R S, Singh S P 1994 On step-size and bias in TD-learning. In
*Proceedings of the Eighth Yale Workshop on Adaptive and Learning Systems*Yale University, pp 91–96Google Scholar - Sutton R S, Barto A G, Williams R J 1991 Reinforcement learning is direct adaptive optimal control. In
*Proceedings of th American Control Conference*Boston, MA, pp 2143–2146Google Scholar - Tan M 1991 Larning a cost-sensitive internal representation for reinforcement learning. In
*Machine Learning: Proceedings of the Eighth International Workshop*(eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 358–362Google Scholar - Tesauro G J 1992 Practical issues in temporal difference learning.
*Mach. Learning*8: 257–278zbMATHGoogle Scholar - Tham C K, Prager R W 1994 A modular Q-learning architecture for manipulator task decomposition. In
*Machine Learning: Proceedings of the Eleventh Intenational Conference*(eds) W W Cohen, H Hirsh (Princeton, NJ: Morgan Kaufmann) (Available via gopher from Dept. of Eng., University of Cambridge, Cambridge, England)Google Scholar - Thrun S B 1986 Efficient exploration in reinforcement learning. Technical report CMU-CS-92-102, School of Computer Science, Carnegie Mellon University, Pittsburgh, PAGoogle Scholar
- Thrun S B 1993 Exploration and model building in mobile robot domains. In
*Proceedings of the 1993 International Conference on Neural Networks*Google Scholar - Thrun S B, Muller K 1992 Active exploration in dynamic environments. In
*Advances in neural information processing systems 4*(eds) J E Moody, S J Hanson, R P Lippmann (San Mateo, CA: Morgan Kaufmann)Google Scholar - Thrun S B, Schwartz A 1993 Issues in using function approximation for reinforcement learning. In
*Proceedings of the Fourth Connectionist Models Summer School*(Hillsdale, NJ: Lawrence Erlbaum)Google Scholar - Tsitsiklis J N 1993 Asynchronous stochastic approximation and Q-learning. Technical Report, LIDS-P-2172, Laboratory for Information and Decision Systems, MIT, Cambridge, MAGoogle Scholar
- Utgoff P E, Clouse J A 1991 Two kinds of training information for evaluation function learning. In
*Proceedings of the Ninth Annual Conference on Artificial Intelligence*(San Mateo, CA: Morgan Kaufmann) pp 596–600Google Scholar - Watkins 1989
*Learning from delayed rewards*. Ph D thesis, Cambridge University, Cambridge, EnglandGoogle Scholar - Watkins C J C H, Dayan P 1992 Technical note: Q-learning.
*Mach. Learning*8: 279–292zbMATHGoogle Scholar - Werbos P J 1987 Building and understanding adaptive systems: a statistical/numerical approach to factory automation and brain research.
*IEEE Trans. Syst, Man Cybern*.Google Scholar - Werbos P J 1988 Generalization of back propagation with application to recurrent gas market model.
*Neural Networks*1: 339–356CrossRefGoogle Scholar - Werbos P J 1989 Neural network for control and system identification. In
*Proceedings of the 28th Conference on Decision and Control*Tampa, FL, pp 260–265Google Scholar - Werbos P J 1990a Consistency of HDP applied to simple reinforcement learning problems.
*Neural Networks*3: 179–189CrossRefGoogle Scholar - Werbos P J 1990b A menu of designs for reinforcement learning over time, In
*Neural networks for control*(eds) W T Miller, R S Sutton, P J Werbos (Cambridge, MA: MIT Press) pp 67–95Google Scholar - Werbos P J 1992 Approximate dynamic programming for real-time control and neural modeling. In
*Handbook of intelligent control: Neural, fuzzy, and adaptive approaches*(eds) D A White, D A Sofge (New York: Van Nostrand Reinhold) pp 493–525Google Scholar - Whitehead S D 1991a A complexity analysis of cooperative mechanisms in reinforcement learning. In
*Proceedings of the Ninth Conference on Artificial Intelligence*, (Cambridge, MA: MIT Press) pp 607–613Google Scholar - Whitehead S D 1991b Complexity and cooperation in Q-learning. In
*Machine Learning: Proceedings of the Eighth International Workshop*(eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 363–367Google Scholar - Whitehead S D, Ballard D H 1990 Active perception and reinforcement learning.
*Neural Comput.*2: 409–419Google Scholar - Williams R J 1986 Reinforcement learning in connectionist networks: a mathematical analysis. Technical report ICS 8605, Institute for Cognitive Science, University of California at San Diego, La Jolla, CAGoogle Scholar
- Williams R J 1987 Reinforcement-learning connectionist systems. Technical report NU-CCS-87-3, College of Computer Science, Northeastern University, Boston, MAGoogle Scholar
- Williams R J, Baird L C III 1990 A mathematical analysis of actor-critic architectures for learning optimal controls through incremental dynamic programming. In
*Proceedings of the Sixth Yale Workshop on Adaptive and Learning Systems*New Haven, CT, pp 96–101Google Scholar

## Copyright information

© Indian Academy of Sciences 1994