Skip to main content
Log in

A tutorial survey of reinforcement learning

  • Published:
Sadhana Aims and scope Submit manuscript

Abstract

This paper gives a compact, self-contained tutorial survey of reinforcement learning, a tool that is increasingly finding application in the development of intelligent dynamic systems. Research on reinforcement learning during the past decade has led to the development of a variety of useful algorithms. This paper surveys the literature and presents the algorithms in a cohesive framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Albus J S 1975 A new approach to manipulator control: The cerebellar model articulation controller (CMAC).Trans. ASME, J. Dyn. Syst., Meas., Contr. 97:220–227

    MATH  Google Scholar 

  • Anderson C W 1986Learning and problem solving with multilayer connectionist systems. Ph D thesis, University of Massachusetts, Amherst, MA

    Google Scholar 

  • Anderson C W 1987 Strategy learning with multilayer connectionist representations. Technical report, TR87-509.3, GTE Laboratories, INC., Waltham, MA

    Google Scholar 

  • Anderson C W 1989 Learning to control an inverted pendulum using neural networks.IEEE Contr. Syst. Mag.: 31–37

  • Anderson C W 1993 Q-Learning with hidden-unit restarting. InAdvances in neural information processing systems 5 (eds) S J Hanson, J D Cowan, C L Giles (San Mateo, CA: Morgan Kaufmann) pp 81–88

    Google Scholar 

  • Bacharach J R 1991 A connectionist learning control architecture for navigation. InAdvances in neural information processing systems 3 (eds) R P Lippman, J E Moody, D S Touretzky (San Mateo, CA: Morgan Kaufmann) pp 457–463

    Google Scholar 

  • Bacharach J R 1992Connectionist modeling and control of finite state environments. Ph D thesis, University of Massachusetts, Amherst, MA

    Google Scholar 

  • Barto A G 1985 Learning by statistical cooperation of self-interested neuron-like computing elements.Human Neurobiology 4: 229–256

    Google Scholar 

  • Barto A G 1986 Game-theoritic cooperativity in networks of self interested units. InNeural networks for computing (ed.) J S Denker (New York: American Institute of Physics) pp 41–46

    Google Scholar 

  • Barto A G 1992 Reinforcemnet learning and adaptive critic methods. InHandbook of intelligent control: Neural, fuzzy, and adaptive approaches (eds) D A White, D A Sofge (New York: Van Nostrand Reinhold) pp 469–491

    Google Scholar 

  • Barto A G, Anandan P 1985 Pattern recognizing stocahstic learning automata.IEEE Trans. Syst., Man Cybern. 15: 360–375

    MATH  MathSciNet  Google Scholar 

  • Barto A G, Anandan P, Anderson C W 1985 Cooperativity in networks of pattern recognizing stochastic learning automata. InProceedings of the Fourth Yale Workshop on Applications of Adaptive Systems Theory, New Haven, CT

  • Barto A G, Bradtke S J, Singh S P 1992 Real-time learning and control using asymchronous dynamic programming. Technical Report COINS 91-57, University of Massachusetts, Amherst, MA

    Google Scholar 

  • Barto A G, Jordan M I 1987 Gradient following without back-propagation in layered networks. InProceedings of the IEEE First Annual Conference on Neural Networks, (eds) M Caudill, C Butler (New York: IEEE) pp II629-II636

    Google Scholar 

  • Barto A G, Singh S P 1991 On the computational economics of reinforcement learning. InConnectionist Models Proceedings of the 1990 Summer School. (eds) D S Touretzky, J L Elman, T J Sejnowski, G E Hinton (San Mateo, CA: Morgan Kaufmann) pp 35–44

    Google Scholar 

  • Barto A G, Sutton R S 1981, Landmark learning: an illustration of associative search.Biol. Cybern. 42: 1–8

    Article  MATH  Google Scholar 

  • Barto A G, Sutton R S 1982 Simulation of anticipatory responses in classical conditioning by a neuron-like adaptive element.Behav. Brain Res. 4: 221–235

    Article  Google Scholar 

  • Barto A G, Sutton R S, Anderson C W 1983 Neuronlike elements that can solve difficult learning control problems.IEEE Trans. Syst., Man Cybern. 13: 835–846

    Google Scholar 

  • Barto A G, Sutton R S, Brouwer P S 1981 Associative search network: a reinforcement learning associative memory.IEEE Trans. Syst., Man Cybern. 40: 201–211

    MATH  Google Scholar 

  • Barto A G, Sutton R S, Watkins C J C H 1990 Learning and sequential decision making. InLearning and computational neuroscience: Foundations of adaptive networks. (eds) M Gabriel, J Moore (Cambridge, MA: MIT Press) pp 539–602

    Google Scholar 

  • Bellman R E, Dreyfus S E 1962Applied dynamic programming. RAND Corporation

  • Bertsekas D P 1982 Distributed dynamic programming.IEEE Trans. Autom. Contr. 27: 610–616

    Article  MATH  MathSciNet  Google Scholar 

  • Bertsekas D P 1989Dynamic programming: Deterministic and stochastic models (Englewood Cliffs, NJ: Prentice-Hall)

    Google Scholar 

  • Bertsekas D P, Tsitsiklis J N 1989Parallel and distributed computation: Numerical methods (Englewood Cliffs, NJ: Prentice-Hall)

    MATH  Google Scholar 

  • Boyen J 1992Modular neural networks for learning context-dependent game strategies. Masters thesis, Computer Speech and Language Processing, University of Cambridge, Cambridge, England

    Google Scholar 

  • Bradtke S J 1993 Reinforcement learning applied to linear quadratic regulation. InAdvances in neural information processing systems 5 (eds) S J Hanson, J D Cowan, C L Giles (San Mateo, CA: Morgan Kaufmann) pp 295–302

    Google Scholar 

  • Bradtke S J 1994Incremental dynamic programming for on-line adaptive optimal control. CMPSCI Technical Report 94-62

  • Brody C 1992 Fast learning with predictive forward models. InAdvances in neural information processing systems 4 (eds) J E Moody, S J Hanson, R P Lippmann (San Mateo, CA: Morgan Kaufmann) pp 563–570

    Google Scholar 

  • Brooks R A 1986 Achieving artificial intelligence through building robots. Technical Report, A I Memo 899, Massachusetts Institute of Technology, Aritificial Intelligence Laboratory, Cambridge, MA

    Google Scholar 

  • Buckland K M, Lawrence P D 1994 Transition point dynamic programming. InAdvances in neural information processing systems 6 (eds) J D Cowan, G Tesauro, J Alspector (San Fransisco, CA: Morgan Kaufmann) pp 639–646

    Google Scholar 

  • Chapman D 1991Vision, Instruction, and Action (Cambridge, MA: MIT Press)

    Google Scholar 

  • Chapman D, Kaelbling L P 1991 Input generalization in delayed reinforcement learning: an algorithm and performance comparisons. InProceedings of the 1991 International Joint Conference on Artificial Intelligence

  • Chrisman L 1992 Planning for closed-loop execution using partially observable markovian decision processes. InProceedings of AAAI

  • Dayan P 1991a Navigating through temporal difference. InAdvances in neural information processing systems 3 (eds) R P Lippmann, J E Moody, D S Touretzky (San Mateo, CA: Morgan Kaufmann) pp 464–470

    Google Scholar 

  • Dayan P 1991bReinforcing connectionism: Learning the statistical way. Ph D thesis, University of Edinburgh, Edinburgh

    Google Scholar 

  • Dayan P, Hinton G E 1993 Feudal reinforcement learning. InAdvances in neural information processing systems 5 (eds) S J Hanson, J D Cowan, C L Giles (San Mateo, CA: Morgan Kaufmann) pp 271–278

    Google Scholar 

  • Dayan P, Sejnowski T J 1993 TD(λ) converges with probability 1. Technical Report, CNL, The Salk Institute, San Diego, CA

    Google Scholar 

  • Dean T L, Wellman M P 1991Planning and control (San Mateo, CA: Morgan Kaufmann)

    Google Scholar 

  • Gullapalli V 1990 A stochastic reinforcement algorithm for learning real-valued functions.Neural Networks 3: 671–692

    Article  Google Scholar 

  • Gullapalli V 1992a Reinforcement learning and its application to control. Technical Report, COINS, 92-10, Ph D thesis, University of Massachusetts, Amherst, MA

    Google Scholar 

  • Gullapalli V 1992b A comparison of supervised and reinforcement learning methods on a reinforcment learning task. InProceedings of the 1991 IEEE Symposium on Intelligent Control, Arlignton, VA

  • Gullapalli V, Barto A G 1994 Convergence of indirect adaptive asynchronous value iteration algorithms. InAdvances in neural information processing systems 6(eds) J D Cowan, G Tesauro, J Alspector (San Fransisco, CA: Morgan Kaufmann) pp 695–702

    Google Scholar 

  • Gullapalli V, Franklin J A, Benbrahim H 1994 Acquiring robot skills via reinforcement learning.IEEE Contr. Syst. Mag.: 13–24

  • Hertz J A, Krogh A S, Palmer R G 1991Introduction to the theory of neural computation (Reading, MA: Addison-Wesley)

    Google Scholar 

  • Jaakkola T, Jordan M I, Singh S P 1994 Convergence of stochastic iterative dynamic programming algorithms. InAdvances in Neural information processing systems 6(eds) J D Cowan, G Tesauro, J Alspector (San Fransisco, CA: Morgan Kaufmann) pp. 703–710

    Google Scholar 

  • Jacobs R A, Jordan M I, Nowlan S J, Hinton G E 1991 Adaptive mixtures of local experts.Neural Comput. 3: 79–87

    Article  Google Scholar 

  • Jordan M I, Jacobs R A 1990 Learning to control an unstable system with forward modeling. InAdvances in neural information processing systems 2 (ed.) D S Touretzky (San Mateo, CA: Morgan Kaufmann)

    Google Scholar 

  • Jordan M I, Rumelhart D E 1990 Forward models: Supervised learning with a distal teacher. Center for Cognitive Science, Occasional Paper # 40, Massachusetts Institute of Technology, Cambridge, MA

    Google Scholar 

  • Kaelbling L P 1990Learning in embedded systems. (Technical Report, TR-90-04) Ph D thesis, Department of Computer Science, Stanford University, Stanford, CA

    Google Scholar 

  • Kaelbling L P 1991Learning in Embedded Systems (Cambridge, MA: MIT Press)

    Google Scholar 

  • Klopf A H 1972 Brain function and adaptive systems — a heterostatic theory. Teachnical report AFCRL-72-0164, Air Force Cambridge Research Laboratories, Bedford, MA

    Google Scholar 

  • Klopf A H 1982The hedonistic neuron: A theory of memory, learning and intelligence. (Washington, D C: Hemisphere)

    Google Scholar 

  • Klopf A H 1988 A neuronal model of classical conditioning.Psychobiology 16: 85–125

    Google Scholar 

  • Korf R E 1990 Real-time heuristic search.Artif. Intell. 42: 189–211

    Article  MATH  Google Scholar 

  • Kumar P R 1985 A survey of some results in stochastic adaptive control.SIAM J. Contr. Optim. 23: 329–380

    Article  MATH  Google Scholar 

  • Lin C S, Kim H 1991 CMAC-based adaptive critic self-learning control.IEEE Trans. Neural Networks 2: 530–533

    Article  Google Scholar 

  • Lin L J 1991a Programming robots using reinforcement learning and teaching. InProceedings of the Ninth National Conference on Artificial Intelligence, pages 781–786, MIT Press, Cambridge, MA

    Google Scholar 

  • Lin L J 1991b Self-improvement based on reinforcement learning, planning and teaching. InMachine Learning: Proceedings of the Eighth International Workshop (eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 323–327

    Google Scholar 

  • Lin L J 1991c Self-improving reactive agents: Case studies of reinforcement learning frameworks. InFrom Animals to Animats: Proceedings of the First International Conference on Simulation of Adaptive Behaviour, (Cambridge, MA: MIT Press) pp 297–305

    Google Scholar 

  • Lin L J 1992 Self-improving reactive agents based on reinforcement learning, planning and teaching.Mach. Learning 8: 293–321

    Google Scholar 

  • Lin L J 1993 Hierarchical learning of robot skills by reinforcement. InProceedings of the 1993 International Conference on Neural Networks pp 181–186

  • Linden A 1993 On discontinuous Q-functions in reinforcement learning. Available via anonymous ftp from archive.cis.ohio-state.edu in directory /pub/neuroprose

  • Maes P, Brooks R 1990 Learning to coordinate behaviour. InProceedings of the Eighth National Conference on Artificial Intelligence (San Mateo, CA: Morgan Kaufmann) pp 796–802

    Google Scholar 

  • Magriel P 1976Backgammon (New York: Times Books)

    Google Scholar 

  • Mahadevan S, Connell J 1991 Scaling reinforcement learning to robotics by exploiting the subsumption architecture. InMachine Learning: Proceedings of the Eighth International Workshop (eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 328–332

    Google Scholar 

  • Mazzoni P, Andersen R A, Jordan M I 1990A R-P learning applied to a network model of cortical area 7a. InProceedings of the 1990 International Joint Conference on Neural Networks 2: 373–379

    Article  Google Scholar 

  • Michie D, Chambers R A 1968 BOXES: An experiment in adaptive control.Machine intelligence 2 (eds) E Dale, D Michie (New York: Oliver and Boyd) pp 137–152

    Google Scholar 

  • Millan J D R, Torras C 1992 A reinforcement connectionist approach to robot path finding in non maze-like environments.Mach. Learning 8: 363–395

    Google Scholar 

  • Minsky M L 1954Theory of neural-analog reinforcement systems and application to the brain-model problem. Ph D thesis, Princeton University, Princeton, NJ

    Google Scholar 

  • Minsky M L 1961 Steps towards artificial intelligence. InProceedings of the Institute of Radio Engineers 49: 8–30 (Reprinted 1963 inComputers and thought (eds) E A Feigenbaum, J Feldman (New York: McGraw-Hill) pp 406–450

    MathSciNet  Google Scholar 

  • Moore A W 1990Efficient memory-based learning for robot control. Ph D thesis, University of Cambridge, Cambridge, UK

    Google Scholar 

  • Moore A W 1991 Variable resolution dynamic programming: Efficiently learning action maps in multivariate real-vlaued state-spaces. InMachine Learning: Proceedings of the Eighth International Workshop (eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 328–332

    Google Scholar 

  • Moore A W, Atkeson C G 1993 Memory-based reinforcement learning: Efficient computation with prioritized sweeping. InAdvances in neural information processing systems 5 (eds) S J Hanson, J D Cowan, C L Giles (San Mateo, CA: Morgan Kaufmann) pp 263–270

    Google Scholar 

  • Mozer M C, Bacharach J 1990a Discovering the structure of reactive environment by exploration. InAdvances in neural information processing 2 (ed.) D S Touretzky (San Mateo, CA: Morgan Kaufmann) pp 439–446

    Google Scholar 

  • Mozer M C, Bacharach J 1990b Discovering the structure of reactive environment by exploration.Neural Computation 2: 447–457

    Google Scholar 

  • Narendra K, Thathachar M A L 1989Learning automata: An introduction (Englewood Cliffs, NJ: Prentice Hall)

    Google Scholar 

  • Peng J, Williams R J 1993 Efficient learning and planning within the Dyna framework. InProceedings of the 1993 International Joint Conference on Neural Networks, pp 168–174

  • Platt J C 1991 Learning by combining memorization and gradient descent.Advances in neural information processing systems 3 (eds) R P Lippmann, J E Moody, D S Touretzky (San Mateo, CA: Morgan Kaufmann) pp 714–720

    Google Scholar 

  • Rosen B E, Goodwin J M, Vidal J J 1991 Adaptive range coding.Advances in neural information processing systems 3 (eds) R P Lippmann, J E Moody, D S Touretzky (San Mateo, CA: Morgan Kaufmann) pp 486–494

    Google Scholar 

  • Ross S 1983Introduction to stochastic dynamic programming (New York: Academic Press)

    MATH  Google Scholar 

  • Rummery G A, Niranjan M 1994 On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, University of Cambridge, Cambridge, England

  • Samuel A L 1959 Some studies in machine learning using the game of checkers.IBM J. Res. Dev.: 210–229 (Reprinted 1963 inComputers and thought (eds) E A Feigenbaum, J Feldman (New York: McGraw-Hill)

  • Samuel A L 1967 Some studies in machine learning using the game of checkers, II Recent progress.IBM J. Res. Dev.: 601–617

  • Selfridge O, Sutton R S, Barto A G 1985 Training and tracking in robotics. InProceedings of the Ninth International Joint Conference of Artificial Intelligence (ed.) A Joshi (San Mateo, CA: Morgan Kaufmann) pp 670–672

    Google Scholar 

  • Shepansky J F, Macy S A 1987 Teaching artificial neural systems to drive: Manual training techniques for autonomous systems. InProceedings of the First Annual International Conference on Neural Networks, San Diego, CA

  • Singh S P 1991 Transfer of learning across composition of sequential tasks. InMachine Learning: Proceedings of the Eighth International Workshop (eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 348–352

    Google Scholar 

  • Singh S P 1992a Reinforcement learning with a hierarchy of abstract models. InProceedings of the Tenth National Conference on Artificial Intelligence, San Jose, CA

  • Singh S P 1992b On the efficient learning of multiple sequential tasks. InAdvances in neural information processing systems 4 (eds) J E Moody, S J Hanson, R P Lippmann (San Mateo, CA: Morgan Kaufmann) pp 251–258

    Google Scholar 

  • Singh S P 1992c Scaling Reinforcement learning algorithms by learning variable temporal resolution models. InProceedings of the Ninth International Machine Learning Conference

  • Singh S P 1992d Transfer of learning by composing solutions of elemental sequential tasks.Mach. Learning 8: 323–339

    MATH  Google Scholar 

  • Singh S P, Barto A G, Grupen R, Connelly C 1994 Robust reinforcement learning in motion planning. InAdvances in neural information processing systems 6 (eds) J D Cowan, G Tesauro, J Alspector (San Fransisco, CA: Morgan Kaufmann) pp 655–662

    Google Scholar 

  • Singh S P, Yee R C 1993 An upper bound on the loss from approximate optimal-value functions. Technical Report, University of Massachusetts, Amherst, MA

    Google Scholar 

  • Sutton R S 1984Temporal credit assignment in reinforcement learning. Ph D thesis, University of Massachusetts, Amherst, MA

    Google Scholar 

  • Sutton R S 1988 Learning to predict by the method of temporal differences.Mach. Learning 3: 9–44

    Google Scholar 

  • Sutton R S 1990 Integrated architecture for learning, planning, and reacting based on approximating dyanmic programming. InProceedings of the Seventh International Conference on Machine Learning (San Mateo, CA: Morgan Kaufmann) pp 216–224

    Google Scholar 

  • Sutton R S 1991a Planning by incremental dynamic programming. InMachine Learning: Proceedings of the Eighth International Workshop (eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 353–357

    Google Scholar 

  • Sutton R S 1991b Integrated modeling and control based on reinforcement learning and dynamic programming. InAdvances in neural information processing systems 3 (eds) R P Lippmann, J E Moody, D S Touretzky (San Mateo, CA: Morgan Kaufmann) pp 471–478

    Google Scholar 

  • Sutton R S, Barto A G 1981 Toward a modern theory of adaptive networks: Expectation and prediction.Psychol. Rev. 88: 135–170

    Article  Google Scholar 

  • Sutton R S, Barto A G 1987 A temporal-difference model of classical conditioning. InProceedings of the Ninth Annual Conference of the Cognitive Science Society, Erlbaum, Hillsdale, NJ

  • Sutton R S, Barto A G 1990 Time-derivative models of Pavlovian reinforcement.Learning and Computational Neuroscience: Foundations of Adaptive Networks (eds) M Gabriel, J Moore (Cambridge, MA: MIT Press) pp 497–537

    Google Scholar 

  • Sutton R S, Singh S P 1994 On step-size and bias in TD-learning. InProceedings of the Eighth Yale Workshop on Adaptive and Learning Systems Yale University, pp 91–96

  • Sutton R S, Barto A G, Williams R J 1991 Reinforcement learning is direct adaptive optimal control. InProceedings of th American Control Conference Boston, MA, pp 2143–2146

  • Tan M 1991 Larning a cost-sensitive internal representation for reinforcement learning. InMachine Learning: Proceedings of the Eighth International Workshop (eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 358–362

    Google Scholar 

  • Tesauro G J 1992 Practical issues in temporal difference learning.Mach. Learning 8: 257–278

    MATH  Google Scholar 

  • Tham C K, Prager R W 1994 A modular Q-learning architecture for manipulator task decomposition. InMachine Learning: Proceedings of the Eleventh Intenational Conference (eds) W W Cohen, H Hirsh (Princeton, NJ: Morgan Kaufmann) (Available via gopher from Dept. of Eng., University of Cambridge, Cambridge, England)

    Google Scholar 

  • Thrun S B 1986 Efficient exploration in reinforcement learning. Technical report CMU-CS-92-102, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA

    Google Scholar 

  • Thrun S B 1993 Exploration and model building in mobile robot domains. InProceedings of the 1993 International Conference on Neural Networks

  • Thrun S B, Muller K 1992 Active exploration in dynamic environments. InAdvances in neural information processing systems 4 (eds) J E Moody, S J Hanson, R P Lippmann (San Mateo, CA: Morgan Kaufmann)

    Google Scholar 

  • Thrun S B, Schwartz A 1993 Issues in using function approximation for reinforcement learning. InProceedings of the Fourth Connectionist Models Summer School (Hillsdale, NJ: Lawrence Erlbaum)

    Google Scholar 

  • Tsitsiklis J N 1993 Asynchronous stochastic approximation and Q-learning. Technical Report, LIDS-P-2172, Laboratory for Information and Decision Systems, MIT, Cambridge, MA

    Google Scholar 

  • Utgoff P E, Clouse J A 1991 Two kinds of training information for evaluation function learning. InProceedings of the Ninth Annual Conference on Artificial Intelligence (San Mateo, CA: Morgan Kaufmann) pp 596–600

    Google Scholar 

  • Watkins 1989Learning from delayed rewards. Ph D thesis, Cambridge University, Cambridge, England

    Google Scholar 

  • Watkins C J C H, Dayan P 1992 Technical note: Q-learning.Mach. Learning 8: 279–292

    MATH  Google Scholar 

  • Werbos P J 1987 Building and understanding adaptive systems: a statistical/numerical approach to factory automation and brain research.IEEE Trans. Syst, Man Cybern.

  • Werbos P J 1988 Generalization of back propagation with application to recurrent gas market model.Neural Networks 1: 339–356

    Article  Google Scholar 

  • Werbos P J 1989 Neural network for control and system identification. InProceedings of the 28th Conference on Decision and Control Tampa, FL, pp 260–265

  • Werbos P J 1990a Consistency of HDP applied to simple reinforcement learning problems.Neural Networks 3: 179–189

    Article  Google Scholar 

  • Werbos P J 1990b A menu of designs for reinforcement learning over time, InNeural networks for control (eds) W T Miller, R S Sutton, P J Werbos (Cambridge, MA: MIT Press) pp 67–95

    Google Scholar 

  • Werbos P J 1992 Approximate dynamic programming for real-time control and neural modeling. InHandbook of intelligent control: Neural, fuzzy, and adaptive approaches (eds) D A White, D A Sofge (New York: Van Nostrand Reinhold) pp 493–525

    Google Scholar 

  • Whitehead S D 1991a A complexity analysis of cooperative mechanisms in reinforcement learning. InProceedings of the Ninth Conference on Artificial Intelligence, (Cambridge, MA: MIT Press) pp 607–613

    Google Scholar 

  • Whitehead S D 1991b Complexity and cooperation in Q-learning. InMachine Learning: Proceedings of the Eighth International Workshop (eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 363–367

    Google Scholar 

  • Whitehead S D, Ballard D H 1990 Active perception and reinforcement learning.Neural Comput. 2: 409–419

    Google Scholar 

  • Williams R J 1986 Reinforcement learning in connectionist networks: a mathematical analysis. Technical report ICS 8605, Institute for Cognitive Science, University of California at San Diego, La Jolla, CA

    Google Scholar 

  • Williams R J 1987 Reinforcement-learning connectionist systems. Technical report NU-CCS-87-3, College of Computer Science, Northeastern University, Boston, MA

    Google Scholar 

  • Williams R J, Baird L C III 1990 A mathematical analysis of actor-critic architectures for learning optimal controls through incremental dynamic programming. InProceedings of the Sixth Yale Workshop on Adaptive and Learning Systems New Haven, CT, pp 96–101

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sathiya Keerthi, S., Ravindran, B. A tutorial survey of reinforcement learning. Sadhana 19, 851–889 (1994). https://doi.org/10.1007/BF02743935

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02743935

Keywords

Navigation