Advertisement

Sadhana

, Volume 19, Issue 6, pp 851–889 | Cite as

A tutorial survey of reinforcement learning

  • S Sathiya Keerthi
  • B Ravindran
Article

Abstract

This paper gives a compact, self-contained tutorial survey of reinforcement learning, a tool that is increasingly finding application in the development of intelligent dynamic systems. Research on reinforcement learning during the past decade has led to the development of a variety of useful algorithms. This paper surveys the literature and presents the algorithms in a cohesive framework.

Keywords

Reinforcement learning dynamic programming optimal control neural networks 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Albus J S 1975 A new approach to manipulator control: The cerebellar model articulation controller (CMAC).Trans. ASME, J. Dyn. Syst., Meas., Contr. 97:220–227zbMATHGoogle Scholar
  2. Anderson C W 1986Learning and problem solving with multilayer connectionist systems. Ph D thesis, University of Massachusetts, Amherst, MAGoogle Scholar
  3. Anderson C W 1987 Strategy learning with multilayer connectionist representations. Technical report, TR87-509.3, GTE Laboratories, INC., Waltham, MAGoogle Scholar
  4. Anderson C W 1989 Learning to control an inverted pendulum using neural networks.IEEE Contr. Syst. Mag.: 31–37Google Scholar
  5. Anderson C W 1993 Q-Learning with hidden-unit restarting. InAdvances in neural information processing systems 5 (eds) S J Hanson, J D Cowan, C L Giles (San Mateo, CA: Morgan Kaufmann) pp 81–88Google Scholar
  6. Bacharach J R 1991 A connectionist learning control architecture for navigation. InAdvances in neural information processing systems 3 (eds) R P Lippman, J E Moody, D S Touretzky (San Mateo, CA: Morgan Kaufmann) pp 457–463Google Scholar
  7. Bacharach J R 1992Connectionist modeling and control of finite state environments. Ph D thesis, University of Massachusetts, Amherst, MAGoogle Scholar
  8. Barto A G 1985 Learning by statistical cooperation of self-interested neuron-like computing elements.Human Neurobiology 4: 229–256Google Scholar
  9. Barto A G 1986 Game-theoritic cooperativity in networks of self interested units. InNeural networks for computing (ed.) J S Denker (New York: American Institute of Physics) pp 41–46Google Scholar
  10. Barto A G 1992 Reinforcemnet learning and adaptive critic methods. InHandbook of intelligent control: Neural, fuzzy, and adaptive approaches (eds) D A White, D A Sofge (New York: Van Nostrand Reinhold) pp 469–491Google Scholar
  11. Barto A G, Anandan P 1985 Pattern recognizing stocahstic learning automata.IEEE Trans. Syst., Man Cybern. 15: 360–375zbMATHMathSciNetGoogle Scholar
  12. Barto A G, Anandan P, Anderson C W 1985 Cooperativity in networks of pattern recognizing stochastic learning automata. InProceedings of the Fourth Yale Workshop on Applications of Adaptive Systems Theory, New Haven, CTGoogle Scholar
  13. Barto A G, Bradtke S J, Singh S P 1992 Real-time learning and control using asymchronous dynamic programming. Technical Report COINS 91-57, University of Massachusetts, Amherst, MAGoogle Scholar
  14. Barto A G, Jordan M I 1987 Gradient following without back-propagation in layered networks. InProceedings of the IEEE First Annual Conference on Neural Networks, (eds) M Caudill, C Butler (New York: IEEE) pp II629-II636Google Scholar
  15. Barto A G, Singh S P 1991 On the computational economics of reinforcement learning. InConnectionist Models Proceedings of the 1990 Summer School. (eds) D S Touretzky, J L Elman, T J Sejnowski, G E Hinton (San Mateo, CA: Morgan Kaufmann) pp 35–44Google Scholar
  16. Barto A G, Sutton R S 1981, Landmark learning: an illustration of associative search.Biol. Cybern. 42: 1–8zbMATHCrossRefGoogle Scholar
  17. Barto A G, Sutton R S 1982 Simulation of anticipatory responses in classical conditioning by a neuron-like adaptive element.Behav. Brain Res. 4: 221–235CrossRefGoogle Scholar
  18. Barto A G, Sutton R S, Anderson C W 1983 Neuronlike elements that can solve difficult learning control problems.IEEE Trans. Syst., Man Cybern. 13: 835–846Google Scholar
  19. Barto A G, Sutton R S, Brouwer P S 1981 Associative search network: a reinforcement learning associative memory.IEEE Trans. Syst., Man Cybern. 40: 201–211zbMATHGoogle Scholar
  20. Barto A G, Sutton R S, Watkins C J C H 1990 Learning and sequential decision making. InLearning and computational neuroscience: Foundations of adaptive networks. (eds) M Gabriel, J Moore (Cambridge, MA: MIT Press) pp 539–602Google Scholar
  21. Bellman R E, Dreyfus S E 1962Applied dynamic programming. RAND CorporationGoogle Scholar
  22. Bertsekas D P 1982 Distributed dynamic programming.IEEE Trans. Autom. Contr. 27: 610–616zbMATHMathSciNetCrossRefGoogle Scholar
  23. Bertsekas D P 1989Dynamic programming: Deterministic and stochastic models (Englewood Cliffs, NJ: Prentice-Hall)Google Scholar
  24. Bertsekas D P, Tsitsiklis J N 1989Parallel and distributed computation: Numerical methods (Englewood Cliffs, NJ: Prentice-Hall)zbMATHGoogle Scholar
  25. Boyen J 1992Modular neural networks for learning context-dependent game strategies. Masters thesis, Computer Speech and Language Processing, University of Cambridge, Cambridge, EnglandGoogle Scholar
  26. Bradtke S J 1993 Reinforcement learning applied to linear quadratic regulation. InAdvances in neural information processing systems 5 (eds) S J Hanson, J D Cowan, C L Giles (San Mateo, CA: Morgan Kaufmann) pp 295–302Google Scholar
  27. Bradtke S J 1994Incremental dynamic programming for on-line adaptive optimal control. CMPSCI Technical Report 94-62Google Scholar
  28. Brody C 1992 Fast learning with predictive forward models. InAdvances in neural information processing systems 4 (eds) J E Moody, S J Hanson, R P Lippmann (San Mateo, CA: Morgan Kaufmann) pp 563–570Google Scholar
  29. Brooks R A 1986 Achieving artificial intelligence through building robots. Technical Report, A I Memo 899, Massachusetts Institute of Technology, Aritificial Intelligence Laboratory, Cambridge, MAGoogle Scholar
  30. Buckland K M, Lawrence P D 1994 Transition point dynamic programming. InAdvances in neural information processing systems 6 (eds) J D Cowan, G Tesauro, J Alspector (San Fransisco, CA: Morgan Kaufmann) pp 639–646Google Scholar
  31. Chapman D 1991Vision, Instruction, and Action (Cambridge, MA: MIT Press)Google Scholar
  32. Chapman D, Kaelbling L P 1991 Input generalization in delayed reinforcement learning: an algorithm and performance comparisons. InProceedings of the 1991 International Joint Conference on Artificial Intelligence Google Scholar
  33. Chrisman L 1992 Planning for closed-loop execution using partially observable markovian decision processes. InProceedings of AAAI Google Scholar
  34. Dayan P 1991a Navigating through temporal difference. InAdvances in neural information processing systems 3 (eds) R P Lippmann, J E Moody, D S Touretzky (San Mateo, CA: Morgan Kaufmann) pp 464–470Google Scholar
  35. Dayan P 1991bReinforcing connectionism: Learning the statistical way. Ph D thesis, University of Edinburgh, EdinburghGoogle Scholar
  36. Dayan P, Hinton G E 1993 Feudal reinforcement learning. InAdvances in neural information processing systems 5 (eds) S J Hanson, J D Cowan, C L Giles (San Mateo, CA: Morgan Kaufmann) pp 271–278Google Scholar
  37. Dayan P, Sejnowski T J 1993 TD(λ) converges with probability 1. Technical Report, CNL, The Salk Institute, San Diego, CAGoogle Scholar
  38. Dean T L, Wellman M P 1991Planning and control (San Mateo, CA: Morgan Kaufmann)Google Scholar
  39. Gullapalli V 1990 A stochastic reinforcement algorithm for learning real-valued functions.Neural Networks 3: 671–692CrossRefGoogle Scholar
  40. Gullapalli V 1992a Reinforcement learning and its application to control. Technical Report, COINS, 92-10, Ph D thesis, University of Massachusetts, Amherst, MAGoogle Scholar
  41. Gullapalli V 1992b A comparison of supervised and reinforcement learning methods on a reinforcment learning task. InProceedings of the 1991 IEEE Symposium on Intelligent Control, Arlignton, VAGoogle Scholar
  42. Gullapalli V, Barto A G 1994 Convergence of indirect adaptive asynchronous value iteration algorithms. InAdvances in neural information processing systems 6(eds) J D Cowan, G Tesauro, J Alspector (San Fransisco, CA: Morgan Kaufmann) pp 695–702Google Scholar
  43. Gullapalli V, Franklin J A, Benbrahim H 1994 Acquiring robot skills via reinforcement learning.IEEE Contr. Syst. Mag.: 13–24Google Scholar
  44. Hertz J A, Krogh A S, Palmer R G 1991Introduction to the theory of neural computation (Reading, MA: Addison-Wesley)Google Scholar
  45. Jaakkola T, Jordan M I, Singh S P 1994 Convergence of stochastic iterative dynamic programming algorithms. InAdvances in Neural information processing systems 6(eds) J D Cowan, G Tesauro, J Alspector (San Fransisco, CA: Morgan Kaufmann) pp. 703–710Google Scholar
  46. Jacobs R A, Jordan M I, Nowlan S J, Hinton G E 1991 Adaptive mixtures of local experts.Neural Comput. 3: 79–87CrossRefGoogle Scholar
  47. Jordan M I, Jacobs R A 1990 Learning to control an unstable system with forward modeling. InAdvances in neural information processing systems 2 (ed.) D S Touretzky (San Mateo, CA: Morgan Kaufmann)Google Scholar
  48. Jordan M I, Rumelhart D E 1990 Forward models: Supervised learning with a distal teacher. Center for Cognitive Science, Occasional Paper # 40, Massachusetts Institute of Technology, Cambridge, MAGoogle Scholar
  49. Kaelbling L P 1990Learning in embedded systems. (Technical Report, TR-90-04) Ph D thesis, Department of Computer Science, Stanford University, Stanford, CAGoogle Scholar
  50. Kaelbling L P 1991Learning in Embedded Systems (Cambridge, MA: MIT Press)Google Scholar
  51. Klopf A H 1972 Brain function and adaptive systems — a heterostatic theory. Teachnical report AFCRL-72-0164, Air Force Cambridge Research Laboratories, Bedford, MAGoogle Scholar
  52. Klopf A H 1982The hedonistic neuron: A theory of memory, learning and intelligence. (Washington, D C: Hemisphere)Google Scholar
  53. Klopf A H 1988 A neuronal model of classical conditioning.Psychobiology 16: 85–125Google Scholar
  54. Korf R E 1990 Real-time heuristic search.Artif. Intell. 42: 189–211CrossRefzbMATHGoogle Scholar
  55. Kumar P R 1985 A survey of some results in stochastic adaptive control.SIAM J. Contr. Optim. 23: 329–380zbMATHCrossRefGoogle Scholar
  56. Lin C S, Kim H 1991 CMAC-based adaptive critic self-learning control.IEEE Trans. Neural Networks 2: 530–533CrossRefGoogle Scholar
  57. Lin L J 1991a Programming robots using reinforcement learning and teaching. InProceedings of the Ninth National Conference on Artificial Intelligence, pages 781–786, MIT Press, Cambridge, MAGoogle Scholar
  58. Lin L J 1991b Self-improvement based on reinforcement learning, planning and teaching. InMachine Learning: Proceedings of the Eighth International Workshop (eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 323–327Google Scholar
  59. Lin L J 1991c Self-improving reactive agents: Case studies of reinforcement learning frameworks. InFrom Animals to Animats: Proceedings of the First International Conference on Simulation of Adaptive Behaviour, (Cambridge, MA: MIT Press) pp 297–305Google Scholar
  60. Lin L J 1992 Self-improving reactive agents based on reinforcement learning, planning and teaching.Mach. Learning 8: 293–321Google Scholar
  61. Lin L J 1993 Hierarchical learning of robot skills by reinforcement. InProceedings of the 1993 International Conference on Neural Networks pp 181–186Google Scholar
  62. Linden A 1993 On discontinuous Q-functions in reinforcement learning. Available via anonymous ftp from archive.cis.ohio-state.edu in directory /pub/neuroproseGoogle Scholar
  63. Maes P, Brooks R 1990 Learning to coordinate behaviour. InProceedings of the Eighth National Conference on Artificial Intelligence (San Mateo, CA: Morgan Kaufmann) pp 796–802Google Scholar
  64. Magriel P 1976Backgammon (New York: Times Books)Google Scholar
  65. Mahadevan S, Connell J 1991 Scaling reinforcement learning to robotics by exploiting the subsumption architecture. InMachine Learning: Proceedings of the Eighth International Workshop (eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 328–332Google Scholar
  66. Mazzoni P, Andersen R A, Jordan M I 1990A R-P learning applied to a network model of cortical area 7a. InProceedings of the 1990 International Joint Conference on Neural Networks 2: 373–379CrossRefGoogle Scholar
  67. Michie D, Chambers R A 1968 BOXES: An experiment in adaptive control.Machine intelligence 2 (eds) E Dale, D Michie (New York: Oliver and Boyd) pp 137–152Google Scholar
  68. Millan J D R, Torras C 1992 A reinforcement connectionist approach to robot path finding in non maze-like environments.Mach. Learning 8: 363–395Google Scholar
  69. Minsky M L 1954Theory of neural-analog reinforcement systems and application to the brain-model problem. Ph D thesis, Princeton University, Princeton, NJGoogle Scholar
  70. Minsky M L 1961 Steps towards artificial intelligence. InProceedings of the Institute of Radio Engineers 49: 8–30 (Reprinted 1963 inComputers and thought (eds) E A Feigenbaum, J Feldman (New York: McGraw-Hill) pp 406–450MathSciNetGoogle Scholar
  71. Moore A W 1990Efficient memory-based learning for robot control. Ph D thesis, University of Cambridge, Cambridge, UKGoogle Scholar
  72. Moore A W 1991 Variable resolution dynamic programming: Efficiently learning action maps in multivariate real-vlaued state-spaces. InMachine Learning: Proceedings of the Eighth International Workshop (eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 328–332Google Scholar
  73. Moore A W, Atkeson C G 1993 Memory-based reinforcement learning: Efficient computation with prioritized sweeping. InAdvances in neural information processing systems 5 (eds) S J Hanson, J D Cowan, C L Giles (San Mateo, CA: Morgan Kaufmann) pp 263–270Google Scholar
  74. Mozer M C, Bacharach J 1990a Discovering the structure of reactive environment by exploration. InAdvances in neural information processing 2 (ed.) D S Touretzky (San Mateo, CA: Morgan Kaufmann) pp 439–446Google Scholar
  75. Mozer M C, Bacharach J 1990b Discovering the structure of reactive environment by exploration.Neural Computation 2: 447–457Google Scholar
  76. Narendra K, Thathachar M A L 1989Learning automata: An introduction (Englewood Cliffs, NJ: Prentice Hall)Google Scholar
  77. Peng J, Williams R J 1993 Efficient learning and planning within the Dyna framework. InProceedings of the 1993 International Joint Conference on Neural Networks, pp 168–174Google Scholar
  78. Platt J C 1991 Learning by combining memorization and gradient descent.Advances in neural information processing systems 3 (eds) R P Lippmann, J E Moody, D S Touretzky (San Mateo, CA: Morgan Kaufmann) pp 714–720Google Scholar
  79. Rosen B E, Goodwin J M, Vidal J J 1991 Adaptive range coding.Advances in neural information processing systems 3 (eds) R P Lippmann, J E Moody, D S Touretzky (San Mateo, CA: Morgan Kaufmann) pp 486–494Google Scholar
  80. Ross S 1983Introduction to stochastic dynamic programming (New York: Academic Press)zbMATHGoogle Scholar
  81. Rummery G A, Niranjan M 1994 On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, University of Cambridge, Cambridge, EnglandGoogle Scholar
  82. Samuel A L 1959 Some studies in machine learning using the game of checkers.IBM J. Res. Dev.: 210–229 (Reprinted 1963 inComputers and thought (eds) E A Feigenbaum, J Feldman (New York: McGraw-Hill)Google Scholar
  83. Samuel A L 1967 Some studies in machine learning using the game of checkers, II Recent progress.IBM J. Res. Dev.: 601–617Google Scholar
  84. Selfridge O, Sutton R S, Barto A G 1985 Training and tracking in robotics. InProceedings of the Ninth International Joint Conference of Artificial Intelligence (ed.) A Joshi (San Mateo, CA: Morgan Kaufmann) pp 670–672Google Scholar
  85. Shepansky J F, Macy S A 1987 Teaching artificial neural systems to drive: Manual training techniques for autonomous systems. InProceedings of the First Annual International Conference on Neural Networks, San Diego, CAGoogle Scholar
  86. Singh S P 1991 Transfer of learning across composition of sequential tasks. InMachine Learning: Proceedings of the Eighth International Workshop (eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 348–352Google Scholar
  87. Singh S P 1992a Reinforcement learning with a hierarchy of abstract models. InProceedings of the Tenth National Conference on Artificial Intelligence, San Jose, CAGoogle Scholar
  88. Singh S P 1992b On the efficient learning of multiple sequential tasks. InAdvances in neural information processing systems 4 (eds) J E Moody, S J Hanson, R P Lippmann (San Mateo, CA: Morgan Kaufmann) pp 251–258Google Scholar
  89. Singh S P 1992c Scaling Reinforcement learning algorithms by learning variable temporal resolution models. InProceedings of the Ninth International Machine Learning Conference Google Scholar
  90. Singh S P 1992d Transfer of learning by composing solutions of elemental sequential tasks.Mach. Learning 8: 323–339zbMATHGoogle Scholar
  91. Singh S P, Barto A G, Grupen R, Connelly C 1994 Robust reinforcement learning in motion planning. InAdvances in neural information processing systems 6 (eds) J D Cowan, G Tesauro, J Alspector (San Fransisco, CA: Morgan Kaufmann) pp 655–662Google Scholar
  92. Singh S P, Yee R C 1993 An upper bound on the loss from approximate optimal-value functions. Technical Report, University of Massachusetts, Amherst, MAGoogle Scholar
  93. Sutton R S 1984Temporal credit assignment in reinforcement learning. Ph D thesis, University of Massachusetts, Amherst, MAGoogle Scholar
  94. Sutton R S 1988 Learning to predict by the method of temporal differences.Mach. Learning 3: 9–44Google Scholar
  95. Sutton R S 1990 Integrated architecture for learning, planning, and reacting based on approximating dyanmic programming. InProceedings of the Seventh International Conference on Machine Learning (San Mateo, CA: Morgan Kaufmann) pp 216–224Google Scholar
  96. Sutton R S 1991a Planning by incremental dynamic programming. InMachine Learning: Proceedings of the Eighth International Workshop (eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 353–357Google Scholar
  97. Sutton R S 1991b Integrated modeling and control based on reinforcement learning and dynamic programming. InAdvances in neural information processing systems 3 (eds) R P Lippmann, J E Moody, D S Touretzky (San Mateo, CA: Morgan Kaufmann) pp 471–478Google Scholar
  98. Sutton R S, Barto A G 1981 Toward a modern theory of adaptive networks: Expectation and prediction.Psychol. Rev. 88: 135–170CrossRefGoogle Scholar
  99. Sutton R S, Barto A G 1987 A temporal-difference model of classical conditioning. InProceedings of the Ninth Annual Conference of the Cognitive Science Society, Erlbaum, Hillsdale, NJGoogle Scholar
  100. Sutton R S, Barto A G 1990 Time-derivative models of Pavlovian reinforcement.Learning and Computational Neuroscience: Foundations of Adaptive Networks (eds) M Gabriel, J Moore (Cambridge, MA: MIT Press) pp 497–537Google Scholar
  101. Sutton R S, Singh S P 1994 On step-size and bias in TD-learning. InProceedings of the Eighth Yale Workshop on Adaptive and Learning Systems Yale University, pp 91–96Google Scholar
  102. Sutton R S, Barto A G, Williams R J 1991 Reinforcement learning is direct adaptive optimal control. InProceedings of th American Control Conference Boston, MA, pp 2143–2146Google Scholar
  103. Tan M 1991 Larning a cost-sensitive internal representation for reinforcement learning. InMachine Learning: Proceedings of the Eighth International Workshop (eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 358–362Google Scholar
  104. Tesauro G J 1992 Practical issues in temporal difference learning.Mach. Learning 8: 257–278zbMATHGoogle Scholar
  105. Tham C K, Prager R W 1994 A modular Q-learning architecture for manipulator task decomposition. InMachine Learning: Proceedings of the Eleventh Intenational Conference (eds) W W Cohen, H Hirsh (Princeton, NJ: Morgan Kaufmann) (Available via gopher from Dept. of Eng., University of Cambridge, Cambridge, England)Google Scholar
  106. Thrun S B 1986 Efficient exploration in reinforcement learning. Technical report CMU-CS-92-102, School of Computer Science, Carnegie Mellon University, Pittsburgh, PAGoogle Scholar
  107. Thrun S B 1993 Exploration and model building in mobile robot domains. InProceedings of the 1993 International Conference on Neural Networks Google Scholar
  108. Thrun S B, Muller K 1992 Active exploration in dynamic environments. InAdvances in neural information processing systems 4 (eds) J E Moody, S J Hanson, R P Lippmann (San Mateo, CA: Morgan Kaufmann)Google Scholar
  109. Thrun S B, Schwartz A 1993 Issues in using function approximation for reinforcement learning. InProceedings of the Fourth Connectionist Models Summer School (Hillsdale, NJ: Lawrence Erlbaum)Google Scholar
  110. Tsitsiklis J N 1993 Asynchronous stochastic approximation and Q-learning. Technical Report, LIDS-P-2172, Laboratory for Information and Decision Systems, MIT, Cambridge, MAGoogle Scholar
  111. Utgoff P E, Clouse J A 1991 Two kinds of training information for evaluation function learning. InProceedings of the Ninth Annual Conference on Artificial Intelligence (San Mateo, CA: Morgan Kaufmann) pp 596–600Google Scholar
  112. Watkins 1989Learning from delayed rewards. Ph D thesis, Cambridge University, Cambridge, EnglandGoogle Scholar
  113. Watkins C J C H, Dayan P 1992 Technical note: Q-learning.Mach. Learning 8: 279–292zbMATHGoogle Scholar
  114. Werbos P J 1987 Building and understanding adaptive systems: a statistical/numerical approach to factory automation and brain research.IEEE Trans. Syst, Man Cybern.Google Scholar
  115. Werbos P J 1988 Generalization of back propagation with application to recurrent gas market model.Neural Networks 1: 339–356CrossRefGoogle Scholar
  116. Werbos P J 1989 Neural network for control and system identification. InProceedings of the 28th Conference on Decision and Control Tampa, FL, pp 260–265Google Scholar
  117. Werbos P J 1990a Consistency of HDP applied to simple reinforcement learning problems.Neural Networks 3: 179–189CrossRefGoogle Scholar
  118. Werbos P J 1990b A menu of designs for reinforcement learning over time, InNeural networks for control (eds) W T Miller, R S Sutton, P J Werbos (Cambridge, MA: MIT Press) pp 67–95Google Scholar
  119. Werbos P J 1992 Approximate dynamic programming for real-time control and neural modeling. InHandbook of intelligent control: Neural, fuzzy, and adaptive approaches (eds) D A White, D A Sofge (New York: Van Nostrand Reinhold) pp 493–525Google Scholar
  120. Whitehead S D 1991a A complexity analysis of cooperative mechanisms in reinforcement learning. InProceedings of the Ninth Conference on Artificial Intelligence, (Cambridge, MA: MIT Press) pp 607–613Google Scholar
  121. Whitehead S D 1991b Complexity and cooperation in Q-learning. InMachine Learning: Proceedings of the Eighth International Workshop (eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 363–367Google Scholar
  122. Whitehead S D, Ballard D H 1990 Active perception and reinforcement learning.Neural Comput. 2: 409–419Google Scholar
  123. Williams R J 1986 Reinforcement learning in connectionist networks: a mathematical analysis. Technical report ICS 8605, Institute for Cognitive Science, University of California at San Diego, La Jolla, CAGoogle Scholar
  124. Williams R J 1987 Reinforcement-learning connectionist systems. Technical report NU-CCS-87-3, College of Computer Science, Northeastern University, Boston, MAGoogle Scholar
  125. Williams R J, Baird L C III 1990 A mathematical analysis of actor-critic architectures for learning optimal controls through incremental dynamic programming. InProceedings of the Sixth Yale Workshop on Adaptive and Learning Systems New Haven, CT, pp 96–101Google Scholar

Copyright information

© Indian Academy of Sciences 1994

Authors and Affiliations

  • S Sathiya Keerthi
    • 1
  • B Ravindran
    • 1
  1. 1.Department of Computer Science and AutomationIndian Institute of ScienceBangaloreIndia

Personalised recommendations