Machine Learning

, Volume 8, Issue 3–4, pp 293–321 | Cite as

Self-improving reactive agents based on reinforcement learning, planning and teaching

  • Long-Ji Lin


To date, reinforcement learning has mostly been studied solving simple learning tasks. Reinforcement learning methods that have been studied so far typically converge slowly. The purpose of this work is thus two-fold: 1) to investigate the utility of reinforcement learning in solving much more complicated learning tasks than previously studied, and 2) to investigate methods that will speed up reinforcement learning.

This paper compares eight reinforcement learning frameworks:adaptive heuristic critic (AHC) learning due to Sutton,Q-learning due to Watkins, and three extensions to both basic methods for speeding up learning. The three extensions are experience replay, learning action models for planning, and teaching. The frameworks were investigated using connectionism as an approach to generalization. To evaluate the performance of different frameworks, a dynamic environment was used as a testbed. The environment is moderately complex and nondeterministic. This paper describes these frameworks and algorithms in detail and presents empirical evaluation of the frameworks.


Reinforcement learning planning teaching connectionist networks 


  1. Anderson, C.W. (1987). Strategy learning with multilayer connectionist representations.Proceedings of the Fourth International Workshop on Machine Learning (pp. 103–114).Google Scholar
  2. Barto, A.G., Sutton, R.S., & Watkins, C.J.C.H. (1990). Learning and sequential decision making. In: M. Gabriel & J.W. Moore (Eds.),Learning and computational neuroscience. MIT Press.Google Scholar
  3. Barto, A.G., Bradtke, S.J., & Singh, S.P. (1991).Real-time learning and control using asynchronous dynamic programming. (Technical Report 91–57). University of Massachusetts, Computer Science Department.Google Scholar
  4. Chapman, D. & Kaelbling, L.P. (1991). Input generalization in delayed reinforcement learning: An algorithm and performance comparisons.Proceedings of IJCAI-91.Google Scholar
  5. Dayan, P. (1992). The convergence of TD(λ) for general λ.Machine Learning, 8, 341–362.Google Scholar
  6. Grefenstette, J.J., Ramsey, C.L., & Schultz, A.C. (1990). Learning sequential decision rules using simulation models and competition.Machine Learning, 5, 355–382.Google Scholar
  7. Hinton, G.E., McClelland, J.L., & Rumelhart, D.E. (1986). Distributed representations.Parallel distributed processing: Explorations in the microstructure of cognition, Vol. 1, Bradford Books/MIT Press.Google Scholar
  8. Howard, R.A. (1960).Dynamic programming and Markov processes. Wiley, New York.Google Scholar
  9. Kaelbling, L.P. (1990).Learning in embedded systems. Ph.D. Thesis, Department of Computer Science, Stanford University.Google Scholar
  10. Lang, K.J. (1989).A time-delay neural network architecture for speech recognition. Ph.D. Thesis, School of Computer Science, Carnegie Mellon University.Google Scholar
  11. Lin, Long-Ji. (1991a). Self-improving reactive agents: Case studies of reinforcement learning frameworks.Proceedings of the First International Conference on Simulation of Adaptive Behavior: From Animals to Animats (pp. 297–305). Also Technical Report CMU-CS-90-109, Carnegie Mellon University.Google Scholar
  12. Lin, Long-Ji. (1991b). Self-improvement based on reinforcement learning, planning and teaching.Proceedings of the Eighth International Workshop on Machine Learning (pp. 323–327).Google Scholar
  13. Lin, Long-Ji. (1991c). Programming robots using reinforcement learning and teaching.Proceedings of AAAI-91 (pp. 781–786).Google Scholar
  14. Mahadevan, S. & Connell, J. (1991). Scaling reinforcement learning to robotics by exploiting the subsumption architecture.Proceedings of the Eighth International Workshop on Machine Learning (pp. 328–332).Google Scholar
  15. Mitchell, T.M. (1982). Generalization as search.Articial Intelligence, 18, 203–226.Google Scholar
  16. Moore, A.W. (1991). Variable resolution dynamic programming: Efficiently learning action maps in multivariate real-valued state-spaces.Proceedings of the Eighth International Workshop on Machine Learning (pp. 333–337).Google Scholar
  17. Mozer, M.C. (1986).RAMBOT: A connectionist expert system that learns by example. (Institute for Cognitive Science Report 8610). University of California at San Diego.Google Scholar
  18. Pomerleau, D.A. (1989).ALVINN: An autonomous land vehicle in a neural network (Technical Report CMU-CS-89-107). Carnegie Mellon University.Google Scholar
  19. Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). Learning internal representations by error propagation.Parallel distributed processing: Explorations in the microstructure of cognition. Vol. 1. Bradford Books/MIT Press.Google Scholar
  20. Sutton, R.S. (1984).Temporal credit assignment in reinforcement learning. Ph.D. Thesis, Dept. of Computer and Information Science, University of Massachusetts.Google Scholar
  21. Sutton, R.S. (1988). Learning to predict by the methods of temporal differences.Machine Learning, 3, 9–44.Google Scholar
  22. Sutton, R.S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming.Proceedings of the Seventh International Workshop on Machine Learning (pp. 216–224).Google Scholar
  23. Tan, Ming. (1991). Learning a cost-sensitive internal representation for reinforcement learning.Proceedings of the Eighth International Workshop on Machine Learning (pp. 358–362).Google Scholar
  24. Thrun, S.B., Möller, K., & Linden, A. (1991). Planning with an adaptive world model. In D.S. Touretzky (Ed.),Advances in neural information processing systems 3, Morgan Kaufmann.Google Scholar
  25. Thrun, S.B. & Möller, K. (1992). Active exploration in dynamic environments. To appear in D.S. Touretzky (Ed.),Advances in neural information processing systems 4, Morgan Kaufmann.Google Scholar
  26. Watkins, C.J.C.H. (1989).Learning from delayed rewards. Ph.D. Thesis, King's College, Cambridge.Google Scholar
  27. Williams, R.J. & Zipser, D. (1988).A learning algorithm for continually running fully recurrent neural networks (Institute for Cognitive Science Report 8805). University of California at San Diego.Google Scholar
  28. Whitehead, S.D. & Ballard, D.H. (1989). A role for anticipation in reactive systems that learn.Proceedings of the Sixth International Workshop on Machine Learning (pp. 354–357).Google Scholar
  29. Whitehead, S.D. & Ballard, D.H. (1991a). Learning to perceive and act by trial and error.Machine Learning, 7 45–83.Google Scholar
  30. Whitehead, S.D. (1991b). Complexity and cooperation in Q-learning.Proceedings of the Eighth International Workshop on Machine Learning (pp. 363–367).Google Scholar

Copyright information

© Kluwer Academic Publishers 1992

Authors and Affiliations

  • Long-Ji Lin
    • 1
  1. 1.School of Computer ScienceCarnegie Mellon UniversityPittsburgh

Personalised recommendations