Self-improving reactive agents based on reinforcement learning, planning and teaching


To date, reinforcement learning has mostly been studied solving simple learning tasks. Reinforcement learning methods that have been studied so far typically converge slowly. The purpose of this work is thus two-fold: 1) to investigate the utility of reinforcement learning in solving much more complicated learning tasks than previously studied, and 2) to investigate methods that will speed up reinforcement learning.

This paper compares eight reinforcement learning frameworks:adaptive heuristic critic (AHC) learning due to Sutton,Q-learning due to Watkins, and three extensions to both basic methods for speeding up learning. The three extensions are experience replay, learning action models for planning, and teaching. The frameworks were investigated using connectionism as an approach to generalization. To evaluate the performance of different frameworks, a dynamic environment was used as a testbed. The environment is moderately complex and nondeterministic. This paper describes these frameworks and algorithms in detail and presents empirical evaluation of the frameworks.


  1. Anderson, C.W. (1987). Strategy learning with multilayer connectionist representations.Proceedings of the Fourth International Workshop on Machine Learning (pp. 103–114).

  2. Barto, A.G., Sutton, R.S., & Watkins, C.J.C.H. (1990). Learning and sequential decision making. In: M. Gabriel & J.W. Moore (Eds.),Learning and computational neuroscience. MIT Press.

  3. Barto, A.G., Bradtke, S.J., & Singh, S.P. (1991).Real-time learning and control using asynchronous dynamic programming. (Technical Report 91–57). University of Massachusetts, Computer Science Department.

  4. Chapman, D. & Kaelbling, L.P. (1991). Input generalization in delayed reinforcement learning: An algorithm and performance comparisons.Proceedings of IJCAI-91.

  5. Dayan, P. (1992). The convergence of TD(λ) for general λ.Machine Learning, 8, 341–362.

    Google Scholar 

  6. Grefenstette, J.J., Ramsey, C.L., & Schultz, A.C. (1990). Learning sequential decision rules using simulation models and competition.Machine Learning, 5, 355–382.

    Google Scholar 

  7. Hinton, G.E., McClelland, J.L., & Rumelhart, D.E. (1986). Distributed representations.Parallel distributed processing: Explorations in the microstructure of cognition, Vol. 1, Bradford Books/MIT Press.

  8. Howard, R.A. (1960).Dynamic programming and Markov processes. Wiley, New York.

    Google Scholar 

  9. Kaelbling, L.P. (1990).Learning in embedded systems. Ph.D. Thesis, Department of Computer Science, Stanford University.

  10. Lang, K.J. (1989).A time-delay neural network architecture for speech recognition. Ph.D. Thesis, School of Computer Science, Carnegie Mellon University.

  11. Lin, Long-Ji. (1991a). Self-improving reactive agents: Case studies of reinforcement learning frameworks.Proceedings of the First International Conference on Simulation of Adaptive Behavior: From Animals to Animats (pp. 297–305). Also Technical Report CMU-CS-90-109, Carnegie Mellon University.

  12. Lin, Long-Ji. (1991b). Self-improvement based on reinforcement learning, planning and teaching.Proceedings of the Eighth International Workshop on Machine Learning (pp. 323–327).

  13. Lin, Long-Ji. (1991c). Programming robots using reinforcement learning and teaching.Proceedings of AAAI-91 (pp. 781–786).

  14. Mahadevan, S. & Connell, J. (1991). Scaling reinforcement learning to robotics by exploiting the subsumption architecture.Proceedings of the Eighth International Workshop on Machine Learning (pp. 328–332).

  15. Mitchell, T.M. (1982). Generalization as search.Articial Intelligence, 18, 203–226.

    Google Scholar 

  16. Moore, A.W. (1991). Variable resolution dynamic programming: Efficiently learning action maps in multivariate real-valued state-spaces.Proceedings of the Eighth International Workshop on Machine Learning (pp. 333–337).

  17. Mozer, M.C. (1986).RAMBOT: A connectionist expert system that learns by example. (Institute for Cognitive Science Report 8610). University of California at San Diego.

  18. Pomerleau, D.A. (1989).ALVINN: An autonomous land vehicle in a neural network (Technical Report CMU-CS-89-107). Carnegie Mellon University.

  19. Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). Learning internal representations by error propagation.Parallel distributed processing: Explorations in the microstructure of cognition. Vol. 1. Bradford Books/MIT Press.

  20. Sutton, R.S. (1984).Temporal credit assignment in reinforcement learning. Ph.D. Thesis, Dept. of Computer and Information Science, University of Massachusetts.

  21. Sutton, R.S. (1988). Learning to predict by the methods of temporal differences.Machine Learning, 3, 9–44.

    Google Scholar 

  22. Sutton, R.S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming.Proceedings of the Seventh International Workshop on Machine Learning (pp. 216–224).

  23. Tan, Ming. (1991). Learning a cost-sensitive internal representation for reinforcement learning.Proceedings of the Eighth International Workshop on Machine Learning (pp. 358–362).

  24. Thrun, S.B., Möller, K., & Linden, A. (1991). Planning with an adaptive world model. In D.S. Touretzky (Ed.),Advances in neural information processing systems 3, Morgan Kaufmann.

  25. Thrun, S.B. & Möller, K. (1992). Active exploration in dynamic environments. To appear in D.S. Touretzky (Ed.),Advances in neural information processing systems 4, Morgan Kaufmann.

  26. Watkins, C.J.C.H. (1989).Learning from delayed rewards. Ph.D. Thesis, King's College, Cambridge.

    Google Scholar 

  27. Williams, R.J. & Zipser, D. (1988).A learning algorithm for continually running fully recurrent neural networks (Institute for Cognitive Science Report 8805). University of California at San Diego.

  28. Whitehead, S.D. & Ballard, D.H. (1989). A role for anticipation in reactive systems that learn.Proceedings of the Sixth International Workshop on Machine Learning (pp. 354–357).

  29. Whitehead, S.D. & Ballard, D.H. (1991a). Learning to perceive and act by trial and error.Machine Learning, 7 45–83.

    Google Scholar 

  30. Whitehead, S.D. (1991b). Complexity and cooperation in Q-learning.Proceedings of the Eighth International Workshop on Machine Learning (pp. 363–367).

Download references

Author information



Rights and permissions

Reprints and Permissions

About this article

Cite this article

Lin, LJ. Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach Learn 8, 293–321 (1992).

Download citation


  • Reinforcement learning
  • planning
  • teaching
  • connectionist networks