Machine Learning

, Volume 8, Issue 3–4, pp 323–339 | Cite as

Transfer of learning by composing solutions of elemental sequential tasks

  • Satinder Pal Singh


Although building sophisticated learning agents that operate in complex environments will require learning to perform multiple tasks, most applications of reinforcement learning have focused on single tasks. In this paper I consider a class of sequential decision tasks (SDTs), called composite sequential decision tasks, formed by temporally concatenating a number of elemental sequential decision tasks. Elemental SDTs cannot be decomposed into simpler SDTs. I consider a learning agent that has to learn to solve a set of elemental and composite SDTs. I assume that the structure of the composite tasks is unknown to the learning agent. The straightforward application of reinforcement learning to multiple tasks requires learning the tasks separately, which can waste computational resources, both memory and time. I present a new learning algorithm and a modular architecture that learns the decomposition of composite SDTs, and achieves transfer of learning by sharing the solutions of elemental SDTs across multiple composite SDTs. The solution of a composite SDT is constructed by computationally inexpensive modifications of the solutions of its constituent elemental SDTs. I provide a proof of one aspect of the learning algorithm.


Reinforcement learning compositional learning modular architecture transfer of learning 


  1. Barto, A.G., Bradtke, S.J., & Singh, S.P. (1991).Real-time learning and control using asynchronous dynamic programming. (Technical Report 91-57). Amherst, MA: University of Massachusetts, COINS Dept.Google Scholar
  2. Barto, A.G. & Singh, S.P. (1990). On the computational economics of reinforcement learning.Proceedings of the 1990 Connectionist Models Summer School. San Mateo, CA: Morgan Kaufmann.Google Scholar
  3. Barto, A.G., Sutton, R.S., & Anderson, C.W. (1983). Neuronlike elements that can solve difficult learning control problems.IEEE SMC, 13, 835–846.Google Scholar
  4. Barto, A.G., Sutton, R.S., & Watkins, C.J.C.H. (1990). Sequential decision problems and neural networks. In D.S. Touretzky, (Ed.),Advances in neural information processing systems 2, San Mateo, CA: Morgan Kaufmann.Google Scholar
  5. Bertsekas, D.P. (1987).Dynamic programming: Deterministic and stochastic models. Englewood Cliffs, NJ: Prentice-Hall.Google Scholar
  6. Brooks, R. (1989). A robot that walks: Emergent behaviors from a carefully evolved network.Neural Computation, 1, 253–262.Google Scholar
  7. Duda, R.O. & Hart, P.E. (1973).Pattern classification and scene analysis. New York: Wiley.Google Scholar
  8. Iba, G.A. (1989). A heuristic approach to the discovery of macro-operators.Machine Learning, 3, 285–317.Google Scholar
  9. Jacobs, R.A. (1990).Task decomposition through competition in a modular connectionist architecture. Ph.D. Thesis, COINS Dept., Univ. of Massachusetts, Amherst, Mass.Google Scholar
  10. Jacobs, R.A. & Jordan, M.I. (1991). A competitive modular connectionist architecture.Advances in neural information processing systems, 3.Google Scholar
  11. Jacobs, R.A., Jordan, M.I., Nowlan, S.J., & Hinton, G.E. (1991). Adaptive mixtures of local experts.Neural Computation, 3.Google Scholar
  12. Kaelbling, L.P. (1990).Learning in embedded systems. Ph.D. Thesis, Stanford University, Department of Computer Science, Stanford CA. Technical Report TR-90-04.Google Scholar
  13. Korf, R.E. (1985). Macro-operators: A weak method for learning.Artificial Learning, 26, 35–77.Google Scholar
  14. Maes, P. & Brooks, R. (1990). Learning to coordinate behaviours.Proceedings of the Eighth AAAI (pp. 796–802). Morgan Kaufmann.Google Scholar
  15. Mahadevan, S. & Connell, J. (1990). Automatic programming of behavior-based robots using reinforcement learning. (Technical Report) Yorktown Heights, NY: IBM Research Division, T.J. Watson Research Center.Google Scholar
  16. Nowlan, S.J. (1990). Competing experts: An experimental investigation of associative mixture models. (Technical Report CRG-TR-90-5). Toronto, Canada: Univ. of Toronto, Department of Computer Science.Google Scholar
  17. Ross, S. (1983).Introduction to stochastic dynamic programming. New York: Academic Press.Google Scholar
  18. Singh, S.P. (1992a). On the efficient learning of multiple sequential tasks. In J. Moody, S.J. Hanson, & R.P. Lippman, (Eds.),Advances in neural information processing systems 4, San Mateo, CA: Morgan Kaufmann.Google Scholar
  19. Singh, S.P. (1992b). Solving multiple sequential tasks using a hierarchy of variable temporal resolution models. Submitted to Machine Learning Conference, 1992.Google Scholar
  20. Skinner, B.F. (1938).The behavior of organisms: An experimental analysis. New York: D. Appleton Century.Google Scholar
  21. Sutton, R.S. (1988). Learning to predict by the methods of temporal differences.Machine Learning, 3, 9–44.Google Scholar
  22. Sutton, R.S. (1990). Integrating architectures for learning, planning, and reacting based on approximating dynamic programming.Proceedings of the Seventh International Workshop on Machine Learning (pp. 216–224). San Mateo, CA: Morgan KaufmannGoogle Scholar
  23. Watkins, C.J.C.H. (1989).Learning from delayed rewards. Ph.D. Thesis, Cambridge Univ., Cambridge, England.Google Scholar
  24. Watkins, C.J.C.H. & Dayan, P. (1992). Q-learning.Machine Learning, 8, 279–292.Google Scholar
  25. Whitehead, S.D. & Ballard, D.H. (1990). Active perception and reinforcement learning.Proceedings of the Seventh International Conference on Machine Learning. Austin, TX.Google Scholar

Copyright information

© Kluwer Academic Publishers 1992

Authors and Affiliations

  • Satinder Pal Singh
    • 1
  1. 1.Department of Computer ScienceUniversity of MassachusettsAmherst

Personalised recommendations