Abstract
Although building sophisticated learning agents that operate in complex environments will require learning to perform multiple tasks, most applications of reinforcement learning have focused on single tasks. In this paper I consider a class of sequential decision tasks (SDTs), called composite sequential decision tasks, formed by temporally concatenating a number of elemental sequential decision tasks. Elemental SDTs cannot be decomposed into simpler SDTs. I consider a learning agent that has to learn to solve a set of elemental and composite SDTs. I assume that the structure of the composite tasks is unknown to the learning agent. The straightforward application of reinforcement learning to multiple tasks requires learning the tasks separately, which can waste computational resources, both memory and time. I present a new learning algorithm and a modular architecture that learns the decomposition of composite SDTs, and achieves transfer of learning by sharing the solutions of elemental SDTs across multiple composite SDTs. The solution of a composite SDT is constructed by computationally inexpensive modifications of the solutions of its constituent elemental SDTs. I provide a proof of one aspect of the learning algorithm.
Article PDF
Similar content being viewed by others
References
Barto, A.G., Bradtke, S.J., & Singh, S.P. (1991).Real-time learning and control using asynchronous dynamic programming. (Technical Report 91-57). Amherst, MA: University of Massachusetts, COINS Dept.
Barto, A.G. & Singh, S.P. (1990). On the computational economics of reinforcement learning.Proceedings of the 1990 Connectionist Models Summer School. San Mateo, CA: Morgan Kaufmann.
Barto, A.G., Sutton, R.S., & Anderson, C.W. (1983). Neuronlike elements that can solve difficult learning control problems.IEEE SMC, 13, 835–846.
Barto, A.G., Sutton, R.S., & Watkins, C.J.C.H. (1990). Sequential decision problems and neural networks. In D.S. Touretzky, (Ed.),Advances in neural information processing systems 2, San Mateo, CA: Morgan Kaufmann.
Bertsekas, D.P. (1987).Dynamic programming: Deterministic and stochastic models. Englewood Cliffs, NJ: Prentice-Hall.
Brooks, R. (1989). A robot that walks: Emergent behaviors from a carefully evolved network.Neural Computation, 1, 253–262.
Duda, R.O. & Hart, P.E. (1973).Pattern classification and scene analysis. New York: Wiley.
Iba, G.A. (1989). A heuristic approach to the discovery of macro-operators.Machine Learning, 3, 285–317.
Jacobs, R.A. (1990).Task decomposition through competition in a modular connectionist architecture. Ph.D. Thesis, COINS Dept., Univ. of Massachusetts, Amherst, Mass.
Jacobs, R.A. & Jordan, M.I. (1991). A competitive modular connectionist architecture.Advances in neural information processing systems, 3.
Jacobs, R.A., Jordan, M.I., Nowlan, S.J., & Hinton, G.E. (1991). Adaptive mixtures of local experts.Neural Computation, 3.
Kaelbling, L.P. (1990).Learning in embedded systems. Ph.D. Thesis, Stanford University, Department of Computer Science, Stanford CA. Technical Report TR-90-04.
Korf, R.E. (1985). Macro-operators: A weak method for learning.Artificial Learning, 26, 35–77.
Maes, P. & Brooks, R. (1990). Learning to coordinate behaviours.Proceedings of the Eighth AAAI (pp. 796–802). Morgan Kaufmann.
Mahadevan, S. & Connell, J. (1990). Automatic programming of behavior-based robots using reinforcement learning. (Technical Report) Yorktown Heights, NY: IBM Research Division, T.J. Watson Research Center.
Nowlan, S.J. (1990). Competing experts: An experimental investigation of associative mixture models. (Technical Report CRG-TR-90-5). Toronto, Canada: Univ. of Toronto, Department of Computer Science.
Ross, S. (1983).Introduction to stochastic dynamic programming. New York: Academic Press.
Singh, S.P. (1992a). On the efficient learning of multiple sequential tasks. In J. Moody, S.J. Hanson, & R.P. Lippman, (Eds.),Advances in neural information processing systems 4, San Mateo, CA: Morgan Kaufmann.
Singh, S.P. (1992b). Solving multiple sequential tasks using a hierarchy of variable temporal resolution models. Submitted to Machine Learning Conference, 1992.
Skinner, B.F. (1938).The behavior of organisms: An experimental analysis. New York: D. Appleton Century.
Sutton, R.S. (1988). Learning to predict by the methods of temporal differences.Machine Learning, 3, 9–44.
Sutton, R.S. (1990). Integrating architectures for learning, planning, and reacting based on approximating dynamic programming.Proceedings of the Seventh International Workshop on Machine Learning (pp. 216–224). San Mateo, CA: Morgan Kaufmann
Watkins, C.J.C.H. (1989).Learning from delayed rewards. Ph.D. Thesis, Cambridge Univ., Cambridge, England.
Watkins, C.J.C.H. & Dayan, P. (1992). Q-learning.Machine Learning, 8, 279–292.
Whitehead, S.D. & Ballard, D.H. (1990). Active perception and reinforcement learning.Proceedings of the Seventh International Conference on Machine Learning. Austin, TX.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Singh, S.P. Transfer of learning by composing solutions of elemental sequential tasks. Mach Learn 8, 323–339 (1992). https://doi.org/10.1007/BF00992700
Issue Date:
DOI: https://doi.org/10.1007/BF00992700