Reinforcement Learning pp 99-115 | Cite as
Transfer of Learning by Composing Solutions of Elemental Sequential Tasks
Abstract
Although building sophisticated learning agents that operate in complex environments will require learning to perform multiple tasks, most applications of reinforcement learning have focused on single tasks. In this paper I consider a class of sequential decision tasks (SDTs), called composite sequential decision tasks, formed by temporally concatenating a number of elemental sequential decision tasks. Elemental SIYI’s cannot be decomposed into simpler SDTs. I consider a learning agent that has to learn to solve a set of elemental and composite SDTs. I assume that the structure of the composite tasks is unknown to the learning agent. The straightforward application of reinforcement learning to multiple tasks requires learning the tasks separately, which can waste computational resources, both memory and time. I present a new learning algorithm and a modular architecture that learns the decomposition of composite SDTs, and achieves transfer of learning by sharing the solutions of elemental SDTs across multiple composite SDTs. The solution of a composite SDT is constructed by computationally inexpensive modifications of the solutions of its constituent elemental SDTs. I provide a proof of one aspect of the learning algorithm.
Keywords
Reinforcement learning compositional learning modular architecture transfer of learningPreview
Unable to display preview. Download preview PDF.
References
- Barto, A.G., Bradtke, S.J., & Singh, S.P. (1991). Real-time learning and control using asynchronous dynamic programming. (Technical Report 91–57). Amherst, MA: University of Massachusetts, COINS Dept.Google Scholar
- Barto, A.G. & Singh, S.P. (1990). On the computational economics of reinforcement learning. Proceedings of the 1990 Connectionist Models Summer School. San Mateo, CA: Morgan Kaufmann.Google Scholar
- Barto, A.G., Sutton, R.S., & Anderson, C.W. (1983). Neuronlike elements that can solve difficult learning control problems. IEEE SMC, 13, 835–846.Google Scholar
- Barto, A.G., Sutton, R.S., & Watkins, C.J.C.H. (1990). Sequential decision problems and neural networks. In D.S. Touretzky, (Ed.), Advances in neural information processing systems 2, San Mateo, CA: Morgan Kaufmann.Google Scholar
- Bertsekas, D.P. (1987). Dynamic programming: Deterministic and stochastic models. Englewood Cliffs, NJ: Prentice-Hall.MATHGoogle Scholar
- Brooks, R. (1989). A robot that walks: Emergent behaviors from a carefully evolved network. Neural Computation, 1, 253–262.CrossRefGoogle Scholar
- Duda, R.O. & Hart, P.E. (1973). Pattern classification and scene analysis. New York: Wiley.MATHGoogle Scholar
- Iba, G.A. (1989). A heuristic approach to the discovery of macro-operators. Machine Learning, 3, 285–317.Google Scholar
- Jacobs, R.A. (1990). Task decomposition through competition in a modular connectionist architecture. Ph.D. Thesis, COINS Dept., Univ. of Massachusetts, Amherst, Mass.Google Scholar
- Jacobs, R.A. & Jordan, M.I. (1991). A competitive modular connectionist architecture. Advances in neural information processing systems, 3.Google Scholar
- Jacobs, R.A., Jordan, M.I., Nowlan, S.J., & Hinton, G.E. (1991). Adaptive mixtures of local experts. Neural Computation, 3.Google Scholar
- Kaelbling, L.P. (1990). Learning in embedded systems. Ph.D. Thesis, Stanford University, Department of Computer Science, Stanford CA. Technical Report TR-90–04.Google Scholar
- Korf, R.E. (1985). Macro-operators: A weak method for learning. Artificial Learning, 26, 35–77.MathSciNetMATHGoogle Scholar
- Maes, P. & Brooks, R. (1990). Learning to coordinate behaviours. Proceedings of the Eighth AAAI (pp. 796–802). Morgan Kaufmann.Google Scholar
- Mahadevan, S. & Connell, J. (1990). Automatic programming of behavior-based robots using reinforcement learning. (Technical Report) Yorktown Heights, NY: IBM Research Division, T.J. Watson Research Center.Google Scholar
- Nowlan, S.J. (1990). Competing experts: An experimental investigation of associative mixture models. (Technical Report CRG-TR-90–5). Toronto, Canada: Univ. of Toronto, Department of Computer Science.Google Scholar
- Ross, S. (1983). Introduction to stochastic dynamic programming. New York: Academic Press.MATHGoogle Scholar
- Singh, S.P. (1992a). On the efficient learning of multiple sequential tasks. In J. Moody, S.J. Hanson, & R.P. Lippman, (Eds.), Advances in neural information processing systems 4, San Mateo, CA: Morgan Kaufmann.Google Scholar
- Singh, S.P. (1992b). Solving multiple sequential tasks using a hierarchy of variable temporal resolution models. Submitted to Machine Learning Conference, 1992.Google Scholar
- Skinner, B.F. (1938). The behavior of organisms: An experimental analysis. New York: D. Appleton Century.Google Scholar
- Sutton, R.S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.Google Scholar
- Sutton, R.S. (1990). Integrating architectures for learning, planning, and reacting based on approximating dynamic programming. Proceedings of the Seventh International Workshop on Machine Learning (pp. 216–224). San Mateo, CA: Morgan KaufmannGoogle Scholar
- Watkins, C.J.C.H. (1989). Learning from delayed rewards. Ph.D. Thesis, Cambridge Univ., Cambridge, England.Google Scholar
- Watkins, C.J.C.H. & Dayan, R. (1992). Q-learning. Machine Learning, 8, 279–292.MATHGoogle Scholar
- Whitehead, S.D. & Ballard, D.H. (1990). Active perception and reinforcement learning. Proceedings of the Seventh International Conference on Machine Learning. Austin, TX.Google Scholar