Recent Advances in Reinforcement Learning pp 227-250 | Cite as

# The Effect of Representation and Knowledge on Goal-Directed Exploration with Reinforcement-Learning Algorithms

- 172 Downloads

## Abstract

We analyze the complexity of on-line reinforcement-learning algorithms applied to goal-directed exploration tasks. Previous work had concluded that, even in deterministic state spaces, initially uninformed reinforcement learning was at least exponential for such problems, or that it was of polynomial worst-case time-complexity only if the learning methods were augmented. We prove that, to the contrary, the algorithms are tractable with only a simple change in the reward structure (“penalizing the agent for action executions”) or in the initialization of the values that they maintain. In particular, we provide tight complexity bounds for both Watkins’ Q-learning and Heger’s Q-hat-learning and show how their complexity depends on properties of the state spaces. We also demonstrate how one can decrease the complexity even further by either learning action models or utilizing prior knowledge of the topology of the state spaces. Our results provide guidance for empirical reinforcement-learning researchers on how to distinguish hard reinforcement-learning problems from easy ones and how to represent them in a way that allows them to be solved efficiently.

## Keywords

action models admissible and consistent heuristics action-penalty representation complexity goal-directed exploration goal-reward representation on-line reinforcement learning prior knowledge reward structure Q-hat-learning Q-learning## Preview

Unable to display preview. Download preview PDF.

## References

- Barto, A.G., S.J. Bradtke, and S.P. Singh. (1995). Learning to act using real-time dynamic programming.
*Artificial Intelligence*, 73(1):81–138.CrossRefGoogle Scholar - Barto, A.G., R.S. Sutton, and C.J. Watkins. (1989). Learning and sequential decision making. Technical Report 89-95, Department of Computer Science, University of Massachusetts at Amherst.Google Scholar
- Bellman, R. (1957).
*Dynamic Programming*. Princeton University Press, Princeton (New Jersey).Google Scholar - Boddy, M. and T. Dean. (1989). Solving time-dependent planning problems. In
*Proceedings of the IJCAI*, pages 979–984.Google Scholar - Goodwin, R. (1994). Reasoning about when to start acting. In
*Proceedings of the International Conference on Artificial Intelligence Planning Systems*, pages 86–91.Google Scholar - Heger, M. (1996). The loss from imperfect value functions in expectation-based and minimax-based tasks.
*Machine Learning*, pages 197–225.Google Scholar - Heger, M. (1994). Consideration of risk in reinforcement learning. In
*Proceedings of the International Conference on Machine Learning*, pages 105–111.Google Scholar - Ishida, T. (1992). Moving target search with intelligence. In
*Proceedings of the AAAI*, pages 525–532.Google Scholar - Ishida, T. and R.E. Korf. (1991). Moving target search. In
*Proceedings of the IJCAI*, pages 204–210.Google Scholar - Kaelbling, L.P. (1990).
*Learning in Embedded Systems*. MIT Press, Cambridge (Massachusetts).Google Scholar - Koenig, S. (1991). Optimal probabilistic and decision-theoretic planning using Markovian decision theory. Master’s thesis, Computer Science Department, University of California at Berkeley. (Available as Technical Report UCB/CSD 92/685).Google Scholar
- Koenig, S. and R.G. Simmons. (1992). Complexity analysis of real-time reinforcement learning applied to finding shortest paths in deterministic domains. Technical Report CMU-CS-93-106, School of Computer Science, Carnegie Mellon University.Google Scholar
- Koenig, S. and R.G. Simmons. (1993). Complexity analysis of real-time reinforcement learning. In
*Proceedings of the AAAI*, pages 99–105.Google Scholar - Koenig, S. and R.G. Simmons. (1994). How to make reactive planners risk-sensitive. In
*Proceedings of the International Conference on Artificial Intelligence Planning Systems*, pages 293–298.Google Scholar - Koenig, S. and R.G. Simmons. (1995a). The effect of representation and knowledge on goal-directed exploration with reinforcement learning algorithms: The proofs. Technical Report CMU-CS-95-177, School of Computer Science, Carnegie Mellon University.Google Scholar
- Koenig, S. and R.G. Simmons. (1995b). Real-time search in non-deterministic domains. In
*Proceedings of the IJCAI*, pages 1660–1667.Google Scholar - Korf, R.E. (1990). Real-time heuristic search.
*Artificial Intelligence*, 42(2–3):189–211.CrossRefGoogle Scholar - Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning, and teaching.
*Machine Learning*, 8:293–321.Google Scholar - Matarić, M. (1994).
*Interaction and Intelligent Behavior*. PhD thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology.Google Scholar - Moore, A.W. and C.G. Atkeson. (1993a). The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces. In
*Proceedings of the NIPS*.Google Scholar - Moore, A.W. and C.G. Atkeson. (1993b). Prioritized sweeping: Reinforcement learning with less data and less time.
*Machine Learning*, 13:103–130.Google Scholar - Nilsson, N.J. (1971).
*Problem-Solving Methods in Artificial Intelligence*. McGraw-Hill, New York (New York).Google Scholar - Pearl, J. (1984).
*Heuristics: Intelligent Search Strategies for Computer Problem Solving*. Addison-Wesley, Menlo Park (California).Google Scholar - Peng, J. and R.J. Williams. (1992). Efficient learning and planning within the DYNA framework. In
*Proceedings of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats*, pages 281–290.Google Scholar - Russell, S. and E. Wefald. (1991).
*Do the Right Thing — Studies in Limited Rationality*. MIT Press, Cambridge (Massachusetts).Google Scholar - Singh, S.P. (1992). Reinforcement learning with a hierarchy of abstract models. In
*Proceedings of the AAAI*, pages 202–207.Google Scholar - Sutton, R.S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In
*Proceedings of the International Conference on Machine Learning*, pages 216–224.Google Scholar - Sutton, R.S. (1991). DYNA, an integrated architecture for learning, planning, and reacting.
*SIGART Bulletin*, 2(4):160–163.CrossRefGoogle Scholar - Thrun, S.B. (1992a). Efficient exploration in reinforcement learning. Technical Report CMU-CS-92-102, School of Computer Science, Carnegie Mellon University.Google Scholar
- Thrun, S.B. (1992b). The role of exploration in learning control with neural networks. In D.A. White and D.A. Sofge, editors,
*Handbook of Intelligent Control: Neural, Fuzzy and Adaptive Approaches*, pages 527–559. Van Nostrand Reinhold, New York (New York).Google Scholar - Watkins, C.J. and P. Dayan. (1992). Q-learning.
*Machine Learning*, 8(3–4):279–292.zbMATHGoogle Scholar - Whitehead, S.D. (1991a). A complexity analysis of cooperative mechanisms in reinforcement learning. In
*Proceedings of the AAAI*, pages 607–613.Google Scholar - Whitehead, S.D. (1991b). A study of cooperative mechanisms for faster reinforcement learning. Technical Report 365, Computer Science Department, University of Rochester.Google Scholar
- Whitehead, S.D. (1992).
*Reinforcement Learning for the Adaptive Control of Perception and Action*. PhD thesis, Computer Science Department, University of Rochester.Google Scholar - Yee, R. (1992). Abstraction in control learning. Technical Report 92-16, Department of Computer Science, University of Massachusetts at Amherst.Google Scholar
- Zilberstein, S. (1993).
*Operational Rationality through Compilation of Anytime Algorithms*. PhD thesis, Computer Science Department, University of California at Berkeley.Google Scholar