Advertisement

The Effect of Representation and Knowledge on Goal-Directed Exploration with Reinforcement-Learning Algorithms

  • Sven Koenig
  • Reid G. Simmons
Chapter
  • 172 Downloads

Abstract

We analyze the complexity of on-line reinforcement-learning algorithms applied to goal-directed exploration tasks. Previous work had concluded that, even in deterministic state spaces, initially uninformed reinforcement learning was at least exponential for such problems, or that it was of polynomial worst-case time-complexity only if the learning methods were augmented. We prove that, to the contrary, the algorithms are tractable with only a simple change in the reward structure (“penalizing the agent for action executions”) or in the initialization of the values that they maintain. In particular, we provide tight complexity bounds for both Watkins’ Q-learning and Heger’s Q-hat-learning and show how their complexity depends on properties of the state spaces. We also demonstrate how one can decrease the complexity even further by either learning action models or utilizing prior knowledge of the topology of the state spaces. Our results provide guidance for empirical reinforcement-learning researchers on how to distinguish hard reinforcement-learning problems from easy ones and how to represent them in a way that allows them to be solved efficiently.

Keywords

action models admissible and consistent heuristics action-penalty representation complexity goal-directed exploration goal-reward representation on-line reinforcement learning prior knowledge reward structure Q-hat-learning Q-learning 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Barto, A.G., S.J. Bradtke, and S.P. Singh. (1995). Learning to act using real-time dynamic programming. Artificial Intelligence, 73(1):81–138.CrossRefGoogle Scholar
  2. Barto, A.G., R.S. Sutton, and C.J. Watkins. (1989). Learning and sequential decision making. Technical Report 89-95, Department of Computer Science, University of Massachusetts at Amherst.Google Scholar
  3. Bellman, R. (1957). Dynamic Programming. Princeton University Press, Princeton (New Jersey).Google Scholar
  4. Boddy, M. and T. Dean. (1989). Solving time-dependent planning problems. In Proceedings of the IJCAI, pages 979–984.Google Scholar
  5. Goodwin, R. (1994). Reasoning about when to start acting. In Proceedings of the International Conference on Artificial Intelligence Planning Systems, pages 86–91.Google Scholar
  6. Heger, M. (1996). The loss from imperfect value functions in expectation-based and minimax-based tasks. Machine Learning, pages 197–225.Google Scholar
  7. Heger, M. (1994). Consideration of risk in reinforcement learning. In Proceedings of the International Conference on Machine Learning, pages 105–111.Google Scholar
  8. Ishida, T. (1992). Moving target search with intelligence. In Proceedings of the AAAI, pages 525–532.Google Scholar
  9. Ishida, T. and R.E. Korf. (1991). Moving target search. In Proceedings of the IJCAI, pages 204–210.Google Scholar
  10. Kaelbling, L.P. (1990). Learning in Embedded Systems. MIT Press, Cambridge (Massachusetts).Google Scholar
  11. Koenig, S. (1991). Optimal probabilistic and decision-theoretic planning using Markovian decision theory. Master’s thesis, Computer Science Department, University of California at Berkeley. (Available as Technical Report UCB/CSD 92/685).Google Scholar
  12. Koenig, S. and R.G. Simmons. (1992). Complexity analysis of real-time reinforcement learning applied to finding shortest paths in deterministic domains. Technical Report CMU-CS-93-106, School of Computer Science, Carnegie Mellon University.Google Scholar
  13. Koenig, S. and R.G. Simmons. (1993). Complexity analysis of real-time reinforcement learning. In Proceedings of the AAAI, pages 99–105.Google Scholar
  14. Koenig, S. and R.G. Simmons. (1994). How to make reactive planners risk-sensitive. In Proceedings of the International Conference on Artificial Intelligence Planning Systems, pages 293–298.Google Scholar
  15. Koenig, S. and R.G. Simmons. (1995a). The effect of representation and knowledge on goal-directed exploration with reinforcement learning algorithms: The proofs. Technical Report CMU-CS-95-177, School of Computer Science, Carnegie Mellon University.Google Scholar
  16. Koenig, S. and R.G. Simmons. (1995b). Real-time search in non-deterministic domains. In Proceedings of the IJCAI, pages 1660–1667.Google Scholar
  17. Korf, R.E. (1990). Real-time heuristic search. Artificial Intelligence, 42(2–3):189–211.CrossRefGoogle Scholar
  18. Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning, and teaching. Machine Learning, 8:293–321.Google Scholar
  19. Matarić, M. (1994). Interaction and Intelligent Behavior. PhD thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology.Google Scholar
  20. Moore, A.W. and C.G. Atkeson. (1993a). The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces. In Proceedings of the NIPS.Google Scholar
  21. Moore, A.W. and C.G. Atkeson. (1993b). Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13:103–130.Google Scholar
  22. Nilsson, N.J. (1971). Problem-Solving Methods in Artificial Intelligence. McGraw-Hill, New York (New York).Google Scholar
  23. Pearl, J. (1984). Heuristics: Intelligent Search Strategies for Computer Problem Solving. Addison-Wesley, Menlo Park (California).Google Scholar
  24. Peng, J. and R.J. Williams. (1992). Efficient learning and planning within the DYNA framework. In Proceedings of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats, pages 281–290.Google Scholar
  25. Russell, S. and E. Wefald. (1991). Do the Right Thing — Studies in Limited Rationality. MIT Press, Cambridge (Massachusetts).Google Scholar
  26. Singh, S.P. (1992). Reinforcement learning with a hierarchy of abstract models. In Proceedings of the AAAI, pages 202–207.Google Scholar
  27. Sutton, R.S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the International Conference on Machine Learning, pages 216–224.Google Scholar
  28. Sutton, R.S. (1991). DYNA, an integrated architecture for learning, planning, and reacting. SIGART Bulletin, 2(4):160–163.CrossRefGoogle Scholar
  29. Thrun, S.B. (1992a). Efficient exploration in reinforcement learning. Technical Report CMU-CS-92-102, School of Computer Science, Carnegie Mellon University.Google Scholar
  30. Thrun, S.B. (1992b). The role of exploration in learning control with neural networks. In D.A. White and D.A. Sofge, editors, Handbook of Intelligent Control: Neural, Fuzzy and Adaptive Approaches, pages 527–559. Van Nostrand Reinhold, New York (New York).Google Scholar
  31. Watkins, C.J. and P. Dayan. (1992). Q-learning. Machine Learning, 8(3–4):279–292.zbMATHGoogle Scholar
  32. Whitehead, S.D. (1991a). A complexity analysis of cooperative mechanisms in reinforcement learning. In Proceedings of the AAAI, pages 607–613.Google Scholar
  33. Whitehead, S.D. (1991b). A study of cooperative mechanisms for faster reinforcement learning. Technical Report 365, Computer Science Department, University of Rochester.Google Scholar
  34. Whitehead, S.D. (1992). Reinforcement Learning for the Adaptive Control of Perception and Action. PhD thesis, Computer Science Department, University of Rochester.Google Scholar
  35. Yee, R. (1992). Abstraction in control learning. Technical Report 92-16, Department of Computer Science, University of Massachusetts at Amherst.Google Scholar
  36. Zilberstein, S. (1993). Operational Rationality through Compilation of Anytime Algorithms. PhD thesis, Computer Science Department, University of California at Berkeley.Google Scholar

Copyright information

© Kluwer Academic Publishers 1996

Authors and Affiliations

  • Sven Koenig
    • 1
  • Reid G. Simmons
    • 1
  1. 1.School of Computer ScienceCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations