Learning Options in Reinforcement Learning

  • Martin Stolle
  • Doina Precup
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2371)

Abstract

Temporally extended actions (e.g., macro actions) have proven very useful for speeding up learning, ensuring robustness and building prior knowledge into AI systems. The options framework (Precup, 2000; Sutton, Precup & Singh, 1999) provides a natural way of incorporating such actions into reinforcement learning systems, but leaves open the issue of how good options might be identified. In this paper, we empirically explore a simple approach to creating options. The underlying assumption is that the agent will be asked to perform different goal-achievement tasks in an environment that is othertherwise the same over time. Our approach is based on the intuition that states that are frequently visited on system trajectories, could prove to be useful subgoals (e.g., McGovern & Barto, 2001; Iba, 1989).

We propose a greedy algorithm for identifying subgoals based on state visitation counts. We present empirical studies of this approach in two gridworld navigation tasks. One of the environments we explored contains bottleneck states, and the algorithm indeed finds these states, as expected. The second environment is an empty gridworld with no obstacles. Although the environment does not contain any obvious subgoals, our approach still finds useful options, which essentially allow the agent to explore the environment more quickly.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [Bradtke & Duff, 1995]Bradtke and Duff]
    [1995]Bradtke:SMDPQ Bradtke, S. J.,& Duff, M. O. (1995). Reinforcement learning methods for continuous-time Markov Decision Problems. Advances in Neural Information Processing Systems 7 (pp. 393–400). MIT Press.Google Scholar
  2. [Dietterich, 1998]Dietterich]
    [1998]Dietterich:MAXQ Dietterich, T. G. (1998). The MAXQ method for hierarchical reinforcement learning. Proceedings of the Fifteenth International Conference on Machine Learning. Morgan Kaufmann.Google Scholar
  3. [Fikesetal., 1972] Fikes et al.]
    [1972]Fikes:RobotPlan Fikes, R., P.E. Hart, & Nilsson, N. J. (1972). Learning and executing generalized robot plans. Artificial Intelligence, 3, 251–288.CrossRefGoogle Scholar
  4. [Iba, 1989] Iba]
    [1989]Iba:Macro Iba, G. A. (1989). A heuristic approach to the discovery of macro-operators. Machine Learning, 3, 285–317.Google Scholar
  5. [Korf, 1985] Korf]
    [1985]Korf:Macro Korf, R. E. (1985). Learning to solve problems by searching for macro-operators. Pitman Publishing Ltd.Google Scholar
  6. [Laird et al., 1986] Laird et al.]
    [1986]Laird:ChunkSOAR Laird, J. E., Rosenbloom, P. S., & Newell, A. (1986). Chunking in SOAR: The anatomy of a general learning mechanism. Machine Learning, 1, 11–46.Google Scholar
  7. [Mahadevan et al., 1997] Mahadevan et al.]
    [1997]Mahadevan:SMDP Mahadevan, S., Mar-challek, N., Das, T. K., & Gosavi, A. (1997). Self-improving factory simulation using continuous-time average-reward reinforcement learning. Proceedings of the Fourteenth International Conference on Machine Learning (pp. 202–210). Morgan Kaufmann.Google Scholar
  8. [McGovern et al., 1997] McGovern et al.]
    [1997]McGovern:MacroRL McGovern, A., Sutton, R.S., & Fagg, A. H. (1997). Roles of macro-actions in accelerating reinforcement learning. Grace Hopper Celebration of Women in Computing (pp. 13–17).Google Scholar
  9. [McGovern, 2002] McGovern]
    [2002]McGovern:Thesis McGovern, E. A. (2002). Autonomous discovery of temporal abstractions from interaction with an environment. Doctoral dissertation, University of Massachusetts, Amherst.Google Scholar
  10. [McGovern & Barto, 2001] McGovern and Barto]
    [2001]McGovern:ICML McGovern, E. A., & Barto, A. G. (2001). Automatic discovery of subgoals in reinforcement learning using diverse density. Proceedings of the Eighteenth International Conference on Machine Learning (pp. 361–368). Morgan Kaufman.Google Scholar
  11. [Minton, 1988] Minton]
    [1988]Minton:BookMinton, S. (1988). Learning search control knowledge. An explanation-based approach. Kluwer Academic Publishers.Google Scholar
  12. [Newell & Simon, 1972] Newell and Simon]
    [1972]Newell:Simon Newell, A., & Simon, H. A. (1972). Human problem solving. Prentice-Hall.Google Scholar
  13. [Parr, 1998] Parr]
    [1998]Parr:Thesis Parr, R. (1998). Hierarchical control and learning for Markov Decision Processes. Doctoral dissertation, Computer Science Division, University of California, Berkeley, USA.Google Scholar
  14. [Parr & Russell, 1998] Parr and Russell]
    [1998]Parr:HAMs Parr, R., & Russell, S. (1998). Reinforcement learning with hierarchies of machines. Advances in Neural Information Processing Systems 10. MIT Press.Google Scholar
  15. [Precup, 2000] Precup]
    [2000]Precup:Thesis Precup, D. (2000)._Temporal abstraction in reinforcement learning. Doctoral dissertation, Department of Computer Science, University of Massachusetts, Amherst, USA.Google Scholar
  16. [Puterman, 1994] Puterman]
    [1994]Puterman:Book Puterman, M. L. (1994). Markov Decision Processes: Discrete stochastic dynamic programming. Wiley.Google Scholar
  17. [Sacerdoti, 1974] Sacerdoti]
    [1974]Sacerdoti:PlanArt Sacerdoti, E. D. (1974). Planning in a hierarchy of abstraction spaces. Artificial Intelligence, 5, 115–135.Google Scholar
  18. [Singh, 1992] Singh]
    [1992]Singh:HDynaAAAI Singh, S. P. (1992). Reinforcement learning with a hierarchy of abstract models. Proceedings of the Tenth National Conference on Artificial Intelligence (pp. 202–207). MIT/AAAI Press.Google Scholar
  19. [Sutton &Barto, 1998] SuttonandBarto]
    [1998]Sutton:BookSutton,R.S.,&Barto, A.G. (1998). Reinforcement learning: An introduction. MIT Press.Google Scholar
  20. [Sutton et al., 1999] Sutton et al.]
    [1999]Precup:Options Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112, 181–211.Google Scholar
  21. [Watkins, 1989] Watkins]
    [1989]Watkins:Qlearn Watkins, C. J. C. H. (1989). Learning with delayed rewards. Doctoral dissertation, Psychology Department, Cambridge University, Cambridge, UK.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Martin Stolle
    • 1
  • Doina Precup
    • 1
  1. 1.School of Computer ScienceMcGill UniversityMontreal

Personalised recommendations