Skip to main content

Learning Options in Reinforcement Learning

  • Conference paper
  • First Online:
Abstraction, Reformulation, and Approximation (SARA 2002)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2371))

Abstract

Temporally extended actions (e.g., macro actions) have proven very useful for speeding up learning, ensuring robustness and building prior knowledge into AI systems. The options framework (Precup, 2000; Sutton, Precup & Singh, 1999) provides a natural way of incorporating such actions into reinforcement learning systems, but leaves open the issue of how good options might be identified. In this paper, we empirically explore a simple approach to creating options. The underlying assumption is that the agent will be asked to perform different goal-achievement tasks in an environment that is othertherwise the same over time. Our approach is based on the intuition that states that are frequently visited on system trajectories, could prove to be useful subgoals (e.g., McGovern & Barto, 2001; Iba, 1989).

We propose a greedy algorithm for identifying subgoals based on state visitation counts. We present empirical studies of this approach in two gridworld navigation tasks. One of the environments we explored contains bottleneck states, and the algorithm indeed finds these states, as expected. The second environment is an empty gridworld with no obstacles. Although the environment does not contain any obvious subgoals, our approach still finds useful options, which essentially allow the agent to explore the environment more quickly.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1995]Bradtke:SMDPQ Bradtke, S. J.,& Duff, M. O. (1995). Reinforcement learning methods for continuous-time Markov Decision Problems. Advances in Neural Information Processing Systems 7 (pp. 393–400). MIT Press.

    Google Scholar 

  2. [1998]Dietterich:MAXQ Dietterich, T. G. (1998). The MAXQ method for hierarchical reinforcement learning. Proceedings of the Fifteenth International Conference on Machine Learning. Morgan Kaufmann.

    Google Scholar 

  3. [1972]Fikes:RobotPlan Fikes, R., P.E. Hart, & Nilsson, N. J. (1972). Learning and executing generalized robot plans. Artificial Intelligence, 3, 251–288.

    Article  Google Scholar 

  4. [1989]Iba:Macro Iba, G. A. (1989). A heuristic approach to the discovery of macro-operators. Machine Learning, 3, 285–317.

    Google Scholar 

  5. [1985]Korf:Macro Korf, R. E. (1985). Learning to solve problems by searching for macro-operators. Pitman Publishing Ltd.

    Google Scholar 

  6. [1986]Laird:ChunkSOAR Laird, J. E., Rosenbloom, P. S., & Newell, A. (1986). Chunking in SOAR: The anatomy of a general learning mechanism. Machine Learning, 1, 11–46.

    Google Scholar 

  7. [1997]Mahadevan:SMDP Mahadevan, S., Mar-challek, N., Das, T. K., & Gosavi, A. (1997). Self-improving factory simulation using continuous-time average-reward reinforcement learning. Proceedings of the Fourteenth International Conference on Machine Learning (pp. 202–210). Morgan Kaufmann.

    Google Scholar 

  8. [1997]McGovern:MacroRL McGovern, A., Sutton, R.S., & Fagg, A. H. (1997). Roles of macro-actions in accelerating reinforcement learning. Grace Hopper Celebration of Women in Computing (pp. 13–17).

    Google Scholar 

  9. [2002]McGovern:Thesis McGovern, E. A. (2002). Autonomous discovery of temporal abstractions from interaction with an environment. Doctoral dissertation, University of Massachusetts, Amherst.

    Google Scholar 

  10. [2001]McGovern:ICML McGovern, E. A., & Barto, A. G. (2001). Automatic discovery of subgoals in reinforcement learning using diverse density. Proceedings of the Eighteenth International Conference on Machine Learning (pp. 361–368). Morgan Kaufman.

    Google Scholar 

  11. [1988]Minton:BookMinton, S. (1988). Learning search control knowledge. An explanation-based approach. Kluwer Academic Publishers.

    Google Scholar 

  12. [1972]Newell:Simon Newell, A., & Simon, H. A. (1972). Human problem solving. Prentice-Hall.

    Google Scholar 

  13. [1998]Parr:Thesis Parr, R. (1998). Hierarchical control and learning for Markov Decision Processes. Doctoral dissertation, Computer Science Division, University of California, Berkeley, USA.

    Google Scholar 

  14. [1998]Parr:HAMs Parr, R., & Russell, S. (1998). Reinforcement learning with hierarchies of machines. Advances in Neural Information Processing Systems 10. MIT Press.

    Google Scholar 

  15. [2000]Precup:Thesis Precup, D. (2000)._Temporal abstraction in reinforcement learning. Doctoral dissertation, Department of Computer Science, University of Massachusetts, Amherst, USA.

    Google Scholar 

  16. [1994]Puterman:Book Puterman, M. L. (1994). Markov Decision Processes: Discrete stochastic dynamic programming. Wiley.

    Google Scholar 

  17. [1974]Sacerdoti:PlanArt Sacerdoti, E. D. (1974). Planning in a hierarchy of abstraction spaces. Artificial Intelligence, 5, 115–135.

    Google Scholar 

  18. [1992]Singh:HDynaAAAI Singh, S. P. (1992). Reinforcement learning with a hierarchy of abstract models. Proceedings of the Tenth National Conference on Artificial Intelligence (pp. 202–207). MIT/AAAI Press.

    Google Scholar 

  19. [1998]Sutton:BookSutton,R.S.,&Barto, A.G. (1998). Reinforcement learning: An introduction. MIT Press.

    Google Scholar 

  20. [1999]Precup:Options Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112, 181–211.

    Google Scholar 

  21. [1989]Watkins:Qlearn Watkins, C. J. C. H. (1989). Learning with delayed rewards. Doctoral dissertation, Psychology Department, Cambridge University, Cambridge, UK.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Stolle, M., Precup, D. (2002). Learning Options in Reinforcement Learning. In: Koenig, S., Holte, R.C. (eds) Abstraction, Reformulation, and Approximation. SARA 2002. Lecture Notes in Computer Science(), vol 2371. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45622-8_16

Download citation

  • DOI: https://doi.org/10.1007/3-540-45622-8_16

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-43941-7

  • Online ISBN: 978-3-540-45622-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics