Skip to main content

Advertisement

SpringerLink
Log in
Menu
Find a journal Publish with us
Search
Cart
Book cover

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

ECML PKDD 2012: Machine Learning and Knowledge Discovery in Databases pp 211–226Cite as

  1. Home
  2. Machine Learning and Knowledge Discovery in Databases
  3. Conference paper
Policy Iteration Based on a Learned Transition Model

Policy Iteration Based on a Learned Transition Model

  • Vivek Ramavajjala21 &
  • Charles Elkan21 
  • Conference paper
  • 4740 Accesses

  • 4 Citations

Part of the Lecture Notes in Computer Science book series (LNAI,volume 7524)

Abstract

This paper investigates a reinforcement learning method that combines learning a model of the environment with least-squares policy iteration (LSPI). The LSPI algorithm learns a linear approximation of the optimal state-action value function; the idea studied here is to let this value function depend on a learned estimate of the expected next state instead of directly on the current state and action. This approach makes it easier to define useful basis functions, and hence to learn a useful linear approximation of the value function. Experiments show that the new algorithm, called NSPI for next-state policy iteration, performs well on two standard benchmarks, the well-known mountain car and inverted pendulum swing-up tasks. More importantly, the NSPI algorithm performs well, and better than a specialized recent method, on a resource management task known as the day-ahead wind commitment problem. This latter task has action and state spaces that are high-dimensional and continuous.

Keywords

  • Action Space
  • Wind Farm
  • Transition Model
  • Markov Decision Process
  • Inverted Pendulum

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Download conference paper PDF

References

  1. Byrd, R.H., Gilbert, J.C., Nocedal, J.: A trust region method based on interior point techniques for nonlinear programming. Mathematical Programming 89, 149–185 (1996)

    CrossRef  MathSciNet  Google Scholar 

  2. Elkan, C.: Reinforcement Learning with a Bilinear Q Function. In: Sanner, S., Hutter, M. (eds.) EWRL 2011. LNCS, vol. 7188, pp. 78–88. Springer, Heidelberg (2012)

    CrossRef  Google Scholar 

  3. Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 6(1), 503–556 (2005)

    MathSciNet  MATH  Google Scholar 

  4. Hannah, L., Dunson, D.B.: Approximate dynamic programming for storage problems. In: Proceedings of the 28th International Conference on Machine Learning (ICML), pp. 337–344 (2011)

    Google Scholar 

  5. Hesami, A.: Matlab implementation of inverted pendulum, http://webdocs.cs.ualberta.ca/~sutton/pole.zip

  6. Howard, R.A.: Comments on the origin and application of Markov decision processes. Management Science 14(7), 503–507 (1968)

    CrossRef  Google Scholar 

  7. Jong, N., Stone, P.: Model-based function approximation in reinforcement learning. In: Proceedings of the Sixth International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pp. 658–665. ACM (2007)

    Google Scholar 

  8. Lagoudakis, M.G., Parr, R., Bartlett, L.: Least-squares policy iteration. Journal of Machine Learning Research 4, 1107–1149 (2003)

    Google Scholar 

  9. Mahadevan, S., Maggioni, M.: Proto-value functions: A Laplacian framework for learning representation and control in Markov decision processes. Journal of Machine Learning Research, 2169–2231 (2007)

    Google Scholar 

  10. Melo, F.S., Lopes, M.: Fitted Natural Actor-Critic: A New Algorithm for Continuous State-Action MDPs. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part II. LNCS (LNAI), vol. 5212, pp. 66–81. Springer, Heidelberg (2008)

    CrossRef  Google Scholar 

  11. Menache, I., Mannor, S., Shimkin, N.: Basis function adaptation in temporal difference reinforcement learning. Annals of Operations Research 134(1), 215–238 (2005)

    CrossRef  MathSciNet  MATH  Google Scholar 

  12. Parr, R., Li, L., Taylor, G., Painter-Wakefield, C., Littman, M.: An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. In: Proceedings of the 25th International Conference on Machine Learning (ICML), pp. 752–759 (2008)

    Google Scholar 

  13. Powell, W.B.: Merging AI and OR to solve high-dimensional stochastic optimization problems using approximate dynamic programming. INFORMS Journal on Computing 22(1), 2–17 (2010)

    CrossRef  MathSciNet  MATH  Google Scholar 

  14. Smart, W.D., Kaelbling, L.P.: Practical reinforcement learning in continuous spaces. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 903–910. Morgan Kaufmann (2000)

    Google Scholar 

  15. Sutton, R.S.: Reinforcement learning architectures for animats. In: Proceedings of the International Workshop on the Simulation of Adaptive Behavior: From Animals to Animats, pp. 288–296. MIT Press (1991)

    Google Scholar 

  16. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. Cambridge University Press (1998)

    Google Scholar 

  17. Tsitsiklis, J.N.: Commentary—perspectives on stochastic optimization over time. INFORMS Journal on Computing 22(1), 18–19 (2010)

    CrossRef  MathSciNet  MATH  Google Scholar 

  18. Uc Cetina, V.: Multilayer perceptrons with radial basis functions as value functions in reinforcement learning. In: Proceedings of the 16th European Symposium on Artificial Neural Networks (ESANN), pp. 161–166 (2008)

    Google Scholar 

  19. Wang, H.O., Tanaka, K., Griffin, M.F.: An approach to fuzzy control of nonlinear systems: stability and design issues. IEEE Transactions on Fuzzy Systems 4(1), 14–23 (1996)

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

  1. Department of Computer Science & Engineering, University of California, San Diego, La Jolla, CA, 92092, USA

    Vivek Ramavajjala & Charles Elkan

Authors
  1. Vivek Ramavajjala
    View author publications

    You can also search for this author in PubMed Google Scholar

  2. Charles Elkan
    View author publications

    You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

  1. Intelligent Systems Laboratory, University of Bristol, Merchant Venturers Building, Woodland Road, BS8 1UB, Bristol, UK

    Peter A. Flach

  2. Intelligent Systems Laboratory, University of Bristol, Merchant Venturers Building, Woodland Road,, BS8 1UB, Bristol, UK

    Tijl De Bie & Nello Cristianini & 

Rights and permissions

Reprints and Permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ramavajjala, V., Elkan, C. (2012). Policy Iteration Based on a Learned Transition Model. In: Flach, P.A., De Bie, T., Cristianini, N. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2012. Lecture Notes in Computer Science(), vol 7524. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33486-3_14

Download citation

  • .RIS
  • .ENW
  • .BIB
  • DOI: https://doi.org/10.1007/978-3-642-33486-3_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-33485-6

  • Online ISBN: 978-3-642-33486-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Search

Navigation

  • Find a journal
  • Publish with us

Discover content

  • Journals A-Z
  • Books A-Z

Publish with us

  • Publish your research
  • Open access publishing

Products and services

  • Our products
  • Librarians
  • Societies
  • Partners and advertisers

Our imprints

  • Springer
  • Nature Portfolio
  • BMC
  • Palgrave Macmillan
  • Apress
  • Your US state privacy rights
  • Accessibility statement
  • Terms and conditions
  • Privacy policy
  • Help and support

167.114.118.210

Not affiliated

Springer Nature

© 2023 Springer Nature