Joint European Conference on Machine Learning and Knowledge Discovery in Databases

ECML PKDD 2012: Machine Learning and Knowledge Discovery in Databases pp 211-226

Policy Iteration Based on a Learned Transition Model

  • Vivek Ramavajjala
  • Charles Elkan
Conference paper

DOI: 10.1007/978-3-642-33486-3_14

Volume 7524 of the book series Lecture Notes in Computer Science (LNCS)
Cite this paper as:
Ramavajjala V., Elkan C. (2012) Policy Iteration Based on a Learned Transition Model. In: Flach P.A., De Bie T., Cristianini N. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2012. Lecture Notes in Computer Science, vol 7524. Springer, Berlin, Heidelberg

Abstract

This paper investigates a reinforcement learning method that combines learning a model of the environment with least-squares policy iteration (LSPI). The LSPI algorithm learns a linear approximation of the optimal state-action value function; the idea studied here is to let this value function depend on a learned estimate of the expected next state instead of directly on the current state and action. This approach makes it easier to define useful basis functions, and hence to learn a useful linear approximation of the value function. Experiments show that the new algorithm, called NSPI for next-state policy iteration, performs well on two standard benchmarks, the well-known mountain car and inverted pendulum swing-up tasks. More importantly, the NSPI algorithm performs well, and better than a specialized recent method, on a resource management task known as the day-ahead wind commitment problem. This latter task has action and state spaces that are high-dimensional and continuous.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Vivek Ramavajjala
    • 1
  • Charles Elkan
    • 1
  1. 1.Department of Computer Science & EngineeringUniversity of CaliforniaSan DiegoUSA