Policy Iteration Based on a Learned Transition Model

  • Vivek Ramavajjala
  • Charles Elkan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7524)


This paper investigates a reinforcement learning method that combines learning a model of the environment with least-squares policy iteration (LSPI). The LSPI algorithm learns a linear approximation of the optimal state-action value function; the idea studied here is to let this value function depend on a learned estimate of the expected next state instead of directly on the current state and action. This approach makes it easier to define useful basis functions, and hence to learn a useful linear approximation of the value function. Experiments show that the new algorithm, called NSPI for next-state policy iteration, performs well on two standard benchmarks, the well-known mountain car and inverted pendulum swing-up tasks. More importantly, the NSPI algorithm performs well, and better than a specialized recent method, on a resource management task known as the day-ahead wind commitment problem. This latter task has action and state spaces that are high-dimensional and continuous.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Byrd, R.H., Gilbert, J.C., Nocedal, J.: A trust region method based on interior point techniques for nonlinear programming. Mathematical Programming 89, 149–185 (1996)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Elkan, C.: Reinforcement Learning with a Bilinear Q Function. In: Sanner, S., Hutter, M. (eds.) EWRL 2011. LNCS, vol. 7188, pp. 78–88. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  3. 3.
    Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 6(1), 503–556 (2005)MathSciNetMATHGoogle Scholar
  4. 4.
    Hannah, L., Dunson, D.B.: Approximate dynamic programming for storage problems. In: Proceedings of the 28th International Conference on Machine Learning (ICML), pp. 337–344 (2011)Google Scholar
  5. 5.
    Hesami, A.: Matlab implementation of inverted pendulum, http://webdocs.cs.ualberta.ca/~sutton/pole.zip
  6. 6.
    Howard, R.A.: Comments on the origin and application of Markov decision processes. Management Science 14(7), 503–507 (1968)CrossRefGoogle Scholar
  7. 7.
    Jong, N., Stone, P.: Model-based function approximation in reinforcement learning. In: Proceedings of the Sixth International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pp. 658–665. ACM (2007)Google Scholar
  8. 8.
    Lagoudakis, M.G., Parr, R., Bartlett, L.: Least-squares policy iteration. Journal of Machine Learning Research 4, 1107–1149 (2003)Google Scholar
  9. 9.
    Mahadevan, S., Maggioni, M.: Proto-value functions: A Laplacian framework for learning representation and control in Markov decision processes. Journal of Machine Learning Research, 2169–2231 (2007)Google Scholar
  10. 10.
    Melo, F.S., Lopes, M.: Fitted Natural Actor-Critic: A New Algorithm for Continuous State-Action MDPs. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part II. LNCS (LNAI), vol. 5212, pp. 66–81. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  11. 11.
    Menache, I., Mannor, S., Shimkin, N.: Basis function adaptation in temporal difference reinforcement learning. Annals of Operations Research 134(1), 215–238 (2005)MathSciNetMATHCrossRefGoogle Scholar
  12. 12.
    Parr, R., Li, L., Taylor, G., Painter-Wakefield, C., Littman, M.: An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. In: Proceedings of the 25th International Conference on Machine Learning (ICML), pp. 752–759 (2008)Google Scholar
  13. 13.
    Powell, W.B.: Merging AI and OR to solve high-dimensional stochastic optimization problems using approximate dynamic programming. INFORMS Journal on Computing 22(1), 2–17 (2010)MathSciNetMATHCrossRefGoogle Scholar
  14. 14.
    Smart, W.D., Kaelbling, L.P.: Practical reinforcement learning in continuous spaces. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 903–910. Morgan Kaufmann (2000)Google Scholar
  15. 15.
    Sutton, R.S.: Reinforcement learning architectures for animats. In: Proceedings of the International Workshop on the Simulation of Adaptive Behavior: From Animals to Animats, pp. 288–296. MIT Press (1991)Google Scholar
  16. 16.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. Cambridge University Press (1998)Google Scholar
  17. 17.
    Tsitsiklis, J.N.: Commentary—perspectives on stochastic optimization over time. INFORMS Journal on Computing 22(1), 18–19 (2010)MathSciNetMATHCrossRefGoogle Scholar
  18. 18.
    Uc Cetina, V.: Multilayer perceptrons with radial basis functions as value functions in reinforcement learning. In: Proceedings of the 16th European Symposium on Artificial Neural Networks (ESANN), pp. 161–166 (2008)Google Scholar
  19. 19.
    Wang, H.O., Tanaka, K., Griffin, M.F.: An approach to fuzzy control of nonlinear systems: stability and design issues. IEEE Transactions on Fuzzy Systems 4(1), 14–23 (1996)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Vivek Ramavajjala
    • 1
  • Charles Elkan
    • 1
  1. 1.Department of Computer Science & EngineeringUniversity of CaliforniaSan DiegoUSA

Personalised recommendations