Skip to main content

Reinforcement Learning with a Bilinear Q Function

  • Conference paper
Recent Advances in Reinforcement Learning (EWRL 2011)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7188))

Included in the following conference series:

Abstract

Many reinforcement learning methods are based on a function Q(s,a) whose value is the discounted total reward expected after performing the action a in the state s. This paper explores the implications of representing the Q function as Q(s,a) = s T Wa, where W is a matrix that is learned. In this representation, both s and a are real-valued vectors that may have high dimension. We show that action selection can be done using standard linear programming, and that W can be learned using standard linear regression in the algorithm known as fitted Q iteration. Experimentally, the resulting method learns to solve the mountain car task in a sample-efficient way. The same method is also applicable to an inventory management task where the state space and the action space are continuous and high-dimensional.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chakraborty, B., Strecher, V., Murphy, S.: Bias correction and confidence intervals for fitted Q-iteration. In: NIPS Workshop on Model Uncertainty and Risk in Reinforcement Learning (2008)

    Google Scholar 

  2. Chang, Y.W., Hsieh, C.J., Chang, K.W., Ringgaard, M., Lin, C.J.: Training and testing low-degree polynomial data mappings via linear SVM. Journal of Machine Learning Research 11, 1471–1490 (2010)

    MathSciNet  Google Scholar 

  3. De Farias, D.P., Van Roy, B.: The linear programming approach to approximate dynamic programming. Operations Research 51(6), 850–865 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  4. Dietterich, T.G.: Machine Learning and Ecosystem Informatics: Challenges and Opportunities. In: Zhou, Z.-H., Washio, T. (eds.) ACML 2009. LNCS, vol. 5828, pp. 1–5. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  5. Džeroski, S., De Raedt, L., Driessens, K.: Relational reinforcement learning. Machine Learning 43(1), 7–52 (2001)

    Article  MATH  Google Scholar 

  6. Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 6(1), 503–556 (2005)

    MathSciNet  MATH  Google Scholar 

  7. Gordon, G.J.: Stable fitted reinforcement learning. In: Advances in Neural Information Processing Systems (NIPS), pp. 1052–1058 (1995a)

    Google Scholar 

  8. Gordon, G.J.: Stable function approximation in dynamic programming. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 261–268 (1995b)

    Google Scholar 

  9. Hannah, L.A., Dunson, D.B.: Approximate dynamic programming for storage problems. In: Proceedings of the International Conference on Machine Learning, ICML (2011)

    Google Scholar 

  10. Judd, K.L., Solnick, A.J.: Numerical dynamic programming with shape-preserving splines. Unpublished paper from the Hoover Institution (1994), http://bucky.stanford.edu/papers/dpshape.pdf

  11. Lagoudakis, M.G., Parr, R.: Least-squares policy iteration. Journal of Machine Learning Research 4, 1107–1149 (2003)

    MathSciNet  Google Scholar 

  12. Lazaric, A., Restelli, M., Bonarini, A.: Reinforcement learning in continuous action spaces through sequential Monte Carlo methods. In: Advances in Neural Information Processing Systems 20 (NIPS). MIT Press (2007)

    Google Scholar 

  13. Melo, F.S., Lopes, M.: Fitted Natural Actor-Critic: A New Algorithm for Continuous State-Action MDPs. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part II. LNCS (LNAI), vol. 5212, pp. 66–81. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  14. Murphy, S.A.: A generalization error for Q-learning. Journal of Machine Learning Research 6, 1073–1097 (2005)

    MATH  Google Scholar 

  15. Neumann, G.: Batch-mode reinforcement learning for continuous state spaces: A survey. ÖGAI Journal 27(1), 15–23 (2008)

    Google Scholar 

  16. Pazis, J., Lagoudakis, M.G.: Binary action search for learning continuous-action control policies. In: Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pp. 100–107 (2009)

    Google Scholar 

  17. Pazis, J., Parr, R.: Generalized value functions for large action sets. In: Proceedings of the International Conference on Machine Learning, ICML (2011)

    Google Scholar 

  18. Powell, W.B.: Approximate Dynamic Programming. John Wiley & Sons, Inc. (2007)

    Google Scholar 

  19. Riedmiller, M.: Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 317–328. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  20. Simester, D.I., Sun, P., Tsitsiklis, J.N.: Dynamic catalog mailing policies. Management Science 52(5), 683–696 (2006)

    Article  Google Scholar 

  21. Smart, W.D., Kaelbling, L.P.: Practical reinforcement learning in continuous spaces. In: Proceedings of the 17th International Conference on Machine Learning (ICML), pp. 903–910 (2000)

    Google Scholar 

  22. Stachurski, J.: Continuous state dynamic programming via nonexpansive approximation. Computational Economics 31(2), 141–160 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  23. Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction. MIT Press (1998)

    Google Scholar 

  24. Todorov, E.: Efficient computation of optimal actions. In: Proceedings of the National Academy of Sciences 106(28), 11478–11483 (2009)

    Google Scholar 

  25. van Hasselt, H.P.: Double Q-learning. In: Advances in Neural Information Processing Systems (NIPS), vol. 23 (2010)

    Google Scholar 

  26. Viviani, P., Flash, T.: Minimum-jerk, two-thirds power law, and isochrony: converging approaches to movement planning. Journal of Experimental Psychology 21, 32–53 (1995)

    Google Scholar 

  27. Yu, V.: Approximate dynamic programming for blood inventory management. Honors thesis, Princeton University (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Elkan, C. (2012). Reinforcement Learning with a Bilinear Q Function. In: Sanner, S., Hutter, M. (eds) Recent Advances in Reinforcement Learning. EWRL 2011. Lecture Notes in Computer Science(), vol 7188. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29946-9_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-29946-9_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-29945-2

  • Online ISBN: 978-3-642-29946-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics