Reinforcement Learning with a Bilinear Q Function

Elkan, Charles

doi:10.1007/978-3-642-29946-9_11

Charles Elkan²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7188))

Included in the following conference series:

European Workshop on Reinforcement Learning

2283 Accesses
3 Citations

Abstract

Many reinforcement learning methods are based on a function Q(s,a) whose value is the discounted total reward expected after performing the action a in the state s. This paper explores the implications of representing the Q function as Q(s,a) = s ^T Wa, where W is a matrix that is learned. In this representation, both s and a are real-valued vectors that may have high dimension. We show that action selection can be done using standard linear programming, and that W can be learned using standard linear regression in the algorithm known as fitted Q iteration. Experimentally, the resulting method learns to solve the mountain car task in a sample-efficient way. The same method is also applicable to an inventory management task where the state space and the action space are continuous and high-dimensional.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chakraborty, B., Strecher, V., Murphy, S.: Bias correction and confidence intervals for fitted Q-iteration. In: NIPS Workshop on Model Uncertainty and Risk in Reinforcement Learning (2008)
Google Scholar
Chang, Y.W., Hsieh, C.J., Chang, K.W., Ringgaard, M., Lin, C.J.: Training and testing low-degree polynomial data mappings via linear SVM. Journal of Machine Learning Research 11, 1471–1490 (2010)
MathSciNet Google Scholar
De Farias, D.P., Van Roy, B.: The linear programming approach to approximate dynamic programming. Operations Research 51(6), 850–865 (2003)
Article MathSciNet MATH Google Scholar
Dietterich, T.G.: Machine Learning and Ecosystem Informatics: Challenges and Opportunities. In: Zhou, Z.-H., Washio, T. (eds.) ACML 2009. LNCS, vol. 5828, pp. 1–5. Springer, Heidelberg (2009)
Chapter Google Scholar
Džeroski, S., De Raedt, L., Driessens, K.: Relational reinforcement learning. Machine Learning 43(1), 7–52 (2001)
Article MATH Google Scholar
Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 6(1), 503–556 (2005)
MathSciNet MATH Google Scholar
Gordon, G.J.: Stable fitted reinforcement learning. In: Advances in Neural Information Processing Systems (NIPS), pp. 1052–1058 (1995a)
Google Scholar
Gordon, G.J.: Stable function approximation in dynamic programming. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 261–268 (1995b)
Google Scholar
Hannah, L.A., Dunson, D.B.: Approximate dynamic programming for storage problems. In: Proceedings of the International Conference on Machine Learning, ICML (2011)
Google Scholar
Judd, K.L., Solnick, A.J.: Numerical dynamic programming with shape-preserving splines. Unpublished paper from the Hoover Institution (1994), http://bucky.stanford.edu/papers/dpshape.pdf
Lagoudakis, M.G., Parr, R.: Least-squares policy iteration. Journal of Machine Learning Research 4, 1107–1149 (2003)
MathSciNet Google Scholar
Lazaric, A., Restelli, M., Bonarini, A.: Reinforcement learning in continuous action spaces through sequential Monte Carlo methods. In: Advances in Neural Information Processing Systems 20 (NIPS). MIT Press (2007)
Google Scholar
Melo, F.S., Lopes, M.: Fitted Natural Actor-Critic: A New Algorithm for Continuous State-Action MDPs. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part II. LNCS (LNAI), vol. 5212, pp. 66–81. Springer, Heidelberg (2008)
Chapter Google Scholar
Murphy, S.A.: A generalization error for Q-learning. Journal of Machine Learning Research 6, 1073–1097 (2005)
MATH Google Scholar
Neumann, G.: Batch-mode reinforcement learning for continuous state spaces: A survey. ÖGAI Journal 27(1), 15–23 (2008)
Google Scholar
Pazis, J., Lagoudakis, M.G.: Binary action search for learning continuous-action control policies. In: Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pp. 100–107 (2009)
Google Scholar
Pazis, J., Parr, R.: Generalized value functions for large action sets. In: Proceedings of the International Conference on Machine Learning, ICML (2011)
Google Scholar
Powell, W.B.: Approximate Dynamic Programming. John Wiley & Sons, Inc. (2007)
Google Scholar
Riedmiller, M.: Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 317–328. Springer, Heidelberg (2005)
Chapter Google Scholar
Simester, D.I., Sun, P., Tsitsiklis, J.N.: Dynamic catalog mailing policies. Management Science 52(5), 683–696 (2006)
Article Google Scholar
Smart, W.D., Kaelbling, L.P.: Practical reinforcement learning in continuous spaces. In: Proceedings of the 17th International Conference on Machine Learning (ICML), pp. 903–910 (2000)
Google Scholar
Stachurski, J.: Continuous state dynamic programming via nonexpansive approximation. Computational Economics 31(2), 141–160 (2008)
Article MathSciNet MATH Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction. MIT Press (1998)
Google Scholar
Todorov, E.: Efficient computation of optimal actions. In: Proceedings of the National Academy of Sciences 106(28), 11478–11483 (2009)
Google Scholar
van Hasselt, H.P.: Double Q-learning. In: Advances in Neural Information Processing Systems (NIPS), vol. 23 (2010)
Google Scholar
Viviani, P., Flash, T.: Minimum-jerk, two-thirds power law, and isochrony: converging approaches to movement planning. Journal of Experimental Psychology 21, 32–53 (1995)
Google Scholar
Yu, V.: Approximate dynamic programming for blood inventory management. Honors thesis, Princeton University (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, 92093-0404, USA
Charles Elkan

Authors

Charles Elkan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

NICTA and the Australian National University, 7 London Circuit, ACT 2601, Canberra, Australia
Scott Sanner
Research School of Computer Science, Australian National University, ACT 0200, Canberra, Australia
Marcus Hutter

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Elkan, C. (2012). Reinforcement Learning with a Bilinear Q Function. In: Sanner, S., Hutter, M. (eds) Recent Advances in Reinforcement Learning. EWRL 2011. Lecture Notes in Computer Science(), vol 7188. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29946-9_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-29946-9_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29945-2
Online ISBN: 978-3-642-29946-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics