Abstract
Many reinforcement learning methods are based on a function Q(s,a) whose value is the discounted total reward expected after performing the action a in the state s. This paper explores the implications of representing the Q function as Q(s,a) = s T Wa, where W is a matrix that is learned. In this representation, both s and a are real-valued vectors that may have high dimension. We show that action selection can be done using standard linear programming, and that W can be learned using standard linear regression in the algorithm known as fitted Q iteration. Experimentally, the resulting method learns to solve the mountain car task in a sample-efficient way. The same method is also applicable to an inventory management task where the state space and the action space are continuous and high-dimensional.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Chakraborty, B., Strecher, V., Murphy, S.: Bias correction and confidence intervals for fitted Q-iteration. In: NIPS Workshop on Model Uncertainty and Risk in Reinforcement Learning (2008)
Chang, Y.W., Hsieh, C.J., Chang, K.W., Ringgaard, M., Lin, C.J.: Training and testing low-degree polynomial data mappings via linear SVM. Journal of Machine Learning Research 11, 1471–1490 (2010)
De Farias, D.P., Van Roy, B.: The linear programming approach to approximate dynamic programming. Operations Research 51(6), 850–865 (2003)
Dietterich, T.G.: Machine Learning and Ecosystem Informatics: Challenges and Opportunities. In: Zhou, Z.-H., Washio, T. (eds.) ACML 2009. LNCS, vol. 5828, pp. 1–5. Springer, Heidelberg (2009)
Džeroski, S., De Raedt, L., Driessens, K.: Relational reinforcement learning. Machine Learning 43(1), 7–52 (2001)
Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 6(1), 503–556 (2005)
Gordon, G.J.: Stable fitted reinforcement learning. In: Advances in Neural Information Processing Systems (NIPS), pp. 1052–1058 (1995a)
Gordon, G.J.: Stable function approximation in dynamic programming. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 261–268 (1995b)
Hannah, L.A., Dunson, D.B.: Approximate dynamic programming for storage problems. In: Proceedings of the International Conference on Machine Learning, ICML (2011)
Judd, K.L., Solnick, A.J.: Numerical dynamic programming with shape-preserving splines. Unpublished paper from the Hoover Institution (1994), http://bucky.stanford.edu/papers/dpshape.pdf
Lagoudakis, M.G., Parr, R.: Least-squares policy iteration. Journal of Machine Learning Research 4, 1107–1149 (2003)
Lazaric, A., Restelli, M., Bonarini, A.: Reinforcement learning in continuous action spaces through sequential Monte Carlo methods. In: Advances in Neural Information Processing Systems 20 (NIPS). MIT Press (2007)
Melo, F.S., Lopes, M.: Fitted Natural Actor-Critic: A New Algorithm for Continuous State-Action MDPs. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part II. LNCS (LNAI), vol. 5212, pp. 66–81. Springer, Heidelberg (2008)
Murphy, S.A.: A generalization error for Q-learning. Journal of Machine Learning Research 6, 1073–1097 (2005)
Neumann, G.: Batch-mode reinforcement learning for continuous state spaces: A survey. ÖGAI Journal 27(1), 15–23 (2008)
Pazis, J., Lagoudakis, M.G.: Binary action search for learning continuous-action control policies. In: Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pp. 100–107 (2009)
Pazis, J., Parr, R.: Generalized value functions for large action sets. In: Proceedings of the International Conference on Machine Learning, ICML (2011)
Powell, W.B.: Approximate Dynamic Programming. John Wiley & Sons, Inc. (2007)
Riedmiller, M.: Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 317–328. Springer, Heidelberg (2005)
Simester, D.I., Sun, P., Tsitsiklis, J.N.: Dynamic catalog mailing policies. Management Science 52(5), 683–696 (2006)
Smart, W.D., Kaelbling, L.P.: Practical reinforcement learning in continuous spaces. In: Proceedings of the 17th International Conference on Machine Learning (ICML), pp. 903–910 (2000)
Stachurski, J.: Continuous state dynamic programming via nonexpansive approximation. Computational Economics 31(2), 141–160 (2008)
Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction. MIT Press (1998)
Todorov, E.: Efficient computation of optimal actions. In: Proceedings of the National Academy of Sciences 106(28), 11478–11483 (2009)
van Hasselt, H.P.: Double Q-learning. In: Advances in Neural Information Processing Systems (NIPS), vol. 23 (2010)
Viviani, P., Flash, T.: Minimum-jerk, two-thirds power law, and isochrony: converging approaches to movement planning. Journal of Experimental Psychology 21, 32–53 (1995)
Yu, V.: Approximate dynamic programming for blood inventory management. Honors thesis, Princeton University (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Elkan, C. (2012). Reinforcement Learning with a Bilinear Q Function. In: Sanner, S., Hutter, M. (eds) Recent Advances in Reinforcement Learning. EWRL 2011. Lecture Notes in Computer Science(), vol 7188. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29946-9_11
Download citation
DOI: https://doi.org/10.1007/978-3-642-29946-9_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29945-2
Online ISBN: 978-3-642-29946-9
eBook Packages: Computer ScienceComputer Science (R0)