Abstract
In this study, we extend the framework of semiparametric statistical inference introduced recently to reinforcement learning [1] to online learning procedures for policy evaluation. This generalization enables us to investigate statistical properties of value function estimators both by batch and online procedures in a unified way in terms of estimating functions. Furthermore, we propose a novel online learning algorithm with optimal estimating functions which achieve the minimum estimation error. Our theoretical developments are confirmed using a simple chain walk problem.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Ueno, T., Kawanabe, M., Mori, T., Maeda, S., Ishii, S.: A semiparametric statistical apparoach to model-free policy evaluation. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1072–1079 (2008)
Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)
Bradtke, S., Barto, A.: Linear least-squares algorithms for temporal difference learning. Machine Learning 22(1), 33–57 (1996)
Boyan, J.: Technical update: Least-squares temporal difference learning. Machine Learning 49(2), 233–246 (2002)
Godambe, V. (ed.): Estimating Functions. Oxford Science, Oxford (1991)
Bickel, D., Ritov, D., Klaassen, C., Wellner, J.: Efficient and Adaptive Estimation for Semiparametric Models. Springer, Heidelberg (1998)
Amari, S., Kawanabe, M.: Information geometry of estimating functions in semi-parametric statistical models. Bernoulli 3(1), 29–54 (1997)
van der Vaart, A.: Asymptotic Statistics. Cambridge University Press, Cambridge (1998)
Bottou, L., LeCun, Y.: On-line learning for very large datasets. Applied Stochastic Models in Business and Industry 21(2), 137–151 (2005)
Bertsekas, D., Tsitsiklis, J.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)
Nedić, A., Bertsekas, D.: Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems 13(1), 79–110 (2003)
Mannor, S., Simester, D., Sun, P., Tsitsiklis, J.: Bias and variance in value function estimation. In: Proceedings of the twenty-first international conference on Machine learning. ACM, New York (2004)
Godambe, V.: The foundations of finite sample estimation in stochastic processes. Biometrika 72(2), 419–428 (1985)
Sørensen, M.: On asymptotics of estimating functions. Brazilian Journal of Probability and Statistics 13(2), 419–428 (1999)
Amari, S.: Natural gradient works efficiently in learning. Neural Computation 10(2), 251–276 (1998)
Horn, R., Johnson, C.: Matrix analysis. Cambridge University Press, Cambridge (1985)
Mahadevan, S., Maggioni, M.: Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes. The Journal of Machine Learning Research 8, 2169–2231 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ueno, T., Maeda, Si., Kawanabe, M., Ishii, S. (2009). Optimal Online Learning Procedures for Model-Free Policy Evaluation. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2009. Lecture Notes in Computer Science(), vol 5782. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04174-7_31
Download citation
DOI: https://doi.org/10.1007/978-3-642-04174-7_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04173-0
Online ISBN: 978-3-642-04174-7
eBook Packages: Computer ScienceComputer Science (R0)