Optimal Online Learning Procedures for Model-Free Policy Evaluation

Ueno, Tsuyoshi; Maeda, Shin-ichi; Kawanabe, Motoaki; Ishii, Shin

doi:10.1007/978-3-642-04174-7_31

Tsuyoshi Ueno²²,
Shin-ichi Maeda²²,
Motoaki Kawanabe²³ &
…
Shin Ishii²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5782))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

3627 Accesses

Abstract

In this study, we extend the framework of semiparametric statistical inference introduced recently to reinforcement learning [1] to online learning procedures for policy evaluation. This generalization enables us to investigate statistical properties of value function estimators both by batch and online procedures in a unified way in terms of estimating functions. Furthermore, we propose a novel online learning algorithm with optimal estimating functions which achieve the minimum estimation error. Our theoretical developments are confirmed using a simple chain walk problem.

Download to read the full chapter text

Chapter PDF

Reinforcement Learning

Importance sampling in reinforcement learning with an estimated behavior policy

Article Open access 07 May 2021

A Survey on Constraining Policy Updates Using the KL Divergence

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Ueno, T., Kawanabe, M., Mori, T., Maeda, S., Ishii, S.: A semiparametric statistical apparoach to model-free policy evaluation. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1072–1079 (2008)
Google Scholar
Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)
Google Scholar
Bradtke, S., Barto, A.: Linear least-squares algorithms for temporal difference learning. Machine Learning 22(1), 33–57 (1996)
MATH Google Scholar
Boyan, J.: Technical update: Least-squares temporal difference learning. Machine Learning 49(2), 233–246 (2002)
Article MATH Google Scholar
Godambe, V. (ed.): Estimating Functions. Oxford Science, Oxford (1991)
MATH Google Scholar
Bickel, D., Ritov, D., Klaassen, C., Wellner, J.: Efficient and Adaptive Estimation for Semiparametric Models. Springer, Heidelberg (1998)
MATH Google Scholar
Amari, S., Kawanabe, M.: Information geometry of estimating functions in semi-parametric statistical models. Bernoulli 3(1), 29–54 (1997)
Article MathSciNet MATH Google Scholar
van der Vaart, A.: Asymptotic Statistics. Cambridge University Press, Cambridge (1998)
Book MATH Google Scholar
Bottou, L., LeCun, Y.: On-line learning for very large datasets. Applied Stochastic Models in Business and Industry 21(2), 137–151 (2005)
Article MathSciNet Google Scholar
Bertsekas, D., Tsitsiklis, J.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)
MATH Google Scholar
Nedić, A., Bertsekas, D.: Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems 13(1), 79–110 (2003)
MathSciNet MATH Google Scholar
Mannor, S., Simester, D., Sun, P., Tsitsiklis, J.: Bias and variance in value function estimation. In: Proceedings of the twenty-first international conference on Machine learning. ACM, New York (2004)
Google Scholar
Godambe, V.: The foundations of finite sample estimation in stochastic processes. Biometrika 72(2), 419–428 (1985)
Article MathSciNet MATH Google Scholar
Sørensen, M.: On asymptotics of estimating functions. Brazilian Journal of Probability and Statistics 13(2), 419–428 (1999)
MathSciNet MATH Google Scholar
Amari, S.: Natural gradient works efficiently in learning. Neural Computation 10(2), 251–276 (1998)
Article Google Scholar
Horn, R., Johnson, C.: Matrix analysis. Cambridge University Press, Cambridge (1985)
Book MATH Google Scholar
Mahadevan, S., Maggioni, M.: Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes. The Journal of Machine Learning Research 8, 2169–2231 (2007)
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Graduate School of Informatics, Kyoto University, Japan
Tsuyoshi Ueno, Shin-ichi Maeda & Shin Ishii
Fraunhofer FIRST and Berlin Institute of Technology, Germany
Motoaki Kawanabe

Authors

Tsuyoshi Ueno
View author publications
You can also search for this author in PubMed Google Scholar
Shin-ichi Maeda
View author publications
You can also search for this author in PubMed Google Scholar
Motoaki Kawanabe
View author publications
You can also search for this author in PubMed Google Scholar
Shin Ishii
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

NICTA, Locked Bag 8001, Canberra, 2601, Australia and Helsinki Institute of IT, Finland
Wray Buntine
Dept. of Knowledge Technologies, Jožef Stefan Institute, Jamova 39, 1000, Ljubljana, Slovenia
Marko Grobelnik & Dunja Mladenić &
The Centre for Computational Statistics and Machine Learning Department of Computer Science, University College London, Gower St.,, WC1E 6BT, London, UK
John Shawe-Taylor

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ueno, T., Maeda, Si., Kawanabe, M., Ishii, S. (2009). Optimal Online Learning Procedures for Model-Free Policy Evaluation. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2009. Lecture Notes in Computer Science(), vol 5782. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04174-7_31

Download citation

DOI: https://doi.org/10.1007/978-3-642-04174-7_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04173-0
Online ISBN: 978-3-642-04174-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Optimal Online Learning Procedures for Model-Free Policy Evaluation

Abstract

Chapter PDF

Similar content being viewed by others

Reinforcement Learning

Importance sampling in reinforcement learning with an estimated behavior policy

A Survey on Constraining Policy Updates Using the KL Divergence

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Optimal Online Learning Procedures for Model-Free Policy Evaluation

Abstract

Chapter PDF

Similar content being viewed by others

Reinforcement Learning

Importance sampling in reinforcement learning with an estimated behavior policy

A Survey on Constraining Policy Updates Using the KL Divergence

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation