Model selection in reinforcement learning

Farahmand, Amir-massoud; Szepesvári, Csaba

doi:10.1007/s10994-011-5254-7

Model selection in reinforcement learning

Published: 11 June 2011

Volume 85, pages 299–332, (2011)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Model selection in reinforcement learning

Download PDF

Amir-massoud Farahmand¹ &
Csaba Szepesvári¹

1811 Accesses
21 Citations
5 Altmetric
Explore all metrics

Abstract

We consider the problem of model selection in the batch (offline, non-interactive) reinforcement learning setting when the goal is to find an action-value function with the smallest Bellman error among a countable set of candidates functions. We propose a complexity regularization-based model selection algorithm, \(\ensuremath{\mbox{\textsc {BErMin}}}\), and prove that it enjoys an oracle-like property: the estimator’s error differs from that of an oracle, who selects the candidate with the minimum Bellman error, by only a constant factor and a small remainder term that vanishes at a parametric rate as the number of samples increases. As an application, we consider a problem when the true action-value function belongs to an unknown member of a nested sequence of function spaces. We show that under some additional technical conditions \(\ensuremath{\mbox{\textsc {BErMin}}}\) leads to a procedure whose rate of convergence, up to a constant factor, matches that of an oracle who knows which of the nested function spaces the true action-value function belongs to, i.e., the procedure achieves adaptivity.

References

Antos, A., Szepesvári, Cs., & Munos, R. (2007). Value-iteration based fitted policy iteration: learning with a single trajectory. In 2007 IEEE symposium on approximate dynamic programming and reinforcement learning (ADPRL) (pp. 330–337). Honolulu: IEEE Press.
Chapter Google Scholar
Antos, A., Munos, R., & Szepesvári, Cs. (2008a). Fitted Q-iteration in continuous action-space MDPs. In J. Platt, D. Koller, Y. Singer, & S. Roweis (Eds.), Advances in neural information processing systems (NIPS-20) (pp. 9–16). Cambridge: MIT Press.
Google Scholar
Antos, A., Szepesvári, Cs., Munos, R. (2008b). Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71, 89–129.
Article Google Scholar
Arlot, S., & Celisse, A. (2009). A survey of cross-validation procedures for model selection. Statistics Surveys, 4, 40–79.
Article MathSciNet Google Scholar
Barron, A. R. (1991). Complexity regularization with application to artificial neural networks. In G. Roussas (Ed.), Nonparametric function estimation and related topics (pp. 561–576). Norwell: Kluwer Academic.
Google Scholar
Barron, A. R., Huang, C., Li, J. Q., & Luo, X. (2008). The MDL principle, maximum likelihoods, and statistical risk. In P. Grünwald, P. Myllymäki, I. Tabus, M. Weinberger, & B. Yu (Eds.), TICSP series: Vol. 38. Festschrift in honor of Jorma Rissanen on the occasion of his 75th birthday. Tampere international center for signal processing.
Google Scholar
Bartlett, P. L., Boucheron, S., & Lugosi, G. (2002). Model selection and error estimation. Machine Learning, 48(1–3), 85–113.
Article MATH Google Scholar
Bartlett, P. L., Bousquet, O., & Mendelson, S. (2005). Local Rademacher complexities. Annals of Statistics, 33(4), 1497–1537.
Article MathSciNet MATH Google Scholar
Bertsekas, D. P., & Shreve, S. E. (1978). Stochastic optimal control: the discrete-time case. San Diego: Academic Press.
MATH Google Scholar
Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Optimization and neural computation series: Vol. 3. Neuro-dynamic programming. Nachua: Athena Scientific.
MATH Google Scholar
Bishop, C. M. (2006). Pattern recognition and machine learning. Berlin: Springer.
MATH Google Scholar
Engel, Y., Mannor, S., & Meir, R. (2005). Reinforcement learning with Gaussian processes. In ICML ’05: proceedings of the 22nd international conference on machine learning (pp. 201–208). New York: ACM.
Chapter Google Scholar
Ernst, D., Geurts, P., & Wehenkel, L. (2005). Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6, 503–556.
MathSciNet MATH Google Scholar
Farahmand, A.-m., & Szepesvári, Cs. (2011). Regularized least-squares regression with β-mixing input processes. Journal of Statistical Planning and Inference.
Farahmand, A.-m., Ghavamzadeh, M., Szepesvári, Cs., & Mannor, S. (2009a). Regularized fitted Q-iteration for planning in continuous-space Markovian decision problems. In Proceedings of American control conference (ACC) (pp. 725–730).
Google Scholar
Farahmand, A.-m., Ghavamzadeh, M. Szepesvári, Cs., & Mannor, S. (2009b). Regularized policy iteration. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), Advances in neural information processing systems (NIPS-21) (pp. 441–448). Cambridge: MIT Press.
Google Scholar
Györfi, L., Kohler, M., Krzyżak, A., & Walk, H. (2002). A distribution-free theory of nonparametric regression. New York: Springer.
Book MATH Google Scholar
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: data mining, inference, and prediction. Berlin: Springer.
MATH Google Scholar
Jung, T., & Polani, D. (2006). Least squares SVM for least squares TD learning. In Proc. 17th European conference on artificial intelligence (pp. 499–503).
Google Scholar
Keller, P. W., Mannor, S., & Precup, D. (2006). Automatic basis function construction for approximate dynamic programming and reinforcement learning. In ICML ’06: proceedings of the 23rd international conference on machine learning (pp. 449–456). New York: ACM.
Chapter Google Scholar
Kolter, J. Z., & Ng, A. Y. (2009). Regularization and feature selection in least-squares temporal difference learning. In ICML ’09: proceedings of the 26th annual international conference on machine learning (pp. 521–528). New York: ACM.
Google Scholar
Lagoudakis, M. G., & Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research, 4, 1107–1149.
MathSciNet Google Scholar
Loth, M., Davy, M., & Preux, P. (2007). Sparse temporal difference learning using LASSO. In IEEE international symposium on approximate dynamic programming and reinforcement learning (pp. 352–359).
Chapter Google Scholar
Lugosi, G., & Wegkamp, M. (2004). Complexity regularization via localized random penalties. Annals of Statistics, 32, 1679–1697.
Article MathSciNet MATH Google Scholar
McDonald, D. (2010). Generalization error bounds for state space models with an application to economic forecasting (Technical report). Department of Statistics, Carnegie Mellon University.
Meir, R. (2000). Nonparametric time series prediction through adaptive model selection. Machine Learning, 39(1), 5–34.
Article MATH Google Scholar
Menache, I., Mannor, S., & Shimkin, N. (2005). Basis function adaptation in temporal difference reinforcement learning. Annals of Operation Research, 134(1), 215–238.
Article MathSciNet MATH Google Scholar
Meyn, S., & Tweedie, R. L. (2009). Markov chains and stochastic stability. New York: Cambridge University Press.
MATH Google Scholar
Modha, D. S., & Masry, E. (1998). Memory-universal prediction of stationary random processes. IEEE Transactions on Information Theory, 44(1), 117–133.
Article MathSciNet MATH Google Scholar
Parr, R., Painter-Wakefield, C., Li, L., & Littman, M. (2007). Analyzing feature generation for value-function approximation. In ICML ’07: proceedings of the 24th international conference on machine learning (pp. 737–744). New York: ACM.
Chapter Google Scholar
Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning. Cambridge: MIT Press.
MATH Google Scholar
Riedmiller, M. (2005). Neural fitted Q iteration—first experiences with a data efficient neural reinforcement learning method. In 16th European conference on machine learning (pp. 317–328).
Google Scholar
Samson, P.-M. (2000). Concentration of measure inequalities for Markov chains and φ-mixing processes. Annals of Probability, 28(1), 416–461.
Article MathSciNet MATH Google Scholar
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: an introduction. Adaptive computation and machine learning. Cambridge: MIT Press.
Google Scholar
Szepesvári, Cs. (2010). Algorithms for reinforcement learning. San Rafael: Morgan Claypool.
MATH Google Scholar
Taylor, G., & Parr, R. (2009). Kernelized value function approximation for reinforcement learning. In ICML ’09: proceedings of the 26th annual international conference on machine learning (pp. 1017–1024). New York: ACM.
Google Scholar
van de Geer, S. A. (2000). Empirical processes in M-estimation. Cambridge: Cambridge University Press.
Google Scholar
van der Vaart, A. W., Dudoit, S., & van der Laan, M. J. (2006). Oracle inequalities for multi-fold cross validation. Statistics & Decisions, 24, 351–372.
Article MathSciNet MATH Google Scholar
Wasserman, L. (2007). All of nonparametric statistics (Springer texts in statistics). Berlin: Springer.
Google Scholar
Wegkamp, M. (2003). Model selection in nonparametric regression. Annals of Statistics, 31(1), 252–273.
Article MathSciNet MATH Google Scholar
Xu, X., Hu, D., & Lu, X. (2007). Kernel-based least squares policy iteration for reinforcement learning. IEEE Transactions on Neural Networks, 18, 973–992.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computing Science, University of Alberta, Edmonton, T6G 2E8, Canada
Amir-massoud Farahmand & Csaba Szepesvári

Authors

Amir-massoud Farahmand
View author publications
You can also search for this author in PubMed Google Scholar
Csaba Szepesvári
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Csaba Szepesvári.

Additional information

Editors: S. Whiteson and M. Littman.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Farahmand, Am., Szepesvári, C. Model selection in reinforcement learning. Mach Learn 85, 299–332 (2011). https://doi.org/10.1007/s10994-011-5254-7

Download citation

Received: 13 March 2010
Accepted: 17 May 2011
Published: 11 June 2011
Issue Date: December 2011
DOI: https://doi.org/10.1007/s10994-011-5254-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Model selection in reinforcement learning

Abstract

Article PDF

Similar content being viewed by others