Abstract
We consider the problem of model selection in the batch (offline, non-interactive) reinforcement learning setting when the goal is to find an action-value function with the smallest Bellman error among a countable set of candidates functions. We propose a complexity regularization-based model selection algorithm, \(\ensuremath{\mbox{\textsc {BErMin}}}\), and prove that it enjoys an oracle-like property: the estimator’s error differs from that of an oracle, who selects the candidate with the minimum Bellman error, by only a constant factor and a small remainder term that vanishes at a parametric rate as the number of samples increases. As an application, we consider a problem when the true action-value function belongs to an unknown member of a nested sequence of function spaces. We show that under some additional technical conditions \(\ensuremath{\mbox{\textsc {BErMin}}}\) leads to a procedure whose rate of convergence, up to a constant factor, matches that of an oracle who knows which of the nested function spaces the true action-value function belongs to, i.e., the procedure achieves adaptivity.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Antos, A., Szepesvári, Cs., & Munos, R. (2007). Value-iteration based fitted policy iteration: learning with a single trajectory. In 2007 IEEE symposium on approximate dynamic programming and reinforcement learning (ADPRL) (pp. 330–337). Honolulu: IEEE Press.
Antos, A., Munos, R., & Szepesvári, Cs. (2008a). Fitted Q-iteration in continuous action-space MDPs. In J. Platt, D. Koller, Y. Singer, & S. Roweis (Eds.), Advances in neural information processing systems (NIPS-20) (pp. 9–16). Cambridge: MIT Press.
Antos, A., Szepesvári, Cs., Munos, R. (2008b). Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71, 89–129.
Arlot, S., & Celisse, A. (2009). A survey of cross-validation procedures for model selection. Statistics Surveys, 4, 40–79.
Barron, A. R. (1991). Complexity regularization with application to artificial neural networks. In G. Roussas (Ed.), Nonparametric function estimation and related topics (pp. 561–576). Norwell: Kluwer Academic.
Barron, A. R., Huang, C., Li, J. Q., & Luo, X. (2008). The MDL principle, maximum likelihoods, and statistical risk. In P. Grünwald, P. Myllymäki, I. Tabus, M. Weinberger, & B. Yu (Eds.), TICSP series: Vol. 38. Festschrift in honor of Jorma Rissanen on the occasion of his 75th birthday. Tampere international center for signal processing.
Bartlett, P. L., Boucheron, S., & Lugosi, G. (2002). Model selection and error estimation. Machine Learning, 48(1–3), 85–113.
Bartlett, P. L., Bousquet, O., & Mendelson, S. (2005). Local Rademacher complexities. Annals of Statistics, 33(4), 1497–1537.
Bertsekas, D. P., & Shreve, S. E. (1978). Stochastic optimal control: the discrete-time case. San Diego: Academic Press.
Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Optimization and neural computation series: Vol. 3. Neuro-dynamic programming. Nachua: Athena Scientific.
Bishop, C. M. (2006). Pattern recognition and machine learning. Berlin: Springer.
Engel, Y., Mannor, S., & Meir, R. (2005). Reinforcement learning with Gaussian processes. In ICML ’05: proceedings of the 22nd international conference on machine learning (pp. 201–208). New York: ACM.
Ernst, D., Geurts, P., & Wehenkel, L. (2005). Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6, 503–556.
Farahmand, A.-m., & Szepesvári, Cs. (2011). Regularized least-squares regression with β-mixing input processes. Journal of Statistical Planning and Inference.
Farahmand, A.-m., Ghavamzadeh, M., Szepesvári, Cs., & Mannor, S. (2009a). Regularized fitted Q-iteration for planning in continuous-space Markovian decision problems. In Proceedings of American control conference (ACC) (pp. 725–730).
Farahmand, A.-m., Ghavamzadeh, M. Szepesvári, Cs., & Mannor, S. (2009b). Regularized policy iteration. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), Advances in neural information processing systems (NIPS-21) (pp. 441–448). Cambridge: MIT Press.
Györfi, L., Kohler, M., Krzyżak, A., & Walk, H. (2002). A distribution-free theory of nonparametric regression. New York: Springer.
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: data mining, inference, and prediction. Berlin: Springer.
Jung, T., & Polani, D. (2006). Least squares SVM for least squares TD learning. In Proc. 17th European conference on artificial intelligence (pp. 499–503).
Keller, P. W., Mannor, S., & Precup, D. (2006). Automatic basis function construction for approximate dynamic programming and reinforcement learning. In ICML ’06: proceedings of the 23rd international conference on machine learning (pp. 449–456). New York: ACM.
Kolter, J. Z., & Ng, A. Y. (2009). Regularization and feature selection in least-squares temporal difference learning. In ICML ’09: proceedings of the 26th annual international conference on machine learning (pp. 521–528). New York: ACM.
Lagoudakis, M. G., & Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research, 4, 1107–1149.
Loth, M., Davy, M., & Preux, P. (2007). Sparse temporal difference learning using LASSO. In IEEE international symposium on approximate dynamic programming and reinforcement learning (pp. 352–359).
Lugosi, G., & Wegkamp, M. (2004). Complexity regularization via localized random penalties. Annals of Statistics, 32, 1679–1697.
McDonald, D. (2010). Generalization error bounds for state space models with an application to economic forecasting (Technical report). Department of Statistics, Carnegie Mellon University.
Meir, R. (2000). Nonparametric time series prediction through adaptive model selection. Machine Learning, 39(1), 5–34.
Menache, I., Mannor, S., & Shimkin, N. (2005). Basis function adaptation in temporal difference reinforcement learning. Annals of Operation Research, 134(1), 215–238.
Meyn, S., & Tweedie, R. L. (2009). Markov chains and stochastic stability. New York: Cambridge University Press.
Modha, D. S., & Masry, E. (1998). Memory-universal prediction of stationary random processes. IEEE Transactions on Information Theory, 44(1), 117–133.
Parr, R., Painter-Wakefield, C., Li, L., & Littman, M. (2007). Analyzing feature generation for value-function approximation. In ICML ’07: proceedings of the 24th international conference on machine learning (pp. 737–744). New York: ACM.
Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning. Cambridge: MIT Press.
Riedmiller, M. (2005). Neural fitted Q iteration—first experiences with a data efficient neural reinforcement learning method. In 16th European conference on machine learning (pp. 317–328).
Samson, P.-M. (2000). Concentration of measure inequalities for Markov chains and φ-mixing processes. Annals of Probability, 28(1), 416–461.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: an introduction. Adaptive computation and machine learning. Cambridge: MIT Press.
Szepesvári, Cs. (2010). Algorithms for reinforcement learning. San Rafael: Morgan Claypool.
Taylor, G., & Parr, R. (2009). Kernelized value function approximation for reinforcement learning. In ICML ’09: proceedings of the 26th annual international conference on machine learning (pp. 1017–1024). New York: ACM.
van de Geer, S. A. (2000). Empirical processes in M-estimation. Cambridge: Cambridge University Press.
van der Vaart, A. W., Dudoit, S., & van der Laan, M. J. (2006). Oracle inequalities for multi-fold cross validation. Statistics & Decisions, 24, 351–372.
Wasserman, L. (2007). All of nonparametric statistics (Springer texts in statistics). Berlin: Springer.
Wegkamp, M. (2003). Model selection in nonparametric regression. Annals of Statistics, 31(1), 252–273.
Xu, X., Hu, D., & Lu, X. (2007). Kernel-based least squares policy iteration for reinforcement learning. IEEE Transactions on Neural Networks, 18, 973–992.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: S. Whiteson and M. Littman.
Rights and permissions
About this article
Cite this article
Farahmand, Am., Szepesvári, C. Model selection in reinforcement learning. Mach Learn 85, 299–332 (2011). https://doi.org/10.1007/s10994-011-5254-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-011-5254-7