Machine Learning

, Volume 85, Issue 3, pp 299–332

Model selection in reinforcement learning

Article

Abstract

We consider the problem of model selection in the batch (offline, non-interactive) reinforcement learning setting when the goal is to find an action-value function with the smallest Bellman error among a countable set of candidates functions. We propose a complexity regularization-based model selection algorithm, \(\ensuremath{\mbox{\textsc {BErMin}}}\), and prove that it enjoys an oracle-like property: the estimator’s error differs from that of an oracle, who selects the candidate with the minimum Bellman error, by only a constant factor and a small remainder term that vanishes at a parametric rate as the number of samples increases. As an application, we consider a problem when the true action-value function belongs to an unknown member of a nested sequence of function spaces. We show that under some additional technical conditions \(\ensuremath{\mbox{\textsc {BErMin}}}\) leads to a procedure whose rate of convergence, up to a constant factor, matches that of an oracle who knows which of the nested function spaces the true action-value function belongs to, i.e., the procedure achieves adaptivity.

Keywords

Reinforcement learning Model selection Complexity regularization Adaptivity Offline learning Off-policy learning Finite-sample bounds 

References

  1. Antos, A., Szepesvári, Cs., & Munos, R. (2007). Value-iteration based fitted policy iteration: learning with a single trajectory. In 2007 IEEE symposium on approximate dynamic programming and reinforcement learning (ADPRL) (pp. 330–337). Honolulu: IEEE Press. CrossRefGoogle Scholar
  2. Antos, A., Munos, R., & Szepesvári, Cs. (2008a). Fitted Q-iteration in continuous action-space MDPs. In J. Platt, D. Koller, Y. Singer, & S. Roweis (Eds.), Advances in neural information processing systems (NIPS-20) (pp. 9–16). Cambridge: MIT Press. Google Scholar
  3. Antos, A., Szepesvári, Cs., Munos, R. (2008b). Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71, 89–129. CrossRefGoogle Scholar
  4. Arlot, S., & Celisse, A. (2009). A survey of cross-validation procedures for model selection. Statistics Surveys, 4, 40–79. MathSciNetCrossRefGoogle Scholar
  5. Barron, A. R. (1991). Complexity regularization with application to artificial neural networks. In G. Roussas (Ed.), Nonparametric function estimation and related topics (pp. 561–576). Norwell: Kluwer Academic. Google Scholar
  6. Barron, A. R., Huang, C., Li, J. Q., & Luo, X. (2008). The MDL principle, maximum likelihoods, and statistical risk. In P. Grünwald, P. Myllymäki, I. Tabus, M. Weinberger, & B. Yu (Eds.), TICSP series: Vol. 38. Festschrift in honor of Jorma Rissanen on the occasion of his 75th birthday. Tampere international center for signal processing. Google Scholar
  7. Bartlett, P. L., Boucheron, S., & Lugosi, G. (2002). Model selection and error estimation. Machine Learning, 48(1–3), 85–113. MATHCrossRefGoogle Scholar
  8. Bartlett, P. L., Bousquet, O., & Mendelson, S. (2005). Local Rademacher complexities. Annals of Statistics, 33(4), 1497–1537. MathSciNetMATHCrossRefGoogle Scholar
  9. Bertsekas, D. P., & Shreve, S. E. (1978). Stochastic optimal control: the discrete-time case. San Diego: Academic Press. MATHGoogle Scholar
  10. Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Optimization and neural computation series: Vol. 3. Neuro-dynamic programming. Nachua: Athena Scientific. MATHGoogle Scholar
  11. Bishop, C. M. (2006). Pattern recognition and machine learning. Berlin: Springer. MATHGoogle Scholar
  12. Engel, Y., Mannor, S., & Meir, R. (2005). Reinforcement learning with Gaussian processes. In ICML ’05: proceedings of the 22nd international conference on machine learning (pp. 201–208). New York: ACM. CrossRefGoogle Scholar
  13. Ernst, D., Geurts, P., & Wehenkel, L. (2005). Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6, 503–556. MathSciNetMATHGoogle Scholar
  14. Farahmand, A.-m., & Szepesvári, Cs. (2011). Regularized least-squares regression with β-mixing input processes. Journal of Statistical Planning and Inference. Google Scholar
  15. Farahmand, A.-m., Ghavamzadeh, M., Szepesvári, Cs., & Mannor, S. (2009a). Regularized fitted Q-iteration for planning in continuous-space Markovian decision problems. In Proceedings of American control conference (ACC) (pp. 725–730). Google Scholar
  16. Farahmand, A.-m., Ghavamzadeh, M. Szepesvári, Cs., & Mannor, S. (2009b). Regularized policy iteration. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), Advances in neural information processing systems (NIPS-21) (pp. 441–448). Cambridge: MIT Press. Google Scholar
  17. Györfi, L., Kohler, M., Krzyżak, A., & Walk, H. (2002). A distribution-free theory of nonparametric regression. New York: Springer. MATHCrossRefGoogle Scholar
  18. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: data mining, inference, and prediction. Berlin: Springer. MATHGoogle Scholar
  19. Jung, T., & Polani, D. (2006). Least squares SVM for least squares TD learning. In Proc. 17th European conference on artificial intelligence (pp. 499–503). Google Scholar
  20. Keller, P. W., Mannor, S., & Precup, D. (2006). Automatic basis function construction for approximate dynamic programming and reinforcement learning. In ICML ’06: proceedings of the 23rd international conference on machine learning (pp. 449–456). New York: ACM. CrossRefGoogle Scholar
  21. Kolter, J. Z., & Ng, A. Y. (2009). Regularization and feature selection in least-squares temporal difference learning. In ICML ’09: proceedings of the 26th annual international conference on machine learning (pp. 521–528). New York: ACM. Google Scholar
  22. Lagoudakis, M. G., & Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research, 4, 1107–1149. MathSciNetGoogle Scholar
  23. Loth, M., Davy, M., & Preux, P. (2007). Sparse temporal difference learning using LASSO. In IEEE international symposium on approximate dynamic programming and reinforcement learning (pp. 352–359). CrossRefGoogle Scholar
  24. Lugosi, G., & Wegkamp, M. (2004). Complexity regularization via localized random penalties. Annals of Statistics, 32, 1679–1697. MathSciNetMATHCrossRefGoogle Scholar
  25. McDonald, D. (2010). Generalization error bounds for state space models with an application to economic forecasting (Technical report). Department of Statistics, Carnegie Mellon University. Google Scholar
  26. Meir, R. (2000). Nonparametric time series prediction through adaptive model selection. Machine Learning, 39(1), 5–34. MATHCrossRefGoogle Scholar
  27. Menache, I., Mannor, S., & Shimkin, N. (2005). Basis function adaptation in temporal difference reinforcement learning. Annals of Operation Research, 134(1), 215–238. MathSciNetMATHCrossRefGoogle Scholar
  28. Meyn, S., & Tweedie, R. L. (2009). Markov chains and stochastic stability. New York: Cambridge University Press. MATHGoogle Scholar
  29. Modha, D. S., & Masry, E. (1998). Memory-universal prediction of stationary random processes. IEEE Transactions on Information Theory, 44(1), 117–133. MathSciNetMATHCrossRefGoogle Scholar
  30. Parr, R., Painter-Wakefield, C., Li, L., & Littman, M. (2007). Analyzing feature generation for value-function approximation. In ICML ’07: proceedings of the 24th international conference on machine learning (pp. 737–744). New York: ACM. CrossRefGoogle Scholar
  31. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning. Cambridge: MIT Press. MATHGoogle Scholar
  32. Riedmiller, M. (2005). Neural fitted Q iteration—first experiences with a data efficient neural reinforcement learning method. In 16th European conference on machine learning (pp. 317–328). Google Scholar
  33. Samson, P.-M. (2000). Concentration of measure inequalities for Markov chains and φ-mixing processes. Annals of Probability, 28(1), 416–461. MathSciNetMATHCrossRefGoogle Scholar
  34. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: an introduction. Adaptive computation and machine learning. Cambridge: MIT Press. Google Scholar
  35. Szepesvári, Cs. (2010). Algorithms for reinforcement learning. San Rafael: Morgan Claypool. MATHGoogle Scholar
  36. Taylor, G., & Parr, R. (2009). Kernelized value function approximation for reinforcement learning. In ICML ’09: proceedings of the 26th annual international conference on machine learning (pp. 1017–1024). New York: ACM. Google Scholar
  37. van de Geer, S. A. (2000). Empirical processes in M-estimation. Cambridge: Cambridge University Press. Google Scholar
  38. van der Vaart, A. W., Dudoit, S., & van der Laan, M. J. (2006). Oracle inequalities for multi-fold cross validation. Statistics & Decisions, 24, 351–372. MathSciNetMATHCrossRefGoogle Scholar
  39. Wasserman, L. (2007). All of nonparametric statistics (Springer texts in statistics). Berlin: Springer. Google Scholar
  40. Wegkamp, M. (2003). Model selection in nonparametric regression. Annals of Statistics, 31(1), 252–273. MathSciNetMATHCrossRefGoogle Scholar
  41. Xu, X., Hu, D., & Lu, X. (2007). Kernel-based least squares policy iteration for reinforcement learning. IEEE Transactions on Neural Networks, 18, 973–992. CrossRefGoogle Scholar

Copyright information

© The Author(s) 2011

Authors and Affiliations

  1. 1.Department of Computing ScienceUniversity of AlbertaEdmontonCanada

Personalised recommendations