Skip to main content

Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path

  • Conference paper
Book cover Learning Theory (COLT 2006)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4005))

Included in the following conference series:

Abstract

We consider batch reinforcement learning problems in continuous space, expected total discounted-reward Markovian Decision Problems. As opposed to previous theoretical work, we consider the case when the training data consists of a single sample path (trajectory) of some behaviour policy. In particular, we do not assume access to a generative model of the environment. The algorithm studied is policy iteration where in successive iterations the Q-functions of the intermediate policies are obtained by means of minimizing a novel Bellman-residual type error. PAC-style polynomial bounds are derived on the number of samples needed to guarantee near-optimal performance where the bound depends on the mixing rate of the trajectory, the smoothness properties of the underlying Markovian Decision Problem, the approximation power and capacity of the function set used.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Lagoudakis, M., Parr, R.: Least-squares policy iteration. Journal of Machine Learning Research 4, 1107–1149 (2003)

    Article  MathSciNet  Google Scholar 

  2. Antos, A., Szepesvári, C., Munos, R.: Learning near-optimal policies with fitted policy iteration and a single sample path: approximate iterative policy evaluation. In: ICML 2006 (submitted, 2006)

    Google Scholar 

  3. Bertsekas, D.P., Shreve, S.E.: Stochastic Optimal Control (The Discrete Time Case). Academic Press, New York (1978)

    MATH  Google Scholar 

  4. Sutton, R.S., Barto, A.G.: Toward a modern theory of adaptive networks: Expectation and prediction. In: Proc. of the Ninth Annual Conference of Cognitive Science Society, Erlbaum, Hillsdale (1987)

    Google Scholar 

  5. Munos, R.: Error bounds for approximate policy iteration. In: 19th International Conference on Machine Learning, pp. 560–567 (2003)

    Google Scholar 

  6. Meyn, S.P., Tweedie, R.: Markov Chains and Stochastic Stability. Springer, New York (1993)

    MATH  Google Scholar 

  7. Anthony, M., Bartlett, P.L.: Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge (1999)

    Book  MATH  Google Scholar 

  8. Yu, B.: Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability 22(1), 94–116 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  9. Nobel, A.: Histogram regression estimation using data-dependent partitions. Annals of Statistics 24(3), 1084–1105 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  10. Haussler, D.: Sphere packing numbers for subsets of the boolean n-cube with bounded Vapnik-Chervonenkis dimension. Journal of Combinatorial Theory Series A 69, 217–232 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  11. Samuel, A.L.: Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, 210–229 (1963); Reprinted in Computers and Thought, Feigenbaum, E.A., Feldman, J. (eds.). McGraw-Hill, New York (1963)

    Google Scholar 

  12. Bellman, R.E., Dreyfus, S.E.: Functional approximation and dynamic programming. Math. Tables and other Aids Comp. 13, 247–251 (1959)

    Article  MathSciNet  MATH  Google Scholar 

  13. Bertsekas, D.P., Tsitsiklis, J.: Neuro-Dynamic Programming. Athena Scientific (1996)

    Google Scholar 

  14. Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction. Bradford Book (1998)

    Google Scholar 

  15. Gordon, G.J.: Stable function approximation in dynamic programming. In: Prieditis, A., Russell, S. (eds.) Proceedings of the Twelfth International Conference on Machine Learning, pp. 261–268. Morgan Kaufmann, San Francisco (1995)

    Google Scholar 

  16. Tsitsiklis, J.N., Van Roy, B.: Feature-based methods for large scale dynamic programming. Machine Learning 22, 59–94 (1996)

    MATH  Google Scholar 

  17. Guestrin, C., Koller, D., Parr, R.: Max-norm projections for factored mdps. In: Proceedings of the International Joint Conference on Artificial Intelligence (2001)

    Google Scholar 

  18. Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 6, 503–556 (2005)

    MathSciNet  Google Scholar 

  19. Wang, X., Dietterich, T.G.: Efficient value function approximation using regression trees. In: Proceedings of the IJCAI Workshop on Statistical Machine Learning for Large-Scale Optimization, Stockholm, Sweden (1999)

    Google Scholar 

  20. Dietterich, T.G., Wang, X.: Batch value function approximation via support vectors. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems 14. MIT Press, Cambridge (2002)

    Google Scholar 

  21. Szepesvári, C., Munos, R.: Finite time bounds for sampling based fitted value iteration. In: ICML 2005 (2005)

    Google Scholar 

  22. Meir, R.: Nonparametric time series prediction through adaptive model selection. Machine Learning 39(1), 5–34 (2000)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Antos, A., Szepesvári, C., Munos, R. (2006). Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path. In: Lugosi, G., Simon, H.U. (eds) Learning Theory. COLT 2006. Lecture Notes in Computer Science(), vol 4005. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11776420_42

Download citation

  • DOI: https://doi.org/10.1007/11776420_42

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-35294-5

  • Online ISBN: 978-3-540-35296-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics