Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path

  • András Antos
  • Csaba Szepesvári
  • Rémi Munos
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4005)


We consider batch reinforcement learning problems in continuous space, expected total discounted-reward Markovian Decision Problems. As opposed to previous theoretical work, we consider the case when the training data consists of a single sample path (trajectory) of some behaviour policy. In particular, we do not assume access to a generative model of the environment. The algorithm studied is policy iteration where in successive iterations the Q-functions of the intermediate policies are obtained by means of minimizing a novel Bellman-residual type error. PAC-style polynomial bounds are derived on the number of samples needed to guarantee near-optimal performance where the bound depends on the mixing rate of the trajectory, the smoothness properties of the underlying Markovian Decision Problem, the approximation power and capacity of the function set used.


Sample Path Policy Iteration Markovian Decision Problem Greedy Policy Behaviour Policy 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Lagoudakis, M., Parr, R.: Least-squares policy iteration. Journal of Machine Learning Research 4, 1107–1149 (2003)CrossRefMathSciNetGoogle Scholar
  2. 2.
    Antos, A., Szepesvári, C., Munos, R.: Learning near-optimal policies with fitted policy iteration and a single sample path: approximate iterative policy evaluation. In: ICML 2006 (submitted, 2006)Google Scholar
  3. 3.
    Bertsekas, D.P., Shreve, S.E.: Stochastic Optimal Control (The Discrete Time Case). Academic Press, New York (1978)MATHGoogle Scholar
  4. 4.
    Sutton, R.S., Barto, A.G.: Toward a modern theory of adaptive networks: Expectation and prediction. In: Proc. of the Ninth Annual Conference of Cognitive Science Society, Erlbaum, Hillsdale (1987)Google Scholar
  5. 5.
    Munos, R.: Error bounds for approximate policy iteration. In: 19th International Conference on Machine Learning, pp. 560–567 (2003)Google Scholar
  6. 6.
    Meyn, S.P., Tweedie, R.: Markov Chains and Stochastic Stability. Springer, New York (1993)MATHGoogle Scholar
  7. 7.
    Anthony, M., Bartlett, P.L.: Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge (1999)CrossRefMATHGoogle Scholar
  8. 8.
    Yu, B.: Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability 22(1), 94–116 (1994)CrossRefMathSciNetMATHGoogle Scholar
  9. 9.
    Nobel, A.: Histogram regression estimation using data-dependent partitions. Annals of Statistics 24(3), 1084–1105 (1996)CrossRefMathSciNetMATHGoogle Scholar
  10. 10.
    Haussler, D.: Sphere packing numbers for subsets of the boolean n-cube with bounded Vapnik-Chervonenkis dimension. Journal of Combinatorial Theory Series A 69, 217–232 (1995)CrossRefMathSciNetMATHGoogle Scholar
  11. 11.
    Samuel, A.L.: Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, 210–229 (1963); Reprinted in Computers and Thought, Feigenbaum, E.A., Feldman, J. (eds.). McGraw-Hill, New York (1963)Google Scholar
  12. 12.
    Bellman, R.E., Dreyfus, S.E.: Functional approximation and dynamic programming. Math. Tables and other Aids Comp. 13, 247–251 (1959)CrossRefMathSciNetMATHGoogle Scholar
  13. 13.
    Bertsekas, D.P., Tsitsiklis, J.: Neuro-Dynamic Programming. Athena Scientific (1996)Google Scholar
  14. 14.
    Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction. Bradford Book (1998)Google Scholar
  15. 15.
    Gordon, G.J.: Stable function approximation in dynamic programming. In: Prieditis, A., Russell, S. (eds.) Proceedings of the Twelfth International Conference on Machine Learning, pp. 261–268. Morgan Kaufmann, San Francisco (1995)Google Scholar
  16. 16.
    Tsitsiklis, J.N., Van Roy, B.: Feature-based methods for large scale dynamic programming. Machine Learning 22, 59–94 (1996)MATHGoogle Scholar
  17. 17.
    Guestrin, C., Koller, D., Parr, R.: Max-norm projections for factored mdps. In: Proceedings of the International Joint Conference on Artificial Intelligence (2001)Google Scholar
  18. 18.
    Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 6, 503–556 (2005)MathSciNetGoogle Scholar
  19. 19.
    Wang, X., Dietterich, T.G.: Efficient value function approximation using regression trees. In: Proceedings of the IJCAI Workshop on Statistical Machine Learning for Large-Scale Optimization, Stockholm, Sweden (1999)Google Scholar
  20. 20.
    Dietterich, T.G., Wang, X.: Batch value function approximation via support vectors. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems 14. MIT Press, Cambridge (2002)Google Scholar
  21. 21.
    Szepesvári, C., Munos, R.: Finite time bounds for sampling based fitted value iteration. In: ICML 2005 (2005)Google Scholar
  22. 22.
    Meir, R.: Nonparametric time series prediction through adaptive model selection. Machine Learning 39(1), 5–34 (2000)CrossRefMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • András Antos
    • 1
  • Csaba Szepesvári
    • 1
  • Rémi Munos
    • 2
  1. 1.Computer and Automation Research Inst. of the Hungarian Academy of SciencesBudapestHungary
  2. 2.Centre de Mathématiques AppliquéesEcole PolytechniquePalaiseauFrance

Personalised recommendations