Machine Learning

, Volume 72, Issue 3, pp 157–171 | Cite as

Rollout sampling approximate policy iteration

  • Christos DimitrakakisEmail author
  • Michail G. Lagoudakis
Open Access


Several researchers have recently investigated the connection between reinforcement learning and classification. We are motivated by proposals of approximate policy iteration schemes without value functions, which focus on policy representation using classifiers and address policy learning as a supervised learning problem. This paper proposes variants of an improved policy iteration scheme which addresses the core sampling problem in evaluating a policy through simulation as a multi-armed bandit machine. The resulting algorithm offers comparable performance to the previous algorithm achieved, however, with significantly less computational effort. An order of magnitude improvement is demonstrated experimentally in two standard reinforcement learning domains: inverted pendulum and mountain-car.


Reinforcement learning Approximate policy iteration Rollouts Bandit problems Classification Sample complexity 


  1. Antos, A., Szepesvári, C., & Munos, R. (2008). Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1), 89–129.  10.1007/s10994-007-5038-2. CrossRefGoogle Scholar
  2. Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning Journal 47(2–3), 235–256. zbMATHCrossRefGoogle Scholar
  3. Dimitrakakis, C., & Lagoudakis, M. (2008). Algorithms and bounds for sampling-based approximate policy iteration. (To be presented at the 8th European Workshop on Reinforcement Learning). Google Scholar
  4. Even-Dar, E., Mannor, S., & Mansour, Y. (2006). Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7, 1079–1105. ISSN 1533-7928. MathSciNetGoogle Scholar
  5. Fern, A., Yoon, S., & Givan, R. (2004). Approximate policy iteration with a policy language bias. Advances in Neural Information Processing Systems, 16(3). Google Scholar
  6. Fern, A., Yoon, S., & Givan, R. (2006). Approximate policy iteration with a policy language bias: Solving relational Markov decision processes. Journal of Artificial Intelligence Research, 25, 75–118. MathSciNetGoogle Scholar
  7. Howard, R. A. (1960). Dynamic programming and Markov processes. Cambridge: MIT Press. zbMATHGoogle Scholar
  8. Kocsis, L., & Szepesvári, C. (2006). Bandit based Monte-Carlo planning. In Proceedings of the European conference on machine learning. Google Scholar
  9. Lagoudakis, M. G. (2003). Efficient approximate policy iteration methods for sequential decision making in reinforcement learning. PhD thesis, Department of Computer Science, Duke University. Google Scholar
  10. Lagoudakis, M. G., & Parr, R. (2003a). Least-squares policy iteration. Journal of Machine Learning Research, 4(6), 1107–1149. CrossRefMathSciNetGoogle Scholar
  11. Lagoudakis, M. G. & Parr, R. (2003b). Reinforcement learning as classification: Leveraging modern classifiers. In Proceedings of the 20th international conference on machine learning (ICML) (pp. 424–431). Washington, DC, USA. Google Scholar
  12. Langford, J., & Zadrozny, B. (2005). Relating reinforcement learning performance to classification performance. In Proceedings of the 22nd international conference on machine learning (ICML) (pp. 473–480). Bonn, Germany, 2005. ISBN 1-59593-180-5. doi: 10.1145/1102351.1102411.
  13. Rexakis, I., & Lagoudakis, M. (2008). Classifier-based policy representation. (To be presented at the 8th European Workshop on Reinforcement Learning). Google Scholar
  14. Riedmiller, M. (2005). Neural fitted Q iteration-first experiences with a data efficient neural reinforcement learning method. In 16th European conference on machine learning (pp. 317–328). Google Scholar
  15. Sutton, R., & Barto, A. (1998). Reinforcement learning: an introduction. Cambridge: MIT Press. Google Scholar
  16. Wang, H. O., Tanaka, K., & Griffin, M. F. (1996). An approach to fuzzy control of nonlinear systems: Stability and design issues. IEEE Transactions on Fuzzy Systems, 4(1), 14–23. CrossRefGoogle Scholar

Copyright information

© The Author(s) 2008

Open AccessThis is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (, which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Authors and Affiliations

  1. 1.Informatics InstituteUniversity of AmsterdamAmsterdamThe Netherlands
  2. 2.Department of Electronic and Computer EngineeringTechnical University of CreteChaniaCrete, Greece

Personalised recommendations