Abstract
Several researchers have recently investigated the connection between reinforcement learning and classification. We are motivated by proposals of approximate policy iteration schemes without value functions, which focus on policy representation using classifiers and address policy learning as a supervised learning problem. This paper proposes variants of an improved policy iteration scheme which addresses the core sampling problem in evaluating a policy through simulation as a multi-armed bandit machine. The resulting algorithm offers comparable performance to the previous algorithm achieved, however, with significantly less computational effort. An order of magnitude improvement is demonstrated experimentally in two standard reinforcement learning domains: inverted pendulum and mountain-car.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Antos, A., Szepesvári, C., & Munos, R. (2008). Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1), 89–129. 10.1007/s10994-007-5038-2.
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning Journal 47(2–3), 235–256.
Dimitrakakis, C., & Lagoudakis, M. (2008). Algorithms and bounds for sampling-based approximate policy iteration. (To be presented at the 8th European Workshop on Reinforcement Learning).
Even-Dar, E., Mannor, S., & Mansour, Y. (2006). Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7, 1079–1105. ISSN 1533-7928.
Fern, A., Yoon, S., & Givan, R. (2004). Approximate policy iteration with a policy language bias. Advances in Neural Information Processing Systems, 16(3).
Fern, A., Yoon, S., & Givan, R. (2006). Approximate policy iteration with a policy language bias: Solving relational Markov decision processes. Journal of Artificial Intelligence Research, 25, 75–118.
Howard, R. A. (1960). Dynamic programming and Markov processes. Cambridge: MIT Press.
Kocsis, L., & Szepesvári, C. (2006). Bandit based Monte-Carlo planning. In Proceedings of the European conference on machine learning.
Lagoudakis, M. G. (2003). Efficient approximate policy iteration methods for sequential decision making in reinforcement learning. PhD thesis, Department of Computer Science, Duke University.
Lagoudakis, M. G., & Parr, R. (2003a). Least-squares policy iteration. Journal of Machine Learning Research, 4(6), 1107–1149.
Lagoudakis, M. G. & Parr, R. (2003b). Reinforcement learning as classification: Leveraging modern classifiers. In Proceedings of the 20th international conference on machine learning (ICML) (pp. 424–431). Washington, DC, USA.
Langford, J., & Zadrozny, B. (2005). Relating reinforcement learning performance to classification performance. In Proceedings of the 22nd international conference on machine learning (ICML) (pp. 473–480). Bonn, Germany, 2005. ISBN 1-59593-180-5. doi:10.1145/1102351.1102411.
Rexakis, I., & Lagoudakis, M. (2008). Classifier-based policy representation. (To be presented at the 8th European Workshop on Reinforcement Learning).
Riedmiller, M. (2005). Neural fitted Q iteration-first experiences with a data efficient neural reinforcement learning method. In 16th European conference on machine learning (pp. 317–328).
Sutton, R., & Barto, A. (1998). Reinforcement learning: an introduction. Cambridge: MIT Press.
Wang, H. O., Tanaka, K., & Griffin, M. F. (1996). An approach to fuzzy control of nonlinear systems: Stability and design issues. IEEE Transactions on Fuzzy Systems, 4(1), 14–23.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: Walter Daelemans, Bart Goethals, Katharina Morik.
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Dimitrakakis, C., Lagoudakis, M.G. Rollout sampling approximate policy iteration. Mach Learn 72, 157–171 (2008). https://doi.org/10.1007/s10994-008-5069-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-008-5069-3