Fast Reinforcement Learning with Large Action Sets Using Error-Correcting Output Codes for MDP Factorization

Dulac-Arnold, Gabriel; Denoyer, Ludovic; Preux, Philippe; Gallinari, Patrick

doi:10.1007/978-3-642-33486-3_12

Gabriel Dulac-Arnold²¹,
Ludovic Denoyer²¹,
Philippe Preux²² &
…
Patrick Gallinari²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7524))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

5480 Accesses
3 Citations

Abstract

The use of Reinforcement Learning in real-world scenarios is strongly limited by issues of scale. Most RL learning algorithms are unable to deal with problems composed of hundreds or sometimes even dozens of possible actions, and therefore cannot be applied to many real-world problems. We consider the RL problem in the supervised classification framework where the optimal policy is obtained through a multiclass classifier, the set of classes being the set of actions of the problem. We introduce error-correcting output codes (ECOCs) in this setting and propose two new methods for reducing complexity when using rollouts-based approaches. The first method consists in using an ECOC-based classifier as the multiclass classifier, reducing the learning complexity from \(\mathcal{O}(A^2)\) to \(\mathcal{O}(A \log(A))\). We then propose a novel method that profits from the ECOC’s coding dictionary to split the initial MDP into \(\mathcal{O}(\log(A))\) separate two-action MDPs. This second method reduces learning complexity even further, from \(\mathcal{O}(A^2)\) to \(\mathcal{O}(\log(A))\), thus rendering problems with large action sets tractable. We finish by experimentally demonstrating the advantages of our approach on a set of benchmark problems, both in speed and performance.

Download to read the full chapter text

Chapter PDF

Reinforcement Learning Algorithms: Categorization and Structural Properties

Learning from Positive and Negative Examples: Dichotomies and Parameterized Algorithms

Reinforcement Learning: A Friendly Introduction

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Lazaric, A., Restelli, M., Bonarini, A.: Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods. In: Proc. of NIPS 2007 (2007)
Google Scholar
Bubeck, S., Munos, R., Stoltz, G., Szepesvári, C., et al.: X-armed bandits. Journal of Machine Learning Research 12, 1655–1695 (2011)
Google Scholar
Negoescu, D., Frazier, P., Powell, W.: The knowledge-gradient algorithm for sequencing experiments in drug discovery. INFORMS J. on Computing 23(3), 346–363 (2011)
Article MathSciNet MATH Google Scholar
Dietterich, T., Bakiri, G.: Solving multiclass learning problems via error-correcting output codes. Jo. of Art. Int. Research 2, 263–286 (1995)
MATH Google Scholar
Lagoudakis, M.G., Parr, R.: Reinforcement learning as classification: Leveraging modern classifiers. In: Proc. of ICML 2003 (2003)
Google Scholar
Bentley, J.L.: Multidimensional binary search trees used for associative searching. Communications of the ACM 18(9), 509–517 (1975)
Article MathSciNet MATH Google Scholar
Lazaric, A., Ghavamzadeh, M., Munos, R.: Analysis of a classification-based policy iteration algorithm. In: Proc. of ICML 2010, pp. 607–614 (2010)
Google Scholar
Sutton, R.: Generalization in reinforcement learning: Successful examples using sparse coarse coding. In: Proc. of NIPS 1996, pp. 1038–1044 (1996)
Google Scholar
Berger, A.: Error-correcting output coding for text classification. In: Workshop on Machine Learning for Information Filtering, IJCAI 1999 (1999)
Google Scholar
Dimitrakakis, C., Lagoudakis, M.G.: Rollout sampling approximate policy iteration. Machine Learning 72(3), 157–171 (2008)
Article Google Scholar
Tham, C.: Modular on-line function approximation for scaling up reinforcement learning. PhD thesis, University of Cambridge (1994)
Google Scholar
Tesauro, G.: Practical issues in temporal difference learning. Machine Learning 8, 257–277 (1992)
MATH Google Scholar
Tesauro, G., Galperin, G.R.: On-Line Policy Improvement Using Monte-Carlo Search. In: Proc. of NIPS 1997, pp. 1068–1074 (1997)
Google Scholar
Pazis, J., Lagoudakis, M.G.: Reinforcement Learning in Multidimensional Continuous Action Spaces. In: Proc. of Adaptive Dynamic Programming and Reinf. Learn., pp. 97–104 (2011)
Google Scholar
Pazis, J., Parr, R.: Generalized Value Functions for Large Action Sets. In: Proc. of ICML 2011, pp. 1185–1192 (2011)
Google Scholar
Beygelzimer, A., Langford, J., Zadrozny, B.: Machine learning techniques reductions between prediction quality metrics. In: Performance Modeling and Engineering, pp. 3–28 (2008)
Google Scholar
Crammer, K., Singer, Y.: On the Learnability and Design of Output Codes for Multiclass Problems. Machine Learning 47(2), 201–233 (2002)
Article MATH Google Scholar
Cissé, M., Artieres, T., Gallinari, P.: Learning efficient error correcting output codes for large hierarchical multi-class problems. In: Workshop on Large-Scale Hierarchical Classification ECML/PKDD 2011, pp. 37–49 (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

LIP6, UPMC, Case 169 4 Place Jussieu, Paris, 75005, France
Gabriel Dulac-Arnold, Ludovic Denoyer & Patrick Gallinari
LIFL (UMR CNRS) & INRIA Lille Nord-Europe, Université de Lille, Villeneuve d’Ascq, France
Philippe Preux

Authors

Gabriel Dulac-Arnold
View author publications
You can also search for this author in PubMed Google Scholar
Ludovic Denoyer
View author publications
You can also search for this author in PubMed Google Scholar
Philippe Preux
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Gallinari
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Intelligent Systems Laboratory, University of Bristol, Merchant Venturers Building, Woodland Road, BS8 1UB, Bristol, UK
Peter A. Flach
Intelligent Systems Laboratory, University of Bristol, Merchant Venturers Building, Woodland Road,, BS8 1UB, Bristol, UK
Tijl De Bie & Nello Cristianini &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dulac-Arnold, G., Denoyer, L., Preux, P., Gallinari, P. (2012). Fast Reinforcement Learning with Large Action Sets Using Error-Correcting Output Codes for MDP Factorization. In: Flach, P.A., De Bie, T., Cristianini, N. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2012. Lecture Notes in Computer Science(), vol 7524. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33486-3_12

Download citation

DOI: https://doi.org/10.1007/978-3-642-33486-3_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33485-6
Online ISBN: 978-3-642-33486-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Fast Reinforcement Learning with Large Action Sets Using Error-Correcting Output Codes for MDP Factorization

Abstract

Chapter PDF

Similar content being viewed by others

Reinforcement Learning Algorithms: Categorization and Structural Properties

Learning from Positive and Negative Examples: Dichotomies and Parameterized Algorithms

Reinforcement Learning: A Friendly Introduction

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Fast Reinforcement Learning with Large Action Sets Using Error-Correcting Output Codes for MDP Factorization

Abstract

Chapter PDF

Similar content being viewed by others

Reinforcement Learning Algorithms: Categorization and Structural Properties

Learning from Positive and Negative Examples: Dichotomies and Parameterized Algorithms

Reinforcement Learning: A Friendly Introduction

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation