Fast Reinforcement Learning with Large Action Sets Using Error-Correcting Output Codes for MDP Factorization

  • Gabriel Dulac-Arnold
  • Ludovic Denoyer
  • Philippe Preux
  • Patrick Gallinari
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7524)


The use of Reinforcement Learning in real-world scenarios is strongly limited by issues of scale. Most RL learning algorithms are unable to deal with problems composed of hundreds or sometimes even dozens of possible actions, and therefore cannot be applied to many real-world problems. We consider the RL problem in the supervised classification framework where the optimal policy is obtained through a multiclass classifier, the set of classes being the set of actions of the problem. We introduce error-correcting output codes (ECOCs) in this setting and propose two new methods for reducing complexity when using rollouts-based approaches. The first method consists in using an ECOC-based classifier as the multiclass classifier, reducing the learning complexity from \(\mathcal{O}(A^2)\) to \(\mathcal{O}(A \log(A))\). We then propose a novel method that profits from the ECOC’s coding dictionary to split the initial MDP into \(\mathcal{O}(\log(A))\) separate two-action MDPs. This second method reduces learning complexity even further, from \(\mathcal{O}(A^2)\) to \(\mathcal{O}(\log(A))\), thus rendering problems with large action sets tractable. We finish by experimentally demonstrating the advantages of our approach on a set of benchmark problems, both in speed and performance.


Optimal Policy Reinforcement Learn Action Space Markov Decision Process Policy Iteration 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Lazaric, A., Restelli, M., Bonarini, A.: Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods. In: Proc. of NIPS 2007 (2007)Google Scholar
  2. 2.
    Bubeck, S., Munos, R., Stoltz, G., Szepesvári, C., et al.: X-armed bandits. Journal of Machine Learning Research 12, 1655–1695 (2011)Google Scholar
  3. 3.
    Negoescu, D., Frazier, P., Powell, W.: The knowledge-gradient algorithm for sequencing experiments in drug discovery. INFORMS J. on Computing 23(3), 346–363 (2011)MathSciNetzbMATHCrossRefGoogle Scholar
  4. 4.
    Dietterich, T., Bakiri, G.: Solving multiclass learning problems via error-correcting output codes. Jo. of Art. Int. Research 2, 263–286 (1995)zbMATHGoogle Scholar
  5. 5.
    Lagoudakis, M.G., Parr, R.: Reinforcement learning as classification: Leveraging modern classifiers. In: Proc. of ICML 2003 (2003)Google Scholar
  6. 6.
    Bentley, J.L.: Multidimensional binary search trees used for associative searching. Communications of the ACM 18(9), 509–517 (1975)MathSciNetzbMATHCrossRefGoogle Scholar
  7. 7.
    Lazaric, A., Ghavamzadeh, M., Munos, R.: Analysis of a classification-based policy iteration algorithm. In: Proc. of ICML 2010, pp. 607–614 (2010)Google Scholar
  8. 8.
    Sutton, R.: Generalization in reinforcement learning: Successful examples using sparse coarse coding. In: Proc. of NIPS 1996, pp. 1038–1044 (1996)Google Scholar
  9. 9.
    Berger, A.: Error-correcting output coding for text classification. In: Workshop on Machine Learning for Information Filtering, IJCAI 1999 (1999)Google Scholar
  10. 10.
    Dimitrakakis, C., Lagoudakis, M.G.: Rollout sampling approximate policy iteration. Machine Learning 72(3), 157–171 (2008)CrossRefGoogle Scholar
  11. 11.
    Tham, C.: Modular on-line function approximation for scaling up reinforcement learning. PhD thesis, University of Cambridge (1994)Google Scholar
  12. 12.
    Tesauro, G.: Practical issues in temporal difference learning. Machine Learning 8, 257–277 (1992)zbMATHGoogle Scholar
  13. 13.
    Tesauro, G., Galperin, G.R.: On-Line Policy Improvement Using Monte-Carlo Search. In: Proc. of NIPS 1997, pp. 1068–1074 (1997)Google Scholar
  14. 14.
    Pazis, J., Lagoudakis, M.G.: Reinforcement Learning in Multidimensional Continuous Action Spaces. In: Proc. of Adaptive Dynamic Programming and Reinf. Learn., pp. 97–104 (2011)Google Scholar
  15. 15.
    Pazis, J., Parr, R.: Generalized Value Functions for Large Action Sets. In: Proc. of ICML 2011, pp. 1185–1192 (2011)Google Scholar
  16. 16.
    Beygelzimer, A., Langford, J., Zadrozny, B.: Machine learning techniques reductions between prediction quality metrics. In: Performance Modeling and Engineering, pp. 3–28 (2008)Google Scholar
  17. 17.
    Crammer, K., Singer, Y.: On the Learnability and Design of Output Codes for Multiclass Problems. Machine Learning 47(2), 201–233 (2002)zbMATHCrossRefGoogle Scholar
  18. 18.
    Cissé, M., Artieres, T., Gallinari, P.: Learning efficient error correcting output codes for large hierarchical multi-class problems. In: Workshop on Large-Scale Hierarchical Classification ECML/PKDD 2011, pp. 37–49 (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Gabriel Dulac-Arnold
    • 1
  • Ludovic Denoyer
    • 1
  • Philippe Preux
    • 2
  • Patrick Gallinari
    • 1
  1. 1.LIP6UPMCParisFrance
  2. 2.LIFL (UMR CNRS) & INRIA Lille Nord-EuropeUniversité de LilleVilleneuve d’AscqFrance

Personalised recommendations