Selecting Near-Optimal Approximate State Representations in Reinforcement Learning

  • Ronald Ortner
  • Odalric-Ambrym Maillard
  • Daniil Ryabko
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8776)


We consider a reinforcement learning setting introduced in [5] where the learner does not have explicit access to the states of the underlying Markov decision process (MDP). Instead, she has access to several models that map histories of past interactions to states. Here we improve over known regret bounds in this setting, and more importantly generalize to the case where the models given to the learner do not contain a true model resulting in an MDP representation but only approximations of it. We also give improved error bounds for state aggregation.


Markov Model Reinforcement Learning Markov Decision Process Approximate Model Average Reward 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bartlett, P.L., Tewari, A.: REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. In: Proc. 25th Conf. on Uncertainty in Artificial Intelligence, UAI 2009, pp. 25–42. AUAI Press (2009)Google Scholar
  2. 2.
    Hallak, A., Castro, D.D., Mannor, S.: Model selection in Markovian processes. In: 19th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, KDD 2013, pp. 374–382. ACM (2013)Google Scholar
  3. 3.
    Jaksch, T., Ortner, R., Auer, P.: Near-optimal regret bounds for reinforcement learning. J. Mach. Learn. Res. 11, 1563–1600 (2010)MathSciNetzbMATHGoogle Scholar
  4. 4.
    Littman, M., Sutton, R., Singh, S.: Predictive representations of state. Adv. Neural Inf. Process. Syst. 15, 1555–1561 (2002)Google Scholar
  5. 5.
    Hutter, M.: Feature Reinforcement Learning: Part I: Unstructured MDPs. J. Artificial General Intelligence 1, 3–24 (2009)CrossRefGoogle Scholar
  6. 6.
    Maillard, O.A., Nguyen, P., Ortner, R., Ryabko, D.: Optimal regret bounds for selecting the state representation in reinforcement learning. In: Proc. 30th Int’l Conf. on Machine Learning, ICML 2013. JMLR Proc., vol. 28, pp. 543–551 (2013)Google Scholar
  7. 7.
    Nguyen, P., Maillard, O.A., Ryabko, D., Ortner, R.: Competing with an infinite set of models in reinforcement learning. In: Proc. 16th Int’l Conf. on Artificial Intelligence and Statistics, AISTATS 2013. JMLR Proc., vol. 31, pp. 463–471 (2013)Google Scholar
  8. 8.
    Ortner, R.: Pseudometrics for state aggregation in average reward markov decision processes. In: Hutter, M., Servedio, R.A., Takimoto, E. (eds.) ALT 2007. LNCS (LNAI), vol. 4754, pp. 373–387. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  9. 9.
    Ortner, R., Maillard, O.A., Ryabko, D.: Selecting Near-Optimal Approximate State Representations in Reinforcement Learning. Extended version,
  10. 10.
    Ortner, R., Ryabko, D.: Online Regret Bounds for Undiscounted Continuous Reinforcement Learning. Adv. Neural Inf. Process. Syst. 25, 1772–1780 (2012)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Ronald Ortner
    • 1
  • Odalric-Ambrym Maillard
    • 2
  • Daniil Ryabko
    • 3
    • 4
  1. 1.Montanuniversitaet LeobenAustria
  2. 2.The TechnionIsrael
  3. 3.Inria Lille-Nord Europe, équipe SequeLFrance
  4. 4.InriaChile

Personalised recommendations