Annals of Operations Research

, Volume 208, Issue 1, pp 321–336

Adaptive aggregation for reinforcement learning in average reward Markov decision processes



We present an algorithm which aggregates online when learning to behave optimally in an average reward Markov decision process. The algorithm is based on the reinforcement learning algorithm UCRL and uses confidence intervals for aggregating the state space. We derive bounds on the regret our algorithm suffers with respect to an optimal policy. These bounds are only slightly worse than the original bounds for UCRL.


Reinforcement learning Markov decision process Bounded parameter MDP Regret 

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.INRIA Lille-Nord Europe, équipe SequeLVilleneuve d’AscqFrance

Personalised recommendations