Annals of Operations Research

, Volume 208, Issue 1, pp 321–336 | Cite as

Adaptive aggregation for reinforcement learning in average reward Markov decision processes

Article

Abstract

We present an algorithm which aggregates online when learning to behave optimally in an average reward Markov decision process. The algorithm is based on the reinforcement learning algorithm UCRL and uses confidence intervals for aggregating the state space. We derive bounds on the regret our algorithm suffers with respect to an optimal policy. These bounds are only slightly worse than the original bounds for UCRL.

Keywords

Reinforcement learning Markov decision process Bounded parameter MDP Regret 

Notes

Acknowledgements

The author would like to thank Peter Auer for discussion of some preliminary ideas, Shiau Hong Lim for sharing his power plug adapter at Cumberland Lodge, and the anonymous reviewers for their comments, which helped to improve the paper. The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreements n 216886 (PASCAL2 Network of Excellence), and n 231495 (project CompLACS). The final version of this paper has been prepared when the author was supported by the Austrian Science Fund (FWF): J 3259-N13.

References

  1. Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multi-armed bandit problem. Machine Learning, 47, 235–256. CrossRefGoogle Scholar
  2. Bartlett, P. L., & Tewari, A. (2009). REGAL: a regularization based algorithm for reinforcement learning in weakly communicating MDPs. In UAI 2009, Proc. 25th annual conference on uncertainty in artificial intelligence (pp. 35–42). Google Scholar
  3. Bertsekas, D. P., & Castañon, D. A. (1989). Adaptive aggregation methods for infinite horizon dynamic programming. IEEE Transactions on Automatic Control, 34(6), 589–598. CrossRefGoogle Scholar
  4. Buşoniu, L., Schutter, B. D., & Babuška, R. (2010). Approximate dynamic programming and reinforcement learning. In Interactive collaborative information systems (pp. 3–44). Google Scholar
  5. Burnetas, A. N., & Katehakis, M. N. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2), 122–142. CrossRefGoogle Scholar
  6. Burnetas, A. N., & Katehakis, M. N. (1997). Optimal adaptive policies for Markov decision processes. Mathematics of Operations Research, 22(1), 222–255. CrossRefGoogle Scholar
  7. Chang, H. S., Fu, M. C., Hu, J., & Marcus, S. I. (2005). An adaptive sampling algorithm for solving Markov decision processes. Operational Research, 53(1), 126–139. CrossRefGoogle Scholar
  8. Chang, H. S., Fu, M. C., Hu, J., & Marcus, S. I. (2007). Simulation-based algorithms for Markov decision processes. London: Springer. Google Scholar
  9. Diuk, C., Li, L., & Leffler, B. R. (2009). The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning. In Proc. 26th annual international conference on machine learning (p. 32). Google Scholar
  10. Even-Dar, E., & Mansour, Y. (2003). Approximate equivalence of Markov decision processes. In Computational learning theory and kernel machines, 16th annual conference on computational learning theory and 7th kernel workshop (pp. 581–594). CrossRefGoogle Scholar
  11. Ferns, N., Panangaden, P., & Precup, D. (2004). Metrics for finite Markov decision processes. In UAI’04, proc. 20th conference in uncertainty in artificial intelligence (pp. 162–169). Google Scholar
  12. Givan, R., Dean, T., & Greig, M. (2003). Equivalence notions and model minimization in Markov decision processes. Artificial Intelligence, 147(1–2), 163–223. CrossRefGoogle Scholar
  13. Givan, R., Leach, S. M., & Dean, T. (2000). Bounded-parameter Markov decision processes. Artificial Intelligence, 122(1–2), 71–109. CrossRefGoogle Scholar
  14. Jaksch, T., Ortner, R., & Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11, 1563–1600. Google Scholar
  15. Katehakis, M. N., & Robbins, H. (1995). Sequential choice from several populations. Proceedings of the National Academy of Sciences of the United States of America, 92(19), 8584–8585. CrossRefGoogle Scholar
  16. Lai, T. L., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6, 4–22. CrossRefGoogle Scholar
  17. Leffler, B. R., Littman, M. L., & Edmunds, T. (2007). Efficient reinforcement learning with relocatable action models. In Proc. 22nd AAAI conference on artificial intelligence (pp. 572–577). Google Scholar
  18. Li, L. (2009). A unifying framework for computational reinforcement learning theory. PhD thesis, Rutgers University. Google Scholar
  19. Li, L., Littman, M. L., Walsh, T. J., & Strehl, A. L. (2011). Knows what it knows: a framework for self-aware learning. Machine Learning, 82(3), 399–443. CrossRefGoogle Scholar
  20. Li, L., Walsh, T. J., & Littman, M. L. (2006). Towards a unified theory of state abstraction for MDPs. In Proc. 9th international symposium on artificial intelligence and mathematics (pp. 531–539). Google Scholar
  21. Mannor, S., Menache, I., Hoze, A., & Klein, U. (2004). Dynamic abstraction in reinforcement learning via clustering. In Machine learning, proc. 21st international conference. Google Scholar
  22. Munos, R. (2010). Approximate dynamic programming. In O. Sigaud & O. Buffet (Eds.), Markov decision processes in artificial intelligence (pp. 67–98), Chap. 3. Google Scholar
  23. Ortner, R. (2007). Pseudometrics for state aggregation in average reward Markov decision processes. In Algorithmic learning theory, 18th international conference, ALT 2007 (pp. 373–387). Google Scholar
  24. Puterman, M. L. (1994). Markov decision processes: discrete stochastic dynamic programming. New York: Wiley. CrossRefGoogle Scholar
  25. Ravindran, B., & Barto, A. G. (2003). SMDP homomorphisms: An algebraic approach to abstraction in semi-Markov decision processes. In IJCAI-03, proc. 18th international joint conference on artificial intelligence (pp. 1011–1018). Google Scholar
  26. Roy, B. V. (2006). Performance loss bounds for approximate value iteration with state aggregation. Mathematics of Operations Research, 31(2), 234–244. CrossRefGoogle Scholar
  27. Singh, S. P., Jaakkola, T., & Jordan, M. I. (1994). Learning without state-estimation in partially observable Markovian decision processes. In Machine learning, proc. 11th international conference (pp. 284–292). Google Scholar
  28. Strehl, A. L., & Littman, M. L. (2005). A theoretical analysis of model-based interval estimation. In Machine learning, proc. 22nd international conference (pp. 857–864). Google Scholar
  29. Strehl, A. L., & Littman, M. L. (2008). An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8), 1309–1331. CrossRefGoogle Scholar
  30. Strehl, A. L., Diuk, C., & Littman, M. L. (2007). Efficient structure learning in factored-state MDPs. In Proc. 22nd AAAI conference on artificial intelligence (pp. 645–650). Google Scholar
  31. Tewari, A., & Bartlett, P. L. (2007). Bounded parameter Markov decision processes with average reward criterion. In Learning theory, 20th annual conference on learning theory, COLT 2007 (pp. 263–277). Google Scholar
  32. Tewari, A., & Bartlett, P. (2008). Optimistic linear programming gives logarithmic regret for irreducible MDPs. Advances in Neural Information Processing Systems, 20, 1505–1512. Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.INRIA Lille-Nord Europe, équipe SequeLVilleneuve d’AscqFrance

Personalised recommendations