Skip to main content
Log in

Adaptive aggregation for reinforcement learning in average reward Markov decision processes

  • Published:
Annals of Operations Research Aims and scope Submit manuscript

Abstract

We present an algorithm which aggregates online when learning to behave optimally in an average reward Markov decision process. The algorithm is based on the reinforcement learning algorithm UCRL and uses confidence intervals for aggregating the state space. We derive bounds on the regret our algorithm suffers with respect to an optimal policy. These bounds are only slightly worse than the original bounds for UCRL.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithm 1
Algorithm 2

Similar content being viewed by others

References

  • Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multi-armed bandit problem. Machine Learning, 47, 235–256.

    Article  Google Scholar 

  • Bartlett, P. L., & Tewari, A. (2009). REGAL: a regularization based algorithm for reinforcement learning in weakly communicating MDPs. In UAI 2009, Proc. 25th annual conference on uncertainty in artificial intelligence (pp. 35–42).

    Google Scholar 

  • Bertsekas, D. P., & Castañon, D. A. (1989). Adaptive aggregation methods for infinite horizon dynamic programming. IEEE Transactions on Automatic Control, 34(6), 589–598.

    Article  Google Scholar 

  • Buşoniu, L., Schutter, B. D., & Babuška, R. (2010). Approximate dynamic programming and reinforcement learning. In Interactive collaborative information systems (pp. 3–44).

    Google Scholar 

  • Burnetas, A. N., & Katehakis, M. N. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2), 122–142.

    Article  Google Scholar 

  • Burnetas, A. N., & Katehakis, M. N. (1997). Optimal adaptive policies for Markov decision processes. Mathematics of Operations Research, 22(1), 222–255.

    Article  Google Scholar 

  • Chang, H. S., Fu, M. C., Hu, J., & Marcus, S. I. (2005). An adaptive sampling algorithm for solving Markov decision processes. Operational Research, 53(1), 126–139.

    Article  Google Scholar 

  • Chang, H. S., Fu, M. C., Hu, J., & Marcus, S. I. (2007). Simulation-based algorithms for Markov decision processes. London: Springer.

    Google Scholar 

  • Diuk, C., Li, L., & Leffler, B. R. (2009). The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning. In Proc. 26th annual international conference on machine learning (p. 32).

    Google Scholar 

  • Even-Dar, E., & Mansour, Y. (2003). Approximate equivalence of Markov decision processes. In Computational learning theory and kernel machines, 16th annual conference on computational learning theory and 7th kernel workshop (pp. 581–594).

    Chapter  Google Scholar 

  • Ferns, N., Panangaden, P., & Precup, D. (2004). Metrics for finite Markov decision processes. In UAI’04, proc. 20th conference in uncertainty in artificial intelligence (pp. 162–169).

    Google Scholar 

  • Givan, R., Dean, T., & Greig, M. (2003). Equivalence notions and model minimization in Markov decision processes. Artificial Intelligence, 147(1–2), 163–223.

    Article  Google Scholar 

  • Givan, R., Leach, S. M., & Dean, T. (2000). Bounded-parameter Markov decision processes. Artificial Intelligence, 122(1–2), 71–109.

    Article  Google Scholar 

  • Jaksch, T., Ortner, R., & Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11, 1563–1600.

    Google Scholar 

  • Katehakis, M. N., & Robbins, H. (1995). Sequential choice from several populations. Proceedings of the National Academy of Sciences of the United States of America, 92(19), 8584–8585.

    Article  Google Scholar 

  • Lai, T. L., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6, 4–22.

    Article  Google Scholar 

  • Leffler, B. R., Littman, M. L., & Edmunds, T. (2007). Efficient reinforcement learning with relocatable action models. In Proc. 22nd AAAI conference on artificial intelligence (pp. 572–577).

    Google Scholar 

  • Li, L. (2009). A unifying framework for computational reinforcement learning theory. PhD thesis, Rutgers University.

  • Li, L., Littman, M. L., Walsh, T. J., & Strehl, A. L. (2011). Knows what it knows: a framework for self-aware learning. Machine Learning, 82(3), 399–443.

    Article  Google Scholar 

  • Li, L., Walsh, T. J., & Littman, M. L. (2006). Towards a unified theory of state abstraction for MDPs. In Proc. 9th international symposium on artificial intelligence and mathematics (pp. 531–539).

    Google Scholar 

  • Mannor, S., Menache, I., Hoze, A., & Klein, U. (2004). Dynamic abstraction in reinforcement learning via clustering. In Machine learning, proc. 21st international conference.

    Google Scholar 

  • Munos, R. (2010). Approximate dynamic programming. In O. Sigaud & O. Buffet (Eds.), Markov decision processes in artificial intelligence (pp. 67–98), Chap. 3.

    Google Scholar 

  • Ortner, R. (2007). Pseudometrics for state aggregation in average reward Markov decision processes. In Algorithmic learning theory, 18th international conference, ALT 2007 (pp. 373–387).

    Google Scholar 

  • Puterman, M. L. (1994). Markov decision processes: discrete stochastic dynamic programming. New York: Wiley.

    Book  Google Scholar 

  • Ravindran, B., & Barto, A. G. (2003). SMDP homomorphisms: An algebraic approach to abstraction in semi-Markov decision processes. In IJCAI-03, proc. 18th international joint conference on artificial intelligence (pp. 1011–1018).

    Google Scholar 

  • Roy, B. V. (2006). Performance loss bounds for approximate value iteration with state aggregation. Mathematics of Operations Research, 31(2), 234–244.

    Article  Google Scholar 

  • Singh, S. P., Jaakkola, T., & Jordan, M. I. (1994). Learning without state-estimation in partially observable Markovian decision processes. In Machine learning, proc. 11th international conference (pp. 284–292).

    Google Scholar 

  • Strehl, A. L., & Littman, M. L. (2005). A theoretical analysis of model-based interval estimation. In Machine learning, proc. 22nd international conference (pp. 857–864).

    Google Scholar 

  • Strehl, A. L., & Littman, M. L. (2008). An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8), 1309–1331.

    Article  Google Scholar 

  • Strehl, A. L., Diuk, C., & Littman, M. L. (2007). Efficient structure learning in factored-state MDPs. In Proc. 22nd AAAI conference on artificial intelligence (pp. 645–650).

    Google Scholar 

  • Tewari, A., & Bartlett, P. L. (2007). Bounded parameter Markov decision processes with average reward criterion. In Learning theory, 20th annual conference on learning theory, COLT 2007 (pp. 263–277).

    Google Scholar 

  • Tewari, A., & Bartlett, P. (2008). Optimistic linear programming gives logarithmic regret for irreducible MDPs. Advances in Neural Information Processing Systems, 20, 1505–1512.

    Google Scholar 

Download references

Acknowledgements

The author would like to thank Peter Auer for discussion of some preliminary ideas, Shiau Hong Lim for sharing his power plug adapter at Cumberland Lodge, and the anonymous reviewers for their comments, which helped to improve the paper. The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreements n 216886 (PASCAL2 Network of Excellence), and n 231495 (project CompLACS). The final version of this paper has been prepared when the author was supported by the Austrian Science Fund (FWF): J 3259-N13.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ronald Ortner.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ortner, R. Adaptive aggregation for reinforcement learning in average reward Markov decision processes. Ann Oper Res 208, 321–336 (2013). https://doi.org/10.1007/s10479-012-1064-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10479-012-1064-y

Keywords

Navigation