Online Regret Bounds for Markov Decision Processes with Deterministic Transitions

Ortner, Ronald

doi:10.1007/978-3-540-87987-9_14

Ronald Ortner⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5254))

Included in the following conference series:

International Conference on Algorithmic Learning Theory

1365 Accesses
5 Citations

Abstract

We consider an upper confidence bound algorithm for Markov decision processes (MDPs) with deterministic transitions. For this algorithm we derive upper bounds on the online regret (with respect to an (ε-)optimal policy) that are logarithmic in the number of steps taken. These bounds also match known asymptotic bounds for the general MDP setting. We also present corresponding lower bounds. As an application, multi-armed bandits with switching cost are considered.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Puterman, M.L.: Markov Decision Processes. Wiley, New York (1994)
MATH Google Scholar
Karp, R.M.: A characterization of the minimum cycle mean in a digraph. Discrete Math. 23(3), 309–311 (1978)
Article MATH MathSciNet Google Scholar
Dasdan, A., Gupta, R.: Faster maximum and minimum mean cycle algorithms for system performance analysis. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 17(10), 889–899 (1998)
Article Google Scholar
Dasdan, A., Irani, S.S., Gupta, R.K.: Efficient algorithms for optimum cycle mean and optimum cost to time ratio problems. In: Proc. 36th DAC, pp. 37–42. ACM, New York (1999)
Google Scholar
Hartmann, M., Orlin, J.B.: Finding minimum cost to time ratio cycles with small integral transit times. Networks 23(6), 567–574 (1993)
Article MATH MathSciNet Google Scholar
Young, N.E., Tarjan, R.E., Orlin, J.B.: Faster parametric shortest path and minimum-balance algorithms. Networks 21(2), 205–221 (1991)
Article MATH MathSciNet Google Scholar
Madani, O.: Polynomial value iteration algorithms for deterministic MDPs. In: Proc. 18th UAI, pp. 311–318. Morgan Kaufmann, San Francisco (2002)
Google Scholar
Kearns, M.J., Singh, S.P.: Near-optimal reinforcement learning in polynomial time. Mach. Learn. 49, 209–232 (2002)
Article MATH Google Scholar
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multi-armed bandit problem. Mach. Learn. 47, 235–256 (2002)
Article MATH Google Scholar
Burnetas, A.N., Katehakis, M.N.: Optimal adaptive policies for Markov decision processes. Math. Oper. Res. 22(1), 222–255 (1997)
Article MATH MathSciNet Google Scholar
Tewari, A., Bartlett, P.L.: Optimistic linear programming gives logarithmic regret for irreducible MDPs. In: Proc. 20th NIPS (to appear)
Google Scholar
Auer, P., Ortner, R.: Logarithmic online regret bounds for undiscounted reinforcement learning. In: Proc. 19th NIPS, pp. 49–56. MIT Press, Cambridge (2006)
Google Scholar
Hunter, J.J.: Mixing times with applications to perturbed Markov chains. Linear Algebra Appl. 417, 108–123 (2006)
Article MATH MathSciNet Google Scholar
Ortner, R.: Pseudometrics for state aggregation in average reward Markov decision processes. In: Hutter, M., Servedio, R.A., Takimoto, E. (eds.) ALT 2007. LNCS (LNAI), vol. 4754, pp. 373–387. Springer, Heidelberg (2007)
Chapter Google Scholar
Cho, G.E., Meyer, C.D.: Markov chain sensitivity measured by mean first passage times. Linear Algebra Appl. 316, 21–28 (2000)
Article MATH MathSciNet Google Scholar
Mannor, S., Tsitsiklis, J.N.: The sample complexity of exploration in the multi-armed bandit problem. J. Mach. Learn. Res. 5, 623–648 (2004)
MathSciNet Google Scholar
Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32, 48–77 (2002)
Article MATH MathSciNet Google Scholar
Jun, T.: A survey on the bandit problem with switching costs. De Economist 152, 513–541 (2004)
Article Google Scholar
Lai, T., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. in Appl. Math. 6, 4–22 (1985)
Article MATH MathSciNet Google Scholar
Agrawal, R., Hedge, M.V., Teneketzis, D.: Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching cost. IEEE Trans. Automat. Control 33(10), 899–906 (1988)
Article MATH MathSciNet Google Scholar
Brezzi, M., Lai, T.L.: Optimal learning and experimentation in bandit problems. J. Econom. Dynam. Control 27, 87–108 (2002)
Article MATH MathSciNet Google Scholar
Kleinberg, R.D.: Nearly tight bounds for the continuum-armed bandit problem. In: Proc. 17th NIPS, pp. 697–704. MIT Press, Cambridge (2004)
Google Scholar
Auer, P., Ortner, R., Szepesvári, C.: Improved rates for the stochastic continuum-armed bandit problem. In: Bshouty, N.H., Gentile, C. (eds.) COLT 2007. LNCS (LNAI), vol. 4539, pp. 454–468. Springer, Heidelberg (2007)
Chapter Google Scholar
Even-Dar, E., Kakade, S.M., Mansour, Y.: Experts in a Markov decision process. In: Proc. 17th NIPS, pp. 401–408. MIT Press, Cambridge (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Leoben, A-8700, Leoben, Austria
Ronald Ortner

Authors

Ronald Ortner
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science and Engineering, University of California, San Diego, USA
Yoav Freund
Department of Computer Science and Information Theory, Department of Computer Science and Budapest University of Technology and Economics, Stoczek u. 2, 1521, Budapest, Hungary
László Györfi
Department of Math., Stat. and Comp. Sci,, University of Illinois, 851 S. Morgan, IL 60607-7045, Chicago, USA
György Turán
Division of Computer Science, Hokkaido University, N-14, W-9, 060-0814, Sapporo, Japan
Thomas Zeugmann

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ortner, R. (2008). Online Regret Bounds for Markov Decision Processes with Deterministic Transitions. In: Freund, Y., Györfi, L., Turán, G., Zeugmann, T. (eds) Algorithmic Learning Theory. ALT 2008. Lecture Notes in Computer Science(), vol 5254. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87987-9_14

Download citation

DOI: https://doi.org/10.1007/978-3-540-87987-9_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-87986-2
Online ISBN: 978-3-540-87987-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics