Skip to main content

Online Regret Bounds for Markov Decision Processes with Deterministic Transitions

  • Conference paper
Algorithmic Learning Theory (ALT 2008)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5254))

Included in the following conference series:

Abstract

We consider an upper confidence bound algorithm for Markov decision processes (MDPs) with deterministic transitions. For this algorithm we derive upper bounds on the online regret (with respect to an (ε-)optimal policy) that are logarithmic in the number of steps taken. These bounds also match known asymptotic bounds for the general MDP setting. We also present corresponding lower bounds. As an application, multi-armed bandits with switching cost are considered.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Puterman, M.L.: Markov Decision Processes. Wiley, New York (1994)

    MATH  Google Scholar 

  2. Karp, R.M.: A characterization of the minimum cycle mean in a digraph. Discrete Math. 23(3), 309–311 (1978)

    Article  MATH  MathSciNet  Google Scholar 

  3. Dasdan, A., Gupta, R.: Faster maximum and minimum mean cycle algorithms for system performance analysis. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 17(10), 889–899 (1998)

    Article  Google Scholar 

  4. Dasdan, A., Irani, S.S., Gupta, R.K.: Efficient algorithms for optimum cycle mean and optimum cost to time ratio problems. In: Proc. 36th DAC, pp. 37–42. ACM, New York (1999)

    Google Scholar 

  5. Hartmann, M., Orlin, J.B.: Finding minimum cost to time ratio cycles with small integral transit times. Networks 23(6), 567–574 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  6. Young, N.E., Tarjan, R.E., Orlin, J.B.: Faster parametric shortest path and minimum-balance algorithms. Networks 21(2), 205–221 (1991)

    Article  MATH  MathSciNet  Google Scholar 

  7. Madani, O.: Polynomial value iteration algorithms for deterministic MDPs. In: Proc. 18th UAI, pp. 311–318. Morgan Kaufmann, San Francisco (2002)

    Google Scholar 

  8. Kearns, M.J., Singh, S.P.: Near-optimal reinforcement learning in polynomial time. Mach. Learn. 49, 209–232 (2002)

    Article  MATH  Google Scholar 

  9. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multi-armed bandit problem. Mach. Learn. 47, 235–256 (2002)

    Article  MATH  Google Scholar 

  10. Burnetas, A.N., Katehakis, M.N.: Optimal adaptive policies for Markov decision processes. Math. Oper. Res. 22(1), 222–255 (1997)

    Article  MATH  MathSciNet  Google Scholar 

  11. Tewari, A., Bartlett, P.L.: Optimistic linear programming gives logarithmic regret for irreducible MDPs. In: Proc. 20th NIPS (to appear)

    Google Scholar 

  12. Auer, P., Ortner, R.: Logarithmic online regret bounds for undiscounted reinforcement learning. In: Proc. 19th NIPS, pp. 49–56. MIT Press, Cambridge (2006)

    Google Scholar 

  13. Hunter, J.J.: Mixing times with applications to perturbed Markov chains. Linear Algebra Appl. 417, 108–123 (2006)

    Article  MATH  MathSciNet  Google Scholar 

  14. Ortner, R.: Pseudometrics for state aggregation in average reward Markov decision processes. In: Hutter, M., Servedio, R.A., Takimoto, E. (eds.) ALT 2007. LNCS (LNAI), vol. 4754, pp. 373–387. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  15. Cho, G.E., Meyer, C.D.: Markov chain sensitivity measured by mean first passage times. Linear Algebra Appl. 316, 21–28 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  16. Mannor, S., Tsitsiklis, J.N.: The sample complexity of exploration in the multi-armed bandit problem. J. Mach. Learn. Res. 5, 623–648 (2004)

    MathSciNet  Google Scholar 

  17. Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32, 48–77 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  18. Jun, T.: A survey on the bandit problem with switching costs. De Economist 152, 513–541 (2004)

    Article  Google Scholar 

  19. Lai, T., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. in Appl. Math. 6, 4–22 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  20. Agrawal, R., Hedge, M.V., Teneketzis, D.: Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching cost. IEEE Trans. Automat. Control 33(10), 899–906 (1988)

    Article  MATH  MathSciNet  Google Scholar 

  21. Brezzi, M., Lai, T.L.: Optimal learning and experimentation in bandit problems. J. Econom. Dynam. Control 27, 87–108 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  22. Kleinberg, R.D.: Nearly tight bounds for the continuum-armed bandit problem. In: Proc. 17th NIPS, pp. 697–704. MIT Press, Cambridge (2004)

    Google Scholar 

  23. Auer, P., Ortner, R., Szepesvári, C.: Improved rates for the stochastic continuum-armed bandit problem. In: Bshouty, N.H., Gentile, C. (eds.) COLT 2007. LNCS (LNAI), vol. 4539, pp. 454–468. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  24. Even-Dar, E., Kakade, S.M., Mansour, Y.: Experts in a Markov decision process. In: Proc. 17th NIPS, pp. 401–408. MIT Press, Cambridge (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ortner, R. (2008). Online Regret Bounds for Markov Decision Processes with Deterministic Transitions. In: Freund, Y., Györfi, L., Turán, G., Zeugmann, T. (eds) Algorithmic Learning Theory. ALT 2008. Lecture Notes in Computer Science(), vol 5254. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87987-9_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-87987-9_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-87986-2

  • Online ISBN: 978-3-540-87987-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics