Online Regret Bounds for Markov Decision Processes with Deterministic Transitions

  • Ronald Ortner
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5254)


We consider an upper confidence bound algorithm for Markov decision processes (MDPs) with deterministic transitions. For this algorithm we derive upper bounds on the online regret (with respect to an (ε-)optimal policy) that are logarithmic in the number of steps taken. These bounds also match known asymptotic bounds for the general MDP setting. We also present corresponding lower bounds. As an application, multi-armed bandits with switching cost are considered.


Switching Cost Markov Decision Process Transition Graph Average Reward Bandit Problem 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Puterman, M.L.: Markov Decision Processes. Wiley, New York (1994)zbMATHGoogle Scholar
  2. 2.
    Karp, R.M.: A characterization of the minimum cycle mean in a digraph. Discrete Math. 23(3), 309–311 (1978)zbMATHCrossRefMathSciNetGoogle Scholar
  3. 3.
    Dasdan, A., Gupta, R.: Faster maximum and minimum mean cycle algorithms for system performance analysis. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 17(10), 889–899 (1998)CrossRefGoogle Scholar
  4. 4.
    Dasdan, A., Irani, S.S., Gupta, R.K.: Efficient algorithms for optimum cycle mean and optimum cost to time ratio problems. In: Proc. 36th DAC, pp. 37–42. ACM, New York (1999)Google Scholar
  5. 5.
    Hartmann, M., Orlin, J.B.: Finding minimum cost to time ratio cycles with small integral transit times. Networks 23(6), 567–574 (1993)zbMATHCrossRefMathSciNetGoogle Scholar
  6. 6.
    Young, N.E., Tarjan, R.E., Orlin, J.B.: Faster parametric shortest path and minimum-balance algorithms. Networks 21(2), 205–221 (1991)zbMATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Madani, O.: Polynomial value iteration algorithms for deterministic MDPs. In: Proc. 18th UAI, pp. 311–318. Morgan Kaufmann, San Francisco (2002)Google Scholar
  8. 8.
    Kearns, M.J., Singh, S.P.: Near-optimal reinforcement learning in polynomial time. Mach. Learn. 49, 209–232 (2002)zbMATHCrossRefGoogle Scholar
  9. 9.
    Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multi-armed bandit problem. Mach. Learn. 47, 235–256 (2002)zbMATHCrossRefGoogle Scholar
  10. 10.
    Burnetas, A.N., Katehakis, M.N.: Optimal adaptive policies for Markov decision processes. Math. Oper. Res. 22(1), 222–255 (1997)zbMATHMathSciNetCrossRefGoogle Scholar
  11. 11.
    Tewari, A., Bartlett, P.L.: Optimistic linear programming gives logarithmic regret for irreducible MDPs. In: Proc. 20th NIPS (to appear)Google Scholar
  12. 12.
    Auer, P., Ortner, R.: Logarithmic online regret bounds for undiscounted reinforcement learning. In: Proc. 19th NIPS, pp. 49–56. MIT Press, Cambridge (2006)Google Scholar
  13. 13.
    Hunter, J.J.: Mixing times with applications to perturbed Markov chains. Linear Algebra Appl. 417, 108–123 (2006)zbMATHCrossRefMathSciNetGoogle Scholar
  14. 14.
    Ortner, R.: Pseudometrics for state aggregation in average reward Markov decision processes. In: Hutter, M., Servedio, R.A., Takimoto, E. (eds.) ALT 2007. LNCS (LNAI), vol. 4754, pp. 373–387. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  15. 15.
    Cho, G.E., Meyer, C.D.: Markov chain sensitivity measured by mean first passage times. Linear Algebra Appl. 316, 21–28 (2000)zbMATHCrossRefMathSciNetGoogle Scholar
  16. 16.
    Mannor, S., Tsitsiklis, J.N.: The sample complexity of exploration in the multi-armed bandit problem. J. Mach. Learn. Res. 5, 623–648 (2004)MathSciNetGoogle Scholar
  17. 17.
    Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32, 48–77 (2002)zbMATHCrossRefMathSciNetGoogle Scholar
  18. 18.
    Jun, T.: A survey on the bandit problem with switching costs. De Economist 152, 513–541 (2004)CrossRefGoogle Scholar
  19. 19.
    Lai, T., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. in Appl. Math. 6, 4–22 (1985)zbMATHCrossRefMathSciNetGoogle Scholar
  20. 20.
    Agrawal, R., Hedge, M.V., Teneketzis, D.: Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching cost. IEEE Trans. Automat. Control 33(10), 899–906 (1988)zbMATHCrossRefMathSciNetGoogle Scholar
  21. 21.
    Brezzi, M., Lai, T.L.: Optimal learning and experimentation in bandit problems. J. Econom. Dynam. Control 27, 87–108 (2002)zbMATHCrossRefMathSciNetGoogle Scholar
  22. 22.
    Kleinberg, R.D.: Nearly tight bounds for the continuum-armed bandit problem. In: Proc. 17th NIPS, pp. 697–704. MIT Press, Cambridge (2004)Google Scholar
  23. 23.
    Auer, P., Ortner, R., Szepesvári, C.: Improved rates for the stochastic continuum-armed bandit problem. In: Bshouty, N.H., Gentile, C. (eds.) COLT 2007. LNCS (LNAI), vol. 4539, pp. 454–468. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  24. 24.
    Even-Dar, E., Kakade, S.M., Mansour, Y.: Experts in a Markov decision process. In: Proc. 17th NIPS, pp. 401–408. MIT Press, Cambridge (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Ronald Ortner
    • 1
  1. 1.University of LeobenLeobenAustria

Personalised recommendations