Abstract
We consider an upper confidence bound algorithm for Markov decision processes (MDPs) with deterministic transitions. For this algorithm we derive upper bounds on the online regret (with respect to an (ε-)optimal policy) that are logarithmic in the number of steps taken. These bounds also match known asymptotic bounds for the general MDP setting. We also present corresponding lower bounds. As an application, multi-armed bandits with switching cost are considered.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Puterman, M.L.: Markov Decision Processes. Wiley, New York (1994)
Karp, R.M.: A characterization of the minimum cycle mean in a digraph. Discrete Math. 23(3), 309–311 (1978)
Dasdan, A., Gupta, R.: Faster maximum and minimum mean cycle algorithms for system performance analysis. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 17(10), 889–899 (1998)
Dasdan, A., Irani, S.S., Gupta, R.K.: Efficient algorithms for optimum cycle mean and optimum cost to time ratio problems. In: Proc. 36th DAC, pp. 37–42. ACM, New York (1999)
Hartmann, M., Orlin, J.B.: Finding minimum cost to time ratio cycles with small integral transit times. Networks 23(6), 567–574 (1993)
Young, N.E., Tarjan, R.E., Orlin, J.B.: Faster parametric shortest path and minimum-balance algorithms. Networks 21(2), 205–221 (1991)
Madani, O.: Polynomial value iteration algorithms for deterministic MDPs. In: Proc. 18th UAI, pp. 311–318. Morgan Kaufmann, San Francisco (2002)
Kearns, M.J., Singh, S.P.: Near-optimal reinforcement learning in polynomial time. Mach. Learn. 49, 209–232 (2002)
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multi-armed bandit problem. Mach. Learn. 47, 235–256 (2002)
Burnetas, A.N., Katehakis, M.N.: Optimal adaptive policies for Markov decision processes. Math. Oper. Res. 22(1), 222–255 (1997)
Tewari, A., Bartlett, P.L.: Optimistic linear programming gives logarithmic regret for irreducible MDPs. In: Proc. 20th NIPS (to appear)
Auer, P., Ortner, R.: Logarithmic online regret bounds for undiscounted reinforcement learning. In: Proc. 19th NIPS, pp. 49–56. MIT Press, Cambridge (2006)
Hunter, J.J.: Mixing times with applications to perturbed Markov chains. Linear Algebra Appl. 417, 108–123 (2006)
Ortner, R.: Pseudometrics for state aggregation in average reward Markov decision processes. In: Hutter, M., Servedio, R.A., Takimoto, E. (eds.) ALT 2007. LNCS (LNAI), vol. 4754, pp. 373–387. Springer, Heidelberg (2007)
Cho, G.E., Meyer, C.D.: Markov chain sensitivity measured by mean first passage times. Linear Algebra Appl. 316, 21–28 (2000)
Mannor, S., Tsitsiklis, J.N.: The sample complexity of exploration in the multi-armed bandit problem. J. Mach. Learn. Res. 5, 623–648 (2004)
Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32, 48–77 (2002)
Jun, T.: A survey on the bandit problem with switching costs. De Economist 152, 513–541 (2004)
Lai, T., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. in Appl. Math. 6, 4–22 (1985)
Agrawal, R., Hedge, M.V., Teneketzis, D.: Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching cost. IEEE Trans. Automat. Control 33(10), 899–906 (1988)
Brezzi, M., Lai, T.L.: Optimal learning and experimentation in bandit problems. J. Econom. Dynam. Control 27, 87–108 (2002)
Kleinberg, R.D.: Nearly tight bounds for the continuum-armed bandit problem. In: Proc. 17th NIPS, pp. 697–704. MIT Press, Cambridge (2004)
Auer, P., Ortner, R., Szepesvári, C.: Improved rates for the stochastic continuum-armed bandit problem. In: Bshouty, N.H., Gentile, C. (eds.) COLT 2007. LNCS (LNAI), vol. 4539, pp. 454–468. Springer, Heidelberg (2007)
Even-Dar, E., Kakade, S.M., Mansour, Y.: Experts in a Markov decision process. In: Proc. 17th NIPS, pp. 401–408. MIT Press, Cambridge (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ortner, R. (2008). Online Regret Bounds for Markov Decision Processes with Deterministic Transitions. In: Freund, Y., Györfi, L., Turán, G., Zeugmann, T. (eds) Algorithmic Learning Theory. ALT 2008. Lecture Notes in Computer Science(), vol 5254. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87987-9_14
Download citation
DOI: https://doi.org/10.1007/978-3-540-87987-9_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-87986-2
Online ISBN: 978-3-540-87987-9
eBook Packages: Computer ScienceComputer Science (R0)