Regret Bounds for Restless Markov Bandits

  • Ronald Ortner
  • Daniil Ryabko
  • Peter Auer
  • Rémi Munos
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7568)


We consider the restless Markov bandit problem, in which the state of each arm evolves according to a Markov process independently of the learner’s actions. We suggest an algorithm that after T steps achieves \(\tilde{O}(\sqrt{T})\) regret with respect to the best policy that knows the distributions of all arms. No assumptions on the Markov chains are made except that they are irreducible. In addition, we show that index-based policies are necessarily suboptimal for the considered problem.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. in Appl. Math. 6, 4–22 (1985)MathSciNetMATHCrossRefGoogle Scholar
  2. 2.
    Akyildiz, I.F., Lee, W.Y., Vuran, M.C., Mohanty, S.: A survey on spectrum management in cognitive radio networks. IEEE Commun. Mag. 46(4), 40–48 (2008)CrossRefGoogle Scholar
  3. 3.
    Anantharam, V., Varaiya, P., Walrand, J.: Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays, part II: Markovian rewards. IEEE Trans. Automat. Control 32(11), 977–982 (1987)MathSciNetMATHCrossRefGoogle Scholar
  4. 4.
    Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32, 48–77 (2002)MathSciNetMATHCrossRefGoogle Scholar
  5. 5.
    Audibert, J.-Y., Bubeck, S.: Minimax policies for adversarial and stochastic bandits. In: COLT 2009. Proc. 22nd Annual Conf. on Learning Theory, pp. 217–226 (2009)Google Scholar
  6. 6.
    Jaksch, T., Ortner, R., Auer, P.: Near-optimal regret bounds for reinforcement learning. J. Mach. Learn. Res. 11, 1563–1600 (2010)MathSciNetMATHGoogle Scholar
  7. 7.
    Bartlett, P.L., Tewari, A.: REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. In: Proc. 25th Conference on Uncertainty in Artificial Intelligence, UAI 2009, pp. 35–42. AUAI Press (2009)Google Scholar
  8. 8.
    Tekin, C., Liu, M.: Adaptive learning of uncontrolled restless bandits with logarithmic regret. In: 49th Annual Allerton Conference, pp. 983–990. IEEE (2011)Google Scholar
  9. 9.
    Filippi, S., Cappe, O., Garivier, A.: Optimally sensing a single channel without prior information: The tiling algorithm and regret bounds. IEEE J. Sel. Topics Signal Process. 5(1), 68–76 (2011)CrossRefGoogle Scholar
  10. 10.
    Levin, D.A., Peres, Y., Wilmer, E.L.: Markov chains and mixing times. American Mathematical Society (2006)Google Scholar
  11. 11.
    Gittins, J.C.: Bandit processes and dynamic allocation indices. J. R. Stat. Soc. Ser. B Stat. Methodol. 41(2), 148–177 (1979)MathSciNetMATHGoogle Scholar
  12. 12.
    Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multi-armed bandit problem. Mach. Learn. 47, 235–256 (2002)MATHCrossRefGoogle Scholar
  13. 13.
    Whittle, P.: Restless bandits: Activity allocation in a changing world. J. Appl. Probab. 25, 287–298 (1988)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Ortner, R.: Pseudometrics for State Aggregation in Average Reward Markov Decision Processes. In: Hutter, M., Servedio, R.A., Takimoto, E. (eds.) ALT 2007. LNCS (LNAI), vol. 4754, pp. 373–387. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  15. 15.
    Aldous, D.J., Fill, J.: Reversible Markov Chains and Random Walks on Graphs (in preparation),
  16. 16.
    Aldous, D.J.: Threshold limits for cover times. J. Theoret. Probab. 4, 197–211 (1991)MathSciNetMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Ronald Ortner
    • 1
    • 2
  • Daniil Ryabko
    • 2
  • Peter Auer
    • 1
  • Rémi Munos
    • 2
  1. 1.Montanuniversitaet LeobenAustria
  2. 2.INRIA Lille-Nord Europe, équipe SequeLFrance

Personalised recommendations