Regret Bounds for Restless Markov Bandits

  • Ronald Ortner
  • Daniil Ryabko
  • Peter Auer
  • Rémi Munos
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7568)


We consider the restless Markov bandit problem, in which the state of each arm evolves according to a Markov process independently of the learner’s actions. We suggest an algorithm that after T steps achieves \(\tilde{O}(\sqrt{T})\) regret with respect to the best policy that knows the distributions of all arms. No assumptions on the Markov chains are made except that they are irreducible. In addition, we show that index-based policies are necessarily suboptimal for the considered problem.


Markov Chain Cognitive Radio Optimal Policy Partially Observable Markov Decision Process Average Reward 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. in Appl. Math. 6, 4–22 (1985)MathSciNetzbMATHCrossRefGoogle Scholar
  2. 2.
    Akyildiz, I.F., Lee, W.Y., Vuran, M.C., Mohanty, S.: A survey on spectrum management in cognitive radio networks. IEEE Commun. Mag. 46(4), 40–48 (2008)CrossRefGoogle Scholar
  3. 3.
    Anantharam, V., Varaiya, P., Walrand, J.: Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays, part II: Markovian rewards. IEEE Trans. Automat. Control 32(11), 977–982 (1987)MathSciNetzbMATHCrossRefGoogle Scholar
  4. 4.
    Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32, 48–77 (2002)MathSciNetzbMATHCrossRefGoogle Scholar
  5. 5.
    Audibert, J.-Y., Bubeck, S.: Minimax policies for adversarial and stochastic bandits. In: COLT 2009. Proc. 22nd Annual Conf. on Learning Theory, pp. 217–226 (2009)Google Scholar
  6. 6.
    Jaksch, T., Ortner, R., Auer, P.: Near-optimal regret bounds for reinforcement learning. J. Mach. Learn. Res. 11, 1563–1600 (2010)MathSciNetzbMATHGoogle Scholar
  7. 7.
    Bartlett, P.L., Tewari, A.: REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. In: Proc. 25th Conference on Uncertainty in Artificial Intelligence, UAI 2009, pp. 35–42. AUAI Press (2009)Google Scholar
  8. 8.
    Tekin, C., Liu, M.: Adaptive learning of uncontrolled restless bandits with logarithmic regret. In: 49th Annual Allerton Conference, pp. 983–990. IEEE (2011)Google Scholar
  9. 9.
    Filippi, S., Cappe, O., Garivier, A.: Optimally sensing a single channel without prior information: The tiling algorithm and regret bounds. IEEE J. Sel. Topics Signal Process. 5(1), 68–76 (2011)CrossRefGoogle Scholar
  10. 10.
    Levin, D.A., Peres, Y., Wilmer, E.L.: Markov chains and mixing times. American Mathematical Society (2006)Google Scholar
  11. 11.
    Gittins, J.C.: Bandit processes and dynamic allocation indices. J. R. Stat. Soc. Ser. B Stat. Methodol. 41(2), 148–177 (1979)MathSciNetzbMATHGoogle Scholar
  12. 12.
    Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multi-armed bandit problem. Mach. Learn. 47, 235–256 (2002)zbMATHCrossRefGoogle Scholar
  13. 13.
    Whittle, P.: Restless bandits: Activity allocation in a changing world. J. Appl. Probab. 25, 287–298 (1988)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Ortner, R.: Pseudometrics for State Aggregation in Average Reward Markov Decision Processes. In: Hutter, M., Servedio, R.A., Takimoto, E. (eds.) ALT 2007. LNCS (LNAI), vol. 4754, pp. 373–387. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  15. 15.
    Aldous, D.J., Fill, J.: Reversible Markov Chains and Random Walks on Graphs (in preparation),
  16. 16.
    Aldous, D.J.: Threshold limits for cover times. J. Theoret. Probab. 4, 197–211 (1991)MathSciNetzbMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Ronald Ortner
    • 1
    • 2
  • Daniil Ryabko
    • 2
  • Peter Auer
    • 1
  • Rémi Munos
    • 2
  1. 1.Montanuniversitaet LeobenAustria
  2. 2.INRIA Lille-Nord Europe, équipe SequeLFrance

Personalised recommendations