Multi-Armed Bandit Learning in IoT Networks: Learning Helps Even in Non-stationary Settings

  • Rémi BonnefoiEmail author
  • Lilian Besson
  • Christophe Moy
  • Emilie Kaufmann
  • Jacques Palicot
Conference paper
Part of the Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering book series (LNICST, volume 228)


Setting up the future Internet of Things (IoT) networks will require to support more and more communicating devices. We prove that intelligent devices in unlicensed bands can use Multi-Armed Bandit (MAB) learning algorithms to improve resource exploitation. We evaluate the performance of two classical MAB learning algorithms, \(\mathrm {UCB}_1\) and Thomson Sampling, to handle the decentralized decision-making of Spectrum Access, applied to IoT networks; as well as learning performance with a growing number of intelligent end-devices. We show that using learning algorithms does help to fit more devices in such networks, even when all end-devices are intelligent and are dynamically changing channel. In the studied scenario, stochastic MAB learning provides a up to \(16\%\) gain in term of successful transmission probabilities, and has near optimal performance even in non-stationary and non-i.i.d. settings with a majority of intelligent devices.


Internet of Things Multi-Armed Bandits Reinforcement learning Cognitive Radio Non-stationary bandits 



This work is supported by the French National Research Agency (ANR), under the projects SOGREEN (grant coded: N ANR-14-CE28-0025-02) and BADASS (N ANR-16-CE40-0002), by Région Bretagne, France, by the French Ministry of Higher Education and Research (MENESR) and ENS Paris-Saclay.


  1. 1.
    Centenaro, M., Vangelista, L., Zanella, A., Zorzi, M.: Long-range communications in unlicensed bands: the rising stars in the IoT and smart city scenarios. IEEE Wirel. Commun. 23(5), 60–67 (2016)CrossRefGoogle Scholar
  2. 2.
    Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1), 4–22 (1985)MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Bubeck, S., Cesa-Bianchi, N., et al.: Regret analysis of stochastic and non-stochastic multi-armed bandit problems. Found. Trends® Mach. Learn. 5(1), 1–122 (2012)CrossRefzbMATHGoogle Scholar
  4. 4.
    Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multi-armed bandit problem. Mach. Learn. 47(2), 235–256 (2002)CrossRefzbMATHGoogle Scholar
  5. 5.
    Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 285–294 (1933)CrossRefzbMATHGoogle Scholar
  6. 6.
    Agrawal, S., Goyal, N.: Analysis of Thompson sampling for the multi-armed bandit problem. In: Conference on Learning Theory, JMLR, p. 39-1 (2012)Google Scholar
  7. 7.
    Kaufmann, E., Korda, N., Munos, R.: Thompson sampling: an asymptotically optimal finite-time analysis. In: Bshouty, N.H., Stoltz, G., Vayatis, N., Zeugmann, T. (eds.) ALT 2012. LNCS (LNAI), vol. 7568, pp. 199–213. Springer, Heidelberg (2012). CrossRefGoogle Scholar
  8. 8.
    Haykin, S.: Cognitive radio: brain-empowered wireless communications. IEEE J. Sel. Areas Commun. 23(2), 201–220 (2005)CrossRefGoogle Scholar
  9. 9.
    Jouini, W., Ernst, D., Moy, C., Palicot, J.: Upper confidence bound based decision making strategies and dynamic spectrum access. In: 2010 IEEE International Conference on Communications, pp. 1–5 (2010)Google Scholar
  10. 10.
    Toldov, V., Clavier, L., Loscrí, V., Mitton N.: A Thompson sampling approach to channel exploration-exploitation problem in multihop cognitive radio networks. In: PIMRC, pp. 1–6 (2016)Google Scholar
  11. 11.
    Bonnefoi, R., Moy, C., Palicot, J.: Advanced metering infrastructure backhaul reliability improvement with cognitive radio. In: SmartGridComm, pp. 230–236 (2016)Google Scholar
  12. 12.
    Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University, Cambridge (2004)CrossRefzbMATHGoogle Scholar
  13. 13.
    Luenberger, D.G.: Quasi-convex programming. SIAM J. Appl. Math. 16(5), 1090–1095 (1968)MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Arrow, K.J., Enthoven, A.C.: Quasi-concave programming. Econometrica 29(4), 779–800 (1961)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Corless, R., Gonnet, G., Hare, D., Jeffrey, D., Knuth, D.: On the lambert \(\cal{W}\) function. Adv. Comput. Math. 5(1), 329–359 (1996)MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Robbins, H.: Some aspects of the sequential design of experiments. Bull. Am. Math. Soc. 58(5), 527–535 (1952)MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    Maskooki, A., Toldov, V., Clavier, L., Loscrí, V., Mitton, N.: Competition: channel exploration/exploitation based on a Thompson sampling approach in a radio cognitive environment. In: EWSN (2016)Google Scholar
  18. 18.
    Moy, C., Palicot, J., Darak, S.J.: Proof-of-concept system for opportunistic spectrum access in multi-user decentralized networks. EAI Endorsed Trans. Cogn. Commun. 2, 1–10 (2016)Google Scholar
  19. 19.
    Liu, K., Zhao, Q.: Distributed learning in multi-armed bandit with multiple players. IEEE Trans. Sig. Process. 58(11), 5667–5681 (2010)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2018

Authors and Affiliations

  1. 1.CentraleSupélec (campus of Rennes), IETR, SCEE TeamCesson-SévignéFrance
  2. 2.Univ. Lille 1, CNRS, Inria, SequeL Team, UMR 9189 - CRIStALLilleFrance

Personalised recommendations