Skip to main content

Multi-Armed Bandit Learning in IoT Networks: Learning Helps Even in Non-stationary Settings

  • Conference paper
  • First Online:
Cognitive Radio Oriented Wireless Networks (CrownCom 2017)


Setting up the future Internet of Things (IoT) networks will require to support more and more communicating devices. We prove that intelligent devices in unlicensed bands can use Multi-Armed Bandit (MAB) learning algorithms to improve resource exploitation. We evaluate the performance of two classical MAB learning algorithms, \(\mathrm {UCB}_1\) and Thomson Sampling, to handle the decentralized decision-making of Spectrum Access, applied to IoT networks; as well as learning performance with a growing number of intelligent end-devices. We show that using learning algorithms does help to fit more devices in such networks, even when all end-devices are intelligent and are dynamically changing channel. In the studied scenario, stochastic MAB learning provides a up to \(16\%\) gain in term of successful transmission probabilities, and has near optimal performance even in non-stationary and non-i.i.d. settings with a majority of intelligent devices.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions


  1. 1.

    In the experiments below, p is about \(10^{-3}\), because in a crowded network p should be smaller than \(N_c / (S + D)\) for all devices to communicate successfully (in average).

  2. 2.

    This optimal policy needs an oracle seeing the entire system, and affecting all the dynamic devices, once and for all, in order to avoid any signaling overhead.

  3. 3.

    We tried similar experiments with other values for \(N_c\) and this repartition vector, and results were similar for non-homogeneous repartitions. Clearly, the problem is less interesting for homogeneous repartition, as all channels appear the same for dynamic devices, and so even with D small in comparison to S, the system behaves like in Fig. 2d, where the performance of the five approaches are very close.


  1. Centenaro, M., Vangelista, L., Zanella, A., Zorzi, M.: Long-range communications in unlicensed bands: the rising stars in the IoT and smart city scenarios. IEEE Wirel. Commun. 23(5), 60–67 (2016)

    Article  Google Scholar 

  2. Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1), 4–22 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  3. Bubeck, S., Cesa-Bianchi, N., et al.: Regret analysis of stochastic and non-stochastic multi-armed bandit problems. Found. Trends® Mach. Learn. 5(1), 1–122 (2012)

    Article  MATH  Google Scholar 

  4. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multi-armed bandit problem. Mach. Learn. 47(2), 235–256 (2002)

    Article  MATH  Google Scholar 

  5. Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 285–294 (1933)

    Article  MATH  Google Scholar 

  6. Agrawal, S., Goyal, N.: Analysis of Thompson sampling for the multi-armed bandit problem. In: Conference on Learning Theory, JMLR, p. 39-1 (2012)

    Google Scholar 

  7. Kaufmann, E., Korda, N., Munos, R.: Thompson sampling: an asymptotically optimal finite-time analysis. In: Bshouty, N.H., Stoltz, G., Vayatis, N., Zeugmann, T. (eds.) ALT 2012. LNCS (LNAI), vol. 7568, pp. 199–213. Springer, Heidelberg (2012).

    Chapter  Google Scholar 

  8. Haykin, S.: Cognitive radio: brain-empowered wireless communications. IEEE J. Sel. Areas Commun. 23(2), 201–220 (2005)

    Article  Google Scholar 

  9. Jouini, W., Ernst, D., Moy, C., Palicot, J.: Upper confidence bound based decision making strategies and dynamic spectrum access. In: 2010 IEEE International Conference on Communications, pp. 1–5 (2010)

    Google Scholar 

  10. Toldov, V., Clavier, L., Loscrí, V., Mitton N.: A Thompson sampling approach to channel exploration-exploitation problem in multihop cognitive radio networks. In: PIMRC, pp. 1–6 (2016)

    Google Scholar 

  11. Bonnefoi, R., Moy, C., Palicot, J.: Advanced metering infrastructure backhaul reliability improvement with cognitive radio. In: SmartGridComm, pp. 230–236 (2016)

    Google Scholar 

  12. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University, Cambridge (2004)

    Book  MATH  Google Scholar 

  13. Luenberger, D.G.: Quasi-convex programming. SIAM J. Appl. Math. 16(5), 1090–1095 (1968)

    Article  MathSciNet  MATH  Google Scholar 

  14. Arrow, K.J., Enthoven, A.C.: Quasi-concave programming. Econometrica 29(4), 779–800 (1961)

    Article  MathSciNet  MATH  Google Scholar 

  15. Corless, R., Gonnet, G., Hare, D., Jeffrey, D., Knuth, D.: On the lambert \(\cal{W}\) function. Adv. Comput. Math. 5(1), 329–359 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  16. Robbins, H.: Some aspects of the sequential design of experiments. Bull. Am. Math. Soc. 58(5), 527–535 (1952)

    Article  MathSciNet  MATH  Google Scholar 

  17. Maskooki, A., Toldov, V., Clavier, L., Loscrí, V., Mitton, N.: Competition: channel exploration/exploitation based on a Thompson sampling approach in a radio cognitive environment. In: EWSN (2016)

    Google Scholar 

  18. Moy, C., Palicot, J., Darak, S.J.: Proof-of-concept system for opportunistic spectrum access in multi-user decentralized networks. EAI Endorsed Trans. Cogn. Commun. 2, 1–10 (2016)

    Google Scholar 

  19. Liu, K., Zhao, Q.: Distributed learning in multi-armed bandit with multiple players. IEEE Trans. Sig. Process. 58(11), 5667–5681 (2010)

    Article  MathSciNet  MATH  Google Scholar 

Download references


This work is supported by the French National Research Agency (ANR), under the projects SOGREEN (grant coded: N ANR-14-CE28-0025-02) and BADASS (N ANR-16-CE40-0002), by Région Bretagne, France, by the French Ministry of Higher Education and Research (MENESR) and ENS Paris-Saclay.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Rémi Bonnefoi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bonnefoi, R., Besson, L., Moy, C., Kaufmann, E., Palicot, J. (2018). Multi-Armed Bandit Learning in IoT Networks: Learning Helps Even in Non-stationary Settings. In: Marques, P., Radwan, A., Mumtaz, S., Noguet, D., Rodriguez, J., Gundlach, M. (eds) Cognitive Radio Oriented Wireless Networks. CrownCom 2017. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 228. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-76206-7

  • Online ISBN: 978-3-319-76207-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics