Skip to main content

Exploiting History Data for Nonstationary Multi-armed Bandit

Part of the Lecture Notes in Computer Science book series (LNAI,volume 12975)

Abstract

The Multi-armed Bandit (MAB) framework has been applied successfully in many application fields. In the last years, the use of active approaches to tackle the nonstationary MAB setting, i.e., algorithms capable of detecting changes in the environment and re-configuring automatically to the change, has been widening the areas of application of MAB techniques. However, such approaches have the drawback of not reusing information in those settings where the same environment conditions recur over time. This paper presents a framework to integrate past information in the abruptly changing nonstationary setting, which allows the active MAB approaches to recover from changes quickly. The proposed framework is based on well-known break-point prediction methods to correctly identify the instant the environment changed in the past, and on the definition of recurring concepts specifically for the MAB setting to reuse information from recurring MAB states, when necessary. We show that this framework does not change the order of the regret suffered by the active approaches commonly used in the bandit field. Finally, we provide an extensive experimental analysis on both synthetic and real-world data, showing the improvement provided by our framework.

Keywords

  • Multi-armed bandit
  • Non-stationary MAB
  • Break-point prediction
  • Recurring concepts

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The extension to other finite support distributions is straightforward and the theoretical results here provided are still valid.

  2. 2.

    Since we are considering Bernoulli reward, having the same expected value also implies to have the same distribution. This definition can be easily generalized to handle other distributions, requiring that the distribution repeats over different phases.

References

  1. Alippi, C., Boracchi, G., Roveri, M.: Just-in-time classifiers for recurrent concepts. IEEE Trans. Neural Netw. Learn. Syst. 24(4), 620–634 (2013)

    CrossRef  Google Scholar 

  2. Auer, P.: Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res. 3(Nov), 397–422 (2002)

    MathSciNet  MATH  Google Scholar 

  3. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)

    CrossRef  Google Scholar 

  4. Aziz, M., Kaufmann, E., Riviere, M.K.: On multi-armed bandit designs for dose-finding clinical trials. J. Mach. Learn. Res. 22, 1–38 (2021)

    MathSciNet  MATH  Google Scholar 

  5. Basseville, M., Nikiforov, I.V.: Detection of Abrupt Changes - Theory and Application. Prentice Hall, Hoboken (1993)

    MATH  Google Scholar 

  6. Besson, L., Kaufmann, E.: The generalized likelihood ratio test meets klUCB: an improved algorithm for piece-wise non-stationary bandits. arXiv preprint arXiv:1902.01575 (2019)

  7. Bubeck, S., Cesa-Bianchi, N.: Regret analysis of stochastic and nonstochastic multi-armed bandit problems. CoRR abs/1204.5721 (2012)

    Google Scholar 

  8. Cao, Y., Wen, Z., Kveton, B., Xie, Y.: Nearly optimal adaptive procedure with change detection for piecewise-stationary bandit. In: AISTATS, pp. 418–427 (2019)

    Google Scholar 

  9. Garivier, A., Moulines, E.: On upper-confidence bound policies for switching bandit problems. In: ALT, pp. 174–188 (2011)

    Google Scholar 

  10. Hartland, C., Gelly, S., Baskiotis, N., Teytaud, O., Sebag, M.: Multi-armed bandit, dynamic environments and meta-bandits, November 2006. https://hal.archives-ouvertes.fr/hal-00113668/file/MetaEve.pdf, working paper

  11. Hawkins, D.M., Qiu, P., Kang, C.W.: The changepoint model for statistical process control. J. Qual. Technol. 35(4), 355–366 (2003)

    CrossRef  Google Scholar 

  12. Hinkley, D.: Inference about the change-point from cumulative sum tests. Biometrika 58 (1971)

    Google Scholar 

  13. Italia, E., Nuara, A., Trovò, F., Restelli, M., Gatti, N., Dellavalle, E.: Internet advertising for non-stationary environments. In: AMEC, pp. 1–15 (2017)

    Google Scholar 

  14. Liu, F., Lee, J., Shroff, N.B.: A change-detection based framework for piecewise-stationary multi-armed bandit problem. In: AAAI (2018)

    Google Scholar 

  15. Mellor, J.C., Shapiro, J.L.: Thompson Sampling in switching environments with Bayesian online change point detection. CoRR abs/1302.3721 (2013)

    Google Scholar 

  16. Nuara, A., Trovo, F., Gatti, N., Restelli, M.: A combinatorial-bandit algorithm for the online joint bid/budget optimization of pay-per-click advertising campaigns. In: AAAI, vol. 32 (2018)

    Google Scholar 

  17. Parvin, M., Meybodi, M.R.: MABRP: a multi-armed bandit problem-based energy-aware routing protocol for wireless sensor network. In: AISP, pp. 464–468. IEEE (2012)

    Google Scholar 

  18. Ross, G.J., Adams, N.M.: Two nonparametric control charts for detecting arbitrary distribution changes. J. Qual. Technol. 44(2), 102–116 (2012)

    CrossRef  Google Scholar 

  19. Ross, G.J., Tasoulis, D.K., Adams, N.M.: Sequential monitoring of a Bernoulli sequence when the pre-change parameter is unknown. Comput. Statist. 28(2), 463–479 (2013)

    CrossRef  MathSciNet  Google Scholar 

  20. Schuirmann, D.J.: A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. J. Pharmacokinet. Biopharm. 15(6), 657–680 (1987)

    CrossRef  Google Scholar 

  21. Trovò, F., Paladino, S., Restelli, M., Gatti, N.: Improving multi-armed bandit algorithms in online pricing settings. Int. J. Approx. Reason. 98, 196–235 (2018)

    CrossRef  MathSciNet  Google Scholar 

  22. Trovò, F., Paladino, S., Restelli, M., Gatti, N.: Sliding-window Thompson Sampling for non-stationary settings. J. Artif. Intell. Res. 68, 311–364 (2020)

    CrossRef  MathSciNet  Google Scholar 

  23. Yahoo!: R6b - Yahoo! front page today module user click log dataset, version 2.0 (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Francesco Trovò .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Re, G., Chiusano, F., Trovò, F., Carrera, D., Boracchi, G., Restelli, M. (2021). Exploiting History Data for Nonstationary Multi-armed Bandit. In: Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12975. Springer, Cham. https://doi.org/10.1007/978-3-030-86486-6_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86486-6_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86485-9

  • Online ISBN: 978-3-030-86486-6

  • eBook Packages: Computer ScienceComputer Science (R0)