Abstract
The empirically successful Thompson Sampling algorithm for stochastic bandits has drawn much interest in understanding its theoretical properties. One important benefit of the algorithm is that it allows domain knowledge to be conveniently encoded as a prior distribution to balance exploration and exploitation more effectively. While it is generally believed that the algorithm’s regret is low (high) when the prior is good (bad), little is known about the exact dependence. This paper is a first step towards answering this important question: focusing on a special yet representative case, we fully characterize the algorithm’s worst-case dependence of regret on the choice of prior. As a corollary, these results also provide useful insights into the general sensitivity of the algorithm to the choice of priors, when no structural assumptions are made. In particular, with p being the prior probability mass of the true reward-generating model, we prove \(O(\sqrt{T/p})\) and \(O(\sqrt{(1-p)T})\) regret upper bounds for the poor- and good-prior cases, respectively, as well as matching lower bounds. Our proofs rely on a fundamental property of Thompson Sampling and make heavy use of martingale theory, both of which appear novel in the Thompson-Sampling literature and may be useful for studying other behavior of the algorithm.
Most of this work was done when C.Y. Liu was an intern at Microsoft.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Note that in this paper, we do not impose any continuity structure on the reward distributions \(\nu (\theta )\) with respect to \(\theta \in \varTheta \). Therefore, it is easy to see that when \(\varTheta \) is uncountable, the (frequentist) regret of Thompson Sampling, as defined in Eq. 1, in the worst-case scenario is linear in time under most underlying models \(\theta \in \varTheta \).
References
Abbasi-Yadkori, Y., Pál, D., Szepesvári, C.: Improved algorithms for linear stochastic bandits. In: NIPS, pp. 2312–2320 (2011)
Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., Schapire, R.E.: Taming the monster: a fast and simple algorithm for contextual bandits. In: ICML, pp. 1638–1646 (2014)
Agrawal, S., Goyal, N.: Analysis of Thompson sampling for the multi-armed bandit problem. In: COLT, pp. 39.1–39.26 (2012)
Agrawal, S., Goyal, N.: Further optimal regret bounds for Thompson sampling. In: AISTATS, pp. 99–107 (2013)
Agrawal, S., Goyal, N.: Thompson sampling for contextual bandits with linear payoffs. In: ICML, pp. 127–135 (2013)
Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.: The non-stochastic multi-armed bandit problem. SIAM J. Comput. 32(1), 48–77 (2002)
Bartroff, J., Lai, T.L., Shih, M.-C.: Sequential Experimentation in Clinical Trials: Design and Analysis, vol. 298. Springer, Heildelberg (2013)
Bubeck, S., Cesa-Bianchi, N.: Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Found. Trends Mach. Learn. 5(1), 1–122 (2012)
Bubeck, S., Liu, C.Y.: Prior-free and prior-dependent regret bounds for Thompson sampling. In: NIPS, pp. 638–646 (2013)
Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge University Press, Cambridge (2006)
Chapelle, O., Li, L.: An empirical evaluation of Thompson sampling. In: NIPS, pp. 2249–2257 (2011)
Chu, W., Li, L., Reyzin, L., Schapire, R.E.: Contextual bandits with linear payoff functions. In: AISTATS, pp. 208–214 (2011)
Gopalan, A., Mannor, S., Mansour, Y.: Thompson sampling for complex online problems. In: ICML, pp. 100–108 (2014)
Graepel, T., Candela, J.Q., Borchert, T., Herbrich, R.: Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s Bing search engine. In: ICML, pp. 13–20 (2010)
Gravin, N., Peres, Y., Sivan, B.: Towards optimal algorithms for prediction with expert advice. In: SODA, pp. 528–547 (2016)
Guha, S., Munagala, K.: Approximation algorithms for Bayesian multi-armed bandit problems. arXiv preprint arXiv: 1306.3525v2 (2013)
Guha, S., Munagala, K.: Stochastic regret minimization via Thompson sampling. In: COLT, pp. 317–338 (2014)
Honda, J., Takemura, A.: Optimality of Thompson sampling for Gaussian bandits depends on priors. In: AISTATS, pp. 375–383 (2014)
Kaufmann, E., Korda, N., Munos, R.: Thompson sampling: an asymptotically optimal finite-time analysis. In: Bshouty, N.H., Stoltz, G., Vayatis, N., Zeugmann, T. (eds.) ALT 2012. LNCS, vol. 7568, pp. 199–213. Springer, Heidelberg (2012)
Komiyama, J., Honda, J., Nakagawa, H.: Optimal regret analysis of Thompson sampling in stochastic multi-armed bandit problem with multiple plays. In: ICML, pp. 1152–1161 (2015)
Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6, 4–22 (1985)
Lattimore, T.: The pareto regret frontier for bandits. In: NIPS, pp. 208–216 (2015)
Li, L.: Generalized Thompson sampling for contextual bandits. Technical report MSR-TR-2013-136, Microsoft Research (2013)
Liu, C.Y., Li, L.: On the prior sensitivity of Thompson sampling (2015). arXiv:1506.03378
May, B.C., Korda, N., Lee, A., Leslie, D.S.: Optimistic Bayesian sampling in contextual-bandit problems. J. Mach. Learn. Res. 13, 2069–2106 (2012)
Russo, D., Van Roy, B.: Learning to optimize via posterior sampling. Math. Oper. Res. 39(4), 1221–1243 (2014)
Russo, D., Van Roy, B.: An information-theoretic analysis of Thompson sampling. J. Mach. Learn. Res. 17(68), 1–30 (2016)
Scott, S.L.: A modern Bayesian look at the multi-armed bandit. Appl. Stoch. Models Bus. Ind. 26, 639–658 (2010)
Thompson, W.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Bull. Am. Math. Soc. 25, 285–294 (1933)
Xia, Y., Li, H., Qin, T., Yu, N., Liu, T.-Y.: Thompson sampling for budgeted multi-armed bandits. In: IJCAI, pp. 3960–3966 (2015)
Acknowledgments
We thank Sébastien Bubeck and the anonymous reviewers for helpful advice that improves the presentation of the paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Liu, CY., Li, L. (2016). On the Prior Sensitivity of Thompson Sampling. In: Ortner, R., Simon, H., Zilles, S. (eds) Algorithmic Learning Theory. ALT 2016. Lecture Notes in Computer Science(), vol 9925. Springer, Cham. https://doi.org/10.1007/978-3-319-46379-7_22
Download citation
DOI: https://doi.org/10.1007/978-3-319-46379-7_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46378-0
Online ISBN: 978-3-319-46379-7
eBook Packages: Computer ScienceComputer Science (R0)