The use of Thompson sampling to increase estimation precision

Kaptein, Maurits

doi:10.3758/s13428-014-0480-0

The use of Thompson sampling to increase estimation precision

Published: 23 July 2014

Volume 47, pages 409–423, (2015)
Cite this article

Behavior Research Methods Aims and scope Submit manuscript

Maurits Kaptein¹

524 Accesses
4 Citations
Explore all metrics

Abstract

In this article, we consider a sequential sampling scheme for efficient estimation of the difference between the means of two independent treatments when the population variances are unequal across groups. The sampling scheme proposed is based on a solution to bandit problems called Thompson sampling. While this approach is most often used to maximize the cumulative payoff over competing treatments, we show that the same method can also be used to balance exploration and exploitation when the aim of the experimenter is to efficiently increase estimation precision. We introduce this novel design optimization method and, by simulation, show its effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using cluster-robust Query ID="Q1" Text=" Hyphen was added because this is in adjective form. Please check throughout." standard errors when analyzing group-randomized trials with few clusters

Article 10 September 2021

Dismemberment and Design for Controlling the Replication Variance of Regret for the Multi-Armed Bandit

Article 01 August 2022

Sequential Experimentation and Learning

Notes

The uniform Beta(1, 1) prior is not the most uninformative prior; higher variance Beta priors contain less information (see Zhu & Lu, 2004). However, it is maximally uninformative given the assumption of binary outcomes.
The above method of addressing the explore–exploit trade-off in the two-arm binomial bandit case is easy to implement in currently standard analysis packages such as the [R] language for statistical computing. Appendix 1 contains an example of the implementation of the two-armed bandit problem described in this section
In the unlikely event that X _A = X _B, we randomly choose one treatment with equal probabilities for both and subsequently assign that treatment to the remaining M-N units.
Here, one could also fit a linear model with an intercept and one indicator variable. The method proceeds in the same way (and leads to the same results). However, we feel that this is less intuitive as an explanation.
Do note that the above update will not be computable with a single data point, since it has undefined variance.
Do note that the above update will not be computable with a single datapoint since it has unknown variance.

References

Agrawal, S. and Goyal, N. (2011). Analysis of thompson sampling for the multi-armed bandit problem. arXiv preprint arXiv:1111.1797.
Agrawal, S. and Goyal, N. (2012). Further optimal regret bounds for thompson sampling. arXiv preprint arXiv:1209.3353.
Allen, T. T., Yu, L., & Schmitz, J. (2003). An experimental design criterion for minimizing meta-model prediction errors applied to die casting process design. Journal of the Royal Statistical Society: Series C: Applied Statistics, 52(1), 103–117.
Article Google Scholar
Antille, G. and Weinberg, A. (2000). A Study of D-optimal Designs Efficiency for Polynomial Regression. Universit é de Genève, Faculté des sciences économiques et sociales, Département d’économétrie.
Atkinson, A. C., Donev, A. N., and Tobias, R. D. (2007). Optimum experimental designs, with SAS, volume 34. Oxford University Press Oxford.
Audibert, J.-Y., Munos, R., & Szepesvári, C. (2009). Exploration-exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19), 1876–1902.
Article Google Scholar
Auer, P., & Ortner, R. (2010). UCB Revisited: Improved Regret Bounds for the Stochastic Multi-Armed Bandit Problem. Periodica Mathematica Hungarica, 61(1), 1–11.
Google Scholar
Bardsley, W. G., Wood, R. M. W., & Melikhova, E. M. (1996). Optimal design: A computer program to study the best possible spacing of design points for model discrimination. Computers & Chemistry, 20(2), 145–157.
Article Google Scholar
Berry, D. A. and Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments (Monographs on Statistics and Applied Probability). Springer.
Box, J. F. (1987). Guinness, Gosset, Fisher, and Small Samples. Statistical Science, 2(1), 45–52.
Article Google Scholar
Box, G. E. P., & Hill, W. J. (1967). Discrimination among mechanistic models. Technometrics, 9(1), 57–71.
Article Google Scholar
Brezzi, M., & Lai, T. L. (2000). Incomplete Learning from Endogenous Data in Dynamic Allocation. Econometrica, 68(6), 1511–1516.
Article Google Scholar
Burnetas, A. N., & Katehakis, M. N. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2), 122–142.
Article Google Scholar
Chaloner, K., & Verdinelli, I. (1995). Bayesian experimental design: A review. Statistical Science, 10(3), 273–304.
Article Google Scholar
Chapelle, O. and Li, L. (2011). An Empirical Evaluation of Thompson Sampling. In Shawe-Taylor, J., Zemel, R. S., Bartlett, P., Pereira, F. C. N., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems, pages 2249—-2257.
Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B., & Dolan, R. J. (2006). Cortical substrates for exploratory decisions in humans. Nature, 441(7095), 876–879.
Article PubMed Central PubMed Google Scholar
Dunlop, M. (2009). Paper Rejected (p > 0.05): An Introduction to the Debate on Appropriateness of Null-Hypothesis Testing. International Journal of Mobile Human Computer Interaction, 1(3), 1–8.
Article Google Scholar
Garivier, A., & Cappé, O. (2011). The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond. Bernoulli, 19(1), 13.
Google Scholar
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013). Bayesian data analysis. CRC press.
Gittins, J. C. (1979). Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society: Series B Methodological, 41(2), 148–177.
Google Scholar
Gittins, J., & Wang, Y. G. (1992). The learning component of dynamic allocation indices. Annals of Statistics, 20(3), 12.
Article Google Scholar
Goodman, S. N. (1999). Toward evidence-based medical statistics. 2: The Bayes factor. Annals of Internal Medicine, 130(12), 1005–1013.
Article PubMed Google Scholar
Hauser, J. R., Urban, G. L., Liberali, G., & Braun, M. (2009). Website Morphing. Marketing Science, 28(2), 202–223.
Article Google Scholar
Hutchinson, J. W., Kamakura, W. A., & Lynch, J. J. G. (2001). Unobserved Heterogeneity as an Alternative Explanation for “Reversal” Effects in Behavioral Research. Journal of Consumer Research, 27(3), 324–344.
Article Google Scholar
Keller, G., & Rady, S. (2010). Strategic experimentation with Poisson bandits. Theoretical Economics, 5(2), 275–311.
Article Google Scholar
Kruschke, J. K. (2012). Bayesian estimation supersedes the t test. Journal of Experimental Psychology: General, 31, 1–33.
Google Scholar
Kuck, H., de Freitas, N., and Doucet, A. (2006). SMC samplers for Bayesian optimal nonlinear design. In Nonlinear Statistical Signal Processing Workshop, 2006 IEEE, pages 99–102. IEEE.
Lai, T. L. (1987). Adaptive Treatment Allocation and the Multi-Armed Bandit Problem. The Annals of Statistics, 15(3), 1091–1114.
Article Google Scholar
Lai, T., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1), 4–22.
Article Google Scholar
McClelland, G. H. (1997). Optimal design in psychological research. Psychological Methods, 2(1), 3–19.
Article Google Scholar
McCloskey, D. N., & Ziliak, S. T. (2009). The Unreasonable Ineffectiveness of Fisherian ‘Tests’ in Biology, and Especially in Medicine. Biological Theory, 4(1), 44–53.
Article Google Scholar
Meehl, P. E. (1967). Theory testing in psychology and physics: a methodological paradox. Philosophy of Science, 34(74), 103–115.
Article Google Scholar
Myung, J. and Pitt, M. (2009a). Bayesian adaptive optimal design of psychology experiments. Proceedings of the 2nd International Workshop in Sequential Methodologies (IWSM2009), pages 1–6.
Myung, J. I., & Pitt, M. A. (2009b). Optimal experimental design for model discrimination. Psychological Review, 116(3), 499.
Article PubMed Central PubMed Google Scholar
Myung, J. I., Cavagnaro, D. R., & Pitt, M. A. (2013). A tutorial on adaptive design optimization. Journal of Mathematical Psychology, 57(3), 53–67.
Article PubMed Central PubMed Google Scholar
Niño-Mora, J. (2007). A (2/3) n 3 fast-pivoting algorithm for the gittins index and optimal stopping of a markov chain. INFORMS Journal on Computing, 19(4), 596–606.
Article Google Scholar
Nowak, M., & Sigmund, K. (1993). A strategy of win-stay, lose-shift that outperforms tit-for-tat in the Prisoner’s Dilemma game. Nature, 364(6432), 56–58.
Article PubMed Google Scholar
O’Brien, T. E., & Funk, G. M. (2003). A gentle introduction to optimal design for regression models. The American Statistician, 57(4), 265–267.
Article Google Scholar
Ortega, P. and Braun, D. (2013). Generalized Thompson Sampling for Sequential Decision-Making and Causal Inference. arXiv preprint arXiv:1303.4431.
Press, W. H. (2009). Bandit solutions provide unified ethical models for randomized clinical trials and comparative effectiveness research. Proceedings of the National Academy of Sciences of the United States of America, 106(52), 22387–22392.
Article PubMed Central PubMed Google Scholar
Robbins, H. (1985). Some aspects of the sequential design of experiments. In Herbert Robbins Selected Papers, pages 169–177. Springer.
Rosenthal, R., Cooper, H., & Hedges, L. (1994). Parametric measures of effect size. The Handbook of Research Synthesis, Pages, 231–244.
Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16(2), 225–237.
Article Google Scholar
Scott, S. L. (2010). A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 26(6), 639–658.
Article Google Scholar
Shieh, G. (2013). On using a pilot sample variance for sample size determination in the detection of differences between two means: Power consideration. Psicológica, 34, 125–143.
Google Scholar
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science, 22(11), 1359–1366.
Article PubMed Google Scholar
Sonin, I. M. (2008). A generalized Gittins index for a Markov chain and its recursive calculation. Statistics & Probability Letters, 78(12), 1526–1533.
Article Google Scholar
Steyvers, M., Lee, M. D., & Wagenmakers, E.-J. (2009). A Bayesian analysis of human decision-making on bandit problems. Journal of Mathematical Psychology, 53(3), 168–179.
Article Google Scholar
Sutton, R. S. and Barto, A. G. (1998). Introduction to reinforcement learning. MIT Press.
Thompson, W. R. (1933). On the Likelihood that one Unknown Probability Exceeds Another in view of the Evidence of two Samples. Biometrika, 25(3–4), 285–294.
Article Google Scholar
Vermorel, J. and Mohri, M. (2005). Multi-armed bandit algorithms and empirical evaluation. In Machine Learning: ECML 2005, pages 437–448. Springer.
Wagenmakers, E., Wetzels, R., Borsboom, D., & van der Maas, H. L. J. (2011). Why psychologists must change the way they analyze their data: The case of psi: Comment on Bem (2011). Journal of Personality and Social Psychology, 100(3), 426–432.
Article PubMed Google Scholar
Welch, B. (1951). On the comparison of several mean values: an alternative approach. Biometrika, 38(3–4), 330–336.
Article Google Scholar
Whittle, P. (1973). Some General Points in the Theory of Optimal Experimental Design. Journal of the Royal Statistical Society: Series B Methodological, 35(1), 123–130.
Google Scholar
Whittle, P. (1980). Multi-armed bandits and the Gittins index. Journal of the Royal Statistical Society: Series B Methodological, 42(2), 143–149.
Google Scholar
Wilcox, R. R. (1981). A Review of the Beta-Binomial Model and Its Extensions. Journal of Educational Statistics, 6, 3–32.
Article Google Scholar
Zhang, S., & Lee, M. D. (2010). Optimal experimental design for a class of bandit problems. Journal of Mathematical Psychology, 54(6), 499–508.
Article Google Scholar
Zhu, M., & Lu, A. Y. (2004). The Counter-intuitive Non-informative Prior for the Bernoulli Family. Journal of Statistics Education, 12(2), 1–10.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Methodology and Statistics, Tilburg University, Archipelstraat 13, 6524LK, Nijmegen, The Netherlands
Maurits Kaptein

Authors

Maurits Kaptein
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maurits Kaptein.

Appendices

Appendix 1 [R]-code for the binomial model

In this section we describe how one can use the [R] language for statistical computing to decide on the allocation of subjects using Thompson sampling in the two armed binomial bandit case (e.g., a case in which two treatments with binary outcomes are compared).

To set up the problem, we need to specify our prior believe(s) regarding the effectiveness of each treatment A and B. For the prior, we use a Beta(1, 1) distribution, independently for each treatment. The density of this prior believe is displayed in the top row of Fig. 1 and places equal weight on all possible outcomes. We first set up two lists to store the initial parameters:

Next, to decide on which treatments to select, we perform a random draw from both Beta _A() as well as Beta _B():

Then, to select which treatment to allocate our next subject to (or “which arm to play” in the canonical version of the problem), we compare our two draws and select the highest:

If allocate.to = 1, we allocate the next subject to treatment A, and if allocate.to = 2, we allocate the next subject to treatment B. After assigning to a treatment, the response is observed, and the prior believe is updated by the data (which is either 0 for a failure, and 1 for a success). A simple [R] function suffices:

Suppose that we allocated a subject to treatment B (allocate.to = 2 in line number 5) and we observed a success; then, we update our believe about treatment B:

After updating our believe, we decide on the allocation of subject 2,

Appendix 2 [R]-code for the normal optimization

Implementing the normal inverse - χ ² model takes a bit more code than implementing the binomial update described in Appendix 1. However, more and more packages that allow researchers to update prior believes “of-the-shelf”, for all kinds of distributions, are available (see, e.g., the LearnBayes package). For the reference prior that is used in the article, the update can be done using only two simple [R] functions:

For each treatment A and B, one collects a vector of observations, which might look like^{Footnote 5}:

Deciding on the next treatment for optimization of the effect concerns:

The process is repeated after the new data point is observed and added to the vector of observations for the treatment that was selected.

Appendix 3 [R]-code for optimal experiment

The step from optimization of the treatment effect as described in Appendix 2 to minimization of the estimation error is simple, and the same [R] functions can be used (see Appendix 2, code lines 15–24).

Again, for each treatment A and B, one collects a vector of observations, which might look like^{Footnote 6}

Deciding on the next treatment for minimization of the estimation error now entails first obtaining a draw from the posterior variances,

and next deciding which treatment should be selected to minimize the \( SE\left(\widehat{\delta}\right) \):

Here, the interpretation of the allocate.to variable is the same as in Appendixes 1 and 2.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kaptein, M. The use of Thompson sampling to increase estimation precision. Behav Res 47, 409–423 (2015). https://doi.org/10.3758/s13428-014-0480-0

Download citation

Published: 23 July 2014
Issue Date: June 2015
DOI: https://doi.org/10.3758/s13428-014-0480-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The use of Thompson sampling to increase estimation precision

Abstract

Access this article

Similar content being viewed by others

Using cluster-robust Query ID="Q1" Text=" Hyphen was added because this is in adjective form. Please check throughout." standard errors when analyzing group-randomized trials with few clusters