Online Controlled Experiments and A/B Tests

Kohavi, Ron; Longbotham, Roger

doi:10.1007/978-1-4899-7502-7_891-2

Ron Kohavi⁴ &
Roger Longbotham⁵

92 Accesses
2 Citations

Motivation and Background

Many good resources are available with motivation and explanations about online controlled experiments (Kohavi et al. 2009a, 2020; Thomke 2020; Luca and Bazerman 2020; Georgiev 2018, 2019; Kohavi and Thomke 2017; Siroker and Koomen 2013; Goward 2012; Schrage 2014; King et al. 2017; McFarland 2012; Manzi 2012; Tang et al. 2010). For organizations running online controlled experiments at scale, Gupta et al. (2019) provide an advanced set of challenges.

We provide a motivating visual example of a controlled experiment that ran at Microsoft’s Bing. The team wanted to add a feature allowing advertisers to provide links to the target site. The rationale is that this will improve ads quality by giving users more information about what the advertiser’s site provides and allow users to directly navigate to the sub-category matching their intent. Visuals of the existing ads layout (Control) and the new ads layout (Treatment) with site links added are shown in Fig. 1.

Fig...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Recommended Reading

ACM (2018) ACM code of ethics and professional conduct. June 22. https://www.acm.org/code-of-ethics
Athey S, Imbens G (2016) Recursive partitioning for heterogeneous causal effects. PNAS 113:7353–7360. https://doi.org/10.1073/pnas.1510489113
Article MathSciNet MATH Google Scholar
Benbunan-Fich R (2017) The ethics of online research with unsuspecting users: from A/B testing to C/D experimentation. Res Ethic 13(3–4):200–218. https://doi.org/10.1177/1747016116680664
Article Google Scholar
Biau DJ, Jolles BM, Porcher R (2010) P value and the theory of hypothesis testing. Clin Orthop Relat Res 468(3):885–892
Article Google Scholar
Bickel PJ, Doksum KA (1981) An analysis of transformations revisited. J Am Stat Assoc 76(374):296–311. https://doi.org/10.1080/01621459.1981.10477649
Article MathSciNet MATH Google Scholar
Blank SG (2005) The four steps to the epiphany: successful strategies for products that win. Cafepress.com
Google Scholar
Box GEP, Stuart Hunter J, Hunter WG (2005) Statistics for experimenters: design, innovation, and discovery, 2nd edn. Wiley
MATH Google Scholar
Bradley E, Tibshirani RJ (1993) An introduction to the bootstrap. Chapman & Hall, New York
MATH Google Scholar
Brooks Bell (2015) Click summit 2015 keynote presentation. Brooks Bell. www.brooksbell.com/wp-content/uploads/2015/05/BrooksBell_ClickSummit15_Keynote1.pdf
Casella G, Berger RL (2001) Statistical inference, 2nd edn. Cengage Learning
MATH Google Scholar
Chen N, Liu M, Ya X (2019) How A/B tests could go wrong: automatic diagnosis of invalid online experiments. In: WSDM ‘19 proceedings of the twelfth ACM international conference on web search and data mining. ACM, Melbourne, pp 501–509. https://dl.acm.org/citation.cfm?id=3291000
Crook T, Frasca B, Kohavi R, Longbotham R (2009) Seven pitfalls to avoid when running controlled experiments on the web. In: KDD ‘09 proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1105–1114
Google Scholar
Deng A, Victor H (2015) Diluted treatment effect estimation for trigger analysis in online controlled experiments. WSDM 2015
Google Scholar
Deng S, Longbotham R, Walker T, Ya X (2011) Choice of randomization unit in online controlled experiment. In: Joint statistical meetings proceedings, pp 4866–4877
Google Scholar
Deng A, Xu Y, Kohavi R, Walker T (2013) Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. WSDM 2013
Google Scholar
Deng A, Lu J, Litz J (2017) Trustworthy analysis of online A/B tests: pitfalls, challenges and solutions. In: WSDM the tenth international conference on web search and data mining. Cambridge
Google Scholar
Deng A, Knoblich U, Jiannan L (2018) Applying the delta method in metric analytics: a practical guide with novel ideas. In: 24th ACM SIGKDD conference on knowledge discovery and data mining
Google Scholar
Dmitriev P, Frasca B, Gupta S, Kohavi R, Vaz G (2016) Pitfalls of long-term online controlled experiments. In: IEEE international conference on big data. Washington, pp 1367–1376. https://doi.org/10.1109/BigData.2016.7840744
Dmitriev P, Gupta S, Kim DW, Vaz G (2017) A dirty dozen: twelve common metric interpretation pitfalls in online controlled experiments. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining (KDD 2017). ACM, Halifax, pp 1427–1436. https://doi.org/10.1145/3097983.3098024
Eckles D, Karrer B, Ugander J (2017) Design and analysis of experiments in networks: reducing bias from interference. J Causal Inference 5(1):23. https://www.deaneckles.com/misc/Eckles_Karrer_Ugander_Reducing_Bias_from_Interference.pdf
Article MathSciNet Google Scholar
EGAP (2018) 10 things to know about heterogeneous treatment effects. EGAP: Evidence in Government and Politics. egap.org/methods-guides/10-things-heterogeneous-treatment-effects
Fabijan A, Dmitriev P, Olsson HH, Bosch J (2017) The evolution of continuous experimentation in software product development: from data to a data-driven organization at scale. In: ICSE ′17 proceedings of the 39th international conference on software engineering. IEEE Press, Buenos Aires, pp 770–780. https://doi.org/10.1109/ICSE.2017.76
Fabijan A, Dmitriev P, McFarland C, Vermeer L, Olsson HH, Bosch J (2018) Experimentation growth: evolving trustworthy A/B testing capabilities in online software companies. J Softw: Evol Process 30(12):e2113. https://doi.org/10.1002/smr.2113
Article Google Scholar
Fabijan A, Gupchup J, Gupta S, Omhover J, Qin W, Vermeer L, Dmitriev P (2019) Diagnosing sample ratio mismatch in online controlled experiments: a taxonomy and rules of thumb for practitioners. In: KDD ‘19 The 25th SIGKDD international conference on knowledge discovery and data mining. ACM, Anchorage
Google Scholar
FAT/ML (2019) Fairness, accountability, and transparency in machine learning. http://www.fatml.org/
Fieller EC (1954) Some problems in interval estimation. J R Stat Soc Ser B 16(2):175–185. JSTOR 2984043 https://www.jstor.org/stable/2984043
Freedman B (1987) Equipoise and the ethics of clinical research. N Engl J Med 317(3):141–145. https://doi.org/10.1056/NEJM198707163170304
Article Google Scholar
Georgiev G (2018) Analysis of 115 A/B tests: average lift is 4%, most lack statistical power analytics toolkit June 26. http://blog.analytics-toolkit.com/2018/analysis-of-115-a-b-tests-average-lift-statistical-power/
Georgiev, G (2019). Statistical methods in online A/B testing: Statistics for data-driven business decisions and risk management in e-commerce. Independently published. https://www.abtestingstats.com/
Good PI (2005) Permutation, parametric and bootstrap tests of hypotheses, 3rd edn. Springer
MATH Google Scholar
Goodman S (2008) A dirty dozen: twelve P-value misconceptions. Semin Hematol. https://doi.org/10.1053/j.seminhematol.2008.04.003
Goward C (2012) You should test that: conversion optimization for more leads, sales and profit or the art and science of optimized marketing. Sybex
Google Scholar
Greenhalgh T (2014) How to read a paper: the basics of evidence-based medicine. BMJ Books. https://www.amazon.com/gp/product/B00IPG7GLC
Google Scholar
Gupta S, Kohavi R, Tang D, Xu Y et al (2019) Top challenges from the first practical online controlled experiments summit. In: Dong XL, Teredesai A, Zafarani R (eds) SIGKDD explorations (ACM), vol 21 (1). https://bit.ly/OCESummit1
Guyatt GH, Sackett DL, Sinclair JC, Hayward R, Cook DJ, Cook RJ (1995) Users’ guides to the medical literature: IX. A method for grading health care recommendations. J Am Med Assoc 274(22):1800–1804. https://doi.org/10.1001/2Fjama.1995.03530220066035
Article Google Scholar
Hochberg Y, Benjamini Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing Series B. J R Stat Soc 57(1):289–300
MATH Google Scholar
Imbens GW, Rubin DB (2015) Causal inference for statistics, social, and biomedical sciences: an introduction. Cambridge University Press, Cambridge
Google Scholar
Johari R, Pekelis L, Koomen P, Walsh D (2017) Peeking at A/B Tests. In: KDD ‘17 proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Halifax, pp 1517–1525. https://doi.org/10.1145/3097983.3097992
Book Google Scholar
Katzir L, Liberty E, Somekh O (2012) Framework and algorithms for network bucket testing. In: Proceedings of the 21st international conference on world wide web. pp 1029–1036
Google Scholar
Kaushik A (2006) Experimentation and testing: a primer. Occam’s Razor. http://www.kaushik.net/avinash/2006/05/experimentation-and-testing-a-primer.html. Accessed 22 May 2008
King R, Churchill EF, Tan C (2017) Designing with data: improving the user experience with A/B testing. O’Reilly Media
Google Scholar
Kohavi R, Longbotham R (2010) Unexpected results in online controlled experiments. SIGKDD explorations. http://bit.ly/expUnexpected
Kohavi R, Thomke S (2017) The surprising power of online experiments. Harv Bus Rev 95:74–82. http://exp-platform.com/hbr-the-surprising-power-of-online-experiments/
Google Scholar
Kohavi R, Longbotham R, Sommerfield D, Henne RM (2009a) Controlled experiments on the web: survey and practical guide. Data Min Knowl Disc 18:140–181. http://bit.ly/expSurvey
Article MathSciNet Google Scholar
Kohavi R, Crook T, Longbotham R (2009b) Online experimentation at microsoft. Third workshop on data mining case studies and practice prize. http://bit.ly/expMicrosoft
Kohavi R, Longbotham R, Walker T (2010) Online experiments: practical lessons. IEEE Computer, pp 82–85. http://bit.ly/expPracticalLessons
Kohavi R, Deng A, Frasca B, Longbotham R, Walker T, Ya X (2012) Trustworthy online controlled experiments: five puzzling outcomes explained. In: Proceedings of the 18th conference on knowledge discovery and data mining. http://bit.ly/expPuzzling
Kohavi R, Deng A, Frasca B, Walker T, Xu Y, Pohlmann N (2013) Online controlled experiments at large scale. In: KDD 2013 proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining. http://bit.ly/ExPScale
Kohavi R, Deng A, Longbotham R, Ya X (2014) Seven rules of thumb for web site. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ‘14). http://bit.ly/expRulesOfThumb
Kohavi R, Tang D, Xu Y (2020) Trustworthy online controlled experiments: a practical guide to A/B testing. Cambridge University Press, Cambridge. https://experimentguide.com/
Kohavi R, Deng A, Vermeer L (2022) A/B testing intuition busters: Common misunderstandings in online controlled experiments. In: Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining (KDD ’22), pp 3168–3177. https://doi.org/10.1145/3534678.3539160
Kramer ADI, Guillory JE, Hancock JT (2014) Experimental evidence of massive-scale emotional contagion through social networks. Proc Natl Acad Sci 111:8788–8790. https://www.pnas.org/content/111/24/8788
Article Google Scholar
Loukides M, Mason H, Patil DJ (2018) Ethics and data science. O’Reilly Media
Google Scholar
Luca M, Bazerman MH (2020) The power of experiments: decision making in a data-driven world. The MIT Press. https://www.amazon.com/Power-Experiments-Decision-Making-Data-Driven/dp/0262043874
Book Google Scholar
Malinas G, Bigelow J (2009) Simpson’s paradox. Stanford encyclopedia of philosophy. http://plato.stanford.edu/entries/paradox-simpson/
Manzi J (2012) Uncontrolled: the surprising payoff of trial-and-error for business, politics, and society. Basic Books
Google Scholar
Martin RC (2008) Clean code: a handbook of agile software craftsmanship. Prentice Hall
Google Scholar
McFarland C (2012) Experiment!: website conversion rate optimization with A/B and multivariate testing. New Riders
Google Scholar
McKinley D (2013) Testing to cull the living flower. http://mcfunley.com/testing-to-cull-the-living-flower
Meyer MN (2018) Ethical considerations when companies study – and fail to study – their customers. In: Selinger E, Polonetsky J, Tene O (eds) The Cambridge handbook of consumer privacy. Cambridge University Press, Cambridge
Google Scholar
Moran M (2007) Do it wrong quickly: how the web changes the old marketing rules. IBM Press
Google Scholar
Moran M (2008) Multivariate testing in action: quicken Loan’s Regis Hadiaris on multivariate testing. December. www.biznology.com/2008/12/multivariate_testing_in_action/
Office for Human Research Protections (1991) Federal policy for the protection of human subjects (‘Common Rule’). https://www.hhs.gov/ohrp/regulations-and-policy/regulations/common-rule/index.html
Optimizely (2018) Optimizely maturity model. https://www.optimizely.com/maturity-model/
Peterson ET (2004) Web analytics demystified: a Marketer’s guide to understanding how your web site affects your business. Celilo Group Media and CafePress
Google Scholar
Ries E (2011) The lean startup: how today’s entrepreneurs use continuous innovation to create radically successful businesses. Crown Business
Google Scholar
Rubin KS (2012) Essential scrum: a practical guide to the most popular agile process. Addison-Wesley Professional
Google Scholar
Schrage M (2014) The innovator’s hypothesis: how cheap experiments are worth more than good ideas. MIT Press
Google Scholar
Selterman D (2014) The ethics of OKCupid’s dating experiment. https://www.luvze.com/the-ethics-of-okcupids-dating-experiment/
Siroker D, Koomen P (2013) A/B testing: the most powerful way to turn clicks into customers. Wiley
Google Scholar
Stone JV (2013) Bayes’ rule: a tutorial introduction to Bayesian analysis. Sebtel Press
MATH Google Scholar
Tang D, Agarwal A, O’Brien D, Meyer M (2010) Overlapping experiment infrastructure: more, better, faster experimentation. In: KDD ‘10 the 16th ACM SIGKDD international conference on knowledge discovery and data mining
Google Scholar
The National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research (1979) The Belmont report April 18. https://www.hhs.gov/ohrp/regulations-and-policy/belmont-report/index.html
Thomke SH (2020) Experimentation works: the surprising power of business experimentation. Harvard Business Review Press. https://www.amazon.com/Experimentation-Works-Surprising-Business-Experiments/dp/163369710X/
Google Scholar
Ugander J, Karrer B, Backstrom L, Kleinberg J (2013) Graph cluster randomization: network exposure to multiple universes. In: KDD ‘13 proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, pp 329–337
Google Scholar
Vickers AJ (2009) What is a p-value anyway? 34 stories to help you actually understand statistics. Pearson. https://www.amazon.com/p-value-Stories-Actually-Understand-Statistics/dp/0321629302
Google Scholar
Wager S, Athey S (2018) Estimation and inference of heterogeneous treatment effects using random forests. J Am Stat Assoc 13(523):1228–1242. https://doi.org/10.1080/01621459.2017.1319839
Article MathSciNet MATH Google Scholar
Wider Funnel (2018) The state of experimentation maturity 2018. Wider Funnel. https://www.widerfunnel.com/wp-content/uploads/2018/04/State-of-Experimentation-2018-Original-Research-Report.pdf
Xie H, Aurisset J (2016) Improving the sensitivity of online controlled experiments: case studies at Netfilx. In: KDD ‘16 proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. Association for Computing Machinery, pp 645–654
Google Scholar
Xu Y, Chen N, Fernandez A, Sinno O, Bhasin A (2015) From infrastructure to culture: A/B testing challenges in large scale social networks. In: KDD ‘15 proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Sydney, pp 2227–2236. https://doi.org/10.1145/2783258.2788602
Zhao Z, Chen M, Matheson D, Stone M (2016) Online experimentation diagnosis and troubleshooting beyond AA validation. In: DSAA 2016 IEEE international conference on data science and advanced analytics. IEEE, pp 498–507. https://ieeexplore.ieee.org/document/7796936

Download references

Author information

Authors and Affiliations

Kohavi, Los Altos, CA, USA
Ron Kohavi
Process Performance Management, Meridian, ID, USA
Roger Longbotham

Authors

Ron Kohavi
View author publications
You can also search for this author in PubMed Google Scholar
Roger Longbotham
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Clayton, VIC, Australia
Dinh Phung
Software Engineering, Monash University School of Computer Science &, Melbourne, VIC, Australia
Geoffrey I. Webb
Engineering (CSE), University of New South Wales School of Computer Science &, Sydney, NSW, Australia
Claude Sammut

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Kohavi, R., Longbotham, R. (2023). Online Controlled Experiments and A/B Tests. In: Phung, D., Webb, G.I., Sammut, C. (eds) Encyclopedia of Machine Learning and Data Science. Springer, New York, NY. https://doi.org/10.1007/978-1-4899-7502-7_891-2

Download citation

DOI: https://doi.org/10.1007/978-1-4899-7502-7_891-2
Received: 22 May 2020
Accepted: 02 December 2022
Published: 08 March 2023
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4899-7502-7
Online ISBN: 978-1-4899-7502-7
eBook Packages: Springer Reference Computer SciencesReference Module Computer Science and Engineering

Publish with us

Policies and ethics

Chapter history

Latest
Online Controlled Experiments and A/B Tests

Published:

08 March 2023

DOI: https://doi.org/10.1007/978-1-4899-7502-7_891-2
Original
Online Controlled Experiments and A/B Testing

Published:

13 May 2016

DOI: https://doi.org/10.1007/978-1-4899-7502-7_891-1