Motivation and Background
Many good resources are available with motivation and explanations about online controlled experiments (Kohavi et al. 2009a, 2020; Thomke 2020; Luca and Bazerman 2020; Georgiev 2018, 2019; Kohavi and Thomke 2017; Siroker and Koomen 2013; Goward 2012; Schrage 2014; King et al. 2017; McFarland 2012; Manzi 2012; Tang et al. 2010). For organizations running online controlled experiments at scale, Gupta et al. (2019) provide an advanced set of challenges.
We provide a motivating visual example of a controlled experiment that ran at Microsoft’s Bing. The team wanted to add a feature allowing advertisers to provide links to the target site. The rationale is that this will improve ads quality by giving users more information about what the advertiser’s site provides and allow users to directly navigate to the sub-category matching their intent. Visuals of the existing ads layout (Control) and the new ads layout (Treatment) with site links added are shown in Fig. 1.
Recommended Reading
ACM (2018) ACM code of ethics and professional conduct. June 22. https://www.acm.org/code-of-ethics
Athey S, Imbens G (2016) Recursive partitioning for heterogeneous causal effects. PNAS 113:7353–7360. https://doi.org/10.1073/pnas.1510489113
Benbunan-Fich R (2017) The ethics of online research with unsuspecting users: from A/B testing to C/D experimentation. Res Ethic 13(3–4):200–218. https://doi.org/10.1177/1747016116680664
Biau DJ, Jolles BM, Porcher R (2010) P value and the theory of hypothesis testing. Clin Orthop Relat Res 468(3):885–892
Bickel PJ, Doksum KA (1981) An analysis of transformations revisited. J Am Stat Assoc 76(374):296–311. https://doi.org/10.1080/01621459.1981.10477649
Blank SG (2005) The four steps to the epiphany: successful strategies for products that win. Cafepress.com
Box GEP, Stuart Hunter J, Hunter WG (2005) Statistics for experimenters: design, innovation, and discovery, 2nd edn. Wiley
Bradley E, Tibshirani RJ (1993) An introduction to the bootstrap. Chapman & Hall, New York
Brooks Bell (2015) Click summit 2015 keynote presentation. Brooks Bell. www.brooksbell.com/wp-content/uploads/2015/05/BrooksBell_ClickSummit15_Keynote1.pdf
Casella G, Berger RL (2001) Statistical inference, 2nd edn. Cengage Learning
Chen N, Liu M, Ya X (2019) How A/B tests could go wrong: automatic diagnosis of invalid online experiments. In: WSDM ‘19 proceedings of the twelfth ACM international conference on web search and data mining. ACM, Melbourne, pp 501–509. https://dl.acm.org/citation.cfm?id=3291000
Crook T, Frasca B, Kohavi R, Longbotham R (2009) Seven pitfalls to avoid when running controlled experiments on the web. In: KDD ‘09 proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1105–1114
Deng A, Victor H (2015) Diluted treatment effect estimation for trigger analysis in online controlled experiments. WSDM 2015
Deng S, Longbotham R, Walker T, Ya X (2011) Choice of randomization unit in online controlled experiment. In: Joint statistical meetings proceedings, pp 4866–4877
Deng A, Xu Y, Kohavi R, Walker T (2013) Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. WSDM 2013
Deng A, Lu J, Litz J (2017) Trustworthy analysis of online A/B tests: pitfalls, challenges and solutions. In: WSDM the tenth international conference on web search and data mining. Cambridge
Deng A, Knoblich U, Jiannan L (2018) Applying the delta method in metric analytics: a practical guide with novel ideas. In: 24th ACM SIGKDD conference on knowledge discovery and data mining
Dmitriev P, Frasca B, Gupta S, Kohavi R, Vaz G (2016) Pitfalls of long-term online controlled experiments. In: IEEE international conference on big data. Washington, pp 1367–1376. https://doi.org/10.1109/BigData.2016.7840744
Dmitriev P, Gupta S, Kim DW, Vaz G (2017) A dirty dozen: twelve common metric interpretation pitfalls in online controlled experiments. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining (KDD 2017). ACM, Halifax, pp 1427–1436. https://doi.org/10.1145/3097983.3098024
Eckles D, Karrer B, Ugander J (2017) Design and analysis of experiments in networks: reducing bias from interference. J Causal Inference 5(1):23. https://www.deaneckles.com/misc/Eckles_Karrer_Ugander_Reducing_Bias_from_Interference.pdf
EGAP (2018) 10 things to know about heterogeneous treatment effects. EGAP: Evidence in Government and Politics. egap.org/methods-guides/10-things-heterogeneous-treatment-effects
Fabijan A, Dmitriev P, Olsson HH, Bosch J (2017) The evolution of continuous experimentation in software product development: from data to a data-driven organization at scale. In: ICSE ′17 proceedings of the 39th international conference on software engineering. IEEE Press, Buenos Aires, pp 770–780. https://doi.org/10.1109/ICSE.2017.76
Fabijan A, Dmitriev P, McFarland C, Vermeer L, Olsson HH, Bosch J (2018) Experimentation growth: evolving trustworthy A/B testing capabilities in online software companies. J Softw: Evol Process 30(12):e2113. https://doi.org/10.1002/smr.2113
Fabijan A, Gupchup J, Gupta S, Omhover J, Qin W, Vermeer L, Dmitriev P (2019) Diagnosing sample ratio mismatch in online controlled experiments: a taxonomy and rules of thumb for practitioners. In: KDD ‘19 The 25th SIGKDD international conference on knowledge discovery and data mining. ACM, Anchorage
FAT/ML (2019) Fairness, accountability, and transparency in machine learning. http://www.fatml.org/
Fieller EC (1954) Some problems in interval estimation. J R Stat Soc Ser B 16(2):175–185. JSTOR 2984043 https://www.jstor.org/stable/2984043
Freedman B (1987) Equipoise and the ethics of clinical research. N Engl J Med 317(3):141–145. https://doi.org/10.1056/NEJM198707163170304
Georgiev G (2018) Analysis of 115 A/B tests: average lift is 4%, most lack statistical power analytics toolkit June 26. http://blog.analytics-toolkit.com/2018/analysis-of-115-a-b-tests-average-lift-statistical-power/
Georgiev, G (2019). Statistical methods in online A/B testing: Statistics for data-driven business decisions and risk management in e-commerce. Independently published. https://www.abtestingstats.com/
Good PI (2005) Permutation, parametric and bootstrap tests of hypotheses, 3rd edn. Springer
Goodman S (2008) A dirty dozen: twelve P-value misconceptions. Semin Hematol. https://doi.org/10.1053/j.seminhematol.2008.04.003
Goward C (2012) You should test that: conversion optimization for more leads, sales and profit or the art and science of optimized marketing. Sybex
Greenhalgh T (2014) How to read a paper: the basics of evidence-based medicine. BMJ Books. https://www.amazon.com/gp/product/B00IPG7GLC
Gupta S, Kohavi R, Tang D, Xu Y et al (2019) Top challenges from the first practical online controlled experiments summit. In: Dong XL, Teredesai A, Zafarani R (eds) SIGKDD explorations (ACM), vol 21 (1). https://bit.ly/OCESummit1
Guyatt GH, Sackett DL, Sinclair JC, Hayward R, Cook DJ, Cook RJ (1995) Users’ guides to the medical literature: IX. A method for grading health care recommendations. J Am Med Assoc 274(22):1800–1804. https://doi.org/10.1001/2Fjama.1995.03530220066035
Hochberg Y, Benjamini Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing Series B. J R Stat Soc 57(1):289–300
Imbens GW, Rubin DB (2015) Causal inference for statistics, social, and biomedical sciences: an introduction. Cambridge University Press, Cambridge
Johari R, Pekelis L, Koomen P, Walsh D (2017) Peeking at A/B Tests. In: KDD ‘17 proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Halifax, pp 1517–1525. https://doi.org/10.1145/3097983.3097992
Katzir L, Liberty E, Somekh O (2012) Framework and algorithms for network bucket testing. In: Proceedings of the 21st international conference on world wide web. pp 1029–1036
Kaushik A (2006) Experimentation and testing: a primer. Occam’s Razor. http://www.kaushik.net/avinash/2006/05/experimentation-and-testing-a-primer.html. Accessed 22 May 2008
King R, Churchill EF, Tan C (2017) Designing with data: improving the user experience with A/B testing. O’Reilly Media
Kohavi R, Longbotham R (2010) Unexpected results in online controlled experiments. SIGKDD explorations. http://bit.ly/expUnexpected
Kohavi R, Thomke S (2017) The surprising power of online experiments. Harv Bus Rev 95:74–82. http://exp-platform.com/hbr-the-surprising-power-of-online-experiments/
Kohavi R, Longbotham R, Sommerfield D, Henne RM (2009a) Controlled experiments on the web: survey and practical guide. Data Min Knowl Disc 18:140–181. http://bit.ly/expSurvey
Kohavi R, Crook T, Longbotham R (2009b) Online experimentation at microsoft. Third workshop on data mining case studies and practice prize. http://bit.ly/expMicrosoft
Kohavi R, Longbotham R, Walker T (2010) Online experiments: practical lessons. IEEE Computer, pp 82–85. http://bit.ly/expPracticalLessons
Kohavi R, Deng A, Frasca B, Longbotham R, Walker T, Ya X (2012) Trustworthy online controlled experiments: five puzzling outcomes explained. In: Proceedings of the 18th conference on knowledge discovery and data mining. http://bit.ly/expPuzzling
Kohavi R, Deng A, Frasca B, Walker T, Xu Y, Pohlmann N (2013) Online controlled experiments at large scale. In: KDD 2013 proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining. http://bit.ly/ExPScale
Kohavi R, Deng A, Longbotham R, Ya X (2014) Seven rules of thumb for web site. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ‘14). http://bit.ly/expRulesOfThumb
Kohavi R, Tang D, Xu Y (2020) Trustworthy online controlled experiments: a practical guide to A/B testing. Cambridge University Press, Cambridge. https://experimentguide.com/
Kohavi R, Deng A, Vermeer L (2022) A/B testing intuition busters: Common misunderstandings in online controlled experiments. In: Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining (KDD ’22), pp 3168–3177. https://doi.org/10.1145/3534678.3539160
Kramer ADI, Guillory JE, Hancock JT (2014) Experimental evidence of massive-scale emotional contagion through social networks. Proc Natl Acad Sci 111:8788–8790. https://www.pnas.org/content/111/24/8788
Loukides M, Mason H, Patil DJ (2018) Ethics and data science. O’Reilly Media
Luca M, Bazerman MH (2020) The power of experiments: decision making in a data-driven world. The MIT Press. https://www.amazon.com/Power-Experiments-Decision-Making-Data-Driven/dp/0262043874
Malinas G, Bigelow J (2009) Simpson’s paradox. Stanford encyclopedia of philosophy. http://plato.stanford.edu/entries/paradox-simpson/
Manzi J (2012) Uncontrolled: the surprising payoff of trial-and-error for business, politics, and society. Basic Books
Martin RC (2008) Clean code: a handbook of agile software craftsmanship. Prentice Hall
McFarland C (2012) Experiment!: website conversion rate optimization with A/B and multivariate testing. New Riders
McKinley D (2013) Testing to cull the living flower. http://mcfunley.com/testing-to-cull-the-living-flower
Meyer MN (2018) Ethical considerations when companies study – and fail to study – their customers. In: Selinger E, Polonetsky J, Tene O (eds) The Cambridge handbook of consumer privacy. Cambridge University Press, Cambridge
Moran M (2007) Do it wrong quickly: how the web changes the old marketing rules. IBM Press
Moran M (2008) Multivariate testing in action: quicken Loan’s Regis Hadiaris on multivariate testing. December. www.biznology.com/2008/12/multivariate_testing_in_action/
Office for Human Research Protections (1991) Federal policy for the protection of human subjects (‘Common Rule’). https://www.hhs.gov/ohrp/regulations-and-policy/regulations/common-rule/index.html
Optimizely (2018) Optimizely maturity model. https://www.optimizely.com/maturity-model/
Peterson ET (2004) Web analytics demystified: a Marketer’s guide to understanding how your web site affects your business. Celilo Group Media and CafePress
Ries E (2011) The lean startup: how today’s entrepreneurs use continuous innovation to create radically successful businesses. Crown Business
Rubin KS (2012) Essential scrum: a practical guide to the most popular agile process. Addison-Wesley Professional
Schrage M (2014) The innovator’s hypothesis: how cheap experiments are worth more than good ideas. MIT Press
Selterman D (2014) The ethics of OKCupid’s dating experiment. https://www.luvze.com/the-ethics-of-okcupids-dating-experiment/
Siroker D, Koomen P (2013) A/B testing: the most powerful way to turn clicks into customers. Wiley
Stone JV (2013) Bayes’ rule: a tutorial introduction to Bayesian analysis. Sebtel Press
Tang D, Agarwal A, O’Brien D, Meyer M (2010) Overlapping experiment infrastructure: more, better, faster experimentation. In: KDD ‘10 the 16th ACM SIGKDD international conference on knowledge discovery and data mining
The National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research (1979) The Belmont report April 18. https://www.hhs.gov/ohrp/regulations-and-policy/belmont-report/index.html
Thomke SH (2020) Experimentation works: the surprising power of business experimentation. Harvard Business Review Press. https://www.amazon.com/Experimentation-Works-Surprising-Business-Experiments/dp/163369710X/
Ugander J, Karrer B, Backstrom L, Kleinberg J (2013) Graph cluster randomization: network exposure to multiple universes. In: KDD ‘13 proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, pp 329–337
Vickers AJ (2009) What is a p-value anyway? 34 stories to help you actually understand statistics. Pearson. https://www.amazon.com/p-value-Stories-Actually-Understand-Statistics/dp/0321629302
Wager S, Athey S (2018) Estimation and inference of heterogeneous treatment effects using random forests. J Am Stat Assoc 13(523):1228–1242. https://doi.org/10.1080/01621459.2017.1319839
Wider Funnel (2018) The state of experimentation maturity 2018. Wider Funnel. https://www.widerfunnel.com/wp-content/uploads/2018/04/State-of-Experimentation-2018-Original-Research-Report.pdf
Xie H, Aurisset J (2016) Improving the sensitivity of online controlled experiments: case studies at Netfilx. In: KDD ‘16 proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. Association for Computing Machinery, pp 645–654
Xu Y, Chen N, Fernandez A, Sinno O, Bhasin A (2015) From infrastructure to culture: A/B testing challenges in large scale social networks. In: KDD ‘15 proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Sydney, pp 2227–2236. https://doi.org/10.1145/2783258.2788602
Zhao Z, Chen M, Matheson D, Stone M (2016) Online experimentation diagnosis and troubleshooting beyond AA validation. In: DSAA 2016 IEEE international conference on data science and advanced analytics. IEEE, pp 498–507. https://ieeexplore.ieee.org/document/7796936
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 Springer Science+Business Media, LLC, part of Springer Nature
About this entry
Cite this entry
Kohavi, R., Longbotham, R. (2023). Online Controlled Experiments and A/B Tests. In: Phung, D., Webb, G.I., Sammut, C. (eds) Encyclopedia of Machine Learning and Data Science. Springer, New York, NY. https://doi.org/10.1007/978-1-4899-7502-7_891-2
Download citation
DOI: https://doi.org/10.1007/978-1-4899-7502-7_891-2
Received:
Accepted:
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4899-7502-7
Online ISBN: 978-1-4899-7502-7
eBook Packages: Springer Reference Computer SciencesReference Module Computer Science and Engineering
Publish with us
Chapter history
-
Latest
Online Controlled Experiments and A/B Tests- Published:
- 08 March 2023
DOI: https://doi.org/10.1007/978-1-4899-7502-7_891-2
-
Original
Online Controlled Experiments and A/B Testing- Published:
- 13 May 2016
DOI: https://doi.org/10.1007/978-1-4899-7502-7_891-1