Skip to main content

Controlled experiments on the web: survey and practical guide

Abstract

The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments, A/B tests (and their generalizations), split tests, Control/Treatment tests, MultiVariable Tests (MVT) and parallel flights. Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. We provide a practical guide to conducting online experiments, where end-users can help guide the development of features. Our experience indicates that significant learning and return-on-investment (ROI) are seen when development teams listen to their customers, not to the Highest Paid Person’s Opinion (HiPPO). We provide several examples of controlled experiments with surprising results. We review the important ingredients of running controlled experiments, and discuss their limitations (both technical and organizational). We focus on several areas that are critical to experimentation, including statistical power, sample size, and techniques for variance reduction. We describe common architectures for experimentation systems and analyze their advantages and disadvantages. We evaluate randomization and hashing techniques, which we show are not as simple in practice as is often assumed. Controlled experiments typically generate large amounts of data, which can be analyzed using data mining techniques to gain deeper understanding of the factors influencing the outcome of interest, leading to new hypotheses and creating a virtuous cycle of improvements. Organizations that embrace controlled experiments with clear evaluation criteria can evolve their systems with automated optimizations and real-time analyses. Based on our extensive practical experience with multiple systems and organizations, we share key lessons that will help practitioners in running trustworthy controlled experiments.

References

  1. Alt B, Usborne N (2005) Market Exp J. [Online] December 29, 2005. http://www.marketingexperiments.com/improving-website-conversion/multivariable-testing.html

  2. Boos DD, Hughes-Oliver JM (2000) How large does n have to be for Z and t intervals?. Am Statist 54(2): 121–128

    Article  Google Scholar 

  3. Box GEP, Hunter JS, Hunter WG (2005) Statistics for experimenters: design, innovation, and discovery, 2nd edn. Wiley, ISBN: 0471718130

  4. Burns M (2006) Web analytics spendings trends 2007. Forrester Research Inc., Cambridge

    Google Scholar 

  5. Charles RS, Melvin MM (2004) Quasi experimentation. [book auth.] In: Wholey JS, Hatry HP, Newcomer KE (eds) Handbook of practical program evaluation, 2nd edn. Jossey-Bass

  6. Chatham B, Temkin BD, Amato M (2004) A primer on A/B testing. Forrester Research

  7. Davies OL, Hay WA (1950) Construction and uses of fractional factorial designs in industrial research. Biometrics 233(6): 121–128

    Google Scholar 

  8. Eisenberg B (2003a) How to Decrease sales by 90%. ClickZ. [Online] Feb 21, 2003. http://www.clickz.com/showPage.html?page=1588161

  9. Eisenberg B (2003b) How to increase conversion rate 1,000%. ClickZ. [Online] Feb 28, 2003. http://www.clickz.com/showPage.html?page=1756031

  10. Eisenberg B (2004) A/B testing for the mathematically disinclined. ClickZ. [Online] May 7, 2004. http://www.clickz.com/showPage.html?page=3349901

  11. Eisenberg B (2005) How to improve A/B testing. ClickZ Netw. [Online] April 29, 2005. http://www.clickz.com/showPage.html?page=3500811

  12. Eisenberg B, Eisenberg J (2005) Call to action, secret formulas to improve online results. Wizard Academy Press, Austin, 2005. Making the dial move by testing, introducing A/B testing

  13. Eisenberg B, Garcia A (2006) Which sells best: a quick start guide to testing for retailers. Future now’s publications. [Online] 2006. http://futurenowinc.com/shop/

  14. Forrester Research (2005) The state of retailing online. Shop.org

  15. Google Website Optimizer (2008) [Online] 2008. http://services.google.com/websiteoptimizer

  16. Hawthorne effect (2007) Wikipedia. [Online] 2007. http://en.wikipedia.org/wiki/Hawthorne_experiments

  17. Hopkins C (1923) Scientific advertising. Crown Publishers Inc., New York City

    Google Scholar 

  18. Kaplan RS, Norton DP (1996) The balanced scorecard: translating strategy into action. Harvard Business School Press, ISBN: 0875846513

  19. Kaushik A (2006) Experimentation and testing: a primer. Occam’s Razor by Avinash Kaushik. [Online] May 22, 2006. http://www.kaushik.net/avinash/2006/05/experimentation-and-testing-a-primer.html

  20. Keppel G, Saufley WH, Tokunaga H (1992) Introduction to design and analysis, 2nd edn. W.H. Freeman and Company

  21. Kohavi R (2007) Emetrics 2007 practical guide to controlled experiments on the web. [Online] October 16, 2007. http://exp-platform.com/Documents/2007-10EmetricsExperimenation.pdf

  22. Kohavi R, Parekh R (2003) Ten supplementary analyses to improve e-commerce web sites. WebKDD

  23. Kohavi R, Round M (2004) In: Sterne J (ed) Front line internet analytics at Amazon.com. Santa Barbara, CA. http://ai.stanford.edu/~ronnyk/emetricsAmazon.pdf

  24. Kohavi R et al (2004) Lessons and challenges from mining retail e-commerce data. Machine Learn 57(1–2):83–113. http://ai.stanford.edu/~ronnyk/lessonsInDM.pdf

    Google Scholar 

  25. Koselka R (1996) The new mantra: MVT. Forbes. March 11, 1996, pp 114–118

  26. Linden G (2006a) Early Amazon: shopping cart recommendations. Geeking with Greg. [Online] April 25, 2006. http://glinden.blogspot.com/2006/04/early-amazon-shopping-cart.html

  27. Linden G (2006b) Make data useful. [Online] Dec 2006. http://home.blarg.net/~glinden/StanfordDataMining.2006-11-29.ppt

  28. Manning H, Dorsey M, Carney CL (2006) Don’t rationalize bad site design. Forrester Research, Cambridge

    Google Scholar 

  29. Marks HM (2000) The progress of experiment: science and therapeutic reform in the united states, 1900–1990. Cambridge University Press, ISBN: 978-0521785617

  30. Maron O, Moore AW (1994) Hoeffding races: accelerating model selection search for classification and function approximation. http://citeseer.ist.psu.edu/maron94hoeffding.html

  31. Mason RL, Gunst RF, Hess JL (1989) Statistical design and analysis of experiments with applications to engineering and science. Wiley, ISBN: 047185364X

  32. McGlaughlin F et al (2006) The power of small changes tested. Market Exp J. [Online] March 21, 2006. http://www.marketingexperiments.com/improving-website-conversion/power-small-change.html

  33. Miller S (2006) The ConversionLab.com: how to experiment your way to increased web sales using split testing and Taguchi optimization. http://www.conversionlab.com/

  34. Miller S (2007) How to design a split test. Web marketing today, conversion/testing. [Online] Jan 18, 2007. http://www.wilsonweb.com/conversion/

  35. Moran M (2007) Do it wrong quickly: how the web changes the old marketing rules. IBM Press, ISBN: 0132255960

  36. Nielsen J (2005) Putting A/B testing in its place. Useit.com Alertbox. [Online] Aug 15, 2005. http://www.useit.com/alertbox/20050815.html

  37. Omniture (2008) [Online] 2008. http://www.omniture.com/products/optimization/offermatica

  38. Optimost (2008) [Online] 2008. http://www.optimost.com

  39. Peterson ET (2004) Web analytics demystified: a marketer’s guide to understanding how your web site affects your business. Celilo Group Media and CafePress, ISBN: 0974358428

  40. Peterson ET (2005) Web site measurement hacks. O’Reilly Media, ISBN: 0596009887

  41. Plackett RL, Burman JP (1946) The design of optimum multifactorial experiments. Biometrika 33: 305–325

    MATH  Article  MathSciNet  Google Scholar 

  42. Quarto-vonTivadar J (2006) AB testing: too little, too soon. Future Now. [Online] 2006. http://www.futurenowinc.com/abtesting.pdf

  43. Rossi PH, Lipsey MW, Freeman HE (2003) Evaluation: a systematic approach, 7th edn. Sage Publications, Inc., ISBN: 0-7619-0894-3

  44. Roy RK (2001) Design of experiments using the taguchi approach: 16 steps to product and process improvement. Wiley, ISBN: 0-471-36101-1

  45. SiteSpect (2008) [Online] 2008. http://www.sitespect.com

  46. Spool JM (2004) The cost of frustration. WebProNews. [Online] September 20, 2004. http://www.webpronews.com/topnews/2004/09/20/the-cost-of-frustration

  47. Sterne J (2002) Web metrics: proven methods for measuring web site success. Wiley, ISBN: 0-471-22072-8

  48. Tan P-N, Kumar V (2002) Discovery of web robot sessions based on their navigational patterns. Data Min Knowl Dis

  49. Thomke S (2001) Enlightened experimentation: the new imperative for innovation, Feb 2001

  50. Thomke SH (2003) Experimentation matters: unlocking the potential of new technologies for innovation

  51. Tyler ME, Ledford J (2006) Google analytics. Wiley, ISBN: 0470053852

  52. Ulwick A (2005) What customers want: using outcome-driven innovation to create breakthrough products and services. McGraw-Hill, ISBN: 0071408673

  53. Usborne N (2005) Design choices can cripple a website. A list apart. [Online] Nov 8, 2005. http://alistapart.com/articles/designcancripple

  54. van Belle G (2002) Statistical rules of thumb. Wiley, ISBN: 0471402273

  55. Varian HR (2007) Kaizen, that continuous improvement strategy, finds its ideal environment. New York Times. February 8, 2007. Online at http://www.nytimes.com/2007/02/08/business/08scene.html?fta=y

  56. Verster (2008) [Online] 2008. http://www.vertster.com

  57. Weiss CH (1997) Evaluation: methods for studying programs and policies, 2nd edn. Prentice Hall, ISBN: 0-13-309725-0

  58. Weiss TR (2000) Amazon apologizes for price-testing program that angered customers. http://www.Safecount.net. [Online] September 28, 2000. http://www.infoworld.com/articles/hn/xml/00/09/28/000928hnamazondvd.html

  59. Wheeler RE (1974) Portable power. Technometrics 16:193–201. http://www.bobwheeler.com/stat/Papers/PortablePower.PDF

    Google Scholar 

  60. Wheeler RE (1975) The validity of portable power. Technometrics 17(2):177–179

    Google Scholar 

  61. Widemile (2008) [Online] 2008. http://www.widemile.com

  62. Wikepedia (2008) Multi-armed bandit. Wikipedia. [Online] 2008. http://en.wikipedia.org/wiki/Multi-armed_bandit

  63. Willan AR, Briggs AH (2006) Statistical analysis of cost-effectiveness data (statistics in practice). Wiley

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Ron Kohavi.

Additional information

Responsible editor: R. Bayardo.

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License ( https://creativecommons.org/licenses/by-nc/2.0 ), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and Permissions

About this article

Cite this article

Kohavi, R., Longbotham, R., Sommerfield, D. et al. Controlled experiments on the web: survey and practical guide. Data Min Knowl Disc 18, 140–181 (2009). https://doi.org/10.1007/s10618-008-0114-1

Download citation

Keywords

  • Controlled experiments
  • A/B testing
  • e-commerce
  • Website optimization
  • MultiVariable Testing
  • MVT