Data Mining and Knowledge Discovery

, Volume 18, Issue 1, pp 140–181 | Cite as

Controlled experiments on the web: survey and practical guide

  • Ron Kohavi
  • Roger Longbotham
  • Dan Sommerfield
  • Randal M. Henne
Open Access
Article

Abstract

The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments, A/B tests (and their generalizations), split tests, Control/Treatment tests, MultiVariable Tests (MVT) and parallel flights. Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. We provide a practical guide to conducting online experiments, where end-users can help guide the development of features. Our experience indicates that significant learning and return-on-investment (ROI) are seen when development teams listen to their customers, not to the Highest Paid Person’s Opinion (HiPPO). We provide several examples of controlled experiments with surprising results. We review the important ingredients of running controlled experiments, and discuss their limitations (both technical and organizational). We focus on several areas that are critical to experimentation, including statistical power, sample size, and techniques for variance reduction. We describe common architectures for experimentation systems and analyze their advantages and disadvantages. We evaluate randomization and hashing techniques, which we show are not as simple in practice as is often assumed. Controlled experiments typically generate large amounts of data, which can be analyzed using data mining techniques to gain deeper understanding of the factors influencing the outcome of interest, leading to new hypotheses and creating a virtuous cycle of improvements. Organizations that embrace controlled experiments with clear evaluation criteria can evolve their systems with automated optimizations and real-time analyses. Based on our extensive practical experience with multiple systems and organizations, we share key lessons that will help practitioners in running trustworthy controlled experiments.

Keywords

Controlled experiments A/B testing e-commerce Website optimization MultiVariable Testing MVT 

References

  1. Alt B, Usborne N (2005) Market Exp J. [Online] December 29, 2005. http://www.marketingexperiments.com/improving-website-conversion/multivariable-testing.html
  2. Boos DD, Hughes-Oliver JM (2000) How large does n have to be for Z and t intervals?. Am Statist 54(2): 121–128CrossRefGoogle Scholar
  3. Box GEP, Hunter JS, Hunter WG (2005) Statistics for experimenters: design, innovation, and discovery, 2nd edn. Wiley, ISBN: 0471718130Google Scholar
  4. Burns M (2006) Web analytics spendings trends 2007. Forrester Research Inc., CambridgeGoogle Scholar
  5. Charles RS, Melvin MM (2004) Quasi experimentation. [book auth.] In: Wholey JS, Hatry HP, Newcomer KE (eds) Handbook of practical program evaluation, 2nd edn. Jossey-BassGoogle Scholar
  6. Chatham B, Temkin BD, Amato M (2004) A primer on A/B testing. Forrester ResearchGoogle Scholar
  7. Davies OL, Hay WA (1950) Construction and uses of fractional factorial designs in industrial research. Biometrics 233(6): 121–128Google Scholar
  8. Eisenberg B (2003a) How to Decrease sales by 90%. ClickZ. [Online] Feb 21, 2003. http://www.clickz.com/showPage.html?page=1588161
  9. Eisenberg B (2003b) How to increase conversion rate 1,000%. ClickZ. [Online] Feb 28, 2003. http://www.clickz.com/showPage.html?page=1756031
  10. Eisenberg B (2004) A/B testing for the mathematically disinclined. ClickZ. [Online] May 7, 2004. http://www.clickz.com/showPage.html?page=3349901
  11. Eisenberg B (2005) How to improve A/B testing. ClickZ Netw. [Online] April 29, 2005. http://www.clickz.com/showPage.html?page=3500811
  12. Eisenberg B, Eisenberg J (2005) Call to action, secret formulas to improve online results. Wizard Academy Press, Austin, 2005. Making the dial move by testing, introducing A/B testingGoogle Scholar
  13. Eisenberg B, Garcia A (2006) Which sells best: a quick start guide to testing for retailers. Future now’s publications. [Online] 2006. http://futurenowinc.com/shop/
  14. Forrester Research (2005) The state of retailing online. Shop.orgGoogle Scholar
  15. Google Website Optimizer (2008) [Online] 2008. http://services.google.com/websiteoptimizer
  16. Hawthorne effect (2007) Wikipedia. [Online] 2007. http://en.wikipedia.org/wiki/Hawthorne_experiments
  17. Hopkins C (1923) Scientific advertising. Crown Publishers Inc., New York CityGoogle Scholar
  18. Kaplan RS, Norton DP (1996) The balanced scorecard: translating strategy into action. Harvard Business School Press, ISBN: 0875846513Google Scholar
  19. Kaushik A (2006) Experimentation and testing: a primer. Occam’s Razor by Avinash Kaushik. [Online] May 22, 2006. http://www.kaushik.net/avinash/2006/05/experimentation-and-testing-a-primer.html
  20. Keppel G, Saufley WH, Tokunaga H (1992) Introduction to design and analysis, 2nd edn. W.H. Freeman and CompanyGoogle Scholar
  21. Kohavi R (2007) Emetrics 2007 practical guide to controlled experiments on the web. [Online] October 16, 2007. http://exp-platform.com/Documents/2007-10EmetricsExperimenation.pdf
  22. Kohavi R, Parekh R (2003) Ten supplementary analyses to improve e-commerce web sites. WebKDDGoogle Scholar
  23. Kohavi R, Round M (2004) In: Sterne J (ed) Front line internet analytics at Amazon.com. Santa Barbara, CA. http://ai.stanford.edu/~ronnyk/emetricsAmazon.pdf
  24. Kohavi R et al (2004) Lessons and challenges from mining retail e-commerce data. Machine Learn 57(1–2):83–113. http://ai.stanford.edu/~ronnyk/lessonsInDM.pdf Google Scholar
  25. Koselka R (1996) The new mantra: MVT. Forbes. March 11, 1996, pp 114–118Google Scholar
  26. Linden G (2006a) Early Amazon: shopping cart recommendations. Geeking with Greg. [Online] April 25, 2006. http://glinden.blogspot.com/2006/04/early-amazon-shopping-cart.html
  27. Linden G (2006b) Make data useful. [Online] Dec 2006. http://home.blarg.net/~glinden/StanfordDataMining.2006-11-29.ppt
  28. Manning H, Dorsey M, Carney CL (2006) Don’t rationalize bad site design. Forrester Research, CambridgeGoogle Scholar
  29. Marks HM (2000) The progress of experiment: science and therapeutic reform in the united states, 1900–1990. Cambridge University Press, ISBN: 978-0521785617Google Scholar
  30. Maron O, Moore AW (1994) Hoeffding races: accelerating model selection search for classification and function approximation. http://citeseer.ist.psu.edu/maron94hoeffding.html
  31. Mason RL, Gunst RF, Hess JL (1989) Statistical design and analysis of experiments with applications to engineering and science. Wiley, ISBN: 047185364XGoogle Scholar
  32. McGlaughlin F et al (2006) The power of small changes tested. Market Exp J. [Online] March 21, 2006. http://www.marketingexperiments.com/improving-website-conversion/power-small-change.html
  33. Miller S (2006) The ConversionLab.com: how to experiment your way to increased web sales using split testing and Taguchi optimization. http://www.conversionlab.com/
  34. Miller S (2007) How to design a split test. Web marketing today, conversion/testing. [Online] Jan 18, 2007. http://www.wilsonweb.com/conversion/
  35. Moran M (2007) Do it wrong quickly: how the web changes the old marketing rules. IBM Press, ISBN: 0132255960Google Scholar
  36. Nielsen J (2005) Putting A/B testing in its place. Useit.com Alertbox. [Online] Aug 15, 2005. http://www.useit.com/alertbox/20050815.html
  37. Optimost (2008) [Online] 2008. http://www.optimost.com
  38. Peterson ET (2004) Web analytics demystified: a marketer’s guide to understanding how your web site affects your business. Celilo Group Media and CafePress, ISBN: 0974358428Google Scholar
  39. Peterson ET (2005) Web site measurement hacks. O’Reilly Media, ISBN: 0596009887Google Scholar
  40. Plackett RL, Burman JP (1946) The design of optimum multifactorial experiments. Biometrika 33: 305–325MATHCrossRefMathSciNetGoogle Scholar
  41. Quarto-vonTivadar J (2006) AB testing: too little, too soon. Future Now. [Online] 2006. http://www.futurenowinc.com/abtesting.pdf
  42. Rossi PH, Lipsey MW, Freeman HE (2003) Evaluation: a systematic approach, 7th edn. Sage Publications, Inc., ISBN: 0-7619-0894-3Google Scholar
  43. Roy RK (2001) Design of experiments using the taguchi approach: 16 steps to product and process improvement. Wiley, ISBN: 0-471-36101-1Google Scholar
  44. SiteSpect (2008) [Online] 2008. http://www.sitespect.com
  45. Spool JM (2004) The cost of frustration. WebProNews. [Online] September 20, 2004. http://www.webpronews.com/topnews/2004/09/20/the-cost-of-frustration
  46. Sterne J (2002) Web metrics: proven methods for measuring web site success. Wiley, ISBN: 0-471-22072-8Google Scholar
  47. Tan P-N, Kumar V (2002) Discovery of web robot sessions based on their navigational patterns. Data Min Knowl DisGoogle Scholar
  48. Thomke S (2001) Enlightened experimentation: the new imperative for innovation, Feb 2001Google Scholar
  49. Thomke SH (2003) Experimentation matters: unlocking the potential of new technologies for innovationGoogle Scholar
  50. Tyler ME, Ledford J (2006) Google analytics. Wiley, ISBN: 0470053852Google Scholar
  51. Ulwick A (2005) What customers want: using outcome-driven innovation to create breakthrough products and services. McGraw-Hill, ISBN: 0071408673Google Scholar
  52. Usborne N (2005) Design choices can cripple a website. A list apart. [Online] Nov 8, 2005. http://alistapart.com/articles/designcancripple
  53. van Belle G (2002) Statistical rules of thumb. Wiley, ISBN: 0471402273Google Scholar
  54. Varian HR (2007) Kaizen, that continuous improvement strategy, finds its ideal environment. New York Times. February 8, 2007. Online at http://www.nytimes.com/2007/02/08/business/08scene.html?fta=y
  55. Verster (2008) [Online] 2008. http://www.vertster.com
  56. Weiss CH (1997) Evaluation: methods for studying programs and policies, 2nd edn. Prentice Hall, ISBN: 0-13-309725-0Google Scholar
  57. Weiss TR (2000) Amazon apologizes for price-testing program that angered customers. http://www.Safecount.net. [Online] September 28, 2000. http://www.infoworld.com/articles/hn/xml/00/09/28/000928hnamazondvd.html
  58. Wheeler RE (1974) Portable power. Technometrics 16:193–201. http://www.bobwheeler.com/stat/Papers/PortablePower.PDF Google Scholar
  59. Wheeler RE (1975) The validity of portable power. Technometrics 17(2):177–179Google Scholar
  60. Widemile (2008) [Online] 2008. http://www.widemile.com
  61. Wikepedia (2008) Multi-armed bandit. Wikipedia. [Online] 2008. http://en.wikipedia.org/wiki/Multi-armed_bandit
  62. Willan AR, Briggs AH (2006) Statistical analysis of cost-effectiveness data (statistics in practice). WileyGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Ron Kohavi
    • 1
  • Roger Longbotham
    • 1
  • Dan Sommerfield
    • 1
  • Randal M. Henne
    • 1
  1. 1.Microsoft, One Microsoft WayRedmondUSA

Personalised recommendations