Online ads and offline sales: measuring the effect of retail advertising via a controlled experiment on Yahoo!

Abstract

A randomized experiment with 1.6 million customers measures positive causal effects of online advertising for a major retailer. The advertising profitably increases purchases by 5%. 93% of the increase occurs in brick-and-mortar stores; 78% of the increase derives from consumers who never click the ads. Our large sample reaches the statistical frontier for measuring economically relevant effects. We improve econometric efficiency by supplementing our experimental variation with non-experimental variation caused by consumer browsing behavior. Our experiment provides a specification check for observational difference-in-differences and cross-sectional estimators; the latter exhibits a large negative bias three times the estimated experimental effect.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Notes

  1. 1.

    Online is $36 billion relative to approximately $176 billion, or roughly 21% (IAB, “2012 Annual Report,” http://www.iab.net/about_the_iab/annual_report); 5.5% of all retail purchase are done online (US Census Bureau, “Quarterly Retail e-Commerce Sales: 1st Quarter 2013,” http://www.census.gov/retail/mrts/www/data/pdf/ec_current.pdf).

  2. 2.

    $176 billion is spent on advertising according to the 2012 IAB report; 2012 US GDP was $15.7 trillion according to the US Census Bureau. See footnote 1 for more details.

  3. 3.

    By contrast, experiments are more common in direct-response advertising. Direct-mail advertisers have a culture of randomizing aspects of the mailings—even minute details such as ink color and envelope size. We were fortunate to have a partner at our retailer who previously worked in retail catalog mailings and was therefore familiar with the benefits of experimentation.

  4. 4.

    (Allaire 1975) pointed out that the authors had failed to quantify the uncertainty in their estimates, and their interesting effects turned out to be statistically insignificant.

  5. 5.

    By contrast, direct-response advertising may produce more statistical power in experiments than brand advertising, because the ads are more salient (higher signal) and produce more immediate responses (less noise). This may explain why direct-mail marketers are more likely to engage in experimentation than other advertisers (see footnote 3). Recent examples in the academic literature include Simester et al. (2009), who experimentally vary the frequency of catalog mailings; Bertrand et al. (2010), who vary the ad creative and the interest rate in loan offers by mail; and Ghose and Yang (2009), who measure the impact of sponsored search advertisements on total clicks for the advertiser on the search-results page.

  6. 6.

    For a survey of empirical and theoretical work on the economics of advertising, see Bagwell (2008). (DellaVigna and Gentzkow 2010) reviews empirical work on persuasion more generally, including advertising and other communication to charitable donors, voters, investors, and retail consumers.

  7. 7.

    Other panel models satisfy a similar role; we use DID for its conceptual simplicity.

  8. 8.

    The retailer selected for the match a subset of their customers to whom they wished to advertise. We do not have precise information about their selection rule.

  9. 9.

    Early drafts of this paper also examined a third campaign, whose analysis required an imperfect data merge. For improved data reliability and simplicity of exposition we now choose to omit all references to the third campaign.

  10. 10.

    The industry uses “impressions” as the standard accounting unit to refer to “online ads that loaded in a web page on the recipient’s computer.” While there is no guarantee that a given impression rendered on a user’s screen or received any visual attention by the internet user, for simplicity of exposition and in accordance with industry practice we use the words “impression,” “exposure,” and “view” synonymously.

  11. 11.

    Dell was not the retailer in this experiment—the retailer prefers anonymity.

  12. 12.

    Only one difference between treatment groups in this table is statistically significant. The mean number of Yahoo! page views was 363 for the treatment group versus 358 for the control, a statistically significant difference (p=0.0016). This difference is rather small in magnitude and largely driven by outliers: almost all of the top 30 page viewers ended up being assigned to the treatment group. If we trim the top 250 out of 1.6 million individuals from the dataset (that is, remove all bot-like users with 12,000 or more page views in two weeks), the difference is no longer significant at the 5% level. The lack of significance remains true whether we trim the top 500, 1000, or 5000 observations.

  13. 13.

    Although the data suggests extreme numbers of ads, Yahoo! engages in extensive anti-fraud efforts to ensure fair pricing of its products and services. In particular, not all ad impressions in our dataset were charged to the retailer as valid impressions.

  14. 14.

    If these customers make purchases that cannot be tracked by the retailer, our estimate will underestimate the total effect of advertising on sales. However, the retailer claims to attribute 90% of purchases to the correct customer account via several methods, such as matching the name on a customer’s credit card at checkout.

  15. 15.

    In Section 5.3, we will decompose the treatment effect of advertising into its effect on the number of purchasers versus its effect on the purchase amounts conditional on purchase.

  16. 16.

    Because of the custom targeting to the selected database of known retailer customers, Yahoo! charged the retailer an appropriately higher rate, roughly five times the price of an equivalent untargeted campaign. In our return-on-investment calculations, we use the actual price charged to the retailer for the custom targeting.

  17. 17.

    This power calculation helps us understand why Lodish et al. (1995a) used a 20% one-sided test as their threshold for statistical significance, a level that at first seemed surprisingly high. Note that their sample sizes were closer to 3,000 than to our 1.6 million.

  18. 18.

    We now realize the importance of doing such power calculations before running an experiment, even one with over a million subjects. Lewis and Rao (2013) detail the statistical imprecision to be expected even in well-designed advertising experiments.

  19. 19.

    Out of 75,000 observations with nonzero purchase amounts, we trim about 400 observations from the left and 400 from the right in the histograms. However, we leave all outliers in our analysis, despite their large variance, because customers who account for a large share of sales may also account for a large share of the ad effect. Further, because all data were recorded electronically, we do not suspect typos in the data. Trimming the outliers from the analysis does not appreciably change our results.

  20. 20.

    However, it does easily exceed the 20% one-sided significance threshold used to declare a campaign successful in (Lodish et al. 1995a).

  21. 21.

    We recorded zero ad views for every member of the control group and, hence, cannot identify the control group members who would have seen ads. The Yahoo! ad server uses a complicated set of targeting rules and inventory constraints to determine which ad to show to a given individual on a given page. For example, one advertiser’s ad might be shown more often on Yahoo! Mail than on Yahoo! Finance. If some other advertiser targeted females under 30 during the same time period, then our ad campaign might have been relatively more likely to be seen by other demographic groups. Our treatment-control assignment represented an additional constraint. We eschew modeling the counterfactual distribution of ad delivery to the control group because we know such modeling would be imperfect and thereby risk biasing our results.

  22. 22.

    Though we have three weeks of pre-period data available, we use only two weeks here for symmetry and simplicity of exposition (two weeks are intuitively comparable to two weeks). Weekly results using a three-week baseline can be found in Table 6.

  23. 23.

    These ads were more expensive than a regular run-of-network campaign. The customer targeting commanded a large premium. In our cost estimates, we report the dollar amounts (scaled by the retailer’s “exchange rate”) paid by the retailer to Yahoo!

  24. 24.

    Lewis, Rao, and Reiley (2011) use experiments to show the existence of “activity bias” in observational studies of online-advertising effectiveness: various online activities show high variance across days as well as high correlation across activities, and therefore viewing an ad is positively (but non-causally) correlated with visiting an advertiser’s website on a given day. A DID estimator would not correct for such activity bias: with individual-specific shocks, yesterday’s incarnation of a person is not a good control for today’s. The DID assumption seems a better bet in the present context because we are using lower-frequency data and examining effects on offline (rather than just online) behavior. Further, our specification checks comparing DID to unbiased experimental estimates show that any bias in this setting is considerably smaller than the large effects documented for online behavior by Lewis, Rao, and Reiley (2011).

  25. 25.

    While this specification test is somewhat low-powered, it still tells us that our statistical assumption is plausible. By contrast, we saw above that using endogenous cross-sectional variation fails miserably in a comparison to the clean experimental estimator.

  26. 26.

    If two estimators are based on valid assumptions, researchers should prefer the estimator with the smallest variance. Selecting the estimator on the variance (second moment) of an adaptive estimator like OLS should not bias the conditional mean.

  27. 27.

    Though we have modeled the treatment effect as additive, we could instead model it as a constant percentage effect using a difference-in-log-differences estimate at the level of group averages. This could produce a slightly different estimate given that the unexposed group purchases 14% more than the exposed group, on average, during the baseline pre-period (R$2.06 versus R$1.81). Formally, we write:

    $$ E[\Delta] = (log(E[y_{E,t}]) - log(E[y_{E,t-1}])) - (log(E[y_{U,t}]) - log(E[y_{U,t-1}])). $$
    (8)

    We estimate E[Δ]=5.0% (2.3%), corresponding to a treatment effect of R$0.091 (0.041) (column 7 in Table 5) computed by multiplying E[Δ]×E[y E,t−1]. This estimate lies midway between the R$0.083 experimental and the R$0.102 DID estimates.

  28. 28.

    The campaigns did not start and end on the same day of the week, so we end up with a three-day overlap between the third week after the start of the campaign and the third week prior to the start of the follow-up campaign. That is, those three days of sales are counted twice. We correct for this double-counting by scaling the estimates by the appropriate ratio. In the cumulative estimates over the entire period, this is the ratio of 8 weeks to 8 weeks and 3 days, due to the 3-day double-counting.

  29. 29.

    We adapt the model slightly to accommodate varying post-campaign time windows by rescaling the pre-campaign period to be proportional in units of time:

    $$ \Delta y=y_{t}-\frac{w_{t}}{w_{t-1}}y_{t-1} $$
    (9)

    where w t equals the number of units of time (e.g., weeks) in time window t. For example, if we are comparing a 3-week pre-campaign period to an 8-week post-campaign period, we would use \(\Delta y=y_{post}^{8wk}-\frac {8}{3}y_{pre}^{3wk}\). We use this to estimate both the total effects over a multi-week period and separate effects for each week.

  30. 30.

    Because the follow-up campaign lasted ten days rather than an even number of weeks, the second “week” of the campaign consists of only three days instead of seven. In this case of a 3-day “week,” we scale up the sales data that week by 7/3 to keep consistent units of sales per week. This implicitly assumes that purchasing behavior and treatment effects are the same across days of the week, which is an imperfect, but reasonable approximation, especially considering that the three-day “week” represents such a minor fraction of the long-run period of study.

  31. 31.

    To avoid overstating the significance of this observation, we note that the weekly estimates are not mutually independent. Each week’s DID estimator uses the same three weeks of pre-campaign data, and sales are also modestly correlated from week to week.

  32. 32.

    Indeed, Table 6 shows that the difference in treatment effect from before to after the start of the follow-up campaign is positive but not statistically significant.

  33. 33.

    The first of three weeks prior to the start of the follow-up campaign overlaps with the week following the campaign for three days (see footnote 28). In addition, the follow-up campaign’s second “week” is actually only three days, since the campaign ran for only ten days (see footnote 30).

  34. 34.

    See (Meland 1999), (Holahan C. and Hof R.D. 2007), and (Shein 2012) for data on the historical decline in CTR. At 0.3%, our highly targeted advertising campaign had a rather high CTR, three times that of the average display campaign in 2007.

  35. 35.

    Because we cannot observe counterfactual ad views for the control group, we must rely on DID, pooling control-group members with untreated treatment-group members.

  36. 36.

    We include negative purchase amounts (net returns) as transactions in this analysis. Since we previously found that advertising decreases the probability of a negative purchase amount, this effect would likely be larger if we restricted our analysis to positive purchases.

  37. 37.

    We present a simple DID in sample proportions: our results are comparable to an OLS linear probability model rather than to a nonlinear model like a probit.

  38. 38.

    When comparing mean time-series differences between treated individuals and untreated individuals, those two means are independent, so standard errors are straightforward. But when computing DID for four group means, pre- and post-campaign basket-size estimates are correlated from some customers purchasing in both periods.

  39. 39.

    Because the advertising increases the number of purchasers, the change in average basket size conflates two effects: the change in inframarginal customers’ purchase amounts and any difference in marginal and inframarginal customers’ average purchase amounts.

References

  1. Aaker, D. A., & Carman, J. M. (1982). Are you overadvertising? a review of advertising-sales studies. Journal of Advertising Research, 22(4), 57–70.

    Google Scholar 

  2. Abraham, M. M. (2008). The off-line impact of online ads. Harvard Business Review, 86(4), 28.

    Google Scholar 

  3. Abraham, M. M., & Lodish, L. (1990). Getting the most out of advertising and promotion. Harvard Business Review, 68(3), 50.

    CAS  PubMed  Google Scholar 

  4. Ackerberg, D. (2003). Advertising, learning, and consumer choice in experience good markets: an empirical examination*. International Economic Review, 44(3), 1007–1040.

    Article  Google Scholar 

  5. Ackerberg, D. A. (2001). Empirically distinguishing informative and prestige effects of advertising. RAND Journal of Economics, 316–333.

  6. Ackoff, R. L., & Emshoff, J. R. (1975). Advertising research at anheuser-busch, inc. (1963-68). Sloan Management Review (pre-1986), 16(2), 1–1. http://search.proquest.com/docview/206793115?accountid=12861.

    Google Scholar 

  7. Allaire, Y. (1975). A multivariate puzzle: A comment on advertising research at anheuser-busch, inc.(1963-68). Sloan Management Review,(Spring), 91, 94.

    Google Scholar 

  8. Angrist, J. D., Imbens, G. W., Rubin, D. B. (1996). Identification of causal effects using instrumental variables. Journal of the American statistical Association, 91(434), 444–455.

    MATH  Article  Google Scholar 

  9. Bagwell, K. (2008). The economic analysis of advertising. Handbook of industrial organization, 3, 1701–1844.

    Article  Google Scholar 

  10. Berndt, E. R. (1991). The Practice of Econometrics: Classic and Contemporary. MA: Addison-Wesley Reading.

    Google Scholar 

  11. Bertrand, M., Karlan, D., Mullainathan, S., Shafir, E., Zinman, J. (2010). What’s advertising content worth? evidence from a consumer credit marketing field experiment. The Quarterly Journal of Economics, 125(1), 263–306.

    Article  Google Scholar 

  12. Blake, T., Nosko, C., Tadelis, S. (2013). Consumer heterogeneity and paid search effectiveness: A large scale field experiment. NBER Working Paper, 1–26.

  13. DellaVigna, S., & Gentzkow, M. (2010). Persuasion: Empirical evidence. Annual Review of Economics, 2(1), 643–669.

    Article  Google Scholar 

  14. Eastlack, J., & Rao, A. (1989). Advertising experiments at the campbell soup company. Marketing Science, 57–71.

  15. Ghose, A., & Yang, S. (2009). An empirical analysis of search engine advertising: Sponsored search in electronic markets. Management Science, 55(10), 1605–1622.

    Article  Google Scholar 

  16. Holahan C., & Hof R.D. (2007). So many ads, so few clicks, Bloomberg Businessweek. http://www.businessweek.com/stories/2007-11-11/so-many-ads-so-few-clicks.

  17. Hu, Y., Lodish, L. M., Krieger, A. M. (2007). An analysis of real world tv advertising tests: A 15-year update. Journal of Advertising Research, 47(3), 341.

    Article  Google Scholar 

  18. Levitt, S. D., & List, J. A. (2009). Field experiments in economics: the past, the present, and the future. European Economic Review, 53(1), 1–18.

    Article  Google Scholar 

  19. Lewis RA, & Rao JM (2013). On the near impossibility of measuring the returns to advertising. Working paper.

  20. Lewis, R. A. & Reiley D.H. (2014). Advertising effectively influences older users, How field experiments can improve measurement and targeting. Review of Industrial Organization forthcoming. http://rd.springer.com/article/10.1007/s11151-013-9403-y.

  21. Lewis, R. A., Rao, J. M., Reiley, D. H. (2011). Here, there, and everywhere: correlated online behaviors can lead to overestimates of the effects of advertising. Proceedings of the 20th international conference on World wide web. pp 157-166.

  22. Lewis, R. A., & Reiley, D. H. (2012). Ad attributes and attribution: Large-scale field experiments measure online customer acquisition. Working Paper.

  23. Lodish, L. M., Abraham, M., Kalmenson, S., Livelsberger, J., Lubetkin, B., Richardson, B., Stevens, M. E. (1995a). How tv advertising works: A meta-analysis of 389 real world split cable tv advertising experiments. Journal of Marketing Research, 125–139.

  24. Lodish, L. M., Abraham, M. M., Livelsberger, J., Lubetkin, B., Richardson, B., Stevens, M. E. (1995b). A summary of fifty-five in-market experimental estimates of the long-term effect of tv advertising. Marketing Science, 14(3 supplement), G133—G140.

    Google Scholar 

  25. Meland, M. (1999). Banner click-throughs continue to fall. Forbes. http://www.forbes.com/1999/05/11/mu8.html.

  26. Shein, E. (2012). Banner ads: Past, present, and... future? CMOcom. http://www.cmo.com/content/cmo-com/home/articles/2012/4/24/banner-ads-past-present-and--future.html.

  27. Simester, D., Hu, J., Brynjolfsson, E., Anderson, E. (2009). Dynamics of retail advertising: Evidence from a field experiment. Economic Inquiry, 47(3), 482–499.

    Article  Google Scholar 

Download references

Acknowledgments

We thank Meredith Gordon, Sergiy Matusevych, and especially Taylor Schreiner for their work on the experiment and data. Yahoo! Inc. provided financial and data assistance and guaranteed academic independence prior to our analysis so that the results could be published no matter how they turned out. We acknowledge the helpful comments of Manuela Angelucci, David Broockman, JP Dubé, Liran Einav, Glenn Ellison, Matt Gentzkow, Jerry Hausman, Kei Hirano, Garrett Johnson, Larry Katz, John List, Preston McAfee, Sendhil Mullainathan, Justin Rao, Paul Ruud, Michael Schwarz, Pai-Ling Yin, and many others, including attendees at many conferences and seminars.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Randall A. Lewis.

Additional information

This work was completed while both authors were employees at Yahoo! Research. Previously circulated versions were titled “Does Retail Advertising Work?” and “Retail Advertising Works!”

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Lewis, R.A., Reiley, D.H. Online ads and offline sales: measuring the effect of retail advertising via a controlled experiment on Yahoo!. Quant Mark Econ 12, 235–266 (2014). https://doi.org/10.1007/s11129-014-9146-6

Download citation

Keywords

  • Online advertising
  • Display advertising
  • Advertising effectiveness
  • Field experiment
  • Difference in differences

JEL Classification

  • Codes: C93 - Field Experiments
  • M37 - Advertising
  • D12 - Consumer Economics: Empirical Analysis