Skip to main content

Counterfactual inference for consumer choice across many product categories

A Correction to this article was published on 27 December 2021

This article has been updated


This paper proposes a method for estimating consumer preferences among discrete choices, where the consumer chooses at most one product in a category, but selects from multiple categories in parallel. The consumer’s utility is additive in the different categories. Her preferences about product attributes as well as her price sensitivity vary across products and may be correlated across products. We build on techniques from the machine learning literature on probabilistic models of matrix factorization, extending the methods to account for time-varying product attributes and products going out-of-stock. We evaluate the performance of the model using held-out data from weeks with price changes or out of stock products. We show that our model improves over traditional modeling approaches that consider each category in isolation. One source of the improvement is the ability of the model to accurately estimate heterogeneity in preferences (by pooling information across categories); another source of improvement is its ability to estimate the preferences of consumers who have rarely or never made a purchase in a given category in the training data. Using held-out data, we show that our model can accurately distinguish which consumers are most price sensitive to a given product. We consider counterfactuals such as personally targeted price discounts, showing that using a richer model such as the one we propose substantially increases the benefits of personalization in discounts.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Change history


  1. This literature grapples with the challenge that to the extent prices vary across markets, the prices are often set in response to the market conditions in those markets. In addition, to the extent that products have quality characteristics that are unobserved to the econometrician, these unobserved quality characteristics may be correlated with the price.

  2. Throughout the paper we use category to refer to disjoint sets of products, such that products that are within the same category are partial substitutes.

  3. In the sample we use in our empirical exercise, among the top 123 categories the average category is only purchased on 3.7% of shopping trips. Only milk, lunch bread, and tomatoes are purchased on more than 15% of trips. Things are even sparser at the individual UPC level. The average purchase rate is 0.36% and only one, avocados, is purchased in more than 2% of the trips in our Tuesday-Wednesday sample.

  4. We also ran alternative specifications with the per consumer price coefficients restricted to be the same across all products, however this lead to a substantial reduction along both the predictive and counterfactual fit measures of performance.

  5. In data from a single cross-section of consumers, Athey and Imbens (2007) show that only a single latent variable can be identified (or two if utility is restricted to be monotone in each) without functional form restrictions, arguing that panel data is critical to uncover common latent characteristics of products.

  6. Elrod and Keane use a factor analytic probit model with normally distributed preferences, whereas Chintagunta uses a logit model with discrete segments of consumer types. Elrod (1988a, 1988b) use logit models to estimate up to two latent attributes and the distribution of consumer preferences. The former study uses a linear utility specification, and the latter uses an ideal-point model.

  7. This approach has similarities to the econometrics literature on “interactive fixed effects models” although that literature has focused primarily on decomposing common trends across individuals over time rather than identifying common preferences for products across individuals (Moon and Weidner, 2015; Moon et al., 2014; Bai, 2009)

  8. As is standard in the machine learning literature, the accuracy is estimated on a “held out” or “test” data set that is not used during the training of the model. This is a more accurate way to evaluate how well a model will be able to make predictions on new data that has not yet been observed.

  9. For example, when trained on Netflix data with 480,000 users, 17,700 movies, and 100 million observations, they report the model took 13 hours to converge on a single CPU.

  10. The matrix has one row for each user and one column for each item. The i,j entry corresponds to how much user i “likes” item j. We often only observe some of the entries of this large matrix, and would like to make predictions for the unobserved entries. e.g. predict how much a user will like a movie that they haven’t watched or rated yet.

  11. See Ruiz et al. (2020) for a related paper that uses a heuristic approach to identify potential substitutes and complements automatically based on patterns of co-purchase.

  12. Also known as the Gumbel distribution.

  13. McFadden (1974) provides a well known thought experiment illustrating this effect in the context of commuters choosing between driving a car, riding a red bus, or riding a blue bus. Steenburgh and Ainslie (2013) provides further details on the degree to which allowing heterogeneity in preferences reduces (but does not eliminate) the problems of the homogeneous logit model.

  14. In the context of the model, this would manifest as a correlation in the error terms 𝜖ijt across products, which the model assumes to be independent and identically distributed.

  15. However these predictions are still conditioned on the shopper’s decision to visit the store, which we treat as exogenous.

  16. i.e. we add an additional Ui0t = 𝜖i0t to each category, representing the decision to buy nothing from the category. Now the set of options for the consumer are mutually exclusive, and collectively exhaustive. On each shopping trip, for every category, a shopper either chooses something from the category or they choose the outside good.

  17. Train (2009) Section 4.2.4 provides a nice overview of the sequential estimation approach in the context of the traditional nested logit model.

  18. If the 𝜖 are a standard Extreme Value Type 1 distribution, then there will be an extra term γ = mean(𝜖) ≈ 0.577 added which in practice does not matter since it can be absorbed into the constant term. Alternatively we can define 𝜖 = EV 1(−γ, 1) to get rid of the extra term without affecting any of the choice probabilities.

  19. KL divergence is similar to a distance function in that it is non-negative and KL(P∣∣Q) = 0 iff P = Q almost everywhere. It is not a true distance function, however, because it is not symmetric KL(Q∣∣P)≠KL(P∣∣Q) and does not satisfy the triangle inequality.

  20. See Appendix B for more details and Athey et al. (2018), Blei et al. (2017), Ruiz et al. (2020) for additional exposition.

  21. A downside of this two-step approach is that we are not able to take into account the full estimated distributions of the latent variables from the product-level model when estimating the second category-level model

  22. We define a shopping trip as a set of all purchases a household makes on a calendar day.

  23. We exclude data from the week prior to Halloween, Thanksgiving, Christmas, 4th of July, and Labor Day.

  24. At higher levels of aggregation, it was much more common to see multiple purchases in the same grouping on a single trip. At lower levels of aggregation, many categories were split into classes that contained products that seemed likely to be substitutes. For example, the category Apples is split into classes such as Fuji and Gala apples. Sharp Cheddar is in a separate class (but same category) as Mild Cheddar.

  25. We divide age into buckets {Under 45, 45–55, Over 55}. We split income at $100k, which is roughly the median for this store.

  26. In this model and all of the subsequent variations, “outside good” (the choice to not buy anything in the category, is assumed to take the form Ui0t = 𝜖ijt.

  27. For a closer match to the Nested Factorization model, one could use a random coefficients nested logit model with product-specific random coefficients on the price in addition to the product intercepts.

  28. We extent the model proposed by Gopalan et al. (2013) to allow for each customer to face multiple independent choice occasions, one for each trip they make to the store. This extension also allows for time varying characteristics such as changes to prices and product availability. However for the application presented here, we do not include prices within the HPF model, since these outputs are used to plug into models that separately control for prices.

  29. Our extension of the original HPF model allows for observed user and item characteristics (including time varying characteristics), however on this dataset we found little or no improvement for out of sample predictive fit relative to a purely latent factorization.

  30. With the appropriate choice of utility for the outside good \(u_{i0} = \log {\left (1 - {\sum }_{k \in J_{c}} \mu _{ik} \right )}\), so that \(P(y_{ijt} = 1) = \frac {\exp {u_{ijt}}}{\exp {u_{i0t} + {\sum }_{k \in J_{c}} \exp {u_{ikt}}}}\approx \mu _{ijt}\) This approximation works best when \({\sum }_{k \in J_{c}} \mu _{ik} \ll 1\), which is generally true in our supermarket application, but may be less appropriate in other contexts where purchase probabilities are larger.

  31. Thus in the shifted data all price changes occur on weeks without price changes in the real data and all weeks with price changes in the real data have no price changes in the shifted data. A naive shift of all prices by exactly 1 week fails to break the correlation between the price changes in the shifted and real price data, due to the frequency of week-long temporary price changes.

  32. We focus on the basic multinomial logit specification due to it’s computational speed and relative simplicity. Running similar tests for the other specifications including the Nested Factorization is possible in theory, but requires a larger computational cost.

  33. For example the regularization coefficient λ in a LASSO regression, or in the case of Nested Factorization, the number of latent factors of each type to include.

  34. As discussed in Rossi (2014), with consumer level data, our biggest concern for the identification of price effects, is that the store may be setting prices in response to variations in expected demand caused by seasonal trends or advertising. For example, there is more demand for fresh berries when they are in season or for turkeys immediately before Thanksgiving. It is not always clear which direction such price endogeneity will bias our estimates. The retailer may decide to take advantage of high demand by raising prices, but in other cases we see prices reduced during high demand periods e.g. bags of candy going on sale before Halloween.

  35. i.e. that controlling for week and day of week effects at the category level is sufficient to make potential demand orthogonal to price level and product availability.

  36. Mean Log Likelihood and Mean Squared Error are calculated by dividing by the total number of purchases in order to make the values comparable between the test and training sets.

  37. i.e. user-item specific covariates that are estimated from the HPF model run on all categories simultaneously as described in Section 4.2.5

  38. These trends also hold in additional specifications of the alternative logit models that included controls for shopping frequency and previous purchase behavior.

  39. In all cases we exclude weeks in which the focal product is out-of-stock on either day. For the cross-price and out-of-stock counterfactuals, we exclude weeks in which the focal product has a price change. For the price change counterfactuals, we exclude weeks in which the magnitude of the price change is less than $0.10

  40. If the individual purchasing decisions are distributed as independent Bernoulli variables, then their sum, the aggregate demand has a Poisson distribution. Then the Tuesday-Wednesday change in aggregate demand has a Skellam distribution, which is the difference between two independent Poisson distributions

  41. Coefficient of variation is defined as \(\frac {sd}{mean}\)

  42. The coefficients are from a regression of actual purchase rate on the predicted purchase rate (both calculated on the test set) with item/category specific fixed effects to absorb heterogeneity in the mean purchase rates across items/categories.

  43. Profits are calculated as price - marginal cost. Marginal costs come from the retailer’s records, which are available for most products. For items with no marginal cost data, we treat the minimum retail price in the data as the marginal cost.

    Table 6 Gains from targeted discounts
  44. We define demographic groups in terms of marital status, income level, age, and number of children.

  45. The validity of this analysis requires that customers are not strategically choosing which days to shop in response to prices. In our context with prices frequently changing across many categories, we believe that this effect is small. However, it is possible that some customers might be able to time their shopping trips in response to the prices of a few products that they consider particularly important.

  46. We exclude all prices that are less than the item’s marginal cost, since those prices would lead to negative profits, and thus would never be chosen as the more profitable price for any consumer.

  47. i.e. the household’s who are predicted to have higher profits under price 1 and the households who are predicted to have higher profits under price 2.

  48. We restrict ourselves to the two most common prices in order to increase the frequency with which we observe shopping trips with the selected prices in the test sample.


Download references


We are grateful to Tilman Drerup and Ayush Kanodia for exceptional research assistance. We thank the seminar participants at Harvard Business School, Stanford, the Microsoft Digital Economy Conference, and the Munich Lectures.


We acknowledge generous financial support from Microsoft Corporation, the Sloan Foundation, the Cyber Initiative at Stanford, and the Office of Naval Research grant N00014-17-1-2131. Robert Donnelly is currently employed at Instacart (San Francisco, USA) but contributed to this research while a graduate student at Stanford Graduate School of Business. Ruiz is currently affiliated with DeepMind (London, UK) but contributed to this research while at Columbia University (New York, USA) and the University of Cambridge (London, UK), supported by the EU H2020 programme (Marie Skłodowska-Curie grant agreement 706760). The views expressed herein do not necessarily represent the views of Instacart or DeepMind.

Author information

Authors and Affiliations



We follow the machine learning tradition in author ordering, with Robert Donnelly having the greatest contribution.

Corresponding author

Correspondence to Susan Athey.

Ethics declarations

Code Availability Additional implementations of related variational inference models are available at and

Additional information

Availability of Data and Materials

The dataset is available to researchers at Stanford and Berkeley by application; it has been used previously in other research papers (see

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised due to affiliations of Dr Francisco J.R. Ruiz were captured incorrectly.


Appendix A:: Data construction and sample selection

The filters we use to select categories for study are outlined as follows:

  1. 1.

    For many of the mixed and nested logit specifications, we encountered difficulty with convergence in some of the product categories. To reduce these issues we ran all of the logit specifications using the top 10 items in each category along with an eleventh “pooled” option that combined all of the less popular items in the category. The NF and HPF models were run without any pooling of items. To make for a fair comparison, we evaluate model fit using only the top 10 items in each category. The relative performance of the NF model improves further if we compare the sum of the predicted purchase probabilities for the pooled items to the pooled item prediction from the logit models.

  2. 2.

    We eliminate categories in which more than 15% of shopping trips contain multiple items from the category or more than 10% of trips contain multiple top 10 items, since for these categories the assumption of unit demand was substantially violated. For any remaining shopping trips in which multiple items from the same category were purchased, we selected one item at random from among the purchased items (and treated the remaining items as unpurchased).

  3. 3.

    We eliminate categories where the average absolute within-category correlation of the top 10 items’ prices is greater than 0.75. This address the challenge of identifying cross-price elasticities in a handful of categories in which virtually all prices move in parallel.

  4. 4.

    We only include categories where at least 2 of the top 10 items have price variation from Tuesday to Wednesday in one of the sample weeks and at least 1 of them top 10 UPCs has price changes of at least 10 cents in a least 10% of the sample weeks.

  5. 5.

    We eliminate the top 15% of categories with the strongest demand seasonality. For each UPC, we first calculate seasonality as the Herfindahl index of daily demands over the sample period. We then calculate the percentile of each UPCs Herfindahl index over all UPCs and define a category’s seasonality as the average of the category’s top 10 items’ percentiles. While our approach of category level time controls at the week level should be able to control for any category level seasonality, it is not able to control for seasonal trends that affect individual UPCs.

The pricing information in our data comes from the transactions, which means we need to infer the prices a customer would have paid for any items they did not purchase. In addition, we need to account for coupons and deals (e.g. buy 2 get 1 free) that may cause different customers shopping on the same day to pay different price per unit. To resolve this, we use the daily median transacted price per unit calculated. In the event of a day with zero purchases, we carry forward the price data from the previous day.

Our data on product availability (i.e. out-of-stock items) is at granular level as described Che et al. (2012) based on the times employees scan items as out of stock and when they are restocked. However, for simplicity all of the models are run at a daily level. We consider an item unavailable to all shoppers on any days in which it is listed as out-of-stock during more than 75% of shopping trips on that day.

Appendix B:: Variational inference algorithm

In this section, we provide additional details on the implementation of the variational inference model that is used to estimate the Nested Factorization model.

Recall from Section 3.2.4 that for the product choice stage of the model, we would like to approximate the posterior distribution of the latent parameters p = {𝜃ij,ρj,σi,γi,λj}. For notational convenience we will rewrite this as = {1,…,K} where the number of latent parameters K = 3N + 3J, where N is the number of households and J is the number of products. We will approximate the posterior using a multivariate Gaussian distribution with a diagonal covariance matrix. In general, imposing this “mean-field” assumption may limit the ability of our variational distribution to approximate the exact posterior, however this structure often works quite well in practice and still allows substantial flexibility in the resulting posterior approximation.

$$ \begin{array}{@{}rcl@{}} q\left( \ell ; \nu \right) = \mathcal{N}(\ell; \mu, {\Sigma}) = {\prod}_{k=1}^{K} \mathcal{N}(\ell_{k}; \mu_{k}, {\sigma^{2}_{k}}) \end{array} $$

We want to find the variational parameters \(\nu = \{\mu _{1}, \ldots , \mu _{K}, \ldots , {\sigma ^{2}_{K}}\}\) that minimize the KL divergence between the variational distribution \(q\left (\ell ; \nu \right )\) and the exact posterior p(y,x).

$$ q^{*}(\ell; \nu) = \arg\min_{\!\!\!\!\!\!\!\!\!\!\!\!\!\nu} KL\left( q(\ell; \nu) \mid \mid p(\ell \mid \boldsymbol{y}, \boldsymbol{x}) \right) $$

This can be rearranged to show that minimizing the KL divergence is equivalent to maximizing an expression known as the evidence lower bound (ELBO) (Blei et al., 2017).

$$ \begin{array}{@{}rcl@{}} \mathscr{L}(\nu) &=& E_{q(\ell;\nu)} \left[ \log p(\boldsymbol{y} \mid \boldsymbol{x}, \ell) - \log q(\ell; \nu) \right] \\ &=& E_{q(\ell;\nu)} \left[ {\sum}_{t} \log p(\boldsymbol{y}_{t} \mid x_{t}, \ell ) + \log p(\ell) + \log q(\ell; \nu) \right] \end{array} $$

Unfortunately, the expectation in Eq. 13 is analytically intractable. However, we can still seek the value of ν that maximizes \({\mathscr{L}}(\nu )\) with stochastic gradient descent if we are able to find a tractable expression for an unbiased estimate of the gradient \(\nabla _{\nu } {\mathscr{L}}(\nu )\). We can do this by applying an approach that is known as the reparametrization trick (Kingma and Welling, 2014; Titsias & Lázaro-Gredilla, 2014; Rezende et al., 2014).

To do this, we introduce a transformation of the latent variables, so that rather than directly drawing from the distribution \( \ell \sim q\left (\ell ; \nu \right )\), we instead draw a new auxiliary random variable \(\varepsilon \sim \mathcal {N}(0, I_{K})\) from a standard multivariate Gaussian distribution. By applying the transformation \(\mathcal {T}(\varepsilon ; \nu ) = \mu + {\Sigma }^{\frac {1}{2}} \varepsilon \) we can generate draws \(\ell = \mathcal {T}(\varepsilon ; \nu )\) such that \(\ell \sim q(\ell ; \nu )\).

For notational ease, we will denote the expression inside the expectation in Eq. 13 as f(; ν). We can now rewrite the expectation in the gradient of the ELBO as

$$ \nabla_{\nu} \mathscr{L}(\nu) = \nabla_{\nu} E_{q(\ell;\nu)} \left[ f(\ell; \nu) \right] = \nabla_{\nu} E_{\varepsilon} \left[ f(\mathcal{T}(\varepsilon; \nu); \nu) \right] $$

Now, bringing the gradient inside of the expectation and applying the chain rule gives us

$$ \nabla_{\nu} \mathscr{L}(\nu) = E_{\varepsilon} \left[ \nabla_{\ell} f(\ell; \nu); \nu)\mid_{\ell = \mathcal{T}(\varepsilon; \nu)} \nabla_{\nu} \mathcal{T}(\varepsilon; \nu) \right] $$

To obtain this expression, we used the fact that \(E_{\varepsilon } \left [ \nabla _{\nu } f(\ell , \nu )\mid _{\ell = \mathcal {T}(\varepsilon ; \nu )} \right ] = 0\), since the only dependence of f(; ν) on ν is through the term \(\log q(\ell ; \nu )\) and the expected value of the score function is 0.

Now we can obtain a Monte Carlo estimate of the gradient of the ELBO by sampling values of ε from the standard multivariate Gaussian in order to approximate this expectation. In addition to making draws of ε to evaluate the expectation, we also subsample customer shopping trips in each iteration. This allows the estimation to more easily scale to large datasets. By scaling the gradient estimates to account for this sampling, we can maintain the unbiasedness of the gradient estimator.

B.1 Hyperparameters

For each of the variational parameters we use a prior variance of 1.0 and a prior mean of 0 and a batch size of 5000 (number of customer trips to sample in each step of the stochastic gradient descent). We do a hyperparameter grid search across:

  • {20,40,80} for the dimension of the latent factorization 𝜃iβj

  • {20,40,80} for the dimension of the price term γiλj

  • With and without user demographics Wi

  • With and without product characteristics Xj

  • {0.001,0.005,0.01} learning rates (i.e. step size for gradient descent)

  • Product level model only: whether price should enter linearly or in logs

  • Category level model only: {10,20} for the dimension of the time factorization μcδt

We selected models based on counterfactual price performance on the validation set. The selected product level model has hyperparameters {80, 20, yes, no, 0.005, linear price}. The selected category level model has hyperparameters {40, 40, no, no, 0.01, 10}.

Appendix C:: Estimated elasticities and purchase probabilities

In this section we compare the predicted own price elasticities across models and how those predicted elasticities vary within and between products. Table 7 shows the median own price elasticity estimated by each model. SD(Mean) is the standard deviation across the mean product level own-price elasticities. This captures how much variability there is in elasticities across products. Mean(SD) is the mean of the standard deviation of elasticities across consumers within a specific product. This captures the amount of variability in elasticities across consumers for the same product.

Table 7 Comparison of own-price elasticities

Figures 8 and 9 show the predicted elasticities and purchase rates for a sample of products and categories (entries are hidden when their text box would overlap with another entry).

Fig. 8
figure 8

Category level elasticities and predicted purchase probabilities

Fig. 9
figure 9

UPC level elasticities and predicted purchase probabilities

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Donnelly, R., Ruiz, F.J., Blei, D. et al. Counterfactual inference for consumer choice across many product categories. Quant Mark Econ 19, 369–407 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Consumer demand
  • Machine learning
  • Variational inference
  • Causal inference
  • Grocery
  • Purchase history data

JEL Classification

  • C52
  • C55
  • D12
  • L81
  • M31