Abstract
Marketers increasingly face modeling situations where the number of independent variables is large and possibly approaching or exceeding the number of observations. In this setting, covariate selection and model estimation present significant challenges to usual methods of inference. These challenges are exacerbated when covariate interactions are of interest. Most extant regularization methods make no distinction between main and interaction terms in estimation. The linear VANISH model is an exception to these methods. The linear VANISH model is a regularization method for models with interaction terms that ensures proper model hierarchy by enforcing the heredity principle. We derive the generalized VANISH model for nonlinear responses, including duration, discrete choice, and count models widely used in marketing applications. In addition, we propose a VANISH model that allows to account for unobserved consumer heterogeneity via a mixture approach. In three empirical applications we demonstrate that our proposed model outperforms main effects models as well as other methods that include interaction terms.
Similar content being viewed by others
Notes
For example, say a model has 100 parameters. When adding only first-level interactions, one needs to estimate 100 main effects and 4,950 interaction effects. Even if enough observations are available to estimate the main effects, adding the interaction effects will almost certainly result in a “large p, small n” problem.
VANISH refers to Variable Selection using Adaptive Nonlinear Interaction Structures in High Dimensions (Radchenko and James 2010).
LASSO refers to Least Absolute Shrinkage and Selection Operator.
Details on the sampler can be found in the Appendix. More details on the derivations of the full conditional distributions of the VANISH parameters can be found in the Web Appendix.
The LASSO prior is \( \pi \left(\beta |\sigma \right)=\prod \limits_{j=1}^p\frac{\lambda }{2\sigma}\exp \left(-\frac{\lambda \mid {\beta}_j\mid }{\sigma}\right) \). Note that the LASSO model only requires one tuning parameter, λ, and one set of latent parameters, τ. Estimation proceeds similarly to estimation using a VANISH prior as detailed in the Appendix. Also see Park and Casella (2008) for a full Bayesian treatment of the linear LASSO.
We estimate the MART version of each of our models (hazard, choice and count) using the gbm package in R. The gbm algorithm is a boosting algorithm which creates a sequence of simple trees where each successive tree is built for predicting the residuals of the preceding tree. Thus, at each step of the algorithm a simple partitioning of the data is determined and the deviations of the observed values from the respective means (residuals for each partition) are computed. The next tree will then be fitted to those residuals to find another partition that will further reduce the residual (error) variance given the preceding sequence of trees.
We thank Maytal Saar-Tsechansky of UT Austin for providing access to the data and Samuel Blazek for providing research assistance in processing the data.
Gilbride et al. (2006) suggest a heterogeneous variable selection approach applied to conjoint data that shrinks individual-level response coefficients either towards zero with a very small prior variance or a normal prior distribution dependent upon individual-level discrete variable selection parameters. Their model performs variable selection at the attribute level rather than the partworth level (i.e., the brand attribute is either selected or not versus each level of the brand attribute). While this approach allows the researcher to heterogeneously model attribute attendance at the individual level it is not, per se, an approach well suited for “large p, small n” problems. Their approach does not include a way to ensure that the number of selected parameters (i.e., the parameters not shrunk towards zero) does not exceed the number of observations. Their approach is also mute on whether and how to handle interaction terms.
We thank the editor for bringing this point to our attention.
References
Agarwal, A.K.H. & Smith, M.S. (2011). Location, Location, Location: An Analysis of Profitability of Position in Online Advertising Markets, Journal of Marketing Research, XLVIII (Dec), 1057–1073.
Allenby, G.M., Neeraj, A., & Ginter, J.L. (1998). On the Heterogeneity of Demand, Journal of Marketing Research, 35, 384–389.
Bakshy, E., Hofman, J.M., Mason, W.A., & Watts, D.J. (2011). Everyone’s an influencer: Quantifying influence on Twitter. Fourth ACM International Conference on Web Search and Data Mining.
Bumbaca, F., Misra, S., & Rossi, P. (2017). Distributed Markov chain Monte Carlo for Bayesian hierarchical models. University of California, Irvine, Working Paper.
Cheng, J., Adamic, L., Dow, A., Kleinberg, J., & Leskovec, J. (2014). Can cascades be predicted? Proc. 23rd International World Wide Web Conference.
Ebbes, P., Papies, D., Van Heerde, H.J. The sense and non-sense of holdout sample validation in the presence of endogeneity. Marketing Science 30(6),1115–1122
Ghose, A., & Yang, S. (2009). An empirical analysis of sponsored search in online advertising. Management Science, 55(10), 1605–1622.
Ghose, A., Ipeirotis, P. G., & Li, B. (2014). Examining the impact of ranking on consumer behavior and search engine revenue. Management Science, 60(7), 1632–1654.
Gilbride, T. J., Allenby, G. M., & Brazell, J. D. (2006). Models for heterogeneous variable selection. Journal of Marketing Research, 43(3), 420–430.
Hong, L., Dan, O., & Davison, B. D. (2011). Predicting popular messages in Twitter. WWW 2011. Hyderabad, India.
Naik, P., Wedel, M., Bacon, L., Bodapati, A., Bradlow, E., Kamakura, W., Kreulen, J., Lenk, P., Madigan, D., & Montgomery, A. (2008). Challenges and opportunities in high-dimensional choice data analyses. Marketing Letters, 19(3), 201–213.
Nelder, J. A. (1998). The selection of terms in response-surface models—how strong is the weak-heredity principle? The American Statistician, 52(4), 315–318.
Park, T., & Casella, G. (2008). The Bayesian LASSO. Journal of the American Statistical Association, 103, 681–686.
Peixoto, J. L. (1990). A property of well-formulated polynomial regression models. The American Statistician, 44(1), 26–30.
Pennebaker, J. W. (2011). The secret life of pronouns: What our words say about us. New York: Bloomsbury Press.
Petrovič, S., Osborne, M., & Lavenko, V. (2011). RT to win! Predicting message propagation in Twitter. Association for the Advancement of Artificial Intelligence.
Radchenko, P., & James, G. M. (2010). Variable selection using adaptive non-linear interaction structures in high dimensions. Journal of the American Statistical Association, 105, 1541–1553.
Rutz, O. J., Bucklin, R. E., & Sonnier. G. P. (2012). A latent instrumental variables approach to modeling keyword conversion in paid search advertising. Journal of Marketing Research, XLIX(Jun), 306–319.
Rutz, O. J., Sonnier, G. P., & Trusov, M. (2017). A new method to aid copy testing of paid search text advertisements. Journal of Marketing Research, 54(6), 885–900.
Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. New York: McGraw-Hill.
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & Van Der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B, 64, 583–639.
Stephens, M. (2000). Dealing with label switching in mixture models. Journal of the Royal Statistical Society: Series B, 62(4), 795–809.
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B, 58(2), 267–288.
Tikhonov, A. N., & Arsenin, V. Y. (1977). Solution of ill-posed problems. Washington: Winston & Sons.
Yoganarasimhan, H. (2018). Search Personalization using Machine Learning. Forthcoming at Management Science.
Zaman, T., Fox, E. B., & Bradlow, E. B. (2014). A Bayesian approach for predicting the popularity of tweets. The Annals of Applied Statistics, 8(3), 1583–1611.
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2), 301–320.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
ESM 1
(PDF 403 kb)
Appendix
Appendix
We detail the steps in our VANISH regularization approach for the VANISH Hazard, VANISH Choice and VANISH Poisson models. For more information on the derivation of the full conditional distributions for the VANISH parameters (τ, ω, λ1, λ2) please see the Web Appendix.
1.1 Hazard model
-
1)
Generate α and γ using a random-walk Metropolis-Hastings (MH) sampler based on the likelihood given by:
where
where dit is 1 if the life event occurs and zero otherwise.
- 2)
Generateβ using a random-walk MH sampler based on the likelihood given by:
the VANISH prior given by:
- 3)
Generate \( \frac{1}{\tau^2} \)
where I(⋅) is the indicator function.
- 4)
Generate \( \frac{1}{\omega^1} \)
where I(⋅) is the indicator function.
- 5)
Generate \( {\lambda}_1^2\ \mathrm{and}\ {\lambda}_2^2 \)
where r = 1,s = 0.1.
1.2 Choice model
-
1)
Generate βs using a random-walk MH sampler based on the likelihood given by (11) and (12) and the VANISH prior given by:
-
2)
Generate \( \frac{1}{{\left({\tau}^s\right)}^2} \)
where I(⋅)is the indicator function.
- 3)
Generate \( \frac{1}{{\left({\omega}^s\right)}^2} \)
where I(⋅) is the indicator function.
- 4)
Generate \( {\left({\lambda}_1^s\right)}^2\ \mathrm{and}\ {\left({\lambda}_2^s\right)}^2 \)
where rlam1, slam = 0.1.
- 5)
Generate π
\( \pi \mid Z,\rho \sim Dir\left[\left({\tilde{\rho}}_1...{\tilde{\rho}}_S\right)\right] \), where \( {\tilde{\rho}}_s={\rho}_s+\sum \limits_{i=1}^nI\left({Z}_i=s\right) \) with prior ρ = (1, ..., 1).
- 6)
Generate Zi
Zi ∣ θ, μ, V, π~multinomial(1, [LR1(βi), ..., LRS(βi)]),
where \( L{R}_l\left({\beta}^i\right)=\frac{\pi_lL\left({\beta}^i\right)}{\sum \limits_{s=1}^S{\pi}_sL\left({\beta}^i\right)} \) and L is the likelihood given by (11) and (12).
- 7)
Choose a permutation \( {\xi}_t^{+} \) and relabel draws according to \( {\xi}_t^{+} \)
-
8)
Set
1.3 Poisson model
-
1)
Generate [βseaβimpβposβcpc] using a random-walk Metropolis-Hastings (MH) sampler based on the likelihood given by:
-
2)
Generate βtxt using a random-walk MH sampler based on the likelihood given by:
the VANISH prior given by:
- 3)
Generate \( \frac{1}{\tau^2} \)
where I(⋅) is the indicator function.
- 4)
Generate \( \frac{1}{\omega^1} \)
where I(⋅) is the indicator function.
- 5)
Generate \( {\lambda}_1^2\ \mathrm{and}\ {\lambda}_2^2 \)
where r = 1,s = 0.1.
Rights and permissions
About this article
Cite this article
Rutz, O.J., Sonnier, G.P. VANISH regularization for generalized linear models. Quant Mark Econ 17, 415–437 (2019). https://doi.org/10.1007/s11129-019-09216-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11129-019-09216-4