Skip to main content
Log in

VANISH regularization for generalized linear models

  • Published:
Quantitative Marketing and Economics Aims and scope Submit manuscript

Abstract

Marketers increasingly face modeling situations where the number of independent variables is large and possibly approaching or exceeding the number of observations. In this setting, covariate selection and model estimation present significant challenges to usual methods of inference. These challenges are exacerbated when covariate interactions are of interest. Most extant regularization methods make no distinction between main and interaction terms in estimation. The linear VANISH model is an exception to these methods. The linear VANISH model is a regularization method for models with interaction terms that ensures proper model hierarchy by enforcing the heredity principle. We derive the generalized VANISH model for nonlinear responses, including duration, discrete choice, and count models widely used in marketing applications. In addition, we propose a VANISH model that allows to account for unobserved consumer heterogeneity via a mixture approach. In three empirical applications we demonstrate that our proposed model outperforms main effects models as well as other methods that include interaction terms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. For example, say a model has 100 parameters. When adding only first-level interactions, one needs to estimate 100 main effects and 4,950 interaction effects. Even if enough observations are available to estimate the main effects, adding the interaction effects will almost certainly result in a “large p, small n” problem.

  2. VANISH refers to Variable Selection using Adaptive Nonlinear Interaction Structures in High Dimensions (Radchenko and James 2010).

  3. LASSO refers to Least Absolute Shrinkage and Selection Operator.

  4. Details on the sampler can be found in the Appendix. More details on the derivations of the full conditional distributions of the VANISH parameters can be found in the Web Appendix.

  5. The LASSO prior is \( \pi \left(\beta |\sigma \right)=\prod \limits_{j=1}^p\frac{\lambda }{2\sigma}\exp \left(-\frac{\lambda \mid {\beta}_j\mid }{\sigma}\right) \). Note that the LASSO model only requires one tuning parameter, λ, and one set of latent parameters, τ. Estimation proceeds similarly to estimation using a VANISH prior as detailed in the Appendix. Also see Park and Casella (2008) for a full Bayesian treatment of the linear LASSO.

  6. We estimate the MART version of each of our models (hazard, choice and count) using the gbm package in R. The gbm algorithm is a boosting algorithm which creates a sequence of simple trees where each successive tree is built for predicting the residuals of the preceding tree. Thus, at each step of the algorithm a simple partitioning of the data is determined and the deviations of the observed values from the respective means (residuals for each partition) are computed. The next tree will then be fitted to those residuals to find another partition that will further reduce the residual (error) variance given the preceding sequence of trees.

  7. We thank Maytal Saar-Tsechansky of UT Austin for providing access to the data and Samuel Blazek for providing research assistance in processing the data.

  8. Gilbride et al. (2006) suggest a heterogeneous variable selection approach applied to conjoint data that shrinks individual-level response coefficients either towards zero with a very small prior variance or a normal prior distribution dependent upon individual-level discrete variable selection parameters. Their model performs variable selection at the attribute level rather than the partworth level (i.e., the brand attribute is either selected or not versus each level of the brand attribute). While this approach allows the researcher to heterogeneously model attribute attendance at the individual level it is not, per se, an approach well suited for “large p, small n” problems. Their approach does not include a way to ensure that the number of selected parameters (i.e., the parameters not shrunk towards zero) does not exceed the number of observations. Their approach is also mute on whether and how to handle interaction terms.

  9. We thank the editor for bringing this point to our attention.

References

  • Agarwal, A.K.H. & Smith, M.S. (2011). Location, Location, Location: An Analysis of Profitability of Position in Online Advertising Markets, Journal of Marketing Research, XLVIII (Dec), 1057–1073.

  • Allenby, G.M., Neeraj, A., & Ginter, J.L. (1998). On the Heterogeneity of Demand, Journal of Marketing Research, 35, 384–389.

  • Bakshy, E., Hofman, J.M., Mason, W.A., & Watts, D.J. (2011). Everyone’s an influencer: Quantifying influence on Twitter. Fourth ACM International Conference on Web Search and Data Mining.

  • Bumbaca, F., Misra, S., & Rossi, P. (2017). Distributed Markov chain Monte Carlo for Bayesian hierarchical models. University of California, Irvine, Working Paper.

  • Cheng, J., Adamic, L., Dow, A., Kleinberg, J., & Leskovec, J. (2014). Can cascades be predicted? Proc. 23rd International World Wide Web Conference.

  • Ebbes, P., Papies, D., Van Heerde, H.J. The sense and non-sense of holdout sample validation in the presence of endogeneity. Marketing Science 30(6),1115–1122

    Article  Google Scholar 

  • Ghose, A., & Yang, S. (2009). An empirical analysis of sponsored search in online advertising. Management Science, 55(10), 1605–1622.

    Article  Google Scholar 

  • Ghose, A., Ipeirotis, P. G., & Li, B. (2014). Examining the impact of ranking on consumer behavior and search engine revenue. Management Science, 60(7), 1632–1654.

    Article  Google Scholar 

  • Gilbride, T. J., Allenby, G. M., & Brazell, J. D. (2006). Models for heterogeneous variable selection. Journal of Marketing Research, 43(3), 420–430.

    Article  Google Scholar 

  • Hong, L., Dan, O., & Davison, B. D. (2011). Predicting popular messages in Twitter. WWW 2011. Hyderabad, India.

  • Naik, P., Wedel, M., Bacon, L., Bodapati, A., Bradlow, E., Kamakura, W., Kreulen, J., Lenk, P., Madigan, D., & Montgomery, A. (2008). Challenges and opportunities in high-dimensional choice data analyses. Marketing Letters, 19(3), 201–213.

    Article  Google Scholar 

  • Nelder, J. A. (1998). The selection of terms in response-surface models—how strong is the weak-heredity principle? The American Statistician, 52(4), 315–318.

    Google Scholar 

  • Park, T., & Casella, G. (2008). The Bayesian LASSO. Journal of the American Statistical Association, 103, 681–686.

    Article  Google Scholar 

  • Peixoto, J. L. (1990). A property of well-formulated polynomial regression models. The American Statistician, 44(1), 26–30.

    Google Scholar 

  • Pennebaker, J. W. (2011). The secret life of pronouns: What our words say about us. New York: Bloomsbury Press.

    Book  Google Scholar 

  • Petrovič, S., Osborne, M., & Lavenko, V. (2011). RT to win! Predicting message propagation in Twitter. Association for the Advancement of Artificial Intelligence.

  • Radchenko, P., & James, G. M. (2010). Variable selection using adaptive non-linear interaction structures in high dimensions. Journal of the American Statistical Association, 105, 1541–1553.

    Article  Google Scholar 

  • Rutz, O. J., Bucklin, R. E., & Sonnier. G. P. (2012). A latent instrumental variables approach to modeling keyword conversion in paid search advertising. Journal of Marketing Research, XLIX(Jun), 306–319.

  • Rutz, O. J., Sonnier, G. P., & Trusov, M. (2017). A new method to aid copy testing of paid search text advertisements. Journal of Marketing Research, 54(6), 885–900.

    Article  Google Scholar 

  • Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. New York: McGraw-Hill.

    Google Scholar 

  • Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & Van Der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B, 64, 583–639.

    Article  Google Scholar 

  • Stephens, M. (2000). Dealing with label switching in mixture models. Journal of the Royal Statistical Society: Series B, 62(4), 795–809.

    Article  Google Scholar 

  • Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B, 58(2), 267–288.

    Google Scholar 

  • Tikhonov, A. N., & Arsenin, V. Y. (1977). Solution of ill-posed problems. Washington: Winston & Sons.

    Google Scholar 

  • Yoganarasimhan, H. (2018). Search Personalization using Machine Learning. Forthcoming at Management Science.

  • Zaman, T., Fox, E. B., & Bradlow, E. B. (2014). A Bayesian approach for predicting the popularity of tweets. The Annals of Applied Statistics, 8(3), 1583–1611.

    Article  Google Scholar 

  • Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2), 301–320.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Oliver J. Rutz.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

ESM 1

(PDF 403 kb)

Appendix

Appendix

We detail the steps in our VANISH regularization approach for the VANISH Hazard, VANISH Choice and VANISH Poisson models. For more information on the derivation of the full conditional distributions for the VANISH parameters (τ, ω, λ1, λ2) please see the Web Appendix.

1.1 Hazard model

  1. 1)

    Generate α and γ using a random-walk Metropolis-Hastings (MH) sampler based on the likelihood given by:

$$ L=\prod \limits_{i=1}^n\prod \limits_{t=1}^{t_i}{\Pr}_i{\left(t,{x}_{it}\right)}^{d_{it}}{\left(1-{\Pr}_i\left(t,{x}_{it}\right)\right)}^{1-{d}_{it}}, $$

where

$$ {\Pr}_i\left(t,{x}_{it}\right)=1-\frac{S_i\left(t,{x}_{it}\right)}{S_i\left(t-1,{x}_{it}\right)}=1-\exp \left(-\exp \left({x}_{it}\beta \right)\underset{t-1}{\overset{t}{\int }}{h}_i(u) du\right) $$

where dit is 1 if the life event occurs and zero otherwise.

  1. 2)

    Generateβ using a random-walk MH sampler based on the likelihood given by:

$$ L=\prod \limits_{i=1}^n\prod \limits_{t=1}^{t_i}{\Pr}_i{\left(t,{x}_{it}\right)}^{d_{it}}{\left(1-{\Pr}_i\left(t,{x}_{it}\right)\right)}^{1-{d}_{it}} $$

the VANISH prior given by:

$$ {\beta}_j\mid ...\propto \exp \left[-\frac{{\left(\sum \limits_{k=j+1}^p|{\beta}_{jk}|\right)}^2}{2{\tau}_j^2}-\frac{\beta_j+\sum \limits_{k:k\ne j}{\beta}_{jk}}{2{\omega}_j^2}\right]. $$
  1. 3)

    Generate \( \frac{1}{\tau^2} \)

$$ \frac{1}{\tau^2}={\gamma}_j\mid ...\sim InverseGaussian\left(\sqrt{\frac{\lambda_1^2}{{\left(\sum \limits_{k=j+1}^p|{\beta}_{jk}|\right)}^2}},{\lambda}_1^2\right)I\left({\gamma}_j>0\right), $$

where I(⋅) is the indicator function.

  1. 4)

    Generate \( \frac{1}{\omega^1} \)

$$ \frac{1}{\omega^1}={\varphi}_j\mid ...\sim InverseGaussian\left(\sqrt{\frac{\lambda_2^2}{{\left({\beta}_j\right)}^2+\sum \limits_{k:k\ne j}{\left({\beta}_{jk}\right)}^2}},{\lambda}_2^2\right)I\left({\varphi}_j>0\right), $$

where I(⋅) is the indicator function.

  1. 5)

    Generate \( {\lambda}_1^2\ \mathrm{and}\ {\lambda}_2^2 \)

$$ {\displaystyle \begin{array}{l}{\lambda}_1^2\mid ...\sim gamma\left(\frac{K}{2}+r,\frac{1}{2}\sum \limits_{j=1}^p{\tau}_j^2+s\right),\mathrm{K}=\#\mathrm{main}\ \mathrm{effects}+\#\mathrm{interaction}\ \mathrm{effects}\\ {}{\lambda}_2^2\mid ...\sim gamma\left(\frac{H}{2}+r,\frac{1}{2}\sum \limits_{j=1}^p{\omega}_j^2+s\right),\mathrm{H}=2\ast \#\mathrm{main}\ \mathrm{effects}+\#\mathrm{interaction}\ \mathrm{effects}\end{array}} $$

where r = 1,s = 0.1.

1.2 Choice model

  1. 1)

    Generate βs using a random-walk MH sampler based on the likelihood given by (11) and (12) and the VANISH prior given by:

$$ {\beta}_{jk}^s\mid ...\propto \exp \left[-\frac{{\left(\sum \limits_{k=j+1}^p|{\beta}_{jk}^s|\right)}^2}{2{\left({\tau}_j^s\right)}^2}-\frac{\beta_j^s+\sum \limits_{k:k\ne j}{\beta}_{jk}^s}{2{\left({\omega}_j^s\right)}^s}\right]. $$
  1. 2)

    Generate \( \frac{1}{{\left({\tau}^s\right)}^2} \)

$$ \frac{1}{{\left({\tau}^s\right)}^s}={\gamma}_j^s\mid ...\sim InverseGaussian\left(\sqrt{\frac{{\left({\lambda}_1^s\right)}^2}{{\left(\sum \limits_{k=j+1}^p|{\beta}_{jk}^s|\right)}^2}},{\left({\lambda}_1^s\right)}^2\right)I\left({\gamma}_j^s>0\right), $$

where I(⋅)is the indicator function.

  1. 3)

    Generate \( \frac{1}{{\left({\omega}^s\right)}^2} \)

$$ \frac{1}{{\left({\omega}^s\right)}^2}={\varphi}_j^s\mid ...\sim InverseGaussian\left(\sqrt{\frac{{\left({\lambda}_2^s\right)}^s}{{\left({\beta}_j^s\right)}^2+\sum \limits_{k:k\ne j}{\left({\beta}_{jk}^s\right)}^2}},{\left({\lambda}_2^s\right)}^s\right)I\left({\varphi}_j^s>0\right), $$

where I(⋅) is the indicator function.

  1. 4)

    Generate \( {\left({\lambda}_1^s\right)}^2\ \mathrm{and}\ {\left({\lambda}_2^s\right)}^2 \)

$$ {\displaystyle \begin{array}{l}{\left({\lambda}_1^s\right)}^2\mid ...\sim gamma\left(\frac{K}{2}+{r}^{lam},\frac{1}{2}\sum \limits_{j=1}^p{\left({\tau}_j^s\right)}^2+{s}^{lam}\right),\mathrm{K}=\#\mathrm{main}\ \mathrm{effects}+\#\mathrm{interaction}\ \mathrm{effects}\\ {}{\left({\lambda}_2^s\right)}^s\mid ...\sim gamma\left(\frac{H}{2}+{r}^{lam},\frac{1}{2}\sum \limits_{j=1}^p{\left({\omega}_j^s\right)}^2+{s}^{lam}\right),\mathrm{H}=2\ast \#\mathrm{main}\ \mathrm{effects}+\#\mathrm{interaction}\ \mathrm{effects}\end{array}} $$

where rlam1, slam = 0.1.

  1. 5)

    Generate π

  • \( \pi \mid Z,\rho \sim Dir\left[\left({\tilde{\rho}}_1...{\tilde{\rho}}_S\right)\right] \), where \( {\tilde{\rho}}_s={\rho}_s+\sum \limits_{i=1}^nI\left({Z}_i=s\right) \) with prior ρ = (1, ..., 1).

  1. 6)

    Generate Zi

  • Zi ∣ θ, μ, V, π~multinomial(1, [LR1(βi), ..., LRS(βi)]),

where \( L{R}_l\left({\beta}^i\right)=\frac{\pi_lL\left({\beta}^i\right)}{\sum \limits_{s=1}^S{\pi}_sL\left({\beta}^i\right)} \) and L is the likelihood given by (11) and (12).

  1. 7)

    Choose a permutation \( {\xi}_t^{+} \) and relabel draws according to \( {\xi}_t^{+} \)

$$ {\xi}_t^{+}=\underset{\xi_t}{\mathrm{argmin}}\sum \limits_{i=1}^n\sum \limits_{s=1}^S{p}_{is}\left({\xi}_t\left({\mu}^t,{V}^t\right)\right)\log \left[\frac{p_{is}\left({\xi}_t\left({\mu}^t,{V}^t\right)\right)}{{\hat{q}}_{is}^{t-1}}\right]. $$
  1. 8)

    Set

$$ {\hat{Q}}^t=\frac{t{\hat{Q}}^{t-1}+P\left({\xi}_t^{\ast}\left({\mu}^t,{V}^t\right)\right)}{t+1}. $$

1.3 Poisson model

$$ {\theta}_t={x}_t^{sea}{\beta}^{sea}+{x}_t^{imp}{\beta}^{imp}+{x}_t^{pos}{\beta}^{pos}+{x}_t^{cpc}{\beta}^{cpc}+\sum \limits_j{x}_{jt}^{txt}{\beta}^{txt}+\sum \limits_{j<k}{x}_{jt}^{txt}{x}_{kt}^{txt}{\beta}_{jk}^{txt} $$
  1. 1)

    Generate [βseaβimpβposβcpc] using a random-walk Metropolis-Hastings (MH) sampler based on the likelihood given by:

$$ L\propto \prod \limits_{t=1}^n\frac{e^{-{\lambda}_t}{\lambda}_t^{y_t}}{y_t!}. $$
  1. 2)

    Generate βtxt using a random-walk MH sampler based on the likelihood given by:

$$ L\propto \prod \limits_{t=1}^n\frac{e^{-{\lambda}_t}{\lambda}_t^{y_t}}{y_t!} $$

the VANISH prior given by:

$$ {\beta}_{jk}\mid ...\propto \exp \left[-\frac{{\left(\sum \limits_{k=j+1}^p|{\beta}_{jk}|\right)}^2}{2{\tau}_j^2}-\frac{\beta_j+\sum \limits_{k:k\ne j}{\beta}_{jk}}{2{\omega}_j^2}\right]. $$
  1. 3)

    Generate \( \frac{1}{\tau^2} \)

$$ \frac{1}{\tau^2}={\gamma}_j\mid ...\sim InverseGaussian\left(\sqrt{\frac{\lambda_1^2}{{\left(\sum \limits_{k=j+1}^p|{\beta}_{jk}|\right)}^2}},{\lambda}_1^2\right)I\left({\gamma}_j>0\right), $$

where I(⋅) is the indicator function.

  1. 4)

    Generate \( \frac{1}{\omega^1} \)

$$ \frac{1}{\omega^1}={\varphi}_j\mid ...\sim InverseGaussian\left(\sqrt{\frac{\lambda_2^2}{{\left({\beta}_j\right)}^2+\sum \limits_{k:k\ne j}{\left({\beta}_{jk}\right)}^2}},{\lambda}_2^2\right)I\left({\varphi}_j>0\right), $$

where I(⋅) is the indicator function.

  1. 5)

    Generate \( {\lambda}_1^2\ \mathrm{and}\ {\lambda}_2^2 \)

$$ {\displaystyle \begin{array}{l}{\lambda}_1^2\mid ...\sim gamma\left(\frac{K}{2}+r,\frac{1}{2}\sum \limits_{j=1}^p{\tau}_j^2+s\right),\mathrm{K}=\#\mathrm{main}\ \mathrm{effects}+\#\mathrm{interaction}\ \mathrm{effects}\\ {}{\lambda}_2^2\mid ...\sim gamma\left(\frac{H}{2}+r,\frac{1}{2}\sum \limits_{j=1}^p{\omega}_j^2+s\right),\mathrm{H}=2\ast \#\mathrm{main}\ \mathrm{effects}+\#\mathrm{interaction}\ \mathrm{effects}\end{array}} $$

where r = 1,s = 0.1.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rutz, O.J., Sonnier, G.P. VANISH regularization for generalized linear models. Quant Mark Econ 17, 415–437 (2019). https://doi.org/10.1007/s11129-019-09216-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11129-019-09216-4

Keywords

Navigation