VANISH regularization for generalized linear models

Rutz, Oliver J.; Sonnier, Garrett P.

doi:10.1007/s11129-019-09216-4

VANISH regularization for generalized linear models

Published: 14 August 2019

Volume 17, pages 415–437, (2019)
Cite this article

Quantitative Marketing and Economics Aims and scope Submit manuscript

Oliver J. Rutz¹ &
Garrett P. Sonnier²

339 Accesses
Explore all metrics

Abstract

Marketers increasingly face modeling situations where the number of independent variables is large and possibly approaching or exceeding the number of observations. In this setting, covariate selection and model estimation present significant challenges to usual methods of inference. These challenges are exacerbated when covariate interactions are of interest. Most extant regularization methods make no distinction between main and interaction terms in estimation. The linear VANISH model is an exception to these methods. The linear VANISH model is a regularization method for models with interaction terms that ensures proper model hierarchy by enforcing the heredity principle. We derive the generalized VANISH model for nonlinear responses, including duration, discrete choice, and count models widely used in marketing applications. In addition, we propose a VANISH model that allows to account for unobserved consumer heterogeneity via a mixture approach. In three empirical applications we demonstrate that our proposed model outperforms main effects models as well as other methods that include interaction terms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Additional Linear Modeling Topics

Comments on “identification and semiparametric estimation of a finite horizon dynamic discrete choice model with a terminating action”

Article 11 April 2019

Panel Data Analysis: A Non-technical Introduction for Marketing Researchers

Notes

For example, say a model has 100 parameters. When adding only first-level interactions, one needs to estimate 100 main effects and 4,950 interaction effects. Even if enough observations are available to estimate the main effects, adding the interaction effects will almost certainly result in a “large p, small n” problem.
VANISH refers to Variable Selection using Adaptive Nonlinear Interaction Structures in High Dimensions (Radchenko and James 2010).
LASSO refers to Least Absolute Shrinkage and Selection Operator.
Details on the sampler can be found in the Appendix. More details on the derivations of the full conditional distributions of the VANISH parameters can be found in the Web Appendix.
The LASSO prior is $ \pi \left(\beta |\sigma \right)=\prod \limits_{j=1}^p\frac{\lambda }{2\sigma}\exp \left(-\frac{\lambda \mid {\beta}_j\mid }{\sigma}\right) $. Note that the LASSO model only requires one tuning parameter, λ, and one set of latent parameters, τ. Estimation proceeds similarly to estimation using a VANISH prior as detailed in the Appendix. Also see Park and Casella (2008) for a full Bayesian treatment of the linear LASSO.
We estimate the MART version of each of our models (hazard, choice and count) using the gbm package in R. The gbm algorithm is a boosting algorithm which creates a sequence of simple trees where each successive tree is built for predicting the residuals of the preceding tree. Thus, at each step of the algorithm a simple partitioning of the data is determined and the deviations of the observed values from the respective means (residuals for each partition) are computed. The next tree will then be fitted to those residuals to find another partition that will further reduce the residual (error) variance given the preceding sequence of trees.
We thank Maytal Saar-Tsechansky of UT Austin for providing access to the data and Samuel Blazek for providing research assistance in processing the data.
Gilbride et al. (2006) suggest a heterogeneous variable selection approach applied to conjoint data that shrinks individual-level response coefficients either towards zero with a very small prior variance or a normal prior distribution dependent upon individual-level discrete variable selection parameters. Their model performs variable selection at the attribute level rather than the partworth level (i.e., the brand attribute is either selected or not versus each level of the brand attribute). While this approach allows the researcher to heterogeneously model attribute attendance at the individual level it is not, per se, an approach well suited for “large p, small n” problems. Their approach does not include a way to ensure that the number of selected parameters (i.e., the parameters not shrunk towards zero) does not exceed the number of observations. Their approach is also mute on whether and how to handle interaction terms.
We thank the editor for bringing this point to our attention.

References

Agarwal, A.K.H. & Smith, M.S. (2011). Location, Location, Location: An Analysis of Profitability of Position in Online Advertising Markets, Journal of Marketing Research, XLVIII (Dec), 1057–1073.
Allenby, G.M., Neeraj, A., & Ginter, J.L. (1998). On the Heterogeneity of Demand, Journal of Marketing Research, 35, 384–389.
Bakshy, E., Hofman, J.M., Mason, W.A., & Watts, D.J. (2011). Everyone’s an influencer: Quantifying influence on Twitter. Fourth ACM International Conference on Web Search and Data Mining.
Bumbaca, F., Misra, S., & Rossi, P. (2017). Distributed Markov chain Monte Carlo for Bayesian hierarchical models. University of California, Irvine, Working Paper.
Cheng, J., Adamic, L., Dow, A., Kleinberg, J., & Leskovec, J. (2014). Can cascades be predicted? Proc. 23rd International World Wide Web Conference.
Ebbes, P., Papies, D., Van Heerde, H.J. The sense and non-sense of holdout sample validation in the presence of endogeneity. Marketing Science 30(6),1115–1122
Article Google Scholar
Ghose, A., & Yang, S. (2009). An empirical analysis of sponsored search in online advertising. Management Science, 55(10), 1605–1622.
Article Google Scholar
Ghose, A., Ipeirotis, P. G., & Li, B. (2014). Examining the impact of ranking on consumer behavior and search engine revenue. Management Science, 60(7), 1632–1654.
Article Google Scholar
Gilbride, T. J., Allenby, G. M., & Brazell, J. D. (2006). Models for heterogeneous variable selection. Journal of Marketing Research, 43(3), 420–430.
Article Google Scholar
Hong, L., Dan, O., & Davison, B. D. (2011). Predicting popular messages in Twitter. WWW 2011. Hyderabad, India.
Naik, P., Wedel, M., Bacon, L., Bodapati, A., Bradlow, E., Kamakura, W., Kreulen, J., Lenk, P., Madigan, D., & Montgomery, A. (2008). Challenges and opportunities in high-dimensional choice data analyses. Marketing Letters, 19(3), 201–213.
Article Google Scholar
Nelder, J. A. (1998). The selection of terms in response-surface models—how strong is the weak-heredity principle? The American Statistician, 52(4), 315–318.
Google Scholar
Park, T., & Casella, G. (2008). The Bayesian LASSO. Journal of the American Statistical Association, 103, 681–686.
Article Google Scholar
Peixoto, J. L. (1990). A property of well-formulated polynomial regression models. The American Statistician, 44(1), 26–30.
Google Scholar
Pennebaker, J. W. (2011). The secret life of pronouns: What our words say about us. New York: Bloomsbury Press.
Book Google Scholar
Petrovič, S., Osborne, M., & Lavenko, V. (2011). RT to win! Predicting message propagation in Twitter. Association for the Advancement of Artificial Intelligence.
Radchenko, P., & James, G. M. (2010). Variable selection using adaptive non-linear interaction structures in high dimensions. Journal of the American Statistical Association, 105, 1541–1553.
Article Google Scholar
Rutz, O. J., Bucklin, R. E., & Sonnier. G. P. (2012). A latent instrumental variables approach to modeling keyword conversion in paid search advertising. Journal of Marketing Research, XLIX(Jun), 306–319.
Rutz, O. J., Sonnier, G. P., & Trusov, M. (2017). A new method to aid copy testing of paid search text advertisements. Journal of Marketing Research, 54(6), 885–900.
Article Google Scholar
Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. New York: McGraw-Hill.
Google Scholar
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & Van Der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B, 64, 583–639.
Article Google Scholar
Stephens, M. (2000). Dealing with label switching in mixture models. Journal of the Royal Statistical Society: Series B, 62(4), 795–809.
Article Google Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B, 58(2), 267–288.
Google Scholar
Tikhonov, A. N., & Arsenin, V. Y. (1977). Solution of ill-posed problems. Washington: Winston & Sons.
Google Scholar
Yoganarasimhan, H. (2018). Search Personalization using Machine Learning. Forthcoming at Management Science.
Zaman, T., Fox, E. B., & Bradlow, E. B. (2014). A Bayesian approach for predicting the popularity of tweets. The Annals of Applied Statistics, 8(3), 1583–1611.
Article Google Scholar
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2), 301–320.
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Washington, Box 353226, Seattle, WA, 98195, USA
Oliver J. Rutz
The University of Texas at Austin, 1 University Station, Austin, TX, 78712, USA
Garrett P. Sonnier

Authors

Oliver J. Rutz
View author publications
You can also search for this author in PubMed Google Scholar
Garrett P. Sonnier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Oliver J. Rutz.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

ESM 1

(PDF 403 kb)

Appendix

We detail the steps in our VANISH regularization approach for the VANISH Hazard, VANISH Choice and VANISH Poisson models. For more information on the derivation of the full conditional distributions for the VANISH parameters (τ, ω, λ₁, λ₂) please see the Web Appendix.

1.1 Hazard model

1)
Generate α and γ using a random-walk Metropolis-Hastings (MH) sampler based on the likelihood given by:

$$ L=\prod \limits_{i=1}^n\prod \limits_{t=1}^{t_i}{\Pr}_i{\left(t,{x}_{it}\right)}^{d_{it}}{\left(1-{\Pr}_i\left(t,{x}_{it}\right)\right)}^{1-{d}_{it}}, $$

where

$$ {\Pr}_i\left(t,{x}_{it}\right)=1-\frac{S_i\left(t,{x}_{it}\right)}{S_i\left(t-1,{x}_{it}\right)}=1-\exp \left(-\exp \left({x}_{it}\beta \right)\underset{t-1}{\overset{t}{\int }}{h}_i(u) du\right) $$

where d_it is 1 if the life event occurs and zero otherwise.

2)
Generateβ using a random-walk MH sampler based on the likelihood given by:

$$ L=\prod \limits_{i=1}^n\prod \limits_{t=1}^{t_i}{\Pr}_i{\left(t,{x}_{it}\right)}^{d_{it}}{\left(1-{\Pr}_i\left(t,{x}_{it}\right)\right)}^{1-{d}_{it}} $$

the VANISH prior given by:

$$ {\beta}_j\mid ...\propto \exp \left[-\frac{{\left(\sum \limits_{k=j+1}^p|{\beta}_{jk}|\right)}^2}{2{\tau}_j^2}-\frac{\beta_j+\sum \limits_{k:k\ne j}{\beta}_{jk}}{2{\omega}_j^2}\right]. $$

3)
Generate $ \frac{1}{\tau^2} $

$$ \frac{1}{\tau^2}={\gamma}_j\mid ...\sim InverseGaussian\left(\sqrt{\frac{\lambda_1^2}{{\left(\sum \limits_{k=j+1}^p|{\beta}_{jk}|\right)}^2}},{\lambda}_1^2\right)I\left({\gamma}_j>0\right), $$

where I(⋅) is the indicator function.

4)
Generate $ \frac{1}{\omega^1} $

$$ \frac{1}{\omega^1}={\varphi}_j\mid ...\sim InverseGaussian\left(\sqrt{\frac{\lambda_2^2}{{\left({\beta}_j\right)}^2+\sum \limits_{k:k\ne j}{\left({\beta}_{jk}\right)}^2}},{\lambda}_2^2\right)I\left({\varphi}_j>0\right), $$

where I(⋅) is the indicator function.

5)
Generate $ {\lambda}_1^2\ \mathrm{and}\ {\lambda}_2^2 $

$$ {\displaystyle \begin{array}{l}{\lambda}_1^2\mid ...\sim gamma\left(\frac{K}{2}+r,\frac{1}{2}\sum \limits_{j=1}^p{\tau}_j^2+s\right),\mathrm{K}=\#\mathrm{main}\ \mathrm{effects}+\#\mathrm{interaction}\ \mathrm{effects}\\ {}{\lambda}_2^2\mid ...\sim gamma\left(\frac{H}{2}+r,\frac{1}{2}\sum \limits_{j=1}^p{\omega}_j^2+s\right),\mathrm{H}=2\ast \#\mathrm{main}\ \mathrm{effects}+\#\mathrm{interaction}\ \mathrm{effects}\end{array}} $$

where r = 1,s = 0.1.

1.2 Choice model

1)
Generate β^s using a random-walk MH sampler based on the likelihood given by (11) and (12) and the VANISH prior given by:

$$ {\beta}_{jk}^s\mid ...\propto \exp \left[-\frac{{\left(\sum \limits_{k=j+1}^p|{\beta}_{jk}^s|\right)}^2}{2{\left({\tau}_j^s\right)}^2}-\frac{\beta_j^s+\sum \limits_{k:k\ne j}{\beta}_{jk}^s}{2{\left({\omega}_j^s\right)}^s}\right]. $$

2)
Generate $ \frac{1}{{\left({\tau}^s\right)}^2} $

$$ \frac{1}{{\left({\tau}^s\right)}^s}={\gamma}_j^s\mid ...\sim InverseGaussian\left(\sqrt{\frac{{\left({\lambda}_1^s\right)}^2}{{\left(\sum \limits_{k=j+1}^p|{\beta}_{jk}^s|\right)}^2}},{\left({\lambda}_1^s\right)}^2\right)I\left({\gamma}_j^s>0\right), $$

where I(⋅)is the indicator function.

3)
Generate $ \frac{1}{{\left({\omega}^s\right)}^2} $

$$ \frac{1}{{\left({\omega}^s\right)}^2}={\varphi}_j^s\mid ...\sim InverseGaussian\left(\sqrt{\frac{{\left({\lambda}_2^s\right)}^s}{{\left({\beta}_j^s\right)}^2+\sum \limits_{k:k\ne j}{\left({\beta}_{jk}^s\right)}^2}},{\left({\lambda}_2^s\right)}^s\right)I\left({\varphi}_j^s>0\right), $$

where I(⋅) is the indicator function.

4)
Generate $ {\left({\lambda}_1^s\right)}^2\ \mathrm{and}\ {\left({\lambda}_2^s\right)}^2 $

$$ {\displaystyle \begin{array}{l}{\left({\lambda}_1^s\right)}^2\mid ...\sim gamma\left(\frac{K}{2}+{r}^{lam},\frac{1}{2}\sum \limits_{j=1}^p{\left({\tau}_j^s\right)}^2+{s}^{lam}\right),\mathrm{K}=\#\mathrm{main}\ \mathrm{effects}+\#\mathrm{interaction}\ \mathrm{effects}\\ {}{\left({\lambda}_2^s\right)}^s\mid ...\sim gamma\left(\frac{H}{2}+{r}^{lam},\frac{1}{2}\sum \limits_{j=1}^p{\left({\omega}_j^s\right)}^2+{s}^{lam}\right),\mathrm{H}=2\ast \#\mathrm{main}\ \mathrm{effects}+\#\mathrm{interaction}\ \mathrm{effects}\end{array}} $$

where r^lam1, s^lam = 0.1.

5)
Generate π

$ \pi \mid Z,\rho \sim Dir\left[\left({\tilde{\rho}}_1...{\tilde{\rho}}_S\right)\right] $, where $ {\tilde{\rho}}_s={\rho}_s+\sum \limits_{i=1}^nI\left({Z}_i=s\right) $ with prior ρ = (1, ..., 1).

6)
Generate Z_i

Z_i ∣ θ, μ, V, π~multinomial(1, [LR₁(βⁱ), ..., LR_S(βⁱ)]),

where $ L{R}_l\left({\beta}^i\right)=\frac{\pi_lL\left({\beta}^i\right)}{\sum \limits_{s=1}^S{\pi}_sL\left({\beta}^i\right)} $ and L is the likelihood given by (11) and (12).

7)
Choose a permutation $ {\xi}_t^{+} $ and relabel draws according to $ {\xi}_t^{+} $

$$ {\xi}_t^{+}=\underset{\xi_t}{\mathrm{argmin}}\sum \limits_{i=1}^n\sum \limits_{s=1}^S{p}_{is}\left({\xi}_t\left({\mu}^t,{V}^t\right)\right)\log \left[\frac{p_{is}\left({\xi}_t\left({\mu}^t,{V}^t\right)\right)}{{\hat{q}}_{is}^{t-1}}\right]. $$

8)
Set

$$ {\hat{Q}}^t=\frac{t{\hat{Q}}^{t-1}+P\left({\xi}_t^{\ast}\left({\mu}^t,{V}^t\right)\right)}{t+1}. $$

1.3 Poisson model

$$ {\theta}_t={x}_t^{sea}{\beta}^{sea}+{x}_t^{imp}{\beta}^{imp}+{x}_t^{pos}{\beta}^{pos}+{x}_t^{cpc}{\beta}^{cpc}+\sum \limits_j{x}_{jt}^{txt}{\beta}^{txt}+\sum \limits_{j<k}{x}_{jt}^{txt}{x}_{kt}^{txt}{\beta}_{jk}^{txt} $$

1)
Generate [β^seaβ^impβ^posβ^cpc] using a random-walk Metropolis-Hastings (MH) sampler based on the likelihood given by:

$$ L\propto \prod \limits_{t=1}^n\frac{e^{-{\lambda}_t}{\lambda}_t^{y_t}}{y_t!}. $$

2)
Generate β^txt using a random-walk MH sampler based on the likelihood given by:

$$ L\propto \prod \limits_{t=1}^n\frac{e^{-{\lambda}_t}{\lambda}_t^{y_t}}{y_t!} $$

the VANISH prior given by:

$$ {\beta}_{jk}\mid ...\propto \exp \left[-\frac{{\left(\sum \limits_{k=j+1}^p|{\beta}_{jk}|\right)}^2}{2{\tau}_j^2}-\frac{\beta_j+\sum \limits_{k:k\ne j}{\beta}_{jk}}{2{\omega}_j^2}\right]. $$

3)
Generate $ \frac{1}{\tau^2} $

$$ \frac{1}{\tau^2}={\gamma}_j\mid ...\sim InverseGaussian\left(\sqrt{\frac{\lambda_1^2}{{\left(\sum \limits_{k=j+1}^p|{\beta}_{jk}|\right)}^2}},{\lambda}_1^2\right)I\left({\gamma}_j>0\right), $$

where I(⋅) is the indicator function.

4)
Generate $ \frac{1}{\omega^1} $

$$ \frac{1}{\omega^1}={\varphi}_j\mid ...\sim InverseGaussian\left(\sqrt{\frac{\lambda_2^2}{{\left({\beta}_j\right)}^2+\sum \limits_{k:k\ne j}{\left({\beta}_{jk}\right)}^2}},{\lambda}_2^2\right)I\left({\varphi}_j>0\right), $$

where I(⋅) is the indicator function.

5)
Generate $ {\lambda}_1^2\ \mathrm{and}\ {\lambda}_2^2 $

$$ {\displaystyle \begin{array}{l}{\lambda}_1^2\mid ...\sim gamma\left(\frac{K}{2}+r,\frac{1}{2}\sum \limits_{j=1}^p{\tau}_j^2+s\right),\mathrm{K}=\#\mathrm{main}\ \mathrm{effects}+\#\mathrm{interaction}\ \mathrm{effects}\\ {}{\lambda}_2^2\mid ...\sim gamma\left(\frac{H}{2}+r,\frac{1}{2}\sum \limits_{j=1}^p{\omega}_j^2+s\right),\mathrm{H}=2\ast \#\mathrm{main}\ \mathrm{effects}+\#\mathrm{interaction}\ \mathrm{effects}\end{array}} $$

where r = 1,s = 0.1.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rutz, O.J., Sonnier, G.P. VANISH regularization for generalized linear models. Quant Mark Econ 17, 415–437 (2019). https://doi.org/10.1007/s11129-019-09216-4

Download citation

Received: 22 July 2018
Accepted: 07 July 2019
Published: 14 August 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s11129-019-09216-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

VANISH regularization for generalized linear models

Abstract

Access this article

Similar content being viewed by others

Additional Linear Modeling Topics

Comments on “identification and semiparametric estimation of a finite horizon dynamic discrete choice model with a terminating action”

Panel Data Analysis: A Non-technical Introduction for Marketing Researchers

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Electronic supplementary material

ESM 1

Appendix

1.1 Hazard model

1.2 Choice model

1.3 Poisson model

Rights and permissions

About this article

Cite this article

Keywords

Navigation

VANISH regularization for generalized linear models

Abstract

Access this article

Similar content being viewed by others

Additional Linear Modeling Topics

Comments on “identification and semiparametric estimation of a finite horizon dynamic discrete choice model with a terminating action”

Panel Data Analysis: A Non-technical Introduction for Marketing Researchers

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Electronic supplementary material

ESM 1

Appendix

Appendix

1.1 Hazard model

1.2 Choice model

1.3 Poisson model

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation