Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Hyper Nonlocal Priors for Variable Selection in Generalized Linear Models

  • 31 Accesses

Abstract

We propose two novel hyper nonlocal priors for variable selection in generalized linear models. To obtain these priors, we first derive two new priors for generalized linear models that combine the Fisher information matrix with the Johnson-Rossell moment and inverse moment priors. We then obtain our hyper nonlocal priors from our nonlocal Fisher information priors by assigning hyperpriors to their scale parameters. As a consequence, the hyper nonlocal priors bring less information on the effect sizes than the Fisher information priors, and thus are very useful in practice whenever the prior knowledge of effect size is lacking. We develop a Laplace integration procedure to compute posterior model probabilities, and we show that under certain regularity conditions the proposed methods are variable selection consistent. We also show that, when compared to local priors, our hyper nonlocal priors lead to faster accumulation of evidence in favor of a true null hypothesis. Simulation studies that consider binomial, Poisson, and negative binomial regression models indicate that our methods select true models with higher success rates than other existing Bayesian methods. Furthermore, the simulation studies show that our methods lead to mean posterior probabilities for the true models that are closer to their empirical success rates. Finally, we illustrate the application of our methods with an analysis of the Pima Indians diabetes dataset.

This is a preview of subscription content, log in to check access.

References

  1. Altomare, D., Consonni, G. and La Rocca, L. (2013). Objective Bayesian search of Gaussian directed acyclic graphical models for ordered variables with non-local priors. Biometrics69, 2, 478–487.

  2. Alves, M.B., Gamerman, D. and Ferreira, M.A.R. (2010). Transfer functions in dynamic generalized linear models. Stat. Model.10, 3–40.

  3. Barbieri, M.M. and Berger, J.O. (2004). Optimal predictive model selection. Ann. Statist.32, 3, 870–897.

  4. Chen, J. and Chen, Z. (2012). Extended BIC for small-n-large-P sparse GLM. Statistica Sinica22, 555–574.

  5. Chen, K., Hu, I., Ying, Z. et al. (1999a). Strong consistency of maximum quasi-likelihood estimators in generalized linear models with fixed and adaptive designs. Ann. Statist.27, 1155–1163.

  6. Chen, M.-H. and Ibrahim, J.G. (2003). Conjugate priors for generalized linear models. Statistica Sinica13, 461–476.

  7. Chen, M.-H., Ibrahim, J.G. and Kim, S. (2008). Properties and implementation of Jeffreys’s prior in binomial regression models. J. Amer. Statist. Assoc.103, 1659–1664.

  8. Chen, M.-H., Ibrahim, J.G. and Yiannoutsos, C. (1999b). Prior elicitation, variable selection and Bayesian computation for logistic regression models. J. R. Stat. Soc. Ser. B Stat. Methodol.61, 223–242.

  9. Chopin, N. and Ridgway, J. (2017). Leave Pima Indians alone: binary regression as a benchmark for Bayesian computation. Statistical Science32, 1, 64–87.

  10. Consonni, G., Forster, J.J. and La Rocca, L. (2013). The whetstone and the alum block: Balanced objective Bayesian comparison of nested models for discrete data. Statistical Science, pp. 398–423.

  11. Dey, D.K., Ghosh, S.K. and Mallick, B.K. (2000). Generalized linear models: A Bayesian perspective. Marcel Dekker, New York.

  12. Fahrmeir, L. and Kaufmann, H. (1985). Consistency and asymptotic normality of the maximum likelihood estimator in generalized linear models. Ann. Statist.13, 342–368.

  13. Fahrmeir, L. and Tutz, G. (2013). Multivariate statistical modelling based on generalized linear models. Springer, New York.

  14. Fox, J. and Monette, G. (1992). Generalized collinearity diagnostics. J. Amer. Statist. Assoc.87, 417, 178–183.

  15. Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw.33, 1, 1–22.

  16. Hoegh, A., Ferreira, M.A.R. and Leman, S. (2016). Spatiotemporal model fusion: multiscale modelling of civil unrest. J. R. Stat. Soc. Ser. C Appl. Stat.65, 529–545.

  17. Ibrahim, J.G. and Laud, P.W. (1991). On Bayesian analysis of generalized linear models using Jeffreys’s prior. J. Amer. Statist. Assoc.86, 981–986.

  18. Johnson, V.E. and Rossell, D. (2010). On the use of non-local prior densities in Bayesian hypothesis tests. J. R. Stat. Soc. Ser. B Stat Methodol72, 143–170.

  19. Johnson, V.E. and Rossell, D. (2012). Bayesian model selection in high-dimensional settings. J. Amer. Statist. Assoc.107, 649–660.

  20. Kass, R., Tierney, L. and Kadane, J. (1990). The validity of posterior expansions based on Laplace’s method, Essays in Honor of George Bernard, pp. 473–488 (Geisser, S., Hodges, J. S., Press, S. J. and Zellner, A., eds.)

  21. Kass, R.E. and Raftery, A.E. (1995). Bayes factors. J. Amer. Statist. Assoc.90, 773–795.

  22. Kass, R.E. and Wasserman, L. (1995). A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. J. Amer. Statist. Assoc.90, 928–934.

  23. Liang, F., Song, Q. and Yu, K. (2013). Bayesian subset modeling for high-dimensional generalized linear models. J. Amer. Statist. Assoc.108, 589–606.

  24. Lichman, M. (2013). UCI Machine Learning Repository.

  25. McCullagh, P. and Nelder, J.A. (1989). Generalized Linear Models, 2nd edn. Chapman & Hall/CRC, London.

  26. Nikooienejad, A., Wang, W. and Johnson, V.E. (2016). Bayesian variable selection for binary outcomes in high dimensional genomic studies using non-local priors. Bioinformatics32, 9, 1338–1345.

  27. Ntzoufras, I., Dellaportas, P. and Forster, J.J. (2003). Bayesian variable and link determination for generalised linear models. J. Statist. Plann. Inference111, 165–180.

  28. R Core Team (2018). R: a language and environment for statistical computing r foundation for statistical computing, Vienna, Austria.

  29. Raftery, A.E. (1996). Approximate Bayes factors and accounting for model uncertainty in generalised linear models. Biometrika83, 251–266.

  30. Raudenbush, S.W., Yang, M.-L. and Yosef, M. (2000). Maximum likelihood for generalized linear models with nested random effects via high-order, multivariate Laplace approximation. J. Comput. Graph. Statist.9, 141–157.

  31. Ripley, B.D. (1996). Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge.

  32. Rossell, D. and Telesca, D. (2017). Nonlocal priors for high-dimensional estimation. J. Amer. Statist. Assoc.112, 517, 254–265.

  33. Rossell, D., Telesca, D. and Johnson, V.E. (2013). High-dimensional Bayesian classifiers using non-local priors,.

  34. Sabanés Bové, D. and Held, L. (2011). Hyper-g priors for generalized linear models. Bayesian Analysis6, 387–410.

  35. Sanyal, N. and Ferreira, M.A. (2017). Bayesian wavelet analysis using nonlocal priors with an application to fMRI analysis. Sankhya B79, 2, 361–388.

  36. Scott, J.G. and Berger, J.O. (2010). Bayes and empirical-Bayes multiplicity adjustment in the variable selection problem. Ann. Statist.38, 5, 2587–2619.

  37. Scrucca, L. (2013). GA: a package for genetic algorithms in R. J. Stat. Softw.53, 4, 1–37.

  38. Shin, M., Bhattacharya, A. and Johnson, V.E. (2018). Scalable Bayesian variable selection using nonlocal prior densities in ultrahigh-dimensional settings. Stat. Sin.28, 2, 1053.

  39. Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. Ser. B Stat. Methodol.58, 1, 267–288.

  40. Tierney, L. and Kadane, J.B. (1986). Accurate approximations for posterior moments and marginal densities. J. Amer. Statist. Assoc.81, 82–86.

  41. Wang, X. and George, E.I. (2007). Adaptive Bayesian criteria in variable selection for generalized linear models. Statistica Sinica17, 667.

  42. West, M. (1985). Generalized linear models: scale parameters, outlier accommodation and prior distributions, Bernardo, J., DeGroot, M., Lindley, D. and Smith, A. (eds.), p. 531–558.

  43. Wu, H.-H., Ferreira, M.A. and Gompper, M.E. (2016). Consistency of hyper-g-prior-based Bayesian variable selection for generalized linear models. Braz. J. Probab. Stat.30, 4, 691–709.

  44. Zellner, A. and Siow, A. (1980). Posterior odds ratios for selected regression hypotheses, vol 1, pp. 585–603. Valencia University Press, Valencia, Bernardo, J. M., DeGroot, M. H., Lindley, D. V. and Smith, A. F. M. (eds.),.

Download references

Acknowledgements

The authors thank the insightful comments and suggestions of two anonymous reviewers, the Associate Editor, and the Editor. Tieming Ji was supported by National Science Foundation Award No. 1615789.

Author information

Correspondence to Ho-Hsiang Wu.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(PDF 195 KB)

Appendices

Appendix

In this appendix, we provide proof for Proposition 2 (ii) and theoretical results in Section 4 including Theorems 1 and 2 and Corollaries 1, 2, and 3. In the proof, we use C to denote distinct positive constants.

10.1 Proof of Proposition 2 (ii)

Proof

Without loss of generality, assume that the first pjq elements of βj are distinct of each other and distinct of the last q elements. Further, assume that \(\beta _{\textbf {j}, p_{\textbf {j}}-q + 1}=...=\beta _{\textbf {j} p_{\textbf {j}}}\) and \(\beta _{\textbf {j} p_{\textbf {j}}}\rightarrow 0\). Then we have

$$\begin{array}{@{}rcl@{}} \pi(\boldsymbol{\beta}_{\textbf{j}})&=&C\prod\limits_{i = 1}^{p_{\textbf{j}}}|\beta_{\textbf{j} i}|{~}^{-(r + 1)}\left\{v_{4}+\phi\nu n\textbf{tr}\left( \left[\text{diag}\left( \boldsymbol{\beta}_{\textbf{j}}\boldsymbol{\beta}_{\textbf{j}}^{T}\right)\text{diag}\left( \textbf{X}_{\textbf{j}}^{T}\textbf{W}\textbf{X}_{\textbf{j}}\right)\right]^{-1}\right) \right\}^{-\frac{rp_{\textbf{j}}}{2}-v_{3}}\\ &=&C\prod\limits_{i = 1}^{p_{\textbf{j}}}|\beta_{\textbf{j} i}|{~}^{-(r + 1)}\left\{C+Cq\beta_{\textbf{j} p_{\textbf{j}}}^{-2}\right\}^{-\frac{rp_{\textbf{j}}}{2}-v_{3}}\\ &=&C|\beta_{\textbf{j} p_{\textbf{j}}}|{~}^{-q(r + 1)}|\beta_{\textbf{j} p_{\textbf{j}}}|{~}^{rp_{\textbf{j}}+ 2v_{3}}\\ &=&C|\beta_{\textbf{j} p_{\textbf{j}}}|{~}^{rp_{\textbf{j}}+ 2v_{3}-(r + 1)q}. \end{array} $$
(12)

Finally, letting q = 1 completes the proof. □

10.2 Proof of Theorems and Corollaries

In order to prove Theorem 1, we need to prove Lemmas 1, 2, and 3 as follows.

Lemma 1.

LetZ1,...Zndenote non-negative random variables defined with respect to an underlyingprobability space, and letz1,..., zndenote arbitrary constants. Then\(P\left [{\prod }_{i = 1}^nZ_i> \right .\)\(\left .{\prod }_{i = 1}^nz_i\right ]\le {\sum }_{i = 1}^nP\left [Z_i>z_i\right ]\).

Proof

The proof can be found in (19) of Johnson and Rossell (2012). □

Lemma 2.

Suppose the conditions of Theorem 1 hold, andtj,then for any𝜖 > 0,\(||(\hat {\alpha }_{\textbf {j}},\hat {\boldsymbol {\beta }}_{\textbf {j} 0})-(\alpha _{\textbf {t}},\boldsymbol {\beta }_{\textbf {t} 0})||=o(\left \{\left [\log (n)\right ]^{1+\epsilon }/n\right \}^{1/2})\),where\((\hat {\alpha }, \hat {\boldsymbol {\beta }})\)aremaximum quasi-likelihood estimates.

Proof

The proof is analogous to the proof of Theorem 1 in Chen et al. (1999a). □

Lemma 3.

Suppose the conditions of Theorem 1 hold, andtj,then\(\ell (\hat {\alpha _{\textbf {j}}},\hat {\boldsymbol {\beta }}_{\textbf {j} 0})-\ell (\alpha _{\textbf {t}},\boldsymbol {\beta }_{\textbf {t} 0})=o_p\left (\log (n)\right )\).

Proof

From Lemma 2 we have, for an arbitrary unit vector u,

$$\begin{array}{@{}rcl@{}} \ell(\hat{\alpha_{\textbf{j}}},\hat{\boldsymbol{\beta}}_{\textbf{j} 0})-\ell(\alpha_{\textbf{t}},\boldsymbol{\beta}_{\textbf{t} 0}) &\leq& C n^{-1/2} \textbf{u}^{T}\sum\limits_{i = 1}^{n}(1,{\textbf{x}^{T}_{i}})^{T}\left[y_{i}-{\mathrm{d}b(\eta_{\textbf{t} i})}/{\mathrm{d}\eta_{\textbf{t} i}}\right]\\ &&-\frac{1}{2}n^{-1}\textbf{u}^{T}\textbf{F}(\tilde{\alpha},\tilde{\boldsymbol{\beta}})\textbf{u}\\ &\leq& Cn^{-1/2}\textbf{u}^{T}\sum\limits_{i = 1}^{n}(1,{\textbf{x}^{T}_{i}})^{T}\left[y_{i}-{\mathrm{d}b(\eta_{\textbf{t} i})}/{\mathrm{d}\eta_{\textbf{t} i}}\right]. \end{array} $$
(13)

Consequently we have

$$\begin{array}{@{}rcl@{}} &&P\left[\ell(\hat{\alpha_{\textbf{j}}},\hat{\boldsymbol{\beta}}_{\textbf{j} 0})-\ell(\alpha_{\textbf{t}},\boldsymbol{\beta}_{\textbf{t} 0})>C\log(n)\right] \end{array} $$
(14)
$$\begin{array}{@{}rcl@{}} &&\le P\left[\textbf{u}^{T}\sum\limits_{i = 1}^{n}(1,{\textbf{x}^{T}_{i}})^{T}\left[y_{i}-{\mathrm{d}b(\eta_{\textbf{t} i})}/{\mathrm{d}\eta_{\textbf{t} i}}\right]\ge Cn^{1/2}\log(n)\right] \end{array} $$
(15)
$$\begin{array}{@{}rcl@{}} &&\le \sum\limits_{k\le (p + 1)}P\left[s_{k}\ge Cn^{1/2}\log(n)\right]+\sum\limits_{k\le (p + 1)}P\left[-s_{k}\ge Cn^{1/2}\log(n)\right], \end{array} $$
(16)

whereskdenotes the k-th component of\({\sum }_{i = 1}^n(1,\textbf {x}^T_i)^T\left [y_i-{\mathrm {d}b(\eta _{\textbf {t} i})}/{\mathrm {d}\eta _{\textbf {t} i}}\right ]\).

We first show the first summation in (16) decreases to zero as n increases. Letxik denote the k-th component of vector \((1,\textbf {x}_i^T)^T\) and ωk, i denote\(f_{kk}^{1/2}x_{ik}\). Then we have

$$\begin{array}{@{}rcl@{}} &&P\left[s_{k}\ge Cn^{1/2}\log(n)\right] \end{array} $$
(17)
$$\begin{array}{@{}rcl@{}} &&\quad \le P\left[\sum\limits_{i = 1}^{n}\omega_{k,i}(y_{i}-{\mathrm{d}b(\eta_{\textbf{t} i})}/{\mathrm{d}\eta_{\textbf{t} i}})\ge C\log(n)\right] \end{array} $$
(18)
$$\begin{array}{@{}rcl@{}} &&\quad \le E \exp\left\{t\left[\sum\limits_{i = 1}^{n}\omega_{k,i}(y_{i}-{\mathrm{d}b(\eta_{\textbf{t} i})}/{\mathrm{d}\eta_{\textbf{t} i}})-C\log(n)\right]\right\} \end{array} $$
(19)
$$\begin{array}{@{}rcl@{}} &&\quad = \exp\left\{\sum\limits_{i = 1}^{n}\left[\frac{b(\eta_{\textbf{t} i} + \omega_{k,i}\phi t)-b(\eta_{\textbf{t} i})}{\phi} -\omega_{k,i}\frac{\mathrm{d}b(\eta_{\textbf{t} i})}{\mathrm{d}\eta_{\textbf{t} i}}t\right]-C\log(n)t\right\} \end{array} $$
(20)
$$\begin{array}{@{}rcl@{}} &&\quad = \exp\left\{\frac{t^{2}}{2}\sum\limits_{i = 1}^{n}\omega^{2}_{k,i}\phi\left( {\mathrm{d}^{2}b(\eta_{\textbf{t} i}+\omega_{k,i}\phi\tilde{t})}/{\mathrm{d}{\eta^{2}_{i}}}\right)-C\log(n)t \right\} \end{array} $$
(21)

for some \(0<\tilde {t}<t\),\(t\in \mathbb {R}\).Lettn = log(n), then from conditionC2 and C4, we have max|ωk, itn| = o(1) and\(\frac {t_n^2}{2}{\sum }_{i = 1}^n\omega ^2_{k,i}\phi \left ({\mathrm {d}^2b(\eta _{\textbf {t} i}+\omega _{k,i}\phi \tilde {t_n})}/\right .\)\(\left .{\mathrm {d}\eta ^2_i}\right )=\frac {Ct_n^2}{2}(1+o(1))\).Consequently, we have\(P\left [s_k\ge Cn^{1/2}\log (n)\right ]\le \exp \left [-\frac {Ct_n^2}{2}(1+o(1))\right ]\).

To show the second summation in (16) also decreases to zero as n increases, we can replace ωk, i with − ωk, i in the above argument, and thus complete the proof. □

Proof of Theorem 1.

Using second order Taylor expansion on the log-likelihood about MLE, we obtain

$$ \ell(\alpha,\boldsymbol{\beta})=\ell(\hat{\alpha},\hat{\boldsymbol{\beta}})-\frac{1}{2}\left[(\alpha,\boldsymbol{\beta})-(\hat{\alpha},\hat{\boldsymbol{\beta}})\right]^{T} \textbf{F}(\hat{\alpha},\hat{\boldsymbol{\beta}})\left[(\alpha,\boldsymbol{\beta})-(\hat{\alpha},\hat{\boldsymbol{\beta}})\right]+O(n^{-1}). $$
(22)

Thus, under the second order Taylor expansion, we obtain the approximate marginal likelihood function for a generic model j, jt,

$$\begin{array}{@{}rcl@{}} \tilde{m}_{\textbf{j}}(\textbf{y}_{n})&=&Cn^{-\frac{1}{2}}\left|\tau\textbf{F}_{\textbf{j}0}\right|{~}^{\frac{2r + 1}{2}}\exp\left\{\ell(\hat{\alpha}_{\textbf{j}},\hat{\boldsymbol{\beta}}_{\textbf{j}})\right\} \int Q^{r}_{\textbf{j}}\exp\\ &&\left( -\frac{\boldsymbol{\beta}_{\textbf{j}}^{T}\textbf{F}_{\textbf{j}0}\boldsymbol{\beta}_{\textbf{j}}}{2\tau}-\frac{(\boldsymbol{\beta}_{\textbf{j}}-\hat{\boldsymbol{\beta}}_{\textbf{j}})^{T} \textbf{F}(\hat{\boldsymbol{\beta}}_{\textbf{j}})(\boldsymbol{\beta}_{\textbf{j}}-\hat{\boldsymbol{\beta}}_{\textbf{j}})}{2}\right)d\boldsymbol{\beta}_{\textbf{j}} \\ &=&Cn^{-\frac{1}{2}}\left|\tau\textbf{F}_{\textbf{j}0}\right|{~}^{\frac{2r + 1}{2}}\exp\left\{\ell(\hat{\alpha}_{\textbf{j}},\hat{\boldsymbol{\beta}}_{\textbf{j}})\right\}\left|\tau^{-1}\textbf{F}_{\textbf{j}0}+\textbf{F}(\hat{\boldsymbol{\beta}}_{\textbf{j}})\right|{~}^{-\frac{1}{2}} \\ &&\left.\times E_{1}(Q^{r}_{\textbf{j}})\exp\left\{-\frac{1}{2}\hat{\boldsymbol{\beta}}_{\textbf{j}}^{T} \textbf{F}(\hat{\boldsymbol{\beta}}_{\textbf{j}})\left[\textbf{I}-\frac{\textbf{I}}{\textbf{I}+(\tau)^{-1}\textbf{F}_{\textbf{j}0}\textbf{F}^{-1}(\hat{\boldsymbol{\beta}}_{\textbf{j}})} \right]\hat{\boldsymbol{\beta}}_{\textbf{j}}\right\} \right\},\\ \end{array} $$
(23)

where \( Q^r_{\textbf {j}}={\prod }_{i = 1}^{p_{\textbf {j}}}(\beta _{\textbf {j} i})^{2r}\) and \(E_{1}(Q^r_{\textbf {j}})\) denotes the expectation of \( Q^r_{\textbf {j}}\) taken with respect to βj following multivariate normal distribution with mean \(\tilde {\boldsymbol {\beta }}_{\textbf {j}}=\left [\textbf {I}+(\tau )^{-1}\textbf {F}_{\textbf {j}0}\textbf {F}^{-1}(\hat {\boldsymbol {\beta }}_{\textbf {j}})\right ]^{-1}\hat {\boldsymbol {\beta }}_{\textbf {j}}\) and precision \(\textbf {V}_{\textbf {j}}=\left \{\tau ^{-1}\textbf {F}_{\textbf {j}0}+\textbf {F}(\hat {\boldsymbol {\beta }}_{\textbf {j}})\right \}\).

From conditions C1, C2, C3, we learn the corresponding precision matrices are all properly bounded, and thus the Bayes factor between model j and model t is bounded above,

$$ \text{BF}_{\textbf{j}:\textbf{t}}< Cn^{\frac{p_{\textbf{t}}-p_{\textbf{j}}}{2}}\frac{E_{1}(Q^{r}_{\textbf{j}})}{E_{1}(Q^{r}_{\textbf{t}})} \exp\left\{\ell(\hat{\alpha}_{\textbf{j}},\hat{\boldsymbol{\beta}}_{\textbf{j}})-\ell(\hat{\alpha}_{\textbf{t}},\hat{\boldsymbol{\beta}}_{\textbf{t}})\right\}. $$
(24)

Next, let d = pjpt. Then from Lemma 1 we have

$$\begin{array}{@{}rcl@{}} &&P\left( Cn^{-\frac{d}{2}}{E_{1}(Q^{r}_{\textbf{j}})} \exp\left\{\ell(\hat{\alpha}_{\textbf{j}},\hat{\boldsymbol{\beta}}_{\textbf{j}})-\ell(\hat{\alpha}_{\textbf{t}},\hat{\boldsymbol{\beta}}_{\textbf{t}})\right\}>Cn^{-d(r+\frac{1}{2})(1-2\epsilon)}\right)\\ &&\le P\left( E_{1}(Q^{r}_{\textbf{j}})>n^{-rd(1-2\epsilon)}\right) + P\left( C\exp\left\{\ell(\hat{\alpha}_{\textbf{j}},\hat{\boldsymbol{\beta}}_{\textbf{j}})-\ell(\hat{\alpha}_{\textbf{t}},\hat{\boldsymbol{\beta}}_{\textbf{t}})\right\}>n^{d\epsilon}\right)\\ &&\equiv P_{1}+P_{2} \end{array} $$
(25)

for 0 < 𝜖 < 1/2. Note that since \(E_{1}(Q^r_{\textbf {t}})\) enjoys the almost sure convergence to some positive constant, to prove Theorem 1, it suffices to show that both P1 and P2 decrease to zero with sufficiently large n.

We first show that P1 decreases to zero with sufficiently large n. Using the inequality \((a-b)^2<(2a)^2+(2b)^2, \forall a, b \in \mathbb {R}\), we obtain

$$\begin{array}{@{}rcl@{}} E_{1}(Q^{r}_{\textbf{j}})&=&\int\prod\left[\tilde\beta_{\textbf{j} i}-(\tilde\beta_{\textbf{j} i}-\beta_{\textbf{j} i})\right]^{2r}\times \textbf{N}(\boldsymbol{\beta}_{\textbf{j}}|\tilde{\boldsymbol{\beta}}_{\textbf{j}},\textbf{V}_{\textbf{j}}^{-1})d\boldsymbol{\beta}_{\textbf{j}} \\ &<&\prod(2\tilde\beta_{\textbf{j} i})^{2r}+\int\prod4^{r}(\tilde\beta_{\textbf{j} i}-\beta_{\textbf{j} i})^{2r}\times \textbf{N}(\boldsymbol{\beta}_{\textbf{j}}|\tilde{\boldsymbol{\beta}}_{\textbf{j}},\textbf{V}_{\textbf{j}}^{-1})d\boldsymbol{\beta}_{\textbf{j}} \\ &\le&\prod\left\{(2\tilde\beta_{\textbf{j} i})^{2r}+Cv^{-r}_{\textbf{j}}\right\}\\ &=&\prod\limits_{\textbf{j} i\in\textbf{t}}\left\{(2\tilde\beta_{\textbf{j} i})^{2r}+Cv^{-r}_{\textbf{j}}\right\}\times\prod\limits_{\textbf{j} i\notin\textbf{t}}\left\{(2\tilde\beta_{\textbf{j} i})^{2r}+Cv^{-r}_{\textbf{j}}\right\}\\ &\equiv& h_{1}\times g_{1} \end{array} $$
(26)

where \(v^{-r}_{\textbf {j}}\) denotes the minimal eigenvalue of Vj.

Define h = (4△)2r + Cnr, where △ denotes maxi{|β0i|}. Then we have

$$\begin{array}{@{}rcl@{}} P\left( h_{1}>h^{p_{\textbf{t}}}\right)&\le&\sum\limits_{\textbf{j} i\in\textbf{t}}P\left( |\tilde{\beta}_{\textbf{j} i}|>2\bigtriangleup\right)\\ &=&\sum\limits_{\textbf{j} i\in\textbf{t}}P\left( |\tilde{\beta}_{\textbf{j} i}|-|\beta_{0i}|>2\bigtriangleup-|\beta_{0i}|\right)\\ &\le&\sum\limits_{\textbf{j} i\in\textbf{t}}P\left( \vert\tilde{\beta}_{\textbf{j} i}-\beta_{0i}\vert>2\bigtriangleup-|\beta_{0i}|\right). \end{array} $$
(27)

Define \(g=(4n^{-\frac {1}{2}+\epsilon })^{2r}+Cn^{-r}\). Following Lemma 2, we have the sampling mean of \(\tilde \beta _{\textbf {j} i}, \textbf {j} i\notin \textbf {t}\), is bounded in magnitude by n− 1/2 + 𝜖. Thus, by replacing △ with n− 1/2 + 𝜖 in (27) we obtain

$$ P\left( g_{1}>g^{p_{\textbf{j}}-p_{\textbf{t}}}\right)\le\sum\limits_{\textbf{j} i\in\textbf{t}}P\left( |\tilde{\beta}_{\textbf{j} i}|>2n^{-1/2+\epsilon}\right). $$
(28)

Note that \(h^{p_{\textbf {t}}}g^{p_{\textbf {j}}-p_{\textbf {t}}}=O(n^{-rd(1-2\epsilon )})\). By Lemma 1, Lemma 2, and inequalities (27) and (28), we obtain, as n,

$$ P_{1}\le P\left( h_{1}g_{1}>h^{p_{\textbf{t}}}g^{p_{\textbf{j}}-p_{\textbf{t}}}\right) \le P\left( h_{1}>h^{p_{\textbf{t}}}\right)+P\left( g_{1}>g^{p_{\textbf{j}}-p_{\textbf{t}}}\right) \rightarrow 0. $$
(29)

Next, we show that P2 decreases to zero with sufficiently large n. First we have

$$ P_{2}=P\left( \left[\ell(\hat{\alpha}_{\textbf{j}},\hat{\boldsymbol{\beta}}_{\textbf{j}})-\ell(\hat{\alpha}_{\textbf{t}},\hat{\boldsymbol{\beta}}_{\textbf{t}})\right]>Cd\epsilon\log(n)\right). $$
(30)

Then, by Lemma 3, we learn that with probability approaching 1, P2 decreases to zero with sufficiently large n and arbitrarily small 𝜖, and thus we arrive at the conclusion. □

Proof of Corollary 1.

Given a generic model j such that jt, using second order Taylor expansion on the log likelihood about MLE and differentiating the unnormalized posterior density of βj, we obtain the score function for finding the \(\hat {\boldsymbol {\beta }}_{\textbf {j},{\text {PM}}}\). Setting the score function to zero gives that \(\hat \beta _{\textbf {j} i,{\text {PM}}}\) must satisfy \( C\hat \beta _{\textbf {j} i,{\text {PM}}}^{-3}-C\hat \beta _{\textbf {j} i,{\text {PM}}}^{-1}- {\sum }_{j = 1}^{p_{\textbf {j}}}F_{ij}(\hat {\boldsymbol {\beta }}_{\textbf {j}})(\hat \beta _{\textbf {j} j,{\text {PM}}}-\hat \beta _{\textbf {j} j})= 0, \) for all i, where \(F_{ij}(\hat {\boldsymbol {\beta }}_{\textbf {j}})\) is the (i,j) element of the Fisher information evaluated at \(\boldsymbol {\beta }=\hat {\boldsymbol {\beta }}_{\textbf {j}}\). Rearranging terms leads to

$$ \sum\limits_{j = 1}^{p_{\textbf{j}}}\frac{F_{ij}(\hat{\boldsymbol{\beta}}_{\textbf{j}})}{n}n\hat\beta_{\textbf{j} i,{\text{PM}}}^{3}(\hat\beta_{\textbf{j} j,{\text{PM}}}-\hat\beta_{\textbf{j} j})=C\hat\beta_{\textbf{j} i,{\text{PM}}}^{2}+C. $$
(31)

Note that from the conditions C2 and C3, we have \(F_{ij}(\hat {\boldsymbol {\beta }}_{\textbf {j}})/n=O(1)\), for all i, j. Following Lemma 2, we have \(\hat \beta _{\textbf {j} j,{\text {PM}}}\xrightarrow {P}\hat \beta _{\textbf {j} j}\). Incorporating all these facts into (31) gives that \(\hat \beta _{\textbf {j} i,{\text {PM}}}\) must satisfy

$$ n(\hat\beta_{\textbf{j} i,{\text{PM}}}-\hat\beta_{\textbf{j} i})\hat\beta_{\textbf{j} i,{\text{PM}}}^{3}\xrightarrow{p}C, $$
(32)

for all i. Using techniques identical to the proof of Equation (36) in the supplemental material of Rossell and Telesca (2017), one can show that the convergence rate of \(\hat \beta _{\textbf {j} i,{\text {PM}}}\) is Op(n− 1/4) from zero or Op(n− 1) from its MLE. Here we directly cite the results for simplicity.

Next, we investigate the convergence rate of the Bayes factor. First note that the use of the Laplace method yields an asymptotic approximation to the marginal likelihood of the data, denoted by \(\tilde {m}_{\textbf {j}}(\textbf {y}_n)\), such that

$$ \tilde{m}_{\textbf{j}}(\textbf{y}_{n})=C n^{-\frac{p_{\textbf{j}}+ 1}{2}}\exp\left\{\ell(\hat{\alpha}_{\textbf{j}},\hat{\boldsymbol{\beta}}_{\textbf{j},{\text{PM}}})\right\}\prod\limits_{i = 1}^{p_{\textbf{j}}}\vert\hat\beta_{\textbf{j} i,{\text{PM}}}\vert^{-(r + 1)}\exp\left\{-\sum\limits_{i = 1}^{p_{\textbf{j}}}{\hat\beta_{\textbf{j} i,{\text{PM}}}^{-2}}\right\}. $$
(33)

Finally, incorporating the convergence rate of \(\hat \beta _{\textbf {j} i,{\text {PM}}}\) into (33) leads to the fact that, with probability approaching one,

$$ \text{BF}_{\textbf{j}:\textbf{t}}<C \exp\left\{\ell(\hat{\alpha}_{\textbf{j}},\hat{\boldsymbol{\beta}}_{\textbf{j},{\text{PM}}})-\ell(\hat{\alpha}_{\textbf{t}},\hat{\boldsymbol{\beta}}_{\textbf{t},{\text{PM}}})\right\}\exp\left\{-n^{1/2}\right\}, $$
(34)

which completes the proof. □

Proof of Corollary 2.

Given a model j such that jt, the approximate marginal likelihood function is

$$ \tilde{m}_{\textbf{j}}(\textbf{y}_{n})=Cn^{-\frac{1}{2}}\left|\textbf{F}_{\textbf{j}0}\right|{~}^{\frac{2r + 1}{2}}\left| \textbf{F}(\hat{\boldsymbol{\beta}}_{\textbf{j}})\right|{~}^{-\frac{1}{2}}\exp\left\{\ell(\hat{\alpha}_{\textbf{j}},\hat{\boldsymbol{\beta}}_{\textbf{j}})\right\} E_{2}(Q^{r}_{\textbf{j}}), $$
(35)

where the quantity

$$ E_{2}(Q^{r}_{\textbf{j}})=E\left\{ Q^{r}_{\textbf{j} i}\times\left( 1+\frac{\boldsymbol{\beta}_{\textbf{j}}^{T}\textbf{F}_{\textbf{j}0}\boldsymbol{\beta}_{\textbf{j}}}{v_{2}}\right)^{-\frac{(2r + 1)p_{\textbf{j}}+v_{1}}{2}}\right\}, $$
(36)

denotes the expectation taken with respect to βj following multivariate normal distribution of mean \(\hat {\boldsymbol {\beta }}_{\textbf {j}}\) and precision \(\boldsymbol {\Psi }_{\textbf {j}}=\textbf {F}(\hat \boldmath {\beta }_{\textbf {j}})\).

We have the Bayes factor between model j and model t bounded by

$$ \frac{\tilde{m}_{\textbf{j}}(\textbf{y}_{n})}{\tilde{m}_{\textbf{t}}(\textbf{y}_{n})}< Cn^{\frac{p_{\textbf{t}}-p_{\textbf{j}}}{2}}\frac{E_{2}(Q^{r}_{\textbf{j}})}{E_{2}(Q^{r}_{\textbf{t}})} \exp\left\{\ell(\hat{\alpha}_{\textbf{j}},\hat{\boldsymbol{\beta}}_{\textbf{j}})-\ell(\hat{\alpha}_{\textbf{t}},\hat{\boldsymbol{\beta}}_{\textbf{t}})\right\}. $$
(37)

Now, using similar argument for deriving Eq. (26), we can obtain

$$\begin{array}{@{}rcl@{}} E_{2}(Q^{r}_{\textbf{j}})&\le&\int\prod\left[\hat\beta_{\textbf{j} i}-(\hat\beta_{\textbf{j} i}-\beta_{\textbf{j} i})\right]^{2r}\times \textbf{N}(\boldsymbol{\beta}_{\textbf{j}}|\hat{\boldsymbol{\beta}}_{\textbf{j}},{\boldsymbol{\Psi}}_{\textbf{j}}^{-1})d\boldsymbol{\beta}_{\textbf{j}} \\ &<&\prod(2\hat\beta_{\textbf{j} i})^{2r}+\int\prod4^{r}(\hat\beta_{\textbf{j} i}-\beta_{\textbf{j} i})^{2r}\times \textbf{N}(\boldsymbol{\beta}_{\textbf{j}}|\hat{\boldsymbol{\beta}}_{\textbf{j}},{\boldsymbol{\Psi}}_{\textbf{j}}^{-1})d\boldsymbol{\beta}_{\textbf{j}} \\ &\le&\prod\left\{(2\hat\beta_{\textbf{j} i})^{2r}+C\psi^{-r}_{\textbf{j} i}\right\}\\ &=&\prod\limits_{\textbf{j} i\in\textbf{t}}\left\{(2\hat\beta_{\textbf{j} i})^{2r}+C\psi^{-r}_{\textbf{j} i}\right\}\times\prod\limits_{\textbf{j} i\notin\textbf{t}}\left\{(2\hat\beta_{\textbf{j} i})^{2r}+C\psi^{-r}_{\textbf{j} i}\right\}\\ &\equiv& h_{2}\times g_{2}, \end{array} $$
(38)

Then, by similar argument for deriving (27) and (28), we obtain

$$ P\left( h_{2}>h^{p_{\textbf{t}}}\right)\rightarrow 0, \text{ as } n\rightarrow\infty, \text{ and } P\left( g_{2}>g^{p_{\textbf{j}}-p_{\textbf{t}}}\right)\rightarrow 0, \text{ as } n\rightarrow\infty. $$
(39)

With inequalities (39), we have

$$\begin{array}{@{}rcl@{}} &&P\left( Cn^{-\frac{d}{2}}{E_{2}(Q^{r}_{\textbf{j}})} \exp\left\{\ell(\hat{\alpha}_{\textbf{j}},\hat{\boldsymbol{\beta}}_{\textbf{j}})-\ell(\hat{\alpha}_{\textbf{t}},\hat{\boldsymbol{\beta}}_{\textbf{t}})\right\}>Cn^{-rd(1+\epsilon)}\right)\\ &&\quad\le P\left( E_{2}(Q^{r}_{\textbf{j}})>n^{-d(r+\frac{1}{2})(1-2\epsilon)}\right) + P\left( C\exp\left\{\ell(\hat{\alpha}_{\textbf{j}},\hat{\boldsymbol{\beta}}_{\textbf{j}})-\ell(\hat{\alpha}_{\textbf{t}},\hat{\boldsymbol{\beta}}_{\textbf{t}})\right\}>n^{d\epsilon}\right)\\ &&\quad\rightarrow 0, \text{ as } n\rightarrow\infty, \end{array} $$
(40)

where d = pjpt, which completes the proof. □

Proof of Corollary 2.

The proof is analogous to the proof of Corollary 1. Given a model j such that jt, expanding (αj, βj) around the maximum likelihood estimates \((\hat {\alpha }_{\textbf {j}},\hat {\boldsymbol {\beta }}_{\textbf {j}})\) and differentiating the unnormalized posterior density of βj yield the score function and the fact that \(\hat \beta _{\textbf {j} i,{\text {PM}}}\) must satisfy

$$ C\left[1+C\sum\limits_{j = 1}^{p_{\textbf{j}}}\hat\beta^{-2}_{\textbf{j} j,{\text{PM}}}\right]^{-1}\hat\beta_{\textbf{j} i,{\text{PM}}}^{-3}-C\hat\beta_{\textbf{j} i,{\text{PM}}}^{-1}- \sum\limits_{j = 1}^{p_{\textbf{j}}}F_{ij}(\hat{\boldsymbol{\beta}}_{\textbf{j}})(\hat\beta_{\textbf{j} j,{\text{PM}}}-\hat\beta_{\textbf{j} j})= 0, $$
(41)

for all i, where \(F_{ij}(\hat {\boldsymbol {\beta }}_{\textbf {j}})\) is the (i,j) element of the Fisher information evaluated at \(\boldsymbol {\beta }=\hat {\boldsymbol {\beta }}_{\textbf {j}}\). By rearranging terms, and the facts that \(F_{ij}(\hat {\boldsymbol {\beta }}_{\textbf {j}})/n=O(1)\), for all i, j and \(\hat \beta _{\textbf {j} j,{\text {PM}}}\xrightarrow {P}\hat \beta _{\textbf {j} j}\), we can further simplify (41) to

$$ \left[1+C\sum\limits_{j = 1}^{p_{\textbf{j}}}\hat\beta^{-2}_{\textbf{j} j,{\text{PM}}}\right]^{-1}\hat\beta_{\textbf{j} i,{\text{PM}}}^{-2}- n\hat\beta_{\textbf{j} i,{\text{PM}}}(\hat\beta_{\textbf{j} i,{\text{PM}}}-\hat\beta_{\textbf{j} i})\xrightarrow{p}C, $$
(42)

for all i. Now since the behavior of \(\left [1+C{\sum }_{j = 1}^{p_{\textbf {j}}}\hat \beta ^{-2}_{\textbf {j} j,{\text {PM}}}\right ]\) in (42) is dominated by \(\underset {i}{\min }\left \{|\hat \beta _{\textbf {j} i,{\text {PM}}}|\right \}\). Then without loss of generality, one can assume \(\underset {i}{\min }\left \{|\hat \beta _{\textbf {j} i,{\text {PM}}}|\right \}=O(n^{-C})\), and thus \(\left [1+C{\sum }_{i = 1}^{p_{\textbf {j}}}\hat \beta ^{-2}_{\textbf {j} i,{\text {PM}}}\right ]^{-1}=O(n^{-2C})\). Now if \(\hat \beta _{\textbf {j} i,{\text {PM}}}=O(n^{-C})\), then \(\left [1+C{\sum }_{j = 1}^{p_{\textbf {j}}}\hat \beta ^{-2}_{\textbf {j} j,{\text {PM}}}\right ]^{-1}\hat \beta _{\textbf {j} i,{\text {PM}}}^{-2}=C\). Otherwise, if \(\hat \beta _{\textbf {j} i,{\text {PM}}}=O(n^{-C+\epsilon })\), for some 𝜖 > 0, then \(\left [1+C{\sum }_{j = 1}^{p_{\textbf {j}}}\hat \beta ^{-2}_{\textbf {j} j,{\text {PM}}}\right ]^{-1}\hat \beta _{\textbf {j} i,{\text {PM}}}^{-2}=O(n^{-2\epsilon })\). Finally, combining all the facts discussed above, we can simplify (42) to

$$ n(\hat\beta_{\textbf{j} i,{\text{PM}}}-\hat\beta_{\textbf{j} i})\hat\beta_{\textbf{j} i,{\text{PM}}}\xrightarrow{p}C. $$
(43)

Note that the two roots of the quadratic equation (43) are given by \(-\frac {1}{2}\left \{\sqrt {\hat \beta ^2_{\textbf {j} i}+ 4c/n}-\sqrt {\hat \beta ^2_{\textbf {j} i}}\right \}\) and \(\hat \beta _{\textbf {j} i}+\frac {1}{2}\left \{\sqrt {\hat \beta ^2_{\textbf {j} i}+ 4c/n}-\sqrt {\hat \beta ^2_{\textbf {j} i}}\right \}\). Using the first order Taylor expansion on the square root function, we obtain that the first root is of order \(O_p(c/n){\hat \beta ^{-1}_{\textbf {j} i}}\) and the second root is of order \(\hat \beta _{\textbf {j} i}+O_p(c/n){\hat \beta ^{-1}_{\textbf {j} i}}\), which implies that the convergence rate of \(\hat \beta _{\textbf {j} i,{\text {PM}}}\) is Op(n− 1/2) from zero or Op(n− 1) from MLE.

Next, we investigate the convergence rate of the Bayes factor. First note that the use of the Laplace method yields an asymptotic approximation to the marginal likelihood of the data, denoted by \(\tilde {m}_{\textbf {j}}(\textbf {y}_n)\), such that

$$ \tilde{m}_{\textbf{j}}(\textbf{y}_{n})=C n^{-\frac{p_{\textbf{j}}+ 1}{2}}\exp\left\{\ell(\hat{\alpha}_{\textbf{j}},\hat{\boldsymbol{\beta}}_{\textbf{j},{\text{PM}}})\right\}\pi(\hat{\boldsymbol{\beta}}_{\textbf{j},{\text{PM}}}). $$
(44)

Second, from Proposition 2, we learn that the behavior of hpimGLM is dominated by the minimal component of βj, that is, \(\pi (\hat {\boldsymbol {\beta }}_{\textbf {j},{\text {PM}}})=\)\(O\left (\left [\underset {i}{\min }|\hat \beta _{\textbf {j} i,{\text {PM}}}|\right ]^{r(p_{\textbf {j}}-1)+ 2v_3-1}\right )\). Then with \(\underset {i}{\min }|\hat \beta _{\textbf {j} i,{\text {PM}}}|=O(n^{-1/2})\), we have \(\pi (\hat {\boldsymbol {\beta }}_{\textbf {j},{\text {PM}}})=O(n^{-\left \{r(p_{\textbf {j}}-1)+ 2v_3-1\right \}/2})\). Similarly, we have \(\pi (\hat {\boldsymbol {\beta }}_{\textbf {t},{\text {PM}}})=O(1)\), as \(\hat {\boldsymbol {\beta }}_{\textbf {t},{\text {PM}}}\rightarrow C\).

Finally, combining all the information into (44), we arrive at

$$\begin{array}{@{}rcl@{}} \text{BF}_{\textbf{j}:\textbf{t}}&<&C n^{-d/2}\exp\left\{\ell(\hat{\alpha}_{\textbf{j}},\hat{\boldsymbol{\beta}}_{\textbf{j},{\text{PM}}})-\ell(\hat{\alpha}_{\textbf{t}},\hat{\boldsymbol{\beta}}_{\textbf{t},{\text{PM}}})\right\}n^{-\left\{r(p_{\textbf{j}}-1)+ 2v_{3}-1\right\}/2}\\ &=&Cn^{-\left\{(r + 1)(p_{\textbf{j}}-1)+ 2v_{3}-p_{\textbf{t}}\right\}/2}\exp\left\{\ell(\hat{\alpha}_{\textbf{j}},\hat{\boldsymbol{\beta}}_{\textbf{j},{\text{PM}}})-\ell(\hat{\alpha}_{\textbf{t}},\hat{\boldsymbol{\beta}}_{\textbf{t},{\text{PM}}})\right\}, \end{array} $$
(45)

which completes the proof. □

Proof of Theorem 2.

The proofs for pmGLM, pimGLM, hpmGLM, and hpimGLM priors are all similar, thus we only show the proof of pmGLM. Without loss of generality, we assume all models have equal prior probability. Thus, we have

$$ P(\textbf{t}|\textbf{y}_{n})= \left[ 1 +{\sum}_{\textbf{t}\subset\textbf{k}}\frac{m_{\textbf{k}}(\textbf{y}_{n})}{m_{\textbf{t}}(\textbf{y}_{n})} +{\sum}_{\textbf{t}\not\subset\textbf{k}}\frac{m_{\textbf{k}}(\textbf{y}_{n})}{m_{\textbf{t}}(\textbf{y}_{n})} \right]^{-1}. $$
(46)

First, from Theorem 1, we have, for sufficiently large n, w.p.a.1, \( {\sum }_{\textbf {t}\subset \textbf {k}}\frac {\tilde {m}_{\textbf {k}}(\textbf {y}_n)}{\tilde {m}_{\textbf {t}}(\textbf {y}_n)} \le {\sum }_{\textbf {t}\subset \textbf {k}}(Cn)^{-r(p_{\textbf {k}}-p_{\textbf {t}})} \le (e^{\frac {p_{\textbf {k}}-p_{\textbf {t}}}{Cn}}-1). \) Next, under the conditions C1, C2, C3, standard asymptotic theory imply the following results,

$$ P(\ell(\hat{\alpha}_{\textbf{k}},\hat{\boldsymbol{\beta}}_{\textbf{k}})-\ell(\hat{\alpha}_{\textbf{t}},\hat{\boldsymbol{\beta}}_{\textbf{t}})\le-Cn)\xrightarrow{p}1, \quad \forall \textbf{k}\not\supset\textbf{t}. $$
(47)

Consequently, we have, for sufficiently large n, w.p.a.1, \( {\sum }_{\textbf {t}\not \subset \textbf {k}}\frac {\tilde {m}_{\textbf {k}}(\textbf {y}_n)}{\tilde {m}_{\textbf {t}}(\textbf {y}_n)} \le {\sum }_{\textbf {t}\not \subset \textbf {k}}C\exp \left (-Cn\right ). \) Finally, putting these two summations together in (46), the use of Slutsky theorem leads to \(\tilde {p}(\textbf {t}|\textbf {y}_n)\xrightarrow {p} 1\). The claimed result follows because the Laplace method is valid with error rate O(n− 1). □

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wu, H., Ferreira, M.A.R., Elkhouly, M. et al. Hyper Nonlocal Priors for Variable Selection in Generalized Linear Models. Sankhya A 82, 147–185 (2020). https://doi.org/10.1007/s13171-018-0151-9

Download citation

Keywords and phrases.

  • Bayesian variable selection
  • Generalized linear model
  • Nonlocal prior
  • Scale mixtures
  • Variable selection consistency.

AMS (2000) subject classification.

  • 62F15