Variable selection and estimation using a continuous approximation to the $$L_0$$ penalty

Wang, Yanxin; Fan, Qibin; Zhu, Li

doi:10.1007/s10463-016-0588-3

Variable selection and estimation using a continuous approximation to the $L_0$ penalty

Published: 19 October 2016

Volume 70, pages 191–214, (2018)
Cite this article

Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Yanxin Wang^1,2,
Qibin Fan³ &
Li Zhu⁴

547 Accesses
3 Citations
Explore all metrics

Abstract

Variable selection problems are typically addressed under the regularization framework. In this paper, an exponential type penalty which very closely resembles the $L_0$ penalty is proposed, we called it EXP penalty. The EXP penalized least squares procedure is shown to consistently select the correct model and is asymptotically normal, provided the number of variables grows slower than the number of observations. EXP is efficiently implemented using a coordinate descent algorithm. Furthermore, we propose a modified BIC tuning parameter selection method for EXP and show that it consistently identifies the correct model, while allowing the number of variables to diverge. Simulation results and data example show that the EXP procedure performs very well in a variety of settings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Variable selection using a smooth information criterion for distributional regression models

Article Open access 21 April 2023

Variable selection procedures from multiple testing

Article 11 June 2018

Adaptive Variable Selection in Nonparametric Sparse Regression

Article 31 May 2014

References

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov, F. Csaki (Eds.), Second international symposium on information theory (pp. 267–281). Budapest: Akademiai Kiado.
Breiman, L. (1995). Better subset regression using the non-negative garrote. Technometrics, 37, 373–384.
Article MathSciNet MATH Google Scholar
Breiman, L. (1996). Heuristics if instability and stabilization in model selection. Annals of Statistics, 24, 2350–2383.
Article MathSciNet MATH Google Scholar
Breheny, P. (2015). The group exponential lasso for bi-level variable selection. Biometrics, 71(3), 731–740.
Article MathSciNet MATH Google Scholar
Breheny, P., Huang, J. (2011). Coordinate descent algorithms for nonconvex penalized regression with applications to biological feature selection. Annals of Applied Statistics, 5(1), 232–253.
Daubechies, I., Defrise, M., De Mol, C. (2004). An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57, 1413–1457.
Dicker, L., Huang, B., Lin, X. (2013). Variable selection and estimation with the seamless-$L_0$ penalty. Statistica Sinica, 23, 929–962.
Douglas, N. VanDerwerken (2011). Variable selection and parameter estimation using a continuous and differentiable approximation to the $L_0$ penalty function. All Theses and Dissertations, Paper 2486.
Fan, J., Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360.
Fan, J., Lv, J. (2011). Non-concave penalized likelihood with np-dimensionality. IEEE Transactions on Information Theory, 57, 5467–5484.
Fan, J., Peng, H. (2004). Nonconcave penalized likehood with a diverging number parameters. Annals of Statistics, 32, 928–961.
Frank, I. E., Friedman, J. H. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35, 109–148.
Friedman, J. H., Hastie, T., Hoefling, H., Tibshirani, R. (2007). Pathwise coordinate optimization. Annals of Applied Statistics, 2(1), 302–332.
Friedman, J., Hastie, T., Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 1–22.
Foster, D., George, E. (1994). The risk inflation criterion for multiple regression. Annals of Statistics, 22, 1947–1975.
Fu, W. J. (1998). Penalized regression: the bridge versus the LASSO. Journal of Computational and Graphical Statistics, 7, 397–416.
MathSciNet Google Scholar
Kim, Y., Choi, H., Oh, H. (2008). Smoothly clipped absolute deviation on high dimensions. Journal of the American Statistical Association, 103, 1665–1673.
Knight, K., Fu, W. (2000). Asymptotics for Lasso-Type Estimators. Annals of Statistics, 28, 1356–1378.
Lee, E. R., Noh, H., Park, B. U. (2014). Model selection via bayesian information criterion for quantile regression models. Journal of the American Statistical Association, 109, 216–229.
Lv, J., Fan, Y. (2009). A unified approach to model selection and sparse recovery using regularized least squares. Annals of Statistics, 37, 3498–3528.
Peng, B., Wang, L. (2014). An iterative coordinate descent algorithm for high-dimensional nonconvex penalized quantile regression. Journal of Computational and Graphical Statistics, 24, 00–00.
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.
Article MathSciNet MATH Google Scholar
Shao, J. (1993). Linear model selection by cross-validation. Journal of the American Statistical Association, 88, 486–494.
Article MathSciNet MATH Google Scholar
Stamey, T., Kabalin, J., McNeal, J., Johnstone, I., Freiha, F., Redwine, E., et al. (1989). Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate ii: radical prostatectomy treated patients. The Journal of Urology, 16, 1076–1083.
Article Google Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B Statistical Methodology, 58, 267–288.
MathSciNet MATH Google Scholar
Wang, H., Leng, C. (2007). Unified lasso estimation by least squares approximation. Journal of the American Statistical Association, 102, 1039–1048.
Wang, H., Li, R., Tsai, C. L. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94, 553–568.
Wang, H., Li, B., Leng, C. (2009). Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society. Series B. Statistical Methodology, 71, 671–683.
Wu, T. T., Lange, K. (2008). Coordinate descent algorithms for LASSO penalized regression. Annals of Applied Statistics, 2, 224–244.
Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics, 38(2), 894–942.
Article MathSciNet MATH Google Scholar
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.
Article MathSciNet MATH Google Scholar
Zou, H., Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B Statistical Methodology, 67, 301–320.
Zou, H., Hastie, T. (2007). On the “degrees of freedom” of lasso. Annals of Statistics, 35, 2173–2192.
Zou, H., Zhang, H. H. (2009). On the adaptive elastic-net with a diverging number of parameters. Annals of Statistics, 37, 1733–1751.

Download references

Author information

Authors and Affiliations

School of Science, Ningbo University of Technology, Ningbo, 315211, China
Yanxin Wang
Department of Statistics, University of Warwick, Coventry, CV4 7AL, UK
Yanxin Wang
School of Mathematics and Statistics, Wuhan University, Wuhan, 430072, China
Qibin Fan
School of Applied Mathematics, Xiamen University of Technology, Xiamen, 361024, China
Li Zhu

Authors

Yanxin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qibin Fan
View author publications
You can also search for this author in PubMed Google Scholar
Li Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanxin Wang.

Additional information

This work was supported by the K. C. Wong Education Foundation, Hong Kong, Program of National Natural Science Foundation of China (No. 61179039) and the Project of Education of Zhejiang Province (No. Y201533324). The authors gratefully acknowledges the support of K. C. Wong Education Foundation, Hong Kong. And the authors are grateful to the editor, the associate editor and the anonymous referees for their constructive and helpful comments.

Appendix

Proof of Theorem 1 Let $\alpha _n=\sqrt{p\sigma ^2/n}$ and fix $r \in (0,1)$. To prove the Theorem, it suffices to show that if $C > 0$ is large enough, then

$$\begin{aligned} Q_n(\beta ^*)< \mathop {\mathrm{inf}}_{\Vert \mu \Vert =C}Q_n(\beta ^*+\alpha _n\mu ) \end{aligned}$$

holds for all n sufficiently large, with probability at least $1- r$. Define $D_n(\mu )= Q_n(\beta ^*+\alpha _n\mu )-Q_n(\beta ^*)$ and note that

$$\begin{aligned} D_n(\mu )= & {} \frac{1}{2n}(\alpha _n^2{\Vert \mathbf{X }\mu \Vert }^2-2\alpha _n\varepsilon ^\mathrm{T}\mathbf{X }\mu ) +\sum _{j=1}^{p}\{p_{\lambda ,a}(|\beta ^*_j+\alpha _n\mu _j|)-p_{\lambda ,a}(|\beta ^*_j|)\}\\~~~\ge & {} \frac{1}{2n}(\alpha _n^2{\Vert \mathbf{X }\mu \Vert }^2-2\alpha _n\varepsilon ^\mathrm{T}\mathbf{X }\mu ) +\sum _{j\in K(\mu )}\{p_{\lambda ,a}(|\beta ^*_j+\alpha _n\mu _j|)-p_{\lambda ,a}(|\beta ^*_j|)\}, \end{aligned}$$

where $K(\mu )=\{j; p_{\lambda ,a}(|\beta ^*_j+\alpha _n\mu _j|)-p_{\lambda ,a}(|\beta ^*_j|)<0\}$. The fact that $p_{\lambda ,a}$ is concave on $[0,\infty )$ implies that

$$\begin{aligned}&p_{\lambda ,a}(|\beta ^*_j+\alpha _n\mu _j|)-p_{\lambda ,a}(|\beta ^*_j|) \ge p'_{\lambda ,a}(|\beta ^*_j+\alpha _n\mu _j|)(|\beta ^*_j+\alpha _n\mu _j|-|\beta ^*_j|) \\&\quad \ge p'_{\lambda ,a}(|\beta ^*_j+\alpha _n\mu _j|)(-\alpha _n|\mu _j|) =-\frac{\lambda \alpha _n|\mu _j|}{a}\mathrm{e}^{-\frac{|\beta ^*_j+\alpha _n\mu _j|}{a}}. \end{aligned}$$

when n is sufficiently large.

Condition (B) implies that

$$\begin{aligned} \mathrm{e}^{-\frac{|\beta ^*_j+\alpha _n\mu _j|}{a}}\le \mathrm{e}^{-\frac{\rho }{a}}. \end{aligned}$$

Thus, for n big enough,

$$\begin{aligned} D_n(\mu )\ge & {} \frac{1}{2n}(\alpha _n^2{\Vert \mathbf{X }\mu \Vert }^2-2\alpha _n\varepsilon ^\mathrm{T}\mathbf{X }\mu )-\frac{Cp\lambda \alpha _n}{a}\mathrm{e}^{-\frac{\rho }{a}}. \end{aligned}$$

(15)

By (D),

$$\begin{aligned} \frac{1}{2n}\alpha _n^2{\Vert \mathbf{X }\mu \Vert }^2\ge \frac{\lambda _\mathrm{min}}{2}C^2\alpha _n^2. \end{aligned}$$

(16)

On the other hand (D) implies,

$$\begin{aligned} \frac{1}{n}\alpha _n{|\varepsilon ^\mathrm{T}\mathbf{X }\mu |} \le \frac{C\alpha _n}{\sqrt{n}}\Vert \frac{1}{\sqrt{n}}\mathbf{X }^\mathrm{T}\varepsilon \Vert =O_P(C\alpha _n^2). \end{aligned}$$

(17)

Furthermore, (C) and (B) imply

$$\begin{aligned} \frac{Cp\lambda \alpha _n}{a}\mathrm{e}^{-\frac{\rho }{a}}=o(C\alpha _n^2). \end{aligned}$$

(18)

From (15)–(18), we conclude that if $C > 0$ is large enough, then $\inf _{\Vert \mu \Vert =C}D_n(\mu )$ $>0$ holds for all n sufficiently large, with probability at least $1-r$. This proves the Theorem 1. $\square $

To prove Theorem 2, we first show that the EXP penalized estimator possesses the sparsity property by following lemma.

Lemma 1

Assume that (A)–(D) hold, and fix $C > 0$. Then

$$\begin{aligned} \lim _{n\rightarrow \infty }P\left[ \mathop {\arg \min }\limits _{\Vert \beta -\beta ^*\Vert \le C \sqrt{{p\sigma ^2}/n}}Q_n(\beta )\subseteq \left\{ \beta \in R^p; \beta _{A^c}=0 \right\} \right] =1. \end{aligned}$$

where $A^c = \{1, \ldots , p\} {\setminus } A$ is the complement of A in $\{1, \ldots , p\}$.

Proof

Suppose that $\beta \in R^p$ and that $\Vert \beta -\beta ^*\Vert \le C\sqrt{{p\sigma ^2}/n}$. Define $\tilde{\beta }\in R^p$ by $\tilde{\beta }_{A^c}=0$ and $\tilde{\beta }_{A}=\beta _{A}$. Similar to the proof of Theorem 1, let

$$\begin{aligned} D_n(\beta ,\tilde{\beta })=Q_n(\beta )-Q_n(\tilde{\beta }), \end{aligned}$$

where $Q_n(\beta )$ is defined in (7). Then

$$\begin{aligned}&D_n(\beta ,\tilde{\beta }) \nonumber \\&\quad = \frac{1}{2n}{\Vert \mathbf y -\mathbf{X }\beta \Vert }^2-\frac{1}{2n}{\Vert \mathbf y -\mathbf{X }\tilde{\beta }\Vert }^2+\sum _{j\in A^c} p_{\lambda ,a}(|\beta _j|) \nonumber \\&\quad = \frac{1}{2n}{\Vert \mathbf y -\mathbf{X }\tilde{\beta }-\mathbf{X }(\beta -\tilde{\beta })\Vert }^2-\frac{1}{2n}{\Vert \mathbf y -\mathbf{X }\tilde{\beta }\Vert }^2 +\sum _{j\in A^c} p_{\lambda ,a}(|\beta _j|) \nonumber \\&\quad = \frac{1}{2n}{(\beta -\tilde{\beta })}^\mathrm{T}\mathbf{X }^\mathrm{T}\mathbf{X }(\beta -\tilde{\beta })-\frac{1}{n}{(\beta -\tilde{\beta })}^\mathrm{T}\mathbf{X }^\mathrm{T}(\mathbf y -\mathbf{X }\tilde{\beta }) +\sum _{j\in A^c} p_{\lambda ,a}(|\beta _j|) \nonumber \\&\quad =O_p(\Vert \beta -\tilde{\beta }\Vert \sqrt{{p\sigma ^2}/n})+\sum _{j\in A^c} p_{\lambda ,a}(|\beta _j|). \end{aligned}$$

(19)

On the other hand, since the EXP penalty is concave on $[0,\infty )$,

$$\begin{aligned} p_{\lambda ,a}(|\beta _j|)\ge & {} p'_{\lambda ,a}(|\beta _j|)|\beta _j| = \frac{\lambda }{a}\mathrm{e}^{-\frac{|\beta _j|}{a}}|\beta _j| \ge \frac{\lambda }{a}\mathrm{e}^{-\frac{C \sqrt{{p\sigma ^2}/n}}{a}}|\beta _j|.~~~ \end{aligned}$$

Thus,

$$\begin{aligned} \sum _{j\in A^c}p_{\lambda ,a}(|\beta _j|)\ge \frac{\lambda }{a}\mathrm{e}^{-\frac{C \sqrt{{p\sigma ^2}/n}}{a}}\Vert \beta -\tilde{\beta }\Vert . \end{aligned}$$

(20)

By (C), it is clear that

$$\begin{aligned} \mathop {\lim \inf }\limits _{n\rightarrow \infty }\left( \frac{\lambda }{a}\mathrm{e}^{-\frac{C \sqrt{{p\sigma ^2}/n}}{a}}\right) >0. \end{aligned}$$

and $\lambda \sqrt{n/{(p\sigma ^2)}}\rightarrow \infty $. Combining these observations with (19) and (20) gives $D_n(\beta ,\tilde{\beta })>0$ with probability tending to 1, as $n\rightarrow \infty $. The result follows. $\square $

Proof of Theorem 2 Taken together, Theorem 1 and Lemma 1 imply that there exist a sequence of local minima ${\hat{\beta }}$ of (7) such that $\Vert {\hat{\beta }}-\beta ^*\Vert =O_P(\sqrt{{p\sigma ^2}/n})$ and ${\hat{\beta }}_{A^c}=0$. Part (i) of the theorem follows immediately.

To prove part (ii), observe that on the event $\{j; {\hat{\beta }}_j\ne 0\}=A$, we must have

$$\begin{aligned} {\hat{\beta }}_A=\beta ^*_A+{(\mathbf{X }_A^\mathrm{T}\mathbf{X }_A)}^{-1}\mathbf{X }_A^\mathrm{T}\varepsilon -{(n^{-1}\mathbf{X }_A^\mathrm{T}\mathbf{X }_A)}^{-1}p'_A, \end{aligned}$$

where $p'_A={(p'_{\lambda ,a}({\hat{\beta }}_j))}_{j\in A}$. It follows that

$$\begin{aligned}&\sqrt{n}B_n{(n^{-1}\mathbf{X }_A^\mathrm{T}\mathbf{X }_A/\sigma ^2)}^{1/2}({\hat{\beta }}_A-\beta ^*_A)\\&\quad = B_n{(\sigma ^2 \mathbf{X }_A^\mathrm{T}\mathbf{X }_A)}^{-1/2}\mathbf{X }_A^\mathrm{T}\varepsilon -nB_n{(\sigma ^2 \mathbf{X }_A^\mathrm{T}\mathbf{X }_A)}^{-1/2}p'_A, \end{aligned}$$

whenever $\{j; {\hat{\beta }}_j\ne 0\}=A$. Now note that conditions (A)–(D) imply

$$\begin{aligned} \Vert nB_n{(\sigma ^2 \mathbf{X }_A^\mathrm{T}\mathbf{X }_A)}^{-1/2}p'_A\Vert =O_P(\sqrt{np/{\sigma ^2}}\frac{\lambda }{a}\mathrm{e}^{-\frac{\rho }{a}})=o_P(1), \end{aligned}$$

Thus,

$$\begin{aligned} \sqrt{n}B_n{(n^{-1}\mathbf{X }_A^\mathrm{T}\mathbf{X }_A/\sigma ^2)}^{1/2}({\hat{\beta }}_A-\beta ^*_A)=B_n{(\sigma ^2 \mathbf{X }_A^\mathrm{T}\mathbf{X }_A)}^{-1/2}\mathbf{X }_A^\mathrm{T}\varepsilon +o_P(1). \end{aligned}$$

To complete the proof of (ii), we use the Lindeberg–Feller central limit theorem to show that

$$\begin{aligned} B_n{(\sigma ^2 \mathbf{X }_A^\mathrm{T}\mathbf{X }_A)}^{-1/2}\mathbf{X }_A^\mathrm{T}\varepsilon \rightarrow N(0,G), \end{aligned}$$

(21)

in distribution. Observe that

$$\begin{aligned} B_n{(\sigma ^2 \mathbf{X }_A^\mathrm{T}\mathbf{X }_A)}^{-1/2}\mathbf{X }_A^\mathrm{T}\varepsilon =\sum _{i=1}^n\omega _{i,n}, \end{aligned}$$

where $\omega _{i,n}= B_n{(\sigma ^2 \mathbf{X }_A^\mathrm{T}\mathbf{X }_A)}^{-1/2}x_{i,A}\varepsilon _i$.

Fix $\delta _0 > 0$ and let $\eta _{i,n}=x_{i,A}^\mathrm{T}{(\mathbf{X }_A^\mathrm{T}\mathbf{X }_A)}^{-1/2}B_n^\mathrm{T}B_n{( \mathbf{X }_A^\mathrm{T}\mathbf{X }_A)}^{-1/2}x_{i,A}$ Then

$$\begin{aligned}&E[{\Vert \omega _{i,n}\Vert }^2; {\Vert \omega _{i,n}\Vert }^2>\delta _0] \\&\quad = \eta _{i,n}E[\varepsilon ^2_i/{\sigma ^2}; \eta _{i,n}\varepsilon ^2_i/{\sigma ^2}>\delta _0] \\&\quad \le \eta _{i,n}E{({|\varepsilon _i/\sigma |}^{2+\delta })}^{2/{(2+\delta )}}P{(\eta _{i,n}\varepsilon ^2_i/{\sigma ^2}>\delta _0)}^{\delta /{(2+\delta )}}\\&\quad \le \eta _{i,n}^{1+\delta /{2+\delta }}\delta _0^{-1}E{({|\varepsilon _i/\sigma |}^{2+\delta })}^{2/{(2+\delta )}}. \end{aligned}$$

Since $\sum _{i=1}^n\eta _{i,n}=tr(B_n^\mathrm{T}B_n)\rightarrow tr(G)< \infty $ and since (E) implies

$$\begin{aligned} \max _{1\le i\le n}\eta _{i,n}\lambda _\mathrm{min}(n^{-1}\mathbf{X }^\mathrm{T}\mathbf{X })\lambda _\mathrm{max}(B_n^\mathrm{T}B_n)\max _{1\le i\le n}\frac{1}{n}\sum _{j=1}^px_{ij}^2\rightarrow 0, \end{aligned}$$

we must have

$$\begin{aligned}&\sum _{i=1}^nE[{\Vert \omega _{i,n}\Vert }^2; {\Vert \omega _{i,n}\Vert }^2>\delta _0] \nonumber \\&\quad \le \delta _0^{-1}E{({|\varepsilon _i/\sigma |}^{2+\delta })}^{2/{(2+\delta )}}\sum _{i=1}^n\eta _{i,n}^{1+\delta /{(2+\delta )}} \nonumber \\&\quad \le \delta _0^{-1}E{({|\varepsilon _i/\sigma |}^{2+\delta })}^{2/{(2+\delta )}}tr(B_n^\mathrm{T}B_n)\max _{1\le i\le n}\eta _{i,n}^{\delta /{(2+\delta )}} \nonumber \\&\quad \rightarrow 0. \end{aligned}$$

Thus, the Lindeberg condition is satisfied and (21) holds. $\square $

Proof of Theorem 3 Suppose we are on the event $\{j; {\hat{\beta }}^*_j\ne 0\}=A$. The first order optimality conditions for (7) imply that

$$\begin{aligned} {\hat{\beta }}^*_A={( \mathbf{X }_A^\mathrm{T}\mathbf{X }_A)}^{-1} \mathbf{X }_A^\mathrm{T}y-n{( \mathbf{X }_A^\mathrm{T}\mathbf{X }_A)}^{-1}p'_A({\hat{\beta }}^*), \end{aligned}$$

where $p'_A(\beta )={(p'_{\lambda ,a}(\beta _j))}_{j\in A}$. Thus,

$$\begin{aligned}&{\Vert \mathbf y -\mathbf{X }{\hat{\beta }}^*\Vert }^2 \\&\quad =\varepsilon ^\mathrm{T}\{\mathbf{I }-\mathbf{X }_A{( \mathbf{X }_A^\mathrm{T}\mathbf{X }_A)}^{-1} \mathbf{X }_A^\mathrm{T}\}\varepsilon +n^2{p'_A({\hat{\beta }}^*)}^\mathrm{T}{( \mathbf{X }_A^\mathrm{T}\mathbf{X }_A)}^{-1}p'_A({\hat{\beta }}^*) \\&\quad = \varepsilon ^\mathrm{T}\{\mathbf{I }-\mathbf{X }_A{( \mathbf{X }_A^\mathrm{T}\mathbf{X }_A)}^{-1} \mathbf{X }_A^\mathrm{T}\}\varepsilon +o_P(\sigma ^2). \end{aligned}$$

Now let ${\hat{\beta }}={\hat{\beta }}(\lambda ,a)$ be a local minimizer of (7) with $(\lambda ,a)\in \Omega $ and let ${\hat{A}}=\{j; {\hat{\beta }}_j\ne 0\}$. Note that

$$\begin{aligned}&{\Vert \mathbf y -\mathbf{X }{{\hat{\beta }}}\Vert }^2 \\&=\mathbf{y }^\mathrm{T} \{\mathbf{I }-\mathbf{X }_{{\hat{A}}}{(\mathbf{X }_{{\hat{A}}}^\mathrm{T}\mathbf{X }_{{\hat{A}}})}^{-1} \mathbf{X }_{{\hat{A}}}^\mathrm{T}\}{} \mathbf y +n^2{p'_{{\hat{A}}}({{\hat{\beta }}})}^\mathrm{T}{( \mathbf{X }_{{\hat{A}}}^\mathrm{T}\mathbf{X }_{{\hat{A}}})}^{-1}p '_{{\hat{A}}}({{\hat{\beta }}}) \\&= {(\mathbf{X }_{A{\setminus }{\hat{A}}}\beta ^*_{A{\setminus }{\hat{A}}}+\varepsilon )}^\mathrm{T}\{\mathbf{I }-\mathbf{X }_{{\hat{A}}}{(\mathbf{X }_{{\hat{A}}}^\mathrm{T}\mathbf{X }_{{\hat{A}}})}^{-1} \mathbf{X }_{{\hat{A}}}^\mathrm{T}\}(\mathbf{X }_{A{\setminus }{\hat{A}}}\beta ^*_{A{\setminus }{\hat{A}}}+\varepsilon ) \\&\quad + \ n^2{p'_{{\hat{A}}}({{\hat{\beta }}})}^\mathrm{T}{( \mathbf{X }_{{\hat{A}}}^\mathrm{T}\mathbf{X }_{{\hat{A}}})}^{-1}p'_{{\hat{A}}} ({{\hat{\beta }}}) \\&= {(\beta ^*_{A{\setminus }{\hat{A}}})}^\mathrm{T}{\mathbf{X }_{A{\setminus }{\hat{A}}}}^\mathrm{T}\{\mathbf{I }-\mathbf{X }_{{\hat{A}}}{(\mathbf{X }_{{\hat{A}}}^\mathrm{T}\mathbf{X }_{{\hat{A}}})}^{-1} \mathbf{X }_{{\hat{A}}}^\mathrm{T}\}\mathbf{X }_{A{\setminus }{\hat{A}}}\beta ^*_{A{\setminus }{\hat{A}}} \\&\quad + \ 2\varepsilon ^\mathrm{T}\{\mathbf{I }-\mathbf{X }_{{\hat{A}}}{(\mathbf{X }_{{\hat{A}}}^\mathrm{T}\mathbf{X }_{{\hat{A}}})}^{-1} \mathbf{X }_{{\hat{A}}}^\mathrm{T}\}\mathbf{X }_{A{\setminus }{\hat{A}}}\beta ^*_{A{\setminus }{\hat{A}}} \\&\quad + \ \varepsilon ^\mathrm{T}\{\mathbf{I }-\mathbf{X }_{{\hat{A}}}{(\mathbf{X }_{{\hat{A}}}^\mathrm{T}\mathbf{X }_{{\hat{A}}})}^{-1} \mathbf{X }_{{\hat{A}}}^\mathrm{T}\}\varepsilon \\&\quad + \ n^2{p'_{{\hat{A}}}({{\hat{\beta }}})}^\mathrm{T}{( \mathbf{X }_{{\hat{A}}}^\mathrm{T}\mathbf{X }_{{\hat{A}}})}^{-1}p'_{{\hat{A}}} ({{\hat{\beta }}}). \end{aligned}$$

Thus, if $A{\setminus }{\hat{A}}=\Phi $, then

$$\begin{aligned}&{\Vert \mathbf y -\mathbf{X }{{\hat{\beta }}}\Vert }^2 \\&\quad \ge {\Vert \mathbf y -\mathbf{X }{{\hat{\beta }}}^*\Vert }^2+{(\beta ^*_{A{\setminus }{\hat{A}}})}^\mathrm{T}{\mathbf{X }_{A{\setminus }{\hat{A}}}}^\mathrm{T}\{\mathbf{I }-\mathbf{X }_{{\hat{A}}}{(\mathbf{X }_{{\hat{A}}}^\mathrm{T}\mathbf{X }_{{\hat{A}}})}^{-1} \mathbf{X }_{{\hat{A}}}^\mathrm{T}\}\mathbf{X }_{A{\setminus }{\hat{A}}}\\&\qquad \beta ^*_{A{\setminus }{\hat{A}}}+2\varepsilon ^\mathrm{T}\{\mathbf{I }-\mathbf{X }_{{\hat{A}}}{(\mathbf{X }_{{\hat{A}}}^\mathrm{T}\mathbf{X }_{{\hat{A}}})}^{-1} \mathbf{X }_{{\hat{A}}}^\mathrm{T}\}\mathbf{X }_{A{\setminus }{\hat{A}}}\beta ^*_{A{\setminus }{\hat{A}}}+O_P(p\sigma ^2) \\&\quad \ge {\Vert \mathbf y -\mathbf{X }{{\hat{\beta }}}^*\Vert }^2+nr\rho ^2+O_P(\sigma \rho \sqrt{n})+O_P(p\sigma ^2) \\&\quad = {\Vert \mathbf y -\mathbf{X }{{\hat{\beta }}}^*\Vert }^2+nr\rho ^2(1+o_P(1)) \end{aligned}$$

where $0< r < \lambda _\mathrm{min}(n^{-1}\mathbf{X }^\mathrm{T}\mathbf{X })$ is a positive constant. Furthermore, whenever ${\Vert \mathbf y -\mathbf{X }{{\hat{\beta }}}\Vert }^2-{\Vert \mathbf y -\mathbf{X }{{\hat{\beta }}}^*\Vert }^2>0$, we have

$$\begin{aligned}&\mathrm {MBIC}({\hat{\beta }})-\mathrm {MBIC}({{\hat{\beta }}}^*)\\&\quad =\log \left( \frac{{\Vert \mathbf y -\mathbf{X }{{\hat{\beta }}}\Vert }^2}{{\Vert \mathbf y -\mathbf{X }{{\hat{\beta }}}^*\Vert }^2}\right) +\log \left( \frac{n-p_0}{n-\hat{p}_0}\right) +\frac{C_n\log (n)}{n}(\hat{p}_0-p) \\&\quad \ge 1-\frac{{\Vert \mathbf y -\mathbf{X }{{\hat{\beta }}}^*\Vert }^2}{{\Vert \mathbf y -\mathbf{X }{{\hat{\beta }}}\Vert }^2}+\log \left( \frac{n-p_0}{n-\hat{p}_0}\right) +\frac{C_n\log (n)}{n}(\hat{p}_0-p) \\&\quad \ge \frac{1}{{\Vert \mathbf y -\mathbf{X }{{\hat{\beta }}}\Vert }^2}\left[ \left( 1-\frac{2 C_np\log (n)}{n}\right) {\Vert \mathbf y -\mathbf{X }{{\hat{\beta }}}\Vert }^2-{\Vert \mathbf y -\mathbf{X }{\hat{\beta }}^*\Vert }^2 \right] \\&\quad \ge \frac{1}{{\Vert \mathbf y -\mathbf{X }{\hat{\beta }}\Vert }^2}[O_P(\sigma ^2C_np\log (n))+nr\rho ^2(1+o_P(1))], \\ \end{aligned}$$

where $\hat{p}_0 = |{\hat{A}}|$. By Condition (B’), it follows that

$$\begin{aligned} \lim _{n\rightarrow \infty }P\left\{ \inf \{\mathrm {MBIC}({\hat{\beta }}); A{\setminus } {\hat{A}}\ne \Phi \}> \mathrm {MBIC}({\hat{\beta }}^*) \right\} = 1. \end{aligned}$$

(22)

It remains to consider ${\hat{\beta }}$, where A is a proper subset of ${\hat{A}}$. Suppose that $A {\setminus } {\hat{A}}$. Then

$$\begin{aligned} {\Vert \mathbf y -\mathbf{X }{\hat{\beta }}\Vert }^2 =\varepsilon ^\mathrm{T}\{\mathbf{I }-\mathbf{X }_{{\hat{A}}}{(\mathbf{X }_{{\hat{A}}}^\mathrm{T}\mathbf{X }_{{\hat{A}}})}^{-1} \mathbf{X }_{{\hat{A}}}^\mathrm{T}\}\varepsilon +n^2{p'_{{\hat{A}}}({\hat{\beta }})}^\mathrm{T}{( \mathbf{X }_{{\hat{A}}}^\mathrm{T}\mathbf{X }_{{\hat{A}}})}^{-1}p '_{{\hat{A}}}({\hat{\beta }}) \end{aligned}$$

and

$$\begin{aligned} \log \left( \frac{{\Vert \mathbf y -\mathbf{X }{\hat{\beta }}\Vert }^2}{{\Vert \mathbf y -\mathbf{X }{\hat{\beta }}^*\Vert }^2}\right) \ge \log \left( \frac{\varepsilon ^\mathrm{T}\{\mathbf{I }-\mathbf{X }_{{\hat{A}}}{(\mathbf{X }_{{\hat{A}}}^\mathrm{T}\mathbf{X }_{{\hat{A}}})}^{-1} \mathbf{X }_{{\hat{A}}}^\mathrm{T}\}\varepsilon }{{\Vert \mathbf y -\mathbf{X }{\hat{\beta }}^*\Vert }^2}\right) . \end{aligned}$$

Since

$$\begin{aligned}&\varepsilon ^\mathrm{T}\{\mathbf{I }-\mathbf{X }_{{\hat{A}}}{(\mathbf{X }_{{\hat{A}}}^\mathrm{T}\mathbf{X }_{{\hat{A}}})}^{-1} \mathbf{X }_{{\hat{A}}}^\mathrm{T}\}\varepsilon -{\Vert \mathbf y -\mathbf{X }{\hat{\beta }}^*\Vert }^2 \\&\quad = \varepsilon ^\mathrm{T}\mathbf{X }_{{\hat{A}}}{(\mathbf{X }_{{\hat{A}}}^\mathrm{T}\mathbf{X }_{{\hat{A}}})}^{-1}\mathbf{X }_{{\hat{A}}}^\mathrm{T}\varepsilon -\varepsilon ^\mathrm{T}\mathbf{X }_{A}{(\mathbf{X }_{{\hat{A}}}^\mathrm{T}\mathbf{X }_{A})}^{-1} \mathbf{X }_{A}^\mathrm{T}\varepsilon +o_P(\sigma ^2) \\ \end{aligned}$$

it follows that

$$\begin{aligned} \log \left( \frac{\varepsilon ^\mathrm{T}\{\mathbf{I }-\mathbf{X }_{{\hat{A}}}{(\mathbf{X }_{{\hat{A}}}^\mathrm{T}\mathbf{X }_{{\hat{A}}})}^{-1} \mathbf{X }_{{\hat{A}}}^\mathrm{T}\}\varepsilon }{{\Vert \mathbf y -\mathbf{X }{\hat{\beta }}^*\Vert }^2}\right) =O_P((\hat{p}_0-p_0)/n). \end{aligned}$$

Thus,

$$\begin{aligned} \mathrm {MBIC}({\hat{\beta }})-\mathrm {MBIC}({\hat{\beta }}^*) \ge (\hat{p}_0-p_0)(C_n\log (n)/n-O_P(1/n)). \end{aligned}$$

We conclude that

$$\begin{aligned} \lim _{n\rightarrow \infty }P\left\{ \inf \{\mathrm {MBIC}({\hat{\beta }}); A \subset {\hat{A}}\}> \mathrm {MBIC}({\hat{\beta }}^*) \right\} = 1. \end{aligned}$$

(23)

Combining this with (22) proves the proposition. $\square $

About this article

Cite this article

Wang, Y., Fan, Q. & Zhu, L. Variable selection and estimation using a continuous approximation to the $L_0$ penalty. Ann Inst Stat Math 70, 191–214 (2018). https://doi.org/10.1007/s10463-016-0588-3

Download citation

Received: 11 May 2015
Revised: 18 September 2016
Published: 19 October 2016
Issue Date: February 2018
DOI: https://doi.org/10.1007/s10463-016-0588-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Variable selection and estimation using a continuous approximation to the \(L_0\) penalty

Abstract

Access this article

Similar content being viewed by others

Variable selection using a smooth information criterion for distributional regression models

Variable selection procedures from multiple testing

Adaptive Variable Selection in Nonparametric Sparse Regression

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Lemma 1

Proof

About this article

Cite this article

Keywords

Navigation

Variable selection and estimation using a continuous approximation to the \(L_0\) penalty

Abstract

Access this article

Similar content being viewed by others

Variable selection using a smooth information criterion for distributional regression models

Variable selection procedures from multiple testing

Adaptive Variable Selection in Nonparametric Sparse Regression

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

Lemma 1

Proof

About this article

Cite this article

Share this article

Keywords

Search

Navigation