Abstract
We propose a new penalty called the doubly sparse (DS) penalty for variable selection in high-dimensional linear regression models when the covariates are naturally grouped. An advantage of the DS penalty over other penalties is that it provides a clear way of controlling sparsity between and within groups, separately. We prove that there exists a unique global minimizer of the DS penalized sum of squares of residuals and show how the DS penalty selects groups and variables within selected groups, even when the number of groups exceeds the sample size. An efficient optimization algorithm is introduced also. Results from simulation studies and real data analysis show that the DS penalty outperforms other existing penalties with finite samples.
Similar content being viewed by others
Change history
25 July 2017
An erratum to this article has been published.
References
An, L. T. H., Tao, P. D. (1997). Solving a class of linearly constrained indefinite quadratic problems by DC algorithms. Journal of Global Optimization, 11, 253–285.
Bertsekas, D. P. (1999). Nonlinear Proramming (2nd ed.). Belmont: Athena Scientific.
Bickel, P. J., Ritov, Y. A., Tsybakov, A. B. (2009). Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37, 1705–1732.
Breheny, P. (2015). The group exponential lasso for bi-level variable selection. Biometrics, 71, 731–740.
Breheny, P., Huang, J. (2009). Penalized methods for bi-level variable selection. Statistics and Its Interface, 2, 369–380.
Breiman, L., Friedman, J. H. (1985). Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical Association, 80, 580–598.
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R. (2004). Least angle regression. The Annals of Statistics, 32, 407–499.
Fan, J., Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360.
Fan, J., Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics, 32, 928–961.
Friedman, J., Hastie, T., Hofling, H., Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1, 302–332.
Huang, J., Zhang, T. (2010). The benefit of group sparsity. The Annals of Statistics, 38, 1978–2004.
Huang, J., Horowitz, J. L., Ma, S. (2008). Asymptotic properties of bridge estimators in sparse high-dimensional regression models. The Annals of Statistics, 36, 587–613.
Huang, J., Ma, S., Xie, H., Zhang, C.-H. (2009). A group bridge approach for variable selection. Biometrika, 96, 339–355.
Huang, J., Breheny, P., Ma, S. (2012). A selective review of group selection in high-dimensional models. Statistical Science, 27, 481–499.
Jiang, D., Huang, J. (2015). Concave 1-norm troup selection. Biostatistics, 16, 252–267.
Kim, Y., Kwon, S. (2012). Global optimality of nonconvex penalized estimators. Biometrika, 99, 315–325.
Kim, Y., Choi, H., Oh, H. (2008). Smoothly clipped absolute deviation on high dimensions. Journal of the American Statistical Association, 103, 1656–1673.
Kwon, S., Lee, S., Kim, Y. (2015). Moderately clipped LASSO. Computaitonal Statistics and Data Analysis, 92, 53–67.
Lin, Y., Zhang, H. H. (2006). Component selection and smoothing in smoothing spline analysis of variance models. The Annals of Statistics, 34, 2272–2297.
Meinshausen, N., Yu, B. (2009). Lasso-type recovery of sparse representation for high-dimensional data. The Annals of Statistics, 37, 246–270.
Rosset, S., Zhu, J. (2007). Piecewise linear regularized solution paths. The Annals of Statistics, 35, 1012–1030.
Sardy, S., Tseng, P. (2004). Amlet, ramlet, and gamlet: Automatic nonlinear fitting of additive models, robust and generalized, with wavelets. Journal of Computational and Graphical Statistics, 13, 283–309.
Scheetz, T. E., Kim, K. Y. A., Swiderski, R. E., Philp, A. R., Braun, T. A., Knudtson, K. L., Dorrance, A. M., DiBona, G. F., Huang, J., Casavant, T. L., Sheffield, V. C., Stone, E. M. (2006). Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of Sciences (Vol. 103, pp. 14429–14434).
Simon, N., Friedman, J., Hastie, T., Tibshirani, R. (2013). A sparse-group lasso. Journal of Computational and Graphical Statistics, 22, 231–245.
Sriperumbudur, B. K., & Lanckriet, G. R. (2009). On the convergence of the concave-convex procedure. Advances in Neural Information Processing Systems, 9, 1759–1767.
Tibshirani, R. J. (1996). Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society Series B, 58, 267–288.
Wang, H., Li, R., Tsai, C. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94, 553–568.
Wang, H., Li, B., Leng, C. (2009). Shrinkage tuning parameter selection with a diverging number of parameters. Journal of Royal Statistical Society Series B, 71, 671–683.
Wang, L., Kim, Y., Li, R. (2013). Calibrating non-convex penalized regression in ultra-high dimension. The Annals of Statistics, 41, 2505–2536.
Wei, F., Huang, J. (2010). Consistent group selection in high-dimensional linear regression. Bernoulli, 16, 1369–1384.
Ye, F., Zhang, C.-H. (2010). Rate minimaxity of the lasso and dantzig selector for the \(\ell _q\) loss in \(\ell _r\) balls. Journal of Machine Learning Research, 11, 3519–3540.
Yuan, M., Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B, 68, 49–67.
Yuille, A., Rangarajan, A. (2003). The concave-convex procedure. Neural Computation, 15, 915–936.
Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38, 894–942.
Zhang, C.-H., Zhang, T. (2012). A general theory of concave regularization for high-dimensional sparse estimation problems. Statistical Science, 27, 576–593.
Zhao, P., Yu, B. (2006). On model selection consistency of lasso. Journal of Machine Learning Reserach, 7, 2541–2563.
Zhou, N., Zhu, J. (2010). Group variable selection via a hierarchical lasso and its oracle property. Statistics and Its Interface, 3, 557–574.
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.
Acknowledgments
We are grateful to the anonymous referees, the associate editor, and the editor for their helpful comments. The research of Sunghoon Kwon was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (No. 2014R1A1A1002995). The research of Woncheol Jang was supported by the Basic Science Research Program through the National Research Foundation (NRF) of Korea grant funded by the Ministry of Education (No. 2013R1A1A2010065) and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2014R1A4A1007895). The research of Yongdai Kim was supported by the Basic Science Research Program through the National Research Foundation (NRF) of Korea grant funded by the Korea government (MSIP) (No. 2014R1A4A1007895).
Author information
Authors and Affiliations
Corresponding author
Additional information
An erratum to this article is available at https://doi.org/10.1007/s10463-017-0612-2.
Appendix
Appendix
Without loss of generality, we assume that the covariates are standardized so that \(\mathbf{X}_{kj}^\mathrm{T}\mathbf{X}_{kj}/n=1\) for all \(k\le K\) and \(j\le p_k\). Further, we use \({\varvec{\hat{{\beta }}}}^o\) instead of \({\varvec{\hat{{\beta }}}}^o(\gamma )\) for simplicity.
Proof of Lemma 1
From the first order optimality conditions (Bertsekas 1999), the necessary conditions follow directly. \(\square \)
Proof of Lemma 2
It suffices to show that there exists a \(\delta >0\) such that \(Q_{\lambda ,\gamma }(\varvec{{\beta }}) \ge Q_{\lambda ,\gamma }({\varvec{\hat{{\beta }}}})\) for all \(\varvec{{\beta }} \in B({\varvec{\hat{{\beta }}}},\delta ),\) where \(B({\varvec{\hat{{\beta }}}},\delta )=\{\varvec{{\beta }}:\Vert \varvec{{\beta }}-{\varvec{\hat{{\beta }}}}\Vert _1\le \delta \}.\) From the convexity of the sum of squared residuals, \(Q_{\lambda ,\gamma }(\varvec{{\beta }})-Q_{\lambda ,\gamma }({\varvec{\hat{{\beta }}}}) \ge \sum _{k=1}^K \chi _k\), where
First, consider cases where \(k\in \mathcal{L}({\varvec{\hat{{\beta }}}})\). Let \(\delta _k=\Vert {\varvec{\hat{{\beta }}}}_k\Vert _1-ap_k(\lambda -\gamma )\) then \(\Vert \varvec{{\beta }}_k\Vert _1/p_k >a(\lambda -\gamma )\) for all \(\varvec{{\beta }}_k \in B({\varvec{\hat{{\beta }}}}_k,\delta _k)\). Hence, the first and second conditions in Lemma 1 imply
Next, consider cases where \(k\in \mathcal{N}({\varvec{\hat{{\beta }}}})\). Let \(\delta _k = \min \{\omega _k,a(\lambda -\gamma )\}\), where \(\omega _k = 2a(\lambda -\Vert D_k({\varvec{\hat{{\beta }}}})\Vert _\infty )\). Then \(\Vert \varvec{{\beta }}_k\Vert _1 < \omega _k\) for all \(\varvec{{\beta }}_k \in B({\varvec{\hat{{\beta }}}}_k,\delta _k)\), which implies
Hence \(Q_{\lambda ,\gamma }(\varvec{{\beta }}) \ge Q_{\lambda ,\gamma }({\varvec{\hat{{\beta }}}})\) for all \(\varvec{{\beta }} \in B({\varvec{\hat{{\beta }}}},\min _{k \le K} \delta _k),\) which completes the proof. \(\square \)
Proof of Lemma 3
Assume that there exists another local minimizer \(\varvec{\tilde{{\beta }}} \in \varvec{{\Omega }}_{\lambda ,\gamma }\) and \(\varvec{\tilde{{\beta }}} \ne {\varvec{\hat{{\beta }}}}\). Let \(\varvec{{\beta }}^h=\varvec{\tilde{{\beta }}}+h({\varvec{\hat{{\beta }}}}-\varvec{\tilde{{\beta }}})=h{\varvec{\hat{{\beta }}}} +(1-h)\varvec{\tilde{{\beta }}}\) for \(0<h<1\), then we have
by using the equality,
for any \(\varvec{{\beta }}\in \mathbb {R}^p.\) Hence, it follows that
where
First, consider cases where \(k\in \mathcal{L}({\varvec{\hat{{\beta }}}})\). If \(k \in \mathcal{N}(\varvec{\tilde{{\beta }}})\) then,
from the condition \(\Vert {\varvec{\hat{{\beta }}}}_k\Vert _1/p_k > (\lambda -\gamma )/\rho _{\min }\). If \(k \in \mathcal{L}(\varvec{\tilde{{\beta }}})\) then,
unless \(\varvec{\tilde{{\beta }}}_k = {\varvec{\hat{{\beta }}}}_k\). If \(k \in \mathcal{S}(\varvec{\tilde{{\beta }}})\), we have
which implies
unless \(\Vert \varvec{\tilde{{\beta }}}_k\Vert _1 = \Vert {\varvec{\hat{{\beta }}}}_k\Vert _1\). Second, consider cases where \(k\in \mathcal{N}({\varvec{\hat{{\beta }}}})\). It is easy to see that
unless \(\Vert \varvec{\tilde{{\beta }}}_k\Vert _1 =0\), since
Hence, we finally have \(\sum _{k=1}^K\chi _k(h) < 0\), unless \(\Vert \varvec{\tilde{{\beta }}}_k\Vert _1=\Vert {\varvec{\hat{{\beta }}}}_k\Vert _1\) for all \(k \le K\). This implies that there exists a \(\delta >0\) sufficiently small such that \(Q_{\lambda ,\gamma }(\varvec{{\beta }}^h)-Q_{\lambda ,\gamma }(\varvec{\tilde{{\beta }}})<0\) for all \(h\in (0,\delta )\) unless \(\Vert \varvec{\tilde{{\beta }}}_k\Vert _1=\Vert {\varvec{\hat{{\beta }}}}_k\Vert _1\) for all \(k\le K\). Hence, \({\varvec{\hat{{\beta }}}}\) is the unique local minimizer. \(\square \)
Proof of Lemma 1
Let \(A^o= \{(k,j):\hat{\beta }_{kj}^o\ne 0\}\). From Lemma 2, it suffices to show that \(\mathbf{P}(E_1 \cap E_2 \cap E_3) \ge 1-\mathbf{P}_1 - \mathbf{P}_2 - \mathbf{P}_3\), where
First consider the event \(E_1\). From Corollary 2 of Zhang and Zhang (2012), we have \(F \subset E_1\) provided that \(\phi _{\max }(\alpha _0|A_*|)/\alpha _0 \le \eta _{\min }/36\), where \(F =\big \{\max _{k \in \mathcal{A}({\varvec{\beta }}^*)}\Vert \mathbf{X}_k^\mathrm{T}\varvec{{\varepsilon }}/n\Vert _\infty \le \gamma /2\big \}\) and
is the cone invertible factor in Ye and Zhang (2010). On the other hand, inequality (7) of Zhang and Zhang (2012) proves \(\eta _{\min } \ge \delta _{\min }^2/16\), where
is the restricted eigenvalue in Bickel et al. (2009) that satisfies \(\delta _{\min } \ge \sqrt{\kappa _{\min }}(1-3\sqrt{\phi _{\max }(\alpha _0|A_*|)/\alpha _0 \kappa _{\min }})\). Hence, (C2) implies that \(F \subset E_1\) and
Second, consider the event \(E_2\). From the first order optimality conditions (Rosset and Zhu 2007), \({\varvec{\hat{{\beta }}}}^o\) satisfies
for all \((k,j) \in G_*\). Let \(S=A^o \cup A_*\) and \({\varvec{\hat{{\beta }}}}_S^o\) be the vector that consists of elements \(\hat{\beta }_{kj}^o\) for \((k,j) \in S\). On the event \(E_1\), (C2) implies
where \(\varvec{{\Sigma }}_{S}=\mathbf{X}_{S}^\mathrm{T}\mathbf{X}_{S}/n\). Let \(\mathbf{u}_{kj}\) be a vector of length \(|S|\le (\alpha _0+1)|A_*|\) whose unique nonzero element that corresponds to \(\beta _{kj}^*\) is 1 and the others are 0. Then, from (18), we can write
where \(\eta _{kj}=-\mathbf{u}_{kj}^\mathrm{T}\varvec{{\Sigma }}_{S}^{-1}\mathbf{X}_{S}^\mathrm{T}(\mathbf{y}-\mathbf{X}_{S}{\varvec{\hat{{\beta }}}}_{S}^o)/n\) and \(\mathbf{v}_{kj}=\mathbf{X}_{S}\varvec{{\Sigma }}_{S}^{-1}\mathbf{u}_{kj}/n.\) Note that
and
From (C1), it is easy to see that
where \(\mathbf{P}_{E_1}(A)=\mathbf{P}( E_1\cap A)\). Hence, by using the triangular inequality \(\Vert {\varvec{\hat{{\beta }}}}_k^o\Vert _1 \ge \Vert \varvec{{\beta }}_k^*\Vert _1 -\Vert {\varvec{\hat{{\beta }}}}_k^o-\varvec{{\beta }}_k^*\Vert _1 \), we have
Third, consider the event \(E_3\). From (18), we can write
for all \((k,j) \in S\), where \(\zeta _{kj}=\mathbf{X}_{kj}^\mathrm{T}\mathbf{X}_{S}\varvec{{\Sigma }}_{S}^{-1}\mathbf{X}_{S}^\mathrm{T}(\mathbf{y}-\mathbf{X}_{S}{\varvec{\hat{{\beta }}}}_{S}^o)/n^2\), \(\mathbf{w}_{kj}=(\mathbf{I}-\varvec{{\Pi }}_{S})\mathbf{X}_{kj}/n\) and \(\varvec{{\Pi }}_{S}=\mathbf{X}_{S}(\mathbf{X}_{S}^\mathrm{T}\mathbf{X}_{S})^{-1}\mathbf{X}_{S}^\mathrm{T}.\) Note that from (17),
and \(\Vert \mathbf{w}_{kj}\Vert _2^2 =\mathbf{X}_{kj}^\mathrm{T}(\mathbf{I}-\varvec{{\Pi }}_{S})\mathbf{X}_{kj}/n^2 \le \Vert \mathbf{X}_{kj}\Vert _2^2/n^2 = 1/n\). Hence, from (C1),
Hence, using \(\mathbf{P}(E_1 \cap E_2 \cap E_3) \ge 1-\mathbf{P}(E_1^c) - \mathbf{P}(E_1 \cap E_2^c) - \mathbf{P}(E_1 \cap E_3^c)\), we complete the proof. \(\square \)
Proof of Lemma 2
Suppose that there is another local minimizer \(\varvec{\tilde{{\beta }}}\in \varvec{{\Omega }}_{\lambda ,\gamma }((\alpha _0+1)|A_*|)\) such that \({\varvec{\hat{{\beta }}}}^o\ne \varvec{\tilde{{\beta }}}\). Let \(S=\{(k,j):\tilde{\beta }_{kj}\ne 0\} \cup A^o \cup A_*\). By replacing \(\mathbf{X}\) with \(\mathbf{X}_S\) in the proof of Lemma 3, we can see that if \({\varvec{\hat{{\beta }}}}^o\) satisfies conditions in Lemma 3 then \({\varvec{\hat{{\beta }}}}^o=\varvec{\tilde{{\beta }}}\). Since \(|S|\le 2(\alpha _0+1)|A_*|\), we have \(\lambda _{\min }(\mathbf{X}_S^\mathrm{T}\mathbf{X}_S/n) \ge \kappa _{\min }\) from (C2). Hence it suffices to show that
which is similar to proofs of (19) and (20) in the proof of Theorem 1. \(\square \)
About this article
Cite this article
Kwon, S., Ahn, J., Jang, W. et al. A doubly sparse approach for group variable selection. Ann Inst Stat Math 69, 997–1025 (2017). https://doi.org/10.1007/s10463-016-0571-z
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-016-0571-z