Skip to main content

High-dimensional sign-constrained feature selection and grouping

Abstract

In this paper, we propose a non-negative feature selection/feature grouping (nnFSG) method for general sign-constrained high-dimensional regression problems that allows regression coefficients to be disjointly homogeneous, with sparsity as a special case. To solve the resulting non-convex optimization problem, we provide an algorithm that incorporates the difference of convex programming, augmented Lagrange and coordinate descent methods. Furthermore, we show that the aforementioned nnFSG method recovers the oracle estimate consistently, and that the mean-squared errors are bounded. Additionally, we examine the performance of our method using finite sample simulations and applying it to a real protein mass spectrum dataset.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3

References

  • Arnold, T. B., Tibshirani, R. J. (2016). Efficient implementations of the generalized lasso dual path algorithm. Journal of Computational and Graphical Statistics, 25(1), 1–27.

    MathSciNet  Article  Google Scholar 

  • Esser, E., Lou, Y. F., Xin, J. (2013). A method for finding structured sparse solutions to nonnegative least squares problems with applications. SIAM Journal on Imaging Sciences, 6(4), 2010–2046.

    MathSciNet  Article  Google Scholar 

  • Fan, J., Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360.

    MathSciNet  Article  Google Scholar 

  • Frank, L. E., Friedman, J. H. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35(2), 109–135.

    Article  Google Scholar 

  • Friedman, J., Hastie, T., Simon, N., Tibshirani, R. (2016). Lasso and elastic-net regularized generalized linear models. R-Package Version, 2(0–5), 2016.

    Google Scholar 

  • Fu, A., Narasimhan, B., Boyd, S. (2017). CVXR: An R package for disciplined convex optimization. arXiv:1711.07582.

  • Goeman, J. J. (2010). \(L_1\) penalized estimation in the Cox proportional hazards model. Biometrical Journal, 52(1), 70–84.

    MathSciNet  MATH  Google Scholar 

  • Hu, Z., Follmann, D. A., Miura, K. (2015). Vaccine design via nonnegative lasso-based variable selection. Statistics in Medicine, 34(10), 1791–1798.

    MathSciNet  Article  Google Scholar 

  • Huang, J., Ma, S., Xie, H., Zhang, C. H. (2009). A group bridge approach for variable selection. Biometrika, 96(2), 339–355.

    MathSciNet  Article  Google Scholar 

  • Itoh, Y., Duarte, M. F., Parente, M. (2016). Perfect recovery conditions for non-negative sparse modeling. IEEE Transactions on Signal Processing, 65(1), 69–80.

    MathSciNet  Article  Google Scholar 

  • Jang, W., Lim, J., Lazar, N., Loh, J. M., McDowell, J., Yu, D. (2011). Regression shrinkage and equality selection for highly correlated predictors with HORSES. Biometrics, 64, 1–23.

    Google Scholar 

  • Koike, Y., Tanoue, Y. (2019). Oracle inequalities for sign constrained generalized linear models. Econometrics and Statistics, 11, 145–157.

    MathSciNet  Article  Google Scholar 

  • Luenberger, D. G., Ye, Y. (2015). Linear and nonlinear programming, Vol. 228. New York: Springer.

    MATH  Google Scholar 

  • Mandal, B. N., Ma, J. (2016). \(l_1\) regularized multiplicative iterative path algorithm for non-negative generalized linear models. Computational Statistics and Data Analysis, 101, 289–299.

    MathSciNet  Article  Google Scholar 

  • Meinshausen, N. (2013). Sign-constrained least squares estimation for high-dimensional regression. Electronic Journal of Statistics, 7, 1607–1631.

    MathSciNet  Article  Google Scholar 

  • Mullen, K. M., van Stokkum, I. H. (2012). The Lawson–Hanson algorithm for nonnegative least squares (NNLS). CRAN: R package. https://cran.r-project.org/web/packages/nnls/nnls.pdf.

  • Rekabdarkolaee, H. M., Boone, E., Wang, Q. (2017). Robust estimation and variable selection in sufficient dimension reduction. Computational Statistics and Data Analysis, 108, 146–157.

    MathSciNet  Article  Google Scholar 

  • Renard, B. Y., Kirchner, M., Steen, H., Steen, J. A., Hamprecht, F. A. (2008). NITPICK: Peak identification for mass spectrometry data. BMC Bioinformatics, 9(1), 355.

    Article  Google Scholar 

  • Shadmi, Y., Jung, P., Caire, G. (2019). Sparse non-negative recovery from biased sub-Gaussian measurements using NNLS. arXiv:1901.05727.

  • She, Y. (2010). Sparse regression with exact clustering. Electronic Journal of Statistics, 4, 1055–1096.

    MathSciNet  Article  Google Scholar 

  • Shen, X., Huang, H. C., Pan, W. (2012a). Simultaneous supervised clustering and feature selection over a graph. Biometrika, 99(4), 899–914.

    MathSciNet  Article  Google Scholar 

  • Shen, X., Pan, W., Zhu, Y. (2012b). Likelihood-based selection and sharp parameter estimation. Journal of the American Statistical Association, 107(497), 223–232.

    MathSciNet  Article  Google Scholar 

  • Shen, X., Pan, W., Zhu, Y., Zhou, H. (2013). On constrained and regularized high-dimensional regression. Annals of the Institute of Statistical Mathematics, 65(5), 807–832.

    MathSciNet  Article  Google Scholar 

  • Slawski, M., Hein, M. (2010). Sparse recovery for protein massspectrometry data. In NIPS workshop on practical applications of sparse modelling.

  • Slawski, M., Hein, M. (2013). Non-negative least squares for high-dimensional linear models: Consistency and sparse recovery without regularization. Electronic Journal of Statistics, 7, 3004–3056.

    MathSciNet  Article  Google Scholar 

  • Slawski, M., Hussong, R., Tholey, A., Jakoby, T., Gregorius, B., Hildebrandt, A., Hein, M. (2012). Isotope pattern deconvolution for peptide mass spectrometry by non-negative least squares/least absolute deviation template matching. BMC Bioinformatics, 13(1), 291.

    Article  Google Scholar 

  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288.

    MathSciNet  MATH  Google Scholar 

  • Tibshirani, R., Wang, P. (2008). Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics, 9(1), 18–29.

    Article  Google Scholar 

  • Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 67(1), 91–108.

    MathSciNet  Article  Google Scholar 

  • Tibshirani, R. J., Taylor, J. (2011). The solution path of the generalized lasso. The Annals of Statistics, 39(3), 1335–1371.

    MathSciNet  Article  Google Scholar 

  • Wen, Y. W., Wang, M., Cao, Z., Cheng, X., Ching, W. K., Vassiliadis, V. S. (2015). Sparse solution of nonnegative least squares problems with applications in the construction of probabilistic Boolean networks. Numerical Linear Algebra with Applications, 22(5), 883–899.

    MathSciNet  Article  Google Scholar 

  • Wu, L., Yang, Y. (2014). Nonnegative elastic net and application in index tracking. Applied Mathematics and Computation, 227, 541–552.

    MathSciNet  Article  Google Scholar 

  • Wu, L., Yang, Y., Liu, H. (2014). Nonnegative-lasso and application in index tracking. Computational Statistics and Data Analysis, 70, 116–126.

    MathSciNet  Article  Google Scholar 

  • Xiang, S., Shen, X., Ye, J. (2015). Efficient nonconvex sparse group feature selection via continuous and discrete optimization. Artificial Intelligence, 224, 28–50.

    MathSciNet  Article  Google Scholar 

  • Yang, S., Yuan, L., Lai, Y. C., Shen, X., Wonka, P., Ye, J. (2012). Feature grouping and selection over an undirected graph. Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 922–930). ACM. New York.

  • Yang, Y., Wu, L. (2016). Nonnegative adaptive lasso for ultra-high dimensional regression models and a two-stage method applied in financial modeling. Journal of Statistical Planning and Inference, 174, 52–67.

    MathSciNet  Article  Google Scholar 

  • Yuan, M., Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49–67.

    MathSciNet  Article  Google Scholar 

  • Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2), 894–942.

    MathSciNet  Article  Google Scholar 

  • Zhu, Y., Shen, X., Pan, W. (2013). Simultaneous grouping pursuit and feature selection over an undirected graph. Journal of the American Statistical Association, 108(502), 713–725.

    MathSciNet  Article  Google Scholar 

  • Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429.

    MathSciNet  Article  Google Scholar 

  • Zou, H., Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society): Series B (Statistical Methodology, 67(2), 301–320.

    MathSciNet  Article  Google Scholar 

Download references

Acknowledgements

This work is supported by Natural Sciences and Engineering Research Council of Canada (RGPIN-2017-05720). Qin also gratefully acknowledges the financial support from the China Scholarship Council (Grant No.201506180073).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hao Ding.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Proof of Lemma 1

Since \(\hat{{{\varvec{{\alpha }}}}}^{{\mathrm{ols}}}=({\hat{\alpha }}_1^{{\mathrm{ols}}},\ldots ,{\hat{\alpha }}_{K^0}^{{\mathrm{ols}}})^{\top }= (Z_{{{\mathcal {G}}_0^0}^c}^{\top }Z_{{{\mathcal {G}}_0^0}^c})^{-1}Z_{{{\mathcal {G}}_0^0}^c}^{\top }{{\varvec{{y}}}} ={{\varvec{{\alpha }}}}^{0}+(Z_{{{\mathcal {G}}_0^0}^c}^{\top }Z_{{{\mathcal {G}}_0^0}^c})^{-1}Z_{{{\mathcal {G}}_0^0}^c}^{\top }{{\varvec{{\epsilon }}}}\), \(\hat{{{\varvec{{\alpha }}}}}^{{\mathrm{ols}}}\sim N\left( {{\varvec{{\alpha }}}}^{0}, \sigma ^2(Z_{{{\mathcal {G}}_0^0}^c}^{\top }Z_{{{\mathcal {G}}_0^0}^c})^{-1}\right) ,\) namely,

$$\begin{aligned} {\hat{\alpha }}_{k}^{{\mathrm{ols}}}-\alpha _k^0\sim N\left( 0,\sigma ^2(Z_{{{\mathcal {G}}_0^0}^c}^{\top }Z_{{{\mathcal {G}}_0^0}^c})_{kk}^{-1}\right) , k=1,\ldots ,K^0, \end{aligned}$$

where \((Z_{{{\mathcal {G}}_0^0}^c}^{\top }Z_{{{\mathcal {G}}_0^0}^c})_{kk}^{-1}\) denotes the k-th diagonal element of matrix \((Z_{{{\mathcal {G}}_0^0}^c}^{\top }Z_{{{\mathcal {G}}_0^0}^c})^{-1}\).

By the assumption (A2), it yields that the variance of \({\hat{\alpha }}_{k}^{{\mathrm{ols}}}\) is bounded from above by \(\sigma ^2/(nc_0)\) for all \(k=1,\ldots ,K^0\). In view of the assumption (A3), \(\min _{1\le k\le K^0}{\alpha }_{k}^{0}=\min _{j\in {{\mathcal {G}}_0^0}^c}{\beta }_{j}^{0}> c_n\), where \(c_n=[2\sigma ^2\log \{2nK^0/(2\pi )^{1/2}\}/(nc_0)]^{1/2}\). Similar to Meinshausen (2013), by Bonferroni’s inequality, we thus have

$$\begin{aligned} \Vert \hat{{{\varvec{{\alpha }}}}}^{{\mathrm{ols}}}-{{\varvec{{\alpha }}}}^{0}\Vert _{\infty }\le c_n, \end{aligned}$$

with probability at least

$$\begin{aligned} 1-2K^0\left\{ 1-\varPhi \left( c_n{(nc_0)^{1/2}}/{\sigma }\right) \right\} =1-2K^0\left\{ 1-\varPhi \left( [2\log \{{2nK^0}/(2\pi )^{1/2}\}]^{1/2} \right) \right\} . \end{aligned}$$

It implies that with probability at least \(1-2K^0\left\{ 1-\varPhi \left( [2\log \{{2nK^0}/(2\pi )^{1/2}\}]^{1/2} \right) \right\} \), \(\min _{1\le k\le K^0}{\hat{\alpha }}_{k}^{{\mathrm{ols}}}> 0\), and thus \(\hat{{{\varvec{{\alpha }}}}}^{ora}=\hat{{{\varvec{{\alpha }}}}}^{{\mathrm{ols}}}\), \(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}=\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\). That is,

$$\begin{aligned} {{\mathrm{pr}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\right) \le 2K^0\left\{ 1-\varPhi \left( [2\log \{{2nK^0}/(2\pi )^{1/2}\}]^{1/2} \right) \right\} . \end{aligned}$$

Since \(1-\varPhi (x)\le (2\pi )^{-1/2}x^{-1}\exp (-x^2/2)\) for any \(x>0\), it follows that

$$\begin{aligned} {{\mathrm{pr}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\right) \le \frac{1}{n}\frac{1}{[2\log \{{2nK^0}/(2\pi )^{1/2}\}]^{1/2}}=O\left( \frac{1}{n(\log n)^{1/2}}\right) . \end{aligned}$$

Proof of Theorem 1

Let \({\mathcal {G}} =({\mathcal {G}}_0,{\mathcal {G}}_1,\ldots ,{\mathcal {G}}_K)\) be a grouping of the constrained problem in Sect. 2, satisfying that \(0\le {\hat{\beta }}_j^{{\mathrm{cons}}}\le \tau \) if \(j\in {\mathcal {G}}_0\), \(|{\hat{\beta }}_j^{{\mathrm{cons}}}-{\hat{\beta }}_{j'}^{{\mathrm{cons}}}|> \tau \) if \(j\in {{\mathcal {G}}}_k\), \(j'\in {{\mathcal {G}}}_{k'}\), \(j = 1, \ldots , p; 1\le k\ne k'\le K\).

If \({{\mathcal {G}}}={\mathcal {G}}^0\), then \(|{{\mathcal {G}}}_0^c|=s_1^0\). By the first constraint \(\sum _{j=1}^p \min \left\{ \frac{|\beta _j|}{\tau }, 1\right\} \le s_1\), \(\sum _{j\in {{\mathcal {G}}}_0}{\hat{\beta }}_j^{{\mathrm{cons}}}/\tau +s_1^0\le s_1^0\), which implies that \({\hat{\beta }}_j^{{\mathrm{cons}}}=0\), \(j\in {{\mathcal {G}}}_0\). By the second constraint \(\sum _{(j, j') \in \varepsilon } \min \left\{ \frac{|\beta _j - \beta _{j'}|}{\tau }, 1\right\} \le s_2\), similarly, we obtain that \({\hat{\beta }}_j^{{\mathrm{cons}}}={\hat{\beta }}_{j'}^{{\mathrm{cons}}}\), \(j, j'\in {{\mathcal {G}}}_k={\mathcal {G}}_k^0\), \((j, j')\in \varepsilon \), \(k=1,\ldots ,K\). Thus, \(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}=\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}\) if \({{\mathcal {G}}}={\mathcal {G}}^0\), which, together with the fact that \({{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}})={{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}, {{\mathcal {G}}}\ne {\mathcal {G}}^0) + {{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}, {{\mathcal {G}}}={\mathcal {G}}^0)\), yields that

$$\begin{aligned} {{\mathrm{pr}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}\right) = {{\mathrm{pr}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}, {{\mathcal {G}}}\ne {\mathcal {G}}^0\right) . \end{aligned}$$
(20)

Denote \({{\bar{S}}}({{\varvec{{\beta }}}}) = 2^{-1}\Vert Y-X {{\varvec{{\beta }}}}\Vert ^2 \). In view that \({{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}, {{\mathcal {G}}}\ne {\mathcal {G}}^0) ={{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}, \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}=\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}, {{\mathcal {G}}}\ne {\mathcal {G}}^0)+{{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}, \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}, {{\mathcal {G}}}\ne {\mathcal {G}}^0)\), (20) thus becomes

$$\begin{aligned}&{{\mathrm{pr}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}\right) \nonumber \\&\le {\mathrm{pr}}\left( {{\bar{S}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\right) -{{\bar{S}}}(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}})\le 0, \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}=\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}, {{\mathcal {G}}}\ne {\mathcal {G}}^0\right) + {\mathrm{pr}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\right) . \end{aligned}$$
(21)

The second term in (21) has already provided in Lemma 2.1. Next, we work on the first term in (21), and denote it by \(\varGamma \).

Consider the case where \(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}=\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\) and \({{\mathcal {G}}}\ne {\mathcal {G}}^0\). Define \(\bar{{{\varvec{{\beta }}}}}= ({\bar{\beta }}_1,\ldots ,{\bar{\beta }}_{p})^{\top }\), satisfying

$$\begin{aligned} {\bar{\beta }}_j=\left\{ \begin{array}{l@{\quad }l} \frac{\sum _{j'\in {\mathcal {G}}_k}{\hat{\beta }}_{j'}^{{\mathrm{cons}}}}{|{\mathcal {G}}_k|}, &{}{{\mathrm{if}}} ~j\in {\mathcal {G}}_k, k=1,\ldots ,K,\\ 0, &{}{{\mathrm{if}}}~ j\in {\mathcal {G}}_0. \end{array} \right. \end{aligned}$$

It follows that \(|{\bar{\beta }}_j-{\hat{\beta }}_{j}^{{\mathrm{cons}}}|\le \tau \), \(\Vert \bar{{{\varvec{{\beta }}}}}-\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\Vert ^2\le \tau ^2p\), and thus

$$\begin{aligned} \Vert X(\bar{{{\varvec{{\beta }}}}}-\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}})\Vert ^2\le \lambda _{{\mathrm {max}}}(X^{\top }X)\tau ^2p. \end{aligned}$$
(22)

Note that

$$\begin{aligned} \Vert Y-X\bar{{{\varvec{{\beta }}}}}\Vert ^2\ge \Vert Y-P_{Z_{{\mathcal {G}}_0^c}}Y\Vert ^2=\Vert (I-P_{Z_{{\mathcal {G}}_0^c}})X{{\varvec{{\beta }}}}^{0}+(I-P_{Z_{{\mathcal {G}}_0^c}}){{\varvec{{\epsilon }}}}\Vert ^2. \end{aligned}$$
(23)

For any vector \({{\varvec{{u}}}}, {{\varvec{{v}}}}\in {\mathbb {R}}^p\) and \(a>0\), it holds that \(\Vert {{\varvec{{u}}}}+{{\varvec{{v}}}}\Vert ^2\ge a^{-1}(a-1)\Vert {{\varvec{{u}}}}\Vert ^2-(a-1)\Vert {{\varvec{{v}}}}\Vert ^2\) (Shen et al. 2012a). We thus have

$$\begin{aligned} {{\bar{S}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\right) = \frac{1}{2}\left\| Y-X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\right\| ^2\ge \frac{a-1}{2a}\left\| Y-X\bar{{{\varvec{{\beta }}}}}\right\| ^2-\frac{a-1}{2}\left\| X(\bar{{{\varvec{{\beta }}}}}-\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}})\right\| ^2. \end{aligned}$$
(24)

By substituting (22)–(23) into (24) and together with \({{\bar{S}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\right) =2^{-1}\Vert (I-P_{Z_{{\mathcal {G}}_0^{0c}}}){{\varvec{{\epsilon }}}}\Vert ^2\le 2^{-1}{{\varvec{{\epsilon }}}}^{\top }{{\varvec{{\epsilon }}}},\) we obtain that, for any \(a>1\),

$$\begin{aligned} 2a\left\{ {{\bar{S}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\right) -{{\bar{S}}}(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}})\right\} = 2a\left\{ {{\bar{S}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\right) -{{\bar{S}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\right) \right\} \ge -L_1-L_2+L_3, \end{aligned}$$

where \(L_1=\{{{\varvec{{\epsilon }}}}-(a-1)(I-P_{Z_{{\mathcal {G}}_0^c}})X{{\varvec{{\beta }}}}^{0}\}^{\top }(I-P_{Z_{{\mathcal {G}}_0^c}})\{{{\varvec{{\epsilon }}}}-(a-1)(I-P_{Z_{{\mathcal {G}}_0^c}})X{{\varvec{{\beta }}}}^{0}\}\), and \(L_1\sigma ^{-2}\) follows noncentral Chi-squared distribution \(\chi _{k,\varLambda }^2\) with degrees of freedom \(k=\max \{n-K({\mathcal {G}}_0^c),0\}\), and noncentral parameter \(\varLambda =(a-1)^2\Vert (I-P_{Z_{{\mathcal {G}}_0^c}})X{{\varvec{{\beta }}}}^{0}\Vert ^2/\sigma ^2\); \(L_2=a{{\varvec{{\epsilon }}}}^{\top }P_{Z_{{\mathcal {G}}_0^c}}{{\varvec{{\epsilon }}}}\) is independent of \(L_1\), and \(a^{-1}\sigma ^{-2}L_2\) follows Chi-squared distribution \(\chi _{\kappa }^2\) with degrees of freedom \(\kappa =K({\mathcal {G}}_0^c)\); \(L_3=a(a-1)\Vert (I-P_{Z_{{\mathcal {G}}_0^c}})X{{\varvec{{\beta }}}}^{0}\Vert ^2-a(a-1) \lambda _{{\mathrm {max}}}(X^{\top }X)\tau ^2p\). Note that, by the definition of \(C_{{\mathrm {min}}}\), \(\Vert (I-P_{Z_{{\mathcal {G}}_0^c}})X{{\varvec{{\beta }}}}^{0}\Vert ^2\ge nC_{{\mathrm {min}}}\).

For \(\varGamma \), by Markov inequality and moment-generating function of Chi-squared distribution, it holds that, for any \(0<t<1/(2a)\) and \(1-2at<1-2t<1 ~ (a>1)\), by Shen et al. (2012a),

$$\begin{aligned} \varGamma&\le \sum _{i=1}^{s_1^0}\sum _{j=0}^i\genfrac(){0.0pt}0{p-s_1^0}{j}\genfrac(){0.0pt}0{s_1^0}{s_1^0-i}T_il_1^*{l_2^*}^{K_i^*/2}\frac{1}{(1-2t)^{n/2}}, \end{aligned}$$

where \(l_1^*=\exp \left\{ \frac{(a-1)\log p}{4n}-n\frac{t(a-1)iC_{{\mathrm {min}}}}{\sigma ^2}\frac{1-2at}{1-2t}\right\} \), \(l_2^*=(1-2t)/(1-2at)\), \(K_i^*=\max _{\{{\mathcal {G}}\in {\mathcal {T}}, |{\mathcal {G}}_0\backslash {\mathcal {G}}_0^0|=i\}}K({\mathcal {G}}_0^c)\). Note that the last inequality holds true because

$$\begin{aligned} \frac{t}{\sigma ^2}a(a-1)\lambda _{{\mathrm {max}}}(X^{\top }X)p\tau ^2\le \frac{2ta(a-1)\log p}{4n}\le \frac{(a-1)\log p}{4n} \end{aligned}$$

for any \(\tau \le \sigma [\log p/\{2np\lambda _{{\mathrm {max}}}(X^{\top }X)\}]^{1/2}\). We choose \(a=4+n/4\), \(t={4^{-1}(a-1)}^{-1}\), and define \(b={(1-2t)}/{(1-2at)}\). Then \(b={(2a-3)}/{(a-2)}<5/2\), and \((a-1)/(4n)\le 1\). Since \(-\log (1-x)\le x(1-x)^{-1}\) for \(0<x<1\), and \(0<2t=2^{-1}(a-1)^{-1}<1\), it follows that

$$\begin{aligned} -\frac{n}{2}\log (1-2t)\le \frac{n}{2}\frac{1/{\{2(a-1)\}}}{1-1/{\{2(a-1)\}}}\le \frac{n}{2}\frac{1}{2(4+n/4)-3}\le 1, \end{aligned}$$

which jointly with the facts

$$\begin{aligned} \genfrac(){0.0pt}0{s_1^0}{s_1^0-i}\le (s_1^0)^i, \sum _{j=0}^i\genfrac(){0.0pt}0{p-s_1^0}{j}\le (p-s_i^0)^i \quad {\text {and}} \quad (p-s_1^0)s_1^0\le p^2/4 \end{aligned}$$

yields that

$$\begin{aligned} \varGamma&\le \sum _{i=1}^{s_1^0}\left( \frac{p^2}{4}\right) ^iT_i\exp \left\{ \frac{(a-1)\log p}{4n}-n\frac{iC_{{\mathrm {min}}}}{4b\sigma ^2}\right\} b^{K_i^*/2}\frac{1}{(1-2t)^{n/2}}\nonumber \\&\le \exp (1)\sum _{i=1}^{s_1^0}\exp \left\{ -i\frac{n}{10\sigma ^2}\left( C_{{\mathrm {min}}} -\frac{10\sigma ^2}{n}(3\log p+{\bar{T}}+{{\bar{K}}}/{2})\right) \right\} . \end{aligned}$$
(25)

Since \((1-z)^{-1}=\sum _{i=0}^{\infty }z^i\) for \(|z|<1\), we thus obtain that, for \(x<0\),

$$\begin{aligned} \sum _{i=1}^{s_1^0}\exp (ix)\le -1+\frac{1}{1-\exp (x)}=\frac{\exp (x)}{1-\exp (x)}. \end{aligned}$$

We take \(x =- {10^{-1}\sigma ^{-2}}n\{C_{{\mathrm {min}}}-{10\sigma ^2}{n}^{-1}(3\log p+{\bar{T}}+{{\bar{K}}}/{2})\}\) if \(C_{{\mathrm {min}}}>{10\sigma ^2}{n}^{-1}(3\log p +{\bar{T}}+{{\bar{K}}}/{2})\). Together with \(\varGamma \le 1\), (25) becomes

$$\begin{aligned} \varGamma \le \{\exp (1)+1\}\exp \left[ -\frac{n}{10\sigma ^2}\left\{ C_{{\mathrm {min}}}-\frac{10\sigma ^2}{n}(3\log p+{\bar{T}}+{{\bar{K}}}/{2})\right\} \right] . \end{aligned}$$
(26)

Similarly, we can show that (26) still holds for \(C_{{\mathrm {min}}}\le {10\sigma ^2}{n}^{-1}(3\log p+{\bar{T}}+{{\bar{K}}}/{2})\). By Lemma 2.1 and (26), (21) becomes

$$\begin{aligned}&{{\mathrm{pr}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}\right) \nonumber \\&\le \{\exp (1)+1\}\exp \left[ -\frac{n}{10\sigma ^2}\left\{ C_{{\mathrm {min}}}-\frac{10\sigma ^2}{n}(3\log p+{\bar{T}}+{{\bar{K}}}/{2})\right\} \right] +\frac{c}{n(\log n)^{1/2}}. \end{aligned}$$
(27)
  1. 1.

    If \(C_{{\mathrm {min}}}\ge {10\sigma ^2}{n}^{-1}\left( \log n+2^{-1}\log \log n +3\log p+{\bar{T}}+{{\bar{K}}}/{2}\right) \), by (27),

    $$\begin{aligned} {{\mathrm{pr}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}\right) =O\left( \frac{1}{n(\log n)^{1/2}}\right) . \end{aligned}$$
  2. 2.

    We denote \(T_1=n^{-1}E(\Vert X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2I_{\{G\}})\), and \(T_2=n^{-1}E(\Vert X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2I_{\{G^c\}})\), where \(G=\{n^{-1}\Vert X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2\ge 25\sigma ^2\}\). It is easy to see that

    $$\begin{aligned} \frac{1}{n}E\left\| X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2=T_1+T_2. \end{aligned}$$

    Now, we work on \(T_1\). By the definition, \(T_1 = \int \nolimits _{25\sigma ^2}^{\infty }{{\mathrm{pr}}}(n^{-1}\Vert X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2\ge x){\mathrm{d}}x + 25\sigma ^2{{\mathrm{pr}}}(n^{-1}\Vert X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2\ge 25\sigma ^2)\). For the first term of \(T_1\),

    $$\begin{aligned}&\int \nolimits _{25\sigma ^2}^{\infty }{{\mathrm{pr}}}\left( n^{-1}\Vert X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2\ge x\right) {\mathrm{d}}x \nonumber \\&\quad \le \int \nolimits _{25\sigma ^2}^{\infty }{{\mathrm{pr}}}\left( 4n^{-1}\Vert {{\varvec{{\epsilon }}}}\Vert ^2\ge x\right) {\mathrm{d}}x \nonumber \\&\quad \le \int \nolimits _{25\sigma ^2}^{\infty }E\left\{ \exp \left( \frac{\Vert {{\varvec{{\epsilon }}}}\Vert ^2}{3\sigma ^2}\right) \right\} \exp \left( -\frac{nx}{12\sigma ^2}\right) {\mathrm{d}}x \nonumber \\&\quad = \int \nolimits _{25\sigma ^2}^{\infty }\exp \left[ -\frac{n}{12\sigma ^2}\{x-6(\log 3) \sigma ^2\}\right] {\mathrm{d}}x \nonumber \\&\quad < \int \nolimits _{25\sigma ^2}^{\infty }\exp \left\{ -\frac{n}{12\sigma ^2}(x-24 \sigma ^2)\right\} {\mathrm{d}}x \nonumber \\&\quad =\frac{12\sigma ^2}{n}\exp \left( -\frac{n}{12}\right) =o\left( \frac{K^0\sigma ^2}{n}\right) . \end{aligned}$$
    (28)

    Since \(\Vert X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2\le 2(\Vert Y-X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\Vert ^2+\Vert Y-X{{\varvec{{\beta }}}}^{0}\Vert ^2)\le 4\Vert Y-X{{\varvec{{\beta }}}}^{0}\Vert ^2=4\Vert {{\varvec{{\epsilon }}}}\Vert ^2\), the first ‘\(\le \)’ follows. The second ‘\(\le \)’ is obtained by the Markov inequality. In view of the moment generating function for Chi-squared distribution, the first ‘\(=\)’ holds. For the second term of \(T_1\),

    $$\begin{aligned} 25\sigma ^2{{\mathrm{pr}}}(n^{-1}\Vert X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2\ge 25\sigma ^2)\le 25\sigma ^2\exp (-{n}/{12})=o\left( \frac{K^0\sigma ^2}{n}\right) . \end{aligned}$$
    (29)

    By (28) and (29), we thus have \(T_1=o({K^0\sigma ^2}/{n})\).

On the other hand,

$$\begin{aligned} T_2&= E\left( \frac{1}{n}\left\| X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2I_{\{G^c\}}I_{\{\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\}}\right) \nonumber \\&\quad +E\left( \frac{1}{n}\left\| X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2\left( 1-I_{\{G\}}\right) I_{\{\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}= \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\}}\right) . \end{aligned}$$
(30)

For the first term in (30), it follows that

$$\begin{aligned}&E\left( \frac{1}{n}\left\| X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2I_{\{G^c\}}I_{\{\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\}}\right) \nonumber \\&\quad \le 25\sigma ^2{{\mathrm{pr}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\right) \nonumber \\&\quad \le 25\sigma ^2 {{\mathrm{pr}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}\right) + 25\sigma ^2{\mathrm{pr}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\right) \nonumber \\&\quad \le \frac{100\sigma ^2}{n(\log n)^{1/2}}+\frac{50\sigma ^2c}{n(\log n)^{1/2}}=o\left( \frac{K^0 \sigma ^2}{n}\right) . \end{aligned}$$
(31)

For the second term in (30),

$$\begin{aligned}&E\left( \frac{1}{n}\left\| X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2I_{\{G\}}I_{\{\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}} = \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\}} \right) \nonumber \\&\quad \le E\left( \frac{1}{n}\left\| X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2I_{\{G\}}\right) =o\left( \frac{K^0\sigma ^2}{n}\right) , \end{aligned}$$
(32)

and

$$\begin{aligned}&E\left( \frac{1}{n}\left\| X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2 I_{\{\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}= \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\}}\right) \nonumber \\&\quad =\frac{1}{n}E\left( \left\| X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2\right) =\frac{1}{n}E\left( \left\| P_{Z_{{\mathcal {G}}_0^{0c}}}{{\varvec{{\epsilon }}}}\right\| ^2\right) =\frac{K^0\sigma ^2}{n}. \end{aligned}$$
(33)

By (30)–(33), \(T_2 = n^{-1}{K^0\sigma ^2}(1+o(1))\). Therefore,

$$\begin{aligned} \frac{1}{n}E\left( \left\| X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2\right) =T_1+T_2=\frac{K^0\sigma ^2}{n}(1+o(1)). \end{aligned}$$

Proof of Theorem 3

This proof mimics the proof of Theorem 1 in (Shen et al. 2012a). We thus omit the details. □

Proof of Theorem 4

By Sect. 3, there exists a finite \(m^*\) such that \(\hat{{{\varvec{{\beta }}}}} = \hat{{{\varvec{{\beta }}}}}^{(m^*)}\). Denote the grouping of \(\hat{{{\varvec{{\beta }}}}}\) by \({\mathcal {G}}=({\mathcal {G}}_0,{\mathcal {G}}_1,\ldots ,{\mathcal {G}}_K)\) with \(K<K^*\). Then \(\hat{{{\varvec{{\beta }}}}}\) satisfies that, for grouping \({\mathcal {G}}\),

$$\begin{aligned} \left\{ \begin{array}{l@{\quad }l} -(X_{{\mathcal {G}}_k}{{\varvec{{1}}}})^{\top }({{\varvec{{y}}}}-X{{\varvec{{\beta }}}})+ n\sum \limits _{j\in {\mathcal {G}}_k}\varDelta _j({{\varvec{{\beta }}}})=0&{} ~k=1,\ldots ,K \\ |(X_A{{\varvec{{1}}}})^{\top }({{\varvec{{y}}}}-X{{\varvec{{\beta }}}})-n\sum \limits _{j\in A}\varDelta _j({{\varvec{{\beta }}}}) |\le n\frac{\lambda _2}{\tau }|\varepsilon \cap \{A\times ({\mathcal {G}}_k{\setminus } A)\}| &{}~A\subset {\mathcal {G}}_k, |{\mathcal {G}}_k|>1, \\ |{{\varvec{{x}}}}_{(j)}^{\top }({{\varvec{{y}}}}-X{{\varvec{{\beta }}}})-n\varDelta _j({{\varvec{{\beta }}}})|\le n\frac{\lambda _1}{\tau } &{} ~ j\in {\mathcal {G}}_0,\\ \end{array} \right. \end{aligned}$$
(34)

where

$$\begin{aligned} \varDelta _j({{\varvec{{\beta }}}})={\lambda _1}{\tau ^{-1}}{\text {sign}}(\beta _j)I_{\{|\beta _j|\le \tau \}} +{\lambda _2}{\tau ^{-1}}\sum \limits _{j': (j',j)\in \varepsilon }{\text {sign}}(\beta _j-\beta _{j'})I_{\{|\beta _j-\beta _{j'}|\le \tau \}} +2\lambda _3\beta _jI_{\{\beta _j<0\}}. \end{aligned}$$

Denote \({\mathcal {J}}={\mathcal {J}}_{11}\cap {\mathcal {J}}_{12}\cap {\mathcal {J}}_{21}\cap {\mathcal {J}}_{22}\), where \({\mathcal {J}}_{11}=\{\min \nolimits _{j\notin {\mathcal {G}}_{0}^0}{\hat{\beta }}_j^{{\mathrm{ols}}}>2\tau \}\), \({\mathcal {J}}_{12}=\{\max \nolimits _{j\in {\mathcal {G}}_{0}^0}|{{\varvec{{x}}}}_{(j)}^{\top }({{\varvec{{y}}}}-X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}})|\le n{\lambda _1}{\tau ^{-1}}\}\), \({\mathcal {J}}_{21}=\{\min \nolimits _{1\le k<l\le K^0}|{\hat{\alpha }}_k^{{\mathrm{ols}}}-{\hat{\alpha }}_l^{{\mathrm{ols}}}|>2\tau \}\), \({\mathcal {J}}_{22}=\cap _{k=1,\ldots ,K^0: |{\mathcal {G}}_k^0|>1}\{\max _{A\subset {\mathcal {G}}_k^0}|(X_A{{\varvec{{1}}}})^{\top }({{\varvec{{y}}}}-X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}})|\le n{\lambda _2}{\tau }^{-1}|\varepsilon \cap \{A\times ({\mathcal {G}}_k^0{\setminus } A)\}|\}\). First, we show that \(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\) is a solution to (34) on \({\mathcal {J}}\). Note that, \(\sum _{j\in {\mathcal {G}}_k^0}\varDelta _j\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\right) =0\) on the set \({\mathcal {J}}_{11}\cap {\mathcal {J}}_{21}\). By the definition of \(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\), \((X_{{\mathcal {G}}_k^0}{{\varvec{{1}}}})^{\top }({{\varvec{{y}}}}-{{\varvec{{X}}}}\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}})=0\). Thus, the first equation in (34) holds for \({{\varvec{{\beta }}}}=\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\). Since \(\sum _{j\in {\mathcal {G}}_k^0}\varDelta _j\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\right) =0\) on \({\mathcal {J}}\), one can easily see that the second and third inequalities also hold for \({{\varvec{{\beta }}}}=\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\).

Next, we show that (34) has a unique solution on \({\mathcal {J}}\), and thus \(\hat{{{\varvec{{\beta }}}}} = \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\). We provide the proof by contradiction. Assume that \(\hat{{{\varvec{{\beta }}}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\). Let \({\mathcal {H}}=({\mathcal {H}}_1,\ldots ,{\mathcal {H}}_L)={\mathcal {G}}_0^c\vee {\mathcal {G}}_0^{0c}\). Herein, we give an example to explain the sign ’\(\vee \)’. Define two sets \(A_1=\{\{1,2,3,4\}, \{5,6\}\}\), and \(A_2=\{\{1,2\}, \{3,4,5,6\},\{7\}\}\). Then \(A_1\vee A_2=\{\{1,2\},\{3,4\},\{5,6\},\{7\}\}\). Denote \(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}} =({\hat{\alpha }}_{{\mathcal {H}}_1}^{{\mathrm{ols}}},\ldots ,{\hat{\alpha }}_{{\mathcal {H}}_L}^{{\mathrm{ols}}})^\top \), \(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}} =({\hat{\alpha }}_{{\mathcal {H}}_1},\ldots ,{\hat{\alpha }}_{{\mathcal {H}}_L})^\top \) the coefficients estimated by OLS and the algorithm 1, respectively. Then \(S({{\varvec{{\alpha }}}}_{{\mathcal {H}}})=(2n)^{-1}\Vert {{\varvec{{y}}}}- Z_{{\mathcal {H}}}{{\varvec{{\alpha }}}}_{{\mathcal {H}}}\Vert ^2 +J({{\varvec{{\alpha }}}}_{{\mathcal {H}}})\), where

$$\begin{aligned} J({{\varvec{{\alpha }}}}_{{\mathcal {H}}})&=\lambda _1\sum \limits _{k=1}^L|{\mathcal {H}}_k| \min \left\{ \frac{|\alpha _{{\mathcal {H}}_k}|}{\tau },1 \right\} + \lambda _2\sum \limits _{1\le k<l\le L}|\varepsilon _{kl}|\min \left\{ \frac{|\alpha _{{\mathcal {H}}_k}-\alpha _{{\mathcal {H}}_l}|}{\tau },1\right\} \\&\quad +\lambda _3\sum \limits _{k=1}^L|{\mathcal {H}}_k| (\min \{\alpha _{{\mathcal {H}}_k},0\})^2 \end{aligned}$$

for \({{\varvec{{\alpha }}}}_{{\mathcal {H}}} =({\alpha }_{{\mathcal {H}}_1},\ldots ,{\alpha }_{{\mathcal {H}}_L})^\top \), where \(\varepsilon _{kl}\) is the set of undirected edge between \({\mathcal {H}}_k\) and \({\mathcal {H}}_l\). We thus have

$$\begin{aligned} \frac{\partial S(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}})}{\partial {{\varvec{{\alpha }}}}_{{\mathcal {H}}}}- \frac{\partial S(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}})}{\partial {{\varvec{{\alpha }}}}_{{\mathcal {H}}}}=\frac{1}{n}Z_{{\mathcal {H}}}^{\top }Z_{{\mathcal {H}}}(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}})+{{\varvec{{\varphi }}}}, \end{aligned}$$

where \({{\varvec{{\varphi }}}}=(\varphi _1,\ldots ,\varphi _L)^{\top }={{\varvec{{\varphi }}}}_{1}+{{\varvec{{\varphi }}}}_{2}\), \({{\varvec{{\varphi }}}}_{1}=(\varphi _{11},\ldots ,\varphi _{L1})^{\top }\), \({{\varvec{{\varphi }}}}_{2}=(\varphi _{12},\ldots ,\varphi _{L2})^{\top }\), \( \varphi _{k1}={\lambda _1}{\tau ^{-1}}|{\mathcal {H}}_k|( a_kI_{\{|{\hat{\alpha }}_{{\mathcal {H}}_k}|\le \tau \}}- a_k^{{\mathrm{ols}}}I_{\{|{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}|\le \tau \}}) + {\lambda _2}{\tau ^{-1}}\sum \nolimits _{l\ne k}|\varepsilon _{kl}|(b_{kl}I_{\{|{\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_l}|\le \tau \}}-b_{kl}^{{\mathrm{ols}}}I_{\{|{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}-{\hat{\alpha }}_{{\mathcal {H}}_l}^{{\mathrm{ols}}}|\le \tau \}}), \varphi _{k2}=2\lambda _3(|{\mathcal {H}}_k|{\hat{\alpha }}_{{\mathcal {H}}_k}I_{\{{\hat{\alpha }}_{{\mathcal {H}}_k}<0\}} -|{\mathcal {H}}_k|{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}I_{\{{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}<0\}}),\) where \(k=1,\ldots ,L\), \(a_k={\text {sign}}({\hat{\alpha }}_{{\mathcal {H}}_k})\), if \({\hat{\alpha }}_{{\mathcal {H}}_k}\ne 0\), \(a_k\in [-1,1]\) otherwise; \(b_{kl}={\text {sign}}({\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_l})\) if \({\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_l}\ne 0\), \(b_{kl}\in [-1,1]\) otherwise. Similarly, we have \(a_k^{{\mathrm{ols}}}\) and \(b_{kl}^{{\mathrm{ols}}}\). Note that \(\Vert {{\varvec{{\varphi }}}}_{1}\Vert ^2 \le 4\tau ^{-2}(\lambda _1s^*+\lambda _2|{\mathcal {N}}|)^2\).

Now, we consider two cases: (1) \(\Vert \hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\Vert <\tau /2\) and (2) \(\Vert \hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\Vert \ge \tau /2\). For each case, we show that both \(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}\) and \(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\) are the local minimizers of \(S({{\varvec{{\alpha }}}}_{{\mathcal {H}}})\) and \(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}=\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\) on \({\mathcal {J}}\).

  1. 1.

    \(\Vert \hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\Vert <\tau /2\). On the set \({\mathcal {J}}\), \({\hat{\alpha }}_{{\mathcal {H}}_k}\ge {\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}-|{\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}|\ge 2\tau -\tau /2>\tau \) if \({\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}>2\tau \); \(|{\hat{\alpha }}_{{\mathcal {H}}_k}|<|{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}|+|{\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}|<\tau /2\) if \(|{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}|=0\); \(|{\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_l}|\ge -|{\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}|-|{\hat{\alpha }}_{{\mathcal {H}}_l}-{\hat{\alpha }}_{{\mathcal {H}}_l}^{{\mathrm{ols}}}| +|{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}-{\hat{\alpha }}_{{\mathcal {H}}_l}^{{\mathrm{ols}}}|\ge \tau \) if \(|{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}-{\hat{\alpha }}_{{\mathcal {H}}_l}^{{\mathrm{ols}}}|\ge 2\tau \); \(|{\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_l}|\le |{\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}|+|{\hat{\alpha }}_{{\mathcal {H}}_l}-{\hat{\alpha }}_{{\mathcal {H}}_l}^{{\mathrm{ols}}}| +|{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}-{\hat{\alpha }}_{{\mathcal {H}}_l}^{{\mathrm{ols}}}|<\tau \) if \(|{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}-{\hat{\alpha }}_{{\mathcal {H}}_l}^{{\mathrm{ols}}}|=0\). It implies that both \(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}\) and \(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\) are the local minimizers of \(S({{\varvec{{\alpha }}}}_{{\mathcal {H}}})\) and \(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}=\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\) on \({\mathcal {J}}\).

  2. 2.

    \(\Vert \hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\Vert \ge \tau /2\). By Cauchy–Schwarz inequality,

    $$\begin{aligned} \left| {{\varvec{{\varphi }}}}_{1}^{\top }(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}})\right| \le \frac{2}{\tau }\left( \lambda _1s^{*}+\lambda _2 |{\mathcal {N}}|\right) \Vert \hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\Vert . \end{aligned}$$

    It is easy to verify that \( ({\hat{\alpha }}_{{\mathcal {H}}_k}I_{\{{\hat{\alpha }}_{{\mathcal {H}}_k}<0\}} -{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}I_{\{{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}<0\}})({\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}})\ge 0\), followed by

    $$\begin{aligned} {{\varvec{{\varphi }}}}_{2}^{\top }(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}})\ge 0. \end{aligned}$$

    By the assumption (A4),

    $$\begin{aligned}&\left( \frac{\partial S(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}})}{\partial {{\varvec{{\alpha }}}}_{{\mathcal {H}}}}- \frac{\partial S(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}})}{\partial {{\varvec{{\alpha }}}}_{{\mathcal {H}}}}\right) ^{\top } \frac{\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}}{\Vert \hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\Vert }\nonumber \\&\ge \min _{K({\mathcal {H}})\le K^*}\frac{\tau }{2}\lambda _{{\mathrm {min}}}\left( \frac{1}{n} Z_{{\mathcal {H}}}^{\top }Z_{{\mathcal {H}}}\right) -\frac{2}{\tau }\left( \lambda _1s^{*}+\lambda _2 |{\mathcal {N}}|\right) >0. \end{aligned}$$
    (35)

On the other hand, \(\frac{\partial S(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}})}{\partial {{\varvec{{\alpha }}}}_{{\mathcal {H}}}} = 0\) and \(\frac{\partial S(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}})}{\partial {{\varvec{{\alpha }}}}_{{\mathcal {H}}}}=0\) on \({\mathcal {J}}\) if \(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}\ne \hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\), which contracts to (35). Therefore, the problem (34) has a unique solution on \({\mathcal {J}}\). That is \(\hat{{{\varvec{{\beta }}}}} = \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\) on \({\mathcal {J}}\), which yields that

$$\begin{aligned} {{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}})\le {{\mathrm{pr}}}(J^c) \le {{\mathrm{pr}}}({\mathcal {J}}_{11}^c)+{{\mathrm{pr}}}({\mathcal {J}}_{12}^c)+{{\mathrm{pr}}}({\mathcal {J}}_{21}^c)+{{\mathrm{pr}}}({\mathcal {J}}_{12}^c). \end{aligned}$$
(36)

Next, we show the bounds of \({{\mathrm{pr}}}({\mathcal {J}}_{11}^c), {{\mathrm{pr}}}({\mathcal {J}}_{12}^c), {{\mathrm{pr}}}({\mathcal {J}}_{21}^c), {{\mathrm{pr}}}({\mathcal {J}}_{12}^c)\).

Before proceeding, we provide the following inequality, for \(x>0\), \(\varPhi (-x)\le (2\pi )^{-1/2}x^{-1}\exp (-x^2/2)\). If \(x^2\ge 2\log \{{2na}/{(2\pi )^{1/2}}\}\), \(a \ge 1\), \(x>0\), then \(2a\varPhi (-x)\le cn^{-1}(\log n)^{-1/2}\).

For \({\mathcal {J}}_{11}^c\), by the assumptions (A1)–(A2), \({\hat{\beta }}_j^{{\mathrm{ols}}}\sim N(\beta _j^0,var({\hat{\beta }}_j^{{\mathrm{ols}}}))\), where \(var({\hat{\beta }}_j^{{\mathrm{ols}}})\le n^{-1}\sigma ^{2}\lambda _{{\mathrm {min}}}^{-1}(n^{-1}Z_{{\mathcal {G}}_0^{0c}}^{\top }Z_{{\mathcal {G}}_0^{0c}})\). If \(\gamma _{{\mathrm {min}}}>2\tau \), and \(\{(\gamma _{{\mathrm {min}}}-2\tau )n^{1/2}\lambda _{{\mathrm {min}}}^{1/2}(n^{-1}Z_{{\mathcal {G}}_{0}^{0c}}^{\top }Z_{{\mathcal {G}}_{0}^{0c}})\sigma ^{-1}\}^2\ge 2\log \{{2n(p-|{\mathcal {G}}_0^0|)}/{(2\pi )^{1/2}}\}\), then

$$\begin{aligned} {{\mathrm{pr}}}({\mathcal {J}}_{11}^c)&\le \sum _{j\in {\mathcal {G}}_0^{0c}}{{\mathrm{pr}}}\left( {\hat{\beta }}_j^{{\mathrm{ols}}}\le 2\tau \right) \le \sum _{j\in {\mathcal {G}}_0^{0c}}{{\mathrm{pr}}}({\beta }_j^{0}-|{\hat{\beta }}_j^{{\mathrm{ols}}}-{\beta }_j^{0}|\le 2\tau )\nonumber \\&\le 2\left( p-|{\mathcal {G}}_0^0|\right) \varPhi \left( -(\gamma _{{\mathrm {min}}}-2\tau )n^{1/2}\lambda _{{\mathrm {min}}}^{1/2}(n^{-1}Z_{{\mathcal {G}}_{0}^{0c}}^{\top }Z_{{\mathcal {G}}_{0}^{0c}})\sigma ^{-1}\right) \nonumber \\&=O\left( \frac{1}{n(\log n)^{1/2}}\right) . \end{aligned}$$
(37)

For \({\mathcal {J}}_{12}^c\), by (A1)–(A2), \({{\varvec{{x}}}}_{(j)}^{\top }({{\varvec{{y}}}}-X^{\top }\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}})={{\varvec{{x}}}}_{(j)}^{\top }(I-P_{Z_{{\mathcal {G}}_0^{0c}}}){{\varvec{{\epsilon }}}}\sim N(0,\sigma ^2\Vert (I-P_{Z_{{\mathcal {G}}_0^{0c}}}){{\varvec{{x}}}}_{(j)}\Vert ^2),\) and \(\Vert (I-P_{Z_{{\mathcal {G}}_0^{0c}}}){{\varvec{{x}}}}_{(j)}\Vert ^2\le \Vert {{\varvec{{x}}}}_{(j)}\Vert ^2\). If \(({n{\lambda _1\tau ^{-1}}\sigma ^{-1}}/{\max \nolimits _{1\le j\le p}\Vert {{\varvec{{x}}}}_{(j)}\Vert })^2\ge 2\log \{{2n|{\mathcal {G}}_0^0|}/{(2\pi )^{1/2}}\}\), then

$$\begin{aligned} {{\mathrm{pr}}}({\mathcal {J}}_{12}^c)&\le \sum _{j\in {\mathcal {G}}_0^{0}}{{\mathrm{pr}}}\left( \left| {{\varvec{{x}}}}_{(j)}^{\top }({{\varvec{{y}}}}-X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}})\right| > n\frac{\lambda _1}{\tau }\right) \nonumber \\&\le 2|{\mathcal {G}}_0^0|\varPhi \left( -\frac{n{\lambda _1}/{\tau }}{\sigma \max \limits _{1\le j\le p}\Vert {{\varvec{{x}}}}_{(j)}\Vert }\right) = O\left( \frac{1}{n(\log n)^{1/2}}\right) . \end{aligned}$$
(38)

For \({\mathcal {J}}_{21}^c\), by (A1)–(A2), \({\hat{\alpha }}_k^{{\mathrm{ols}}}-{\hat{\alpha }}_l^{{\mathrm{ols}}}\sim N(\alpha _k^0-\alpha _l^0, var({\hat{\alpha }}_k^{{\mathrm{ols}}}-{\hat{\alpha }}_l^{{\mathrm{ols}}})),\) where \(var({\hat{\alpha }}_k^{{\mathrm{ols}}}-{\hat{\alpha }}_l^{{\mathrm{ols}}})\le 4n^{-1}\sigma ^{2}\lambda _{{\mathrm {min}}}^{-1}(n^{-1}Z_{{\mathcal {G}}_0^{0c}}^{\top }Z_{{\mathcal {G}}_0^{0c}})\). If \(\gamma _{{\mathrm {min}}}>2\tau \), and \(\{2^{-1}\sigma ^{-1}(\gamma _{{\mathrm {min}}}-2\tau )n^{1/2}\lambda _{{\mathrm {min}}}^{1/2}(n^{-1}Z_{{\mathcal {G}}_{0}^{0c}}^{\top }Z_{{\mathcal {G}}_{0}^{0c}})\}^2\ge 2\log \{{nK^0(K^0-1)}/{(2\pi )^{1/2}}\}\), then

$$\begin{aligned} {{\mathrm{pr}}}({\mathcal {J}}_{21}^c)&\le \sum _{1\le k<l\le K^0}{{\mathrm{pr}}}(|{\hat{\alpha }}_k-{\hat{\alpha }}_l|\le 2\tau )\nonumber \\&\le \sum _{1\le k<l\le K^0}{{\mathrm{pr}}}(|{\alpha }_k^0-{\alpha }_l^0|-|({\hat{\alpha }}_k-{\hat{\alpha }}_l)-({\alpha }_k^0-{\alpha }_l^0)|\le 2\tau )\nonumber \\&\le K^0(K^0-1)\varPhi \left( -2^{-1}\sigma ^{-1}(\gamma _{{\mathrm {min}}}-2\tau )n^{1/2}\lambda _{{\mathrm {min}}}^{1/2}(n^{-1}Z_{{\mathcal {G}}_{0}^{0c}}^{\top }Z_{{\mathcal {G}}_{0}^{0c}})\right) \nonumber \\&= O\left( \frac{1}{n(\log n)^{1/2}}\right) . \end{aligned}$$
(39)

For \({\mathcal {J}}_{22}^c\), by (A1)–(A2), \((X_A{{\varvec{{1}}}})^{\top }({{\varvec{{y}}}}-X^{\top }\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}})=(X_A{{\varvec{{1}}}})^{\top }(I-P_{Z_{{\mathcal {G}}_0^{0c}}}){{\varvec{{\epsilon }}}}\sim N(0,\sigma ^2\Vert (I-P_{Z_{{\mathcal {G}}_0^{0c}}})X_A{{\varvec{{1}}}}\Vert ^2),\) and \(\Vert (I-P_{Z_{{\mathcal {G}}_0^{0c}}})X_A{{\varvec{{1}}}}\Vert ^2\le \Vert X_A{{\varvec{{1}}}}\Vert ^2\). Denote \({\mathcal {D}} = \max \nolimits _{k,A\subset {\mathcal {G}}_{k}^0} {\Vert X_A{{\varvec{{1}}}}\Vert }/{|\varepsilon \cap \{A\times ({\mathcal {G}}_k^0{\setminus } A)\}|}\). If \(({2^{-1}n{\lambda _2}{\tau }^{-1}\sigma ^{-1}}/{\mathcal {D}})^2\ge 2\log \{{2n|{\mathcal {N}}|}/{(2\pi )^{1/2}}\}\), then

$$\begin{aligned} {{\mathrm{pr}}}({\mathcal {J}}_{22}^c)&\le \sum \limits _{k=1,\ldots ,K^0;A\subset {\mathcal {G}}_k^0}{{\mathrm{pr}}}\left( \left| (X_A{{\varvec{{1}}}})^{\top }({{\varvec{{y}}}}-X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}})\right| > n\frac{\lambda _2}{\tau }\left| \varepsilon \cap \{A\times ({\mathcal {G}}_k^0{\setminus } A)\}\right| \right) \nonumber \\&\le 2|{\mathcal {N}}|\varPhi \left( -\frac{n{\lambda _2}/{\tau }}{2\sigma {\mathcal {D}} }\right) = O\left( \frac{1}{n(\log n)^{1/2}}\right) . \end{aligned}$$
(40)

By (36)–(40), we thus have \({{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}) = O\left( \frac{1}{n(\log n)^{1/2}}\right) ,\) which, together with Lemma 2.1, yields that

$$\begin{aligned} {{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}})\le {{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}})+{{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}})=O\left( \frac{1}{n(\log n)^{1/2}}\right) . \end{aligned}$$

(2) Note that, \(\hat{{{\varvec{{\alpha }}}}}\) satisfies that \(-Z_{{\mathcal {G}}_0^c}^{\top }\left( {{\varvec{{y}}}}-Z_{{\mathcal {G}}_0^c}\hat{{{\varvec{{\alpha }}}}}\right) +2n\lambda _3M_0\hat{{{\varvec{{\alpha }}}}}+n\hat{{{\varvec{{\delta }}}}}=0,\) where \(M_0\) is a \(K\times K\) diagonal matrix with diagonal elements \(|{\mathcal {G}}_k|I_{\{\hat{{{\varvec{{\alpha }}}}}_{k}<0\}}\) for \(k = 1,\ldots , K\); \(\hat{{{\varvec{{\delta }}}}}=({\hat{\delta }}_1,\ldots ,{\hat{\delta }}_K)^{\top }\), \({\hat{\delta }}_k=\sum _{j\in {\mathcal {G}}_k}\varUpsilon _j(\hat{{{\varvec{{\beta }}}}})\), and \(\varUpsilon _j({{\varvec{{\beta }}}})={\lambda _1}{\tau ^{-1}}{\text {sign}}(\beta _j)I_{\{|\beta _j|\le \tau \}} + {\lambda _2}{\tau ^{-1}}\sum \nolimits _{j': (j',j)\in \varepsilon }{\text {sign}}(\beta _j-\beta _{j'})I_{\{|\beta _j-\beta _{j'}|\le \tau \}}\). Note that \(\Vert \hat{{{\varvec{{\delta }}}}}\Vert ^2\le \tau ^{-2}(\lambda _1s^*+\lambda _2|{\mathcal {N}}|)^2\). We obtain that \(\hat{{{\varvec{{\alpha }}}}}=(Z_{{\mathcal {G}}_0^c}^{\top }Z_{{\mathcal {G}}_0^c}+2n\lambda _3M_0)^{-1}(Z_{{\mathcal {G}}_0^c}^{\top }{{\varvec{{y}}}}-n\hat{{{\varvec{{\delta }}}}}),\) followed by

$$\begin{aligned}&\Vert X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2\nonumber \\&=\Vert Z_{{\mathcal {G}}_0^c}(Z_{{\mathcal {G}}_0^c}^{\top }Z_{{\mathcal {G}}_0^c}+2n\lambda _3M_0)^{-1}(Z_{{\mathcal {G}}_0^c}^{\top }{{\varvec{{y}}}}-n\hat{{{\varvec{{\delta }}}}})-Z_{{\mathcal {G}}_0^{0c}}{{\varvec{{\alpha }}}}^{0}\Vert ^2\nonumber \\&=\Vert \{I-Z_{{\mathcal {G}}_0^c}(Z_{{\mathcal {G}}_0^c}^{\top }Z_{{\mathcal {G}}_0^c}+2n\lambda _3M_0)^{-1}Z_{{\mathcal {G}}_0^c}^{\top }\}Z_{{\mathcal {G}}_0^{0c}}{{\varvec{{\alpha }}}}^{0} -Z_{{\mathcal {G}}_0^c}(Z_{{\mathcal {G}}_0^c}^{\top }Z_{{\mathcal {G}}_0^c}+2n\lambda _3M_0)^{-1}Z_{{\mathcal {G}}_0^c}^{\top }{{\varvec{{\epsilon }}}}\nonumber \\&\quad +nZ_{{\mathcal {G}}_0^c} (Z_{{\mathcal {G}}_0^c}^{\top }Z_{{\mathcal {G}}_0^c}+2n\lambda _3M_0)^{-1}\hat{{{\varvec{{\delta }}}}})\Vert ^2\nonumber \\&\le 3\Vert X{{\varvec{{\beta }}}}^{0}\Vert ^2+3\Vert {{\varvec{{\epsilon }}}}\Vert ^2+\frac{3\tau ^2 n}{16}\min _{K({\mathcal {G}}_0^c)\le K^*}\lambda _{{\mathrm {min}}}\left( \frac{1}{n}Z_{{\mathcal {G}}_0^c}^{\top }Z_{{\mathcal {G}}_0^c}\right) . \end{aligned}$$
(41)

Denote \(T_1=n^{-1}E(\Vert X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2I_{\{G\}})\) and \(T_2=n^{-1}E(\Vert X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2I_{\{G^c\}})\), where \(G=\{n^{-1}\Vert X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2\ge D\}\). By the definition, we have \(n^{-1}E(\Vert X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2)=T_1+T_2.\) Next, we work on \(T_1,T_2.\) Let

$$\begin{aligned} D=\frac{3}{n}\Vert X{{\varvec{{\beta }}}}_{0}\Vert ^2+10\sigma ^2+\frac{3\tau ^2 }{16}\min _{K({\mathcal {G}}_0^c)\le K^*}\lambda _{{\mathrm {min}}}\left( \frac{1}{n}Z_{{\mathcal {G}}_0^c}^{\top }Z_{{\mathcal {G}}_0^c}\right) . \end{aligned}$$
(42)

For \(T_1\), it follows that

$$\begin{aligned}&\int \nolimits _{D}^{\infty }{{\mathrm{pr}}}\left( n^{-1}\Vert X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2\ge x\right) {\mathrm{d}}x \nonumber \\&\quad \le \int \nolimits _{10\sigma ^2}^{\infty }{{\mathrm{pr}}}\left( 3n^{-1}\Vert {{\varvec{{\epsilon }}}}\Vert ^2\ge x\right) {\mathrm{d}}x \nonumber \\&\quad \le \int \nolimits _{10\sigma ^2}^{\infty }E\left\{ \exp \left( \frac{t\Vert {{\varvec{{\epsilon }}}}\Vert ^2}{\sigma ^2}\right) \exp \left( -\frac{ntx}{3\sigma ^2} \right) \right\} {\mathrm{d}}x \nonumber \\&\quad \le \int \nolimits _{10\sigma ^2}^{\infty }\exp \left\{ -\frac{n}{9\sigma ^2}(x-9\sigma ^2)\right\} {\mathrm{d}}x\nonumber \\&\quad \le \frac{9\sigma ^2}{n}\exp \left( -\frac{n}{9}\right) =o\left( \frac{K^0\sigma ^2}{n}\right) . \end{aligned}$$
(43)

By (41) and (42), thus the first ‘\(\le \)’ follows. In view of the moment generating function for Chi-squared distribution, taking \(t = 1/3\), the third ‘\(\le \)’ holds. For \(T_2\),

$$\begin{aligned} T_2= E\left( \frac{1}{n}\left\| X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2I_{\{G^c\}}I_{\{\hat{{{\varvec{{\beta }}}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\}}\right) +E\left( \frac{1}{n}\left\| X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2\left( 1-I_{\{G\}}\right) I_{\{\hat{{{\varvec{{\beta }}}}}= \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\}}\right) . \end{aligned}$$
(44)

For the first term in (44), if \(D= o\{K^0(\log n)^{1/2}\},\) then

$$\begin{aligned}&E\left( \frac{1}{n}\left\| X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2I_{\{G^c\}}I_{\{\hat{{{\varvec{{\beta }}}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\}}\right) \le D{{\mathrm{pr}}}\left( \hat{{{\varvec{{\beta }}}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\right) = o\left( \frac{K^0\sigma ^2}{n}\right) . \end{aligned}$$
(45)

For the second term in (44),

$$\begin{aligned} E\left( \frac{1}{n}\left\| X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2I_{\{G\}}I_{\{\hat{{{\varvec{{\beta }}}}} = \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\}}\right) \le E\left( \frac{1}{n}\left\| X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2I_{\{G\}}\right) =o\left( \frac{K^0\sigma ^2}{n}\right) , \end{aligned}$$
(46)
$$\begin{aligned} E\left( \frac{1}{n}\left\| X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2I_{\{\hat{{{\varvec{{\beta }}}}} = \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\}}\right) \le E\left( \frac{1}{n}\left\| X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2\right) =\frac{K^0\sigma ^2}{n}. \end{aligned}$$
(47)

By (43), (44)–(47), \(n^{-1}E(\Vert X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2)=T_1+T_2=n^{-1}K^0\sigma ^2(1+o(1)).\)

About this article

Verify currency and authenticity via CrossMark

Cite this article

Qin, S., Ding, H., Wu, Y. et al. High-dimensional sign-constrained feature selection and grouping. Ann Inst Stat Math 73, 787–819 (2021). https://doi.org/10.1007/s10463-020-00766-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10463-020-00766-z

Keywords

  • Difference convex programming
  • Feature grouping
  • Feature selection
  • High-dimensional
  • Non-negative