Skip to main content
Log in

Group sparse recovery via group square-root elastic net and the iterative multivariate thresholding-based algorithm

  • Original Paper
  • Published:
AStA Advances in Statistical Analysis Aims and scope Submit manuscript

Abstract

In this work, we propose a novel group selection method called Group Square-Root Elastic Net. It is based on square-root regularization with a group elastic net penalty, i.e., a \(\ell _{2,1}+\ell _2\) penalty. As a type of square-root-based procedure, one distinct feature is that the estimator is independent of the unknown noise level \(\sigma \), which is non-trivial to estimate under the high-dimensional setting, especially when \(p\gg n\). In many applications, the estimator is expected to be sparse, not in an irregular way, but rather in a structured manner. It makes the proposed method very attractive to tackle both high-dimensionality and structured sparsity. We study the correct subset recovery under a Group Elastic Net Irrepresentable Condition. Both the slow rate bounds and fast rate bounds are established, the latter under the Restricted Eigenvalue assumption and Gaussian noise assumption. To implement, a fast algorithm based on the scaled multivariate thresholding-based iterative selection idea is introduced with proved convergence. A comparative study examines the superiority of our approach against alternatives.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Ahsen, M.E., Vidyasagar, M.: Error bounds for compressed sensing algorithms with group sparsity: A unified approach. Appl Comput Harmon Anal 43, 212–232 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  • Argyriou, A., Evgeniou, T., Pontil, M.: Convex multi-task feature learning. Mach Learn 73, 243–272 (2008)

    Article  MATH  Google Scholar 

  • Belloni, A., Wang, L.: Square-Root Lasso: Pivotal Recovery of Sparse Signals via Conic Programming. Biometrika 98, 791–806 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  • Belloni, A., Chernozhukov, V., Wang, L.: Pivotal estimation via square-root lasso in nonparametric regression. Ann Stat 42, 757–788 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  • Bickel, P.J., Ritov, Y., Tsybakov, A.B.: Simultaneous analysis of Lasso and Dantzig selector. Ann Stat 37, 1705–1732 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  • Buhlmann, P., Geer, S.V.D.: Statistics for High-Dimensional Data. Publisher, Place (2011)

    Book  MATH  Google Scholar 

  • Bunea, Florentina: Honest variable selection in linear and logistic regression models via \(\ell _l\) and \(\ell _l+\ell _2\) penalization. Electron J Stat 2, 1153–1194 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  • Bunea, F., Lederer, J., She, Y.: The Group Square-Root Lasso: Theoretical Properties and Fast Algorithms. IEEE Trans Inf Theory 60, 1313–1325 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  • Cho, H., Fryzlewicz, P.: High dimensional variable selection via tilting. J R Stat Soc Series B Stat Methodol 74, 593–622 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  • Dalalyan, A.S., Hebiri, M., Lederer, J.: On the Prediction Performance of the Lasso. Bernoulli 23, 552–581 (2014)

    MathSciNet  MATH  Google Scholar 

  • Evgeniou, T., Pontil, M., Toubia, O.: A Convex Optimization Approach to Modeling Consumer Heterogeneity in Conjoint Estimation. Market Sci 26, 805–818 (2007)

    Article  Google Scholar 

  • Friedman, J.H., Hastie, T., Tibshirani, R.: Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw 33, 1 (2010)

    Article  Google Scholar 

  • Hebiri, M., Lederer, J.: How Correlations Influence Lasso Prediction. IEEE Trans Inf Theory 59, 1846–1854 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  • Huang J, Huang X, & Metaxas D. Learning with dynamic group sparsity. In: 2009 IEEE 12th ICCV, 2009, pp. 64-71

  • Huang, J., Zhang, T.: The Benefit of Group Sparsity. Ann Stat 38, 1978–2004 (2012)

    MathSciNet  MATH  Google Scholar 

  • Huang, J., Breheny, P., Ma, S.: A Selective Review of Group Selection in High-Dimensional Models. Stat Sci 27, 481–499 (2013)

    MathSciNet  MATH  Google Scholar 

  • Hu, J., Huang, J., Qiu, F.: A group adaptive elastic-net approach for variable selection in high-dimensional linear regression[J]. Sci China Math 61(1), 173–188 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  • Jia, J., Yu, B.: On Model Selection Consistency of the Elastic Net When p \(\gg \) n. Stat Sin 20, 595–611 (2010)

    MathSciNet  MATH  Google Scholar 

  • Laurent, B., Massart, P. (2000). Adaptive estimation of a quadratic functional by model selection. Ann. Statist. 1302-1338

  • Lederer, J., Yu, L., Gaynanova, I.: Oracle Inequalities for High-dimensional Prediction. Bernoulli 25, 1225–1255 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  • Lounici, K., Pontil, M., Tsybakov, A.B.: Oracle Inequalities and Optimal Inference under Group Sparsity. Ann Stat 39, 2164–2204 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  • Meier, L., Sara, V.D.G., Buhlmann, P.: The group lasso for logistic regression. J R Stat Soc Series B Stat Methodol 70, 53–71 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  • Mishali, M., Eldar, Y.C.: Blind Multiband Signal Reconstruction: Compressed Sensing for Analog Signals. IEEE Trans Signal Process 57, 993–1009 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  • Mishali, M., Eldar, Y.C.: From theory to practice: sub-nyquist sampling of sparse wideband analog signals. IEEE J Sel Top Signal Process 4, 375–391 (2010)

    Article  Google Scholar 

  • Opial, Z.: Weak convergence of the sequence of successive approximations for nonexpansive mappings in Banach spaces. Bull Amer Math Soc 73, 591–597 (1967)

    Article  MathSciNet  MATH  Google Scholar 

  • Peng, L., Wengu, C.: Signal recovery under cumulative coherence. J Comput Appl Math 346, 399–417 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  • Raninen E, & Ollila E. Scaled and square-root elastic net. In: 2017 IEEE Int Conf Acoust Speech Signal Process, 2017, pp.4336–4340

  • She, Y.: Sparse regression with exact clustering. Electron J Stat 4, 1055–1096 (2008)

    MathSciNet  MATH  Google Scholar 

  • She, Y.: An iterative algorithm for fitting nonconvex penalized generalized linear models with grouped predictors. Comput Stat Data Anal 56, 2976–2990 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  • Stucky, B., Sara, V.D.G.: Sharp Oracle inequalities for square root regularization. J Mach Learn Res 18, 1–29 (2017)

    MathSciNet  MATH  Google Scholar 

  • Van de Geer, Sara A, Buhlmann P. On the conditions used to prove oracle results for the Lasso. Electron J Stat, 2009, 3: 1360–1392

  • Wainwright, Martin J.: Structured Regularizers for High-Dimensional Problems: Statistical and Computational Issues. Annu Rev Stat Appl 1, 233–253 (2014)

    Article  Google Scholar 

  • Wu L. Analysis of Longitudinal Data. Technometrics, 2003, 45

  • Yi, C., Huang, J.: Semismooth Newton Coordinate Descent Algorithm for Elastic-Net Penalized Huber Loss Regression and Quantile Regression. J Comput Graph Stat 26, 547–557 (2015)

    Article  MathSciNet  Google Scholar 

  • Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J R Stat Soc Series B Stat Methodol 68, 49–67 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  • Zhao, P., Yu, B.: On model selection consistency of lasso. J Mach Learn Res 7, 2541–2563 (2006)

    MathSciNet  MATH  Google Scholar 

  • Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J R Stat Soc Series B Stat Methodol 67, 301–320 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  • Zou, H.: The adaptive lasso and its oracle properties. J Am Stat Assoc 101, 1418–1429 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  • Zou, H., Zhang, H.: On the adaptive elastic-net with a diverging number of parameters. Ann Stat 37, 1733–1751 (2009)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant No. 11671059).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hu Yang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proofs of Theorems

Proof of Lemma 1

as stated in Section 3. By KKT conditions, the equation (3) can be rewritten in an equivalent form

$$\begin{aligned} {\hat{\beta }}=\arg \min \limits _{\beta \in {\mathbb {R}}^p}\frac{1}{n{\hat{\sigma }}}\Vert Y-X\beta \Vert _2^2+{\hat{\sigma }}+\frac{2\lambda _1}{n}\sum \limits _{j=1}^J\sqrt{T_j}\Vert \beta _j\Vert _2 +\frac{2\lambda _2}{n}\sum \limits _{j=1}^J\Vert \beta _j\Vert _2^2, \end{aligned}$$
(12)

where \({\hat{\sigma }}=\Vert {\hat{\epsilon }}\Vert _2/\sqrt{n}\) with \({\hat{\epsilon }}=Y-X{\hat{\beta }}\). Then using the same trick with augmented data as in Zou and Hastie (2005), the above equation can be rewritten as

$$\begin{aligned} {\hat{\beta }}=\arg \min \limits _{\beta \in {\mathbb {R}}^p} \frac{1}{n{\hat{\sigma }}}\Vert Y^*-X^*\beta \Vert _2^2+\frac{2\lambda _1}{n}\sum \limits _{j=1}^J\sqrt{T_j}\Vert \beta _j\Vert _2, \end{aligned}$$

where \(Y^*=\big (Y^T,0^T\big )^T,X^*=\big (X^T,-\sqrt{2\lambda _2{\hat{\sigma }}}I^T\big )^T\).

Follow the Lemma 2.1 in Lederer et al. (2019), with probability one, there exists a tuning parameter \(\lambda \) such that

$$\begin{aligned} \lambda =\frac{1}{{\hat{\sigma }}}\max _{j\in \{1,\dots ,J\}} \Vert X_j^{*T}\epsilon ^*\Vert _2/\sqrt{T_j}, \end{aligned}$$

where \(\epsilon ^*=Y^*-X^*\beta ^0\). Then after some computation, we complete the proof. \(\square \)

Proof of Theorem 1

as stated in Section 3. Similar to the Lemma 1, the equation (3) has an equivalent form of

$$\begin{aligned} {\hat{\beta }}=\arg \min \limits _{\beta \in {\mathbb {R}}^p} \frac{1}{n{\hat{\sigma }}}\Vert Y^*-X^*\beta \Vert _2^2+\frac{2\lambda _1}{n}\sum \limits _{j=1}^J\sqrt{T_j}\Vert \beta _j\Vert _2, \end{aligned}$$

where \({\hat{\sigma }}=\Vert {\hat{\epsilon }}\Vert _2/\sqrt{n}\) with \({\hat{\epsilon }}=Y-X{\hat{\beta }}\), \(Y^*=\big (Y,0\big )^T\) and \(X^*=\big (X,-\sqrt{2\lambda _2{\hat{\sigma }}}I\big )^T\).

By the definition of \({\hat{\beta }}\), for any \(\alpha \in (0,1)\), we have

$$\begin{aligned} \frac{1}{n{\hat{\sigma }}}&\Vert Y^*-X^*{\hat{\beta }}\Vert _2^2+\frac{2\lambda _1}{n}\sum \limits _{j=1}^J\sqrt{T_j}\Vert \hat{\beta _j}\Vert _2\\&\le \frac{1}{n{\hat{\sigma }}}\Vert Y^*-X^*(\alpha {\hat{\beta }}+(1-\alpha )\beta ^0)\Vert _2^2+\frac{2\lambda _1}{n}\sum \limits _{j=1}^J\sqrt{T_j}\Vert \alpha {\hat{\beta }}_j+(1-\alpha )\beta ^0_j\Vert _2\\&=\frac{1}{n{\hat{\sigma }}}\Vert \alpha X^*(\beta ^0-{\hat{\beta }})+\epsilon ^*\Vert _2^2+\frac{2\lambda _1}{n}\sum \limits _{j=1}^J\sqrt{T_j}\Vert \alpha {\hat{\beta }}_j+(1-\alpha )\beta ^0_j\Vert _2, \end{aligned}$$

where \(\epsilon ^*=Y^*-X^*\beta ^0\). Then the triangle inequality leads to

$$\begin{aligned} \Vert \alpha {\hat{\beta }}_j+(1-\alpha )\beta ^0_j\Vert _2\le \alpha \Vert {\hat{\beta }}_j\Vert _2+(1-\alpha )\Vert \beta ^0_j\Vert _2. \end{aligned}$$

By the above, we get

$$\begin{aligned} (1+\alpha )\Vert X({\hat{\beta }}-\beta ^0)\Vert _2^2\le 2(\sigma \epsilon ^TX-2{\hat{\sigma }}\lambda _2\beta ^0)^T({\hat{\beta }}-\beta ^0) +2{\hat{\sigma }}\lambda _1\sum \limits _{j=1}^J\bigg \{\sqrt{T_j}(\Vert \beta _j^0\Vert _2-\Vert {\hat{\beta }}_j\Vert _2)\bigg \}. \end{aligned}$$

By the Cauchy-Schwarz inequality, we can obtain that

$$\begin{aligned} (\sigma \epsilon ^TX-2{\hat{\sigma }}\lambda _2\beta ^0)^T({\hat{\beta }}-\beta ^0)&=\sum \limits _{j=1}^J(\sigma X_j^T\epsilon -2{\hat{\sigma }}\lambda _2\beta ^0_j)^T({\hat{\beta }}_j-\beta ^0_j)\\&\le \sum \limits _{j=1}^J\Vert \sigma X_j^T\epsilon -2{\hat{\sigma }}\lambda _2\beta ^0_j\Vert _2\Vert {\hat{\beta }}_j-\beta ^0_j\Vert _2. \end{aligned}$$

Then with triangle inequality and \(\lambda _1\) defined in Lemma 1, we have

$$\begin{aligned} (1+\alpha )\Vert X({\hat{\beta }}-\beta ^0)\Vert _2^2\le 4\max \limits _{j\in \{ 1,\dots ,J\}}\Vert \sigma X_j^T\epsilon -2{\hat{\sigma }}\lambda _2\beta ^0_j\Vert _2\sum \limits _{j=1}^J\Vert \beta ^0_j\Vert _2. \end{aligned}$$

Since \(\alpha \in (0,1)\) is arbitrary, then take the limit \(\alpha \rightarrow 1\), we finally obtain

$$\begin{aligned} \Vert X({\hat{\beta }}-\beta ^0)\Vert _2^2\le 2\max \limits _{j\in \{ 1,\dots ,J\}}\Vert \sigma X_j^T\epsilon -2{\hat{\sigma }}\lambda _2\beta ^0_j\Vert _2\sum \limits _{j=1}^J\Vert \beta ^0_j\Vert _2, \end{aligned}$$

as desired. \(\square \)

Lemma 3

For any \(j\in \{1,2,\dots ,J\}\), recall that \(\Vert \beta _j\Vert _2\le {\bar{m}}_j\), then we have the following inequality

$$\begin{aligned} \Vert \beta _j\Vert _2^2-\Vert {\hat{\beta }}_j\Vert _2^2\le \Vert \beta _j-{\hat{\beta }}_j\Vert _2^2+2{\bar{m}}_j\Vert \beta _j-{\hat{\beta }}_j\Vert _2. \end{aligned}$$
(13)

Proof

By the triangle inequality, it is easy to obtain that

$$\begin{aligned} \Vert \beta _j\Vert _2^2-\Vert {\hat{\beta }}_j\Vert _2^2&=\big (\Vert \beta _j\Vert _2-\Vert {\hat{\beta }}_j\Vert _2\big )\big (\Vert \beta _j\Vert _2+\Vert {\hat{\beta }}_j\Vert _2\big )\\&\le \Vert \beta _j-{\hat{\beta }}_j\Vert _2\big (\Vert \beta _j\Vert _2+\Vert {\hat{\beta }}_j-\beta _j+\beta _j\Vert _2\big )\\&\le \Vert \beta _j-{\hat{\beta }}_j\Vert _2\big (\Vert {\hat{\beta }}_j-\beta _j\Vert _2+2\Vert \beta _j\Vert _2\big )\\&\le \Vert \beta _j-{\hat{\beta }}_j\Vert _2^2+2{\bar{m}}_j\Vert \beta _j-{\hat{\beta }}_j\Vert _2. \end{aligned}$$

\(\square \)

Lemma 4

Suppose that \(\epsilon \sim N(0,1)\), for a given \(\alpha \in (0,1)\), let \(t=\sqrt{\frac{4ln(1/\alpha )}{n}}+\frac{4ln(1/\alpha )}{n}\) and define

$$\begin{aligned} {\mathcal {B}}:=\big \{\Vert \epsilon \Vert _2/\sqrt{n}\le \sqrt{1+t}\big \}. \end{aligned}$$
(14)

Then

$$\begin{aligned} {\mathbb {P}}({\mathcal {B}})\ge 1-\alpha . \end{aligned}$$

The proof of this result is a direct application of Lemma 8.1 in Buhlmann et al. (2011).

Proof of Theorem 3

as stated in Section 3. Firstly, we show that \(\Delta :={\hat{\beta }}-\beta \in \Delta _{\gamma }\) and then derive the bounds in a second step.

By the definition of \({\hat{\beta }}\), we obtain

$$\begin{aligned} \frac{1}{\sqrt{n}}&\Vert Y-X{\hat{\beta }}\Vert _2+\frac{\lambda _1}{n}\sum \limits _{j=1}^J\sqrt{T_j}\Vert {\hat{\beta }}_j\Vert _2 +\frac{\lambda _2}{n}\sum \limits _{j=1}^J\Vert {\hat{\beta }}_j\Vert _2^2\\&\le \frac{1}{\sqrt{n}}\Vert Y-X\beta ^0\Vert _2+\frac{\lambda _1}{n}\sum \limits _{j=1}^J\sqrt{T_j}\Vert \beta ^0_j\Vert _2 +\frac{\lambda _2}{n}\sum \limits _{j=1}^J\Vert \beta ^0_j\Vert _2^2, \end{aligned}$$

then the triangle inequality and Lemma 3 yield

$$\begin{aligned} \frac{1}{\sqrt{n}}&\Vert Y-X{\hat{\beta }}\Vert _2-\frac{1}{\sqrt{n}}\Vert Y-X\beta ^0\Vert _2\nonumber \\&\le \frac{\lambda _1}{n}\sum \limits _{j=1}^J\sqrt{T_j}\big (\Vert \beta ^0_j\Vert _2-\Vert {\hat{\beta }}_j\Vert _2\big )+\frac{\lambda _2}{n}\sum \limits _{j=1}^J\big (\Vert \beta ^0_j\Vert _2^2- \Vert {\hat{\beta }}_j\Vert _2^2\big )\nonumber \\&\le \frac{\lambda _1}{n}\bigg \{\sum \limits _{j\in S}\sqrt{T_j}\Vert \Delta _j\Vert _2-\sum \limits _{j\in S^c}\sqrt{T_j}\Vert \Delta _j\Vert _2\bigg \}\nonumber \\&\quad +\frac{\lambda _2}{n}\bigg \{\sum \limits _{j\in S}\big (\Vert \Delta _j\Vert _2^2+2{\bar{m}}_j\Vert \Delta _j\Vert _2\big )-\sum \limits _{j\in S^c}\Vert \Delta _j\Vert _2^2\bigg \}\nonumber \\&=\frac{1}{n}\sum \limits _{j\in S}\big (\lambda _1\sqrt{T_j}+2{\bar{m}}_j\lambda _2\big )\Vert \Delta _j\Vert _2+\frac{\lambda _2}{n}\sum \limits _{j\in S}\Vert \Delta _j\Vert _2^2\nonumber \\&\quad -\frac{\lambda _1}{n}\sum \limits _{j\in S^c}\sqrt{T_j}\Vert \Delta _j\Vert _2-\frac{\lambda _2}{n}\sum \limits _{j\in S^c}\Vert \Delta _j\Vert _2^2. \end{aligned}$$
(15)

Since

$$\begin{aligned} \frac{\nabla \Vert Y-X\beta \Vert _2|_{\beta =\beta ^0}}{\sqrt{n}}=\frac{-X^T\epsilon }{\sqrt{n}\Vert \epsilon \Vert _2}. \end{aligned}$$

Due to the convexity of the \(\ell _2\)-norm, we obtain

$$\begin{aligned} \frac{1}{\sqrt{n}}\Vert Y-X{\hat{\beta }}\Vert _2-\frac{1}{\sqrt{n}}\Vert Y-X\beta ^0\Vert _2\ge -\frac{|\epsilon ^TX\Delta |}{\sqrt{n}\Vert \epsilon \Vert _2}. \end{aligned}$$

Then by the Cauchy-Schwarz’s inequality, we can validate that

$$\begin{aligned} |\epsilon ^TX\Delta |&=|\sum \limits _{j=1}^J\epsilon ^TX_j\Delta _j|\nonumber \\&\le \sum \limits _{j=1}^J\Vert \epsilon ^TX_j\Vert _2\Vert \Delta _j\Vert _2\nonumber \\&=\frac{\Vert \epsilon \Vert _2}{\sqrt{n}}\sum \limits _{j=1}^J\frac{\sqrt{n}\Vert \epsilon ^TX_j\Vert _2}{\Vert \epsilon \Vert _2}\Vert \Delta _j\Vert _2\nonumber \\&=\frac{\Vert \epsilon \Vert _2}{\sqrt{n}}\sum \limits _{j=1}^JV_j\Vert \Delta _j\Vert _2, \end{aligned}$$
(16)

where \(V_j=\frac{\sqrt{n}\Vert X_j^T\epsilon \Vert _2}{\Vert \epsilon \Vert _2}\). Since, on the set \({\mathcal {A}}\), we have for any \(j\in \{1,2,\dots ,J\}\), \(V_j\le \lambda _1\sqrt{T_j}/{\bar{\gamma }}-2{\bar{m}}_j\lambda _2\), then the above two inequalities give

$$\begin{aligned} \frac{1}{\sqrt{n}}\Vert Y-X{\hat{\beta }}\Vert _2-\frac{1}{\sqrt{n}}\Vert Y-X\beta ^0\Vert _2\ge -\frac{1}{n}\sum \limits _{j=1}^J\bigg (\frac{\lambda _1\sqrt{T_j}}{{\bar{\gamma }}}-2{\bar{m}}_j\lambda _2\bigg )\Vert \Delta _j\Vert _2. \end{aligned}$$
(17)

By the (14) and (16), we get

$$\begin{aligned} \frac{\lambda _1}{n}&\bigg (1-\frac{1}{{\bar{\gamma }}}\bigg )\sum \limits _{j\in S^c}\sqrt{T_j}\Vert \Delta _j\Vert _2+\frac{\lambda _2}{n}\sum \limits _{j\in S^c}\Vert \Delta _j\Vert _2^2\\&\le \frac{\lambda _1}{n}\bigg (1+\frac{1}{{\bar{\gamma }}}\bigg )\sum \limits _{j\in S}\sqrt{T_j}\Vert \Delta _j\Vert _2+\frac{\lambda _2}{n}\sum \limits _{j\in S}\Vert \Delta _j\Vert _2^2. \end{aligned}$$

Note that \(\gamma =\frac{{\bar{\gamma }}+1}{{\bar{\gamma }}-1}\), we find

$$\begin{aligned} \frac{\lambda _1}{n}&\sum \limits _{j\in S^c}\sqrt{T_j}\Vert \Delta _j\Vert _2+\frac{\gamma +1}{2}\frac{\lambda _2}{n}\sum \limits _{j\in S^c}\Vert \Delta _j\Vert _2^2\\&\le \gamma \frac{\lambda _1}{n}\sum \limits _{j\in S}\sqrt{T_j}\Vert \Delta _j\Vert _2+\frac{\gamma +1}{2}\frac{\lambda _2}{n}\sum \limits _{j\in S}\Vert \Delta _j\Vert _2^2. \end{aligned}$$

Since \(\gamma >1\) then

$$\begin{aligned} \frac{\lambda _1}{n}&\sum \limits _{j\in S^c}\sqrt{T_j}\Vert \Delta _j\Vert _2+\frac{\lambda _2}{n}\sum \limits _{j\in S^c}\Vert \Delta _j\Vert _2^2\\&\le \frac{\lambda _1}{n}\sum \limits _{j\in S^c}\sqrt{T_j}\Vert \Delta _j\Vert _2+\frac{\gamma +1}{2}\frac{\lambda _2}{n}\sum \limits _{j\in S^c}\Vert \Delta _j\Vert _2^2\\&\le \gamma \bigg (\frac{\lambda _1}{n}\sum \limits _{j\in S}\sqrt{T_j}\Vert \Delta _j\Vert _2+\frac{\gamma +1}{2\gamma }\frac{\lambda _2}{n}\sum \limits _{j\in S}\Vert \Delta _j\Vert _2^2\bigg )\\&\le \gamma \bigg (\frac{\lambda _1}{n}\sum \limits _{j\in S}\sqrt{T_j}\Vert \Delta _j\Vert _2+\frac{\lambda _2}{n}\sum \limits _{j\in S}\Vert \Delta _j\Vert _2^2\bigg ), \end{aligned}$$

that is \(\Delta \in \Delta _{\gamma }\), as desired.

Now, we need to derive the bounds, by (14) we observe that

$$\begin{aligned} \frac{1}{\sqrt{n}}&\Vert Y-X{\hat{\beta }}\Vert _2-\frac{1}{\sqrt{n}}\Vert Y-X\beta ^0\Vert _2\\&\le \frac{1}{n}\sum \limits _{j\in S}\big (\lambda _1\sqrt{T_j}+2{\bar{m}}_j\lambda _2\big )\Vert \Delta _j\Vert _2+\frac{\lambda _2}{n}\sum \limits _{j\in S}\Vert \Delta _j\Vert _2^2. \end{aligned}$$

By the definition of \({\hat{\beta }}\), we obtain

$$\begin{aligned} \frac{1}{\sqrt{n}}\Vert Y-X{\hat{\beta }}\Vert _2\le \frac{1}{\sqrt{n}}\Vert Y\Vert _2. \end{aligned}$$

By problem 6.1 in Buhlmann et al. (2011), for any reasonable signal-to-noise ratio SNR, we have \(\sigma \le \Vert Y\Vert _2/\sqrt{n}\le const.\sigma \), with the “const.” well under control.

Then from Lemma 1 in Laurent and Massart Laurent and Massart (2000), \({\mathbb {P}}(X\le n-nt)\le \exp {(-nt^2/4)}\) for \(X\sim \chi ^2(n)\). Thus, we have \(\Vert Y\Vert _2/\sqrt{n}\le \varrho \Vert \sigma \epsilon \Vert _2/\sqrt{n}\), with \(\varrho >1\).

Besides

$$\begin{aligned} \frac{1}{n}\Vert Y-X{\hat{\beta }}\Vert _2^2-\frac{1}{n}\Vert Y-X\beta ^0\Vert _2^2&=\frac{\Vert \sigma \epsilon -X\Delta \Vert _2^2}{n}-\frac{\Vert \sigma \epsilon \Vert _2^2}{n}\\&=\frac{\Vert X\Delta \Vert _2^2}{n}-\frac{2\sigma \epsilon ^TX\Delta }{n}. \end{aligned}$$

Thus, by (15), it is easy to obtain that

$$\begin{aligned}\frac{\Vert X\Delta \Vert _2^2}{n}&=\frac{1}{n}\Vert Y-X{\hat{\beta }}\Vert _2^2-\frac{1}{n}\Vert Y-X\beta ^0\Vert _2^2+\frac{2\sigma \epsilon ^TX\Delta }{n}\\&\le \frac{1}{n}(\Vert Y-X{\hat{\beta }}\Vert _2-\Vert Y-X\beta ^0\Vert _2) (\Vert Y-X{\hat{\beta }}\Vert _2+\Vert Y-X\beta ^0\Vert _2)+\frac{2\sigma \epsilon ^TX\Delta }{n}\\&\le \bigg \{\frac{\Vert Y\Vert _2+\Vert \sigma \epsilon \Vert _2}{\sqrt{n}}\bigg \}\frac{1}{n}\sum \limits _{j\in S}\bigg \{\big (\lambda _1\sqrt{T_j}+2{\bar{m}}_j\lambda _2\big )\Vert \Delta _j\Vert _2+\lambda _2\Vert \Delta _j\Vert _2^2\big )\bigg \}+\frac{2\sigma |\epsilon ^TX\Delta |}{n}\\&\le \bigg \{\frac{\Vert Y\Vert _2}{\sqrt{n}}+\frac{\Vert \sigma \epsilon \Vert _2}{\sqrt{n}}\bigg \}\frac{1}{n}\sum \limits _{j\in S}\bigg \{\big (\lambda _1\sqrt{T_j}+2{\bar{m}}_j\lambda _2\big )\Vert \Delta _j\Vert _2+\frac{\lambda _2}{n}\sum \limits _{j\in S}\Vert \Delta _j\Vert _2^2\bigg \}\\& \quad +\frac{2\Vert \sigma \epsilon \Vert _2}{n^{3/2}}\sum \limits _{j=1}^J\bigg (\frac{\lambda _1\sqrt{T_j}}{{\bar{\gamma }}}-2{\bar{m}}_j\lambda _2\bigg )\Vert \Delta _j\Vert _2\\&\le \frac{{\bar{\varrho }}\gamma \Vert \sigma \epsilon \Vert _2}{\sqrt{n}}\bigg (\frac{\lambda _1}{n}\sum \limits _{j\in S}\sqrt{T_j}\Vert \Delta _j\Vert _2 +\frac{\lambda _2}{n}\sum \limits _{j\in S}\Vert \Delta _j\Vert _2^2\bigg ). \end{aligned}$$

where \({\bar{\varrho }}\ge 2\), then we find

$$\begin{aligned} \frac{\Vert X\Delta \Vert _2^2}{n}\le \frac{{\bar{\varrho }}\gamma \Vert \sigma \epsilon \Vert _2}{\sqrt{n}}\bigg (\frac{\lambda _1\sqrt{s^*}}{\sqrt{n}\kappa }\frac{\Vert X\Delta \Vert _2}{n}+ \frac{\lambda _2}{n\kappa ^2}\frac{\Vert X\Delta \Vert _2^2}{n}\bigg ), \end{aligned}$$

by the RE assumption. By the definition of \({\mathcal {B}}\), we have

$$\begin{aligned} \frac{\Vert \sigma \epsilon \Vert _2}{\sqrt{n}}\le \sigma \sqrt{1+t}. \end{aligned}$$

Then

$$\begin{aligned} \Vert X\Delta \Vert _2\le \frac{u\lambda _1\sqrt{s^*}}{\sqrt{n}\kappa }\frac{\Vert \sigma \epsilon \Vert _2}{\sqrt{n}}\le \sigma \sqrt{1+t}\frac{u\lambda _1\sqrt{s^*}}{\sqrt{n}\kappa }\lesssim \frac{\sigma \lambda _1\sqrt{s^*}}{\sqrt{n}\kappa }, \end{aligned}$$
(18)

where \(u=\frac{{\bar{\varrho }}\gamma }{1-\sigma \sqrt{1+t}\frac{{\bar{\varrho }}\gamma \lambda _2}{n\kappa ^2}}\in (0,\infty )\).

Next, note the fact that \(\Delta \in \Delta _{\gamma }\), then by the RE assumption we can prove that

$$\begin{aligned} \lambda _1&\sum \limits _{j=1}^J\sqrt{T_j}\Vert \Delta _j\Vert _2+\lambda _2\sum \limits _{j=1}^J\Vert \Delta _j\Vert _2^2\\&\le (1+\gamma )\bigg (\lambda _1\sum \limits _{j\in S}\sqrt{T_j}\Vert \Delta _j\Vert _2+\lambda _2\sum \limits _{j\in S}\Vert \Delta _j\Vert _2^2\bigg )\\&\le (1+\gamma )\bigg (\lambda _1\sqrt{s^*}\frac{\Vert X\Delta \Vert _2}{\sqrt{n}\kappa }+\lambda _2\frac{\Vert X\Delta \Vert _2^2}{n\kappa ^2}\bigg )\\&\le \sigma \sqrt{1+t}\frac{(1+\gamma )s^*u\lambda _1}{n\kappa ^2}\bigg (\lambda _1+\lambda _2\frac{u\lambda _1\sigma \sqrt{1+t}}{n\kappa ^2}\bigg )\\&\lesssim \sigma (1+t)\frac{\lambda _1s^*}{n\kappa ^2}\bigg (\lambda _1+\sigma \lambda _1\frac{\lambda _2}{n\kappa ^2}\bigg )\\&\lesssim \frac{\sigma \lambda _1^2s^*}{n\kappa ^2}\bigg (1+\sigma \frac{\lambda _2}{n\kappa ^2}\bigg ), \end{aligned}$$

which concludes the proof of this theorem. \(\square \)

Lemma 5

Under the condition of Theorem 2, on the set \({\mathcal {A}}\cap {\mathcal {B}}\), we have

$$\begin{aligned} \bigg (1-\frac{\lambda _1\sqrt{s^*}u}{n\kappa }\bigg )\Vert \sigma \epsilon \Vert _2\le \Vert Y-X{\hat{\beta }}\Vert _2\le \bigg (1+\frac{\lambda _1\sqrt{s^*}u}{n\kappa }\bigg )\Vert \sigma \epsilon \Vert _2, \end{aligned}$$

for \(u=\frac{{\bar{\varrho }}\gamma }{1-\sigma \sqrt{1+t}\frac{{\bar{\varrho }}\gamma \lambda _2}{n\kappa ^2}}\).

Proof

By the triangle inequality and (17), it is easy to prove that

$$\begin{aligned} \Vert \sigma \epsilon \Vert _2-\Vert X\Delta \Vert _2\le \Vert Y-X{\hat{\beta }}\Vert _2\le \Vert \sigma \epsilon \Vert _2+\Vert X\Delta \Vert _2. \end{aligned}$$

Then we get the conclusion immediately. \(\square \)

Proof of Theorem 6

as stated in Section 3. The crucial step of this proof is to use the KKT Conditions, i.e.,

$$\begin{aligned} {\left\{ \begin{array}{ll} \frac{X_j^T(Y-X{\hat{\beta }})}{\Vert Y-X{\hat{\beta }}\Vert _2}=\frac{\lambda _1}{\sqrt{n}}\sqrt{T_j}\frac{{\hat{\beta }}_j}{\Vert {\hat{\beta }}_j\Vert _2}+\frac{2\lambda _2}{\sqrt{n}}{\hat{\beta }}_j,~~~~{\hat{\beta }}_j\ne 0,\\ \frac{\Vert X_j^T(Y-X{\hat{\beta }})\Vert _2}{\Vert Y-X{\hat{\beta }}\Vert _2}\le \frac{\lambda _1}{\sqrt{n}}\sqrt{T_j},~~~~~~~~~~~~~~~~~~{\hat{\beta }}_j= 0. \end{array}\right. } \end{aligned}$$
(19)

Thus, there exists a vector \(\tau \in {\mathbb {R}}^p\) such that \(\Vert \tau _j\Vert _2\le \sqrt{T_j}\) for all \(j\in \{1,2,\dots ,J\}\) and, additionally, \(\tau _j=\frac{\sqrt{T_j}{\hat{\beta }}_j}{\Vert {\hat{\beta }}_j\Vert _2}\) for all \(j\in {\hat{S}}\), i.e., \({\hat{\beta }}_j\ne 0\), then the above equation can be rewritten as

$$\begin{aligned} \frac{X^T(Y-X{\hat{\beta }})}{\Vert Y-X{\hat{\beta }}\Vert _2}=\frac{\lambda _1}{\sqrt{n}}\tau +\frac{2\lambda _2}{\sqrt{n}}{\hat{\beta }}. \end{aligned}$$
(20)

Denote \({\hat{\psi }}=\Vert Y-X{\hat{\beta }}\Vert _2\), then

$$\begin{aligned} \sigma X^T\epsilon -X^TX\Delta =\frac{\lambda _1{\hat{\psi }}}{\sqrt{n}}\tau +\frac{2\lambda _2{\hat{\psi }}}{\sqrt{n}}{\hat{\beta }}. \end{aligned}$$

On the one hand,

$$\begin{aligned} -n^2C_{11}\Delta _S-n^2C_{12}\Delta _{S^c}=\sqrt{n}\lambda _1{\hat{\psi }}\tau _S+2\sqrt{n}\lambda _2{\hat{\psi }}{\hat{\beta }}_S-n\sigma (X^T\epsilon )_S. \end{aligned}$$

Since \(\Delta _S={\hat{\beta }}_S-\beta _S^0\), then

$$\begin{aligned}&-n^2C_{11}\Delta _S-2n\lambda _2\Delta _S-n^2C_{12}\Delta _{S^c}\nonumber \\&=\sqrt{n}\lambda _1{\hat{\psi }}\tau _S+2\sqrt{n}\lambda _2({\hat{\psi }}-\sqrt{n})\Delta _S +2\sqrt{n}\lambda _2{\hat{\psi }}\beta _S^0-n\sigma (X^T\epsilon )_S, \end{aligned}$$
(21)

or,equivalently

$$\begin{aligned}&-n^2\Delta _{S^c}^TC_{21}\Delta _S\nonumber \\&=n^2\Delta _{S^c}^TC_{21}\bigg (C_{11}+\frac{2\lambda _2}{n}I\bigg )^{-1}C_{12}\Delta _{S}+ \sqrt{n}\lambda _1{\hat{\psi }}\Delta _{S^c}^TC_{21}\bigg (C_{11}+\frac{2\lambda _2}{n}I\bigg )^{-1}\tau _S\nonumber \\&\qquad +2\sqrt{n}\lambda _2({\hat{\psi }}-\sqrt{n})\Delta _{S^c}^TC_{21}\bigg (C_{11}+\frac{2\lambda _2}{n}I\bigg )^{-1}\Delta _S\nonumber \\&\qquad +2\sqrt{n}\lambda _2{\hat{\psi }}\Delta _{S^c}^TC_{21}\bigg (C_{11}+\frac{2\lambda _2}{n}I\bigg )^{-1}\beta _S^0 -n\sigma \Delta _{S^c}^TC_{21}\bigg (C_{11}+\frac{2\lambda _2}{n}I\bigg )^{-1}(X^T\epsilon )_S\nonumber \\&\quad =\sqrt{n}\lambda _1{\hat{\psi }}\Delta _{S^c}^TC_{21}\bigg (C_{11}+\frac{2\lambda _2}{n}I\bigg )^{-1}\bigg ( \tau _S+\frac{2\lambda _2({\hat{\psi }}-\sqrt{n})}{\lambda _1{\hat{\psi }}}\Delta _S+2\frac{\lambda _2}{\lambda _1}\beta _S^0\bigg )\nonumber \\&\qquad -n\sigma \Delta _{S^c}^TC_{21}\bigg (C_{11}+\frac{2\lambda _2}{n}I\bigg )^{-1}(X^T\epsilon )_S. \end{aligned}$$
(22)

On the other hand

$$\begin{aligned} -n^2C_{21}\Delta _S-n^2C_{22}\Delta _{S^c}=\sqrt{n}\lambda _1{\hat{\psi }}\tau _{S^c}+2\sqrt{n}\lambda _2{\hat{\psi }}{\hat{\beta }}_{S^c}-n\sigma (X^T\epsilon )_{S^c}. \end{aligned}$$

Since for all \(j\in S^c\)

$$\begin{aligned}&{\hat{\beta }}_j\ne 0~\Rightarrow ~ \Delta _j^T\tau _j=\sqrt{T_j}\Vert \Delta _j\Vert _2,~~\Delta _j^T{\hat{\beta }}_j=\Vert \Delta _j\Vert _2^2,\\&{\hat{\beta }}_j= 0~\Rightarrow ~ \Delta _j^T\tau _j=0=\sqrt{T_j}\Vert \Delta _j\Vert _2,~~\Delta _j^T{\hat{\beta }}_j=0=\Vert \Delta _j\Vert _2^2, \end{aligned}$$

this implies that

$$\begin{aligned}&-n^2\Delta _{S^c}^TC_{21}\Delta _S-n^2\Delta _{S^c}^TC_{22}\Delta _{S^c}\\& \quad =\sqrt{n}\lambda _1{\hat{\psi }}\Delta _{S^c}^T\tau _{S^c} +2\sqrt{n}\lambda _2{\hat{\psi }}\Delta _{S^c}^T{\hat{\beta }}_{S^c}-n\sigma \Delta _{S^c}^T(X^T\epsilon )_{S^c}\\&\quad\ge \sqrt{n}\lambda _1{\hat{\psi }}\sum \limits _{j\in S^c}\bigg (\sqrt{T_j}\Vert \Delta _j\Vert _2+\frac{2\lambda _2}{\lambda _1}\Vert \Delta _j\Vert _2^2 -\frac{\sqrt{n}\sigma \Vert (X^T\epsilon )_j\Vert _2}{\lambda _1{\hat{\psi }}}\Vert \Delta _j\Vert _2\bigg )\\&\quad \ge \sqrt{n}\lambda _1{\hat{\psi }}\sum \limits _{j\in S^c}\bigg (\sqrt{T_j}\Vert \Delta _j\Vert _2 -\frac{\sqrt{n}\sigma \Vert (X^T\epsilon )_j\Vert _2}{\lambda _1{\hat{\psi }}}\Vert \Delta _j\Vert _2\bigg ). \end{aligned}$$

Lemma 5 implies that \(\sqrt{T_j}\lambda _1/{\tilde{\eta }}-(4{\bar{m}}_j)\vee (6/\kappa )\lambda _2\ge {\hat{V}}_j\) for

$$\begin{aligned} {\hat{V}}_j:=\frac{\sigma \sqrt{n}\Vert X_j^T\epsilon \Vert _2}{{\hat{\psi }}}. \end{aligned}$$

Thus, we have

$$\begin{aligned} -n^2\Delta _{S^c}^TC_{21}\Delta _S-n^2\Delta _{S^c}^TC_{22}\Delta _{S^c}\ge \bigg (1-\frac{1}{{\tilde{\eta }}}\bigg )\sqrt{n}\lambda _1{\hat{\psi }}\sum \limits _{j\in S^c} \sqrt{T_j}\Vert \Delta _j\Vert _2. \end{aligned}$$
(23)

Subtracting (22) from (21) yields

$$\begin{aligned}&n^2\Delta _{S^c}^T\bigg (C_{22}-C_{21}\big (C_{11}+\frac{2\lambda _2}{n}I\big )^{-1}C_{12}\bigg )\Delta _{S^c}\nonumber \\&\quad \le \sqrt{n}\lambda _1{\hat{\psi }}\Delta _{S^c}^TC_{21}\big (C_{11}+\frac{2\lambda _2}{n}I\big )^{-1}\bigg ( \tau _S+\frac{2\lambda _2({\hat{\psi }}-\sqrt{n})}{\lambda _1{\hat{\psi }}}\Delta _S+2\frac{\lambda _2}{\lambda _1}\beta _S^0\bigg )\nonumber \\&\qquad -n\sigma \Delta _{S^c}^TC_{21}\big (C_{11}+\frac{2\lambda _2}{n}I\big )^{-1}(X^T\epsilon )_S-\bigg (1-\frac{1}{{\tilde{\eta }}}\bigg )\sqrt{n}\lambda _1{\hat{\psi }}\sum \limits _{j\in S^c} \sqrt{T_j}\Vert \Delta _j\Vert _2. \end{aligned}$$
(24)

The first term of the right-hand side above can be bounded via the Cauchy-Schwarz’s inequality by

$$\begin{aligned}&\Delta _{S^c}^TC_{21}\big (C_{11}+\frac{2\lambda _2}{n}I\big )^{-1}\bigg ( \tau _S+\frac{2\lambda _2({\hat{\psi }}-\sqrt{n})}{\lambda _1{\hat{\psi }}}\Delta _S+2\frac{\lambda _2}{\lambda _1}\beta _S^0 -\frac{\sqrt{n}\sigma }{\lambda _1{\hat{\psi }}}(X^T\epsilon )_S\bigg )\nonumber \\&\quad =\sum \limits _{j\in S^c}\Delta _j^T \bigg ({\tilde{C}}_{21}\big (C_{11}+\frac{2\lambda _2I}{n}\big )^{-1}\big ( \tau _S+\frac{2\lambda _2({\hat{\psi }}-\sqrt{n})}{\lambda _1{\hat{\psi }}}\Delta _S+\frac{2\lambda _2}{\lambda _1}\beta _S^0 -\frac{\sqrt{n}\sigma }{\lambda _1{\hat{\psi }}}(X^T\epsilon )_S\big )\bigg )_j\nonumber \\&\quad \le \sum \limits _{j\in S^c}\Vert \Delta _j\Vert _2 \bigg \Vert \bigg ({\tilde{C}}_{21}\big (C_{11}+\frac{2\lambda _2I}{n}\big )^{-1}\big ( \tau _S+\frac{2\lambda _2({\hat{\psi }}-\sqrt{n})}{\lambda _1{\hat{\psi }}}\Delta _S\nonumber \\&\qquad +\frac{2\lambda _2}{\lambda _1}\beta _S^0 -\frac{\sqrt{n}\sigma }{\lambda _1{\hat{\psi }}}(X^T\epsilon )_S\big )\bigg )_j\bigg \Vert _2, \end{aligned}$$
(25)

where \({\tilde{C}}_{21}=(0~C_{12})^T\). Note that \(\Vert \tau _j\Vert _2\le \sqrt{T_j}\), then we have \(\frac{\sqrt{n}\sigma }{\lambda _1{\hat{\psi }}}\Vert X_j^T\epsilon \Vert _2\le \sqrt{T_j}/{\tilde{\eta }}-(4{\bar{m}}_j)\vee (6/\kappa )\lambda _2/\lambda _1\). If \({\hat{\psi }}\ge \sqrt{n}\), by the RE Condition and Lemma 5 we have

$$\begin{aligned} \frac{2\lambda _2|{\hat{\psi }}-\sqrt{n}|}{\lambda _1{\hat{\psi }}}\Vert \Delta _j\Vert _2\le \frac{4\lambda _2}{\lambda _1}{\bar{m}}_j. \end{aligned}$$

Otherwise,

$$\begin{aligned} \frac{2\lambda _2|{\hat{\psi }}-\sqrt{n}|}{\lambda _1{\hat{\psi }}}\Vert \Delta _j\Vert _2&\le \frac{2\lambda _2(\sqrt{n}-{\hat{\psi }})}{\lambda _1\sqrt{n}} \frac{\Vert X\Delta \Vert _2}{{\hat{\psi }}\kappa } \le \frac{2\lambda _2}{\lambda _1}\frac{\Vert X\Delta \Vert _2}{{\hat{\psi }}\kappa }\\&\le \frac{2\lambda _2}{\lambda _1}\frac{{\hat{\psi }}+\Vert \sigma \epsilon \Vert _2}{{\hat{\psi }}\kappa }\le \frac{6\lambda _2}{\lambda _1\kappa }. \end{aligned}$$

Thus, the right-hand side of (24) can be bounded by

$$\begin{aligned}&\sqrt{n}\lambda _1{\hat{\psi }}\max \limits _{\nu :\Vert \nu _k\Vert _2\le \big (1+\frac{1}{{\tilde{\eta }}}\big )\sqrt{T_k} +\frac{2\lambda _2}{\lambda _1}\Vert \beta _k^0\Vert _2} \sum \limits _{j\in S^c}\sqrt{T_j}\Vert \Delta _j\Vert _2\frac{\Vert ({\tilde{C}}_{21}(C_{11}+\frac{2\lambda _2}{n}I)^{-1}\nu )_j\Vert _2}{\sqrt{T_j}}\\&\le \bigg (1+\frac{1}{{\tilde{\eta }}}\bigg )\sqrt{n}\lambda _1{\hat{\psi }}\max \limits _{\nu :\Vert \nu _k\Vert _2\le \sqrt{T_k}+ \frac{2\lambda _2}{\lambda _1}\Vert \beta _k^0\Vert _2} \sum \limits _{j\in S^c}\sqrt{T_j}\Vert \Delta _j\Vert _2\frac{\Vert ({\tilde{C}}_{21}(C_{11}+\frac{2\lambda _2}{n}I)^{-1}\nu )_j\Vert _2}{\sqrt{T_j}}. \end{aligned}$$

Since \({\tilde{\eta }}=\frac{1+\eta }{1-\eta }\), then by the Group Elastic Net Irrepresentable Condition, if \({\hat{\beta }}_{S^c}\ne 0\), the above expression is smaller than

$$\begin{aligned} \bigg (1+\frac{1}{{\tilde{\eta }}}\bigg )\eta \sqrt{n}\lambda _1{\hat{\psi }}\sum \limits _{j\in S^c}\sqrt{T_j}\Vert \Delta _j\Vert _2=\bigg (1-\frac{1}{{\tilde{\eta }}}\bigg ) \sqrt{n}\lambda _1{\hat{\psi }}\sum \limits _{j\in S^c}\sqrt{T_j}\Vert \Delta _j\Vert _2. \end{aligned}$$

Thus, by (23) it yields

$$\begin{aligned} n^2\Delta _{S^c}^T\big (C_{22}-C_{21}\big (C_{11}+\frac{2\lambda _2}{n}I\big )^{-1}C_{12}\big )\Delta _{S^c}<0. \end{aligned}$$
(26)

Yet, \(C_{22}-C_{21}\big (C_{11}+\frac{2\lambda _2}{n}I\big )^{-1}C_{12}\ge 0\), which leads to a contradiction. Thus, we obtain \({\hat{\beta }}_{S^c}= 0\).

Now, we start to prove the second claim. Firstly, we substitute \({\hat{\beta }}_{S^c}= 0\) into (20)

$$\begin{aligned}&-n^2C_{11}\Delta _S-2n\lambda _2\Delta _S\nonumber \\&\quad =\sqrt{n}\lambda _1{\hat{\psi }}\tau _S+2\sqrt{n}\lambda _2({\hat{\psi }}-\sqrt{n})\Delta _S +2\sqrt{n}\lambda _2{\hat{\psi }}\beta _S^0-n\sigma (X^T\epsilon )_S. \end{aligned}$$
(27)

Similar to the above, it has an equivalent form as

$$\begin{aligned} -n^2\Delta _S=\sqrt{n}\lambda _1{\hat{\psi }}\big (C_{11}+\frac{2\lambda _2}{n}I\big )^{-1}\big (\tau _S+\frac{2\lambda _2({\hat{\psi }}-\sqrt{n})}{\lambda _1{\hat{\psi }}}\Delta _S +\frac{2\lambda _2}{\lambda _1}\beta _S^0-\frac{\sqrt{n}\sigma }{\lambda _1{\hat{\psi }}}(X^T\epsilon )_S\big ). \end{aligned}$$

Then on the event \({\mathcal {B}}\), we can use \(\sqrt{T_j}\lambda _1/{\tilde{\eta }}-6\lambda _2/\kappa \ge {\hat{V}}_j\) and Lemma 5 to bound the above

$$\begin{aligned} \Vert \Delta _j\Vert _{\infty }&\le \max \limits _{\nu :\Vert \nu _k\Vert _2\le \sqrt{T_k}+ \frac{2\lambda _2}{\lambda _1}\Vert \beta _k^0\Vert _2}\frac{(1+\frac{1}{{\tilde{\eta }}})\lambda _1}{n}\frac{{\hat{\psi }}}{\sqrt{n}}\big \Vert \big ((C_{11}+\frac{2\lambda _2}{n}I)^{-1}\nu \big )_j\big \Vert _{\infty }\\&=\max \limits _{\nu :\Vert \nu _k\Vert _2\le 1}\frac{(1+\frac{1}{{\tilde{\eta }}})\lambda _1}{n}\frac{{\hat{\psi }}}{\sqrt{n}}\big (\sqrt{T_j}+\frac{2\lambda _2}{\lambda _1}\Vert \beta _j^0\Vert _2\big )\big \Vert \big ((C_{11}+\frac{2\lambda _2}{n}I)^{-1}\nu \big )_j\big \Vert _{\infty }\\&\le \big (\sqrt{T_j}+\frac{2\lambda _2}{\lambda _1}\Vert \beta _j^0\Vert _2\big )\frac{\big (1+\frac{1}{{\tilde{\eta }}}\big )\lambda _1}{n}\frac{\big (1+ \frac{\lambda _1\sqrt{s^*}u}{n\kappa }\big )\Vert \sigma \epsilon \Vert _2}{\sqrt{n}}\xi _{\Vert \cdot \Vert _{\infty }}\\&\le \frac{2\sigma \sqrt{1+t}}{1+\eta }(1+\frac{u}{2\varrho \gamma })\frac{\lambda _1\big (\sqrt{T_j}+\frac{2\lambda _2}{\lambda _1}\Vert \beta _j^0\Vert _2\bigg )}{n}\xi _{\Vert \cdot \Vert _{\infty }}\\&\le \frac{D\big (\lambda _1\sqrt{T_j}+2\lambda _2\Vert \beta _j^0\Vert _2\big )}{n},~~~for~all ~1\le j \le J, \end{aligned}$$

where \(D=\frac{2\sigma \sqrt{1+t}}{1+\eta }(1+\frac{u}{\bar{2\varrho }\gamma })\xi _{\Vert \cdot \Vert _{\infty }}\). Hence, the proof of the second claim is completed. \(\square \)

For the third claim, by the triangular inequality we have

$$\begin{aligned} \Vert {\hat{\beta }}_j\Vert _{\infty }\ge \Vert \beta _j^0\Vert _{\infty }-\Vert \Delta _j\Vert _{\infty }\ge \frac{D\lambda _1\sqrt{T_j}}{n(1-\frac{2D\lambda _2\sqrt{T_j}}{n})} -\frac{D\big (\lambda _1\sqrt{T_j}+2\lambda _2\Vert \beta _j^0\Vert _2\big )}{n}> 0. \end{aligned}$$

Thus, we complete the proof.

Proof of Theorem 7

as stated in Section 4. By Lemma IV.4 given in Bunea et al. (2014), the mapping is nonexpansive. Then the key of this proof is to show the mapping is asymptotically regular which means that \(\Vert \beta (t+1)-\beta (t)\Vert _2\rightarrow 0\) as \(t\rightarrow \infty \) for any initial value \(\beta (0)\).

Since, the scaling operations of objective \(F(\beta )=\Vert Y-X\beta \Vert _2+\sum \nolimits _{j=1}^J\lambda _{1j}\Vert \beta _j\Vert _2+\frac{\lambda _2}{2}\sum \nolimits _{j=1}^J\Vert \beta _j\Vert _2^2\) have performed beforehand, where \(\lambda _{1j}=\lambda _1\sqrt{T_J},~j=1,2,\dots ,J\), we can introduce a surrogate function

$$\begin{aligned} G(\beta ,\gamma )=&\Vert Y-X\beta \Vert _2+\frac{1}{\Vert Y-X\beta \Vert _2}(\gamma -\beta )^TX^T(X\beta -Y)\nonumber \\&+\frac{1}{2\Vert Y-X\beta \Vert _2}\Vert \beta -\gamma \Vert _2^2+\sum \limits _{j=1}^J\lambda _{1j}\Vert \gamma _j\Vert _2+\frac{\lambda _2}{2}\sum \limits _{j=1}^J\Vert \gamma _j\Vert _2^2. \end{aligned}$$
(28)

Given \(\beta \), minimizing the above equation w.r.t \(\gamma \) is equivalent to

$$\begin{aligned}&\min \limits _{\gamma }\frac{1}{\Vert Y-X\beta \Vert _2}\bigg (\frac{1}{2}\Vert \beta -\gamma \Vert _2^2+(\gamma -\beta )^TX^T(X\beta -Y) +\Vert Y-X\beta \Vert _2\sum \limits _{j=1}^J\lambda _{1j}\Vert \gamma _j\Vert _2\\&~~~~~~~~~~~~~~~~~~~~~~+\frac{\lambda _2}{2}\Vert Y-X\beta \Vert _2\sum \limits _{j=1}^J\Vert \gamma _j\Vert _2^2\bigg )~\Leftrightarrow \\&\min \limits _{\gamma }\frac{1}{\Vert Y-X\beta \Vert _2}\bigg (\frac{1}{2}\Vert \gamma -\beta -X^TY+X^TX\beta \Vert _2^2+\Vert Y-X\beta \Vert _2\sum \limits _{j=1}^J\lambda _{1j}\Vert \gamma _j\Vert _2\\&~~~~~~~~~~~~~~~~~~~~~~+\frac{\lambda _2}{2}\Vert Y-X\beta \Vert _2\sum \limits _{j=1}^J\Vert \gamma _j\Vert _2^2\bigg )~\Leftrightarrow \\&\min \limits _{\gamma }\frac{1+\lambda _2\Vert Y-X\beta \Vert _2}{\Vert Y-X\beta \Vert _2}\bigg (\frac{1}{2}\Vert \gamma -\frac{\beta +X^TY-X^TX\beta }{1+\lambda _2\Vert Y-X\beta \Vert _2}\Vert _2^2 +\sum \limits _{j=1}^J\frac{2\lambda _{1j}\Vert Y-X\beta \Vert _2}{1+\lambda _2\Vert Y-X\beta \Vert _2}\Vert \gamma _j\Vert _2\bigg ). \end{aligned}$$

Applying the Lemma 1 and Lemma 2 in She (2012), the minimizer can be computed by

$$\begin{aligned} {\hat{\gamma }}_j=\mathbf {\Theta }\bigg (\frac{\beta _j+X_j^T(Y-X\beta )}{1+\lambda _2\Vert Y-X\beta \Vert _2};\frac{\lambda _{1j}\Vert Y-X\beta \Vert _2}{1+\lambda _2\Vert Y-X\beta \Vert _2}\bigg ), \end{aligned}$$
(29)

and further obtain

$$\begin{aligned} G(\beta ,{\hat{\gamma }}+\delta )-G(\beta ,{\hat{\gamma }})\ge \frac{\Vert \delta \Vert _2^2}{2\Vert Y-X\beta \Vert _2}\big (1+\lambda _2\Vert Y-X\beta \Vert _2\big ). \end{aligned}$$
(30)

On the other hand, the Taylor expansion gives

$$\begin{aligned}&\Vert Y-X\beta \Vert _2+\frac{1}{\Vert Y-X\beta \Vert _2}(\gamma -\beta )^TX^T(X\beta -Y)-\Vert Y-X\gamma \Vert _2\nonumber \\& \quad =-\frac{1}{2}(\beta -\gamma )^T\bigg (\frac{X^TX}{\Vert Y-X\xi \Vert _2}-\frac{X^T(X\xi -Y)(X\xi -Y)^TX}{\Vert Y-X\xi \Vert _2^3}\bigg )(\beta -\gamma ), \end{aligned}$$
(31)

for some \(\xi =\theta \beta +(1-\theta )\gamma \) with \(\theta \in (0,1)\).

Since \((\beta -\gamma )^TX^T(X\xi -Y)(X\xi -Y)^TX(\beta -\gamma )\ge 0\), then by the definition of \(G(\beta ,\gamma )\) and (29) we obtain

$$ \begin{aligned}&F(\beta (t+1))\\& \qquad +\frac{1}{2}(\beta (t+1)-\beta (t))^T\big (\frac{I}{\Vert Y-X\beta (t)\Vert _2}-\frac{X^TX}{\Vert Y-X\xi (t)\Vert _2}\big )(\beta (t+1)-\beta (t))\\& \quad\le G(\beta (t),\beta (t+1))\\& \quad\le G(\beta (t),\beta (t))-\frac{1+\lambda _2\Vert Y-X\beta (t)\Vert _2}{2\Vert Y-X\beta (t)\Vert _2}\Vert \beta (t+1)-\beta (t)\Vert _2^2\\& \quad=F(\beta (t))-\frac{1+\lambda _2\Vert Y-X\beta (t)\Vert _2}{2\Vert Y-X\beta (t)\Vert _2}\Vert \beta (t+1)-\beta (t)\Vert _2^2, \end{aligned} $$

for some \(\xi (t)=\theta (t)\beta (t)+(1-\theta (t))\beta (t+1)\) with \(\theta (t)\in (0,1)\).

Thus, let \(\Vert X\Vert _2\) be the operator norm of X, then the following inequality holds

$$\begin{aligned}&F(\beta (t))-F(\beta (t+1))\nonumber \\&\quad \ge \frac{1}{2}\bigg (\frac{2+\lambda _2\Vert Y-X\beta (t)\Vert _2}{\Vert Y-X\beta (t)\Vert _2}-\frac{\Vert X\Vert _2^2}{\Vert Y-X\xi (t)\Vert _2}\bigg )\Vert \beta (t+1)-\beta (t)\Vert _2^2. \end{aligned}$$
(32)

Under the certain regularity condition \(\inf _{\xi \in A}\Vert X\xi -Y\Vert _2>0\), where \(A=\big \{\theta \beta (t)+(1-\theta )\beta (t+1):\theta \in [0,1],t=0,1,\dots \big \}\), \(F(\beta (t))\) is monotonically decreasing for large enough K. Then we have

$$\begin{aligned}&0\le F(\beta (t+1))\le F(\beta (t))\le M,\\&F(\beta (t))-F(\beta (t+1))\rightarrow 0~as~t\rightarrow \infty , \end{aligned}$$

In fact, with \(\Vert Y-X\xi (t)\Vert _2\ge \epsilon \), \(M\triangleq F(\beta (0))\) and \(\Vert X\Vert _2^2\le 2\epsilon /M\) suffices, \(\big (\frac{2+\lambda _2\Vert Y-X\beta (t)\Vert _2}{\Vert Y-X\beta (t)\Vert _2}-\frac{\Vert X\Vert _2}{\Vert Y-X\xi (t)\Vert _2}\big )>0\). It is easy to prove that

$$\begin{aligned} \bigg (\frac{2+\lambda _2\Vert Y-X\beta (t)\Vert _2}{\Vert Y-X\beta (t)\Vert _2}-\frac{\Vert X\Vert _2}{\Vert Y-X\xi (t)\Vert _2}\bigg )\Vert \beta (t+1)-\beta (t)\Vert _2^2\rightarrow 0~as~t\rightarrow \infty . \end{aligned}$$

Thus, by the Optial’s condition in Opial (1967) and She (2008), \(\beta (t)\) has a unique limit point \(\beta ^*\). It is easy to verify that \(\beta ^*\) as a fixed point of (10) satisfies the KKT condition, which implies \(\beta ^*\) is a global minimizer. \(\square \)

Appendix B: Properties of Solution of Group Square-Root Elastic Net

Now, we discuss some properties of the solution of the Group Square-Root Elastic Net. With a square root of the residual sum of squared errors as loss function and a group elastic net penalty, the solution exists and can be \(\sigma \)-free. In the following, we show the uniqueness of our estimates:

Lemma 6

For any XY and \(\lambda _1,\lambda _2>0\), let \({\hat{\beta }},{\tilde{\beta }}\in \arg \min _{\beta \in {\mathbb {R}}^p}\big \{\frac{1}{\sqrt{n}}\Vert Y-X{\hat{\beta }}\Vert _2+\frac{\lambda _1}{n}\sum _{j=1}^J\sqrt{T_j}\Vert {\hat{\beta }}_j\Vert _2 +\frac{\lambda _2}{n}\sum _{j=1}^J\Vert {\hat{\beta }}_j\Vert _2^2\big \}\), it holds that \(X{\hat{\beta }}=X{\tilde{\beta }}\). Moreover, the solution of the Group Square-Root Elastic Net is unique for any given \(\lambda _1,\lambda _2\).

Proof

The proof uses some basic properties of convex analysis. Firstly, due to the convexity of the objective function, the set of minima is also convex. Thus, if \({\hat{\beta }},{\tilde{\beta }}\) are two distinct solutions, so is \(\alpha {\hat{\beta }}+(1-\alpha ){\tilde{\beta }}\) for any \(0<\alpha <1\). Then the convexity implies

$$\begin{aligned}&\frac{1}{\sqrt{n}}\Vert Y-X(\alpha {\hat{\beta }}+(1-\alpha ){\tilde{\beta }})\Vert _2\\&\quad +\frac{\lambda _1}{n}\sum \limits _{j=1}^J\sqrt{T_j}\Vert \alpha {\hat{\beta }}_j+(1-\alpha ){\tilde{\beta }}_j\Vert _2 +\frac{\lambda _2}{n}\sum \limits _{j=1}^J\Vert \alpha {\hat{\beta }}_j+(1-\alpha ){\tilde{\beta }}_j\Vert _2^2\\&\le \alpha \bigg (\frac{1}{\sqrt{n}}\Vert Y-X{\hat{\beta }}\Vert _2+\frac{\lambda _1}{n}\sum \limits _{j=1}^J\sqrt{T_j}\Vert {\hat{\beta }}_j\Vert _2 +\frac{\lambda _2}{n}\sum \limits _{j=1}^J\Vert {\hat{\beta }}_j\Vert _2^2\bigg )\\&\quad +(1-\alpha )\bigg (\frac{1}{\sqrt{n}}\Vert Y-X{\tilde{\beta }}\Vert _2+\frac{\lambda _1}{n}\sum \limits _{j=1}^J\sqrt{T_j}\Vert {\tilde{\beta }}_j\Vert _2 +\frac{\lambda _2}{n}\sum \limits _{j=1}^J\Vert {\tilde{\beta }}_j\Vert _2^2\bigg ), \end{aligned}$$

with strictly inequality if \({\hat{\beta }}\ne {\tilde{\beta }}\).

If \({\hat{\beta }},{\tilde{\beta }}\) are two distinct solutions, we have

$$\begin{aligned}&\frac{1}{\sqrt{n}}\Vert Y-X{\hat{\beta }}\Vert _2+\frac{\lambda _1}{n}\sum \limits _{j=1}^J\sqrt{T_j}\Vert {\hat{\beta }}_j\Vert _2 +\frac{\lambda _2}{n}\sum \limits _{j=1}^J\Vert {\hat{\beta }}_j\Vert _2^2\\&=\frac{1}{\sqrt{n}}\Vert Y-X{\tilde{\beta }}\Vert _2+\frac{\lambda _1}{n}\sum \limits _{j=1}^J\sqrt{T_j}\Vert {\tilde{\beta }}_j\Vert _2 +\frac{\lambda _2}{n}\sum \limits _{j=1}^J\Vert {\tilde{\beta }}_j\Vert _2^2. \end{aligned}$$

By the above argument, we obtain that

$$\begin{aligned}&\frac{1}{\sqrt{n}}\Vert Y-X(\alpha {\hat{\beta }}+(1-\alpha ){\tilde{\beta }})\Vert _2\nonumber \\&\qquad +\frac{\lambda _1}{n}\sum \limits _{j=1}^J\sqrt{T_j}\Vert \alpha {\hat{\beta }}_j+(1-\alpha ){\tilde{\beta }}_j\Vert _2 +\frac{\lambda _2}{n}\sum \limits _{j=1}^J\Vert \alpha {\hat{\beta }}_j+(1-\alpha ){\tilde{\beta }}_j\Vert _2^2\nonumber \\&\quad <\frac{1}{\sqrt{n}}\Vert Y-X{\hat{\beta }}\Vert _2+\frac{\lambda _1}{n}\sum \limits _{j=1}^J\sqrt{T_j}\Vert {\hat{\beta }}_j\Vert _2 +\frac{\lambda _2}{n}\sum \limits _{j=1}^J\Vert {\hat{\beta }}_j\Vert _2^2, \end{aligned}$$
(33)

this leads to a contradiction. Thus, we obtain \(X{\hat{\beta }}=X{\tilde{\beta }}\) and \({\hat{\beta }}={\tilde{\beta }}\) as desired. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xie, W., Yang, H. Group sparse recovery via group square-root elastic net and the iterative multivariate thresholding-based algorithm. AStA Adv Stat Anal 107, 469–507 (2023). https://doi.org/10.1007/s10182-022-00443-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10182-022-00443-x

Keywords

Navigation