Skip to main content

Regularization After Marginal Learning for Ultra-High Dimensional Regression Models

  • Chapter
  • First Online:
Big and Complex Data Analysis

Part of the book series: Contributions to Statistics ((CONTRIB.STAT.))

Abstract

Regularization is a popular variable selection technique for high dimensional regression models. However, under the ultra-high dimensional setting, a direct application of the regularization methods tends to fail in terms of model selection consistency due to the possible spurious correlations among predictors. Motivated by the ideas of screening (Fan and Lv, J R Stat Soc Ser B Stat Methodol 70:849–911, 2008) and retention (Weng et al, Manuscript, 2013), we propose a new two-step framework for variable selection, where in the first step, marginal learning techniques are utilized to partition variables into different categories, and the regularization methods can be applied afterwards. The technical conditions of model selection consistency for this broad framework relax those for the one-step regularization methods. Extensive simulations show the competitive performance of the new method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19, 716–723 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  2. Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with sparsity-inducing penalties. arXiv preprint arXiv:1108.0775 (2011)

    Google Scholar 

  3. Chen, J., Chen, Z.: Extended bayesian information criteria for model selection with large model spaces. Biometrika 95, 759–771 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  4. Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32, 407–499 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  5. Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  6. Fan, J., Lv, J.: Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B Stat. Methodol. 70, 849–911 (2008)

    Article  MathSciNet  Google Scholar 

  7. Fan, J., Song, R.: Sure independence screening in generalized linear models with np-dimensionality. Ann. Stat. 38, 3567–3604 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  8. Fan, J., Feng, Y., Song, R.: Nonparametric independence screening in sparse ultra-high dimensional additive models. J. Am. Stat. Assoc. 106, 544–557 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  9. Fan, J., Feng, Y., Tong, X.: A road to classification in high dimensional space: the regularized optimal affine discriminant. J. R. Stat. Soc. Ser. B. 74, 745–771 (2012)

    Article  MathSciNet  Google Scholar 

  10. Fan, J., Feng, Y., Jiang, J., Tong, X.: Feature augmentation via nonparametrics and selection (fans) in high dimensional classification. J. Am. Stat. Assoc. (2014, to appear)

    Google Scholar 

  11. Feng, Y., Li, T., Ying, Z.: Likelihood adaptively modified penalties. arXiv preprint arXiv:1308.5036 (2013)

    Google Scholar 

  12. Feng, Y., Yu, Y.: Consistent cross-validation for tuning parameter selection in high-dimensional variable selection. arXiv preprint arXiv:1308.5390 (2013)

    Google Scholar 

  13. Frank, l.E., Friedman, J.H.: A statistical view of some chemometrics regression tools. Technometrics 35, 109–135 (1993)

    Google Scholar 

  14. Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010)

    Article  Google Scholar 

  15. Greenshtein, E., Ritov, Y.: Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10, 971–988 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  16. Huang, J., Ma, S., Zhang, C.-H.: Adaptive lasso for sparse high-dimensional regression models. Stat. Sin. 18, 1603 (2008)

    MathSciNet  MATH  Google Scholar 

  17. Knight, K., Fu, W.: Asymptotics for lasso-type estimators. Ann. Stat. 28, 1356–1378 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  18. Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)

    Article  MathSciNet  MATH  Google Scholar 

  19. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol. 58, 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  20. Wainwright, M.J.: Sharp thresholds for high-dimensional and noisy sparsity recovery. IEEE Trans. Inf. Theory 55, 2183–2202 (2009)

    Article  Google Scholar 

  21. Weng, H., Feng, Y., Qiao, X.: Regularization after retention in ultrahigh dimensional linear regression models. Manuscript (2013). Preprint, arXiv:1311.5625

    Google Scholar 

  22. Yu, Y., Feng, Y.: Apple: approximate path for penalized likelihood estimators. Stat. Comput. 24, 803–819 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  23. Yu, Y., Feng, Y.: Modified cross-validation for lasso penalized high-dimensional linear models. J. Comput. Graph. Stat. 23, 1009–1027 (2014)

    Article  Google Scholar 

  24. Zhao, P., Yu, B.: On model selection consistency of lasso. J. Mach. Learn. Res. 7, 2541–2563 (2006)

    MathSciNet  MATH  Google Scholar 

  25. Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101, 1418–1429 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  26. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67, 301–320 (2005)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This work was partially supported by NSF grant DMS-1308566. The authors would like to thank the Editor and the referee for constructive comments which greatly improved the paper. The majority of the work was done when Mengjia Yu was an M.A. student at Columbia University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yang Feng .

Editor information

Editors and Affiliations

Appendix

Appendix

Proof of Theorem 1

Denote the design matrix by X, response vector by Y, and error vector by ɛ. The scale condition is \(\log p_{n} = O(n^{a_{1}})\), \(s_{n} = O(n^{a_{2}}),a_{1}> 0,a_{2}> 0,a_{1} + 2a_{2} <1\).

Step I::

Recall the index of variables with large coefficients

$$\displaystyle{\mathcal{M}_{\tilde{\gamma }_{n}} =\{ 1 \leq j \leq p: \vert \hat{\beta }_{j}^{M}\vert \,\mbox{ is among the first}\lfloor \tilde{\gamma }_{ n}\rfloor \mbox{ of all }\} = \mathcal{N}^{c}.}$$

Under Corollary 1,

$$\displaystyle{\mbox{ pr}(S \subset \mathcal{M}_{\tilde{\gamma }_{n}} = \mathcal{N}^{c} = \mathcal{R}\cup \mathcal{U}) \rightarrow 1\quad \mbox{ as }n \rightarrow \infty.}$$

Hence with high probability the set \(\hat{\mathcal{N}}\) contains only noises.

Step II::

Next we will show that RAM-2 succeeds in detecting signals in \(\hat{\mathcal{N}}^{c}\). Let S = { 1 ≤ j ≤ p: β j ≠ 0}. Denote the compositions \(S = \hat{\mathcal{R}}_{1} \cup \hat{\mathcal{U}}_{1}\) and define the set of noises left in \(\hat{\mathcal{N}}^{c}\) as \((\hat{\mathcal{R}}\setminus \hat{\mathcal{R}}_{1}) \cup (\hat{\mathcal{U}}\setminus \hat{\mathcal{U}}_{1})\dot{ =} \hat{\mathcal{R}}_{2} \cup \hat{\mathcal{U}}_{2}\), where \(\hat{\mathcal{R}}_{1}\) and \(\hat{\mathcal{U}}_{1}\) are signals from \(\hat{\mathcal{R}}\) and \(\hat{\mathcal{U}}\), respectively.

Firstly, we would like to introduce an important technique in RAR+. Define the set of true signals as S, and in an arbitrary regularization, define the set that is hold without penalty as H while the set that needs to be checked with penalty as C. Let

$$\displaystyle\begin{array}{rcl} \check{\beta }& =& \mathop{\arg \min \nolimits }_{\beta }\Bigg\{(2n)^{-1}\|Y - X\beta \|_{ 2}^{2} +\lambda _{ n}\|\beta _{C}\|_{1}\Bigg\},{}\end{array}$$
(12)
$$\displaystyle\begin{array}{rcl} \bar{\beta }& =& \mathop{\arg \min \nolimits }_{\beta _{(S\cup H)^{c}}=0}\Bigg\{(2n)^{-1}\|Y - X\beta \|_{ 2}^{2} +\lambda _{ n}\|\beta _{C\cap S}\|_{1}\Bigg\}.{}\end{array}$$
(13)

Now we define Q = SH which are the variables we would like to retain, and then the variables that are supposed to be discarded are Q c = C ∖ S.

By optimality conditions of convex problems [2], \(\check{\beta }\) is a solution to (12) if and only if

$$\displaystyle{ n^{-1}X^{T}(Y - X\check{\beta }) =\lambda _{ n}\partial \|\check{\beta }_{C}\|, }$$
(14)

where \(\partial \|\check{\beta }_{C}\|\) is the subgradient of ∥ β C  ∥ 1 at \(\beta =\check{\beta }\). Namely, the ith (1 ≤ i ≤ p n ) element of \(\partial \|\check{\beta }_{C}\|\) is

$$\displaystyle{ (\partial \|\check{\beta }_{C}\|)_{i} = \left \{\begin{array}{@{}l@{\quad }l@{}} 0 \quad &\text{ if }i \in C; \\ \mbox{ sign}(\check{\beta }_{i})\quad &\text{ if }i \in C^{c}\mbox{ and }\check{\beta }_{i}\neq 0; \\ t \quad &\text{ otherwise, } \end{array} \right. }$$

where t can be any real number with | t | ≤ 1. Similarly, \(\bar{\beta }\) is the unique solution to (13) if and only if

$$\displaystyle\begin{array}{rcl} \bar{\beta }_{Q^{c}} = 0,\quad n^{-1}X_{ Q}^{T}(Y - X_{ Q}\bar{\beta }_{Q}) =\lambda _{n}\mbox{ sig}(\bar{\beta }_{Q}),& &{}\end{array}$$
(15)

where \(\mbox{ sig}(\bar{\beta }_{Q})\), a vector of length card(Q), is the subgradient of \(\|\beta _{\bar{Q}^{c}}\|_{1}\) at \(\beta _{Q} =\bar{\beta } _{Q}\). Then it is not hard to see that the unique solution \(\bar{\beta }\) is also a solution for (13) if

$$\displaystyle{ \|n^{-1}X_{ Q^{c}}^{T}(Y - X_{ Q}\bar{\beta }_{Q})\|_{\infty } <\lambda _{n}, }$$
(16)

simply because (15) and (16) imply \(\bar{\beta }\) satisfies (14). Solving the equation in (15) gives

$$\displaystyle{ \bar{\beta }_{Q} = (X_{Q}^{T}X_{ Q})^{-1}\left [X_{ Q}^{T}Y - n\lambda _{ n}\mbox{ sig}(\bar{\beta }_{Q})\right ]. }$$
(17)

Using (17) and Y = X S β S +ɛ, (16) is equivalent to

$$\displaystyle\begin{array}{rcl} & & \|X_{Q^{c}}^{T}X_{ Q}(X_{Q}^{T}X_{ Q})^{-1}\mbox{ sig}(\bar{\beta }_{ Q}) \\ & & +(n\lambda _{n})^{-1}X_{ Q^{c}}^{T}(I - X_{ Q}(X_{Q}^{T}X_{ Q})^{-1}X_{ Q}^{T})(X_{ S}\beta _{S}+\varepsilon )\|_{\infty } <1{}\end{array}$$
(18)

Since (IX Q (X Q T X Q )−1 X Q T)X Q  = 0, (18) can be simplified as

$$\displaystyle{ \|X_{Q^{c}}^{T}X_{ Q}(X_{Q}^{T}X_{ Q})^{-1}\mbox{ sig}(\bar{\beta }_{ Q})+(n\lambda _{n})^{-1}X_{ Q^{c}}^{T}(I-X_{ Q}(X_{Q}^{T}X_{ Q})^{-1}X_{ Q}^{T})\varepsilon \|_{ \infty } <1. }$$
(19)

Note that, if there is a unique solution for (12), say \(\check{\beta }\), and \(\bar{\beta }\) satisfies (19), then \(\bar{\beta }\) is indeed the unique solution for (12). This is equivalent to \(\check{\beta }_{Q^{c}} = 0\). Furthermore, if \(\min _{j\in Q}\vert \beta _{j}\vert>\|\beta _{j} -\bar{\beta }_{j}\|_{\infty }\) also holds, we can conclude \(\check{\beta }_{Q}\neq 0\). Thus (12) achieves sign recovery. In the following, we will make use of this idea repeatedly.

Secondly, consider the Step 1 (5),

$$\displaystyle\begin{array}{rcl} \hat{\beta }_{\hat{\mathcal{R}},\hat{\mathcal{U}}_{1}}& =& \mathop{\arg \min \nolimits }_{\beta _{\hat{\mathcal{N}}}=0}\ \Bigg\{(2n)^{-1}\sum _{ i=1}^{n}\bigg(Y _{ i} -\sum _{j\in \hat{\mathcal{U}}}X_{ij}\beta _{j} -\sum _{k\in \hat{\mathcal{R}}}X_{ik}\beta _{k}\bigg)^{2} +\lambda _{ n}\sum _{j\in \hat{\mathcal{U}}}\vert \beta _{j}\vert \Bigg\}, \\ & =& \mathop{\arg \min \nolimits }_{\beta _{\hat{\mathcal{N}}}=0}\Bigg\{(2n)^{-1}\|Y - X\beta \|_{ 2}^{2} +\lambda _{ n}\|\beta _{\hat{\mathcal{U}}}\|_{1}\Bigg\}. {}\end{array}$$
(20)

Here, denote \(\check{\beta }=\hat{\beta } _{\hat{\mathcal{R}},\hat{\mathcal{U}}_{1}}\). After this step, the ideal result is that with high probability,

$$\displaystyle{ \check{\beta }_{\hat{\mathcal{U}}_{1}}\neq 0\mbox{ and }\check{\beta }_{\hat{\mathcal{U}}\setminus \hat{\mathcal{U}}_{1}} = 0. }$$
(21)

Therefore, define an oracle estimator of (20),

$$\displaystyle{ \bar{\beta }=\mathop{ \arg \min \nolimits }_{\beta _{(\hat{\mathcal{R}}\cup \hat{\mathcal{U}}_{ 1})^{c}}=0}\Bigg\{(2n)^{-1}\|Y - X_{\hat{ Q}}\beta _{\hat{Q}}\|_{2}^{2} +\lambda _{ n}\|\beta _{\hat{\mathcal{U}}_{1}}\|_{1}\Bigg\}, }$$
(22)

where \(\hat{Q} =\hat{ \mathcal{R}}\cup \hat{\mathcal{U}}_{1} = S \cup \hat{\mathcal{R}}_{2}\). Now, we plug \(\check{\beta },\bar{\beta }\), and \(\hat{Q}\) back to (12), (13), and (19), then it is sufficient to prove (20) has a unique solution and it achieves sign consistency with \(Q = \hat{Q}\).

Let

$$\displaystyle\begin{array}{rcl} F& =& X_{\hat{Q}^{c}}^{T} - \Sigma _{\hat{ Q}^{c}\hat{Q}}\Sigma _{\hat{Q}\hat{Q}}^{-1}X_{\hat{ Q}}^{T}, {}\\ K_{1}& =& \Sigma _{\hat{Q}^{c}\hat{Q}}\Sigma _{\hat{Q}\hat{Q}}^{-1}\mbox{ sig}(\bar{\beta }_{\hat{ Q}}), {}\\ K_{2}& =& FX_{\hat{Q}}(X_{\hat{Q}}^{T}X_{\hat{ Q}})^{-1}\mbox{ sig}(\bar{\beta }_{\hat{ Q}}) + (n\lambda _{n})^{-1}F\{I - X_{\hat{ Q}}(X_{\hat{Q}}^{T}X_{\hat{ Q}})^{-1}X_{\hat{ Q}}^{T}\}\varepsilon. {}\\ \end{array}$$

Then, (19) is equivalent to

$$\displaystyle{ \|K_{1} + K_{2}\|_{\infty } <1. }$$

To be more clear that, since we have already screen \(\hat{\mathcal{N}}\) out, \(\hat{Q}^{c}\) is in fact the complement of \(\hat{Q}\) under the “universe” \(\hat{\mathcal{R}}\cup \hat{\mathcal{U}}\). We write \(\hat{Q}^{c}\) instead of \((\hat{\mathcal{R}}\cup \hat{\mathcal{U}})\setminus \hat{Q} = \hat{\mathcal{U}}_{2}\) to show a close connection with the analysis in first part above.

Now let

$$\displaystyle\begin{array}{rcl} A& =& \{R \subset \hat{\mathcal{R}}_{1} \subset S,S \subset \hat{Q} \subset S \cup Z\}, {}\\ B& =& \{S \subset \hat{Q} \subset S \cup Z\}, {}\\ \mathcal{T}_{A}& =& \{(\mathcal{R}_{1},Q)\vert R \subset \mathcal{R}_{1} \subset S,S \subset Q \subset S \cup Z\}. {}\\ \end{array}$$

From Conditions 1 and 4, P(A) → 1 as a direct result of Proposition 2 in Weng et al. [21]. Since Condition 8 implies

$$\displaystyle{ \mbox{ pr}(\|K_{1}\|_{\infty }\leq 1 -\gamma _{1}) \geq \mbox{ pr}(\{\|K_{1}\|_{\infty }\leq 1 -\gamma _{1}\} \cap A) = \mbox{ pr}(A) \rightarrow 1 }$$
(23)

as given A

$$\displaystyle{ \|K_{1}\|_{\infty } =\| \Sigma _{\hat{Q}^{c}\hat{Q}}\Sigma _{\hat{Q}\hat{Q}}^{-1}\mbox{ sig}(\bar{\beta }_{\hat{ Q}})\|_{\infty }\leq \|\{ \Sigma _{\hat{Q}^{c}\hat{Q}}\Sigma _{\hat{Q}\hat{Q}}^{-1}\}_{ \hat{\mathcal{U}_{1}}}\|_{\infty } }$$

is always less than 1 −γ 1.

Denote \(K_{2}(\mathcal{R}_{1},Q)\) as the analogy of K 2 and \(\bar{\beta }_{Q}\) as the analogy of \(\bar{\beta }_{\hat{Q}}\) by replacing \(\hat{\mathcal{R}}_{1}\) and \(\hat{Q}\) in (22) with \(\mathcal{R}_{1}\) and Q. Since given X Q and ɛ, the j-th element of \(K_{2}(\mathcal{R}_{1},Q)\), namely

$$\displaystyle{ F(\,j)X_{Q}(X_{Q}^{T}X_{ Q})^{-1}\mbox{ sig}(\bar{\beta }_{ Q}) + (n\lambda _{n})^{-1}F(\,j)\{I - X_{ Q}(X_{Q}^{T}X_{ Q})^{-1}X_{ Q}^{T}\}\varepsilon, }$$
(24)

is normally distributed with mean 0 and variance V j , where

$$\displaystyle\begin{array}{rcl} V _{j}& \leq & (\Sigma _{Q^{c}\vert Q})_{jj}\Big[\mbox{ sig}(\bar{\beta }_{Q})^{T}(X_{ Q}^{T}X_{ Q})^{-1}\mbox{ sig}(\bar{\beta }_{ Q}) + (n\lambda _{n})^{-2}\varepsilon ^{T}\{I - X_{ Q}(X_{Q}^{T}X_{ Q})^{-1}X_{ Q}^{T}\}\varepsilon \Big] {}\\ & \leq & \mbox{ sig}(\bar{\beta }_{Q})^{T}(X_{ Q}^{T}X_{ Q})^{-1}\mbox{ sig}(\bar{\beta }_{ Q}) + (n\lambda _{n})^{-2}\|\varepsilon \|_{ 2}^{2}. {}\\ \end{array}$$

Hence, we let

$$\displaystyle\begin{array}{rcl} H& =& \mathop{\bigcup }_{\begin{array}{c}(\mathcal{R}_{1},Q)\subset \mathcal{T}_{A}\end{array}}\Big\{\mbox{ sig}(\bar{\beta }_{Q})^{T}(X_{ Q}^{T}X_{ Q})^{-1}\mbox{ sig}(\bar{\beta }_{ Q}) + (n\lambda _{n})^{-2}\|\varepsilon \|_{ 2}^{2} {}\\ & & \quad> \frac{s_{n} + z_{n}} {nC_{\min }} (8(s_{n} + z_{n})^{1/2}n^{-1/2} + 1) + (1 + s_{ n}^{1/2}n^{-1/2})/(n\lambda _{ n}^{2})\Big\}. {}\\ \end{array}$$

Next, we want to show

$$\displaystyle\begin{array}{rcl} \mbox{ pr}\Big(\|K_{2}\|_{\infty }> \frac{\gamma _{1}} {2}\Big)& \leq & \mbox{ pr}\Big(\Big\{\|K_{2}\|_{\infty }> \frac{\gamma _{1}} {2}\Big\} \cap A\Big) + \mbox{ pr}(A^{c}) \\ & \leq & \mbox{ pr}\Big(\Big\{\mathop{\bigcup }_{\begin{array}{c}(\mathcal{R}_{1},Q)\subset \mathcal{T}_{A}\end{array}}\|K_{2}(\mathcal{R}_{1},Q)\|_{\infty }> \frac{\gamma _{1}} {2}\Big\} \cap A\Big) + \mbox{ pr}(A^{c}) \\ & \leq & \mbox{ pr}\Big(\mathop{\bigcup }_{\begin{array}{c}(\mathcal{R}_{1},Q)\subset \mathcal{T}_{A}\end{array}}\|K_{2}(\mathcal{R}_{1},Q)\|_{\infty }> \frac{\gamma _{1}} {2}\mid H^{c}\Big) + \mbox{ pr}(H) + \mbox{ pr}(A^{c}) \\ & \longrightarrow & 0. {}\end{array}$$
(25)

By the tail probability inequality of Gaussian distribution (inequality (48) in Wainwright [20]), it is not hard to see that

$$\displaystyle\begin{array}{rcl} & & \mbox{ pr}\Big(\mathop{\bigcup }_{\begin{array}{c}(\mathcal{R}_{1},Q)\subset \mathcal{T}_{A}\end{array}}\|K_{2}(\mathcal{R}_{1},Q)\|_{\infty }> \frac{\gamma _{1}} {2}\mid H^{c}\Big) \\ & & \quad \leq \sum _{\begin{array}{c}(\mathcal{R}_{1},Q)\subset \mathcal{T}_{A}\end{array}}\mbox{ pr}(\|K_{2}(\mathcal{R}_{1},Q)\|_{\infty }> \frac{\gamma _{1}} {2}\mid H^{c}) \\ & & \quad \leq 2^{s_{n}+z_{n} } \cdot \max _{\begin{array}{c}(\mathcal{R}_{1},Q)\subset \mathcal{T}_{A}\end{array}}\mbox{ pr}(\|K_{2}(\mathcal{R}_{1},Q)\|_{\infty }> \frac{\gamma _{1}} {2}\mid H^{c}) \\ & & \quad \leq 2^{s_{n}+z_{n} } \cdot 2(\,p_{n} - s_{n})\exp (-\gamma _{1}^{2}/8V ), {}\end{array}$$
(26)

where \(V = (1 + s_{n}^{1/2}n^{-1/2})/(n\lambda _{n}^{2}) + \frac{s_{n}+z_{n}} {nC_{\min }} (8(s_{n} + z_{n})^{1/2}n^{-1/2} + 1) \geq V _{ j}\) under condition H c. Since \(\log [2^{s_{n}+z_{n}+1}(\,p_{n} - s_{n})] = o(\gamma _{1}^{2}/8V )\) under our scaling, (26) → 0. To bound pr(H), note that

$$\displaystyle\begin{array}{rcl} \mbox{ pr}(H)& \leq & \mbox{ pr}\Big(\mathop{\bigcup }_{\begin{array}{c}(\mathcal{R}_{1 },Q)\subset \mathcal{T}_{A}\end{array}} \Big\{\mbox{ sig} (\bar{\beta }_{Q})^{T}(X_{ Q}^{T}X_{ Q})^{-1}\mbox{ sig}(\bar{\beta }_{ Q}) \\ & &> \frac{s_{n} + z_{n}} {nC_{\min }} (8(s_{n} + z_{n})^{1/2}n^{-1/2} + 1)\Big\}\Big) \\ & & +\mbox{ pr}\Big((n\lambda _{n})^{-2}\|\varepsilon \|_{ 2}^{2}> (1 + s_{ n}^{1/2}n^{-1/2})/(n\lambda _{ n}^{2})\Big){}\end{array}$$
(27)

Since ∥ ɛ ∥ 2 2 ∼ χ 2(n), using the inequality of (54a) in Wainwright [20], we get

$$\displaystyle\begin{array}{rcl} \mbox{ pr}\Big((n\lambda _{n})^{-2}\|\varepsilon \|_{ 2}^{2}> (1 + s_{ n}^{1/2}n^{-1/2})/(n\lambda _{ n}^{2})\Big)& \leq & \mbox{ pr}\Big(\|\varepsilon \|_{ 2}^{2} \geq (1 + s_{ n}^{1/2}n^{-1/2})n\Big) \\ & \leq & \exp (-\frac{3} {16}s_{n}), {}\end{array}$$
(28)

whenever s n n < 1∕2. For any given Q that satisfying S ⊂ Q ⊂ SZ,

$$\displaystyle\begin{array}{rcl} \mbox{ sig}(\bar{\beta }_{Q})^{T}(X_{ Q}^{T}X_{ Q})^{-1}\mbox{ sig}(\bar{\beta }_{ Q})& \leq & (s_{n} + z_{n})\|(X_{Q}^{T}X_{ Q})^{-1}\|_{ 2} {}\\ & \leq & (s_{n} + z_{n})/n\Big(\|(X_{Q}^{T}X_{ Q}/n)^{-1} - \Sigma _{ QQ}^{-1}\|_{ 2} +\| \Sigma _{QQ}^{-1}\|_{ 2}\Big) {}\\ & \leq & (s_{n} + z_{n})/n\Big(\|(X_{Q}^{T}X_{ Q}/n)^{-1} - \Sigma _{ QQ}^{-1}\|_{ 2} + 1/C_{\min }\Big). {}\\ \end{array}$$

holds for any \(\mathcal{R}_{1}\) that satisfying \(R \subset \mathcal{R}_{1} \subset S\). Therefore, by the concentration inequality of (58b) in Wainwright [20],

$$\displaystyle\begin{array}{rcl} & & \mbox{ pr}\Bigg(\mathop{\bigcup }_{\begin{array}{c}(\mathcal{R}_{1},Q)\subset \mathcal{T}_{A}\end{array}}\Big\{\mbox{ sig}(\bar{\beta }_{Q})^{T}(X_{ Q}^{T}X_{ Q})^{-1}\mbox{ sig}(\bar{\beta }_{ Q})> \frac{s_{n} + z_{n}} {nC_{\min }} \big(8(s_{n} + z_{n})^{1/2}n^{-1/2} + 1\big)\Big\}\Bigg) \\ & & \quad \leq \sum _{\begin{array}{c}S\subset Q\subset S\cup Z\end{array}} \mbox{ pr }\Bigg(\mathop{\bigcup }_{R\subset \mathcal{R}_{1}\subset S} \Big\{\mbox{ sig}(\bar{\beta }_{Q})^{T}(X_{ Q}^{T}X_{ Q})^{-1}\mbox{ sig}(\bar{\beta }_{ Q})> \frac{s_{n} + z_{n}} {nC_{\min }} \big(8(s_{n} + z_{n})^{1/2}n^{-1/2} + 1\big)\Big\}\Bigg) \\ & & \quad \leq \sum _{\begin{array}{c}S\subset Q\subset S\cup Z\end{array}}\mbox{ pr}\Big(\|(X_{Q}^{T}X_{ Q}/n)^{-1} - \Sigma _{ QQ}^{-1}\|_{ 2} \geq \frac{8} {C_{\min }}(s_{n} + z_{n})^{1/2}n^{-1/2}\Big) \\ & & \quad \leq \sum _{\begin{array}{c}S\subset Q\subset S\cup Z\end{array}}\mbox{ pr}\Big(\|(X_{Q}^{T}X_{ Q}/n)^{-1} - \Sigma _{ QQ}^{-1}\|_{ 2} \geq \frac{8} {C_{\min }}(\mbox{ Card}(Q))^{1/2}n^{-1/2}\Big) \\ & & \quad \leq 2^{z_{n}+1}\exp \left (-\frac{s_{n}} {2} \right ). {}\end{array}$$
(29)

Hence, (28) and (29) imply \(\mbox{ pr}(H) \leq 2^{z_{n}+1}\exp (-\frac{s_{n}} {2} ) +\exp (-\frac{3} {16}s_{n}) \rightarrow 0\).

Since P(A c) = 1 − P(A) → 0, the inequalities (26)–(29) imply (25) under the scaling in Theorem 1. Thus ∥ K 1 + K 2 ∥   < 1 achieves with high probability, which also means \(\check{\beta }_{\hat{\mathcal{U}}\setminus \hat{\mathcal{U}_{1}}} = 0\) achieves asymptotically.

From our analysis in the first part, the following goal is the uniqueness of (20). If there is another solution, let’s call it \(\check{\beta }^{{\prime}}\). For any t such that 0 < t < 1, the linear combination \(\check{\beta }(t) = t\check{\beta } + (1 - t)\check{\beta }^{{\prime}}\) is also a solution to (20) as a consequence of the convexity. Note that, the new solution point \(\check{\beta }(t)\) satisfies (16) and \(\check{\beta }(t)_{Q^{c}} = 0\), hence it is a solution to (13). From the uniqueness of (13), we conclude that \(\check{\beta }=\check{\beta } ^{{\prime}}\).

The last part of this step is to prove \(\bar{\beta }_{\mathcal{U}_{1}}\neq 0\) with high probability. By (17) and \(Y = X_{S}\beta _{S}+\varepsilon = X_{\hat{Q}}\beta _{\hat{Q}}+\varepsilon\), we have

$$\displaystyle\begin{array}{rcl} \|\beta _{\hat{Q}} -\bar{\beta }_{\hat{Q}}\|_{\infty }& =& \|\lambda _{n}(X_{\hat{Q}}^{T}X_{\hat{ Q}}/n)^{-1}\mbox{ sig}(\bar{\beta }_{\hat{ Q}}) - (X_{\hat{Q}}^{T}X_{\hat{ Q}})^{-1}X_{\hat{ Q}}^{T}\varepsilon \|_{ \infty } \\ &\leq & \lambda _{n}\|(X_{\hat{Q}}^{T}X_{\hat{ Q}}/n)^{-1}\|_{ \infty } +\| (X_{\hat{Q}}^{T}X_{\hat{ Q}})^{-1}X_{\hat{ Q}}^{T}\varepsilon \|_{ \infty } \\ &\leq & \lambda _{n}(s_{n} + z_{n})^{1/2}\|(X_{\hat{ Q}}^{T}X_{\hat{ Q}}/n)^{-1}\|_{ 2} +\| (X_{\hat{Q}}^{T}X_{\hat{ Q}})^{-1}X_{\hat{ Q}}^{T}\varepsilon \|_{ \infty } \\ &\leq & \lambda _{n}(s_{n} + z_{n})^{1/2}(\|(X_{\hat{ Q}}^{T}X_{\hat{ Q}}/n)^{-1} - \Sigma _{\hat{ Q}\hat{Q}}^{-1}\|_{ 2} + 1/C_{\min }) \\ & & +\|(X_{\hat{Q}}^{T}X_{\hat{ Q}})^{-1}X_{\hat{ Q}}^{T}\varepsilon \|_{ \infty } {}\end{array}$$
(30)

for any \(\hat{Q}\) satisfying \(S \subset \hat{Q} \subset S \cup Z\). In (29), we have already got

$$\displaystyle{ \mbox{ pr}\big(\|(X_{\hat{Q}}^{T}X_{\hat{ Q}}/n)^{-1} - \Sigma _{\hat{ Q}\hat{Q}}^{-1}\|_{ 2} \geq \frac{8} {C_{\min }}(s_{n} + z_{n})^{1/2}n^{-1/2}\big) \leq 2\exp (-\frac{s_{n}} {2} ) }$$
(31)

Let \(G =\Big\{\| (X_{\hat{Q}}^{T}X_{\hat{Q}})^{-1}\|_{2}> 9/(nC_{\min })\Big\}\), by the inequality (60) in Wainwright [20],

$$\displaystyle{ \mbox{ pr}(G) \leq \mbox{ pr}(\|(X^{T}X)^{-1}\|_{ 2}> 9/(nC_{\min })) \leq 2\exp (-n/2). }$$

Since \((X_{\hat{Q}}^{T}X_{\hat{Q}})^{-1}X_{\hat{Q}}^{T}\varepsilon \mid X_{\hat{Q}} \sim N(0,(X_{\hat{Q}}^{T}X_{\hat{Q}})^{-1})\), then when we condition on G and achieve

$$\displaystyle\begin{array}{rcl} & & \mbox{ pr}\Big(\|(X_{\hat{Q}}^{T}X_{\hat{ Q}})^{-1}X_{\hat{ Q}}^{T}\varepsilon \|_{ \infty }> \frac{(s_{n} + z_{n})^{1/2}} {n^{1/2}C_{\min }^{1/2}} \Big) \\ & & \quad \leq \mbox{ pr}\Big(\|(X_{\hat{Q}}^{T}X_{\hat{ Q}})^{-1}X_{\hat{ Q}}^{T}\varepsilon \|_{ \infty }> \frac{(s_{n} + z_{n})^{1/2}} {n^{1/2}C_{\min }^{1/2}} \mid G^{c}\Big) + \mbox{ pr}(G) \\ & & \quad \leq 2(s_{n} + z_{n})e^{-(s_{n}+z_{n})/18} + 2e^{-n/2}, {}\end{array}$$
(32)

since under G c, each component of \((X_{\hat{Q}}^{T}X_{\hat{Q}})^{-1}X_{\hat{Q}}^{T}\varepsilon \mid X_{\hat{Q}}\) is normally distributed with mean 0 and variance that is less than 9∕(nC min).

Hence (30)–(32) together imply that,

$$\displaystyle{\|\bar{\beta }_{\hat{Q}} -\beta _{\hat{Q}}\|_{\infty }\leq U_{n}\dot{ =}\lambda _{n}(s_{n} + z_{n})^{1/2}\Big( \frac{8} {C_{\min }}(s_{n} + z_{n})^{1/2}n^{-1/2} + 1/C_{\min }\Big) + \frac{(s_{n} + z_{n})^{1/2}} {n^{1/2}C_{\min }^{1/2}} }$$

holds with probability larger than \(2(s_{n} + z_{n})e^{-(s_{n}+z_{n})/18} + 2e^{-n/2} + 2\exp ^{-s_{n}/2}\). Therefore,

$$\displaystyle\begin{array}{rcl} & & \mbox{ pr}(\|\bar{\beta }_{\hat{Q}} -\beta _{\hat{Q}}\|_{\infty }\geq U_{n}) \\ & & \quad \leq \mbox{ pr}\Big(\mathop{\bigcup }_{\begin{array}{c}S\subset Q\subset S\cup Z\end{array}}\{\|\bar{\beta }_{Q} -\beta _{Q}\|_{\infty }\geq U_{n}\} \cap B\Big) + \mbox{ pr}(B^{c}) \\ & & \quad \leq 2^{z_{n} }\Big(2(s_{n} + z_{n})e^{-(s_{n}+z_{n})/18} + 2e^{-n/2} + 2\exp ^{-s_{n}/2}\Big) + \mbox{ pr}(B^{c}){}\end{array}$$
(33)

Under the scaling of Theorem 1, we have pr(B) ≥ pr(A) → 1 and \(2^{z_{n}}(2(s_{n} + z_{n})e^{-(s_{n}+z_{n})/18} + 2e^{-n/2} + 2\exp ^{-s_{n}/2}) \rightarrow 0\). From Condition 5, it is easy to verify that

$$\displaystyle{ \min _{j\in S}\vert \beta _{j}\vert> U_{n}, }$$

for sufficiently large n. Thus with high probability \(\min _{j\in S}\vert \beta _{j}\vert>\|\bar{\beta } _{\hat{Q}} -\beta _{\hat{Q}}\|_{\infty }\) as n increases, which also implies \(\check{\beta }_{\hat{Q}}\neq 0\) with high probability.

Finally, \(\hat{\beta }_{\hat{\mathcal{R}},\hat{\mathcal{U}}_{1}}\) exactly recover signals with high probability as n → .

Step III::

We need to prove that RAM-2 succeeds in detecting signals via Step 3. Similar to Step II, we need to define proper \(\check{\beta }\) in (12) and \(\bar{\beta }\) in (13). Since the main idea is the same as the procedure above, we only describe the key steps in the following proof. Recall the estimator (7),

$$\displaystyle\begin{array}{rcl} \hat{\beta }_{\hat{\mathcal{R}}_{1},\hat{\mathcal{U}}_{1},\hat{\mathcal{N}}_{1}}& =& \mathop{\arg \min \nolimits }_{\beta _{\hat{\mathcal{U}}_{ 2}\cup \hat{\mathcal{N}}_{2}}=0}\ \Bigg\{(2n)^{-1}\sum _{ i=1}^{n}\Big(Y _{ i} -\sum _{j\in \hat{\mathcal{R}}}X_{ij}\beta _{j} -\sum _{k\in \hat{\mathcal{U}}_{1}\cup \hat{\mathcal{N}}_{1}} X_{ik } \beta _{k}\Big)^{2} +\lambda _{ n}^{\star \star }\sum _{ j\in \hat{\mathcal{R}}}\vert \beta _{j}\vert \Bigg\} \\ & =& \mathop{\arg \min \nolimits }_{\beta _{\hat{\mathcal{N}}\cup \hat{\mathcal{U}}\setminus \hat{\mathcal{U}}_{ 1}}=0}\Bigg\{(2n)^{-1}\|Y - X\beta \|_{ 2}^{2} +\lambda _{ n}^{\star \star }\|\beta _{ \hat{\mathcal{R}}}\|_{1}\Bigg\}. {}\end{array}$$
(34)

This is a new “\(\check{\beta }\)” in (12), and we denote it as \(\tilde{\beta }\). After this step, the ideal result is that with high probability,

$$\displaystyle{ \tilde{\beta }_{\hat{\mathcal{R}}_{1}}\neq 0\mbox{ and }\tilde{\beta }_{\hat{\mathcal{R}}\setminus \hat{\mathcal{R}}_{1}} = 0. }$$
(35)

Therefore, define an oracle estimator of (34),

$$\displaystyle{\mathring{\beta} =\mathop{ \arg \min \nolimits }_{\beta _{S^{c}}=0}\Bigg\{(2n)^{-1}\|Y - X_{ S}\beta _{S}\|_{2}^{2} +\lambda _{ n}^{\star \star }\|\beta _{ \hat{\mathcal{R}}_{1}}\|_{1}\Bigg\}. }$$
(36)

Now, we plug \(\tilde{\beta }\) and \(\mathring{\beta}\) back to (12), (13), and (18), then it is sufficient to prove (34) has a unique solution and it achieves sign consistency with Q = S. Let

$$\displaystyle\begin{array}{rcl} F^{{\prime}}& =& X_{\hat{ \mathcal{R}}_{2}}^{T} - \Sigma _{\hat{ \mathcal{R}}_{2}S}\Sigma _{S}^{-1}X_{ S}^{T}, {}\\ K_{1}^{{\prime}}& =& \Sigma _{\hat{ \mathcal{R}}_{2}S}\Sigma _{SS}^{-1}\mbox{ sig}(\mathring{\beta} _{ S}), {}\\ K_{2}^{{\prime}}& =& F^{{\prime}}X_{ S}(X_{S}^{T}X_{ S})^{-1}\mbox{ sig}(\mathring{\beta} _{ S}) + (n\lambda _{n}^{\star \star })^{-1}F^{{\prime}}\{I - X_{ S}(X_{S}^{T}X_{ S})^{-1}X_{ S}^{T}\}\varepsilon.{}\\ \end{array}$$

Similarly,

$$\displaystyle{ \mbox{ pr}\Big(\|K_{1}^{{\prime}}\|_{ \infty }\leq 1-\alpha \Big) \geq \mbox{ pr}\Big(\big\{\|K_{1}^{{\prime}}\|_{ \infty }\leq 1 -\alpha \big\}\cap D\Big) \geq \mbox{ pr}(D) \geq \mbox{ pr}(A) \rightarrow 1, }$$
(37)

where \(D =\{\hat{ \mathcal{R}}_{2} \subset Z\}\) and it implies ∥ K 1  ∥   ≤ 1 −α under Condition 7. Let

$$\displaystyle\begin{array}{rcl} H^{{\prime}}& =& \mathop{\bigcup }_{ R\subset \mathcal{R}_{2}\subset S} \Big\{\mbox{ sig}(\mathring{\beta} _{S})^{T}(X_{ S}^{T}X_{ S})^{-1}\mbox{ sig}(\mathring{\beta} _{ S}) + (n\lambda _{n}^{\star \star })^{-2}\|\varepsilon \|_{ 2}^{2}> \frac{s_{n}} {nC_{\min }}\big(8s_{n}^{1/2}n^{-1/2} + 1\big) {}\\ & & +\big(1 + s_{n}^{1/2}n^{-1/2}\big)/\big(n(\lambda _{ n}^{\star \star })^{2}\big)\Big\}. {}\\ \end{array}$$

Then,

$$\displaystyle\begin{array}{rcl} \mbox{ pr}\Big(\|K_{2}^{{\prime}}\|_{ \infty }> \frac{\alpha } {2}\Big)& \leq &\mbox{ pr}\Big(\big\{\|K_{2}^{{\prime}}\|_{ \infty }> \frac{\alpha } {2}\big\} \cap A\Big) + \mbox{ pr}(A^{c}) \\ & \leq &\mbox{ pr}\Big(\mathop{\bigcup }_{ \begin{array}{c}(\mathcal{R}_{2},\mathcal{R}_{1}) \\ \mathcal{R}_{2}\subset Z \\ R\subset \mathcal{R}_{1}\subset S\end{array}}\Big\{\|\tilde{K}_{2}(\mathcal{R}_{2},\mathcal{R}_{1})\|_{\infty }> \frac{\alpha } {2}\Big\}\Big) + \mbox{ pr}(A^{c}) \\ & \leq &\mbox{ pr}\Big(\mathop{\bigcup }_{ \begin{array}{c}(\mathcal{R}_{2},\mathcal{R}_{1}) \\ \mathcal{R}_{2}\subset Z \\ R\subset \mathcal{R}_{1}\subset S\end{array}} \Big\{\|\tilde{K}_{2 } (\mathcal{R}_{2},\mathcal{R}_{1})\|_{\infty }> \frac{\alpha } {2}\Big\}\mid \tilde{H}^{c}\Big) + \mbox{ pr}(\tilde{H}) + \mbox{ pr}(A^{c}) \\ & \leq & 2^{z_{n}+s_{n}+1}z_{ n}e^{-\alpha ^{2}/8V ^{{\prime}} } + 2e^{-\frac{s_{n}} {2} } + e^{-\frac{3} {16} s_{n}} + \mbox{ pr}(A^{c}) \\ & \longrightarrow & 0, {}\end{array}$$
(38)

where the last step of (38) follows from (26), (28), and (29) in the proof of Step II, and \(V ^{{\prime}} = \frac{s_{n}} {nC_{\min }}(8s_{n}^{1/2}n^{-1/2} + 1) + (1 + s_{ n}^{1/2}n^{-1/2})/(n(\lambda _{ n}^{\star \star })^{2})\).

Equations (37) and (38) indicate \(\tilde{\beta }_{\hat{\mathcal{R}}\setminus \hat{\mathcal{R}}_{1}} = 0\). We skip the proof of uniqueness and move to the next step of proving \(\tilde{\beta }_{\hat{\mathcal{R}}_{1}}\neq 0\).

$$\displaystyle\begin{array}{rcl} \|\mathring{\beta} _{S} -\beta _{S}\|_{\infty }& =& \|(X_{S}^{T}X_{ S})^{-1}(X_{ S}^{T}Y - n\lambda _{ n}^{\star \star }\mbox{ sig}(\mathring{\beta} _{ S})) -\beta _{S}\|_{\infty } {}\\ &\leq & \|(X_{S}^{T}X_{ S})^{-1}X_{ S}^{T}\varepsilon \|_{ \infty } +\|\lambda _{ n}^{\star \star }(X_{ S}^{T}X_{ S}/n)^{-1}\|_{ \infty }. {}\\ \end{array}$$

Let \(W_{n} =\lambda _{ n}^{\star \star }s_{n}^{1/2}\big( \frac{8} {C_{\min }}s_{n}^{1/2}n^{-\frac{1} {2} } + \frac{1} {C_{ \min }}\big) + \frac{s_{n}^{1/2}} {n^{1/2}C_{\min }^{1/2}} = o(n^{a_{2}/2-\delta })\). In the same way, we can show that as n → 

$$\displaystyle{ \mbox{ pr}\big(\|\mathring{\beta} _{S} -\beta _{S}\|_{\infty }\leq W_{n}\big) \rightarrow 0 }$$

Hence, Condition 5 ensures that as n → ,

$$\displaystyle{ \mbox{ pr}\big(\min _{j\in S}\vert \beta _{j}\vert>\|\mathring{\beta} _{S} -\beta _{S}\|_{\infty }\big) \rightarrow 1, }$$
(39)

which is equivalent to \(\tilde{\beta }_{\hat{\mathcal{R}}_{1}}\neq 0\).

Finally, combining Step I, Step II, and Step III, we conclude that

$$\displaystyle{ P\big(\hat{\beta }_{\hat{\mathcal{R}}_{1},\hat{\mathcal{U}}_{1},\hat{\mathcal{N}}_{1}}\mbox{ is unique and }\mbox{ sign}(\hat{\beta }_{\hat{\mathcal{R}}_{1},\hat{\mathcal{U}}_{1},\hat{\mathcal{N}}_{1}}) = \mbox{ sign}(\beta )\big) \rightarrow 1,\quad \mbox{ as }n \rightarrow \infty. }$$

 □ 

Proof of Theorem 2

Denote the compositions \(S = \hat{\mathcal{R}}_{1} \cup \hat{\mathcal{U}}_{1} \cup \hat{\mathcal{N}}_{1}\) and define the set of noises left in \(\hat{\mathcal{U}}^{c}\) as \((\hat{\mathcal{R}}\setminus \hat{\mathcal{R}}_{1}) \cup (\hat{\mathcal{N}}\setminus \hat{\mathcal{N}}_{1})\dot{ =} \hat{\mathcal{R}}_{2} \cup \hat{\mathcal{N}}_{2}\), where \(\hat{\mathcal{R}}_{1}\), \(\hat{\mathcal{U}}_{1}\), and \(\hat{\mathcal{N}}_{1}\) are signals from \(\hat{\mathcal{R}}\), \(\hat{\mathcal{U}}\), and \(\hat{\mathcal{N}}\), respectively.

Step I::

Consider the Step 1 in (5), which is exactly the same as (20). Since there is no difference from the Step II in the proof of Theorem 1, we skip the details here.

Step II::

Let’s consider the Step 2 in (6).

$$\displaystyle\begin{array}{rcl} \hat{\beta }_{\hat{\mathcal{R}},\hat{\mathcal{U}}_{1},\hat{\mathcal{N}}_{1}}& =& \mathop{\arg \min \nolimits }_{\beta _{\hat{\mathcal{U}}\setminus \hat{\mathcal{U}}_{ 1}}=0}\ \Bigg\{(2n)^{-1}\sum _{ i=1}^{n}\Big(Y _{ i} -\sum _{j\in \hat{\mathcal{N}}}X_{ij}\beta _{j} -\sum _{k\in \hat{\mathcal{R}}\cup \hat{\mathcal{U}}_{1}} X_{ik}\beta _{k}\Big)^{2} +\lambda _{ n}^{\star }\sum _{ j\in \hat{\mathcal{N}}}\vert \beta _{j}\vert \Bigg\} \\ & =& \mathop{\arg \min \nolimits }_{\beta _{\hat{\mathcal{U}}\setminus \hat{\mathcal{U}}_{ 1}}=0}\Bigg\{(2n)^{-1}\|Y - X\beta \|_{ 2}^{2} +\lambda _{ n}^{\star }\|\beta _{ \hat{\mathcal{N}}}\|_{1}\Bigg\}. {}\end{array}$$
(40)

Here, denote \(\check{\beta }=\hat{\beta } _{\hat{\mathcal{R}},\hat{\mathcal{U}}_{1},\hat{\mathcal{N}}_{1}}\). After this step, the ideal result is that with high probability,

$$\displaystyle{ \check{\beta }_{\hat{\mathcal{N}}_{1}}\neq 0\mbox{ and }\check{\beta }_{\hat{\mathcal{N}}\setminus \hat{\mathcal{N}}_{1}} = 0. }$$
(41)

Then, define an oracle estimator of (20),

$$\displaystyle{ \bar{\beta }=\mathop{ \arg \min \nolimits }_{\beta _{(\hat{\mathcal{R}}\cup \hat{\mathcal{U}}_{ 1}\cup \hat{\mathcal{N}}_{1})^{c}}=0}\Bigg\{(2n)^{-1}\|Y - X_{\hat{ Q}}\beta _{\hat{Q}}\|_{2}^{2} +\lambda _{ n}^{\star }\|\beta _{ \hat{\mathcal{N}}_{1}}\|_{1}\Bigg\}, }$$
(42)

where \(\hat{Q} = (\hat{\mathcal{R}}\cup \hat{\mathcal{U}}_{1}) \cup \hat{\mathcal{N}}_{1} = S \cup \hat{\mathcal{R}}_{2}\). Similar to Step II in proof of Theorem 1, let

$$\displaystyle\begin{array}{rcl} F& =& X_{\hat{\mathcal{N}}_{2}}^{T} - \Sigma _{\hat{ Q}^{c}\hat{Q}}\Sigma _{\hat{Q}\hat{Q}}^{-1}X_{\hat{ Q}}^{T}, {}\\ K_{1}& =& \Sigma _{\hat{\mathcal{N}}_{2}\hat{Q}}\Sigma _{\hat{Q}\hat{Q}}^{-1}\mbox{ sig}(\bar{\beta }_{\hat{ Q}}), {}\\ K_{2}& =& FX_{\hat{Q}}(X_{\hat{Q}}^{T}X_{\hat{ Q}})^{-1}\mbox{ sig}(\bar{\beta }_{\hat{ Q}}) + (n\lambda _{n})^{-1}F\{I - X_{\hat{ Q}}(X_{\hat{Q}}^{T}X_{\hat{ Q}})^{-1}X_{\hat{ Q}}^{T}\}\varepsilon, {}\\ \mbox{ and}& & {}\\ A& =& \{R \subset \hat{\mathcal{L}}_{1}\dot{ =}\hat{ \mathcal{R}}_{1} \cup \hat{\mathcal{U}}_{1} \subset S,S \subset \hat{Q} \subset S \cup Z\}, {}\\ B& =& \{S \subset \hat{Q} \subset S \cup Z\}, {}\\ \mathcal{T}_{A}& =& \{(\mathcal{L}_{1},Q)\vert R \subset \mathcal{L}_{1} \subset S,S \subset Q \subset S \cup Z\}. {}\\ \end{array}$$

Similarly, we get

$$\displaystyle{ \mbox{ pr}(\|K_{1}\|_{\infty }\leq 1 -\gamma _{1}) \geq \mbox{ pr}(\{\|K_{1}\|_{\infty }\leq 1 -\gamma _{1}\} \cap A) \geq \mbox{ pr}(A) \rightarrow 1. }$$
(43)

To obtain \(\mbox{ pr}(\|K_{2}\|_{\infty }> \frac{\gamma _{1}} {2}) \rightarrow 0\), we define event H as

$$\displaystyle\begin{array}{rcl} H& =& \mathop{\bigcup }_{\begin{array}{c}(\mathcal{L}_{1},Q)\subset \mathcal{T}_{A}\end{array}}\Big\{\mbox{ sig}(\bar{\beta }_{Q})^{T}(X_{ Q}^{T}X_{ Q})^{-1}\mbox{ sig}(\bar{\beta }_{ Q}) + (n\lambda _{n})^{-2}\|\varepsilon \|_{ 2}^{2} {}\\ & &> \frac{s_{n} + z_{n}} {nC_{\min }} \big(8(s_{n} + z_{n})^{1/2}n^{-1/2} + 1\big) +\big (1 + s_{ n}^{1/2}n^{-1/2}\big)/(n\lambda _{ n}^{2})\Big\}. {}\\ \end{array}$$

Then, following (25)–(29),

$$\displaystyle\begin{array}{rcl} \mbox{ pr}\big(\|K_{2}\|_{\infty }> \frac{\gamma _{1}} {2}\big)& \leq &\mbox{ pr}\Big(\big\{\|K_{2}\|_{\infty }> \frac{\gamma _{1}} {2}\big\} \cap A\Big) + \mbox{ pr}(A^{c}) \\ & \leq &\mbox{ pr}\Big(\mathop{\bigcup }_{\begin{array}{c}(\mathcal{L}_{1},Q)\subset \mathcal{T}_{A}\end{array}}\|K_{2}(\mathcal{L}_{1},Q)\|_{\infty }> \frac{\gamma _{1}} {2}\mid H^{c}\Big) + \mbox{ pr}(H) + \mbox{ pr}(A^{c}) \\ & \leq & 2^{s_{n}+z_{n} } \cdot 2(\,p_{n} - s_{n})\exp (-\gamma _{1}^{2}/8V ) + 2^{z_{n}+1}\exp \big(-\frac{s_{n}} {2} \big) \\ & & +\exp \big(-\frac{3} {16}s_{n}\big) \\ & \longrightarrow & 0, {}\end{array}$$
(44)

where V = (1 + s n 1∕2 n −1∕2)∕(n(λ n )2) + s n n −1 C min −1(8s n 1∕2 n −1∕2 + 1).

Again, we skip the uniqueness of \(\check{\beta }\) and move to bound \(\|\bar{\beta }_{\hat{Q}} -\beta _{\bar{Q}}\|_{\infty }\). By (30)–(32) in the proof of Theorem 1, we have

$$\displaystyle\begin{array}{rcl} & & \mbox{ pr}\big(\|\bar{\beta }_{\hat{Q}} -\beta _{\hat{Q}}\|_{\infty }\geq U_{n}\big) {}\\ & & \quad \leq 2^{z_{n} }\big(2(s_{n} + z_{n})e^{-(s_{n}+z_{n})/18} + 2e^{-n/2} + 2\exp ^{-s_{n}/2}\big) + \mbox{ pr}(B^{c}) \rightarrow 0, {}\\ \end{array}$$

where \(U_{n} =\lambda _{n}(s_{n} + z_{n})^{1/2}\Big( \frac{8} {C_{\min }}(s_{n} + z_{n})^{1/2}n^{-1/2} + 1/C_{\min }\Big) + \frac{(s_{n}+z_{n})^{1/2}} {n^{1/2}C_{\min }^{1/2}}\). As min j ∈ S  | β j  | ≫ U n with sufficiently large n, we conclude that with high probability \(\min _{j\in S}\vert \beta _{j}\vert>\|\bar{\beta } _{\hat{Q}} -\beta _{\hat{Q}}\|_{\infty }\) as n increases, which also implies \(\check{\beta }_{\hat{Q}}\neq 0\) with high probability.

Therefore, \(\hat{\beta }_{\hat{\mathcal{R}},\hat{\mathcal{U}}_{1},\hat{\mathcal{N}}_{1}}\) successfully recover signals from \(\hat{\mathcal{N}}\) with high probability when n is large enough.

Step III::

Following the same steps as in Step III in the proof of Theorem 1, we have

$$\displaystyle{ P\big(\hat{\beta }_{\hat{\mathcal{R}}_{1},\hat{\mathcal{U}}_{1},\hat{\mathcal{N}}_{1}}\mbox{ is unique and }\mbox{ sign}(\hat{\beta }_{\hat{\mathcal{R}}_{1},\hat{\mathcal{U}}_{1},\hat{\mathcal{N}}_{1}}) = \mbox{ sign}(\beta )\big) \rightarrow 1,\quad \mbox{ as }n \rightarrow \infty. }$$

 □ 

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Feng, Y., Yu, M. (2017). Regularization After Marginal Learning for Ultra-High Dimensional Regression Models. In: Ahmed, S. (eds) Big and Complex Data Analysis. Contributions to Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-41573-4_1

Download citation

Publish with us

Policies and ethics