Regularization After Marginal Learning for Ultra-High Dimensional Regression Models

Feng, Yang; Yu, Mengjia

doi:10.1007/978-3-319-41573-4_1

Yang Feng² &
Mengjia Yu²

Part of the book series: Contributions to Statistics ((CONTRIB.STAT.))

3595 Accesses
2 Altmetric

Abstract

Regularization is a popular variable selection technique for high dimensional regression models. However, under the ultra-high dimensional setting, a direct application of the regularization methods tends to fail in terms of model selection consistency due to the possible spurious correlations among predictors. Motivated by the ideas of screening (Fan and Lv, J R Stat Soc Ser B Stat Methodol 70:849–911, 2008) and retention (Weng et al, Manuscript, 2013), we propose a new two-step framework for variable selection, where in the first step, marginal learning techniques are utilized to partition variables into different categories, and the regularization methods can be applied afterwards. The technical conditions of model selection consistency for this broad framework relax those for the one-step regularization methods. Extensive simulations show the competitive performance of the new method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Hardcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19, 716–723 (1974)
Article MathSciNet MATH Google Scholar
Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with sparsity-inducing penalties. arXiv preprint arXiv:1108.0775 (2011)
Google Scholar
Chen, J., Chen, Z.: Extended bayesian information criteria for model selection with large model spaces. Biometrika 95, 759–771 (2008)
Article MathSciNet MATH Google Scholar
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32, 407–499 (2004)
Article MathSciNet MATH Google Scholar
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001)
Article MathSciNet MATH Google Scholar
Fan, J., Lv, J.: Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B Stat. Methodol. 70, 849–911 (2008)
Article MathSciNet Google Scholar
Fan, J., Song, R.: Sure independence screening in generalized linear models with np-dimensionality. Ann. Stat. 38, 3567–3604 (2010)
Article MathSciNet MATH Google Scholar
Fan, J., Feng, Y., Song, R.: Nonparametric independence screening in sparse ultra-high dimensional additive models. J. Am. Stat. Assoc. 106, 544–557 (2011)
Article MathSciNet MATH Google Scholar
Fan, J., Feng, Y., Tong, X.: A road to classification in high dimensional space: the regularized optimal affine discriminant. J. R. Stat. Soc. Ser. B. 74, 745–771 (2012)
Article MathSciNet Google Scholar
Fan, J., Feng, Y., Jiang, J., Tong, X.: Feature augmentation via nonparametrics and selection (fans) in high dimensional classification. J. Am. Stat. Assoc. (2014, to appear)
Google Scholar
Feng, Y., Li, T., Ying, Z.: Likelihood adaptively modified penalties. arXiv preprint arXiv:1308.5036 (2013)
Google Scholar
Feng, Y., Yu, Y.: Consistent cross-validation for tuning parameter selection in high-dimensional variable selection. arXiv preprint arXiv:1308.5390 (2013)
Google Scholar
Frank, l.E., Friedman, J.H.: A statistical view of some chemometrics regression tools. Technometrics 35, 109–135 (1993)
Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010)
Article Google Scholar
Greenshtein, E., Ritov, Y.: Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10, 971–988 (2004)
Article MathSciNet MATH Google Scholar
Huang, J., Ma, S., Zhang, C.-H.: Adaptive lasso for sparse high-dimensional regression models. Stat. Sin. 18, 1603 (2008)
MathSciNet MATH Google Scholar
Knight, K., Fu, W.: Asymptotics for lasso-type estimators. Ann. Stat. 28, 1356–1378 (2000)
Article MathSciNet MATH Google Scholar
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)
Article MathSciNet MATH Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol. 58, 267–288 (1996)
MathSciNet MATH Google Scholar
Wainwright, M.J.: Sharp thresholds for high-dimensional and noisy sparsity recovery. IEEE Trans. Inf. Theory 55, 2183–2202 (2009)
Article Google Scholar
Weng, H., Feng, Y., Qiao, X.: Regularization after retention in ultrahigh dimensional linear regression models. Manuscript (2013). Preprint, arXiv:1311.5625
Google Scholar
Yu, Y., Feng, Y.: Apple: approximate path for penalized likelihood estimators. Stat. Comput. 24, 803–819 (2014)
Article MathSciNet MATH Google Scholar
Yu, Y., Feng, Y.: Modified cross-validation for lasso penalized high-dimensional linear models. J. Comput. Graph. Stat. 23, 1009–1027 (2014)
Article Google Scholar
Zhao, P., Yu, B.: On model selection consistency of lasso. J. Mach. Learn. Res. 7, 2541–2563 (2006)
MathSciNet MATH Google Scholar
Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101, 1418–1429 (2006)
Article MathSciNet MATH Google Scholar
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67, 301–320 (2005)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

This work was partially supported by NSF grant DMS-1308566. The authors would like to thank the Editor and the referee for constructive comments which greatly improved the paper. The majority of the work was done when Mengjia Yu was an M.A. student at Columbia University.

Author information

Authors and Affiliations

Department of Statistics, Columbia University, New York, NY, 10027, USA
Yang Feng & Mengjia Yu

Authors

Yang Feng
View author publications
You can also search for this author in PubMed Google Scholar
Mengjia Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yang Feng .

Editor information

Editors and Affiliations

Department of Mathematics & Statistics, Brock University, St. Catherines, Ontario, Canada
S. Ejaz Ahmed

Appendix

Proof of Theorem 1

Denote the design matrix by X, response vector by Y, and error vector by ɛ. The scale condition is $\log p_{n} = O(n^{a_{1}})$, $s_{n} = O(n^{a_{2}}),a_{1}> 0,a_{2}> 0,a_{1} + 2a_{2} <1$.

Step I::

Recall the index of variables with large coefficients

$$\displaystyle{\mathcal{M}_{\tilde{\gamma }_{n}} =\{ 1 \leq j \leq p: \vert \hat{\beta }_{j}^{M}\vert \,\mbox{ is among the first}\lfloor \tilde{\gamma }_{ n}\rfloor \mbox{ of all }\} = \mathcal{N}^{c}.}$$

Under Corollary 1,

$$\displaystyle{\mbox{ pr}(S \subset \mathcal{M}_{\tilde{\gamma }_{n}} = \mathcal{N}^{c} = \mathcal{R}\cup \mathcal{U}) \rightarrow 1\quad \mbox{ as }n \rightarrow \infty.}$$

Hence with high probability the set $\hat{\mathcal{N}}$ contains only noises.

Step II::

Next we will show that RAM-2 succeeds in detecting signals in $\hat{\mathcal{N}}^{c}$. Let S = { 1 ≤ j ≤ p: β _j ≠ 0}. Denote the compositions $S = \hat{\mathcal{R}}_{1} \cup \hat{\mathcal{U}}_{1}$ and define the set of noises left in $\hat{\mathcal{N}}^{c}$ as $(\hat{\mathcal{R}}\setminus \hat{\mathcal{R}}_{1}) \cup (\hat{\mathcal{U}}\setminus \hat{\mathcal{U}}_{1})\dot{ =} \hat{\mathcal{R}}_{2} \cup \hat{\mathcal{U}}_{2}$, where $\hat{\mathcal{R}}_{1}$ and $\hat{\mathcal{U}}_{1}$ are signals from $\hat{\mathcal{R}}$ and $\hat{\mathcal{U}}$, respectively.

Firstly, we would like to introduce an important technique in RAR+. Define the set of true signals as S, and in an arbitrary regularization, define the set that is hold without penalty as H while the set that needs to be checked with penalty as C. Let

$$\displaystyle\begin{array}{rcl} \check{\beta }& =& \mathop{\arg \min \nolimits }_{\beta }\Bigg\{(2n)^{-1}\|Y - X\beta \|_{ 2}^{2} +\lambda _{ n}\|\beta _{C}\|_{1}\Bigg\},{}\end{array}$$

(12)

$$\displaystyle\begin{array}{rcl} \bar{\beta }& =& \mathop{\arg \min \nolimits }_{\beta _{(S\cup H)^{c}}=0}\Bigg\{(2n)^{-1}\|Y - X\beta \|_{ 2}^{2} +\lambda _{ n}\|\beta _{C\cap S}\|_{1}\Bigg\}.{}\end{array}$$

(13)

Now we define Q = S ∪ H which are the variables we would like to retain, and then the variables that are supposed to be discarded are Q ^c = C ∖ S.

By optimality conditions of convex problems [2], $\check{\beta }$ is a solution to (12) if and only if

$$\displaystyle{ n^{-1}X^{T}(Y - X\check{\beta }) =\lambda _{ n}\partial \|\check{\beta }_{C}\|, }$$

(14)

where $\partial \|\check{\beta }_{C}\|$ is the subgradient of ∥ β _C ∥ ₁ at $\beta =\check{\beta }$. Namely, the ith (1 ≤ i ≤ p _n) element of $\partial \|\check{\beta }_{C}\|$ is

$$\displaystyle{ (\partial \|\check{\beta }_{C}\|)_{i} = \left \{\begin{array}{@{}l@{\quad }l@{}} 0 \quad &\text{ if }i \in C; \\ \mbox{ sign}(\check{\beta }_{i})\quad &\text{ if }i \in C^{c}\mbox{ and }\check{\beta }_{i}\neq 0; \\ t \quad &\text{ otherwise, } \end{array} \right. }$$

where t can be any real number with | t | ≤ 1. Similarly, $\bar{\beta }$ is the unique solution to (13) if and only if

$$\displaystyle\begin{array}{rcl} \bar{\beta }_{Q^{c}} = 0,\quad n^{-1}X_{ Q}^{T}(Y - X_{ Q}\bar{\beta }_{Q}) =\lambda _{n}\mbox{ sig}(\bar{\beta }_{Q}),& &{}\end{array}$$

(15)

where $\mbox{ sig}(\bar{\beta }_{Q})$, a vector of length card(Q), is the subgradient of $\|\beta _{\bar{Q}^{c}}\|_{1}$ at $\beta _{Q} =\bar{\beta } _{Q}$. Then it is not hard to see that the unique solution $\bar{\beta }$ is also a solution for (13) if

$$\displaystyle{ \|n^{-1}X_{ Q^{c}}^{T}(Y - X_{ Q}\bar{\beta }_{Q})\|_{\infty } <\lambda _{n}, }$$

(16)

simply because (15) and (16) imply $\bar{\beta }$ satisfies (14). Solving the equation in (15) gives

$$\displaystyle{ \bar{\beta }_{Q} = (X_{Q}^{T}X_{ Q})^{-1}\left [X_{ Q}^{T}Y - n\lambda _{ n}\mbox{ sig}(\bar{\beta }_{Q})\right ]. }$$

(17)

Using (17) and Y = X _S β _S +ɛ, (16) is equivalent to

$$\displaystyle\begin{array}{rcl} & & \|X_{Q^{c}}^{T}X_{ Q}(X_{Q}^{T}X_{ Q})^{-1}\mbox{ sig}(\bar{\beta }_{ Q}) \\ & & +(n\lambda _{n})^{-1}X_{ Q^{c}}^{T}(I - X_{ Q}(X_{Q}^{T}X_{ Q})^{-1}X_{ Q}^{T})(X_{ S}\beta _{S}+\varepsilon )\|_{\infty } <1{}\end{array}$$

(18)

Since (I − X _Q(X _Q ^T X _Q)⁻¹ X _Q ^T)X _Q = 0, (18) can be simplified as

$$\displaystyle{ \|X_{Q^{c}}^{T}X_{ Q}(X_{Q}^{T}X_{ Q})^{-1}\mbox{ sig}(\bar{\beta }_{ Q})+(n\lambda _{n})^{-1}X_{ Q^{c}}^{T}(I-X_{ Q}(X_{Q}^{T}X_{ Q})^{-1}X_{ Q}^{T})\varepsilon \|_{ \infty } <1. }$$

(19)

Note that, if there is a unique solution for (12), say $\check{\beta }$, and $\bar{\beta }$ satisfies (19), then $\bar{\beta }$ is indeed the unique solution for (12). This is equivalent to $\check{\beta }_{Q^{c}} = 0$. Furthermore, if $\min _{j\in Q}\vert \beta _{j}\vert>\|\beta _{j} -\bar{\beta }_{j}\|_{\infty }$ also holds, we can conclude $\check{\beta }_{Q}\neq 0$. Thus (12) achieves sign recovery. In the following, we will make use of this idea repeatedly.

Secondly, consider the Step 1 (5),

$$\displaystyle\begin{array}{rcl} \hat{\beta }_{\hat{\mathcal{R}},\hat{\mathcal{U}}_{1}}& =& \mathop{\arg \min \nolimits }_{\beta _{\hat{\mathcal{N}}}=0}\ \Bigg\{(2n)^{-1}\sum _{ i=1}^{n}\bigg(Y _{ i} -\sum _{j\in \hat{\mathcal{U}}}X_{ij}\beta _{j} -\sum _{k\in \hat{\mathcal{R}}}X_{ik}\beta _{k}\bigg)^{2} +\lambda _{ n}\sum _{j\in \hat{\mathcal{U}}}\vert \beta _{j}\vert \Bigg\}, \\ & =& \mathop{\arg \min \nolimits }_{\beta _{\hat{\mathcal{N}}}=0}\Bigg\{(2n)^{-1}\|Y - X\beta \|_{ 2}^{2} +\lambda _{ n}\|\beta _{\hat{\mathcal{U}}}\|_{1}\Bigg\}. {}\end{array}$$

(20)

Here, denote $\check{\beta }=\hat{\beta } _{\hat{\mathcal{R}},\hat{\mathcal{U}}_{1}}$. After this step, the ideal result is that with high probability,

$$\displaystyle{ \check{\beta }_{\hat{\mathcal{U}}_{1}}\neq 0\mbox{ and }\check{\beta }_{\hat{\mathcal{U}}\setminus \hat{\mathcal{U}}_{1}} = 0. }$$

(21)

Therefore, define an oracle estimator of (20),

$$\displaystyle{ \bar{\beta }=\mathop{ \arg \min \nolimits }_{\beta _{(\hat{\mathcal{R}}\cup \hat{\mathcal{U}}_{ 1})^{c}}=0}\Bigg\{(2n)^{-1}\|Y - X_{\hat{ Q}}\beta _{\hat{Q}}\|_{2}^{2} +\lambda _{ n}\|\beta _{\hat{\mathcal{U}}_{1}}\|_{1}\Bigg\}, }$$

(22)

where $\hat{Q} =\hat{ \mathcal{R}}\cup \hat{\mathcal{U}}_{1} = S \cup \hat{\mathcal{R}}_{2}$. Now, we plug $\check{\beta },\bar{\beta }$, and $\hat{Q}$ back to (12), (13), and (19), then it is sufficient to prove (20) has a unique solution and it achieves sign consistency with $Q = \hat{Q}$.

Let

$$\displaystyle\begin{array}{rcl} F& =& X_{\hat{Q}^{c}}^{T} - \Sigma _{\hat{ Q}^{c}\hat{Q}}\Sigma _{\hat{Q}\hat{Q}}^{-1}X_{\hat{ Q}}^{T}, {}\\ K_{1}& =& \Sigma _{\hat{Q}^{c}\hat{Q}}\Sigma _{\hat{Q}\hat{Q}}^{-1}\mbox{ sig}(\bar{\beta }_{\hat{ Q}}), {}\\ K_{2}& =& FX_{\hat{Q}}(X_{\hat{Q}}^{T}X_{\hat{ Q}})^{-1}\mbox{ sig}(\bar{\beta }_{\hat{ Q}}) + (n\lambda _{n})^{-1}F\{I - X_{\hat{ Q}}(X_{\hat{Q}}^{T}X_{\hat{ Q}})^{-1}X_{\hat{ Q}}^{T}\}\varepsilon. {}\\ \end{array}$$

Then, (19) is equivalent to

$$\displaystyle{ \|K_{1} + K_{2}\|_{\infty } <1. }$$

To be more clear that, since we have already screen $\hat{\mathcal{N}}$ out, $\hat{Q}^{c}$ is in fact the complement of $\hat{Q}$ under the “universe” $\hat{\mathcal{R}}\cup \hat{\mathcal{U}}$. We write $\hat{Q}^{c}$ instead of $(\hat{\mathcal{R}}\cup \hat{\mathcal{U}})\setminus \hat{Q} = \hat{\mathcal{U}}_{2}$ to show a close connection with the analysis in first part above.

Now let

$$\displaystyle\begin{array}{rcl} A& =& \{R \subset \hat{\mathcal{R}}_{1} \subset S,S \subset \hat{Q} \subset S \cup Z\}, {}\\ B& =& \{S \subset \hat{Q} \subset S \cup Z\}, {}\\ \mathcal{T}_{A}& =& \{(\mathcal{R}_{1},Q)\vert R \subset \mathcal{R}_{1} \subset S,S \subset Q \subset S \cup Z\}. {}\\ \end{array}$$

From Conditions 1 and 4, P(A) → 1 as a direct result of Proposition 2 in Weng et al. [21]. Since Condition 8 implies

$$\displaystyle{ \mbox{ pr}(\|K_{1}\|_{\infty }\leq 1 -\gamma _{1}) \geq \mbox{ pr}(\{\|K_{1}\|_{\infty }\leq 1 -\gamma _{1}\} \cap A) = \mbox{ pr}(A) \rightarrow 1 }$$

(23)

as given A

$$\displaystyle{ \|K_{1}\|_{\infty } =\| \Sigma _{\hat{Q}^{c}\hat{Q}}\Sigma _{\hat{Q}\hat{Q}}^{-1}\mbox{ sig}(\bar{\beta }_{\hat{ Q}})\|_{\infty }\leq \|\{ \Sigma _{\hat{Q}^{c}\hat{Q}}\Sigma _{\hat{Q}\hat{Q}}^{-1}\}_{ \hat{\mathcal{U}_{1}}}\|_{\infty } }$$

is always less than 1 −γ ₁.

Denote $K_{2}(\mathcal{R}_{1},Q)$ as the analogy of K ₂ and $\bar{\beta }_{Q}$ as the analogy of $\bar{\beta }_{\hat{Q}}$ by replacing $\hat{\mathcal{R}}_{1}$ and $\hat{Q}$ in (22) with $\mathcal{R}_{1}$ and Q. Since given X _Q and ɛ, the j-th element of $K_{2}(\mathcal{R}_{1},Q)$, namely

$$\displaystyle{ F(\,j)X_{Q}(X_{Q}^{T}X_{ Q})^{-1}\mbox{ sig}(\bar{\beta }_{ Q}) + (n\lambda _{n})^{-1}F(\,j)\{I - X_{ Q}(X_{Q}^{T}X_{ Q})^{-1}X_{ Q}^{T}\}\varepsilon, }$$

(24)

is normally distributed with mean 0 and variance V _j, where

$$\displaystyle\begin{array}{rcl} V _{j}& \leq & (\Sigma _{Q^{c}\vert Q})_{jj}\Big[\mbox{ sig}(\bar{\beta }_{Q})^{T}(X_{ Q}^{T}X_{ Q})^{-1}\mbox{ sig}(\bar{\beta }_{ Q}) + (n\lambda _{n})^{-2}\varepsilon ^{T}\{I - X_{ Q}(X_{Q}^{T}X_{ Q})^{-1}X_{ Q}^{T}\}\varepsilon \Big] {}\\ & \leq & \mbox{ sig}(\bar{\beta }_{Q})^{T}(X_{ Q}^{T}X_{ Q})^{-1}\mbox{ sig}(\bar{\beta }_{ Q}) + (n\lambda _{n})^{-2}\|\varepsilon \|_{ 2}^{2}. {}\\ \end{array}$$

Hence, we let

$$\displaystyle\begin{array}{rcl} H& =& \mathop{\bigcup }_{\begin{array}{c}(\mathcal{R}_{1},Q)\subset \mathcal{T}_{A}\end{array}}\Big\{\mbox{ sig}(\bar{\beta }_{Q})^{T}(X_{ Q}^{T}X_{ Q})^{-1}\mbox{ sig}(\bar{\beta }_{ Q}) + (n\lambda _{n})^{-2}\|\varepsilon \|_{ 2}^{2} {}\\ & & \quad> \frac{s_{n} + z_{n}} {nC_{\min }} (8(s_{n} + z_{n})^{1/2}n^{-1/2} + 1) + (1 + s_{ n}^{1/2}n^{-1/2})/(n\lambda _{ n}^{2})\Big\}. {}\\ \end{array}$$

Next, we want to show

$$\displaystyle\begin{array}{rcl} \mbox{ pr}\Big(\|K_{2}\|_{\infty }> \frac{\gamma _{1}} {2}\Big)& \leq & \mbox{ pr}\Big(\Big\{\|K_{2}\|_{\infty }> \frac{\gamma _{1}} {2}\Big\} \cap A\Big) + \mbox{ pr}(A^{c}) \\ & \leq & \mbox{ pr}\Big(\Big\{\mathop{\bigcup }_{\begin{array}{c}(\mathcal{R}_{1},Q)\subset \mathcal{T}_{A}\end{array}}\|K_{2}(\mathcal{R}_{1},Q)\|_{\infty }> \frac{\gamma _{1}} {2}\Big\} \cap A\Big) + \mbox{ pr}(A^{c}) \\ & \leq & \mbox{ pr}\Big(\mathop{\bigcup }_{\begin{array}{c}(\mathcal{R}_{1},Q)\subset \mathcal{T}_{A}\end{array}}\|K_{2}(\mathcal{R}_{1},Q)\|_{\infty }> \frac{\gamma _{1}} {2}\mid H^{c}\Big) + \mbox{ pr}(H) + \mbox{ pr}(A^{c}) \\ & \longrightarrow & 0. {}\end{array}$$

(25)

By the tail probability inequality of Gaussian distribution (inequality (48) in Wainwright [20]), it is not hard to see that

$$\displaystyle\begin{array}{rcl} & & \mbox{ pr}\Big(\mathop{\bigcup }_{\begin{array}{c}(\mathcal{R}_{1},Q)\subset \mathcal{T}_{A}\end{array}}\|K_{2}(\mathcal{R}_{1},Q)\|_{\infty }> \frac{\gamma _{1}} {2}\mid H^{c}\Big) \\ & & \quad \leq \sum _{\begin{array}{c}(\mathcal{R}_{1},Q)\subset \mathcal{T}_{A}\end{array}}\mbox{ pr}(\|K_{2}(\mathcal{R}_{1},Q)\|_{\infty }> \frac{\gamma _{1}} {2}\mid H^{c}) \\ & & \quad \leq 2^{s_{n}+z_{n} } \cdot \max _{\begin{array}{c}(\mathcal{R}_{1},Q)\subset \mathcal{T}_{A}\end{array}}\mbox{ pr}(\|K_{2}(\mathcal{R}_{1},Q)\|_{\infty }> \frac{\gamma _{1}} {2}\mid H^{c}) \\ & & \quad \leq 2^{s_{n}+z_{n} } \cdot 2(\,p_{n} - s_{n})\exp (-\gamma _{1}^{2}/8V ), {}\end{array}$$

(26)

where $V = (1 + s_{n}^{1/2}n^{-1/2})/(n\lambda _{n}^{2}) + \frac{s_{n}+z_{n}} {nC_{\min }} (8(s_{n} + z_{n})^{1/2}n^{-1/2} + 1) \geq V _{ j}$ under condition H ^c. Since $\log [2^{s_{n}+z_{n}+1}(\,p_{n} - s_{n})] = o(\gamma _{1}^{2}/8V )$ under our scaling, (26) → 0. To bound pr(H), note that

$$\displaystyle\begin{array}{rcl} \mbox{ pr}(H)& \leq & \mbox{ pr}\Big(\mathop{\bigcup }_{\begin{array}{c}(\mathcal{R}_{1 },Q)\subset \mathcal{T}_{A}\end{array}} \Big\{\mbox{ sig} (\bar{\beta }_{Q})^{T}(X_{ Q}^{T}X_{ Q})^{-1}\mbox{ sig}(\bar{\beta }_{ Q}) \\ & &> \frac{s_{n} + z_{n}} {nC_{\min }} (8(s_{n} + z_{n})^{1/2}n^{-1/2} + 1)\Big\}\Big) \\ & & +\mbox{ pr}\Big((n\lambda _{n})^{-2}\|\varepsilon \|_{ 2}^{2}> (1 + s_{ n}^{1/2}n^{-1/2})/(n\lambda _{ n}^{2})\Big){}\end{array}$$

(27)

Since ∥ ɛ ∥ ₂ ² ∼ χ ²(n), using the inequality of (54a) in Wainwright [20], we get

$$\displaystyle\begin{array}{rcl} \mbox{ pr}\Big((n\lambda _{n})^{-2}\|\varepsilon \|_{ 2}^{2}> (1 + s_{ n}^{1/2}n^{-1/2})/(n\lambda _{ n}^{2})\Big)& \leq & \mbox{ pr}\Big(\|\varepsilon \|_{ 2}^{2} \geq (1 + s_{ n}^{1/2}n^{-1/2})n\Big) \\ & \leq & \exp (-\frac{3} {16}s_{n}), {}\end{array}$$

(28)

whenever s _n∕n < 1∕2. For any given Q that satisfying S ⊂ Q ⊂ S ∪ Z,

$$\displaystyle\begin{array}{rcl} \mbox{ sig}(\bar{\beta }_{Q})^{T}(X_{ Q}^{T}X_{ Q})^{-1}\mbox{ sig}(\bar{\beta }_{ Q})& \leq & (s_{n} + z_{n})\|(X_{Q}^{T}X_{ Q})^{-1}\|_{ 2} {}\\ & \leq & (s_{n} + z_{n})/n\Big(\|(X_{Q}^{T}X_{ Q}/n)^{-1} - \Sigma _{ QQ}^{-1}\|_{ 2} +\| \Sigma _{QQ}^{-1}\|_{ 2}\Big) {}\\ & \leq & (s_{n} + z_{n})/n\Big(\|(X_{Q}^{T}X_{ Q}/n)^{-1} - \Sigma _{ QQ}^{-1}\|_{ 2} + 1/C_{\min }\Big). {}\\ \end{array}$$

holds for any $\mathcal{R}_{1}$ that satisfying $R \subset \mathcal{R}_{1} \subset S$. Therefore, by the concentration inequality of (58b) in Wainwright [20],

$$\displaystyle\begin{array}{rcl} & & \mbox{ pr}\Bigg(\mathop{\bigcup }_{\begin{array}{c}(\mathcal{R}_{1},Q)\subset \mathcal{T}_{A}\end{array}}\Big\{\mbox{ sig}(\bar{\beta }_{Q})^{T}(X_{ Q}^{T}X_{ Q})^{-1}\mbox{ sig}(\bar{\beta }_{ Q})> \frac{s_{n} + z_{n}} {nC_{\min }} \big(8(s_{n} + z_{n})^{1/2}n^{-1/2} + 1\big)\Big\}\Bigg) \\ & & \quad \leq \sum _{\begin{array}{c}S\subset Q\subset S\cup Z\end{array}} \mbox{ pr }\Bigg(\mathop{\bigcup }_{R\subset \mathcal{R}_{1}\subset S} \Big\{\mbox{ sig}(\bar{\beta }_{Q})^{T}(X_{ Q}^{T}X_{ Q})^{-1}\mbox{ sig}(\bar{\beta }_{ Q})> \frac{s_{n} + z_{n}} {nC_{\min }} \big(8(s_{n} + z_{n})^{1/2}n^{-1/2} + 1\big)\Big\}\Bigg) \\ & & \quad \leq \sum _{\begin{array}{c}S\subset Q\subset S\cup Z\end{array}}\mbox{ pr}\Big(\|(X_{Q}^{T}X_{ Q}/n)^{-1} - \Sigma _{ QQ}^{-1}\|_{ 2} \geq \frac{8} {C_{\min }}(s_{n} + z_{n})^{1/2}n^{-1/2}\Big) \\ & & \quad \leq \sum _{\begin{array}{c}S\subset Q\subset S\cup Z\end{array}}\mbox{ pr}\Big(\|(X_{Q}^{T}X_{ Q}/n)^{-1} - \Sigma _{ QQ}^{-1}\|_{ 2} \geq \frac{8} {C_{\min }}(\mbox{ Card}(Q))^{1/2}n^{-1/2}\Big) \\ & & \quad \leq 2^{z_{n}+1}\exp \left (-\frac{s_{n}} {2} \right ). {}\end{array}$$

(29)

Hence, (28) and (29) imply $\mbox{ pr}(H) \leq 2^{z_{n}+1}\exp (-\frac{s_{n}} {2} ) +\exp (-\frac{3} {16}s_{n}) \rightarrow 0$.

Since P(A ^c) = 1 − P(A) → 0, the inequalities (26)–(29) imply (25) under the scaling in Theorem 1. Thus ∥ K ₁ + K ₂ ∥ _∞ < 1 achieves with high probability, which also means $\check{\beta }_{\hat{\mathcal{U}}\setminus \hat{\mathcal{U}_{1}}} = 0$ achieves asymptotically.

From our analysis in the first part, the following goal is the uniqueness of (20). If there is another solution, let’s call it $\check{\beta }^{{\prime}}$. For any t such that 0 < t < 1, the linear combination $\check{\beta }(t) = t\check{\beta } + (1 - t)\check{\beta }^{{\prime}}$ is also a solution to (20) as a consequence of the convexity. Note that, the new solution point $\check{\beta }(t)$ satisfies (16) and $\check{\beta }(t)_{Q^{c}} = 0$, hence it is a solution to (13). From the uniqueness of (13), we conclude that $\check{\beta }=\check{\beta } ^{{\prime}}$.

The last part of this step is to prove $\bar{\beta }_{\mathcal{U}_{1}}\neq 0$ with high probability. By (17) and $Y = X_{S}\beta _{S}+\varepsilon = X_{\hat{Q}}\beta _{\hat{Q}}+\varepsilon$, we have

$$\displaystyle\begin{array}{rcl} \|\beta _{\hat{Q}} -\bar{\beta }_{\hat{Q}}\|_{\infty }& =& \|\lambda _{n}(X_{\hat{Q}}^{T}X_{\hat{ Q}}/n)^{-1}\mbox{ sig}(\bar{\beta }_{\hat{ Q}}) - (X_{\hat{Q}}^{T}X_{\hat{ Q}})^{-1}X_{\hat{ Q}}^{T}\varepsilon \|_{ \infty } \\ &\leq & \lambda _{n}\|(X_{\hat{Q}}^{T}X_{\hat{ Q}}/n)^{-1}\|_{ \infty } +\| (X_{\hat{Q}}^{T}X_{\hat{ Q}})^{-1}X_{\hat{ Q}}^{T}\varepsilon \|_{ \infty } \\ &\leq & \lambda _{n}(s_{n} + z_{n})^{1/2}\|(X_{\hat{ Q}}^{T}X_{\hat{ Q}}/n)^{-1}\|_{ 2} +\| (X_{\hat{Q}}^{T}X_{\hat{ Q}})^{-1}X_{\hat{ Q}}^{T}\varepsilon \|_{ \infty } \\ &\leq & \lambda _{n}(s_{n} + z_{n})^{1/2}(\|(X_{\hat{ Q}}^{T}X_{\hat{ Q}}/n)^{-1} - \Sigma _{\hat{ Q}\hat{Q}}^{-1}\|_{ 2} + 1/C_{\min }) \\ & & +\|(X_{\hat{Q}}^{T}X_{\hat{ Q}})^{-1}X_{\hat{ Q}}^{T}\varepsilon \|_{ \infty } {}\end{array}$$

(30)

for any $\hat{Q}$ satisfying $S \subset \hat{Q} \subset S \cup Z$. In (29), we have already got

$$\displaystyle{ \mbox{ pr}\big(\|(X_{\hat{Q}}^{T}X_{\hat{ Q}}/n)^{-1} - \Sigma _{\hat{ Q}\hat{Q}}^{-1}\|_{ 2} \geq \frac{8} {C_{\min }}(s_{n} + z_{n})^{1/2}n^{-1/2}\big) \leq 2\exp (-\frac{s_{n}} {2} ) }$$

(31)

Let $G =\Big\{\| (X_{\hat{Q}}^{T}X_{\hat{Q}})^{-1}\|_{2}> 9/(nC_{\min })\Big\}$, by the inequality (60) in Wainwright [20],

$$\displaystyle{ \mbox{ pr}(G) \leq \mbox{ pr}(\|(X^{T}X)^{-1}\|_{ 2}> 9/(nC_{\min })) \leq 2\exp (-n/2). }$$

Since $(X_{\hat{Q}}^{T}X_{\hat{Q}})^{-1}X_{\hat{Q}}^{T}\varepsilon \mid X_{\hat{Q}} \sim N(0,(X_{\hat{Q}}^{T}X_{\hat{Q}})^{-1})$, then when we condition on G and achieve

$$\displaystyle\begin{array}{rcl} & & \mbox{ pr}\Big(\|(X_{\hat{Q}}^{T}X_{\hat{ Q}})^{-1}X_{\hat{ Q}}^{T}\varepsilon \|_{ \infty }> \frac{(s_{n} + z_{n})^{1/2}} {n^{1/2}C_{\min }^{1/2}} \Big) \\ & & \quad \leq \mbox{ pr}\Big(\|(X_{\hat{Q}}^{T}X_{\hat{ Q}})^{-1}X_{\hat{ Q}}^{T}\varepsilon \|_{ \infty }> \frac{(s_{n} + z_{n})^{1/2}} {n^{1/2}C_{\min }^{1/2}} \mid G^{c}\Big) + \mbox{ pr}(G) \\ & & \quad \leq 2(s_{n} + z_{n})e^{-(s_{n}+z_{n})/18} + 2e^{-n/2}, {}\end{array}$$

(32)

since under G ^c, each component of $(X_{\hat{Q}}^{T}X_{\hat{Q}})^{-1}X_{\hat{Q}}^{T}\varepsilon \mid X_{\hat{Q}}$ is normally distributed with mean 0 and variance that is less than 9∕(nC _min).

Hence (30)–(32) together imply that,

$$\displaystyle{\|\bar{\beta }_{\hat{Q}} -\beta _{\hat{Q}}\|_{\infty }\leq U_{n}\dot{ =}\lambda _{n}(s_{n} + z_{n})^{1/2}\Big( \frac{8} {C_{\min }}(s_{n} + z_{n})^{1/2}n^{-1/2} + 1/C_{\min }\Big) + \frac{(s_{n} + z_{n})^{1/2}} {n^{1/2}C_{\min }^{1/2}} }$$

holds with probability larger than $2(s_{n} + z_{n})e^{-(s_{n}+z_{n})/18} + 2e^{-n/2} + 2\exp ^{-s_{n}/2}$. Therefore,

$$\displaystyle\begin{array}{rcl} & & \mbox{ pr}(\|\bar{\beta }_{\hat{Q}} -\beta _{\hat{Q}}\|_{\infty }\geq U_{n}) \\ & & \quad \leq \mbox{ pr}\Big(\mathop{\bigcup }_{\begin{array}{c}S\subset Q\subset S\cup Z\end{array}}\{\|\bar{\beta }_{Q} -\beta _{Q}\|_{\infty }\geq U_{n}\} \cap B\Big) + \mbox{ pr}(B^{c}) \\ & & \quad \leq 2^{z_{n} }\Big(2(s_{n} + z_{n})e^{-(s_{n}+z_{n})/18} + 2e^{-n/2} + 2\exp ^{-s_{n}/2}\Big) + \mbox{ pr}(B^{c}){}\end{array}$$

(33)

Under the scaling of Theorem 1, we have pr(B) ≥ pr(A) → 1 and $2^{z_{n}}(2(s_{n} + z_{n})e^{-(s_{n}+z_{n})/18} + 2e^{-n/2} + 2\exp ^{-s_{n}/2}) \rightarrow 0$. From Condition 5, it is easy to verify that

$$\displaystyle{ \min _{j\in S}\vert \beta _{j}\vert> U_{n}, }$$

for sufficiently large n. Thus with high probability $\min _{j\in S}\vert \beta _{j}\vert>\|\bar{\beta } _{\hat{Q}} -\beta _{\hat{Q}}\|_{\infty }$ as n increases, which also implies $\check{\beta }_{\hat{Q}}\neq 0$ with high probability.

Finally, $\hat{\beta }_{\hat{\mathcal{R}},\hat{\mathcal{U}}_{1}}$ exactly recover signals with high probability as n → ∞.

Step III::

We need to prove that RAM-2 succeeds in detecting signals via Step 3. Similar to Step II, we need to define proper $\check{\beta }$ in (12) and $\bar{\beta }$ in (13). Since the main idea is the same as the procedure above, we only describe the key steps in the following proof. Recall the estimator (7),

$$\displaystyle\begin{array}{rcl} \hat{\beta }_{\hat{\mathcal{R}}_{1},\hat{\mathcal{U}}_{1},\hat{\mathcal{N}}_{1}}& =& \mathop{\arg \min \nolimits }_{\beta _{\hat{\mathcal{U}}_{ 2}\cup \hat{\mathcal{N}}_{2}}=0}\ \Bigg\{(2n)^{-1}\sum _{ i=1}^{n}\Big(Y _{ i} -\sum _{j\in \hat{\mathcal{R}}}X_{ij}\beta _{j} -\sum _{k\in \hat{\mathcal{U}}_{1}\cup \hat{\mathcal{N}}_{1}} X_{ik } \beta _{k}\Big)^{2} +\lambda _{ n}^{\star \star }\sum _{ j\in \hat{\mathcal{R}}}\vert \beta _{j}\vert \Bigg\} \\ & =& \mathop{\arg \min \nolimits }_{\beta _{\hat{\mathcal{N}}\cup \hat{\mathcal{U}}\setminus \hat{\mathcal{U}}_{ 1}}=0}\Bigg\{(2n)^{-1}\|Y - X\beta \|_{ 2}^{2} +\lambda _{ n}^{\star \star }\|\beta _{ \hat{\mathcal{R}}}\|_{1}\Bigg\}. {}\end{array}$$

(34)

This is a new “$\check{\beta }$” in (12), and we denote it as $\tilde{\beta }$. After this step, the ideal result is that with high probability,

$$\displaystyle{ \tilde{\beta }_{\hat{\mathcal{R}}_{1}}\neq 0\mbox{ and }\tilde{\beta }_{\hat{\mathcal{R}}\setminus \hat{\mathcal{R}}_{1}} = 0. }$$

(35)

Therefore, define an oracle estimator of (34),

$$\displaystyle{\mathring{\beta} =\mathop{ \arg \min \nolimits }_{\beta _{S^{c}}=0}\Bigg\{(2n)^{-1}\|Y - X_{ S}\beta _{S}\|_{2}^{2} +\lambda _{ n}^{\star \star }\|\beta _{ \hat{\mathcal{R}}_{1}}\|_{1}\Bigg\}. }$$

(36)

Now, we plug $\tilde{\beta }$ and $\mathring{\beta}$ back to (12), (13), and (18), then it is sufficient to prove (34) has a unique solution and it achieves sign consistency with Q = S. Let

$$\displaystyle\begin{array}{rcl} F^{{\prime}}& =& X_{\hat{ \mathcal{R}}_{2}}^{T} - \Sigma _{\hat{ \mathcal{R}}_{2}S}\Sigma _{S}^{-1}X_{ S}^{T}, {}\\ K_{1}^{{\prime}}& =& \Sigma _{\hat{ \mathcal{R}}_{2}S}\Sigma _{SS}^{-1}\mbox{ sig}(\mathring{\beta} _{ S}), {}\\ K_{2}^{{\prime}}& =& F^{{\prime}}X_{ S}(X_{S}^{T}X_{ S})^{-1}\mbox{ sig}(\mathring{\beta} _{ S}) + (n\lambda _{n}^{\star \star })^{-1}F^{{\prime}}\{I - X_{ S}(X_{S}^{T}X_{ S})^{-1}X_{ S}^{T}\}\varepsilon.{}\\ \end{array}$$

Similarly,

$$\displaystyle{ \mbox{ pr}\Big(\|K_{1}^{{\prime}}\|_{ \infty }\leq 1-\alpha \Big) \geq \mbox{ pr}\Big(\big\{\|K_{1}^{{\prime}}\|_{ \infty }\leq 1 -\alpha \big\}\cap D\Big) \geq \mbox{ pr}(D) \geq \mbox{ pr}(A) \rightarrow 1, }$$

(37)

where $D =\{\hat{ \mathcal{R}}_{2} \subset Z\}$ and it implies ∥ K ₁ ^′ ∥ _∞ ≤ 1 −α under Condition 7. Let

$$\displaystyle\begin{array}{rcl} H^{{\prime}}& =& \mathop{\bigcup }_{ R\subset \mathcal{R}_{2}\subset S} \Big\{\mbox{ sig}(\mathring{\beta} _{S})^{T}(X_{ S}^{T}X_{ S})^{-1}\mbox{ sig}(\mathring{\beta} _{ S}) + (n\lambda _{n}^{\star \star })^{-2}\|\varepsilon \|_{ 2}^{2}> \frac{s_{n}} {nC_{\min }}\big(8s_{n}^{1/2}n^{-1/2} + 1\big) {}\\ & & +\big(1 + s_{n}^{1/2}n^{-1/2}\big)/\big(n(\lambda _{ n}^{\star \star })^{2}\big)\Big\}. {}\\ \end{array}$$

Then,

$$\displaystyle\begin{array}{rcl} \mbox{ pr}\Big(\|K_{2}^{{\prime}}\|_{ \infty }> \frac{\alpha } {2}\Big)& \leq &\mbox{ pr}\Big(\big\{\|K_{2}^{{\prime}}\|_{ \infty }> \frac{\alpha } {2}\big\} \cap A\Big) + \mbox{ pr}(A^{c}) \\ & \leq &\mbox{ pr}\Big(\mathop{\bigcup }_{ \begin{array}{c}(\mathcal{R}_{2},\mathcal{R}_{1}) \\ \mathcal{R}_{2}\subset Z \\ R\subset \mathcal{R}_{1}\subset S\end{array}}\Big\{\|\tilde{K}_{2}(\mathcal{R}_{2},\mathcal{R}_{1})\|_{\infty }> \frac{\alpha } {2}\Big\}\Big) + \mbox{ pr}(A^{c}) \\ & \leq &\mbox{ pr}\Big(\mathop{\bigcup }_{ \begin{array}{c}(\mathcal{R}_{2},\mathcal{R}_{1}) \\ \mathcal{R}_{2}\subset Z \\ R\subset \mathcal{R}_{1}\subset S\end{array}} \Big\{\|\tilde{K}_{2 } (\mathcal{R}_{2},\mathcal{R}_{1})\|_{\infty }> \frac{\alpha } {2}\Big\}\mid \tilde{H}^{c}\Big) + \mbox{ pr}(\tilde{H}) + \mbox{ pr}(A^{c}) \\ & \leq & 2^{z_{n}+s_{n}+1}z_{ n}e^{-\alpha ^{2}/8V ^{{\prime}} } + 2e^{-\frac{s_{n}} {2} } + e^{-\frac{3} {16} s_{n}} + \mbox{ pr}(A^{c}) \\ & \longrightarrow & 0, {}\end{array}$$

(38)

where the last step of (38) follows from (26), (28), and (29) in the proof of Step II, and $V ^{{\prime}} = \frac{s_{n}} {nC_{\min }}(8s_{n}^{1/2}n^{-1/2} + 1) + (1 + s_{ n}^{1/2}n^{-1/2})/(n(\lambda _{ n}^{\star \star })^{2})$.

Equations (37) and (38) indicate $\tilde{\beta }_{\hat{\mathcal{R}}\setminus \hat{\mathcal{R}}_{1}} = 0$. We skip the proof of uniqueness and move to the next step of proving $\tilde{\beta }_{\hat{\mathcal{R}}_{1}}\neq 0$.

$$\displaystyle\begin{array}{rcl} \|\mathring{\beta} _{S} -\beta _{S}\|_{\infty }& =& \|(X_{S}^{T}X_{ S})^{-1}(X_{ S}^{T}Y - n\lambda _{ n}^{\star \star }\mbox{ sig}(\mathring{\beta} _{ S})) -\beta _{S}\|_{\infty } {}\\ &\leq & \|(X_{S}^{T}X_{ S})^{-1}X_{ S}^{T}\varepsilon \|_{ \infty } +\|\lambda _{ n}^{\star \star }(X_{ S}^{T}X_{ S}/n)^{-1}\|_{ \infty }. {}\\ \end{array}$$

Let $W_{n} =\lambda _{ n}^{\star \star }s_{n}^{1/2}\big( \frac{8} {C_{\min }}s_{n}^{1/2}n^{-\frac{1} {2} } + \frac{1} {C_{ \min }}\big) + \frac{s_{n}^{1/2}} {n^{1/2}C_{\min }^{1/2}} = o(n^{a_{2}/2-\delta })$. In the same way, we can show that as n → ∞

$$\displaystyle{ \mbox{ pr}\big(\|\mathring{\beta} _{S} -\beta _{S}\|_{\infty }\leq W_{n}\big) \rightarrow 0 }$$

Hence, Condition 5 ensures that as n → ∞,

$$\displaystyle{ \mbox{ pr}\big(\min _{j\in S}\vert \beta _{j}\vert>\|\mathring{\beta} _{S} -\beta _{S}\|_{\infty }\big) \rightarrow 1, }$$

(39)

which is equivalent to $\tilde{\beta }_{\hat{\mathcal{R}}_{1}}\neq 0$.

Finally, combining Step I, Step II, and Step III, we conclude that

$$\displaystyle{ P\big(\hat{\beta }_{\hat{\mathcal{R}}_{1},\hat{\mathcal{U}}_{1},\hat{\mathcal{N}}_{1}}\mbox{ is unique and }\mbox{ sign}(\hat{\beta }_{\hat{\mathcal{R}}_{1},\hat{\mathcal{U}}_{1},\hat{\mathcal{N}}_{1}}) = \mbox{ sign}(\beta )\big) \rightarrow 1,\quad \mbox{ as }n \rightarrow \infty. }$$

□

Proof of Theorem 2

Denote the compositions $S = \hat{\mathcal{R}}_{1} \cup \hat{\mathcal{U}}_{1} \cup \hat{\mathcal{N}}_{1}$ and define the set of noises left in $\hat{\mathcal{U}}^{c}$ as $(\hat{\mathcal{R}}\setminus \hat{\mathcal{R}}_{1}) \cup (\hat{\mathcal{N}}\setminus \hat{\mathcal{N}}_{1})\dot{ =} \hat{\mathcal{R}}_{2} \cup \hat{\mathcal{N}}_{2}$, where $\hat{\mathcal{R}}_{1}$, $\hat{\mathcal{U}}_{1}$, and $\hat{\mathcal{N}}_{1}$ are signals from $\hat{\mathcal{R}}$, $\hat{\mathcal{U}}$, and $\hat{\mathcal{N}}$, respectively.

Step I::

Consider the Step 1 in (5), which is exactly the same as (20). Since there is no difference from the Step II in the proof of Theorem 1, we skip the details here.

Step II::

Let’s consider the Step 2 in (6).

$$\displaystyle\begin{array}{rcl} \hat{\beta }_{\hat{\mathcal{R}},\hat{\mathcal{U}}_{1},\hat{\mathcal{N}}_{1}}& =& \mathop{\arg \min \nolimits }_{\beta _{\hat{\mathcal{U}}\setminus \hat{\mathcal{U}}_{ 1}}=0}\ \Bigg\{(2n)^{-1}\sum _{ i=1}^{n}\Big(Y _{ i} -\sum _{j\in \hat{\mathcal{N}}}X_{ij}\beta _{j} -\sum _{k\in \hat{\mathcal{R}}\cup \hat{\mathcal{U}}_{1}} X_{ik}\beta _{k}\Big)^{2} +\lambda _{ n}^{\star }\sum _{ j\in \hat{\mathcal{N}}}\vert \beta _{j}\vert \Bigg\} \\ & =& \mathop{\arg \min \nolimits }_{\beta _{\hat{\mathcal{U}}\setminus \hat{\mathcal{U}}_{ 1}}=0}\Bigg\{(2n)^{-1}\|Y - X\beta \|_{ 2}^{2} +\lambda _{ n}^{\star }\|\beta _{ \hat{\mathcal{N}}}\|_{1}\Bigg\}. {}\end{array}$$

(40)

Here, denote $\check{\beta }=\hat{\beta } _{\hat{\mathcal{R}},\hat{\mathcal{U}}_{1},\hat{\mathcal{N}}_{1}}$. After this step, the ideal result is that with high probability,

$$\displaystyle{ \check{\beta }_{\hat{\mathcal{N}}_{1}}\neq 0\mbox{ and }\check{\beta }_{\hat{\mathcal{N}}\setminus \hat{\mathcal{N}}_{1}} = 0. }$$

(41)

Then, define an oracle estimator of (20),

$$\displaystyle{ \bar{\beta }=\mathop{ \arg \min \nolimits }_{\beta _{(\hat{\mathcal{R}}\cup \hat{\mathcal{U}}_{ 1}\cup \hat{\mathcal{N}}_{1})^{c}}=0}\Bigg\{(2n)^{-1}\|Y - X_{\hat{ Q}}\beta _{\hat{Q}}\|_{2}^{2} +\lambda _{ n}^{\star }\|\beta _{ \hat{\mathcal{N}}_{1}}\|_{1}\Bigg\}, }$$

(42)

where $\hat{Q} = (\hat{\mathcal{R}}\cup \hat{\mathcal{U}}_{1}) \cup \hat{\mathcal{N}}_{1} = S \cup \hat{\mathcal{R}}_{2}$. Similar to Step II in proof of Theorem 1, let

$$\displaystyle\begin{array}{rcl} F& =& X_{\hat{\mathcal{N}}_{2}}^{T} - \Sigma _{\hat{ Q}^{c}\hat{Q}}\Sigma _{\hat{Q}\hat{Q}}^{-1}X_{\hat{ Q}}^{T}, {}\\ K_{1}& =& \Sigma _{\hat{\mathcal{N}}_{2}\hat{Q}}\Sigma _{\hat{Q}\hat{Q}}^{-1}\mbox{ sig}(\bar{\beta }_{\hat{ Q}}), {}\\ K_{2}& =& FX_{\hat{Q}}(X_{\hat{Q}}^{T}X_{\hat{ Q}})^{-1}\mbox{ sig}(\bar{\beta }_{\hat{ Q}}) + (n\lambda _{n})^{-1}F\{I - X_{\hat{ Q}}(X_{\hat{Q}}^{T}X_{\hat{ Q}})^{-1}X_{\hat{ Q}}^{T}\}\varepsilon, {}\\ \mbox{ and}& & {}\\ A& =& \{R \subset \hat{\mathcal{L}}_{1}\dot{ =}\hat{ \mathcal{R}}_{1} \cup \hat{\mathcal{U}}_{1} \subset S,S \subset \hat{Q} \subset S \cup Z\}, {}\\ B& =& \{S \subset \hat{Q} \subset S \cup Z\}, {}\\ \mathcal{T}_{A}& =& \{(\mathcal{L}_{1},Q)\vert R \subset \mathcal{L}_{1} \subset S,S \subset Q \subset S \cup Z\}. {}\\ \end{array}$$

Similarly, we get

$$\displaystyle{ \mbox{ pr}(\|K_{1}\|_{\infty }\leq 1 -\gamma _{1}) \geq \mbox{ pr}(\{\|K_{1}\|_{\infty }\leq 1 -\gamma _{1}\} \cap A) \geq \mbox{ pr}(A) \rightarrow 1. }$$

(43)

To obtain $\mbox{ pr}(\|K_{2}\|_{\infty }> \frac{\gamma _{1}} {2}) \rightarrow 0$, we define event H as

$$\displaystyle\begin{array}{rcl} H& =& \mathop{\bigcup }_{\begin{array}{c}(\mathcal{L}_{1},Q)\subset \mathcal{T}_{A}\end{array}}\Big\{\mbox{ sig}(\bar{\beta }_{Q})^{T}(X_{ Q}^{T}X_{ Q})^{-1}\mbox{ sig}(\bar{\beta }_{ Q}) + (n\lambda _{n})^{-2}\|\varepsilon \|_{ 2}^{2} {}\\ & &> \frac{s_{n} + z_{n}} {nC_{\min }} \big(8(s_{n} + z_{n})^{1/2}n^{-1/2} + 1\big) +\big (1 + s_{ n}^{1/2}n^{-1/2}\big)/(n\lambda _{ n}^{2})\Big\}. {}\\ \end{array}$$

Then, following (25)–(29),

$$\displaystyle\begin{array}{rcl} \mbox{ pr}\big(\|K_{2}\|_{\infty }> \frac{\gamma _{1}} {2}\big)& \leq &\mbox{ pr}\Big(\big\{\|K_{2}\|_{\infty }> \frac{\gamma _{1}} {2}\big\} \cap A\Big) + \mbox{ pr}(A^{c}) \\ & \leq &\mbox{ pr}\Big(\mathop{\bigcup }_{\begin{array}{c}(\mathcal{L}_{1},Q)\subset \mathcal{T}_{A}\end{array}}\|K_{2}(\mathcal{L}_{1},Q)\|_{\infty }> \frac{\gamma _{1}} {2}\mid H^{c}\Big) + \mbox{ pr}(H) + \mbox{ pr}(A^{c}) \\ & \leq & 2^{s_{n}+z_{n} } \cdot 2(\,p_{n} - s_{n})\exp (-\gamma _{1}^{2}/8V ) + 2^{z_{n}+1}\exp \big(-\frac{s_{n}} {2} \big) \\ & & +\exp \big(-\frac{3} {16}s_{n}\big) \\ & \longrightarrow & 0, {}\end{array}$$

(44)

where V = (1 + s _n ^1∕2 n ^−1∕2)∕(n(λ _n ^⋆)²) + s _n n ⁻¹ C _min ⁻¹(8s _n ^1∕2 n ^−1∕2 + 1).

Again, we skip the uniqueness of $\check{\beta }$ and move to bound $\|\bar{\beta }_{\hat{Q}} -\beta _{\bar{Q}}\|_{\infty }$. By (30)–(32) in the proof of Theorem 1, we have

$$\displaystyle\begin{array}{rcl} & & \mbox{ pr}\big(\|\bar{\beta }_{\hat{Q}} -\beta _{\hat{Q}}\|_{\infty }\geq U_{n}\big) {}\\ & & \quad \leq 2^{z_{n} }\big(2(s_{n} + z_{n})e^{-(s_{n}+z_{n})/18} + 2e^{-n/2} + 2\exp ^{-s_{n}/2}\big) + \mbox{ pr}(B^{c}) \rightarrow 0, {}\\ \end{array}$$

where $U_{n} =\lambda _{n}(s_{n} + z_{n})^{1/2}\Big( \frac{8} {C_{\min }}(s_{n} + z_{n})^{1/2}n^{-1/2} + 1/C_{\min }\Big) + \frac{(s_{n}+z_{n})^{1/2}} {n^{1/2}C_{\min }^{1/2}}$. As min_j ∈ S | β _j | ≫ U _n with sufficiently large n, we conclude that with high probability $\min _{j\in S}\vert \beta _{j}\vert>\|\bar{\beta } _{\hat{Q}} -\beta _{\hat{Q}}\|_{\infty }$ as n increases, which also implies $\check{\beta }_{\hat{Q}}\neq 0$ with high probability.

Therefore, $\hat{\beta }_{\hat{\mathcal{R}},\hat{\mathcal{U}}_{1},\hat{\mathcal{N}}_{1}}$ successfully recover signals from $\hat{\mathcal{N}}$ with high probability when n is large enough.

Step III::

Following the same steps as in Step III in the proof of Theorem 1, we have

$$\displaystyle{ P\big(\hat{\beta }_{\hat{\mathcal{R}}_{1},\hat{\mathcal{U}}_{1},\hat{\mathcal{N}}_{1}}\mbox{ is unique and }\mbox{ sign}(\hat{\beta }_{\hat{\mathcal{R}}_{1},\hat{\mathcal{U}}_{1},\hat{\mathcal{N}}_{1}}) = \mbox{ sign}(\beta )\big) \rightarrow 1,\quad \mbox{ as }n \rightarrow \infty. }$$

□

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Feng, Y., Yu, M. (2017). Regularization After Marginal Learning for Ultra-High Dimensional Regression Models. In: Ahmed, S. (eds) Big and Complex Data Analysis. Contributions to Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-41573-4_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-41573-4_1
Published: 22 March 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41572-7
Online ISBN: 978-3-319-41573-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Regularization After Marginal Learning for Ultra-High Dimensional Regression Models

Abstract

Access this chapter

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Proof of Theorem 1

Proof of Theorem 2

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation