Abstract
Regularization is a popular variable selection technique for high dimensional regression models. However, under the ultra-high dimensional setting, a direct application of the regularization methods tends to fail in terms of model selection consistency due to the possible spurious correlations among predictors. Motivated by the ideas of screening (Fan and Lv, J R Stat Soc Ser B Stat Methodol 70:849–911, 2008) and retention (Weng et al, Manuscript, 2013), we propose a new two-step framework for variable selection, where in the first step, marginal learning techniques are utilized to partition variables into different categories, and the regularization methods can be applied afterwards. The technical conditions of model selection consistency for this broad framework relax those for the one-step regularization methods. Extensive simulations show the competitive performance of the new method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19, 716–723 (1974)
Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with sparsity-inducing penalties. arXiv preprint arXiv:1108.0775 (2011)
Chen, J., Chen, Z.: Extended bayesian information criteria for model selection with large model spaces. Biometrika 95, 759–771 (2008)
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32, 407–499 (2004)
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001)
Fan, J., Lv, J.: Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B Stat. Methodol. 70, 849–911 (2008)
Fan, J., Song, R.: Sure independence screening in generalized linear models with np-dimensionality. Ann. Stat. 38, 3567–3604 (2010)
Fan, J., Feng, Y., Song, R.: Nonparametric independence screening in sparse ultra-high dimensional additive models. J. Am. Stat. Assoc. 106, 544–557 (2011)
Fan, J., Feng, Y., Tong, X.: A road to classification in high dimensional space: the regularized optimal affine discriminant. J. R. Stat. Soc. Ser. B. 74, 745–771 (2012)
Fan, J., Feng, Y., Jiang, J., Tong, X.: Feature augmentation via nonparametrics and selection (fans) in high dimensional classification. J. Am. Stat. Assoc. (2014, to appear)
Feng, Y., Li, T., Ying, Z.: Likelihood adaptively modified penalties. arXiv preprint arXiv:1308.5036 (2013)
Feng, Y., Yu, Y.: Consistent cross-validation for tuning parameter selection in high-dimensional variable selection. arXiv preprint arXiv:1308.5390 (2013)
Frank, l.E., Friedman, J.H.: A statistical view of some chemometrics regression tools. Technometrics 35, 109–135 (1993)
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010)
Greenshtein, E., Ritov, Y.: Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10, 971–988 (2004)
Huang, J., Ma, S., Zhang, C.-H.: Adaptive lasso for sparse high-dimensional regression models. Stat. Sin. 18, 1603 (2008)
Knight, K., Fu, W.: Asymptotics for lasso-type estimators. Ann. Stat. 28, 1356–1378 (2000)
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol. 58, 267–288 (1996)
Wainwright, M.J.: Sharp thresholds for high-dimensional and noisy sparsity recovery. IEEE Trans. Inf. Theory 55, 2183–2202 (2009)
Weng, H., Feng, Y., Qiao, X.: Regularization after retention in ultrahigh dimensional linear regression models. Manuscript (2013). Preprint, arXiv:1311.5625
Yu, Y., Feng, Y.: Apple: approximate path for penalized likelihood estimators. Stat. Comput. 24, 803–819 (2014)
Yu, Y., Feng, Y.: Modified cross-validation for lasso penalized high-dimensional linear models. J. Comput. Graph. Stat. 23, 1009–1027 (2014)
Zhao, P., Yu, B.: On model selection consistency of lasso. J. Mach. Learn. Res. 7, 2541–2563 (2006)
Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101, 1418–1429 (2006)
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67, 301–320 (2005)
Acknowledgements
This work was partially supported by NSF grant DMS-1308566. The authors would like to thank the Editor and the referee for constructive comments which greatly improved the paper. The majority of the work was done when Mengjia Yu was an M.A. student at Columbia University.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
Proof of Theorem 1
Denote the design matrix by X, response vector by Y, and error vector by ɛ. The scale condition is \(\log p_{n} = O(n^{a_{1}})\), \(s_{n} = O(n^{a_{2}}),a_{1}> 0,a_{2}> 0,a_{1} + 2a_{2} <1\).
- Step I::
-
Recall the index of variables with large coefficients
$$\displaystyle{\mathcal{M}_{\tilde{\gamma }_{n}} =\{ 1 \leq j \leq p: \vert \hat{\beta }_{j}^{M}\vert \,\mbox{ is among the first}\lfloor \tilde{\gamma }_{ n}\rfloor \mbox{ of all }\} = \mathcal{N}^{c}.}$$Under Corollary 1,
$$\displaystyle{\mbox{ pr}(S \subset \mathcal{M}_{\tilde{\gamma }_{n}} = \mathcal{N}^{c} = \mathcal{R}\cup \mathcal{U}) \rightarrow 1\quad \mbox{ as }n \rightarrow \infty.}$$Hence with high probability the set \(\hat{\mathcal{N}}\) contains only noises.
- Step II::
-
Next we will show that RAM-2 succeeds in detecting signals in \(\hat{\mathcal{N}}^{c}\). Let S = { 1 ≤ j ≤ p: β j ≠ 0}. Denote the compositions \(S = \hat{\mathcal{R}}_{1} \cup \hat{\mathcal{U}}_{1}\) and define the set of noises left in \(\hat{\mathcal{N}}^{c}\) as \((\hat{\mathcal{R}}\setminus \hat{\mathcal{R}}_{1}) \cup (\hat{\mathcal{U}}\setminus \hat{\mathcal{U}}_{1})\dot{ =} \hat{\mathcal{R}}_{2} \cup \hat{\mathcal{U}}_{2}\), where \(\hat{\mathcal{R}}_{1}\) and \(\hat{\mathcal{U}}_{1}\) are signals from \(\hat{\mathcal{R}}\) and \(\hat{\mathcal{U}}\), respectively.
Firstly, we would like to introduce an important technique in RAR+. Define the set of true signals as S, and in an arbitrary regularization, define the set that is hold without penalty as H while the set that needs to be checked with penalty as C. Let
Now we define Q = S ∪ H which are the variables we would like to retain, and then the variables that are supposed to be discarded are Q c = C ∖ S.
By optimality conditions of convex problems [2], \(\check{\beta }\) is a solution to (12) if and only if
where \(\partial \|\check{\beta }_{C}\|\) is the subgradient of ∥ β C ∥ 1 at \(\beta =\check{\beta }\). Namely, the ith (1 ≤ i ≤ p n ) element of \(\partial \|\check{\beta }_{C}\|\) is
where t can be any real number with | t | ≤ 1. Similarly, \(\bar{\beta }\) is the unique solution to (13) if and only if
where \(\mbox{ sig}(\bar{\beta }_{Q})\), a vector of length card(Q), is the subgradient of \(\|\beta _{\bar{Q}^{c}}\|_{1}\) at \(\beta _{Q} =\bar{\beta } _{Q}\). Then it is not hard to see that the unique solution \(\bar{\beta }\) is also a solution for (13) if
simply because (15) and (16) imply \(\bar{\beta }\) satisfies (14). Solving the equation in (15) gives
Using (17) and Y = X S β S +ɛ, (16) is equivalent to
Since (I − X Q (X Q T X Q )−1 X Q T)X Q = 0, (18) can be simplified as
Note that, if there is a unique solution for (12), say \(\check{\beta }\), and \(\bar{\beta }\) satisfies (19), then \(\bar{\beta }\) is indeed the unique solution for (12). This is equivalent to \(\check{\beta }_{Q^{c}} = 0\). Furthermore, if \(\min _{j\in Q}\vert \beta _{j}\vert>\|\beta _{j} -\bar{\beta }_{j}\|_{\infty }\) also holds, we can conclude \(\check{\beta }_{Q}\neq 0\). Thus (12) achieves sign recovery. In the following, we will make use of this idea repeatedly.
Secondly, consider the Step 1 (5),
Here, denote \(\check{\beta }=\hat{\beta } _{\hat{\mathcal{R}},\hat{\mathcal{U}}_{1}}\). After this step, the ideal result is that with high probability,
Therefore, define an oracle estimator of (20),
where \(\hat{Q} =\hat{ \mathcal{R}}\cup \hat{\mathcal{U}}_{1} = S \cup \hat{\mathcal{R}}_{2}\). Now, we plug \(\check{\beta },\bar{\beta }\), and \(\hat{Q}\) back to (12), (13), and (19), then it is sufficient to prove (20) has a unique solution and it achieves sign consistency with \(Q = \hat{Q}\).
Let
Then, (19) is equivalent to
To be more clear that, since we have already screen \(\hat{\mathcal{N}}\) out, \(\hat{Q}^{c}\) is in fact the complement of \(\hat{Q}\) under the “universe” \(\hat{\mathcal{R}}\cup \hat{\mathcal{U}}\). We write \(\hat{Q}^{c}\) instead of \((\hat{\mathcal{R}}\cup \hat{\mathcal{U}})\setminus \hat{Q} = \hat{\mathcal{U}}_{2}\) to show a close connection with the analysis in first part above.
Now let
From Conditions 1 and 4, P(A) → 1 as a direct result of Proposition 2 in Weng et al. [21]. Since Condition 8 implies
as given A
is always less than 1 −γ 1.
Denote \(K_{2}(\mathcal{R}_{1},Q)\) as the analogy of K 2 and \(\bar{\beta }_{Q}\) as the analogy of \(\bar{\beta }_{\hat{Q}}\) by replacing \(\hat{\mathcal{R}}_{1}\) and \(\hat{Q}\) in (22) with \(\mathcal{R}_{1}\) and Q. Since given X Q and ɛ, the j-th element of \(K_{2}(\mathcal{R}_{1},Q)\), namely
is normally distributed with mean 0 and variance V j , where
Hence, we let
Next, we want to show
By the tail probability inequality of Gaussian distribution (inequality (48) in Wainwright [20]), it is not hard to see that
where \(V = (1 + s_{n}^{1/2}n^{-1/2})/(n\lambda _{n}^{2}) + \frac{s_{n}+z_{n}} {nC_{\min }} (8(s_{n} + z_{n})^{1/2}n^{-1/2} + 1) \geq V _{ j}\) under condition H c. Since \(\log [2^{s_{n}+z_{n}+1}(\,p_{n} - s_{n})] = o(\gamma _{1}^{2}/8V )\) under our scaling, (26) → 0. To bound pr(H), note that
Since ∥ ɛ ∥ 2 2 ∼ χ 2(n), using the inequality of (54a) in Wainwright [20], we get
whenever s n ∕n < 1∕2. For any given Q that satisfying S ⊂ Q ⊂ S ∪ Z,
holds for any \(\mathcal{R}_{1}\) that satisfying \(R \subset \mathcal{R}_{1} \subset S\). Therefore, by the concentration inequality of (58b) in Wainwright [20],
Hence, (28) and (29) imply \(\mbox{ pr}(H) \leq 2^{z_{n}+1}\exp (-\frac{s_{n}} {2} ) +\exp (-\frac{3} {16}s_{n}) \rightarrow 0\).
Since P(A c) = 1 − P(A) → 0, the inequalities (26)–(29) imply (25) under the scaling in Theorem 1. Thus ∥ K 1 + K 2 ∥ ∞ < 1 achieves with high probability, which also means \(\check{\beta }_{\hat{\mathcal{U}}\setminus \hat{\mathcal{U}_{1}}} = 0\) achieves asymptotically.
From our analysis in the first part, the following goal is the uniqueness of (20). If there is another solution, let’s call it \(\check{\beta }^{{\prime}}\). For any t such that 0 < t < 1, the linear combination \(\check{\beta }(t) = t\check{\beta } + (1 - t)\check{\beta }^{{\prime}}\) is also a solution to (20) as a consequence of the convexity. Note that, the new solution point \(\check{\beta }(t)\) satisfies (16) and \(\check{\beta }(t)_{Q^{c}} = 0\), hence it is a solution to (13). From the uniqueness of (13), we conclude that \(\check{\beta }=\check{\beta } ^{{\prime}}\).
The last part of this step is to prove \(\bar{\beta }_{\mathcal{U}_{1}}\neq 0\) with high probability. By (17) and \(Y = X_{S}\beta _{S}+\varepsilon = X_{\hat{Q}}\beta _{\hat{Q}}+\varepsilon\), we have
for any \(\hat{Q}\) satisfying \(S \subset \hat{Q} \subset S \cup Z\). In (29), we have already got
Let \(G =\Big\{\| (X_{\hat{Q}}^{T}X_{\hat{Q}})^{-1}\|_{2}> 9/(nC_{\min })\Big\}\), by the inequality (60) in Wainwright [20],
Since \((X_{\hat{Q}}^{T}X_{\hat{Q}})^{-1}X_{\hat{Q}}^{T}\varepsilon \mid X_{\hat{Q}} \sim N(0,(X_{\hat{Q}}^{T}X_{\hat{Q}})^{-1})\), then when we condition on G and achieve
since under G c, each component of \((X_{\hat{Q}}^{T}X_{\hat{Q}})^{-1}X_{\hat{Q}}^{T}\varepsilon \mid X_{\hat{Q}}\) is normally distributed with mean 0 and variance that is less than 9∕(nC min).
Hence (30)–(32) together imply that,
holds with probability larger than \(2(s_{n} + z_{n})e^{-(s_{n}+z_{n})/18} + 2e^{-n/2} + 2\exp ^{-s_{n}/2}\). Therefore,
Under the scaling of Theorem 1, we have pr(B) ≥ pr(A) → 1 and \(2^{z_{n}}(2(s_{n} + z_{n})e^{-(s_{n}+z_{n})/18} + 2e^{-n/2} + 2\exp ^{-s_{n}/2}) \rightarrow 0\). From Condition 5, it is easy to verify that
for sufficiently large n. Thus with high probability \(\min _{j\in S}\vert \beta _{j}\vert>\|\bar{\beta } _{\hat{Q}} -\beta _{\hat{Q}}\|_{\infty }\) as n increases, which also implies \(\check{\beta }_{\hat{Q}}\neq 0\) with high probability.
Finally, \(\hat{\beta }_{\hat{\mathcal{R}},\hat{\mathcal{U}}_{1}}\) exactly recover signals with high probability as n → ∞.
- Step III::
-
We need to prove that RAM-2 succeeds in detecting signals via Step 3. Similar to Step II, we need to define proper \(\check{\beta }\) in (12) and \(\bar{\beta }\) in (13). Since the main idea is the same as the procedure above, we only describe the key steps in the following proof. Recall the estimator (7),
$$\displaystyle\begin{array}{rcl} \hat{\beta }_{\hat{\mathcal{R}}_{1},\hat{\mathcal{U}}_{1},\hat{\mathcal{N}}_{1}}& =& \mathop{\arg \min \nolimits }_{\beta _{\hat{\mathcal{U}}_{ 2}\cup \hat{\mathcal{N}}_{2}}=0}\ \Bigg\{(2n)^{-1}\sum _{ i=1}^{n}\Big(Y _{ i} -\sum _{j\in \hat{\mathcal{R}}}X_{ij}\beta _{j} -\sum _{k\in \hat{\mathcal{U}}_{1}\cup \hat{\mathcal{N}}_{1}} X_{ik } \beta _{k}\Big)^{2} +\lambda _{ n}^{\star \star }\sum _{ j\in \hat{\mathcal{R}}}\vert \beta _{j}\vert \Bigg\} \\ & =& \mathop{\arg \min \nolimits }_{\beta _{\hat{\mathcal{N}}\cup \hat{\mathcal{U}}\setminus \hat{\mathcal{U}}_{ 1}}=0}\Bigg\{(2n)^{-1}\|Y - X\beta \|_{ 2}^{2} +\lambda _{ n}^{\star \star }\|\beta _{ \hat{\mathcal{R}}}\|_{1}\Bigg\}. {}\end{array}$$(34)This is a new “\(\check{\beta }\)” in (12), and we denote it as \(\tilde{\beta }\). After this step, the ideal result is that with high probability,
$$\displaystyle{ \tilde{\beta }_{\hat{\mathcal{R}}_{1}}\neq 0\mbox{ and }\tilde{\beta }_{\hat{\mathcal{R}}\setminus \hat{\mathcal{R}}_{1}} = 0. }$$(35)Therefore, define an oracle estimator of (34),
$$\displaystyle{\mathring{\beta} =\mathop{ \arg \min \nolimits }_{\beta _{S^{c}}=0}\Bigg\{(2n)^{-1}\|Y - X_{ S}\beta _{S}\|_{2}^{2} +\lambda _{ n}^{\star \star }\|\beta _{ \hat{\mathcal{R}}_{1}}\|_{1}\Bigg\}. }$$(36)Now, we plug \(\tilde{\beta }\) and \(\mathring{\beta}\) back to (12), (13), and (18), then it is sufficient to prove (34) has a unique solution and it achieves sign consistency with Q = S. Let
$$\displaystyle\begin{array}{rcl} F^{{\prime}}& =& X_{\hat{ \mathcal{R}}_{2}}^{T} - \Sigma _{\hat{ \mathcal{R}}_{2}S}\Sigma _{S}^{-1}X_{ S}^{T}, {}\\ K_{1}^{{\prime}}& =& \Sigma _{\hat{ \mathcal{R}}_{2}S}\Sigma _{SS}^{-1}\mbox{ sig}(\mathring{\beta} _{ S}), {}\\ K_{2}^{{\prime}}& =& F^{{\prime}}X_{ S}(X_{S}^{T}X_{ S})^{-1}\mbox{ sig}(\mathring{\beta} _{ S}) + (n\lambda _{n}^{\star \star })^{-1}F^{{\prime}}\{I - X_{ S}(X_{S}^{T}X_{ S})^{-1}X_{ S}^{T}\}\varepsilon.{}\\ \end{array}$$Similarly,
$$\displaystyle{ \mbox{ pr}\Big(\|K_{1}^{{\prime}}\|_{ \infty }\leq 1-\alpha \Big) \geq \mbox{ pr}\Big(\big\{\|K_{1}^{{\prime}}\|_{ \infty }\leq 1 -\alpha \big\}\cap D\Big) \geq \mbox{ pr}(D) \geq \mbox{ pr}(A) \rightarrow 1, }$$(37)where \(D =\{\hat{ \mathcal{R}}_{2} \subset Z\}\) and it implies ∥ K 1 ′ ∥ ∞ ≤ 1 −α under Condition 7. Let
$$\displaystyle\begin{array}{rcl} H^{{\prime}}& =& \mathop{\bigcup }_{ R\subset \mathcal{R}_{2}\subset S} \Big\{\mbox{ sig}(\mathring{\beta} _{S})^{T}(X_{ S}^{T}X_{ S})^{-1}\mbox{ sig}(\mathring{\beta} _{ S}) + (n\lambda _{n}^{\star \star })^{-2}\|\varepsilon \|_{ 2}^{2}> \frac{s_{n}} {nC_{\min }}\big(8s_{n}^{1/2}n^{-1/2} + 1\big) {}\\ & & +\big(1 + s_{n}^{1/2}n^{-1/2}\big)/\big(n(\lambda _{ n}^{\star \star })^{2}\big)\Big\}. {}\\ \end{array}$$Then,
$$\displaystyle\begin{array}{rcl} \mbox{ pr}\Big(\|K_{2}^{{\prime}}\|_{ \infty }> \frac{\alpha } {2}\Big)& \leq &\mbox{ pr}\Big(\big\{\|K_{2}^{{\prime}}\|_{ \infty }> \frac{\alpha } {2}\big\} \cap A\Big) + \mbox{ pr}(A^{c}) \\ & \leq &\mbox{ pr}\Big(\mathop{\bigcup }_{ \begin{array}{c}(\mathcal{R}_{2},\mathcal{R}_{1}) \\ \mathcal{R}_{2}\subset Z \\ R\subset \mathcal{R}_{1}\subset S\end{array}}\Big\{\|\tilde{K}_{2}(\mathcal{R}_{2},\mathcal{R}_{1})\|_{\infty }> \frac{\alpha } {2}\Big\}\Big) + \mbox{ pr}(A^{c}) \\ & \leq &\mbox{ pr}\Big(\mathop{\bigcup }_{ \begin{array}{c}(\mathcal{R}_{2},\mathcal{R}_{1}) \\ \mathcal{R}_{2}\subset Z \\ R\subset \mathcal{R}_{1}\subset S\end{array}} \Big\{\|\tilde{K}_{2 } (\mathcal{R}_{2},\mathcal{R}_{1})\|_{\infty }> \frac{\alpha } {2}\Big\}\mid \tilde{H}^{c}\Big) + \mbox{ pr}(\tilde{H}) + \mbox{ pr}(A^{c}) \\ & \leq & 2^{z_{n}+s_{n}+1}z_{ n}e^{-\alpha ^{2}/8V ^{{\prime}} } + 2e^{-\frac{s_{n}} {2} } + e^{-\frac{3} {16} s_{n}} + \mbox{ pr}(A^{c}) \\ & \longrightarrow & 0, {}\end{array}$$(38)where the last step of (38) follows from (26), (28), and (29) in the proof of Step II, and \(V ^{{\prime}} = \frac{s_{n}} {nC_{\min }}(8s_{n}^{1/2}n^{-1/2} + 1) + (1 + s_{ n}^{1/2}n^{-1/2})/(n(\lambda _{ n}^{\star \star })^{2})\).
Equations (37) and (38) indicate \(\tilde{\beta }_{\hat{\mathcal{R}}\setminus \hat{\mathcal{R}}_{1}} = 0\). We skip the proof of uniqueness and move to the next step of proving \(\tilde{\beta }_{\hat{\mathcal{R}}_{1}}\neq 0\).
Let \(W_{n} =\lambda _{ n}^{\star \star }s_{n}^{1/2}\big( \frac{8} {C_{\min }}s_{n}^{1/2}n^{-\frac{1} {2} } + \frac{1} {C_{ \min }}\big) + \frac{s_{n}^{1/2}} {n^{1/2}C_{\min }^{1/2}} = o(n^{a_{2}/2-\delta })\). In the same way, we can show that as n → ∞
Hence, Condition 5 ensures that as n → ∞,
which is equivalent to \(\tilde{\beta }_{\hat{\mathcal{R}}_{1}}\neq 0\).
Finally, combining Step I, Step II, and Step III, we conclude that
□
Proof of Theorem 2
Denote the compositions \(S = \hat{\mathcal{R}}_{1} \cup \hat{\mathcal{U}}_{1} \cup \hat{\mathcal{N}}_{1}\) and define the set of noises left in \(\hat{\mathcal{U}}^{c}\) as \((\hat{\mathcal{R}}\setminus \hat{\mathcal{R}}_{1}) \cup (\hat{\mathcal{N}}\setminus \hat{\mathcal{N}}_{1})\dot{ =} \hat{\mathcal{R}}_{2} \cup \hat{\mathcal{N}}_{2}\), where \(\hat{\mathcal{R}}_{1}\), \(\hat{\mathcal{U}}_{1}\), and \(\hat{\mathcal{N}}_{1}\) are signals from \(\hat{\mathcal{R}}\), \(\hat{\mathcal{U}}\), and \(\hat{\mathcal{N}}\), respectively.
- Step I::
-
Consider the Step 1 in (5), which is exactly the same as (20). Since there is no difference from the Step II in the proof of Theorem 1, we skip the details here.
- Step II::
-
Let’s consider the Step 2 in (6).
$$\displaystyle\begin{array}{rcl} \hat{\beta }_{\hat{\mathcal{R}},\hat{\mathcal{U}}_{1},\hat{\mathcal{N}}_{1}}& =& \mathop{\arg \min \nolimits }_{\beta _{\hat{\mathcal{U}}\setminus \hat{\mathcal{U}}_{ 1}}=0}\ \Bigg\{(2n)^{-1}\sum _{ i=1}^{n}\Big(Y _{ i} -\sum _{j\in \hat{\mathcal{N}}}X_{ij}\beta _{j} -\sum _{k\in \hat{\mathcal{R}}\cup \hat{\mathcal{U}}_{1}} X_{ik}\beta _{k}\Big)^{2} +\lambda _{ n}^{\star }\sum _{ j\in \hat{\mathcal{N}}}\vert \beta _{j}\vert \Bigg\} \\ & =& \mathop{\arg \min \nolimits }_{\beta _{\hat{\mathcal{U}}\setminus \hat{\mathcal{U}}_{ 1}}=0}\Bigg\{(2n)^{-1}\|Y - X\beta \|_{ 2}^{2} +\lambda _{ n}^{\star }\|\beta _{ \hat{\mathcal{N}}}\|_{1}\Bigg\}. {}\end{array}$$(40)Here, denote \(\check{\beta }=\hat{\beta } _{\hat{\mathcal{R}},\hat{\mathcal{U}}_{1},\hat{\mathcal{N}}_{1}}\). After this step, the ideal result is that with high probability,
$$\displaystyle{ \check{\beta }_{\hat{\mathcal{N}}_{1}}\neq 0\mbox{ and }\check{\beta }_{\hat{\mathcal{N}}\setminus \hat{\mathcal{N}}_{1}} = 0. }$$(41)Then, define an oracle estimator of (20),
$$\displaystyle{ \bar{\beta }=\mathop{ \arg \min \nolimits }_{\beta _{(\hat{\mathcal{R}}\cup \hat{\mathcal{U}}_{ 1}\cup \hat{\mathcal{N}}_{1})^{c}}=0}\Bigg\{(2n)^{-1}\|Y - X_{\hat{ Q}}\beta _{\hat{Q}}\|_{2}^{2} +\lambda _{ n}^{\star }\|\beta _{ \hat{\mathcal{N}}_{1}}\|_{1}\Bigg\}, }$$(42)where \(\hat{Q} = (\hat{\mathcal{R}}\cup \hat{\mathcal{U}}_{1}) \cup \hat{\mathcal{N}}_{1} = S \cup \hat{\mathcal{R}}_{2}\). Similar to Step II in proof of Theorem 1, let
$$\displaystyle\begin{array}{rcl} F& =& X_{\hat{\mathcal{N}}_{2}}^{T} - \Sigma _{\hat{ Q}^{c}\hat{Q}}\Sigma _{\hat{Q}\hat{Q}}^{-1}X_{\hat{ Q}}^{T}, {}\\ K_{1}& =& \Sigma _{\hat{\mathcal{N}}_{2}\hat{Q}}\Sigma _{\hat{Q}\hat{Q}}^{-1}\mbox{ sig}(\bar{\beta }_{\hat{ Q}}), {}\\ K_{2}& =& FX_{\hat{Q}}(X_{\hat{Q}}^{T}X_{\hat{ Q}})^{-1}\mbox{ sig}(\bar{\beta }_{\hat{ Q}}) + (n\lambda _{n})^{-1}F\{I - X_{\hat{ Q}}(X_{\hat{Q}}^{T}X_{\hat{ Q}})^{-1}X_{\hat{ Q}}^{T}\}\varepsilon, {}\\ \mbox{ and}& & {}\\ A& =& \{R \subset \hat{\mathcal{L}}_{1}\dot{ =}\hat{ \mathcal{R}}_{1} \cup \hat{\mathcal{U}}_{1} \subset S,S \subset \hat{Q} \subset S \cup Z\}, {}\\ B& =& \{S \subset \hat{Q} \subset S \cup Z\}, {}\\ \mathcal{T}_{A}& =& \{(\mathcal{L}_{1},Q)\vert R \subset \mathcal{L}_{1} \subset S,S \subset Q \subset S \cup Z\}. {}\\ \end{array}$$Similarly, we get
$$\displaystyle{ \mbox{ pr}(\|K_{1}\|_{\infty }\leq 1 -\gamma _{1}) \geq \mbox{ pr}(\{\|K_{1}\|_{\infty }\leq 1 -\gamma _{1}\} \cap A) \geq \mbox{ pr}(A) \rightarrow 1. }$$(43)To obtain \(\mbox{ pr}(\|K_{2}\|_{\infty }> \frac{\gamma _{1}} {2}) \rightarrow 0\), we define event H as
$$\displaystyle\begin{array}{rcl} H& =& \mathop{\bigcup }_{\begin{array}{c}(\mathcal{L}_{1},Q)\subset \mathcal{T}_{A}\end{array}}\Big\{\mbox{ sig}(\bar{\beta }_{Q})^{T}(X_{ Q}^{T}X_{ Q})^{-1}\mbox{ sig}(\bar{\beta }_{ Q}) + (n\lambda _{n})^{-2}\|\varepsilon \|_{ 2}^{2} {}\\ & &> \frac{s_{n} + z_{n}} {nC_{\min }} \big(8(s_{n} + z_{n})^{1/2}n^{-1/2} + 1\big) +\big (1 + s_{ n}^{1/2}n^{-1/2}\big)/(n\lambda _{ n}^{2})\Big\}. {}\\ \end{array}$$$$\displaystyle\begin{array}{rcl} \mbox{ pr}\big(\|K_{2}\|_{\infty }> \frac{\gamma _{1}} {2}\big)& \leq &\mbox{ pr}\Big(\big\{\|K_{2}\|_{\infty }> \frac{\gamma _{1}} {2}\big\} \cap A\Big) + \mbox{ pr}(A^{c}) \\ & \leq &\mbox{ pr}\Big(\mathop{\bigcup }_{\begin{array}{c}(\mathcal{L}_{1},Q)\subset \mathcal{T}_{A}\end{array}}\|K_{2}(\mathcal{L}_{1},Q)\|_{\infty }> \frac{\gamma _{1}} {2}\mid H^{c}\Big) + \mbox{ pr}(H) + \mbox{ pr}(A^{c}) \\ & \leq & 2^{s_{n}+z_{n} } \cdot 2(\,p_{n} - s_{n})\exp (-\gamma _{1}^{2}/8V ) + 2^{z_{n}+1}\exp \big(-\frac{s_{n}} {2} \big) \\ & & +\exp \big(-\frac{3} {16}s_{n}\big) \\ & \longrightarrow & 0, {}\end{array}$$(44)where V = (1 + s n 1∕2 n −1∕2)∕(n(λ n ⋆)2) + s n n −1 C min −1(8s n 1∕2 n −1∕2 + 1).
Again, we skip the uniqueness of \(\check{\beta }\) and move to bound \(\|\bar{\beta }_{\hat{Q}} -\beta _{\bar{Q}}\|_{\infty }\). By (30)–(32) in the proof of Theorem 1, we have
where \(U_{n} =\lambda _{n}(s_{n} + z_{n})^{1/2}\Big( \frac{8} {C_{\min }}(s_{n} + z_{n})^{1/2}n^{-1/2} + 1/C_{\min }\Big) + \frac{(s_{n}+z_{n})^{1/2}} {n^{1/2}C_{\min }^{1/2}}\). As min j ∈ S | β j | ≫ U n with sufficiently large n, we conclude that with high probability \(\min _{j\in S}\vert \beta _{j}\vert>\|\bar{\beta } _{\hat{Q}} -\beta _{\hat{Q}}\|_{\infty }\) as n increases, which also implies \(\check{\beta }_{\hat{Q}}\neq 0\) with high probability.
Therefore, \(\hat{\beta }_{\hat{\mathcal{R}},\hat{\mathcal{U}}_{1},\hat{\mathcal{N}}_{1}}\) successfully recover signals from \(\hat{\mathcal{N}}\) with high probability when n is large enough.
- Step III::
-
Following the same steps as in Step III in the proof of Theorem 1, we have
$$\displaystyle{ P\big(\hat{\beta }_{\hat{\mathcal{R}}_{1},\hat{\mathcal{U}}_{1},\hat{\mathcal{N}}_{1}}\mbox{ is unique and }\mbox{ sign}(\hat{\beta }_{\hat{\mathcal{R}}_{1},\hat{\mathcal{U}}_{1},\hat{\mathcal{N}}_{1}}) = \mbox{ sign}(\beta )\big) \rightarrow 1,\quad \mbox{ as }n \rightarrow \infty. }$$□
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Feng, Y., Yu, M. (2017). Regularization After Marginal Learning for Ultra-High Dimensional Regression Models. In: Ahmed, S. (eds) Big and Complex Data Analysis. Contributions to Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-41573-4_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-41573-4_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41572-7
Online ISBN: 978-3-319-41573-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)