Skip to main content
Log in

Extensions of stability selection using subsamples of observations and covariates

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

We introduce extensions of stability selection, a method to stabilise variable selection methods introduced by Meinshausen and Bühlmann (J R Stat Soc 72:417–473, 2010). We propose to apply a base selection method repeatedly to random subsamples of observations and subsets of covariates under scrutiny, and to select covariates based on their selection frequency. We analyse the effects and benefits of these extensions. Our analysis generalizes the theoretical results of Meinshausen and Bühlmann (J R Stat Soc 72:417–473, 2010) from the case of half-samples to subsamples of arbitrary size. We study, in a theoretical manner, the effect of taking random covariate subsets using a simplified score model. Finally we validate these extensions on numerical experiments on both synthetic and real datasets, and compare the obtained results in detail to the original stability selection method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. We used the R-package LARS (Hastie and Efron 2012) as Lasso implementation.

References

  • Alexander, D.H., Lange, K.: Stability selection for genome-wide association. Genet. Epidemiol. 35(7), 722–728 (2011)

    Article  Google Scholar 

  • Arlot, S., Celisse, A.: A survey of cross-validation procedures for model selection. Stat. Surv. 4, 40–79 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Bolasso, F.B.: Model consistent Lasso estimation through the bootstrap. In: Proceedings of 25th International Conference on Machine Learning (ICML), pp. 33–40. ACM (2008)

  • Beinrucker, A., Dogan, U., Blanchard, G.: Early stopping for mutual information based feature selection. In: Proceedings of 21st International Conference on Pattern Recognition (ICPR), pp. 975–978 (2012a)

  • Beinrucker, A., Dogan, U., Blanchard, G.: A simple extension of stability feature selection. In: Pattern Recognition, vol. 7476 of Lecture Notes in Computer Science, pp. 256–265. Springer, New York (2012b)

  • Bi, J., Bennett, K., Embrechts, M., Breneman, C., Song, M.: Dimensionality reduction via sparse support vector machines. J. Mach. Learn. Res. 3, 1229–1243 (2003)

    MATH  Google Scholar 

  • Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  • Bühlmann, P., Yu, B.: Analyzing bagging. Ann. Stat. 30(4), 927–961 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  • Bühlmann, P., Rütimann, P., van de Geer, S., Zhang, C.-H.: Correlated variables in regression: clustering and sparse estimation. J. Stat. Plan. Inference 143(11), 1835–1858 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  • Cover, T.M., Thomas, J.A.: Elements of Information Theory, second edn. Wiley-Interscience, New York (2006)

    MATH  Google Scholar 

  • Efron, B.: Bootstrap methods: another look at the jackknife. Ann. Stat. 7(1), 1–26 (1979)

    Article  MathSciNet  MATH  Google Scholar 

  • Embrechts, P.: Modelling Extremal Events: For Insurance and Finance, volume 33 of Stochastic Modelling and Applied Probability. Springer, New York (1997)

    Book  Google Scholar 

  • Escudero, G., Marquez, L., Rigau, G.: Boosting applied to word sense disambiguation. In: Proceedings of European Conference on Machine Learning (ECML), pp. 129–141 (2000)

  • Fleuret, F.: Fast binary feature selection with conditional mutual information. J. Mach. Learn. Res. 5, 1531–1555 (2004)

    MathSciNet  MATH  Google Scholar 

  • Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)

    Article  Google Scholar 

  • Guyon, I.: Feature Extraction: Foundations and Applications, vol. 207. Springer, New York (2006)

    Google Scholar 

  • Hastie, T., Efron, B.: LARS: Least Angle Regression, Lasso and Forward Stagewise (2012). URL http://CRAN.R-project.org/package=lars. R package version 1.1

  • Haury, A.-C., Mordelet, F., Vera-Licona, P., Vert, J.-P.: Tigress: trustful inference of gene regulation using stability selection. BMC Syst. Biol. 6(1), 145 (2012)

    Article  Google Scholar 

  • He, Q., Lin, D.-Y.: A variable selection method for genome-wide association studies. Bioinformatics 27(1), 1–8 (2011)

    Article  MathSciNet  Google Scholar 

  • He, Z., Yu, W.: Stable feature selection for biomarker discovery. Comput. Biol. Chem. 34(4), 215–225 (2010)

  • Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012)

  • Leadbetter, M.R., Lindgren, G., Rootzén, H.: Extremes and Related Properties of Random Sequences and Processes. Springer Series in Statistics. Springer, New York (1983)

    Book  MATH  Google Scholar 

  • LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  • Lounici, K., et al.: Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators. Electron. J. Stat. 2, 90–102 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  • MASH Consortium. The MASH project. http://www.mash-project.eu (2012). [Online; Accessed 19 Mar 2013]

  • Meinshausen, N., Bühlmann, P.: Stability selection. J. R. Stat. Soc. 72(4), 417–473 (2010)

    Article  MathSciNet  Google Scholar 

  • Meinshausen, N., Yu, B.: Lasso-type recovery of sparse representations for high-dimensional data. Ann. Stat. 37(1), 246–270 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  • Politis, D.N., Romano, J.P., Wolf, M.: Subsampling. Springer Series in Statistics. Springer, New York (1999)

    Book  MATH  Google Scholar 

  • Sauerbrei, W., Schumacher, M.: A bootstrap resampling procedure for model building: application to the Cox regression model. Stat. Med. 11(16), 2093–2109 (1992)

    Article  Google Scholar 

  • Schapire, R., Singer, Y.: Improved boosting algorithms using confidence-rated predictions. Mach. Learn. 37(3), 297–336 (1999)

    Article  MATH  Google Scholar 

  • Shah, R.D., Samworth, R.J.: Variable selection with error control: another look at stability selection. J. R. Stat. Soc. 75(1), 55–80 (2013)

    Article  MathSciNet  Google Scholar 

  • Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. 58(1), 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  • Wang, S., Nan, B., Rosset, S., Zhu, J.: Random Lasso. Ann. Appl. Stat. 5(1), 468–485 (2011)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

We are extremely grateful to Nicolai Meinshausen and Peter Bühlmann for communicating to us the R-code used by Meinshausen and Bühlmann (2010) as well as for numerous discussions. We are indebted to Richard Samworth and Rajen Shah for numerous discussions and for hosting the first author during part of this work. We thank Maurilio Gutzeit for helping us with part of the numerical experiments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andre Beinrucker.

Additional information

A preliminary version of this work was presented at the conference DAGM 2012 (Beinrucker et al. 2012b).

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 320 KB)

Appendix: Proofs of theoretical results

Appendix: Proofs of theoretical results

1.1 Proofs of Sect. 3.1

For notational convenience we use the shorthand \(S^{{\mathrm {base}}}(\ell ,t)\equiv S^{{\mathrm {base}}}(X^{(\mathcal {S}(\ell ,t))},Y^{(\mathcal {S}(\ell ,t))})\) . To prove Theorem 1 and Corollary 1 we need some notation and two lemmas. We define

$$\begin{aligned} {\Pi }^{{\mathrm {simult}}}_{L,\ell _0}(d) := \frac{1}{T} \sum _{t=1}^T \mathbf {1}\left\{ \sum _{\ell =1}^L \mathbf {1}\left\{ d \in S_L^{{\mathrm {base}}}(\ell ,t) \right\} \ge \ell _0 \right\} \end{aligned}$$

the ratio of repetitions out of T where covariate d has been selected in at least \(\ell _0\) subsamples simultaneously.

Lemma 1

(Relation of \(\Pi ^{{\mathrm {simult}}}\) and \(\Pi ^{SFS}\)) It holds for any \(d\in {\mathcal {F}}\):

$$\begin{aligned} \left( \frac{L-\ell _0+1}{L} \right) \Pi ^{{\mathrm {simult}}}_{L,\ell _0}(d) + \frac{\ell _0-1}{L} \ge {\Pi }^{SFS}_{L}(d)\,. \end{aligned}$$

Proof

We have for all repetitions of drawings of subsamples \(t=1,\ldots ,T\):

$$\begin{aligned}&\frac{1}{L} \sum _{\ell =1}^L {\mathbf {1}\{d \in S^{{\mathrm {base}}}(\ell ,t)\}}\\&\quad \le \left( \frac{\ell _{0}-1}{L}\right) \mathbf {1}\left\{ \sum _{\ell =1}^L \mathbf {1}\left\{ d \in S^{{\mathrm {base}}}(\ell ,t) \right\} \le \ell _0 -1 \right\} \\&\qquad + \mathbf {1}\left\{ \sum _{\ell =1}^L \mathbf {1}\left\{ d \in S^{{\mathrm {base}}}(\ell ,t) \right\} \ge \ell _0 \right\} \,. \end{aligned}$$

Averaging over the repetitions \(t=1,\ldots ,T\) , we obtain

$$\begin{aligned} \Pi ^{SFS}_{L}(d)&\le \frac{\ell _{0}-1}{L} \left( 1 - { {\Pi }^{{\mathrm {simult}}}_{L,\ell _0}}(d) \right) + {\Pi ^{{\mathrm {simult}}}_{L,\ell _0}}(d)\\&= \left( \frac{L-\ell _{0}+1}{L} \right) {\Pi ^{{\mathrm {simult}}}_{L,\ell _{0}}(d)} + \frac{\ell _{0}-1}{L}\,. \end{aligned}$$

\(\square \)

Lemma 2

(Exponential inequality for \(\Pi ^{{\mathrm {simult}}}\)) The following inequality holds for any \(d\in {\mathcal {F}}\), \(\xi >0\), and \(\ell _0\in \left\{ 1,\ldots ,L\right\} \) such that \(p_0:= \frac{\ell _0}{L} \ge p_L(d)\):

$$\begin{aligned} \mathbb {P}\left[ \Pi ^{{\mathrm {simult}}}_{L,\ell _0}(d) \ge \xi \right] \le \frac{1}{\xi } \exp \left( -L D\left( p_0, p_L(d)\right) \right) \,. \end{aligned}$$
(13)

Proof

We have

$$\begin{aligned} \mathbb {E}\left[ {\Pi }^{{\mathrm {simult}}}_{L,\ell _0}(d)\right]&= \mathbb {P}\left[ \sum _{\ell =1}^L {\mathbf {1}\{d \in {S}^{{\mathrm {base}}}(\ell ,1) \}} \ge \ell _0 \right] \\&= \mathbb {P}\left[ {\mathrm {Bin}}\left( L,p_L(d)\right) \ge \ell _0\right] \\&\le \exp \left( -L D\left( p_0, p_L(d)\right) \right) \,. \end{aligned}$$

The first equality is valid because the L random observation subsamples are disjoint. Therefore, their joint distribution is the same as that of L independent samples of size \(\left\lfloor \frac{N}{L} \right\rfloor \); thus \((S^{{\mathrm {base}}}(\ell ,1))_{1\le \ell \le L}\) has the same distribution as L independent copies of the variable \(S_L^{{\mathrm {base}}}\). The last inequality is the Chernoff binomial bound. Using Markov’s inequality we get (13).

Proof of Theorem 1

We relate \(\Pi ^{SFS}\) to \(\Pi ^{{\mathrm {simult}}}\) and apply an exponential inequality on \(\Pi ^{{\mathrm {simult}}}\). For any \(d\in A_{\theta ,L}\), it holds by definition of \(A_{\theta ,L}\) and the assumptions on \(p_0\) that \(p_L(d)\le \theta \le p_0\), hence it holds by Lemma 1 and Lemma 2 that

$$\begin{aligned}&\mathbb {P}\left[ \Pi ^{SFS}_{L}(d) \ge \tau \right] \\&\quad \le \mathbb {P}\left[ \left( \frac{L-\ell _0+1}{L}\right) \Pi ^{{\mathrm {simult}}}_{L,\ell _0}(d) +\frac{\ell _0-1}{L} \ge \tau \right] \\&\quad = \mathbb {P}\left[ \Pi ^{{\mathrm {simult}}}_{L,\ell _0}(d) \ge \frac{L\tau -\ell _0+1}{L-\ell _0+1}\right] \\&\quad \le \frac{1-p_0+L^{-1}}{\tau -p_0+L^{-1}} \exp \left( -L D\left( p_0, p_L(d)\right) \right) \,, \end{aligned}$$

where we have used \(\xi :=\frac{L\tau -\ell _0+1}{L-\ell _0+1}\) . This result generalizes Shah and Samworth (2013, Lemma 5). Hence

$$\begin{aligned}&\frac{\mathbb {E}\left[ \left|S^{SFS}_{L,\tau } \cap A_{\theta ,L} \right|\right] }{\left|A_{\theta ,L} \right|}\\&\quad = \frac{1}{\left|A_{\theta ,L} \right|} \sum _{d\in A_{\theta ,L}} \mathbb {P}\left[ \Pi ^{SFS}_{L,\ell _0}(d) \ge \tau \right] \\&\quad \le \frac{1-p_0+L^{-1}}{\tau -p_0+L^{-1}} \frac{1}{\left|A_{\theta ,L} \right|} \sum _{d\in A_{\theta ,L}} \exp \left( -L D\left( p_0, p_L(d)\right) \right) \,. \end{aligned}$$

Since \(x\rightarrow \exp ( -L D(p_0,x))\) is non-decreasing, we obtain the first part of the result by upper bounding for all \(d \in A_{\theta ,L}\):

$$\begin{aligned} \exp \left( -L D\left( p_0, p_L(d)\right) \right) \le \exp \left( -L D\left( p_0, \theta \right) \right) \,. \end{aligned}$$

For the second part, we use the upper bound

$$\begin{aligned} \exp \left( -L D\left( p_0, p_L(d)\right) \right)&= \frac{\exp \left( -L D\left( p_0, p_L(d)\right) \right) }{p_L(d)} p_L(d)\\&\le \frac{\exp \left( -L D\left( p_0, \theta \right) \right) }{\theta } p_L(d)\,, \end{aligned}$$

since the function \(x\mapsto \frac{\exp \left( -L D\left( p_0, x\right) \right) }{x}\) can be shown to be non-decreasing for \(x \le p_0 - L^{-1}\) . Finally, summing over \(d \in A_{\theta ,L}\), observe

$$\begin{aligned} \sum _{d\in A_{\theta ,L}} p_L(d)&= \mathbb {E}\left[ \sum _{d\in A_{\theta ,L}} {\mathbf {1}\{ d \in S^{{\mathrm {base}}}_L\}}\right] \\&= \mathbb {E}\left[ \left|A_{\theta ,L} \cap S^{{\mathrm {base}}}_L \right|\right] , \end{aligned}$$

leading to the desired conclusion. Equations (6) and (7) can be proved similarly. \(\square \)

Proof of Corollary 1

This follows the same argument as in Shah and Samworth (2013). If the variable selection was completely at random, the marginal selection probability of any given covariate would be \(\frac{q_L }{D}\), where we recall \(q_L =\mathbb {E}\left[ \left|S^{{\mathrm {base}}}_L \right|\right] \) is the average number of covariates selected by the base method. As we assume that the selection probability of a signal covariate is better than random; it entails that for any \(d \in {\mathcal {N}}^C\), we must have \(p_L(d)> \frac{q_L }{D}\). Conversely, as all noise covariates have the same probability to be selected by the base method, one has \(p_L(d)< \frac{q_L }{D}\) for any \(d \in {\mathcal {N}}\). Therefore, with \(\theta :=\frac{q_L }{D}\) we must have \(A_{\theta ,L}={\mathcal {N}}\) and \(A_{\theta ,L}^c={\mathcal {N}}^C\). Inequality (4) therefore implies (8), wherein we have taken a minimum over the range of \(\ell _0\) allowed in Theorem 1. \(\square \)

1.2 Proofs for Sect. 3.2

Proof of Theorem 2

We can first bound the error probability from above by omitting \(Q_d\):

$$\begin{aligned} {\mathbb {P}}\left[ \hat{d}_D \in A_{D,\theta } \right]&= {\mathbb {P}}\left[ \max _{d \in A_{D,\theta }} \hat{Q}_d > \max _{d \in A_{D,\theta }^c}\hat{Q}_d \right] \\&={\mathbb {P}}\left[ \max _{d \in A_{D,\theta }} \left( Q_d + \varepsilon _d\right) > \max _{d \in A_{D,\theta }^c}\left( Q_d + \varepsilon _d\right) \right] \\&\le {\mathbb {P}}\left[ \max _{d \in A_{D,\theta }} (\theta + \varepsilon _d) > \max _{d \in A_{D,\theta }^C} (\theta +\varepsilon _d) \right] \\&= \mathbb {P}\left[ \mathop {\hbox {Arg Max}}\limits _{d\in \left\{ 1,\ldots ,D\right\} } \varepsilon _d \in A_{D,\theta }\right] = \frac{\left|A_{D,\theta } \right|}{D} \rightarrow \eta , \end{aligned}$$

as \(D\rightarrow \infty \). If \(\eta =0\), the conclusion is therefore established; in the remainder of the proof we hence assume \(\eta >0\). We defer to the end of the proof the case \(\eta =1\) and assume for now that \(\eta \in (0,1)\). Then \(\frac{|A_{D,\theta }|}{D} \rightarrow \eta \in (0,1)\) implies both \(|A_{D,\theta }| \rightarrow \infty \) and \(|A^c_{D,\theta }|\rightarrow \infty \), as well as \(\frac{|A_{D,\theta }^c|}{|A_{D,\theta }|} \rightarrow \gamma := \frac{1-\eta }{\eta }\). We return to the error probability and bound it from below by using \(Q_d\ge 0\) for \(d \in A_{D,\theta }\) and \(Q_d\le M\) for \(d\in A_{D,\theta }^c\):

$$\begin{aligned} {\mathbb {P}}\left[ \hat{d}_D \in A_{D,\theta } \right] \ge {\mathbb {P}}\left[ \max _{d \in A_{D,\theta }} \varepsilon _d > M + \max _{d \in A_{D,\theta }^C}\varepsilon _d \right] . \end{aligned}$$
(14)

Since the distribution of \(\varepsilon _i\) belongs to MDA(Fréchet \((\alpha )\)), from classical results of extreme value theory (Embrechts (1997, Theorem 3.3.7)) we know that there exists a slow varying function L so that, if we denote \(G(x):=x^{1/\alpha }L(x)\), then

(15)

in the sense of convergence in distribution, as \(D\rightarrow \infty \). Following on (14):

$$\begin{aligned} {\mathbb {P}}\left[ \hat{d}_D \in A_{D,\theta } \right]\ge & {} {\mathbb {P}}\left[ \frac{\max _{d \in A_{D,\theta }} \varepsilon _d}{G(|A_{D,\theta }|)} > \frac{M}{G(|A_{D,\theta }|)} \right. \\&\left. +\frac{G(|A^c_{D,\theta }|)}{G(|A_{D,\theta }|)} \frac{\max _{d \in A_{D,\theta }^c}\varepsilon _d}{G(|A^c_{D,\theta }|)} \right] . \end{aligned}$$

As L is slowly varying, we have \(\lim _{x \rightarrow \infty } \frac{L(a x)}{L(x)} \rightarrow 1\) uniformly for a belonging to a bounded interval of the positive real axis (Embrechts 1997, Theorem A 3.2). We deduce

$$\begin{aligned} \frac{G(|A^c_{D,\theta }|)}{G(|A_{D,\theta }|)} = \left( \frac{|A_{D,\theta }^c|}{|A_{D,\theta }|}\right) ^{\frac{1}{\alpha }} \frac{L\left( |A_{D,\theta }|\left( \frac{|A^c_{D,\theta }|}{|A_{D,\theta }|}\right) \right) }{L(|A_{D,\theta }|)} \rightarrow \gamma ^{1/\alpha }, \end{aligned}$$
(16)

as \(D\rightarrow \infty \). We apply Slutsky’s theorem (Embrechts 1997, Example A2.7) to Eqs. (16) and (15) to obtain

in distribution, where Fréchet \((\alpha ,c)\) denotes the Fréchet \((\alpha )\) distribution rescaled by a factor \(c>0\).

Further, slow variation of L implies that L(x) is asymptotically negligible with respect to any power function, so that

$$\begin{aligned} \frac{M}{G(|A_{D,\theta }|)} = \frac{M}{|A_{D,\theta }|^{\frac{1}{\alpha }}L(|A_{D,\theta }|)} \rightarrow 0, \;\; \text { as } \;\; D\rightarrow \infty . \end{aligned}$$
(17)

As the maxima in \(\max _{d \in A_{D,\theta }} \varepsilon _d\) and \(\max _{d \in A_{D,\theta }^C}\varepsilon _d\) are taken over disjoint sets of independent random variables, they are independent. Since they converge marginally in distribution, they also converge jointly and their difference converges due to the continuous mapping theorem (Embrechts 1997, Theorem A2.6). Combining with (17) and using Slutsky’s theorem again, we conclude that

$$\begin{aligned} \frac{\max _{d \in A_{D,\theta }} \varepsilon _d}{G(|A_{D,\theta }|)} - \frac{M}{G(|A_{D,\theta }|)} - \frac{G(|A^c_{D,\theta }|)}{G(|A_{D,\theta }|)} \frac{\max _{d \in A_{D,\theta }^c}\varepsilon _d}{G(|A^c_{D,\theta }|)} \end{aligned}$$

converges in distribution to the difference of two independent Fréchet distributed random variables. This convergence implies the convergence of the c.d.f. for all continuity points (Embrechts 1997, Eq. A.1). As the limiting distribution is continuous, we finally obtain

$$\begin{aligned} \liminf _{D\rightarrow \infty } {\mathbb {P}}\left[ \hat{d}_D \in A_{D,\theta } \right] \ge {\mathbb {P}}\left[ F - \gamma ^{\frac{1}{\alpha }} F' > 0 \right] , \end{aligned}$$

where \(F,F'\) are independent Fréchet \((\alpha )\) random variables. To identify the value of this lower bound, observe that it is also the limiting value of

$$\begin{aligned} \mathbb {P}\left[ \mathop {\hbox {Arg Max}}\limits _{d\in \left\{ 1,\ldots ,D\right\} } \varepsilon _d \in A_{D,\theta }\right] = \frac{|A_{D,\theta }|}{D}. \end{aligned}$$

Indeed, it suffices to repeat the above argument, except for skipping inequality (14). Hence this limiting value is exactly equal to \(\eta \).

Finally, for the case \(\eta =1\), observe that the above argument remains valid provided \(|A_{D,\theta }^c|\rightarrow \infty \). Even if this is not the case (i.e. \(|A_{D,\theta }^c|\) remains bounded), then the conclusion would be a fortiori true since we could replace \(A_{D,\theta }^c\) by a slightly larger set of cardinality \(\ln (D)\) (say), which can only decrease the lower bound while still obtaining the above limiting value. \(\square \)

Proof of Theorem 3

To show the convergence of the error probability we use similar arguments as in the proof of Theorem 2. From classical results of extreme value theory for independent standard normal random variables \((\zeta _k)_{k \in \mathbb {N}}\) (Embrechts 1997, Example 3.3.29) it holds that

$$\begin{aligned} \frac{\max _{i \le k} \zeta _i - b_k }{a_k} \rightarrow \mathrm{Gumbel}\,, \qquad \text { as } k \rightarrow \infty \,, \end{aligned}$$

in distribution, where \(a_k:= \frac{1}{\sqrt{2\ln (k)}}\) and \(b_k:=\sqrt{2\ln (k)} - \frac{\ln (4\pi )+\ln (\ln (k))}{2\sqrt{2\ln (k)}}\). Below, to clarify the argument we will introduce \((\zeta _k)_{k \in \mathbb {N}}\) and \((\zeta '_k)_{k \in \mathbb {N}}\) two independent sequences of independent standard normal variables.

Denote \(k_D:=|A_{D,\theta '}|\), \(\ell _D:= |A_{D,\theta }^c|\) and \(\varDelta :=\theta -\theta '>0\). Since \(A_{D,\theta }^c \subseteq A_{D,\theta '}^c\) we can bound the error probability from above as follows:

$$\begin{aligned} {\mathbb {P}}\left[ \hat{d}_D \in A_{D,\theta '} \right]&\le {\mathbb {P}}\left[ \max _{d \in A_{D,\theta '}} \left( Q_d + \varepsilon _d\right) > \max _{d \in A_{D,\theta }^c}\left( Q_d + \varepsilon _d\right) \right] \\&\le \mathbb {P}\left[ \max _{d \in A_{D,\theta '}} \varepsilon _d > \max _{d \in A_{D,\theta }^c} \varepsilon _d + \varDelta \right] \\&= \mathbb {P}\left[ \max _{d\le k_D} \zeta '_d > \max _{d \le \ell _D} \zeta _d + \varDelta \right] \,. \end{aligned}$$

The last equality holds since \(A_{D,\theta '}\) and \(A_{D,\theta }^c\) are disjoint; it is purely formal but notationally convenient for the sequel. Now denote \(k'_D := \max (k_D,\ell _D)\) . The above implies

$$\begin{aligned}&{\mathbb {P}}\left[ \hat{d}_D \in A_{D,\theta '} \right] \\&\quad \le \mathbb {P}\left[ \max _{d\le k'_D} \zeta '_d > \max _{d \le \ell _D} \zeta _d + \varDelta \right] \\&\quad = {\mathbb {P}}\left[ \frac{a_{k'_D}}{a_{\ell _D}}\left( \frac{\max _{d \le k'_D} \zeta '_d - b_{k'_D}}{a_{k'_D}}\right) - \frac{\max _{d \le \ell _D}\zeta _d - b_{\ell _D}}{a_{\ell _D}} \right. \\&\quad > \left. \frac{b_{\ell _D} - b_{k'_D} + \varDelta }{a_{\ell _D}} \right] \end{aligned}$$

We treat the different terms in the above upper bound. We have

$$\begin{aligned} \frac{a_{k'_D}}{a_{\ell _D}} = \frac{\sqrt{2\ln (\ell _D)}}{\sqrt{2\ln (k'_D)}} \le 1 . \end{aligned}$$

Noting that \(b_k = \sqrt{2 \ln k} + o(1)\), we have

$$\begin{aligned}&\frac{b_{\ell _D} - b_{k'_D} + \varDelta }{a_{\ell _D}} \nonumber \\&\qquad = \sqrt{2 \ln \ell _D} \left( \sqrt{2 \ln \ell _D} - \sqrt{2 \ln k'_D} + \varDelta + o(1)\right) \nonumber \\&\qquad = -2\left( \ln \frac{k'_D}{\ell _D}\right) \left( \frac{\sqrt{\ln \ell _D}}{\sqrt{\ln k'_D} + \sqrt{\ln \ell _D}}\right) \end{aligned}$$
(18)
$$\begin{aligned}&\quad \! \qquad \quad + \varDelta \sqrt{2\ln \ell _D} + o(\sqrt{\ln \ell _D}) \nonumber \\&\qquad \ge \varDelta \sqrt{2\ln \ell _D} + o(\sqrt{\ln \ell _D}). \end{aligned}$$
(19)

To check that the last inequality holds, note that Assumption (ii) of the Theorem states that \(\liminf \frac{\ell _D}{D} := \eta >0\), in particular \(\ell _D \rightarrow \infty \). On the other hand, since \(A_{D,\theta '} \subseteq A_{D,\theta }\), we have \(\limsup \frac{k_D}{D} \le 1-\eta \), therefore \(\limsup \frac{k_D}{\ell _D} \le \frac{1-\eta }{\eta }:=\gamma \) and finally \(\limsup \frac{k'_D}{\ell _D} \le \max (\gamma ,1)\) . Since \(\ln \frac{k'_D}{\ell _D} \ge 0\) , \( \limsup \ln \frac{k'_D}{\ell _D} \le \left( \ln \gamma \right) _+\), and the second factor in (18) is positive and upper bounded by 1, the whole term in (18) is O(1) , so that Inequality (19) follows.

We deduce that for any \(B>0\) and for D large enough \(\frac{b_{\ell _D} - b_{k'_D}+\varDelta }{a_{\ell _D}} >B\) holds, and we have

$$\begin{aligned} {\mathbb {P}}\left[ \hat{d}_D \in A_{D,\theta '} \right] \le&{\mathbb {P}}\left[ \left( \frac{\max _{d \le k'_D} \zeta '_d - b_{k'_D}}{a_{k'_D}}\right) \right. \\&\left. - \frac{\max _{d \le \ell _D}\zeta _d - b_{\ell _D}}{a_{\ell _D}} > B \right] . \end{aligned}$$

By similar arguments as in the proof of Theorem 2, the latter upper bound converges to \({\mathbb {P}}[ G - G'>B ]\), where \(G,G'\) are two independent Gumbel random variables. As B is arbitrary we come to the announced conclusion. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Beinrucker, A., Dogan, Ü. & Blanchard, G. Extensions of stability selection using subsamples of observations and covariates. Stat Comput 26, 1059–1077 (2016). https://doi.org/10.1007/s11222-015-9589-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-015-9589-y

Keywords

Navigation