Extensions of stability selection using subsamples of observations and covariates

Beinrucker, Andre; Dogan, Ürün; Blanchard, Gilles

doi:10.1007/s11222-015-9589-y

Extensions of stability selection using subsamples of observations and covariates

Published: 21 July 2015

Volume 26, pages 1059–1077, (2016)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Andre Beinrucker¹,
Ürün Dogan² &
Gilles Blanchard¹

531 Accesses
8 Citations
2 Altmetric
Explore all metrics

Abstract

We introduce extensions of stability selection, a method to stabilise variable selection methods introduced by Meinshausen and Bühlmann (J R Stat Soc 72:417–473, 2010). We propose to apply a base selection method repeatedly to random subsamples of observations and subsets of covariates under scrutiny, and to select covariates based on their selection frequency. We analyse the effects and benefits of these extensions. Our analysis generalizes the theoretical results of Meinshausen and Bühlmann (J R Stat Soc 72:417–473, 2010) from the case of half-samples to subsamples of arbitrary size. We study, in a theoretical manner, the effect of taking random covariate subsets using a simplified score model. Finally we validate these extensions on numerical experiments on both synthetic and real datasets, and compare the obtained results in detail to the original stability selection method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A new criterion for assessing discriminant validity in variance-based structural equation modeling

Article Open access 22 August 2014

RMSEA, CFI, and TLI in structural equation modeling with ordered categorical data: The story they tell depends on the estimation methods

Article 04 June 2018

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Article Open access 05 May 2021

Notes

We used the R-package LARS (Hastie and Efron 2012) as Lasso implementation.

References

Alexander, D.H., Lange, K.: Stability selection for genome-wide association. Genet. Epidemiol. 35(7), 722–728 (2011)
Article Google Scholar
Arlot, S., Celisse, A.: A survey of cross-validation procedures for model selection. Stat. Surv. 4, 40–79 (2010)
Article MathSciNet MATH Google Scholar
Bolasso, F.B.: Model consistent Lasso estimation through the bootstrap. In: Proceedings of 25th International Conference on Machine Learning (ICML), pp. 33–40. ACM (2008)
Beinrucker, A., Dogan, U., Blanchard, G.: Early stopping for mutual information based feature selection. In: Proceedings of 21st International Conference on Pattern Recognition (ICPR), pp. 975–978 (2012a)
Beinrucker, A., Dogan, U., Blanchard, G.: A simple extension of stability feature selection. In: Pattern Recognition, vol. 7476 of Lecture Notes in Computer Science, pp. 256–265. Springer, New York (2012b)
Bi, J., Bennett, K., Embrechts, M., Breneman, C., Song, M.: Dimensionality reduction via sparse support vector machines. J. Mach. Learn. Res. 3, 1229–1243 (2003)
MATH Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MathSciNet MATH Google Scholar
Bühlmann, P., Yu, B.: Analyzing bagging. Ann. Stat. 30(4), 927–961 (2002)
Article MathSciNet MATH Google Scholar
Bühlmann, P., Rütimann, P., van de Geer, S., Zhang, C.-H.: Correlated variables in regression: clustering and sparse estimation. J. Stat. Plan. Inference 143(11), 1835–1858 (2013)
Article MathSciNet MATH Google Scholar
Cover, T.M., Thomas, J.A.: Elements of Information Theory, second edn. Wiley-Interscience, New York (2006)
MATH Google Scholar
Efron, B.: Bootstrap methods: another look at the jackknife. Ann. Stat. 7(1), 1–26 (1979)
Article MathSciNet MATH Google Scholar
Embrechts, P.: Modelling Extremal Events: For Insurance and Finance, volume 33 of Stochastic Modelling and Applied Probability. Springer, New York (1997)
Book Google Scholar
Escudero, G., Marquez, L., Rigau, G.: Boosting applied to word sense disambiguation. In: Proceedings of European Conference on Machine Learning (ECML), pp. 129–141 (2000)
Fleuret, F.: Fast binary feature selection with conditional mutual information. J. Mach. Learn. Res. 5, 1531–1555 (2004)
MathSciNet MATH Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)
Article Google Scholar
Guyon, I.: Feature Extraction: Foundations and Applications, vol. 207. Springer, New York (2006)
Google Scholar
Hastie, T., Efron, B.: LARS: Least Angle Regression, Lasso and Forward Stagewise (2012). URL http://CRAN.R-project.org/package=lars. R package version 1.1
Haury, A.-C., Mordelet, F., Vera-Licona, P., Vert, J.-P.: Tigress: trustful inference of gene regulation using stability selection. BMC Syst. Biol. 6(1), 145 (2012)
Article Google Scholar
He, Q., Lin, D.-Y.: A variable selection method for genome-wide association studies. Bioinformatics 27(1), 1–8 (2011)
Article MathSciNet Google Scholar
He, Z., Yu, W.: Stable feature selection for biomarker discovery. Comput. Biol. Chem. 34(4), 215–225 (2010)
Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012)
Leadbetter, M.R., Lindgren, G., Rootzén, H.: Extremes and Related Properties of Random Sequences and Processes. Springer Series in Statistics. Springer, New York (1983)
Book MATH Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Lounici, K., et al.: Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators. Electron. J. Stat. 2, 90–102 (2008)
Article MathSciNet MATH Google Scholar
MASH Consortium. The MASH project. http://www.mash-project.eu (2012). [Online; Accessed 19 Mar 2013]
Meinshausen, N., Bühlmann, P.: Stability selection. J. R. Stat. Soc. 72(4), 417–473 (2010)
Article MathSciNet Google Scholar
Meinshausen, N., Yu, B.: Lasso-type recovery of sparse representations for high-dimensional data. Ann. Stat. 37(1), 246–270 (2009)
Article MathSciNet MATH Google Scholar
Politis, D.N., Romano, J.P., Wolf, M.: Subsampling. Springer Series in Statistics. Springer, New York (1999)
Book MATH Google Scholar
Sauerbrei, W., Schumacher, M.: A bootstrap resampling procedure for model building: application to the Cox regression model. Stat. Med. 11(16), 2093–2109 (1992)
Article Google Scholar
Schapire, R., Singer, Y.: Improved boosting algorithms using confidence-rated predictions. Mach. Learn. 37(3), 297–336 (1999)
Article MATH Google Scholar
Shah, R.D., Samworth, R.J.: Variable selection with error control: another look at stability selection. J. R. Stat. Soc. 75(1), 55–80 (2013)
Article MathSciNet Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. 58(1), 267–288 (1996)
MathSciNet MATH Google Scholar
Wang, S., Nan, B., Rosset, S., Zhu, J.: Random Lasso. Ann. Appl. Stat. 5(1), 468–485 (2011)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

We are extremely grateful to Nicolai Meinshausen and Peter Bühlmann for communicating to us the R-code used by Meinshausen and Bühlmann (2010) as well as for numerous discussions. We are indebted to Richard Samworth and Rajen Shah for numerous discussions and for hosting the first author during part of this work. We thank Maurilio Gutzeit for helping us with part of the numerical experiments.

Author information

Authors and Affiliations

University of Potsdam, Am Neuen Palais 10, 14469, Potsdam, Germany
Andre Beinrucker & Gilles Blanchard
Microsoft/Skype Labs, 2 Waterhouse Square, 140 Holborn, London, EC1N2ST, UK
Ürün Dogan

Authors

Andre Beinrucker
View author publications
You can also search for this author in PubMed Google Scholar
Ürün Dogan
View author publications
You can also search for this author in PubMed Google Scholar
Gilles Blanchard
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andre Beinrucker.

Additional information

A preliminary version of this work was presented at the conference DAGM 2012 (Beinrucker et al. 2012b).

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 320 KB)

Appendix: Proofs of theoretical results

1.1 Proofs of Sect. 3.1

For notational convenience we use the shorthand $S^{{\mathrm {base}}}(\ell ,t)\equiv S^{{\mathrm {base}}}(X^{(\mathcal {S}(\ell ,t))},Y^{(\mathcal {S}(\ell ,t))})$ . To prove Theorem 1 and Corollary 1 we need some notation and two lemmas. We define

$$\begin{aligned} {\Pi }^{{\mathrm {simult}}}_{L,\ell _0}(d) := \frac{1}{T} \sum _{t=1}^T \mathbf {1}\left\{ \sum _{\ell =1}^L \mathbf {1}\left\{ d \in S_L^{{\mathrm {base}}}(\ell ,t) \right\} \ge \ell _0 \right\} \end{aligned}$$

the ratio of repetitions out of T where covariate d has been selected in at least $\ell _0$ subsamples simultaneously.

Lemma 1

(Relation of $\Pi ^{{\mathrm {simult}}}$ and $\Pi ^{SFS}$) It holds for any $d\in {\mathcal {F}}$:

$$\begin{aligned} \left( \frac{L-\ell _0+1}{L} \right) \Pi ^{{\mathrm {simult}}}_{L,\ell _0}(d) + \frac{\ell _0-1}{L} \ge {\Pi }^{SFS}_{L}(d)\,. \end{aligned}$$

Proof

We have for all repetitions of drawings of subsamples $t=1,\ldots ,T$:

$$\begin{aligned}&\frac{1}{L} \sum _{\ell =1}^L {\mathbf {1}\{d \in S^{{\mathrm {base}}}(\ell ,t)\}}\\&\quad \le \left( \frac{\ell _{0}-1}{L}\right) \mathbf {1}\left\{ \sum _{\ell =1}^L \mathbf {1}\left\{ d \in S^{{\mathrm {base}}}(\ell ,t) \right\} \le \ell _0 -1 \right\} \\&\qquad + \mathbf {1}\left\{ \sum _{\ell =1}^L \mathbf {1}\left\{ d \in S^{{\mathrm {base}}}(\ell ,t) \right\} \ge \ell _0 \right\} \,. \end{aligned}$$

Averaging over the repetitions $t=1,\ldots ,T$ , we obtain

$$\begin{aligned} \Pi ^{SFS}_{L}(d)&\le \frac{\ell _{0}-1}{L} \left( 1 - { {\Pi }^{{\mathrm {simult}}}_{L,\ell _0}}(d) \right) + {\Pi ^{{\mathrm {simult}}}_{L,\ell _0}}(d)\\&= \left( \frac{L-\ell _{0}+1}{L} \right) {\Pi ^{{\mathrm {simult}}}_{L,\ell _{0}}(d)} + \frac{\ell _{0}-1}{L}\,. \end{aligned}$$

$\square $

Lemma 2

(Exponential inequality for $\Pi ^{{\mathrm {simult}}}$) The following inequality holds for any $d\in {\mathcal {F}}$, $\xi >0$, and $\ell _0\in \left\{ 1,\ldots ,L\right\} $ such that $p_0:= \frac{\ell _0}{L} \ge p_L(d)$:

$$\begin{aligned} \mathbb {P}\left[ \Pi ^{{\mathrm {simult}}}_{L,\ell _0}(d) \ge \xi \right] \le \frac{1}{\xi } \exp \left( -L D\left( p_0, p_L(d)\right) \right) \,. \end{aligned}$$

(13)

Proof

We have

$$\begin{aligned} \mathbb {E}\left[ {\Pi }^{{\mathrm {simult}}}_{L,\ell _0}(d)\right]&= \mathbb {P}\left[ \sum _{\ell =1}^L {\mathbf {1}\{d \in {S}^{{\mathrm {base}}}(\ell ,1) \}} \ge \ell _0 \right] \\&= \mathbb {P}\left[ {\mathrm {Bin}}\left( L,p_L(d)\right) \ge \ell _0\right] \\&\le \exp \left( -L D\left( p_0, p_L(d)\right) \right) \,. \end{aligned}$$

The first equality is valid because the L random observation subsamples are disjoint. Therefore, their joint distribution is the same as that of L independent samples of size $\left\lfloor \frac{N}{L} \right\rfloor $; thus $(S^{{\mathrm {base}}}(\ell ,1))_{1\le \ell \le L}$ has the same distribution as L independent copies of the variable $S_L^{{\mathrm {base}}}$. The last inequality is the Chernoff binomial bound. Using Markov’s inequality we get (13).

Proof of Theorem 1

We relate $\Pi ^{SFS}$ to $\Pi ^{{\mathrm {simult}}}$ and apply an exponential inequality on $\Pi ^{{\mathrm {simult}}}$. For any $d\in A_{\theta ,L}$, it holds by definition of $A_{\theta ,L}$ and the assumptions on $p_0$ that $p_L(d)\le \theta \le p_0$, hence it holds by Lemma 1 and Lemma 2 that

$$\begin{aligned}&\mathbb {P}\left[ \Pi ^{SFS}_{L}(d) \ge \tau \right] \\&\quad \le \mathbb {P}\left[ \left( \frac{L-\ell _0+1}{L}\right) \Pi ^{{\mathrm {simult}}}_{L,\ell _0}(d) +\frac{\ell _0-1}{L} \ge \tau \right] \\&\quad = \mathbb {P}\left[ \Pi ^{{\mathrm {simult}}}_{L,\ell _0}(d) \ge \frac{L\tau -\ell _0+1}{L-\ell _0+1}\right] \\&\quad \le \frac{1-p_0+L^{-1}}{\tau -p_0+L^{-1}} \exp \left( -L D\left( p_0, p_L(d)\right) \right) \,, \end{aligned}$$

where we have used $\xi :=\frac{L\tau -\ell _0+1}{L-\ell _0+1}$ . This result generalizes Shah and Samworth (2013, Lemma 5). Hence

$$\begin{aligned}&\frac{\mathbb {E}\left[ \left|S^{SFS}_{L,\tau } \cap A_{\theta ,L} \right|\right] }{\left|A_{\theta ,L} \right|}\\&\quad = \frac{1}{\left|A_{\theta ,L} \right|} \sum _{d\in A_{\theta ,L}} \mathbb {P}\left[ \Pi ^{SFS}_{L,\ell _0}(d) \ge \tau \right] \\&\quad \le \frac{1-p_0+L^{-1}}{\tau -p_0+L^{-1}} \frac{1}{\left|A_{\theta ,L} \right|} \sum _{d\in A_{\theta ,L}} \exp \left( -L D\left( p_0, p_L(d)\right) \right) \,. \end{aligned}$$

Since $x\rightarrow \exp ( -L D(p_0,x))$ is non-decreasing, we obtain the first part of the result by upper bounding for all $d \in A_{\theta ,L}$:

$$\begin{aligned} \exp \left( -L D\left( p_0, p_L(d)\right) \right) \le \exp \left( -L D\left( p_0, \theta \right) \right) \,. \end{aligned}$$

For the second part, we use the upper bound

$$\begin{aligned} \exp \left( -L D\left( p_0, p_L(d)\right) \right)&= \frac{\exp \left( -L D\left( p_0, p_L(d)\right) \right) }{p_L(d)} p_L(d)\\&\le \frac{\exp \left( -L D\left( p_0, \theta \right) \right) }{\theta } p_L(d)\,, \end{aligned}$$

since the function $x\mapsto \frac{\exp \left( -L D\left( p_0, x\right) \right) }{x}$ can be shown to be non-decreasing for $x \le p_0 - L^{-1}$ . Finally, summing over $d \in A_{\theta ,L}$, observe

$$\begin{aligned} \sum _{d\in A_{\theta ,L}} p_L(d)&= \mathbb {E}\left[ \sum _{d\in A_{\theta ,L}} {\mathbf {1}\{ d \in S^{{\mathrm {base}}}_L\}}\right] \\&= \mathbb {E}\left[ \left|A_{\theta ,L} \cap S^{{\mathrm {base}}}_L \right|\right] , \end{aligned}$$

leading to the desired conclusion. Equations (6) and (7) can be proved similarly. $\square $

Proof of Corollary 1

This follows the same argument as in Shah and Samworth (2013). If the variable selection was completely at random, the marginal selection probability of any given covariate would be $\frac{q_L }{D}$, where we recall $q_L =\mathbb {E}\left[ \left|S^{{\mathrm {base}}}_L \right|\right] $ is the average number of covariates selected by the base method. As we assume that the selection probability of a signal covariate is better than random; it entails that for any $d \in {\mathcal {N}}^C$, we must have $p_L(d)> \frac{q_L }{D}$. Conversely, as all noise covariates have the same probability to be selected by the base method, one has $p_L(d)< \frac{q_L }{D}$ for any $d \in {\mathcal {N}}$. Therefore, with $\theta :=\frac{q_L }{D}$ we must have $A_{\theta ,L}={\mathcal {N}}$ and $A_{\theta ,L}^c={\mathcal {N}}^C$. Inequality (4) therefore implies (8), wherein we have taken a minimum over the range of $\ell _0$ allowed in Theorem 1. $\square $

1.2 Proofs for Sect. 3.2

Proof of Theorem 2

We can first bound the error probability from above by omitting $Q_d$:

$$\begin{aligned} {\mathbb {P}}\left[ \hat{d}_D \in A_{D,\theta } \right]&= {\mathbb {P}}\left[ \max _{d \in A_{D,\theta }} \hat{Q}_d > \max _{d \in A_{D,\theta }^c}\hat{Q}_d \right] \\&={\mathbb {P}}\left[ \max _{d \in A_{D,\theta }} \left( Q_d + \varepsilon _d\right) > \max _{d \in A_{D,\theta }^c}\left( Q_d + \varepsilon _d\right) \right] \\&\le {\mathbb {P}}\left[ \max _{d \in A_{D,\theta }} (\theta + \varepsilon _d) > \max _{d \in A_{D,\theta }^C} (\theta +\varepsilon _d) \right] \\&= \mathbb {P}\left[ \mathop {\hbox {Arg Max}}\limits _{d\in \left\{ 1,\ldots ,D\right\} } \varepsilon _d \in A_{D,\theta }\right] = \frac{\left|A_{D,\theta } \right|}{D} \rightarrow \eta , \end{aligned}$$

as $D\rightarrow \infty $. If $\eta =0$, the conclusion is therefore established; in the remainder of the proof we hence assume $\eta >0$. We defer to the end of the proof the case $\eta =1$ and assume for now that $\eta \in (0,1)$. Then $\frac{|A_{D,\theta }|}{D} \rightarrow \eta \in (0,1)$ implies both $|A_{D,\theta }| \rightarrow \infty $ and $|A^c_{D,\theta }|\rightarrow \infty $, as well as $\frac{|A_{D,\theta }^c|}{|A_{D,\theta }|} \rightarrow \gamma := \frac{1-\eta }{\eta }$. We return to the error probability and bound it from below by using $Q_d\ge 0$ for $d \in A_{D,\theta }$ and $Q_d\le M$ for $d\in A_{D,\theta }^c$:

$$\begin{aligned} {\mathbb {P}}\left[ \hat{d}_D \in A_{D,\theta } \right] \ge {\mathbb {P}}\left[ \max _{d \in A_{D,\theta }} \varepsilon _d > M + \max _{d \in A_{D,\theta }^C}\varepsilon _d \right] . \end{aligned}$$

(14)

Since the distribution of $\varepsilon _i$ belongs to MDA(Fréchet $(\alpha )$), from classical results of extreme value theory (Embrechts (1997, Theorem 3.3.7)) we know that there exists a slow varying function L so that, if we denote $G(x):=x^{1/\alpha }L(x)$, then

(15)

in the sense of convergence in distribution, as $D\rightarrow \infty $. Following on (14):

$$\begin{aligned} {\mathbb {P}}\left[ \hat{d}_D \in A_{D,\theta } \right]\ge & {} {\mathbb {P}}\left[ \frac{\max _{d \in A_{D,\theta }} \varepsilon _d}{G(|A_{D,\theta }|)} > \frac{M}{G(|A_{D,\theta }|)} \right. \\&\left. +\frac{G(|A^c_{D,\theta }|)}{G(|A_{D,\theta }|)} \frac{\max _{d \in A_{D,\theta }^c}\varepsilon _d}{G(|A^c_{D,\theta }|)} \right] . \end{aligned}$$

As L is slowly varying, we have $\lim _{x \rightarrow \infty } \frac{L(a x)}{L(x)} \rightarrow 1$ uniformly for a belonging to a bounded interval of the positive real axis (Embrechts 1997, Theorem A 3.2). We deduce

$$\begin{aligned} \frac{G(|A^c_{D,\theta }|)}{G(|A_{D,\theta }|)} = \left( \frac{|A_{D,\theta }^c|}{|A_{D,\theta }|}\right) ^{\frac{1}{\alpha }} \frac{L\left( |A_{D,\theta }|\left( \frac{|A^c_{D,\theta }|}{|A_{D,\theta }|}\right) \right) }{L(|A_{D,\theta }|)} \rightarrow \gamma ^{1/\alpha }, \end{aligned}$$

(16)

as $D\rightarrow \infty $. We apply Slutsky’s theorem (Embrechts 1997, Example A2.7) to Eqs. (16) and (15) to obtain

in distribution, where Fréchet $(\alpha ,c)$ denotes the Fréchet $(\alpha )$ distribution rescaled by a factor $c>0$.

Further, slow variation of L implies that L(x) is asymptotically negligible with respect to any power function, so that

$$\begin{aligned} \frac{M}{G(|A_{D,\theta }|)} = \frac{M}{|A_{D,\theta }|^{\frac{1}{\alpha }}L(|A_{D,\theta }|)} \rightarrow 0, \;\; \text { as } \;\; D\rightarrow \infty . \end{aligned}$$

(17)

As the maxima in $\max _{d \in A_{D,\theta }} \varepsilon _d$ and $\max _{d \in A_{D,\theta }^C}\varepsilon _d$ are taken over disjoint sets of independent random variables, they are independent. Since they converge marginally in distribution, they also converge jointly and their difference converges due to the continuous mapping theorem (Embrechts 1997, Theorem A2.6). Combining with (17) and using Slutsky’s theorem again, we conclude that

$$\begin{aligned} \frac{\max _{d \in A_{D,\theta }} \varepsilon _d}{G(|A_{D,\theta }|)} - \frac{M}{G(|A_{D,\theta }|)} - \frac{G(|A^c_{D,\theta }|)}{G(|A_{D,\theta }|)} \frac{\max _{d \in A_{D,\theta }^c}\varepsilon _d}{G(|A^c_{D,\theta }|)} \end{aligned}$$

converges in distribution to the difference of two independent Fréchet distributed random variables. This convergence implies the convergence of the c.d.f. for all continuity points (Embrechts 1997, Eq. A.1). As the limiting distribution is continuous, we finally obtain

$$\begin{aligned} \liminf _{D\rightarrow \infty } {\mathbb {P}}\left[ \hat{d}_D \in A_{D,\theta } \right] \ge {\mathbb {P}}\left[ F - \gamma ^{\frac{1}{\alpha }} F' > 0 \right] , \end{aligned}$$

where $F,F'$ are independent Fréchet $(\alpha )$ random variables. To identify the value of this lower bound, observe that it is also the limiting value of

$$\begin{aligned} \mathbb {P}\left[ \mathop {\hbox {Arg Max}}\limits _{d\in \left\{ 1,\ldots ,D\right\} } \varepsilon _d \in A_{D,\theta }\right] = \frac{|A_{D,\theta }|}{D}. \end{aligned}$$

Indeed, it suffices to repeat the above argument, except for skipping inequality (14). Hence this limiting value is exactly equal to $\eta $.

Finally, for the case $\eta =1$, observe that the above argument remains valid provided $|A_{D,\theta }^c|\rightarrow \infty $. Even if this is not the case (i.e. $|A_{D,\theta }^c|$ remains bounded), then the conclusion would be a fortiori true since we could replace $A_{D,\theta }^c$ by a slightly larger set of cardinality $\ln (D)$ (say), which can only decrease the lower bound while still obtaining the above limiting value. $\square $

Proof of Theorem 3

To show the convergence of the error probability we use similar arguments as in the proof of Theorem 2. From classical results of extreme value theory for independent standard normal random variables $(\zeta _k)_{k \in \mathbb {N}}$ (Embrechts 1997, Example 3.3.29) it holds that

$$\begin{aligned} \frac{\max _{i \le k} \zeta _i - b_k }{a_k} \rightarrow \mathrm{Gumbel}\,, \qquad \text { as } k \rightarrow \infty \,, \end{aligned}$$

in distribution, where $a_k:= \frac{1}{\sqrt{2\ln (k)}}$ and $b_k:=\sqrt{2\ln (k)} - \frac{\ln (4\pi )+\ln (\ln (k))}{2\sqrt{2\ln (k)}}$. Below, to clarify the argument we will introduce $(\zeta _k)_{k \in \mathbb {N}}$ and $(\zeta '_k)_{k \in \mathbb {N}}$ two independent sequences of independent standard normal variables.

Denote $k_D:=|A_{D,\theta '}|$, $\ell _D:= |A_{D,\theta }^c|$ and $\varDelta :=\theta -\theta '>0$. Since $A_{D,\theta }^c \subseteq A_{D,\theta '}^c$ we can bound the error probability from above as follows:

$$\begin{aligned} {\mathbb {P}}\left[ \hat{d}_D \in A_{D,\theta '} \right]&\le {\mathbb {P}}\left[ \max _{d \in A_{D,\theta '}} \left( Q_d + \varepsilon _d\right) > \max _{d \in A_{D,\theta }^c}\left( Q_d + \varepsilon _d\right) \right] \\&\le \mathbb {P}\left[ \max _{d \in A_{D,\theta '}} \varepsilon _d > \max _{d \in A_{D,\theta }^c} \varepsilon _d + \varDelta \right] \\&= \mathbb {P}\left[ \max _{d\le k_D} \zeta '_d > \max _{d \le \ell _D} \zeta _d + \varDelta \right] \,. \end{aligned}$$

The last equality holds since $A_{D,\theta '}$ and $A_{D,\theta }^c$ are disjoint; it is purely formal but notationally convenient for the sequel. Now denote $k'_D := \max (k_D,\ell _D)$ . The above implies

$$\begin{aligned}&{\mathbb {P}}\left[ \hat{d}_D \in A_{D,\theta '} \right] \\&\quad \le \mathbb {P}\left[ \max _{d\le k'_D} \zeta '_d > \max _{d \le \ell _D} \zeta _d + \varDelta \right] \\&\quad = {\mathbb {P}}\left[ \frac{a_{k'_D}}{a_{\ell _D}}\left( \frac{\max _{d \le k'_D} \zeta '_d - b_{k'_D}}{a_{k'_D}}\right) - \frac{\max _{d \le \ell _D}\zeta _d - b_{\ell _D}}{a_{\ell _D}} \right. \\&\quad > \left. \frac{b_{\ell _D} - b_{k'_D} + \varDelta }{a_{\ell _D}} \right] \end{aligned}$$

We treat the different terms in the above upper bound. We have

$$\begin{aligned} \frac{a_{k'_D}}{a_{\ell _D}} = \frac{\sqrt{2\ln (\ell _D)}}{\sqrt{2\ln (k'_D)}} \le 1 . \end{aligned}$$

Noting that $b_k = \sqrt{2 \ln k} + o(1)$, we have

$$\begin{aligned}&\frac{b_{\ell _D} - b_{k'_D} + \varDelta }{a_{\ell _D}} \nonumber \\&\qquad = \sqrt{2 \ln \ell _D} \left( \sqrt{2 \ln \ell _D} - \sqrt{2 \ln k'_D} + \varDelta + o(1)\right) \nonumber \\&\qquad = -2\left( \ln \frac{k'_D}{\ell _D}\right) \left( \frac{\sqrt{\ln \ell _D}}{\sqrt{\ln k'_D} + \sqrt{\ln \ell _D}}\right) \end{aligned}$$

(18)

$$\begin{aligned}&\quad \! \qquad \quad + \varDelta \sqrt{2\ln \ell _D} + o(\sqrt{\ln \ell _D}) \nonumber \\&\qquad \ge \varDelta \sqrt{2\ln \ell _D} + o(\sqrt{\ln \ell _D}). \end{aligned}$$

(19)

To check that the last inequality holds, note that Assumption (ii) of the Theorem states that $\liminf \frac{\ell _D}{D} := \eta >0$, in particular $\ell _D \rightarrow \infty $. On the other hand, since $A_{D,\theta '} \subseteq A_{D,\theta }$, we have $\limsup \frac{k_D}{D} \le 1-\eta $, therefore $\limsup \frac{k_D}{\ell _D} \le \frac{1-\eta }{\eta }:=\gamma $ and finally $\limsup \frac{k'_D}{\ell _D} \le \max (\gamma ,1)$ . Since $\ln \frac{k'_D}{\ell _D} \ge 0$ , $ \limsup \ln \frac{k'_D}{\ell _D} \le \left( \ln \gamma \right) _+$, and the second factor in (18) is positive and upper bounded by 1, the whole term in (18) is O(1) , so that Inequality (19) follows.

We deduce that for any $B>0$ and for D large enough $\frac{b_{\ell _D} - b_{k'_D}+\varDelta }{a_{\ell _D}} >B$ holds, and we have

$$\begin{aligned} {\mathbb {P}}\left[ \hat{d}_D \in A_{D,\theta '} \right] \le&{\mathbb {P}}\left[ \left( \frac{\max _{d \le k'_D} \zeta '_d - b_{k'_D}}{a_{k'_D}}\right) \right. \\&\left. - \frac{\max _{d \le \ell _D}\zeta _d - b_{\ell _D}}{a_{\ell _D}} > B \right] . \end{aligned}$$

By similar arguments as in the proof of Theorem 2, the latter upper bound converges to ${\mathbb {P}}[ G - G'>B ]$, where $G,G'$ are two independent Gumbel random variables. As B is arbitrary we come to the announced conclusion. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Beinrucker, A., Dogan, Ü. & Blanchard, G. Extensions of stability selection using subsamples of observations and covariates. Stat Comput 26, 1059–1077 (2016). https://doi.org/10.1007/s11222-015-9589-y

Download citation

Received: 08 July 2014
Accepted: 23 June 2015
Published: 21 July 2015
Issue Date: September 2016
DOI: https://doi.org/10.1007/s11222-015-9589-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Extensions of stability selection using subsamples of observations and covariates

Abstract

Access this article

Similar content being viewed by others

A new criterion for assessing discriminant validity in variance-based structural equation modeling

RMSEA, CFI, and TLI in structural equation modeling with ordered categorical data: The story they tell depends on the estimation methods

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (pdf 320 KB)

Appendix: Proofs of theoretical results

1.1 Proofs of Sect. 3.1

Lemma 1

Proof

Lemma 2

Proof

Proof of Theorem 1

Proof of Corollary 1

1.2 Proofs for Sect. 3.2

Proof of Theorem 2

Proof of Theorem 3

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Extensions of stability selection using subsamples of observations and covariates

Abstract

Access this article

Similar content being viewed by others

A new criterion for assessing discriminant validity in variance-based structural equation modeling

RMSEA, CFI, and TLI in structural equation modeling with ordered categorical data: The story they tell depends on the estimation methods

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (pdf 320 KB)

Appendix: Proofs of theoretical results

Appendix: Proofs of theoretical results

1.1 Proofs of Sect. 3.1

Lemma 1

Proof

Lemma 2

Proof

Proof of Theorem 1

Proof of Corollary 1

1.2 Proofs for Sect. 3.2

Proof of Theorem 2

Proof of Theorem 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation