Abstract
We introduce extensions of stability selection, a method to stabilise variable selection methods introduced by Meinshausen and Bühlmann (J R Stat Soc 72:417–473, 2010). We propose to apply a base selection method repeatedly to random subsamples of observations and subsets of covariates under scrutiny, and to select covariates based on their selection frequency. We analyse the effects and benefits of these extensions. Our analysis generalizes the theoretical results of Meinshausen and Bühlmann (J R Stat Soc 72:417–473, 2010) from the case of half-samples to subsamples of arbitrary size. We study, in a theoretical manner, the effect of taking random covariate subsets using a simplified score model. Finally we validate these extensions on numerical experiments on both synthetic and real datasets, and compare the obtained results in detail to the original stability selection method.
Similar content being viewed by others
Notes
We used the R-package LARS (Hastie and Efron 2012) as Lasso implementation.
References
Alexander, D.H., Lange, K.: Stability selection for genome-wide association. Genet. Epidemiol. 35(7), 722–728 (2011)
Arlot, S., Celisse, A.: A survey of cross-validation procedures for model selection. Stat. Surv. 4, 40–79 (2010)
Bolasso, F.B.: Model consistent Lasso estimation through the bootstrap. In: Proceedings of 25th International Conference on Machine Learning (ICML), pp. 33–40. ACM (2008)
Beinrucker, A., Dogan, U., Blanchard, G.: Early stopping for mutual information based feature selection. In: Proceedings of 21st International Conference on Pattern Recognition (ICPR), pp. 975–978 (2012a)
Beinrucker, A., Dogan, U., Blanchard, G.: A simple extension of stability feature selection. In: Pattern Recognition, vol. 7476 of Lecture Notes in Computer Science, pp. 256–265. Springer, New York (2012b)
Bi, J., Bennett, K., Embrechts, M., Breneman, C., Song, M.: Dimensionality reduction via sparse support vector machines. J. Mach. Learn. Res. 3, 1229–1243 (2003)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Bühlmann, P., Yu, B.: Analyzing bagging. Ann. Stat. 30(4), 927–961 (2002)
Bühlmann, P., Rütimann, P., van de Geer, S., Zhang, C.-H.: Correlated variables in regression: clustering and sparse estimation. J. Stat. Plan. Inference 143(11), 1835–1858 (2013)
Cover, T.M., Thomas, J.A.: Elements of Information Theory, second edn. Wiley-Interscience, New York (2006)
Efron, B.: Bootstrap methods: another look at the jackknife. Ann. Stat. 7(1), 1–26 (1979)
Embrechts, P.: Modelling Extremal Events: For Insurance and Finance, volume 33 of Stochastic Modelling and Applied Probability. Springer, New York (1997)
Escudero, G., Marquez, L., Rigau, G.: Boosting applied to word sense disambiguation. In: Proceedings of European Conference on Machine Learning (ECML), pp. 129–141 (2000)
Fleuret, F.: Fast binary feature selection with conditional mutual information. J. Mach. Learn. Res. 5, 1531–1555 (2004)
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)
Guyon, I.: Feature Extraction: Foundations and Applications, vol. 207. Springer, New York (2006)
Hastie, T., Efron, B.: LARS: Least Angle Regression, Lasso and Forward Stagewise (2012). URL http://CRAN.R-project.org/package=lars. R package version 1.1
Haury, A.-C., Mordelet, F., Vera-Licona, P., Vert, J.-P.: Tigress: trustful inference of gene regulation using stability selection. BMC Syst. Biol. 6(1), 145 (2012)
He, Q., Lin, D.-Y.: A variable selection method for genome-wide association studies. Bioinformatics 27(1), 1–8 (2011)
He, Z., Yu, W.: Stable feature selection for biomarker discovery. Comput. Biol. Chem. 34(4), 215–225 (2010)
Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012)
Leadbetter, M.R., Lindgren, G., Rootzén, H.: Extremes and Related Properties of Random Sequences and Processes. Springer Series in Statistics. Springer, New York (1983)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Lounici, K., et al.: Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators. Electron. J. Stat. 2, 90–102 (2008)
MASH Consortium. The MASH project. http://www.mash-project.eu (2012). [Online; Accessed 19 Mar 2013]
Meinshausen, N., Bühlmann, P.: Stability selection. J. R. Stat. Soc. 72(4), 417–473 (2010)
Meinshausen, N., Yu, B.: Lasso-type recovery of sparse representations for high-dimensional data. Ann. Stat. 37(1), 246–270 (2009)
Politis, D.N., Romano, J.P., Wolf, M.: Subsampling. Springer Series in Statistics. Springer, New York (1999)
Sauerbrei, W., Schumacher, M.: A bootstrap resampling procedure for model building: application to the Cox regression model. Stat. Med. 11(16), 2093–2109 (1992)
Schapire, R., Singer, Y.: Improved boosting algorithms using confidence-rated predictions. Mach. Learn. 37(3), 297–336 (1999)
Shah, R.D., Samworth, R.J.: Variable selection with error control: another look at stability selection. J. R. Stat. Soc. 75(1), 55–80 (2013)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. 58(1), 267–288 (1996)
Wang, S., Nan, B., Rosset, S., Zhu, J.: Random Lasso. Ann. Appl. Stat. 5(1), 468–485 (2011)
Acknowledgments
We are extremely grateful to Nicolai Meinshausen and Peter Bühlmann for communicating to us the R-code used by Meinshausen and Bühlmann (2010) as well as for numerous discussions. We are indebted to Richard Samworth and Rajen Shah for numerous discussions and for hosting the first author during part of this work. We thank Maurilio Gutzeit for helping us with part of the numerical experiments.
Author information
Authors and Affiliations
Corresponding author
Additional information
A preliminary version of this work was presented at the conference DAGM 2012 (Beinrucker et al. 2012b).
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix: Proofs of theoretical results
Appendix: Proofs of theoretical results
1.1 Proofs of Sect. 3.1
For notational convenience we use the shorthand \(S^{{\mathrm {base}}}(\ell ,t)\equiv S^{{\mathrm {base}}}(X^{(\mathcal {S}(\ell ,t))},Y^{(\mathcal {S}(\ell ,t))})\) . To prove Theorem 1 and Corollary 1 we need some notation and two lemmas. We define
the ratio of repetitions out of T where covariate d has been selected in at least \(\ell _0\) subsamples simultaneously.
Lemma 1
(Relation of \(\Pi ^{{\mathrm {simult}}}\) and \(\Pi ^{SFS}\)) It holds for any \(d\in {\mathcal {F}}\):
Proof
We have for all repetitions of drawings of subsamples \(t=1,\ldots ,T\):
Averaging over the repetitions \(t=1,\ldots ,T\) , we obtain
\(\square \)
Lemma 2
(Exponential inequality for \(\Pi ^{{\mathrm {simult}}}\)) The following inequality holds for any \(d\in {\mathcal {F}}\), \(\xi >0\), and \(\ell _0\in \left\{ 1,\ldots ,L\right\} \) such that \(p_0:= \frac{\ell _0}{L} \ge p_L(d)\):
Proof
We have
The first equality is valid because the L random observation subsamples are disjoint. Therefore, their joint distribution is the same as that of L independent samples of size \(\left\lfloor \frac{N}{L} \right\rfloor \); thus \((S^{{\mathrm {base}}}(\ell ,1))_{1\le \ell \le L}\) has the same distribution as L independent copies of the variable \(S_L^{{\mathrm {base}}}\). The last inequality is the Chernoff binomial bound. Using Markov’s inequality we get (13).
Proof of Theorem 1
We relate \(\Pi ^{SFS}\) to \(\Pi ^{{\mathrm {simult}}}\) and apply an exponential inequality on \(\Pi ^{{\mathrm {simult}}}\). For any \(d\in A_{\theta ,L}\), it holds by definition of \(A_{\theta ,L}\) and the assumptions on \(p_0\) that \(p_L(d)\le \theta \le p_0\), hence it holds by Lemma 1 and Lemma 2 that
where we have used \(\xi :=\frac{L\tau -\ell _0+1}{L-\ell _0+1}\) . This result generalizes Shah and Samworth (2013, Lemma 5). Hence
Since \(x\rightarrow \exp ( -L D(p_0,x))\) is non-decreasing, we obtain the first part of the result by upper bounding for all \(d \in A_{\theta ,L}\):
For the second part, we use the upper bound
since the function \(x\mapsto \frac{\exp \left( -L D\left( p_0, x\right) \right) }{x}\) can be shown to be non-decreasing for \(x \le p_0 - L^{-1}\) . Finally, summing over \(d \in A_{\theta ,L}\), observe
leading to the desired conclusion. Equations (6) and (7) can be proved similarly. \(\square \)
Proof of Corollary 1
This follows the same argument as in Shah and Samworth (2013). If the variable selection was completely at random, the marginal selection probability of any given covariate would be \(\frac{q_L }{D}\), where we recall \(q_L =\mathbb {E}\left[ \left|S^{{\mathrm {base}}}_L \right|\right] \) is the average number of covariates selected by the base method. As we assume that the selection probability of a signal covariate is better than random; it entails that for any \(d \in {\mathcal {N}}^C\), we must have \(p_L(d)> \frac{q_L }{D}\). Conversely, as all noise covariates have the same probability to be selected by the base method, one has \(p_L(d)< \frac{q_L }{D}\) for any \(d \in {\mathcal {N}}\). Therefore, with \(\theta :=\frac{q_L }{D}\) we must have \(A_{\theta ,L}={\mathcal {N}}\) and \(A_{\theta ,L}^c={\mathcal {N}}^C\). Inequality (4) therefore implies (8), wherein we have taken a minimum over the range of \(\ell _0\) allowed in Theorem 1. \(\square \)
1.2 Proofs for Sect. 3.2
Proof of Theorem 2
We can first bound the error probability from above by omitting \(Q_d\):
as \(D\rightarrow \infty \). If \(\eta =0\), the conclusion is therefore established; in the remainder of the proof we hence assume \(\eta >0\). We defer to the end of the proof the case \(\eta =1\) and assume for now that \(\eta \in (0,1)\). Then \(\frac{|A_{D,\theta }|}{D} \rightarrow \eta \in (0,1)\) implies both \(|A_{D,\theta }| \rightarrow \infty \) and \(|A^c_{D,\theta }|\rightarrow \infty \), as well as \(\frac{|A_{D,\theta }^c|}{|A_{D,\theta }|} \rightarrow \gamma := \frac{1-\eta }{\eta }\). We return to the error probability and bound it from below by using \(Q_d\ge 0\) for \(d \in A_{D,\theta }\) and \(Q_d\le M\) for \(d\in A_{D,\theta }^c\):
Since the distribution of \(\varepsilon _i\) belongs to MDA(Fréchet \((\alpha )\)), from classical results of extreme value theory (Embrechts (1997, Theorem 3.3.7)) we know that there exists a slow varying function L so that, if we denote \(G(x):=x^{1/\alpha }L(x)\), then
in the sense of convergence in distribution, as \(D\rightarrow \infty \). Following on (14):
As L is slowly varying, we have \(\lim _{x \rightarrow \infty } \frac{L(a x)}{L(x)} \rightarrow 1\) uniformly for a belonging to a bounded interval of the positive real axis (Embrechts 1997, Theorem A 3.2). We deduce
as \(D\rightarrow \infty \). We apply Slutsky’s theorem (Embrechts 1997, Example A2.7) to Eqs. (16) and (15) to obtain
in distribution, where Fréchet \((\alpha ,c)\) denotes the Fréchet \((\alpha )\) distribution rescaled by a factor \(c>0\).
Further, slow variation of L implies that L(x) is asymptotically negligible with respect to any power function, so that
As the maxima in \(\max _{d \in A_{D,\theta }} \varepsilon _d\) and \(\max _{d \in A_{D,\theta }^C}\varepsilon _d\) are taken over disjoint sets of independent random variables, they are independent. Since they converge marginally in distribution, they also converge jointly and their difference converges due to the continuous mapping theorem (Embrechts 1997, Theorem A2.6). Combining with (17) and using Slutsky’s theorem again, we conclude that
converges in distribution to the difference of two independent Fréchet distributed random variables. This convergence implies the convergence of the c.d.f. for all continuity points (Embrechts 1997, Eq. A.1). As the limiting distribution is continuous, we finally obtain
where \(F,F'\) are independent Fréchet \((\alpha )\) random variables. To identify the value of this lower bound, observe that it is also the limiting value of
Indeed, it suffices to repeat the above argument, except for skipping inequality (14). Hence this limiting value is exactly equal to \(\eta \).
Finally, for the case \(\eta =1\), observe that the above argument remains valid provided \(|A_{D,\theta }^c|\rightarrow \infty \). Even if this is not the case (i.e. \(|A_{D,\theta }^c|\) remains bounded), then the conclusion would be a fortiori true since we could replace \(A_{D,\theta }^c\) by a slightly larger set of cardinality \(\ln (D)\) (say), which can only decrease the lower bound while still obtaining the above limiting value. \(\square \)
Proof of Theorem 3
To show the convergence of the error probability we use similar arguments as in the proof of Theorem 2. From classical results of extreme value theory for independent standard normal random variables \((\zeta _k)_{k \in \mathbb {N}}\) (Embrechts 1997, Example 3.3.29) it holds that
in distribution, where \(a_k:= \frac{1}{\sqrt{2\ln (k)}}\) and \(b_k:=\sqrt{2\ln (k)} - \frac{\ln (4\pi )+\ln (\ln (k))}{2\sqrt{2\ln (k)}}\). Below, to clarify the argument we will introduce \((\zeta _k)_{k \in \mathbb {N}}\) and \((\zeta '_k)_{k \in \mathbb {N}}\) two independent sequences of independent standard normal variables.
Denote \(k_D:=|A_{D,\theta '}|\), \(\ell _D:= |A_{D,\theta }^c|\) and \(\varDelta :=\theta -\theta '>0\). Since \(A_{D,\theta }^c \subseteq A_{D,\theta '}^c\) we can bound the error probability from above as follows:
The last equality holds since \(A_{D,\theta '}\) and \(A_{D,\theta }^c\) are disjoint; it is purely formal but notationally convenient for the sequel. Now denote \(k'_D := \max (k_D,\ell _D)\) . The above implies
We treat the different terms in the above upper bound. We have
Noting that \(b_k = \sqrt{2 \ln k} + o(1)\), we have
To check that the last inequality holds, note that Assumption (ii) of the Theorem states that \(\liminf \frac{\ell _D}{D} := \eta >0\), in particular \(\ell _D \rightarrow \infty \). On the other hand, since \(A_{D,\theta '} \subseteq A_{D,\theta }\), we have \(\limsup \frac{k_D}{D} \le 1-\eta \), therefore \(\limsup \frac{k_D}{\ell _D} \le \frac{1-\eta }{\eta }:=\gamma \) and finally \(\limsup \frac{k'_D}{\ell _D} \le \max (\gamma ,1)\) . Since \(\ln \frac{k'_D}{\ell _D} \ge 0\) , \( \limsup \ln \frac{k'_D}{\ell _D} \le \left( \ln \gamma \right) _+\), and the second factor in (18) is positive and upper bounded by 1, the whole term in (18) is O(1) , so that Inequality (19) follows.
We deduce that for any \(B>0\) and for D large enough \(\frac{b_{\ell _D} - b_{k'_D}+\varDelta }{a_{\ell _D}} >B\) holds, and we have
By similar arguments as in the proof of Theorem 2, the latter upper bound converges to \({\mathbb {P}}[ G - G'>B ]\), where \(G,G'\) are two independent Gumbel random variables. As B is arbitrary we come to the announced conclusion. \(\square \)
Rights and permissions
About this article
Cite this article
Beinrucker, A., Dogan, Ü. & Blanchard, G. Extensions of stability selection using subsamples of observations and covariates. Stat Comput 26, 1059–1077 (2016). https://doi.org/10.1007/s11222-015-9589-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-015-9589-y