Loss-guided Stability Selection

In modern data analysis, sparse model selection becomes inevitable once the number of predictors variables is very high. It is well-known that model selection procedures like the Lasso or Boosting tend to overfit on real data. The celebrated Stability Selection overcomes these weaknesses by aggregating models, based on subsamples of the training data, followed by choosing a stable predictor set which is usually much sparser than the predictor sets from the raw models. The standard Stability Selection is based on a global criterion, namely the per-family error rate, while additionally requiring expert knowledge to suitably configure the hyperparameters. Since model selection depends on the loss function, i.e., predictor sets selected w.r.t. some particular loss function differ from those selected w.r.t. some other loss function, we propose a Stability Selection variant which respects the chosen loss function via an additional validation step based on out-of-sample validation data, optionally enhanced with an exhaustive search strategy. Our Stability Selection variants are widely applicable and user-friendly. Moreover, our Stability Selection variants can avoid the issue of severe underfitting which affects the original Stability Selection for noisy high-dimensional data, so our priority is not to avoid false positives at all costs but to result in a sparse stable model with which one can make predictions. Experiments where we consider both regression and binary classification and where we use Boosting as model selection algorithm reveal a significant precision improvement compared to raw Boosting models while not suffering from any of the mentioned issues of the original Stability Selection.


Introduction
The first milestone in high-dimensional data analysis has been achieved once the Lasso (Tibshirani [1994]) was introduced, performing automated sparse variable selection for the squared loss and a shrinking of the coefficients.Several years later, Bühlmann and Yu proposed L 2 -Boosting (Bühlmann and Yu [2003], see also Bühlmann [2006]), a sophisticated forward step-wise approach that combines simple linear regression models which turned out to be competitive to the Lasso, see for example Efron et al. [2004], Bühlmann and Hothorn [2007], Bühlmann [2006] for experiments and discussions on the differences of Lasso and L 2 -Boosting.However, although both algorithms have attractive asymptotic properties (Bühlmann [2006], Bühlmann and Van De Geer [2011]), they still show an overfitting behaviour in practice.This issue resulted in several modifications of the Lasso, for example, the Adaptive Lasso, which is a two-stage procedure (Zou [2006]) where the second stage thins out the selected set of variables from the first stage, or the Multistep Adaptive Lasso, originally introduced in Bühlmann et al. [2008], which is an extension of the Adaptive Lasso and asymptotically performs log-regularized least squares minimization.
As for Boosting, the variant Sparse Boosting (Bühlmann and Yu [2006]) has been proposed which respects the model complexity, measured by the trace of the corresponding Boosting operator, leading to sparser models.Nevertheless, a standard approach to prevent Boosting models from overfitting is a suitable choice of the number of iterations.Since iterating the Boosting algorithm as long as convergence occurs is not reasonable in terms of sparse variable selection, one tries to stop the algorithm before convergence.For this method, called "early stopping", there already has been done a lot of work, see for example Bühlmann and Yu [2003], Zhang and Yu [2005], Bühlmann [2006] and Mayr et al. [2012].Note that Hofner et al. [2015] point out that even Boosting with early stopping still tends to overfitting.
The analogous problem of finding the optimal number of iterations in Boosting models is to define a suitably chosen regularization parameter for Lasso-type models.Although cross-validation is a standard approach, Meinshausen and Bühlmann [2010] point out that even for a sophisticatedly chosen grid Λ for the regularization parameter, the optimal model may not be contained in the set { Ŝ(λ) | λ ∈ Λ} if Ŝ(λ) is the set of variables chosen by the Lasso with regularization parameter λ.They suggest a very fruitful strategy called "Stability Selection" which essentially combines the variable selection procedure with bagging (Breiman [1996]) or, more precisely, subagging (Bühlmann and Yu [2002]), by aggregating models computed on subsamples.Only those variables that have been selected on sufficiently many subsamples enter the final stable model.The reason behind this model aggregation procedure is to immunize the final model against peculiar sample configurations.Stability Selection is a very powerful method that can be applied to Lasso or Graphical Lasso models (Meinshausen and Bühlmann [2010]), but which also has been extended to Boosting models in Hofner et al. [2015].The Stability Selection has been applied successfully in the context of gene expression (Meinshausen and Bühlmann [2010], Stekhoven et al. [2012], Shah and Samworth [2013], Hofner et al. [2015]), fMRI data (Ryali et al. [2012]) and voice activity detection (Hamaidi et al. [2017]), which are exemplary for having very few observations with a huge number of predictors.
Despite the magnificent success of Stability Selection, there are still open questions.The Lasso resp.Boosting models are computed on different (training) subsamples and these models are aggregated without ever performing an out-of-sample validation, which is known to be potentially misleading.Second, the choice of the tuning parameters of the Stability Selection may require expert knowledge to be defined appropriately.We experimentally show that, on noisy highdimensional data, the error control in terms of the expected number of falsely selected variables to which both parameters are related ( [Meinshausen and Bühlmann, 2010, Thm. 1]) seems to be too strict, may resulting in an empty model.Third, variable selection clearly depends on the loss function in the sense that model selection w.r.t.different loss functions ends up in different models which is not represented by the global per-family error rate criterion.
Therefore, we propose a loss-guided Stability Selection variant that includes a validation step after having aggregated the subsample-specific models.In this validation step, we compare the performance w.r.t. the chosen loss function for different stable candidate models on an outof-sample validation data set.The candidate stable models are computed using a pre-defined grid of either thresholds, i.e., the stable models include the variables whose aggregated selection frequencies exceed the respective threshold, or of cardinalities, i.e., the stable models consist of the respective number of variables with the highest aggregated selection frequencies.Then, the final stable model is the best performing candidate stable model.Thanks to the sparseness of the candidate stable models, this validation step induces only very little computational overhead compared to the original Stability Selection.Motivated by the issue that, on very noisy data, even the relevant variables may be selected on a rather low fraction of the models so that in may even happen that they achieve even lower selection frequencies than some noise variables, we propose another strategy where the grid search is replaced by an exhaustive search over the best variables, again with a limited computational overhead.Summarizing, we focus on finding a stable model with which one can make predictions instead of concentrating on avoiding false positives.
This article is organized as follows.In Sec. 2, we recapitulate Gradient Boosting and strategies that have been proposed to improve the model selection quality on real data and briefly summarize how Stability Selection works.Sec. 3 is devoted to our Stability Selection variants.In Sec. 4, we conduct experiments, both on simulated and real data, that show the potential of our Stability Selection variant.Sec. 5 concludes.

Boosting and variable selection
Let D = (X, Y ) be a data set where X ∈ R n×p is the predictor matrix and where Y ∈ R n is the corresponding response column.The rows of the regressor matrix X i are contained in some regressor space X ⊂ R p , the responses Y i belong to some space Y ⊂ R. We assume that the instances (X i , Y i ) are i.i.d.realizations of some underlying joint distribution.We denote the submatrix of X with all column indices in some index set J as X •,J .Let L be a loss function which in this work is a map The idea behind Boosting is to combine simple models ("weak learners", "baselearners") that are easy to fit in order to generate a final "strong" learning model.The first algorithm of this kind was the Adaboost algorithm (Freund and Schapire [1997]) for binary classification problems.It has been shown in Breiman [1999] that AdaBoost can be seen as a gradient descent algorithm for the exponential loss.Friedman et al. [2000] showed that AdaBoost can also be identified with a forward stage-wise procedure.We recapitulate the general functional gradient descent (FGD) algorithm in Alg. 1 (cf.Bühlmann and Hothorn [2007]).
and evaluate them at the current model for all i = 1, ..., n; Treat the vector U = (U i ) i as response and fit a model with a pre-selected learning algorithm as base procedure ; Update the current model via f The weak learners in the "base procedure" can, for example, be trees, smoothing splines or simple least squares models.The special case where the loss function is the squared loss is referred to as "L 2 -Boosting" in this article.For an overview on Gradient Boosting algorithms and their paradigms, we refer to Bühlmann and Yu [2003] and Bühlmann and Hothorn [2007].
The number m iter of Boosting iterations becomes very important if a pure Boosting algorithm is applied without further variable selection criteria due to a general overfitting behaviour of Boosting (see e.g.Bühlmann and Hothorn [2007]).More precisely, a small number of m iter leads to biased models with a low variance whereas performing many iterations reduces the bias but increases the variance (Mayr et al. [2012]).While Bühlmann and Yu [2003] considered the relative difference of mean squared errors as a stopping criterion, Bühlmann [2006] invoked the corrected AIC, minimizing it over a suitable set of values for m iter .Stopping before convergence is referred to as "early stopping" (see Bühlmann and Hothorn [2007] and Mayr et al. [2012]).Mayr et al. [2012]

Stability Selection
We now briefly recapitulate the Stability Selection introduced in Meinshausen and Bühlmann [2010].Let Λ ⊂ R ≥0 be the set of regularization parameters λ for Lasso-type algorithms.The estimated set of variables corresponding to a specific λ is denoted by Ŝ(λ).When tuning the algorithm, usually by defining a grid of candidate values for λ and fitting a model for each element of the grid, one essentially would pick one of the models which behaves best w.r.t.some quality criterion.Following Meinshausen and Bühlmann [2010], variable selection by just choosing one element of the set { Ŝ(λ) | λ ∈ Λ} does generally not suffice due to the overestimating behaviour of algorithms like the Lasso.Instead, one draws B subsamples of a size of around n sub = n/2 and only the variables that have been selected on sufficiently many subsamples are finally selected.
The probabilities Πj (λ) := P (j ∈ Ŝ(λ)) are, for each j, approximated by computing the relative fraction of subsamples whose corresponding model contains variable j.Then, by fixing a cutoff π thr , the Stability Selection defines the set An important issue when selecting variables are type I errors, i.e., falsely selected variables.
[ Meinshausen and Bühlmann, 2010, Thm. 1] provides an upper bound for the expected number of false positives.They additionally show under which assumptions exact error control is possible with Stability Selection even in high-dimensional settings.
Since the original Stability Selection of Meinshausen and Bühlmann [2010] was basically tailored to Lasso-type models, Hofner et al. [2015] provided a Stability Selection for Boosting models.
Again, one generates B subsamples and performs Boosting on each subsample.Each Boosting model is iterated until a pre-defined number Q of variables is selected, respectively for each subsample.The number Q and the threshold π thr (as in Eq. 2.1) are related by the per-family error-rate due to [Meinshausen and Bühlmann, 2010, Thm. 1], so fixing two of these quantities, the third can be reasonably set.The Stability Selection is implemented in the R-package stabs (Hofner and Hothorn [2017], Hofner et al. [2014], Thomas et al. [2018]).
An extension of the Stability Selection is introduced in Shah and Samworth [2013] who provide bounds for the type I error that are free from the exchangeability assumption and the assumption that the selection procedure is better than random guessing needed in Meinshausen and Bühlmann [2010].Another variant of the Stability Selection from Meinshausen and Bühlmann [2010] has been proposed in Zhou et al. [2013] who criticize that the selection of the stable features is done according to max λ∈Λ ( Πj (λ)).They suggest to take the average of the Πj (λ) for the best k parameters λ, calling their method therefore Top-k-Stability Selection.A Stability Selection including a filtering step in the sense that the grid Λ is condensed to a sub-grid of regularization parameters corresponding to the lowest aggregated out-of-sample loss has been suggested in [Yu and Kumbier, 2020, Sec. 8].Lim and Yu [2016] proposed a method based on the so-called estimation stability which however is restricted to the choice of the regularization parameter for Lasso.[2017b] who discuss and propose similarity metrics in order to quantify this stability.
An algorithm called Bolasso (Bach [2008]) can be interpreted version of Stability Selection which is based on bootstrapping instead of subsampling.The final set of selected variables is the intersection of all selected sets.Therefore, one may identify the threshold π thr = 1 for the Bolasso.

A modified, loss-guided Stability Selection
Although the original Stability Selection already led to magnificent results in literature, there are still open questions.First, the original Stability Selection, including variants in literature, solely considers a training data set, so the stable model is completely based on in-sample losses.
Second, Bühlmann and Van De Geer [2011] and Hofner et al. [2015] have made recommendations for the choice of π thr and advise not to give too much attention to it as long as it lies in a reasonable interval.On the other hand, issues with the Stability Selection as well as a considerable hyperparameter sensitivity have already been reported in literature (Li et al. [2013], Wang et al. [2020]).We believe that this parameter should also be chosen data-driven by respecting the outof-sample performance of the resulting models, analogously to cross-validation techniques to find the optimal hyperparameter from a grid of hyperparameters.Furthermore, for a user, it is much more intuitive to define the number of variables in the final model than to define a threshold since there is no intuition about how many variables a particular threshold corresponds to.This has already been suggested in Zhou et al. [2013], but we also allow for a grid of such numbers so that the optimal number of stable variables is derived by the out-of-sample performance of the corresponding candidate stable models.The main feature of our aggregation and selection procedure is that it is not based on a global criterion like the PFER as in the original Stability Selection but adapted to the actual problem in the sense that the loss function directly guides the selection of the stable model.
A further problem is that the excellent implementation of the Stability Selection based on Boosting where the Boosting models are iterated until a given number of variables have been selected backfires if the hyperparameters have not been chosen appropriately since increasing this number may require to compute all Boosting models again which is rather expensive on high-dimensional data.Another problem is that, on noisy data, even the relevant variables may tend to be selected on a rather low relative part of the resamples.Therefore, it can happen that noise variables are selected more frequently than the actual relevant variables which, if one concentrates solely on keeping the number of false positives low, can lead to too sparse or even empty models.

A grid search for the optimal stable model
First, we partition the data D into a training set D train ∈ R n train ×(p+1) and a validation set p+1) .The most generic procedure that we also apply in this paper is to draw for j = 1, ..., p, and either only take all variables j for which Πj ≥ π thr , or we rank the components in a descending order and take the first q ones.Thus, we produce one of the final sets of selected predictors Ŝstab (q where we denote the largest element of a vector x of length p by x (p:p) and so forth.
We first decide to define the stable model either according to q or to π thr .In the case of adjusting q, we need a reasonable subset of N which clearly satisfies In the case of adjusting π thr , we discretize a reasonable interval, w.l.o.g.]0, 1], according to some mesh size ∆ > 0, so we get the grid Then, we search for the optimal element of the grid by first computing the candidate stable model corresponding to each grid element according to Eq. 3.3 resp.Eq. 3.2.Then, using the validation data set D val , we compute the final coefficient set (w.l.o.g.least squares coefficients, see Subsec.3.3) for each candidate stable model and compute the loss on D val .This guarantees that not only the predictor sets Ŝ(b) are adapted to the loss function but that also the stable predictor set respects the loss appropriately.See Alg. 2 for an overview of our loss-guided Of course, [Meinshausen and Bühlmann, 2010, Thm. 1] which bounds the expected number of falsely selected variables in terms of q, π thr and the number p of variables may be applied ex post where q = mean b (| Ŝ(b) |) is the average number of selected variables in each Boosting model.If q opt is the optimal number q selected from the grid q grid , the corresponding threshold lies in the interval ] Π(p−qopt):p , Π(p−qopt+1):p ] Therefore, if a grid as in Eq. 3.4 has been used, it holds (under the assumptions in Bühlmann and Van De Geer [2011]) that  Remark 3.2.We recommend to use q-grids instead of π-grids since they are more intuitive and since they guarantee that the Stability Selection does not end up in an empty model.It is important to note that the aggregated selection frequencies can clearly also be inspected, so if we force the Stability Selection to output q stable variables, the analyst can investigate if they attain acceptable aggregated selection frequencies.

Post Stability Selection model selection
Consider, for simplicity, the situation that there are k r relevant variables, indexed by i k 1 , ..., i kr , and k n non-relevant variables, indexed by j k 1 , ..., j kn , whose aggregated selection frequencies if the number q of stable variables is high, one inevitably selects noise variables.These noise variables may decrease the out of sample performance, so our loss-guided Stability Selection may select only very few variables, i.e., whose aggregated selection frequencies are larger than Πj k 1 .Of course, the same argument holds for more flexible settings as above when, for example, a sequence of true variables is disturbed by single noise variable in terms of aggregated selection frequencies.Due to strictly respecting the ordering of the variables in terms of their aggregated selection frequencies, there is no chance to select the true best model if at least one relevant variable has a lower aggregated selection frequency than at least one noise variable.
We suggest to use the advantage that stable variable sets (due to the rather low threshold that we use at this stage, we term them "meta-stable" variable sets) are usually rather sparse.Therefore, Since the computation of the coefficients on the reduced data set is cheap, this strategy only induces an insignificant computational overhead provided that it is based on a sufficiently low number q 0 of variables (see Alg.

Final coefficients
We want to point out that Stability Selection aggregates models, but not coefficients.Therefore, all coefficients computed on the subsamples are not considered further.We suggest just computing the coefficients that minimize the selected loss function on the stable model, e.g., least squares coefficients for the quadratic loss function.In the presumably very rare, but not impossible case that the stable model still has more than n predictors, we would apply the underlying model selection algorithm again on the reduced data set where only the stable predictors remain.
Note that Chernozhukov et al. [2018] propose a much more sophisticated strategy for computing the final coefficients by averaging coefficients computed from solving moment equations in a cross-validation scheme which of course also could be applied here.

Computational aspects
As for the computational complexity of both the original and our loss-guided Stability Selection, we can state the following lemma.which is captured by O(Bc base (n, p)) according to the assumptions.ii) After having computed the meta-stable model, one additionally has to compute the coefficients and the performance measure for each considered predictor set whose number is at most 2 q 0 for the exhaustive search and definitely smaller for forward or backward strategies.This again leads to an additional complexity of O(np).Note that this argumentation only holds if the number q 0 does not grow with n or p which is true if the assumption is valid.
iii) The only difference is that the selection of the (candidate) stable models cannot be done in O(p) steps due to the ordering procedure which requires O(p ln(p)) steps if quick sorting algorithms are applied.However, this quantity is captured by O(np) for p being of order exp(n).

Example 3.1. i) A variant of Lem. 3.1 clearly generally holds if the model selection procedure complexity dominates the coefficient computation and the loss evaluation. However, the O(np)
complexity spelled out in the lemma is most natural for many coefficient estimation costs and holds for the concrete ones in this paper.
ii) It may sound counter-intuitive to require the complexity of the loss evaluation to be contained in O(np) since the loss requires Y and Ŷ as input, so p does not appear.We formulated it in this way due to loss functions like ranking loss functions (see Werner [2021a] for an overview) which require O(n ln(n)) steps to be evaluated.Then, assuming that p is of order exp(n) maintains the assumption that the loss evaluation complexity is O(np).iii) For L 2 -Boosting models with m iter Boosting iterations each, the complexity of the Stability Selection is O(Bn train pm iter ).

Remark 3.3. As for the storage complexity, the Stability Selection essentially only requires to
memorize the data set and, for each b = 1, ..., B, the coefficient vector (or its logical counterpart where a TRUE appears if the coefficient is non-zero), leading to storage costs of order O(np+Bp).
Additionally, one has to store at least the current subsample, which however could be deleted afterwards.In principle, the aggregated selection frequencies can be updated iteratively so that one only had to memorize the current coefficient vector or its logical counterpart.During the grid search, we only memorize the validation loss w.r.t. each element of the grid and finally report the coefficients of the optimal model and the aggregated selection frequencies, so denoting

Other potential applications
We want to emphasize that our Stability Selection is not limited to Boosting models on which we focus in this paper but clearly also applicable to model selection procedures like the Lasso and its variants, as done in Meinshausen and Bühlmann [2010], where the threshold in Eq. 2.1 resp.a corresponding number q would be selected data-driven.As in Meinshausen and Bühlmann [2010], it can also be applied to sparse covariance estimation, for example with the Graphical Lasso (Banerjee et al. [2008], Friedman et al. [2008]).The stable model is in this case a set of nodes, i.e., a subset of {(i, j) | i, j = 1, ..., p}.After having computed candidate stable models, one can proceed by computing the classical covariances based on the candidate node sets and by evaluating the non-penalized log-likelihood, so the stable model with the highest value is chosen as the final model.
A completely different setting are variable length Markov chains (VLMCs, see e.g.Bühlmann and Wyner [1999]) where the memory length for each transition probability depends on the realizations in the previous time steps.Model selection in VLMCs consists of finding the optimal variable lengths for which Rissanen [1983] proposed the context tree algorithm where first an overfitted maximal context tree is grown which is pruned afterwards.In this sense, the final context tree can be identified with a sparse model, i.e., a sparse VLMC representation of a full MC model of the respective order.A Stability Selection would aggregate the context trees fitted on suitable subsamples of the data to get a stable VLMC representation.

Data generation
We generate data by drawing the n rows X i of the regressor matrix independently from a multivariate normal distribution N p (M x , I p ) where I p denotes the identity matrix of dimension p×p and where M x = (µ x , µ x , ..., µ x ) ∈ R p for some µ x ∈ R. We fix the number s 0 of true relevant variables and randomly select s 0 variable indices from the p column indices without replacement.
The s 0 coefficients β j corresponding to the relevant variables are drawn independently from a As for the signal-to-noise ratio (SNR), we can easily generate regression data with a pre-scribed SNR by computing Var(Xβ) and by adjusting the scaling of the noise term in the linear model Y = Xβ + .For classification data however, we generate Xβ and treat η i := exp(X i β)/(1 + exp(X i β)) with X i β := X i β − mean(Xβ) (to avoid the issue of highly imbalanced data if (nearly) all components of Xβ are positive resp.negative) as P (Y = 1), i.e., we draw the responses according to Y i ∼ Bin(1, η i ).[Friedman et al., 2001, Sec. 16] propose a noise-tosignal ratio N SR = Var(Y |Xβ)/ Var(Xβ).We are not aware of any method that allows for generating classification data with a pre-scribed SNR (or NSR) but for the sake of transparency, we compute the inverse of N SR on our data sets as a replacement of the SNR.

Training
We compare standard Boosting (L 2 -Boosting for regression and LogitBoost for classification), Note that after having selected the final stable model, the final coefficients are computed on the whole set D using standard least squares regression resp.logit regression.The Boosting parameters are per default m iter = 100 and κ = 0.1.The whole procedure is repeated V times for each scenario, i.e., we generate V independent data sets.
As for PSS-ES, we suggest a hybrid approach that considers both in-sample and out-of-sample losses.More precisely, due to the issue that the number of observations is often low, the number of validation samples may not be sufficiently representative so that only focusing on out-ofsample data when selecting the best candidate stable model could be misleading.Our strategy is to perform the usual exhaustive search based on the in-sample losses as implemented in the R-packages leaps (based on Fortran code by Alan Miller [2020]) and bestglm (McLeod et al. [2020]) that compute the best model for each cardinality up to at most q 0 (see Alg. 3).Normally, one would select the best of these competing models by some criterion like the AIC or the BIC.In our case, we compare each of these models again by computing the out-of-sample performance of the least squares resp.logit coefficients computed only on these variables.Finally, the best (measured in the out-of-sample performance) of these best (measured in the in-sample performance) models is chosen as the final stable model.We nevertheless also implemented a forward and a backward search purely based on the out-of-sample performance and refer to them as PSS-FW resp.PSS-BW.

Simulated data
As already elucidated in Wang et al. [2020], Stability Selection is a model selection strategy.
The final performance therefore depends on how the analyst uses the resulting stable model.Therefore, computing test losses would not be fair for comparing the performance of Stability Selection with that of the underlying model selection algorithm if the true model is known.Instead, we compute the average number of variables that have been selected for all model selection strategies.Furthermore, we compute the average number of true positives.All averages are first taken over the 10 cross-validation loops and again averaged over the V repetitions.Of course, one can expect raw Boosting models to achieve a higher TPR than stable models, therefore, just reporting the number of true positives would be somewhat unfair.We compute the precision which is defined as pr = T P/P P for the number T P of true positives and the number P P of predicted positives, i.e., the relative part of relevant variables among the selected variables, and also report the average over the cross-validation loops and the V data sets.

Real data
On real data, there is no known ground truth model.Of course, we can report the mean number of true positives, but this number alone would obviously bias the evaluation towards overfitting models.
Therefore, we compute the losses on the test data and report the average test loss over all cross-validation samples of the given data set.The intention behind this strategy is that if the assumption of a linear model is suitable, the prediction based on the true relevant variables should be better than any prediction based on a different set of variables.Note that we compute the least squares coefficients on the selected models, including that for L 2 -Boosting since directly using the L 2 -Boosting output would incorporate the additional shrinking effect of that algorithm which would make the results incomparable.

Comparison on simulated data
We first compare our loss-guided Stability Selection and PSS-ES with the Stability Selection from Hofner et al. [2015] on three scenarios in order to highlight some issues with the latter one.Afterwards, we compare our Stability Selection variants with the corresponding Boosting algorithm on both 16 regression scenarios and 8 classification scenarios.

Issues with the original Stability Selection
We start by showing that the Stability Selection proposed by Hofner et al. [2015] can fail on highdimensional noisy data.The reason is that in such cases, there are seldom clearly dominating variables in terms of selection frequencies, or in other words, the Boosting algorithm gets irritated by the noise variables, resulting in no variable at all passing the threshold in the Stability Selection.As for Hofner's Stability Selection, we use its implementation as stabsel in the R-packages mboost resp.stabs.
One can argue that one just has to modify either Q, the PFER or the threshold when applying stabsel, but as we show in scenario A, even that does not necessarily result in good models, apart from the immense computational overhead due to increasing Q necessitates to re-compute all Boosting models.We do not intend to detect the best configuration here (besides, different configurations may lead to the same stable model) but to approximate the optimal model that the original Stability Selection can select.
We then only compute the number of selected variables and true positives corresponding to the best model in each cross-validation loop and the average over all 10 loops.We report the average of these cross-validated quantities over all V replications.We also count the number of cases where no variable has been selected.The sampling type is per default the Shah-Samworth sampling type with B complementary pairs.
Scen Selection again with a higher false positive tolerance?).Our Stability Selection in contrast still includes false positives for the benefit that it is very easy to use without requiring expert knowledge at any stage while, when using a q-grid, guaranteeing to output a model to work with.

Comparison of our Stability Selection with Boosting for regression
We study the performance of our For the loss-guided Stability Selection, we always use a q-grid q grid = {1, 2, ..., 10}.We set π thr = 0.25 and q 0 = 20 for PSS-ES resp.q 0 = 50 for PSS-FW and PSS-BW.
In Fig. 1, we can observe that our loss-guided Stability Selection and PSS-ES lead to the highest precision.More precisely, they achieve from at least double to more than five times the precision as L 2 -Boosting in scenarios I-VIII while achieving from at least 2.8 times up to nearly 9 times the precision of L 2 -Boosting in scenarios IX-XVI.The forward and backward strategy, although being never significantly worse than Boosting, achieve much lower precision than the other variants.We believe that this results from the rather reduced data availability  IX and X, so there is a clear joint impact, as expected.
It is noteworthy that the precision in the scenarios IX-XVI is comparable to that in the scenarios I-VIII although the coefficients tend to be weak while the number of true positives is lower in nearly all scenarios.Moreover, the regressors are also centered while having a mean at -2 in scenarios I-VIII.Although, mathematically spoken, the effect of the weak coefficients is just that the norms of the response and error vectors are smaller in scenarios IX-XVI than in I-VIII (in expectation), Meinshausen and Bühlmann [2010] distinguish between "active" variables which are those with a corresponding non-zero coefficient and "relevant" variables whose coefficient exceeds some threshold in absolute value.However, oddly, the precision even tends to increase on the scenarios with a low SNR, especially for scenarios VIII and XVI, which are the most challenging ones.
Summarizing, our loss-guided Stability Selection and PSS-ES perform well on all scenarios by achieving a drastically higher precision than standalone Boosting, without suffering from any issue like resulting in an empty model.

Comparison of our Stability Selection with Boosting for classification
We now consider 8 classification scenarios specified in Tab. 4. For the loss-guided Stability Selection, we always use a q-grid q grid = {1, 2, ..., 10}.We set π 0 = 0.25 and q 0 = 15 since  Selection outputs rather rich models which, however, only achieve a slightly better test performance.We interpret this result as follows: There are relevant variables that achieve a very high aggregated selection frequency and that enter all stable sets.There are many non-relevant variables which are however persistent and are selected on many subsamples.These variables are discarded by the original Stability Selection which cannot select the relevant variables which get even lower aggregated selection frequencies, therefore this method leads to very sparse models.
Our loss-guided Stability Selection includes more variables, including these persistent noise variables, until the inclusion of these noise variables becomes infavorable concerning the validation loss.Therefore, it is capable to select much richer models than the original Stability Selection to a certain extent but is stuck to the ordering of the variables.PSS-ES in contrast reliably picks variables that, disregarding the concrete ordering in terms of aggregated selection frequencies as long as the particular frequency is at least π 0 , improve the test loss.
In this experiment, all stable models lead to a worse performance than the model selected by L 2 -Boosting, which is not that surprising on such a high-dimensional data set since one can expect the number of relevant variables to be rather high so that stable models are doomed to underfit.In our interpretation, the false positives selected by L 2 -Boosting have a rather low effect as the corresponding coefficients can be expected to be small in absolute value.Therefore, the rich L 2 -Boosting model in total works better than the stable models as the benefit of the inclusion of further true variables is larger than the performance decrease due to overfitting.
We want to emphasize again that an important advantage of sparse stable models is the better interpretability and the guidance of future experiments since one can focus on much less variables there.
We have also run the simulations again with m iter = 300 but this leads to such rich L 2 -Boosting models that the number of variables is higher than the number of observations.The stable models in contrast were not affected which is not surprising since increasing the iteration number only includes some more variables while maintaining the variables selected for the first 100 iterations.
The ordering of the variables would only be altered if some of these late variables would get higher aggregated selection frequencies than the early variables which is not the case and which clearly is not expected.As for PSS-ES, the number q 0 of meta-stable variables is quickly attained, in most cases, even for m iter = 100, so that additional variables for a larger number of iterations would just be discarded here.

Conclusion and outlook
We Essentially, we do not pay too much attention to the error control but rather concentrate on finding a well-performing model rather than a too parsimonious or even an empty model which can be interpreted as a relaxed perspective in contrast to that of Meinshausen and Bühlmann [2010] or Hofner et al. [2015].Although we lose the error control that the original Stability Selection provided, our Stability Selection variants are user-friendly as they do not require expert knowledge, intuitive as they allow for fixing a range of number of variables in the stable model instead of some threshold, and as they, at least when using a grid over these numbers of stable variables, are guaranteed to result in a non-empty model.
Our Stability Selection variants prove to reliably achieve somewhat higher precision, up to a factor of 10, compared to pure Boosting, in a plethora of simulated scenarios for regression and classification where the data configurations range from easy to very challenging (large number of predictors, low number of observations, weak coefficients, very large noise variance).We also applied our Stability Selection variants to a real high-dimensional data set where they prove to lead to reasonable models and where PSS-ES proves to be capable to ignore the variable ordering in terms of selection frequencies in the favor of better stable models.
subsamples D b;train of size n sub of D train and to compute predictor sets Ŝ(b) , b = 1, ..., B, by applying some model selection procedure (without any restrictions like stopping it once Q variables are selected as in Hofner et al. [2015]) to subsample D b;train , respectively.Then, as in literature, we compute the aggregated selection ) and get the loss L(val,k)  on the validation data D val end end Choose the model corresponding to kopt = argmin k (L(val,k) ); Compute final coefficients w.r.t. the stable model on the whole data D Algorithm 2: Loss-guided Stability Selection Remark 3.1.The aggregation procedure may suffer from several issues, for example, contaminated instances/cells in the data set.Of course, one can apply robust model selection methods on the subsamples.Furthermore, several methods like computing the performance of the individual models on the subsamples and downweighting or trimming them when computing the Πj have been considered inWerner [2019], also including an additional outer cross-validation scheme where different partitions of the data into training and validation data are drawn.A trimming procedure based on the in-sample losses was suggested inWerner [2021b].
we propose a "Post Stability Selection model selection" approach which combines the resampling strategy of Stability Selection with well-established classical model selection and a validation step corresponding to the loss function.We perform an exhaustive search in order to overcome the problem of inappropriate variable orderings and call the strategy " Post Stability Selection Exhaustive Search" (PSS-ES).See Alg. 3 for an overview where P(A) denotes the power set of a discrete set A. Note that we left a definition of the "performance" in the algorithm open since there are several possible alternatives like standard exhaustive search w.r.t. the validation loss, additional cross-validation or a hybrid approach between the in-sample and out-of-sample performances that we apply in Sec. 4.
3).Alternatively, one can execute the corresponding forward or backward selection strategies.Initialization: Training data D train , validation data D val , threshold π thr , maximum number q0 of candidate variables, aggregated selection frequencies Πj; Ŝmeta−stab = {j | Πj ≥ π thr }; if | Ŝmeta−stab | > q0 then Ŝmeta−stab = {j | Πj ≥ Π((p−q 0 +1):p) } end for S ∈ P( Ŝmeta−stab ) do Compute βS on (X train •,S , Y train ); Compute the performance of the model end Select the model Ŝstab with the best performance; Compute final coefficients w.r.t. the stable model on D Algorithm 3: Post Stability Selection Exhaustive Search (PSS-ES) Lemma 3.1.i) The computational complexity of our loss-guided Stability Selection is contained in the same complexity class as the original Stability Selection provided that the complexity O(c base (n, p)) of the model selection algorithm applied on training data with n observations and p predictors is at least O(np), that the loss function can be evaluated in at most O(np) steps and that the final coefficients can be computed in at most O(np) steps.ii) Statement i) remains valid for the Post Stability Selection model selection provided that the true number s 0 of relevant variables does not depend neither on n nor p. iii) Selecting the (candidate) stable model according to the rank-based strategy does not cause a computational overhead compared to the threshold-based strategy provided that p is of order exp(n).Proof.The original Stability Selection requires drawing B subsamples of size n sub , leading to a complexity of O(Bn sub ), the application of the base procedure on each subsample with a total complexity of O(Bc base (n sub , p)), the computation of the Πj which is done in O(Bp) steps and the selection of the variables according to Eq. 3.3, requiring O(p) steps.Summarizing, the computational complexity is dominated by the base procedure according to the assumption, letting the total complexity be O(Bc base (n, p)).i) The loss-guided Stability Selection with a grid of thresholds additionally requires |π grid | times to check which variables exceed the threshold, followed by the computation of the coefficients and the loss evaluation, leading to additional costs of order O(|π grid |(p + n train p + n val p)) = O(np) our grid by * grid for * ∈ {π, q}, the storage costs for the whole Stability Selection are given by O(np + Bp + | * grid | + p).A Post Stability Selection model selection strategy would require the same storage capacities where | * grid | is replaced by 2 q 0 .Evidently, both Stability Selection strategies do not require any interaction between the different Bootstrap replications at any stage except for the final aggregation of the selection frequencies.Therefore, one can easily parallelize them similarly as the original Stability Selection by running the base procedure on the subsamples on different nodes.
the loss-based Stability Selection and PSS-ES.As for Boosting, the whole data set D is used as training set.As for our Stability Selection variants, we partition D into a training set D train with n train instances and a validation set D val with n val instances.In Sec.4.4 and 4.5, we generate an independent test data set with n test instances that will only appear during evaluation.In all cases, in the spirit of randomized cross-validation, we draw 10 such partitions into (test,) training and validation data.The subsamples in our Stability Selection are of fixed size n sub .
due to these strategies being solely based on D val while PSS-ES takes both D train and D val into account when finding the stable model.However, the clear dominance of the forward over the backward search can hardly be explained.One can observe a tendency of PSS-ES being slightly inferior to the loss-guided Stability Selection.

Figure 2 :
Figure 2: Upper row: Average number of retrieved true positives for L 2 -Boosting (L2), our lossguided Stability Selection (LSS), PSS-ES (ES), PSS-FW (FW) and PSS-BW (BW); bottom row: Average precision have shown that the standard Stability Selection where the hyperparameters are chosen manually may fail on high-dimensional noisy data due to severe underfitting, potentially caused by both inappropriate hyperparameter settings by the analyst as well as the very strict falsepositive avoiding paradigm.We presented a loss-guided Stability Selection variant that chooses the hyperparameters in a data-driven way according to a grid search w.r.t. the out-of-sample loss.Motivated by the issue that true variables may get lower aggregated selection frequencies than some noise variables so that even our loss-guided Stability Selection cannot include these true variables without including the respective noise variables, we proposed another variant called Post Stability Selection exhaustive search (PSS-ES) which selects a meta-stable predictor set and performs an exhaustive search on this set.The computational overhead of our Stability Selection variants is limited since the additional operations are executed on already sparse predictor sets.One can show that the complexity is captured by the same O-class as the complexity of the original Stability Selection.Honestly speaking, the original Stability Selection may be better in the case of strong signals, i.e., high signal-to-noise ratios, since the relevant variables generally enter the models quickly.
provided that Π (p−qopt):p > 0.5.If we use a threshold grid as in Eq. 3.5 for our Stability Selection, just replace Π (p−qopt):p by π opt for π opt being the optimal element in π grid , provided that π opt > 0.5.It is important to note that we rather risk to include wrong variables in our final stable model than to get an empty or heavily underfitted model, so we do not think of not being able to compute the error control probability ex ante as a weakness of our Stability Selection.
Initialization: Data D train , D val , size n sub of subsamples, binary variable gridtype, grid, hyperparameters for the underlying model selection procedure; for b=1,...,B do Draw a subsample D b;train ∈ R n sub ×(p+1) from D train ; Apply the model selection algorithm an get a model Ŝ(b) ; end Compute the Πj as in Eq. 3.1 ; if gridtype=='qgrid' then for k = 1, ..., |q grid | do Get the stable model Ŝstab ((q grid ) k ) according to Eq. 3.2; Compute the coefficients on the reduced data (X train •, Ŝstab ((q grid ) k ) , Y train ) and get the loss L (val,k) on the validation data D val end else for k = 1, ..., |π grid | do Get the stable model Ŝstab ((π grid ) k ) according to Eq. 3.3; Compute the coefficients on the reduced data (X train •, Ŝstab ((π grid ) k ) , Y train

Table 1 :
Scenario specificationIn all scenarios A-C, we apply the Stability Selection 30 times for each combination of Q ∈ {5, 6, ..., 10} and P F ER ∈ {1, 2, 3, 4, 5}.The reason is that using one single configuration would potentially be highly misleading and discredit the original Stability Selection.Therefore, we use the test losses in order to determine the best stable model from the original Stability Selection.

Table 2 :
Results for Scenarios A-CWe see in Tab. 2 that there do not seem to be severe issues with the original Stability Selection in scenario A where it achieves a significantly higher precision than our variants.Among the different hyperparameter configurations, less than 1% of them (0.158 out of 30) led to an empty model.However, the results have to be interpreted with caution since they correspond to the best of the 30 different configurations.Regarding scenarios B and C where weak coefficients appear 1 in contrast to the strong signals in scenario A, the original Stability Selection reveals that, although the best model again achieves a somewhat higher precision than our variants, it tends to severe underfitting, more precisely, around the half of the configurations (14.012 resp.15.233 out of 30, in average) result in an empty model.Even worse, in 9 resp.11out of the 100 repetitions in scenario B resp.C, all configurations end up in an empty model, hence all the computational effort that has been done was in vain since on these training data sets, no variable ever entered the stable model.From the perspective of avoiding false positives at all costs, the original Stability Selection clearly beats our variants.From the perspective of finding a predictive model, the original Stability Selection backfires if weak coefficients appear since a considerable computational effort is lost and there is no obvious way how to proceed in this situation (would one apply the Stability

Table 3 :
Stability Selection variants on 16 different regression scenarios, Scenario specification for regression

Table 4 :
Scenario specification for classification.SNR refers to the average of the SNR's of the individual data sets