Trimming stability selection increases variable selection robustness

Contamination can severely distort an estimator unless the estimation procedure is suitably robust. This is a well-known issue and has been addressed in Robust Statistics, however, the relation of contamination and distorted variable selection has been rarely considered in the literature. As for variable selection, many methods for sparse model selection have been proposed, including the Stability Selection which is a meta-algorithm based on some variable selection algorithm in order to immunize against particular data configurations. We introduce the variable selection breakdown point that quantifies the number of cases resp. cells that have to be contaminated in order to let no relevant variable be detected. We show that particular outlier configurations can completely mislead model selection. We combine the variable selection breakdown point with resampling, resulting in the Stability Selection breakdown point that quantifies the robustness of Stability Selection. We propose a trimmed Stability Selection which only aggregates the models with the best performance so that, heuristically, models computed on heavily contaminated resamples should be trimmed away. An extensive simulation study with non-robust regression and classification algorithms as well as with two robust regression algorithms reveals both the potential of our approach to boost the model selection robustness as well as the fragility of variable selection using non-robust algorithms, even for an extremely small cell-wise contamination rate.

It has been observed that single regularized models often tend to overfit which was the starting point for a more sophisticated concept that combines ensemble models with feature selection, namely Stability Selection (Meinshausen and Bühlmann [2010]).Roughly speaking, in Stability Selection one considers subsamples of the training data, performs feature selection on each subsample and aggregates the models to get a final stable model at the end.Stability Selection and several regression approaches (Bottmer et al. [2021], Leung et al. [2016], Filzmoser et al. [2020a]).There are also notable algorithms for detecting and imputing cell-wise outliers like the DDC from Rousseeuw and Van Den Bossche [2018] or the cellHandler from Raymaekers and Rousseeuw [2019].
A common robustness measure is given by the breakdown point.In contrast to the prominent influence curve which quantifies the local robustness of an estimator, i.e., only allowing for an infinitesimal fraction of the data to be contaminated, the breakdown point (BDP), introduced in [Hampel, 1971, Sec. 6] in a functional version and in Donoho and Huber [1983] in a finite-sample version, studies the global robustness of an estimator.The finite-sample BDP from Donoho and Huber [1983] quantifies the minimum fraction of instances in a data set that guarantees the estimator to "break down" when being allowed to be contaminated arbitrarily while the functional BDP essentially quantifies the minimum Prokhorov distance of the ideal and contaminated distribution that leads to such a breakdown.There has already been a lot of work on BDPs, see for example Rousseeuw [1984], Rousseeuw [1985], Davies [1993] and Hubert [1997], Genton [1998], Becker and Gather [1999], Gather and Hilker [1997] or Hubert et al. [2008] which cover location, scale, regression, spatial and multivariate estimators.Recently, BDP concepts for classification (Zhao et al. [2018]), multiclass-classification (Qian et al. [2019]) and ranking (Werner [2021]) have been proposed.Another type of breakdown point which relates the quotient of the number of variables and observations to the sparsity of the true model was studied in Donoho and Stod-den [2006], illuminating that in high-dimensional settings, classical model selection procedures like Lasso can only find a reliable model provided that the true model is sufficiently sparse.
Despite the recent successes of robust model selection methods and path-breaking theoretical results concerning model selection consistency (see, e.g., Bühlmann and Van De Geer [2011]), the question how contamination affects variable selection is seldomly addressed, leaving the connection of the paradigms of sparse model selection (select a small fraction of columns), stability (select variables that are appropriate for the majority of the data, i.e., a majority of the rows) and robustness (focus on a certain "clean" majority of the rows) still opaque.Although the frequency of contamination in real data is high and although the issue that non-aggregated feature selection models are usually unstable and tend to overfit is well-known, combining both stability and robustness seems to have rarely been considered in literature so far.A robust Stability Selection based on cleaned data has been introduced in Uraibi et al. [2015].Uraibi [2019] apply a weighted LAD-Lasso (Arslan [2012]) as base algorithm for the Stability Selection where the weights are computed according to a robust distance in order to downweigh leverage points.Park et al. [2019] proposed a robust Stability Selection where the term "robust" however refers to an immunization against a specific regularization parameter of the underlying Lasso models, therefore considers a different goal than we do.Notable work on aggregating robust estimators has been done in Salibián-Barrera and Zamar [2002], Salibián-Barrera et al. [2008] who propose a linear update rule for robust MM-estimators in order to avoid computing it on each of the drawn Bootstrap samples, see also Salibián-Barrera [2006] for fixed-design regressor matrices and Salibián- Barrera et al. [2006] for robust PCA estimators.While these techniques do not consider variable selection, Salibián- Barrera and Van Aelst [2008] extend their method to variable selection where a backward selection strategy according to a minimization of the expected prediction error is applied.
We aim at making a tiny step towards the connection of robustness, stability and sparsity by introducing the variable selection breakdown point, which describes the number of outliers that can make variable selection completely unreliable, and the Stability Selection breakdown point, which corresponds to a sufficiently high probability that a stable model becomes completely unreliable.We study the relative robustness improvement that a Stability Selection grants, compared to a single model selection algorithm.It turns out that a rank-based Stability Selection where a fixed number of best variables enters the stable model is generally more robust than a threshold-based Stability Selection where all variables whose aggregated selection frequency exceeds the threshold enters the stable model.We propose a trimmed Stability Selection (TrimStabSel) and investigate its performance on a large variety of simulated data in comparison with a non-robust Stability Selection, where L 2 -Boosting, LogitBoost and SLTS are used model selection algorithms.The numerical results show that even an extremely small cell-wise contamination rate can already have a severe impact on variable selection.Our TrimStabSel can in particular be recommended in settings where the contamination rate is expected to be low but non-zero.Our contribution is threefold: i) We propose BDP definitions for (stable) variable selection; ii) We propose (oracle) outlier schemes that can completely distort model selection with usually very few outliers; iii) We lift the popular concept of trimming from single instances to whole models that allows for contamination rates exceeding 50% while maintaining the 50%-bound for the standard BDPs of the underlying model selection algorithms.

Preliminaries
Let D ∈ R n× (p+k) be the data matrix consisting of a regressor matrix X ∈ R n×p and a response matrix Y ∈ R n×k .If univariate responses are considered, Y ∈ R n is a response column.We denote the i-th row of X by X i and the j-th column of X by X •,j .The i-th row of Y is denoted by Y i and for k = 1, Y i denotes the i-th component of Y .Let n sub < n always be the number of instances in subsamples resp.Bootstrap samples.
We start by recapitulating contamination models and the breakdown point concept.The last subsection is devoted to Stability Selection.
Definition 2.1.Let (Ω, A) be a measurable space.Let P := {P θ | θ ∈ Θ} be a parametric model where each P θ is a distribution on (Ω, A) and where Θ ⊂ R p is some parameter space.Let P θ 0 be the ideal distribution ("model distribution", [Rieder, 1994, Sec. 4.2]).Then, a contamination model is the set of all distributions given by the system U * (θ 0 ) : The radius r is also called "contamination radius".
To emphasize different standard types of contamination balls, a subscript (here represented by the " * ") is usually added to concretize the contamination model.

Example 2.1. A convex contamination model
Many other metrics have entered the theory of Robust Statistics, see [Rieder, 1994, Sec. 4.2] for an overview.While in standard estimation problems like estimating a location or scale parameter one just has a sample of observations, there is far more flexibility to define outliers in for example regression settings.See Kohl [2005] and Rieder [1994] for more details.

Remark 2.1 (Regression outliers). If we only allow for contaminating the responses while
maintaining the regressors, we speak of Y -outliers while contamination in the regressors is referred to as X-outliers.
In the contamination model in Def.2.1, we considered case-wise (row-wise/instance-wise) outliers, i.e., either a whole row in the regressor matrix or in the response is contaminated or not.
A more realistic scenario where the entries (cells) of the regressor matrix are allowed to be perturbed independently (cell-wise outliers) has been introduced in Alqallaf et al. [2009].See also Agostinelli et al. [2015] for the notation.Definition 2.2.Let X i ∼ P θ 0 for i = 1, ..., n where P θ 0 is a distribution on some measurable space (Ω, A) and let X be the n×p-matrix consisting of the X i as rows.Let U 1 , ..., U p ∼ Bin(1, r) i.i.d.. Then the cell-wise convex contamination model considers all sets where X ∼ Q for any distribution Q on (Ω, A) and the matrix U with diagonal entries U i .
Remark 2.2.Alqallaf et al. [2009] pointed out that if all U j are perfectly dependent, one either gets the original row or a fully contaminated row which is just the classical convex contamination model in Def.2.1.
Remark 2.3.Note that for response matrices, one can similarly construct cell-wise outliers.
In the cell-wise contamination model, the curse of dimensionality becomes drastically more apparent since the probability that at least one case is contaminated grows with the dimension p.Note that a single contaminated cell already makes an observation an outlier.This has been pointed out in Öllerer and Croux [2015], Croux and Öllerer [2016] who concentrated on precision matrix estimation, see also Loh and Tan [2018].

The breakdown point concept
The goal of Robust Statistics is to provide robust estimators, i.e., estimators that tolerate a certain amount of contaminated data without being significantly distorted.As for the term "robustness", in this work, we use the global robustness concept that allows for a large fraction of the data points being contaminated arbitrarily.The minimum fraction of outliers that can lead to a breakdown of the estimator is called the breakdown point (BDP) of this particular estimator.The finite-sample BDP of Donoho and Huber [1983] is defined as follows.
Definition 2.3.Let Z n be a sample consisting of instances (X 1 , Y 1 ), ..., (X n , Y n ).The finitesample breakdown point of the estimator β is defined by where Z m n denotes any sample with (n−m) instances in common with the original sample Z n , so one can arbitrarily contaminate m instances of Z n , and where In this definition, one implicitly assumes that β ∈ R p .If w.l.o.g. a compact set Θ ⊂ R p is considered, one may define a breakdown in the sense that β is located at the boundary of Θ.It has been discussed in Davies and Gather [2005] whether attaining a boundary value should be considered as a breakdown.See the rejoinder of Davies and Gather [2005] for some discussions on that topic where the authors suggest a metric which becomes infinity once one of the boundary values is attained.
There are already a lot of BDP concepts in literature, for example for regression (Stromberg and Ruppert [1992] and Sakata and White [1995]), for location-scale estimators (Sakata and White [1998]), for time series (Genton and Lucas [2003], Genton [2003]) and for variogram estimators (Genton [1998]).Donoho and Stodden [2006] propose a BDP for model selection while Kanamori et al. [2004] studied the BDP for SVMs.A BDP variant for clustering was introduced in Hennig [2008] who called their concept the "dissolution point".The situation of heavy-tailed original data has been addressed in Ruckdeschel and Horbenko [2012] with the expected BDP that takes the ideal distribution of the original data into account.

Stability Selection
The  Meinshausen and Bühlmann [2010], one defines a threshold π thr based on an inequality derived in [Meinshausen and Bühlmann, 2010, Thm. 1] so that the stable set then consists of all variables j for which πj ≥ π thr .There are some variants of this Stability Selection, most notably the one in Hofner et al. [2015] that makes it applicable for Boosting while the original one is tailored to algorithms that invoke a regularization term like the Lasso or the Graphical Lasso.An excellent implementation of the Stability Selection can be found in the R-packages mboost (Hothorn et al. [2017], Hofner et al. [2014], Hothorn et al. [2010], Bühlmann and Hothorn [2007], Hofner et al. [2015], Hothorn and Bühlmann [2006]) and stabs (Hofner and Hothorn [2017], Hofner et al. [2015], Thomas et al. [2018]).As for the selection of the stable set according to the aggregated selection frequencies, another paradigm that defines a number q of variables that have to enter the stable model so that the q variables with the highest selection frequencies are chosen has been suggested in literature (e.g., Zhou et al. [2013], Werner [2022]) since the threshold-based approach is less intuitive for the user due to the number of stable variables not being predictable in the first place.

Breakdown of variable selection
In this section, we first define a BDP for variable selection.Based on this definition, we discuss why this BDP may be very small and outline the path from the robustness of single algorithms concerning variable selection to ensembles of such algorithms.Donoho and Stodden [2006] already provided a very insightful work where the notion of a "breakdown of model selection" has been introduced.They computed phase diagrams that show under which configurations of the dimensionality of the data and the sparsity level of the true underlying model a successful model selection is possible.More precisely, they derive that the underlying model has to be sufficiently sparse, expressed in the fraction of the true dimensionality q of the model and the number n of observations, the question whether model selection is possible depends on the fraction n/p for p being the number of predictors.

Variable selection breakdown point
Our idea also considers to compute a breakdown for model selection, but we restrict ourselves to the standard setting of Robust Statistics where we want to examine how many outliers can be tolerated for model selection.In order not to confuse our BDP concept with the one in Donoho and Stodden [2006], we call our concept the variable selection breakdown point (VSBDP).
where Z m n again denotes any sample that has (n − m) instances in common with Z n .

b) The cell-wise variable selection breakdown point (cell-VSBDP) is given by
where Zm cell denotes the data set where m cells can be modified arbitrarily and where all other cell values remain as in the original data.
In other words, the VSBDP quantifies the relative number of rows resp.cells that have to be contaminated in order to guarantee that none of the relevant variables are selected.The fraction of outlying cells as a breakdown measure has for example already been considered in Velasco et al. [2020].Let us now formulate a very simple but important result.
Theorem 3.1.Let q ≤ p be the true dimension of the underlying model and let again k be the response dimension and n be the number of instances in the data set.a) Then the cell-VSBDP is at most min(q,k)  p+k .b) Let p = p(n) so that it grows when n grows.If q or k stays constant, the asymptotic cell-VSBDP is zero.c) Let p = p(n) and q = q(n) so that both quantities grows when n grows.Let contamination only be allowed on the predictor matrix.Then the asymptotic cell-VSBDP is given by lim n→∞ q(n) p(n) .
Proof.a) If q < k, just replace the entries X ij for all i and for all j corresponding to the relevant variables by zeroes, so that one has to modify qn cells of the data set, more precisely only of the predictor matrix.Then the originally relevant columns remain without any predictive power and therefore will not be selected.If q ≥ k, replace all entries of the response matrix with zeroes which would result in an empty model since no predictor column remains correlated with the response, requiring kn outlying cells.

b)+c) Directly follows from a).
2 This is a universal result, regardless of the data structure or the applied algorithms.We want to point out that the classical understanding of robustness would only consider the estimated coefficients themselves and call an estimator robust if the coefficients stay bounded.However, if the non-zero coefficients correspond to non-relevant variables, the learning procedure results in a robust fit on noisy variables which will definitely have poor prediction quality for out-of-sample data.Our analysis is based on an interplay between model selection and coefficient estimation so that the ultimate goal is to achieve both sub-goals in order to get a reliable model.Let us therefore pose the following provoking statement: From the perspective of retrieving the correct model, all robust regression and classification models are doomed to have a breakdown point less than q/p.
There are methods that solve regression problems by first identifying cell-wise outliers in the data matrix corresponding to a regression problem (Agostinelli et al. [2015], Leung et al. [2016]) and by down-weighting them.In general, the column structure of the contamination scheme corresponds to a perfect correlation of the binomial variables in the cell-wise convex contamination model in Def.2.2, i.e., all U ij , i = 1, ..., n, are perfectly correlated, in contrast to case-wise contamination where the U ij , j = 1, ..., p (or j = 1, ..., p + k if the response matrix is also considered), are perfectly correlated.1 Note that a zero column would clearly be detected even by hand (although it would not help to reconstruct the original values so that the corresponding variable would not be selected anyway), and other outlier structures like replacing cells by a fixed value could be detected by the outlier detection and imputation procedure (DDC) from Rousseeuw and Van Den Bossche [2018] or the cellHandler from Raymaekers and Rousseeuw [2019].However, one could try to hide the perturbation by first replacing selected cells of a given column by the mean of the remaining column entries and by adding some random noise to these imputations.
When having simulated data and a random cell-wise outlier scheme, it is extremely unlikely that such column-wise outliers that make the true model irretrievable would appear, so this situation should at most very rarely occur in a random simulation as for example done in Filzmoser et al. [2020a] where first a set of instances is selected and for each of these instances, a random fraction of cells are contaminated.Filzmoser et al. [2020a] identify the column values with measurements made by a specific sensor, so if this sensor is corrupted across a whole study, one indeed would have column-wise outliers.In the simulations in Filzmoser et al. [2020a], they restricted themselves to cell-wise contamination in the regressor matrix while Leung et al. [2016] and Velasco et al. [2020] also allowed for (a small amount of) contamination in the response vector.From a practical perspective, assuming that contamination cannot appear in the responses is a very heavy assumption since if one allows for all p measurement vectors of the variables to be contaminated, it is difficult to argue why of all things the response measurements should be clean.
From a practical perspective, it is important to make an assumption how much information an attacker has.Since Robust Statistics mainly focuses on deriving robust algorithms that can cope with a certain fraction of contamination but not on attacking algorithms that unveil weaknesses of the algorithms by targetedly producing dangerous contamination, this topic has gained far less attraction than in the Deep Learning community.There, one considers adversarial attacks (Goodfellow et al. [2014]), which minimally perturb instances so that an already trained model makes wrong predictions, and poisoning attacks (Jagielski et al. [2018]), which are perturbations that are injected into the data before the model is trained and which therefore are stronger related to Robust Statistics.However, these perturbations are usually bounded by a matrix norm and not by a contamination radius as in convex contamination schemes which nevertheless has also been considered in Robust Statistics (e.g.Burgard et al. [2021]).The success of such attacks depends on the amount of knowledge that the attacker has both on the victim model as well as on the data and the underlying model.Coming back to our column-wise outliers, an attacker that randomly would inject column-wise outliers to the predictor matrix would hardly contaminate the relevant variables by chance unless it is an oracle attacker that is aware of the relevant variables, maybe due to intercepting and analyzing the data first.

Resampling and robustness
For Bootstrapping, the probability that in a particular resample there is at least one contaminated instance is given by where ε = m/n indicates the empirical rate of outlying rows in the data set.Similarly, for subsamples, it is given by where Hyp(n, n − m, n sub ) is the hypergeometrical distribution describing the number of clean instances (successes) for (n − m) = (1 − ε)n clean instances in the "urn".This observation has already been made in literature, more precisely, in Berrendero [2007] for Bootstrap samples and in Camponovo et al. [2012] for subsampling.[Filzmoser et al., 2020b, Sec. 3.5] also derive the necessary number of Bootstrap samples so that the probability of having at least one clean resample is sufficiently large.This paradigm has been extended for the bagged median in Berrendero [2007].A general condition so that a bagged robust estimator where the estimator on each resample has the BDP c does break down is given by for Bootstrapping and by for subsampling.This also consistently covers the case of c = 1/n since n sub /n < 1.
The last two formulae correspond to non-robust resampling aggregation with robust estimators.Berrendero [2007] also considered a robust aggregation procedure called Bragging which was introduced in an earlier version of Städler and Bühlmann [2012] which just replaces the mean aggregation by a median aggregation of the individual estimators.The condition for a breakdown of the bragged estimator changes to at least (B+1)/2 resamples being sufficiently contaminated to let these resample-individual estimators break down, see also Berrendero [2007].Similar results are clearly available if one considers trimmed bagging (Croux et al. [2007]).See also Salibián-Barrera and Zamar [2002] for such a resampling BDP for Bootstrapped robust quantile estimators.
We now recapitulate the resampling breakdown point introduced in Berrendero [2007] and use a slightly modified definition to make it more consistent with other BDP concepts.
This resampling BDP defines the maximum fraction of contaminated instances in the data so that the probability that a mean aggregation breaks down exceeds the tolerance level.This definition is important since it lifts the worst-case BDP concept to a probabilistic concept that respects that the worst case is often very unlikely and that solely reporting it would be too pessimistic.
Now, after this exposition, let us first emphasize that the idea from the rejoinder of Davies and Gather [2005] to map the boundary values of a bounded image set of an estimator to infinite values is extremely important when assessing the effects of resampling on the robustness.
Grandvalet [2000] claim that Bagging never improves the BDP.This is not true if the value set is bounded.We first formally spell out why Bagging is usually not robust, although this fact has already been observed in literature.

b) Note that if the fraction of outlying instances/cells is smaller than the case-BDP c resp.
the cell-BDP c of the applied estimator, not letting it break down, resampling induces a certain probability, namely a ⊥ c,n sub ,B (ε), that the bagged estimator indeed suffers a breakdown.One has to keep in mind that this probability may become rather high on specific configurations.Consider Bagging with B = 100 Bootstrap samples, n = 200, n sub = 100 and m = 90 outlying instances and let c = 0.5.Then, the probability that a single Bootstrap sample is sufficiently clean is P (Bin(100, 0.45) < 50) ≈ 0.817, but the probability that the bagged estimator breaks down is a Boot 0.5,100,100 (0.45) ≈ 1.In this case, Bagging would completely de-robustify the estimator.This fact is well-known (see e.g.Salibián-Barrera et al. [2008]), but we need it for comparison with Stability Selection, i.e., a bagged variable selection which operates on a finite space.
As for bounded domains however, the situation becomes dramatically different.
for Bootstrapping resp.

The Stability Selection BDP
Having the VSBDP and the resampling BDP defined, we are ready to define a BDP for Stability Selection itself.
with the cell-BDP c of the underlying model selection algorithm.
Intuitively, the Stab-BDP denotes the minimum fraction of outliers required so that the probability that the reported stable model does not contain any relevant variable is sufficiently large.

Can (non-relevant) variables be targetedly promoted in practice?
Research in this direction has recently been done in the context of poisoning attacks, i.e., injecting contamination into the training data so that one can targetedly manipulate the trained model.In contrast to the BDP concept where one can modify a certain fraction of the data arbitrarily, poisoning attacks generally allow to contaminate all data instances but with a bound on some || • || s -norm for the contamination vector resp.matrix.Li et al. [2020], Li et al. [2021] prove that poisoning attacks can indeed targetedly affect model selection in the sense that variables selected by the attacker can be suppressed resp.promoted, i.e., the attacker can inject variables to the selected model resp.discard variables.According to their attack scheme, there is essentially no limitation in the number of promoted resp.suppressed variables, but note that they propose l s -attacks for s ≥ 1, i.e., one does not change a single instance but in principle each instance/cell.Nevertheless, applying their method with an l 0 -constraint as already done for crafting sparse adversarial attacks (e.g., Su et al. [2019], Carlini and Wagner [2017]) would indeed exactly represent our setting.
To the best of our knowledge, a concise statement on the impact of such a fraction of outliers to model selection itself like how many relevant variables can be suppressed or how many nonrelevant variables can be promoted has not yet been proposed in literature.Therefore, we propose an optimistic and a pessimistic scenario concerning this impact.

Case-wise scenarios:
In this scenario, we assume that for a variable selection method with instance-BDP c, a number of cn outlying rows in the regressor matrix, the response matrix or the regressor matrix reduced to the relevant columns • is able to promote any subset of non-relevant variables resp.suppress any subset of relevant variables which means that we assume that this outlier fraction can indeed, at least theoretically, cause all relevant variables to be ignored (pessimictic case-wise scenario); • is able to targetedly suppress all relevant but promote only one single variable (optimistic case-wise scenario).

Cell-wise scenarios:
We assume that for a cell-BDP of c, a fraction of at least c of outlying cells in the regressor matrix, the response matrix or the regressor matrix reduced to the relevant columns • can promote min(p−s 0 , c) non-relevant variables where s 0 = |S 0 | resp.suppress any subset of relevant variables (pessimictic cell-wise scenario).This scenario is indeed extremely pessimistic since even in Li et al. [2020], Li et al. [2021], it seems that one at least has to manipulate the regressor matrix and the response vector which would at least require two outlying cells for an estimator with c = 0; • is able to targetedly suppress all relevant variables but to promote only one single variable (optimistic cell-wise scenario).

Threshold-based Stability Selection
We now quantify the robustness of threshold-based Stability Selection.We assume that there is a fixed selection of instances resp.cells in the data matrix which are contaminated, so we quantify the probability that the Stability Selection breaks down.
if the resamples are drawn by subsampling; ii) given by if the resamples are drawn by Bootstrapping.
Proof.First, note that the selection frequencies of the non-relevant variables are not important here, so one does not distinguish between the pessimistic and the optimistic scenario.According to our assumption, there is no immediate breakdown on the original data.A breakdown is achieved once each relevant variable no longer appears in the stable set, i.e., if all aggregated selection frequencies are below the threshold π.Regarding the variable j * = argmax j (π + j ), there are more than B(π + j * −π) sufficiently contaminated resamples required since each such resample can decrease the aggregated selection frequency by at most 1/B.Sufficiently contaminated means that at least cn sub instances in the resample have to be contaminated.Putting everything together, we get the stated probabilities.
2 Note that the column-wise outliers do not have to be considered in the theorem above since it is impossible to have a column-wise contamination rate higher than c without having a casewise contamination rate higher than c which already is a breakdown, so one does not have to distinguish between contamination in the relevant or non-relevant or response columns here.
The situation becomes somewhat different when considering cell-wise contamination as shown in the following theorem where we assume univariate responses.

Theorem 4.2. Let π be the threshold for the Stability Selection based on B resamples of size
n sub from a data set with n instances and let c be the cell-BDP of the applied model selection algorithm.Let, for i = 1, ..., n, a fixed selection of c i cells in instance i be contaminated, let Z l , l = 0, 1, ..., p + 1, be the number of instances for with c i = l and let m := i c i .Let further Z rel l , l = 1, ..., s 0 , denote the number of instances with l outlying cells in the relevant columns.Let m be the number of outliers in the response column.Let π+ j be the aggregated selection frequencies on the original data where π+ j ≥ π ∃j for j ∈ S 0 .Then, in both the optimistic and the pessimistic cell-wise scenario, the probability that the Stability Selection breaks down is i) 1 if the fraction of cell-wise outliers exceeds c in the relevant columns or in the response column iii) given by min(P 1 , P 2 , P 3 ) for where is the probability that when sampling n sub instances without replacement, one gets z 0 out of the Z 0 instances without cell-wise outliers, z 1 out of the Z 1 instances with one cell-wise outlier and so forth; for where p 2 := z0 ,...,zs 0 : l lz l ≥ cn sub s 0 f Z rel ,n sub (z 0 , ..., zs 0 ) for Z rel = (Z rel 0 , ..., Z rel s 0 ); and for P 3 being the quantity in Eq. 4.3, if the resamples are drawn by subsampling and if none the conditions in i) or ii) are satisfied; iv) given by min( P1 , P2 , P3 ) for where f M ult represents the density of a multinomial distribution with parameters n sub and for p2 := z0 ,...,zs 0 : l lz l ≥ cn sub s 0 f M ult (z 0 , ..., zs 0 ); and for P3 as in Eq. 4.4, if the resamples are drawn by Bootstrapping and if none the conditions in i) or ii) are satisfied.

Proof. i)+ii) Obvious and already partially discussed earlier.
iii) There are three ways how to achieve a breakdown as already mentioned when defining the cell-wise scenarios.P 1 quantifies the probability that due to subsampling, the fraction of cell-wise outliers in the whole data matrix becomes at least c while P 2 quantifies the analogous probability for the set of relevant columns.The stated probabilities p 1 and p 2 quantify that this happens for one resample, so P 1 and P 2 quantify that this happens in sufficiently many resamples.P 3 quantifies the probability that a fraction of at least c of the responses are contaminated which also can cause the wrong variables to be selected according to the assumption in the cell-wise scenarios.Assuming that m outliers are apparent in the response column, we get the same formula as in Thm.4.1.iv) As iii).with k response columns, one has to distinguish between the seemingly unrelated regression case (Zellner [1962]) where one fits a model for each response column separately and the general case that the response columns are correlated so that the entire response matrix enters as input for a unified model.In the second case, the probability of a breakdown would be 1 if the relative part of contaminated cells in the response matrix is at least c, in the first case however, the relative part of outliers has to be larger than c for one resp.for each response column if a breakdown of the set of the resulting k models is defined in the sense that at least one resp.each of the individual models break down.

Rank-based Stability Selection
The robustness of the rank-based Stability Selection additionally depends on the aggregated selection frequencies of the non-relevant variables.
Theorem 4.3.Let q be the pre-scribed number of stable variables for the Stability Selection based on B resamples of size n sub from a data set with n instances, m of them contaminated, and let c be the BDP of the applied model selection algorithm.Let π+ j for j ∈ S 0 resp.π− k for k ∈ {1, ..., p} \ S 0 be the aggregated selection frequencies on the original data where we assume that k I(π − k ≥ π+ j ) < q − 1 ∃j ∈ S 0 .a) In the pessimistic scenario, let s be the number of relevant variables in the top-q variables.Then, w.l.o.g., let π+ 1 , ..., π+ s be the corresponding aggregated selection frequencies of these variables on the original data, similarly, let π− 1 , ..., π− q−s be the aggregated selection frequencies of the (q − s) non-relevant variables among the top−q variables.Let π− q−s+1 , ..., π− q be the aggregated if the resamples are drawn by Bootstrapping.
Proof.On the original data, there is no immediate breakdown according to the assumption.A breakdown is achieved once each relevant instance has an aggregated selection frequency smaller than the aggregated selection frequencies of q non-relevant variables.a) Therefore, it suffices to have more than 0.5B(max j=1,...,s (π + j ) − min k=q−s+1,...,q (π − k )) ) contaminated resamples since in each one, both the non-relevant variables corresponding to π− k , k = 1, ..., q, can be promoted and at the same time, the relevant variables can be suppressed, so the quantities max j=1,...,s (π + j ) and min k=q−s+1,...,q (π − k ) move towards each other by 2/B steps per contaminated resample.Hence, after 0.5B(max j=1,...,s (π + j ) − min k=q−s+1,...,q (π − k )) such steps, they are equal or have already crossed, so having one more contaminated resample, the formerly best relevant variable definitely has a lower aggregated selection frequency than the formerly q-th best non-relevant variable.The probabilities are then computed as in Thm. 4.1. b) In the optimistic scenario, we can only targetedly promote one non-relevant variable.Therefore, it depends on the terms s, q, max j (π + j ) and all π− k for k = 1, ..., q, so one cannot make a universal precise statement.However, there are two extreme cases.If for max j=1,...,s the difference between max j (π + j ) and π− k for (q − 1) indices k from {1, ..., q} is exactly ∆/2, except for k * := argmin k=q−s+1,...,q (π − k ) for which it is ∆, then more than 0.5 B∆ contaminated samples suffice for a breakdown if the variable corresponding to π− k * is promoted in each of these resamples since the same reasoning as in a) applies, i.e., the selection probabilities for the worst non-relevant and best relevant variable move towards each other with a step size of 2/B.The other extreme case is that all π− k are equal.Then, if s > B∆ , even after promoting each of these non-relevant variables in one single contaminated resample does not suffice for a breakdown since there will be at least one remaining one whose aggregated selection frequency was still not promoted.In that case, we can treat π− k (in general, min k=q−s+1,...,q (π − k )) as threshold so that the results from Thm. 4.1 are applicable, so the relevant variables have to be suppressed in so many resamples such that the selection frequency of the best of them finally crosses the threshold.
2 Corollary 4.1.Let q be the pre-scribed number of stable variables for the Stability Selection based on B resamples of size n sub from a data set with n instances and let c be the cell-BDP of the applied model selection algorithm.Let, for i = 1, ..., n, c i cells in instance i be contaminated, let Z l , l = 0, 1, ..., p+1, be the number of instances for with c i = l and let m := i c i .Let further Z rel l , l = 1, ..., s 0 , denote the number of instances with l outlying cells in the relevant columns.Let π+ j for j ∈ S 0 resp.π− k for k ∈ {1, ..., p} \ S 0 be the aggregated selection frequencies on the original data where we assume that k I(π − k ≥ π+ j ) < q − 1 ∃j ∈ S 0 .Then, the probability that the Stability Selection breaks down is a) 1 if the fraction of cell-wise outliers exceeds c in the relevant columns or in the response column c) Let none of the conditions in a) or b) hold.In the pessimistic scenario, let the notation and assumptions be as in part a) of Thm.4.3.Then, the probability that the Stability Selection breaks down is i) given by min(P 1 , P 2 , P 3 ) for

Implications of the theorems
We learned from the results in the previous subsections that Stability Selection does not suffer from the numerical instabilities (cf.Salibián-Barrera et al. [2008] for this notion) of resampling that can lead to some resamples having a larger fraction of outlying instances/cells than the whole training data to that extent as standard bagged estimators do.
Apart from the simple observation that robustness is increased by Stability Selection, we can extract one interesting recommendation for the Stability Selection variant.For the thresholdbased Stability Selection, the robustness depends on the difference of the aggregated selection frequency of the best relevant variable and the threshold.It is however not evident that there even exists such a variable.This problem has been studied for example in Werner [2022] for noisy data.Due to the combination of a high noise level, a large number p of candidate variables and a rather sparse true model, one often faces the situation that no variable (including non-relevant variables) can pass the threshold, leading to an empty model, i.e., an immediate breakdown.The rank-based variant circumvents this issue.Moreover, when considering contamination, achieving a breakdown does not only require to let the aggregated selection frequencies drop below the threshold, but their ranks have to drop below q if the best q variables enter the stable model, which is more difficult.Therefore, we recommend to prefer the rank-based Stability Selection over the threshold-based Stability Selection in regard of variable selection robustness.Note that this does not contradict [Meinshausen and Bühlmann, 2010, Thm. 1] as the number q can be fixed empirically so that an appropriate bound is achieved, where the aggregated selection probability of the q-th best variable replaces the universal threshold.
In contrast to a simple bagged estimator, Stability Selection allows for more than the half of the instances/cells being contaminated without violating equivariance properties (cf.Davies and Gather [2005]) of the underlying algorithm that prevent BDPs from exceeding 0.5.For example, in machine learning, especially regression, one assumes that there is an underlying model from which the clean data have been generated.Even if the relative fraction of outlying instances exceeds a half, say it is 60%, then it is nevertheless a desirable goal to infer the model which describes the correspondence structure of the responses and the predictors of the clean observations.There is no qualitative hindrance to aim at finding the underlying model due to the standard assumption that outliers do not have structure and just stem from some unknown, arbitrary distribution.
In order to clarify the argumentation, consider the awkward case-wise contamination situation that the outliers had structure, i.e., additionally to the underlying model f : X → Y that relates the responses and the predictors of the clean observations one had another model g : X → Y from which the outlying instances are generated.In this artificial setting, one would indeed try to infer g instead of f and treat the actual clean instances as the outliers, more precisely, as the outliers w.r.t. the model g.
The main problem, even for a Stability Selection, would be that the probability to draw resamples that are sufficiently contaminated to let the applied algorithm break down would become rather high if the fraction of outlying instances already exceeds 0.5 on the original data.Therefore, an additional robustification step is necessary which will be discussed in the next section.

Robustifying Stability Selection
Let us recapitulate the SLTS algorithm from Alfons et al. [2013]: It computes the loss function L (the squared loss) for all instances and defines the "clean subset" that enters the next iteration as the h < n instances with the lowest in-sample losses.By iterating this strategy and by immunizing against running into a local minimum of the trimmed objective h i=1 (L(Y, X, β)) i:n by starting with multiple initial configurations, the SLTS outputs essentially the optimal sparse coefficients according to the empirical trimmed risk.
Regarding Stability Selection, the instances in a resample cannot be treated individually since one aims at aggregating column information from whole resamples.One can however individualize the resamples themselves and the corresponding selected sets of variables.In other words, we aim at lifting the trimming concept from instances in estimation problems to resamples in a model aggregation problem.

Related approaches
A more flexible approach than Bragging (Bühlmann [2012]) is trimmed Bagging (Croux et al. [2007]) where the trimmed mean of the estimators instead of their median is computed.In contrast to Croux et al. [2007], we do not bag classifiers nor other learners but predictor sets themselves.A close approach to our intended Trimmed Stability Selection is the approach in Zhang et al. [2019] where general ensemble variable selection techniques are pruned.Their motivation came from the perspective of strongly correlated members of the ensemble in contrast to the perspective of robustness as in our work.They extend the approach in Zhang et al. [2017] where the members with the highest prediction errors are trimmed in the sense that Zhang et al. [2019] define a so-called variable selection loss which is a squared distance between the aggregated selection frequencies (called "importances" there) and a reference importance vector.Zhang et al. [2019] are aware of the fact that the true importance vector is not known so they approximate it empirically.
A very important related algorithm is the fast and robust Bootstrap (FRB), initially introduced in Salibián- Barrera and Zamar [2002], Salibián-Barrera et al. [2008] for regression MMestimators.The main idea is that standard Bootstrapping of robust estimators would take a long computational time and that a simple uniform Bootstrapping may would result in having resamples with more than half of the instances being contaminated with a considerable probability.They derived a linear update rule for the robust estimator that downweights outlying instances.These weights are derived by a robust MM-estimator which allows for identifying such instances according to the absolute value of their regression residual.Model selection using the FRB has been proposed by Salibián-Barrera and Van Aelst [2008].The idea is to approximate the prediction error of the model built on a particular set of predictors by using FRB and to select the predictor set for which the prediction error was minimal.Their strategy suffers from the lack of scalability for data sets with a large number of predictors since they have to compute the prediction error for all possible models whose total number is 2 p −1, and even their backward strategy become infeasible for high p.

Trimmed Stability Selection
In this paper, we do not consider validation data since assuming that the training data are contaminated while the validation data are clean would be awkward, so we intend to measure the quality of the resample-specific models on an in-sample loss basis, similarly as outlying instances are detected by their individual in-sample loss as for example in Alfons et al. [2013].More precisely, if the contamination of a certain resample has caused the algorithm to select wrong variables, the in-sample loss should be high compared to another resample where a sufficiently well model has been selected which contains enough of the true predictors.(5.3) for all j = 1, ..., p, and let πγ := (π γ j ) p j=1 .The actual identification of the stable set works as usual, based on the πγ j .The Trimmed Stability Selection (TrimStabSel) is described by the following algorithm: Now, we have to analyze the effect of this trimming procedure on the Stab-BDP.We abstain from detailing out each of the cases considered in Subsections 4.3 and 4.4 again but formulate a universal result which can be easily adapted to all the individual cases.

Simulation study
We now investigate the impact of our proposed outlier scheme on model selection and the performance of TrimStabSel.
We consider a variety of scenarios that differ by n, p, the fraction of outlying cells and the signal to noise ratio (SNR).In all scenarios, there are s 0 = 5 relevant variables and the corresponding components of the coefficient β are i.i.d.N (4, 1)-distributed.The cells of the regressor matrix are i.i.d.N (5, 1)-distributed.In the regression settings, the responses are computed by Y i = X i β + i with i ∼ N (0, σ 2 ) i.i.d. for all i where σ 2 is set so that a specific signal to noise ratio is valid.In the classification settings, we compute Xβ and η i := exp(X i β)/(1 + exp(X i β)) where As there is no opportunity to simulate a target SNR, we distinguish three situations by drawing β j ∼ N (µ, 1) so that a higher |µ| corresponds to a higher SNR.As for the outlier configuration, we consider different m ≤ n and for each m, we randomly select m rows of the regressor matrix so that in each of these rows, the value in the cells corresponding to the five relevant variables are replaced by zero.
We evaluate the performance of the Stability Selection variants by computing the mean true positive rate (TPR) over V repetitions where for each v = 1, ..., V , we generate an independent data set.Moreover, we compute the fraction of breakdowns, i.e., where no relevant variable has entered the stable model as well as the fraction of cases where the stable model is perfect, i.e., it consists only of the five relevant variables.These quantities are then plotted against the number m of outlying instances in a single graphic, separately for each SNR.

(Trimmed) Stability Selection with L 2 -Boosting
The scenario specifications are given in Table 1.We consider the SNRs 1, 2 and 5 and for each SNR and each value for m from the set given in Table 1, we generate V = 1000 independent data sets and apply all four Stability Selection variants specified in Table 1.We use the function glmboost from the R-package mboost (Hothorn et al. [2017]) with family=Gaussian(), 100 iterations and a learning rate of 0.1.The results in Fig. 1 and Fig. 2 show the non-surprising facts that the TPR increases with increasing SNR and that the TPR curves and the curves representing the relative frequency of  At small contamination rates, for scenario 1 until around 12% and for scenario 2 until around 7%, TrimStabSel considerably improves model selection, especially for a high SNR.For example, the relative breakdown frequency in scenario 1 with an SNR of 5 and m = 4 is around 10 times as high for the non-trimmed Stability Selection as for the third variant of TrimStabSel while the TPR is three times as high and even more than three times as high for m = 3.The results in scenario 2-4 look similar as for scenario 1.For an SNR of 5, one can observe even near perfect results for at least the third TrimStabSel variant for low contamination rates up to around 3%.

(Trimmed) Stability Selection with LogitBoost
We use the same specifications as in Table 1, but as we cannot targetedly let the data have some specified SNR, we generate the relevant β j according to a N (µ, 1)-distribution with µ ∈ {1, 4, 8} where higher means make, in expectation, the signals stronger and the SNR higher.Again, we use V = 1000.We again use glmboost, here with family=Binomial(link='logit') and let the other hyperparameters be as for L 2 -Boosting.
The results look similarly as in the previous subsection.One can observe a more compressed shape of the TPR curve for the case µ = 1 while the curves corresponding to the different Stability Selection variants show a considerable margin.In our opinion, these behaviours are a direct consequence of the SNRs.The empirical mean SNRs on our data sets, computed as V ar(Xβ)/V ar(Y ) (the reciprocal value of the noise to signal ratio from Friedman et al. [2001]), is between 30 and 40 for µ = 1 and more than 1200 for µ = 8.Although the interpretation of this SNR is not identical to the interpretation of the SNR in the regression setting, the margins between the curves for high values of µ reflect the behaviour from the previous subsection, here even stronger.
We only consider regression scenarios here, i.e., we use family=Gaussian() in the sparseLTS   It should be noted that the TPR starts with lower values for m = 0 than in the previous experiments.In particular, the relative fraction of perfect models is extremely low for scenario 1, even without contamination and with an SNR of 5, in contrast to the experiments with L 2 -Boosting.The reason could be that, similarly as in TrimStabSel where the trimming decreases the evidence and makes the stable set therefore more fragile, trimming instances away in SLTS decreases the evidence of the fitted model and therefore its performance.The loss in efficiency of robust methods seems to carry over to model selection itself.
One can observe that trimming eventually pays off, but due to the fraction of broken models already being very high resp.the mean TPR already being very low, the improvement itself granted by TrimStabSel may no longer be reasonable as the relative fraction of broken models is still very high, for example, in scenario 2 with an SNR of 2, 99% of the models have broken  In order to investigate the impact of the number of subsamples, we additionally propose three further TrimStabSel configurations in scenarios 1 and 2 where B = 1000 and α ∈ {0.75, 0.9, 0.95}, calling these scenarios 1' and 2'.The results are depicted in Fig. 6.One can observe that the mean TPR is considerably higher compared to the mean TPR achieved in scenario 1 and 2, even better than for the non-trimmed Stability Selection for low contamination rates.This is a consequence of the increased evidence as we aggregate 50 up to 250 models in contrast to only 10 up to 50 models as before.Another side effect is the reduced margin between the curves corresponding to the non-trimmed Stability Selection and the TrimStabSel variants, including a smaller robustness gain in regions where the contamination rate even overcomes SLTS.The eye-catching curve of the relative breakdown rate for the third TrimStabSel variant in Fig. 5 however does no longer differ much from that of the non-trimmed Stability Selection in high contamination regions in Fig. 6.This can be explained by solely focusing on a very few number of (non-broken) models in scenarios 1 and 2 for the third TrimStabSel variant.

Conclusion
We intended to make a step towards the unification of sparse model selection, robustness and stability in order to lift the understanding of robustness from the rows of a data matrix to the columns and investigated how contamination can affect model selection.We started with the introduction of the variable selection breakdown point and an outlier scheme which allows a very small number of contaminated cells to completely distort variable selection, making a robustification in the usual sense that provides coefficients whose norm is always bounded obsolete if no relevant variable is considered.
We extended the notion of the resampling breakdown point, which quantifies the relative fraction of outlying instances so that the probability that a resample is contaminated too much exceeds some threshold, by the Stability Selection BDP which we computed for different scenarios where we postulate different effects of outliers onto model selection due to the absence of concrete results in literature.Our analysis reveals that a Stability Selection where the stable set is given by the best q variables for a pre-defined q can be expected to be more robust than the standard threshold-based Stability Selection.We want to emphasize that in our experiments, a robust model selection algorithm seems to show inferior performance than a non-robust model selection algorithm if applied on clean data.
Although it is well-known that robust algorithms are less efficient in terms of asymptotic covariance, this loss in efficiency seems to carry over to variable selection itself.
Future research is necessary in order to study the potential of outliers for targeted variable promotion or suppression further.Although our proposed outlier schemes seem to be artificial so that they most probably would not occur by chance, one has to be aware of the attacking paradigm emerging from the deep learning community.Similarly as popular situations where models have to be inferred before attacks can be crafted (see, e.g., Papernot et al. [2017]), one could intercept a data transfer, try to detect relevant variables and suppress them targetedly or try to detect certainly non-relevant variables (for example by Sure Independence Screening, see Fan and Lv [2008]) in order to targetedly promote them.

Bibliography
This paper is organized as follows.Section 2 compiles relevant notions from Robust Statistics which are contamination models and the breakdown point.We also recapitulate the concept of Stability Selection.Section 3 is devoted to the definition of our variable selection BDP and to a discussion on resampling BDPs which are required for our Stability Selection.Our BDP concept for Stability Selection is introduced in Section 4. Section 5 presents the Trimmed Stability Selection.Section 6 provides a detailed simulation study that compares the performances of the standard Stability Selection and the Trimmed Stability Selection on simulated data with L 2 -Boosting, LogitBoost and SLTS as model selection algorithms.

Definition 3. 1 .
Let D be a data set with n instances and predictor dimension p.Let k ∈ N be the dimension of the responses.a) The case-wise variable selection breakdown point (case-VSBDP) is given by

Proposition 3. 1 .
Let B be the number of resamples and let S n be some estimator, mapping onto an unbounded domain, for a data set with n observations.If the (classical) BDP of the estimator is c, so it is for the bagged estimator.Proof.Since the BDP of the estimator is c, manipulating a relative fraction of c instances in one single resample b suffices to let the estimator S (b) n on this resample to break down, i.e., the norm of the estimated value is fully controlled by the outliers.Then, as the bagged estimator being the empirical mean of all S (b) n , b = 1, .., B, its norm is also fully controlled.2 The proof is rather unusual since it assumes that one can targetedly contaminate a selected resample.Usually, the attacker should only have access to the whole training data set.Then, a probabilistic statement in the spirit of the resampling BDP becomes more appropriate.Proposition 3.2.Let B be the number of resamples and let S n be some estimator, mapping onto an unbounded domain, for a data set with n observations.If the (classical) BDP of the estimator is c, the resampling BDP of the bagged estimator equals (ε ⊥ (c, n sub , B, α)) * for ⊥= Boot for Bootstrap resp.for ⊥= Subs for subsampling.Proof.Evidently, if at least one resample is contaminated to an extent such that the estimator breaks down, due to the unbounded domain and the mean aggregation, the bagged estimator breaks down.By definition of the resampling BDP, this happens with a probability of more than α if the fraction of contaminated rows is (ε ⊥ (c, n sub , B, α)) * . 2 Remark 3.1.a) These results can be trivially extended to the case of cell-wise contamination.
Proposition 3.3.a) Let B be the number of resamples and let S n be some estimator, mapping onto a bounded domain, for a data set with n observations from which no one takes any of the boundary values.If the (classical) BDP of the estimator is c, the resampling BDP of the bagged estimator is Definition 4.1.Let n be the number of instances in a data set and let n sub be the number of instances in each resample.Let B be the number of resamples for the Stability Selection and let R be the resampling distribution.Let S stab (⊥, n sub , B, Z n ) denote the stable set derived from B resamples according to ⊥= Boot or ⊥= Subs with n sub instances from the data set Z n .a) Then the case-wise Stability Selection BDP (case-Stab-BDP) for tolerance level α is given by ε * Stab (⊥, c, n sub , B, α) := min m n P R (∀j ∈ S 0 : j / ∈ S stab (⊥, n sub , B, Z m n )) ≥ α (4.1)where c represents the case-BDP of the underlying model selection algorithm.b) Similarly, the cell-wise Stability Selection BDP (cell-Stab-BDP) for tolerance level α is given by

Theorem 4. 1 .
Let π be the threshold for the Stability Selection based on B resamples of size n sub from a data set with n instances, a fixed selection of m of them contaminated, and let c be the BDP of the applied model selection algorithm.Let π+ j be the aggregated selection frequencies on the original data where π+ j ≥ π ∃j for j ∈ S 0 .Then, in both the optimistic and the pessimistic case-wise scenario, the probability that the Stability Selection breaks down is i) given by 2 with p1 and p2 from Thm. 4.2 or in the interval given in Eq. 4.8 if the resamples are drawn by Bootstrapping.Proof.Parts a), b) and c) directly follow from Thm.s 4.3 and 4.2.Statement d) is more tricky since one has intervals due to the argumentation in part b) of the proof of Thm.4.3.For a concrete data set and a concrete model selection algorithm, one value in the respective intervals is realized so that the probability of a breakdown of the Stability Selection is the minimum. 2 Our trimmed Stability Selection works as follows.We generate B resamples of size n sub from the training data and apply the model selection algorithm, for example, Lasso or Boosting, on each resample which selects a model Ŝ(b) ⊂ {1, ..., p} for b = 1, ..., B and which computes coefficients β(b) .Let I (b) be the index set of the rows that have been selected by the resampling algorithm, i.e., I (b) ∈ {1, ..., n} n sub for Bootstrapping resp.I (b) ⊂ {1, ..., n} with |I (b) | = n sub and I ∀k = l, k, l = 1, ..., n sub , for subsampling.For the b-th resample, we compute the infunction L : Y × Y → [0, ∞[.Although, in regression, losses are usually continuous, wecannot exclude that there are ties, for example, if two resamples are identical or if a trimmed loss is used.In this case, we perform random tie breaking.Note that there is no evidence that the models computed on identical resamples are identical since the model selection algorithm may include stochastic components like the SLTS whose result depends on the randomly chosen initial configurations.For the setI trim (γ) := b ∈ {1, ..., B} B b =1 I(L (b ) ≥ L (b) ) ≤ γB (5.2)with the trimming rate γ ∈ [0, 1[, we compute the trimmed aggregated selection frequencies πγ j = 1 B − γB b∈{1,...,B}\I trim (γ) I(j ∈ Ŝ(b) )

Figure 1 :
Figure 1: Results for scenarios 1 and 2 with L 2 -Boosting as model selection algorithm.Solid lines represent the TPR, dashed lines the relative frequencies of a breakdown and dotted lines the relative frequencies perfect stable models.The black lines correspond to the non-trimmed Stability Selection and the red, green and blue lines to the first, second and third configuration of TrimStabSel, as specified inTable 1, respectively.

Figure 2 :
Figure 2: Results for scenarios 3 and 4 with L 2 -Boosting as model selection algorithm.

Figure 3 :
Figure 3: Results for scenarios 1 and 2 with LogitBoost as model selection algorithm.

Figure 4 :
Figure 4: Results for scenarios 3 and 4 with LogitBoost as model selection algorithm.

Figure 5 :
Figure 5: Results for scenarios 1 and 2 with SLTS as model selection algorithm.

Figure 6 :
Figure 6: Results for scenarios 1' and 2' with SLTS as model selection algorithm.

Figure 7 :
Figure 7: Results for scenarios 3 and 4 with SLTS as model selection algorithm.

Finally, we propose
a Trimmed Stability Selection which considers only the best resamples, based on the in-sample losses, when aggregating the models.A simulation study reveals the potential of this Trimmed Stability Selection to robustify model selection, although it evidently inherits the necessity to find appropriate hyperparameter configurations.The simulations also prove the alarming fragility of variable selection, even for an extremely low number of outlying cells, if the outliers are targetedly placed onto the relevant columns.In particular, regarding the rapid performance decrease of the non-trimmed Stability Selection with L 2 -Boosting as model selection algorithm, one has to keep in mind that even 2 resp.5 outlying instances in scenario 1 and 2 resp.3 and 4 suffice to let nearly no stable model be perfect, accompanied with a somewhat decreased mean TPR, so the cell-wise contamination rates range from 25/100500 in scenario 4 to 10/1275 in scenario 1.Although the further robustification of an SLTS-based Stability Selection in principle workswhich is revealed by our simulations that lead to higher mean TPRs and lower breakdown frequencies, it may not be recommended in practice.First, the trimmed Stability Selection tends to show a decreased performance in the presence of low contamination rates since SLTS can handle these configurations by its own internal trimming, similarly as a robust algorithm loses efficiency in non-contaminated settings.Second, if SLTS breaks down for large contamination rates, slightly larger contamination rates will even overcome a trimmed Stability Selection unless the number of resamples would be infeasibly high.We recommend to consider a trimmed Stability Selection in situations where the contamination rate can be expected to be low.In such settings, the trimmed Stability Selection with a nonrobust model selection algorithm like L 2 -Boosting shows a significant improvement concerning mean TPR, breakdown rate and the relative number of perfect stable models in contrast to the non-trimmed Stability Selection, while being very easy to implement.This avoids the application of robust model selection algorithms which are computationally more expensive and which alone do not follow the stability paradigm which would in fact necessitate to even apply a Stability Selection with a robust model selection algorithm which would be computationally very expensive.
Stability Selection is an ensemble model selection technique that has been introduced in Nogueira et al. [2017b]6]ty of feature selection resp.featureranking has been done inNogueira and Brown [2016],Nogueira et al. [2017a],Nogueira et al. [2017b].Nogueira et al. [2017b]point out that stable feature selection is either represented by a hard subset selection of the candidate variables or by a ranking of the variable or individual weights which, given some threshold for the ranks resp.the weights, eventually leads to a subset of variables.The cited works propose similarity metrics in order to quantify the stability of feature selection resp.feature ranking.
9) for subsampling.b) For Bragging, the resampling BDP becomes inf{ε ∈ {0, 1/n, ..., 1} | P (Bin(B, P (Bin(n sub , ε) ≥ cn sub )) ≥ (B + 1)/2 ) > α} (3.10) for Bootstrapping resp.inf{ε ∈ {0, 1/n, ..., 1} | P (Bin(B, P (Hyp(n, n − εn, n sub ) ≤ (1 − c)n sub )) ≥ (B + 1)/2 ) > α} (3.11) for subsampling.Proof.a) If at least one estimator does not break down, it takes a value in the interior of the domain by assumption.Hence, the aggregated estimated value will also lie in the interior of the domain, so there is no breakdown.A breakdown occurs if and only if all estimators break down 2 This is an extremely strange result since the median aggregation (and any other trimmed aggregation) is less robust than the mean aggregation.This artifact, resulting from a wrong notion of robustness for estimation, shows that a breakdown has to be defined very carefully if the domain is bounded, providing another argument why the concept from the rejoinder of Davies and Gather [2005] to map the boundary values to infinite values is necessary.4 Stability Selection and robustness As Stability Selection does not aggregate coefficients but just indicator functions, the influence of a model computed on a single resample is bounded by 1/B in the sense that the aggregated selection frequencies computed on the original data can be at most distorted by 1/B (for certain j) if one single resample is (sufficiently) contaminated.stage, one draws again resamples from the rows but for each resample one draws columns according to the importance values, computes again the Lasso on each subsample and aggregates the coefficients.Neither the randomization over the columns nor the computation of the importance provides any robustness advantage.One just needs to be able to produce an arbitrarily large coefficient for a non-relevant variable on one single resample.This will cause the importance of this column to be arbitrarily close to 1 in this component and therefore arbitrarily close to 0 in all other components, especially those corresponding to the relevant variables which therefore will not be selected in the second stage.Therefore, the variable selection robustness of Random Lasso can be quantified by the resampling BDP (ε ⊥ (1/n, n sub , B, α)) Wang et al. [2011]m Lasso).Before we analyze the Stab-BDP, we want to emphasize that this robustification effect gets lost if the model aggregation is based on values from unbounded sets.An exemplar for this model type is the Random Lasso fromWang et al. [2011]which, for b = 1, ..., B, does not only draw resamples from the rows but also randomly selects several columns from a uniform distribution over the columns.Then, a Lasso model is computed on each resample, leading to coefficients β(b) .The importance of the columns is then computed by | b β(b) |/B and in a second * for case-wise outliers and by (ε ⊥ (1/(n(p + 1)), n sub , B, α)) * for cell-wise outliers, provided that one outlying instance resp.cell can promote a non-relevant variable.