1 Introduction

Motivation

Feature-selection methods are ubiquitous for a variety of reasons. By reducing the dimensionality of datasets, they lower the computational cost and the memory requirements of prediction models. Prediction models also tend to generalize better without irrelevant and spurious predictors. While some model types can implicitly select relevant features, others cannot. In addition, simpler prediction models improve interpretability [42].

Most conventional feature-selection methods return one feature set only [6]. These methods optimize a criterion of feature-set quality, e.g., prediction performance. However, besides the optimal feature set, there might be other, differently composed feature sets with similar quality. Such alternative feature sets are interesting for users, e.g., to obtain several diverse explanations. Alternative explanations can provide additional insights into predictions, enable users to develop and test different hypotheses, appeal to different kinds of users, and foster trust in the predictions [34, 78].

For example, in a dataset describing physical experiments, feature selection may help to discover relationships between physical quantities. In particular, highly predictive feature sets indicate which input quantities are strongly related to the output quantity. Domain experts may use these feature sets to formulate hypotheses on physical laws. However, if multiple alternative sets of similar quality exist, further analyses and experiments may be necessary to reveal the true underlying physical mechanism. Only knowing one predictive feature set and using it as the only explanation is misleading in such a situation.

Problem statement

This articleFootnote 1 addresses the problem of alternative feature selection, which we informally define as follows: Find multiple, sufficiently different feature sets that optimize feature-set quality. We provide formal definitions in Sect. 3.2. This problem entails an interesting trade-off: Depending on how different the alternatives should be, one may have to compromise on quality. In particular, a stronger dissimilarity requirement may require selecting more low-quality features in the alternatives. Users should be able to control the dissimilarity and hence this trade-off.

Related work

In the field of feature selection, only a few methods yield multiple, diverse feature sets [6]. In particular, there are techniques for ensemble feature selection [69, 71] and statistically equivalent feature subsets [39]. These approaches do not guarantee the diversity of the feature sets, nor do they let users control diversity. Obtaining multiple, diverse solutions also is an objective in other fields, e.g., for clustering [4, 29, 51], subgroup discovery [76], subspace search [74], and explainable AI [1, 33, 68]. These approaches are not directly applicable or easily adaptable to feature selection, and most of them provide limited or no user control over alternatives, as we will elaborate in Sect. 6.

Contributions

Our contribution is fourfold.

  1. (1)

    We formalize alternative feature selection as an optimization problem. In particular, we define alternatives via constraints on feature sets. This approach is orthogonal to the feature-selection method itself. As a wide variety of such methods exists [10, 42], users can choose a method according to their needs/preferences. This approach also allows integrating other constraints on feature sets, e.g., to capture domain knowledge [2, 21]. Finally, this approach gives users control over the search for alternatives with two parameters, i.e., the number of alternatives and a dissimilarity threshold.

  2. (2)

    We analyze the computational complexity of this problem. We show \(\mathcal{N}\mathcal{P}\)-hardness, for a simple notion of feature-set quality already. On the other hand, the runtime is polynomial if the sizes of the feature sets and the number of alternatives are fixed.

  3. (3)

    We study how to solve this optimization problem. Specifically, we describe how to integrate conventional feature-selection methods into the objective function. The resulting optimization problems are either white-box, which one can tackle with an integer-programming optimizer, or black-box, for which we propose a constraint-aware heuristic search procedure.

  4. (4)

    We evaluate alternative feature selection with comprehensive experiments. We use 30 datasets representing binary-classification problems from the Penn Machine Learning Benchmarks (PMLB) [60, 67] and five feature-selection methods. We focus on the feature-set quality of alternatives relative to the user parameters. We publish all our codeFootnote 2 and experimental dataFootnote 3 online.

Experimental results

We observe that several factors influence the quality of alternatives, namely the dataset, the feature-selection method, the notion of feature-set quality, and user parameters controlling the search for alternatives. We have come up with several observations and recommendations, such as: (a) Quality-wise, the prediction performance of feature sets may not correlate with the quality assigned by feature-selection methods. In particular, seemingly bad alternatives regarding the latter might still be good regarding the former. (b) As feature-set quality tends to decrease with an increasing number of alternatives and with an increasing dissimilarity threshold, these parameters let users indeed control the trade-off between the diversity of feature sets and their quality. If the parameter values are too strict, no valid alternatives might even exist. (c) Computationally, a sequential search for alternatives has turned out to be significantly faster than searching for multiple alternatives simultaneously while yielding a similar quality.

Outline

Section 2 introduces notation and fundamentals. Section 3 describes and analyzes alternative feature selection. Section 4 outlines our experimental design. Section 5 presents the experimental results. Section 6 reviews related work. Section 7 concludes. Appendix A contains technical details and some of the proofs.

2 Fundamentals

In this section, we introduce basic notation (Sect. 2.1) and review different methods to measure the quality of feature sets (Sect. 2.2).

2.1 Notation

\(X \in {\mathbb {R}}^{m \times n}\) is a dataset represented as a matrix. Each row is a data object, and each column is a feature. \({\tilde{F}} = \{f_1, \dots , f_n\}\) is the set of feature names. We assume that categorical features have been encoded numerically, e.g., via one-hot encoding. \(X_{\cdot {}j} \in {\mathbb {R}}^m\) denotes the vector representation of the j-th feature. \(y \in Y^m\) represents the prediction target with domain Y, e.g., \(Y=\{0,1\}\) for binary classification or \(Y={\mathbb {R}}\) for regression. Feature selection makes a binary decision \(s_j \in \{0,1\}\) for each feature, i.e., either selects it or not. The vector \(s \in \{0,1\}^n\) combines all these selection decisions and yields the selected feature set \(F_s = \{f_j \mid s_j=1\} \subseteq {\tilde{F}}\). To simplify notation, we drop the subscript s in definitions where we do not explicitly refer to the value of s but only the set F. The function Q(sXy) denotes the feature-set quality, which, without loss of generality, should be maximized.

2.2 Measuring feature (set) quality

There are different ways to evaluate feature-set quality Q(sXy). We only give a short overview here; see [10, 42, 59] for comprehensive studies and surveys of feature selection. Also, note that we assume a supervised feature-selection scenario, i.e., feature-set quality depending on a prediction target y. In principle, our definitions of alternatives also apply to an unsupervised scenario. Since the prediction target only appears in the function Q(sXy), one could replace Q(sXy) with Q(sX), i.e., an unsupervised notion of quality.

A conventional categorization of feature-selection methods distinguishes between filter, wrapper, and embedded methods [24]. Filter methods evaluate feature-set quality without prediction models. Univariate filters assess each feature independently, e.g., using the correlation of the feature with the prediction target. Multivariate filters, like CFS [25, 26], FCBF [81], and mRMR [63], assess feature sets as a whole, also considering interactions like redundancies between features. Wrapper methods [35] search over feature sets with a black-box optimization strategy, e.g., genetic algorithms, and assess feature-set quality with prediction models trained on candidate feature sets. Embedded methods train prediction models with built-in feature selection, e.g., decision trees [7] or random forests [8]. Thus, the criterion for feature-set quality is model-specific.

Besides conventional feature selection, post hoc feature importance methods play a major role in the field of machine-learning interpretability [9, 48]. These methods assess feature importance after training a prediction model and can be instance-specific, like LIME [65] or SHAP [44], or describe the entire dataset, like permutation importance [8] or SAGE [12].

3 Alternative feature selection

In this section, we introduce and analyze alternative feature selection. First, we define the overall structure of the optimization problem, i.e., objective and constraints (Sect. 3.1). Second, we formalize the notion of alternatives via constraints (Sect. 3.2). Third, we discuss different objective functions corresponding to different feature-set quality measures from Sect. 2.2. In particular, we describe how to solve the resulting optimization problem (Sect. 3.3). Fourth, we analyze the computational complexity of the optimization problem (Sect. 3.4).

3.1 Optimization problem

Alternative feature selection has two goals. First, the quality of alternatives should be high. Second, alternative feature sets should differ from each other, i.e., contain different features. There are different ways to combine two goals in an optimization problem: First, one can consider both goals as objectives in a multi-objective problem. Second, one can treat one goal as the objective and enforce the other with constraints. Third, one can define constraints for both goals, searching for feasible solutions instead of optimizing. We stick to the second formulation and maximize feature-set quality subject to feature sets being alternative. This formulation allows us to keep the original objective function of feature selection. Thus, users do not need to specify a threshold on feature-set quality but control how alternative the feature sets must be instead. We obtain the following optimization problem (Eq. 1) to find a single alternative feature set \(F_s\):

$$\begin{aligned} \begin{aligned} \max _s&\quad Q(s,X,y) \\ \text {subject to:}&\quad F_s~\text {being alternative} \end{aligned} \end{aligned}$$
(1)

In the following, we discuss different objective functions Q(sXy) and suitable constraints for being alternative. Additionally, many feature-selection methods limit the feature-set size \(|F_s|\) to a user-defined value \(k \in {\mathbb {N}}\), which adds further, simple constraints.

3.2 Constraints—defining alternatives

We now formalize alternative feature sets via constraints, considering single (Sect. 3.2.1) and multiple alternatives (Sect. 3.2.2). All our definitions are independent of the feature-selection method.

3.2.1 Single alternative

We consider a feature set an alternative to another set if it differs sufficiently. We express this notion with a set-dissimilarity measure [11, 15]. These measures typically quantify how strongly two sets \(F'\), \(F''\) overlap and relate this to their sizes. For instance, a well-known set-dissimilarity measure is the Jaccard distance, which is defined as follows (Eq. 2) for the feature sets \(F'\) and \(F''\):

$$\begin{aligned} \begin{aligned} d_{\text {Jacc}}(F',F'')&= 1 - \frac{|F' \cap F''|}{|F' \cup F''|} \\&= 1 - \frac{|F' \cap F''|}{|F'| + |F''| - |F' \cap F''|} \end{aligned} \end{aligned}$$
(2)

In this article, we use a dissimilarity measure based on the Dice coefficient, as defined by Eq. (3):

$$\begin{aligned} d_{\text {Dice}}(F',F'') = 1 - \frac{2 \cdot |F' \cap F''|}{|F'| + |F''|} \end{aligned}$$
(3)

Other measures are also possible. Our definitions of alternatives only assume non-negativity, i.e., \(d(F',F'') \ge 0\), and symmetry, i.e., \(d(F',F'')=d(F'',F')\). We define single alternatives as follows:

Definition 1

(Single alternative) Given a symmetric, non-negative set-dissimilarity measure \(d(\cdot )\) and a dissimilarity threshold \(\tau \in {\mathbb {R}}_{\ge 0}\), a feature set \(F'\) is an alternative to a feature set \(F''\) (and vice versa) if \(d(F',F'') \ge \tau \).

The threshold \(\tau \) controls how dissimilar alternative feature sets should be. The larger \(\tau \), the more feature-set quality may drop. However, depending on the dataset, the same value of \(\tau \) may have a stronger or weaker effect on feature-set quality. For example, certain datasets may encompass numerous features with similar levels of utility, thereby allowing for numerous alternatives of comparable quality. In contrast, predictions on other datasets may hinge on a few key features. Further, only users can decide which drop in feature-set quality is acceptable as a trade-off for obtaining alternatives. Thus, we leave \(\tau \) as a user parameter. For normalized dissimilarity measures as in Eqs. 2 and 3, the interpretation of \(\tau \) is user-friendly: Setting \(\tau =0\) allows identical alternatives, while \(\tau =1\) forbids any overlap. Users can also experiment with various values, such as performing a binary search across the range of \(\tau \).

When implementing Definition 1, the following proposition gives way to using a broad range of solvers to tackle the related optimization problem:

Proposition 1

(Linearity of constraints) Using the Dice dissimilarity (Eq. 3), one can express alternative feature sets (Definition 1) with 0–1 integer linear constraints.

Proof

We re-arrange terms in the Dice dissimilarity (Eq. 3) to get rid of the quotient of feature-set sizes, obtaining Eq. (4):

$$\begin{aligned} \begin{aligned}{} & {} d_{\text {Dice}}(F',F'')&\ge \tau \\\Leftrightarrow & {} |F' \cap F''|&\le \frac{1 - \tau }{2} \cdot (|F'| + |F''|) \end{aligned} \end{aligned}$$
(4)

Next, we express set sizes in terms of the feature-selection vector s, obtaining Eq. (5):

$$\begin{aligned} \begin{aligned} |F_s|&= \sum _{j=1}^n s_j \\ |F_{s'} \cap F_{s''}|&= \sum _{j=1}^n s'_j \cdot s''_j \end{aligned} \end{aligned}$$
(5)

Finally, we replace each product \(s'_j \cdot s''_j\) with an auxiliary variable \(t_j\), bound by additional constraints, to linearize it [49], obtaining Eq. (6):

$$\begin{aligned} \begin{aligned} t_j&\le s'_j \\ t_j&\le s''_j \\ 1 + t_j&\ge s'_j + s''_j \\ t_j&\in \{0,1\} \end{aligned} \end{aligned}$$
(6)

Only sums of variables and products with constants remain, so the constraints are linear. Further, if one feature set is known, i.e., either \(s'\) or \(s''\) is fixed to particular values, Eq. (5) already is linear without Eq. (6). \(\square \)

As an alternative formulation, one could also encode the 0–1 integer linear constraints into propositional logic (SAT) [75].

If the set sizes \(|F'|\) and \(|F''|\) are constant, e.g., user-defined, Eq. (4) implies that the threshold \(\tau \) exhibits a linear relationship with the maximum number of overlapping features \(|F' \cap F''|\). This correspondence simplifies the interpretation of \(\tau \) and encourages us to utilize the Dice dissimilarity in the subsequent steps. On the contrary, the Jaccard distance exhibits a nonlinear relationship between \(\tau \) and the overlap size. This insight arises from rearranging Eq. (2) in conjunction with Definition 1, obtaining Eq. (7):

$$\begin{aligned} \begin{aligned} d_{\text {Jacc}}(F',F'')&= 1 - \frac{|F' \cap F''|}{|F'| + |F''| - |F' \cap F''|} \ge \tau \\&\Leftrightarrow |F' \cap F''| \le \frac{1 - \tau }{2 - \tau } \cdot (|F'| + |F''|) \end{aligned} \end{aligned}$$
(7)

Further, if \(|F'| = |F''|\), Eq. (4) implies that parameter \(\tau \) represents the share of features in one set that must not be a member of the other one, and vice versa, as expressed by Eq. (8):

$$\begin{aligned} \begin{aligned}{} & {} d_{\text {Dice}}(F',F'')&\ge \tau \\\Leftrightarrow & {} |F' \cap F''|&\le (1 - \tau ) \cdot |F'| \\{} & {} &= (1 - \tau ) \cdot |F''| \end{aligned} \end{aligned}$$
(8)

This correspondence further eases interpretability. In particular, if users are uncertain how to choose \(\tau \), and if \(|F'|\) is reasonably small, users can try out all values of \(\tau \in \{i / |F'|\}\) with \(i \in \{1, \dots , |F'|\}\). These \(|F'|\) unique values of \(\tau \) suffice to produce all possible distinct solutions that one could obtain with an arbitrary \(\tau \in (0,1]\).

3.2.2 Multiple alternatives

One can search for multiple alternatives either sequentially or simultaneously. The number of alternatives \(a \in {\mathbb {N}}_0\) is a parameter for the users. The overall number of feature sets is \(a+1\) since we deem one feature set the ‘original’ one. Table 1 compares the sizes of the optimization problems for these two search methods.

Table 1 Size of the optimization problem by search method, for a alternatives (\(a + 1\) feature sets overall) and n features

Sequential alternatives

With sequential search, users obtain alternatives iteratively, with one alternative per iteration. Thus, users can monitor the quality of alternatives and end the search after each iteration. This option frees users from setting the parameter a a priori. Each feature set has to be an alternative to all previously found ones, which are contained within the set \({\mathbb {F}}\):

Definition 2

(Sequential alternative) A feature set \(F''\) is an alternative to a set of feature sets \({\mathbb {F}}\) (and vice versa) if \(F''\) is a single alternative (Definition 1) to each \(F' \in {\mathbb {F}}\).

Adapting Eq. (1) to Definition 2, we obtain the following optimization problem (Eq. 9) for each iteration of the search:

$$\begin{aligned} \begin{aligned} \max _s&\quad Q(s,X,y) \\ \text {subject to:}&\quad \forall F' \in {\mathbb {F}}:~d(F_s,F') \ge \tau \end{aligned} \end{aligned}$$
(9)

The objective function remains the same as for a single alternative, i.e., optimizes only one feature set \(F_s\) per iteration. In particular, with \({\mathbb {F}} = \emptyset \) in the first iteration, we optimize for the ‘original’ feature set, which is the same as in conventional feature selection without constraints for alternatives. Thus, the number of variables in each optimization is independent of the number of alternatives a, and no linearization variables (Eq. 6) are required. Each alternative only adds one new constraint to the problem. Thus, we expect the runtime of sequential search to scale well with the number of alternatives. Additional runtime improvements can be achieved if the optimizer retains a state between iterations and can warm-start.

Simultaneous alternatives

Through simultaneous search, users can acquire multiple alternatives simultaneously, whose number a must be set beforehand. Once more, we employ pairwise dissimilarity constraints for these alternatives:

Definition 3

(Simultaneous alternatives) A set of feature sets \({\mathbb {F}}\) contains simultaneous alternatives if each feature set \(F' \in {\mathbb {F}}\) is a single alternative (Definition 1) to each other set \(F'' \in {\mathbb {F}}\), \(F' \ne F''\).

Adapting Eq. (1) to Definition 3, we obtain the following optimization problem (Eq. 10):

$$\begin{aligned} \begin{aligned} \max _{s^{(0)}, \dots , s^{(a)}}&\quad {\text {*}}{agg}_{i \in \{0, \dots , a\}} Q(s^{(i)},X,y) \\ \text {subject to:}&\quad \forall i_1, i_2 \in \{0, \dots , a\},~i_1 \ne i_2: \\&\quad \quad d(F_{s^{(i_1)}},F_{s^{(i_2)}}) \ge \tau \end{aligned} \end{aligned}$$
(10)

In contrast to the sequential case (Eq. 9), the problem requires a modified objective function and \(a+1\) instead of one decision vector s. The operator \(\text {agg}(\cdot )\) defines how to aggregate the qualities of the feature sets. In our experiments, we consider the sum as well as the minimum to instantiate \(\text {agg}(\cdot )\), which we refer to as sum-aggregation and min-aggregation. The latter explicitly fosters balanced feature-set qualities. Appendix 1 discusses these two aggregation operators and additional ideas for balancing qualities in detail.

In terms of runtime, we anticipate simultaneous search to scale worse with the number of alternatives a than sequential search. This is because simultaneous search addresses a single large optimization problem, as opposed to several smaller ones. The number of decision variables grows linearly with a, while the number of constraints increases quadratically. Further, auxiliary variables are necessary to obtain linear constraints (Eq. 6).

Quality-wise, simultaneous search may benefit from optimizing once rather than greedily. Furthermore, the quality may be distributed more evenly across the alternatives compared to sequential search. However, a drawback is that there are no intermediate steps where users can interrupt the search after assessing the current feature-set quality.

3.3 Objective functions—finding alternatives

In this section, we delve into the process of identifying alternative feature sets. Specifically, we describe how to solve the optimization problem introduced in Sect. 3.1 for the different categories of feature-set quality measures featured in Sect. 2.2. We distinguish between white box (Sect. 3.3.1), black box (Sect. 3.3.2), and embedding (Sect. 3.3.3) approaches.

3.3.1 White-box optimization

If the feature-set quality function Q(sXy) is sufficiently simple, one can find alternatives with a solver for white-box optimization problems. We already showed that our notion of alternative feature sets can be expressed as 0–1 integer linear constraints (Proposition 1). We now discuss four popular feature-selection methods with objectives that admit formulating a 0–1 integer linear problem. We evaluate all four methods in our experiments later.

Univariate filter feature selection

For univariate filter methods, the objective function is linear by default since the quality of a feature set is the sum of the qualities of the individual features, as defined by Eq. (11):

$$\begin{aligned} Q_{\text {uni}}(s,X,y) = \sum _{j=1}^{n} q(X_{\cdot {}j},y) \cdot s_j \end{aligned}$$
(11)

Here, \(q(\cdot )\) is a bivariate dependency measure to quantify the relationship between each feature and the prediction target, e.g., mutual information [37] or the absolute value of Pearson correlation.

For this objective, Appendix A.2 specifies the complete optimization problem, including the constraints for alternatives we already introduced in Sect. 3.2.

Post hoc feature importance

One can also insert post hoc feature-importance scores as univariate feature qualities \(q(X_{\cdot {}j},y)\) into Eq. (11). However, such post hoc importance scores typically evaluate the quality of each feature in the presence of other features. For example, a feature may only be important in subsets where another feature is present, due to feature interaction, but unimportant otherwise, and a post hoc importance method like SHAP [44] may reflect both these aspects. In contrast, Eq. (11) implicitly assumes feature independence and cannot adapt importance scores depending on whether other features are selected. Thus, treating pre-computed post hoc importance scores as univariate feature qualities in the optimization objective can serve as a heuristic but may not faithfully represent the actual feature qualities in a particular selected set.

FCBF

The Fast Correlation-Based Filter (FCBF) [81] is a multivariate filter method based on the notion of predominance: The correlation of each selected feature with the prediction target must exceed a user-defined threshold and must also be higher than the correlation of each other selected feature with the given one. While FCBF originally employs a heuristic search to identify predominant features, we propose a formulation as a constrained optimization problem (Eq. 12) to allow a white-box optimization for alternatives:

$$\begin{aligned}{} & {} \max _s \quad Q_{\text {FCBF}}(s,X,y) = \sum _{j=1}^{n} q(X_{\cdot {}j},y) \cdot s_j\nonumber \\ \text {subject to:}{} & {} \quad \forall j_1, j_2 \in \{1, \dots , n\},~j_1 \ne j_2, \nonumber \\{} & {} \quad q(X_{\cdot {}j_1},y) \le q(X_{\cdot {}j_2}, X_{\cdot {}j_1}): \nonumber \\{} & {} \quad \quad s_{j_1} + s_{j_2} \le 1 \end{aligned}$$
(12)

We drop the original FCBF’s threshold on feature-target correlation and maximize the latter instead, as in the univariate-filter case. Further, we keep the constraints of FCBF on feature–feature correlation. Specifically, we prohibit the concurrent selection of two features if their correlation is at least as high as the correlation of either feature with the target variable. Since the condition \(q(X_{\cdot {}j_1},y) \le q(X_{\cdot {}j_2}, X_{\cdot {}j_1})\) in Eq. (12) does not depend on the decision variables s, one can pre-compute whether it holds before formulating the optimization problem and add the corresponding linear constraint \(s_{j_1} + s_{j_2} \le 1\) only for feature pairs where it is needed.

mRMR

The multivariate filter method Minimal Redundancy Maximum Relevance (mRMR) [63] combines two criteria, relevance and redundancy. Relevance is the dependency between selected features and the prediction target, as for univariate filter methods. Redundancy, in turn, quantifies the dependency between selected features. Both terms are averaged over the selected features. The objective is to maximize the difference between relevance and redundancy, as defined by Eq. (13):

$$\begin{aligned} \max _s \quad Q_{\text {mRMR}}(s,X,y) = \frac{\sum _{j=1}^{n} q(X_{\cdot {}j},y) \cdot s_j}{\sum _{j=1}^{n} s_j} \nonumber \\ - \frac{\sum _{j_1=1}^{n} \sum _{j_2=1}^{n} q(X_{\cdot {}j_1}, X_{\cdot {}j_2}) \cdot s_{j_1} \cdot s_{j_2}}{(\sum _{j=1}^{n} s_j)^2} \end{aligned}$$
(13)

If one knows the feature-set size \(\sum _{j=1}^{n} s_j\) to be a constant k, the denominators of both fractions are constant, so the objective leads to a quadratic-programming problem [55, 66]. If one additionally replaces each product terms \(s_{j_1} \cdot s_{j_2}\) according to Eq. (6), the problem becomes linear. However, there is a more efficient linearization [56, 57], which we use in our experiments, displayed in Eq. (14):

$$\begin{aligned} \max _s \quad Q_{\text {mRMR}}(s,X,y)= & {} \frac{\sum _{j=1}^{n} q(X_{\cdot {}j},y) \cdot s_j}{k} \nonumber \\- & {} \frac{\sum _{j=1}^{n} z_j}{k \cdot (k-1)} \nonumber \\ \text {s.t.:} \qquad \qquad \quad \forall j_1: A_{j_1}= & {} \sum _{j_2 \ne j_1} q(X_{\cdot {}j_1}, X_{\cdot {}j_2}) \cdot s_{j_2} \nonumber \\ \forall j: z_j\ge & {} M \cdot (s_j - 1) + A_j \nonumber \\ \forall j: z_j\in & {} {\mathbb {R}}_{\ge 0} \nonumber \\ \text {with:} \qquad \qquad \quad j, j_1, j_2\in & {} \{1, \dots , n\} \end{aligned}$$
(14)

Here, \(A_{j_1}\) is the sum of all redundancy terms related to the feature with index \(j_1\), i.e., the summed dependency value between this feature and all other selected features. Thus, one can use one real-valued auxiliary variable \(z_j\) for each feature instead of one new binary variable for each pair of features. Since redundancy should be minimized, \(z_j\) assumes the value of \(A_j\) with equality if the feature with index j is selected (\(s_j=1\)) and is zero otherwise (\(s_j=0\)). To this end, M is a large positive value that deactivates the constraint \(z_j \ge A_j\) if \(s_j=0\).

Since Eq. (14) assumes the feature-set size \(k \in {\mathbb {N}}\) to be user-defined before optimization, it requires fewer auxiliary variables and constraints than the more general formulation in [56, 57]. Additionally, in accordance with [55], we assign a value of zero to the self-redundancy terms \(q(X_{\cdot {}j},X_{\cdot {}j})\), effectively excluding them from the objective function. Thus, the redundancy term uses \(k \cdot (k-1)\) instead of \(k^2\) for averaging.

3.3.2 Black-box optimization

Overview of approaches

If feature-set quality does not have an expression suitable for white-box optimization, one has to treat it as a black-box function when searching for alternatives. This situation applies to wrapper feature-selection methods, which use prediction models to assess feature-set quality. There are several ways to consider the constraints for alternatives in such scenarios: First, one can enumerate and evaluate all feature sets satisfying the constraints, which is inefficient. Second, one can sample from the space of valid feature sets. However, uniform sampling from a constrained space is a computationally hard problem, possibly harder than determining if a valid solution exists or not [17]. Third, one can formulate a multi-objective problem to avoid hard constraints on alternatives. However, as explained in Sect. 3.1, we decided to pursue a single-objective formulation with constraints. Fourth, one can push constraints into a search heuristic; this is what we describe next. One idea is to prevent the heuristic from producing invalid feature sets or to make this less likely, e.g., by adding a penalty to the objective function. Another idea is to ‘repair’ invalid feature sets in the search, e.g., by replacing them with the most similar feature sets satisfying the constraints. Such solver-assisted approaches are popular for finding valid software configurations [22, 28, 79]. In our experiments, we also employ a repair-based approach, which we describe in the following.

Algorithm 1
figure a

Greedy Wrapper for alternative feature selection.

Greedy Wrapper

We propose a novel greedy hill-climbing procedure, displayed in Algorithm 1. Unlike standard hill climbing for feature selection [35], our procedure observes constraints. First, the algorithm uses a solver to find one solution that is alternative enough, given the current constraints (Line 1). Thus, it has a valid starting point and can always return a solution unless there are no valid solutions at all. Next, it tries ‘swapping’ two features, i.e., selecting the features if they were deselected or deselecting them if they were selected (Line 7). For simultaneous search, we swap the affected two features in each alternative feature set. This swap might violate cardinality constraints as well as constraints for alternatives. Thus, the algorithm calls the solver again to find one solution \(S'\) containing this swap and satisfying the other constraints. If such a solution \(S'\) exists and its quality \(Q(S',X,y)\) is higher than the one of the current solution, the algorithm proceeds with the new solution, attempting again to swap the first and second features (Lines 1012). Otherwise, it tries to swap the next pair of features (Lines 1317). Specifically, we assess only one solution per swap before proceeding instead of exhaustively enumerating and evaluating all valid solutions involving the swap.

The algorithm terminates if no swap leads to an improvement or a fixed number of iterations \(max\_iters\) is reached (Line 6). Due to its heuristic nature, the algorithm might get stuck in local optima rather than yielding the global optimum. In particular, \(max\_iters\) only is an upper bound on the iteration count since the algorithm can stop earlier. We define the iteration count as the number of invocations of the solver, i.e., attempts to generate valid alternatives. This number also bounds the number of prediction models trained. However, we only train a model for valid solutions (Line 9), and not all solver calls may yield one.

3.3.3 Embedding alternatives

If feature selection is embedded into a prediction model, there is no general approach for finding alternative feature sets. Instead, one would need to embed the search for alternatives into model training as well. Thus, we leave the formulation of specific approaches open for future work. For instance, one could prevent decision trees from splitting on a feature if the resulting feature set is too similar to a given feature set. As another example, there are various formal encodings of prediction models, e.g., as SAT formulas [54, 70, 82]. In such representations, one might directly add the constraints for alternatives and employ a solver for ‘training’.

3.4 Computational complexity

In this section, we discuss the time complexity of alternative feature selection. We are interested in scalability regarding the number of features \(n \in {\mathbb {N}}\), also taking the feature-set size \(k \in {\mathbb {N}}\) and the number of alternatives \(a \in {\mathbb {N}}_0\) into account. Section 3.4.1 examines exhaustive search, which works for arbitrary feature-selection methods, while Sect. 3.4.2 analyzes the optimization problem with univariate feature qualities (Eq. 11).

3.4.1 Exhaustive search for arbitrary feature-selection methods

An exhaustive search over all possible feature sets is a simple but inefficient approach to finding alternatives. This procedure provides an upper bound for the time complexity of the optimization problem.

Sequential search

Even without considering alternatives, the search space of feature selection grows exponentially with n. In particular, there are \(2^n - 1\) non-empty feature sets of arbitrary size and \(\left( {\begin{array}{c}n\\ k\end{array}}\right) = \frac{n!}{k! \cdot (n-k)!} \le n^k\) sets of a fixed size k. These numbers also hold for sequential search (Eq. 9), which optimizes alternatives one at a time. For each feature set, one must evaluate the objective and check the constraints. The cost of the former depends on the feature-selection method but should usually be polynomial in n. The latter entails a cost of \(O(a \cdot n)\) for each new alternative and \(O(a^2 \cdot n)\) for the whole sequential search with a alternatives. In particular, constraint checking involves iterating over all existing feature sets and features to compute the dissimilarity between sets (Eq. A4). With \(O(n^k)\) feature sets as solution candidates, we obtain the following proposition:

Proposition 2

(Complexity of exhaustive sequential search) Exhaustive sequential search for \(a \in {\mathbb {N}}\) alternative feature sets of size k from n features has a time complexity of \(O(a^2 \cdot n^{k+1})\) without the cost of evaluating the objective function.

This complexity is polynomial in n if k is a small constant independent of n, i.e., \(k \in O(1)\), and if a is at most polynomial in n, i.e., \(a \in O(n^c),~c \in O(1)\). The assumption \(k \in O(1)\) makes sense for feature selection, where one typically wants to obtain small feature sets from a high-dimensional dataset. However, the exponent k might still render an exhaustive search practically infeasible. In terms of parameterized complexity, the problem resides in class \(\mathcal{X}\mathcal{P}\) since the complexity term has the form \(O(f(k) \cdot n^{g(k)})\) [14] if a is constant, having the parameter k and functions \(f(k) = 1\), \(g(k) = k+1\).

Simultaneous search

Simultaneous search enlarges the search space by optimizing \(a+1\) feature sets at once (Eq. 10), having \(O((n^k)^{a+1}) = O(n^{k \cdot (a+1)})\) solution candidates. Thus, the following proposition holds:

Proposition 3

(Complexity of exhaustive simultaneous search) Exhaustive simultaneous search for \(a \in {\mathbb {N}}\) alternative feature sets of size k from n features has a time complexity of \(O(a^2 \cdot n^{k \cdot (a+1) + 1})\) without the cost of evaluating the objective function.

The scalability with respect to n is less favorable compared to sequential search, as the number of alternatives a now appears in the exponent. Further, Proposition 3 assumes that the constraints do not use linearization variables (Eqs. 6 and A5), which would enlarge the search space even further. Finally, the complexity remains polynomial in n if a and k are small and independent from n, i.e., \(a \cdot k \in O(1)\):

Proposition 4

(Parameterized complexity of simultaneous search) Simultaneous search for \(a \in {\mathbb {N}}\) alternative feature sets of size k from n features resides in the parameterized complexity class \(\mathcal{X}\mathcal{P}\) for the parameter \(a \cdot k\).

3.4.2 Univariate feature qualities

While the assumption \(a \cdot k \in O(1)\) ensures polynomial runtime regarding n for arbitrary feature-selection methods, the optimization problem can still be hard without this assumption. In the following, we derive complexity results for univariate feature qualities (Eq. 11). This feature-selection method arguably has the simplest objective function, where the quality of a feature set is equal to the sum of the individual qualities of its constituent features. This simplicity eases the transformation from and to well-known \(\mathcal{N}\mathcal{P}\)-hard problems.

In the following complexity analyses, we assume the feature qualities \(q(X_{\cdot {}j},y)\) are given. In particular, one can pre-compute these qualities before searching alternatives and treat them as constants in the optimization problem. The complexity of this computation depends on the particular feature-quality measure and the number of data objects m. However, the number of features n should only affect the complexity linearly due to the univariate setting.

We have previously observed that our optimization problem with univariate feature qualities can be formulated as an integer-linear program (Sect. 3.3.1). While Integer programming is \(\mathcal{N}\mathcal{P}\)-complete in general [20, 32], our problem is a special case of it and could therefore have lower complexity.

Min-aggregation

We start with two assumptions, both of which we will drop later: First, we use a dissimilarity threshold of \(\tau = 1\), i.e., zero overlap of feature sets. Second, all features must be part of one set. We call the combination of these assumptions, which implies \(n = (a+1) \cdot k\), a complete partitioning. This scenario differs from \(a \cdot k \ll n\), which we assumed in Sect. 3.4.1, where runtime was polynomial in n.

The currently assumed scenario is a variant of an existing problem known as Multi-Way Number Partitioning [40] or Multiprocessor Scheduling [20] in literature: Partition a set of n integers into a fixed number of subsets such that all subset sums are as equal as possible [36]. A typical application is assigning tasks with different lengths to a fixed number of processors, aiming to balance the load. In our scenario, the tasks correspond to features, the task lengths to univariate feature qualities, and the processors to feature sets. Maximizing the minimum sum is one of several possible objectives in the literature [36, 40]. This objective corresponds to simultaneous search (Eq. 10) with min-aggregation (Eq. A2) in our scenario. Since Multiprocessor Scheduling is \(\mathcal{N}\mathcal{P}\)-complete, even for just two partitions [20], we obtain the following result:

Proposition 5

(Complexity of simultaneous search with min-aggregation, complete partitioning, and unconstrained feature-set size) Assuming univariate feature qualities (Eq. 11), a dissimilarity threshold \(\tau = 1\), and unconstrained feature-set sizes, and all n features have to be selected, simultaneous search for alternative feature sets with min-aggregation is \(\mathcal{N}\mathcal{P}\)-complete.

Dropping several assumptions from Proposition 5, we directly obtain the following, more general proposition:

Proposition 6

(Complexity of simultaneous search with min-aggregation) Simultaneous search for alternative feature sets with min-aggregation is \(\mathcal{N}\mathcal{P}\)-hard.

While Proposition 5 allows arbitrary set sizes, the hardness result remains valid for fixed or bounded set sizes k, known as Balanced Number Partitioning [46, 84] or K-Partitioning [27, 41] problem in the literature:

Proposition 7

(Complexity of simultaneous search with min-aggregation, complete partitioning, and constrained feature-set size) Assuming univariate feature qualities (Eq. 11), a dissimilarity threshold \(\tau = 1\), desired feature-set size k, and all n features have to be selected, simultaneous search for alternative feature sets with min-aggregation is \(\mathcal{N}\mathcal{P}\)-complete.

We now allow that some features may not be part of any feature set while we keep the assumption of zero feature-set overlap. The problem of finding such an incomplete partitioning still is \(\mathcal{N}\mathcal{P}\)-complete in general:

Proposition 8

(Complexity of simultaneous search with min-aggregation, incomplete partitioning, and constrained feature-set size) Assuming univariate feature qualities (Eq. 11), a dissimilarity threshold \(\tau = 1\), desired feature-set size k, and not all n features have to be selected, simultaneous search for alternative feature sets with min-aggregation is \(\mathcal{N}\mathcal{P}\)-complete.

See Appendix A.3 for the proof of this proposition and of the ones that follow.

Next, we also allow \(\tau < 1\), i.e., feature-set overlap, and obtain another \(\mathcal{N}\mathcal{P}\)-hardness result:

Proposition 9

(Complexity of simultaneous search with min-aggregation, \(\tau < 1\), and constrained feature-set size) Assuming univariate feature qualities (Eq. 11), a dissimilarity threshold \(\tau < 1\), and desired feature-set size k, simultaneous search for alternative feature sets with min-aggregation is \(\mathcal{N}\mathcal{P}\)-hard.

Sum-aggregation

In contrast to the previous hardness results, sum-aggregation (Eq. A1) admits polynomial-time algorithms for \(\tau =1\):

Proposition 10

(Complexity of search with sum-aggregation and \(\tau =1\)) Assuming univariate feature qualities (Eq. 11) and a dissimilarity threshold \(\tau = 1\), the search for alternative feature sets with sum-aggregation has a time complexity of O(n) for a complete partitioning of n features and \(O(n \cdot \log n)\) for an incomplete partitioning.

This result applies to both sequential and simultaneous search, whether using complete or incomplete partitioning, for any arbitrary number a of alternatives, and even when feature sets vary in size k. The primary reason for achieving polynomial runtime is that sum-aggregation does not necessitate balancing the feature sets’ qualities. Thus, \(\tau =1\) allows many solutions with the same objective value. While at least one of these solutions also optimizes the objective with min-aggregation, most do not. Hence, it is not a contradiction that optimizing with min-aggregation is considerably harder.

4 Experimental design

In this section, we describe our experimental design. We provide a concise overview of its components and objectives in Sect. 4.1 before delving into a detailed explanation of its components. In particular, we describe evaluation metrics (Sect. 4.2), methods (Sect. 4.3), datasets (Sect. 4.4), and implementation (Sect. 4.5).

4.1 Overview

We conduct experiments with 30 datasets representing binary-classification problems. As evaluation metrics, we take into account feature-set quality and runtime. We compare five feature-selection methods, each representing distinct concepts of feature-set quality. Furthermore, we train prediction models using the resultant feature sets and analyze their prediction performance. In our pursuit of alternatives, we explore both simultaneous and sequential search methods. We systematically vary the number of alternatives a and the dissimilarity threshold \(\tau \).

4.2 Evaluation metrics

Our evaluation focuses on the trade-off between feature-set quality and obtaining alternatives. In addition, we evaluate the runtime of the search for alternatives.

Feature-set quality

We use two metrics for feature-set quality. First, we evaluate the objective functions Q(sXy) of the feature-selection methods and report their objective value. Second, we train prediction models with the feature sets found. We report prediction performance in terms of Matthews correlation coefficient (MCC) [45], which is insensitive to imbalanced prediction targets, reaches 1 for perfect predictions, and is 0 for random guessing.

To analyze how well feature selection and predictions generalize, we conduct a stratified fivefold cross-validation. Model training and the search for alternatives only have access to the training data. However, we also use the test data to evaluate the quality of each feature set found with the training data. For the test-set objective value, we initialize the objective function with feature qualities computed on the test set but insert the feature selection from the training set. For the test-set prediction performance, we predict on the test set but use a prediction model trained with these features on the training set.

Runtime

Regarding runtime, we first analyze the optimization time. For white-box feature-selection methods, this corresponds to the summed runtime of all solver calls. We exclude the time for feature-quality computations that one can reuse for multiple solver calls. For Greedy Wrapper, we measure the runtime of the entire black-box optimization procedure with multiple solver calls and model trainings. As a second metric, we examine the optimization status, which can take four values. If the solver finished before reaching its timeout, it either found an optimal solution or proved the problem infeasible, i.e., no solution exists. If the solver reached its timeout, it either found a feasible solution whose optimality is unclear or no valid solution (yet), so the problem is not solved.

4.3 Methods

4.3.1 Prediction

As prediction models, we use decision trees [7] since these models allow learning complex, nonlinear dependencies. Preliminary experiments with random forests [8] and a k-nearest neighbors classifier yielded similar insights. We leave the hyperparameters of the trees at their defaults, except for using information gain instead of Gini impurity as the split criterion, to be consistent with our filter feature-selection methods. The trees may only use features from the alternatives, which allows to assess the quality of alternatives.

4.3.2 Feature selection (objective functions)

We choose five well-known feature-selection methods to provide objective functions for optimization. They cover the different feature-selection categories from Sect. 2.2 except embedded, as explained in Sect. 3.3.3. However, we use feature importance from an embedded method, i.e., decision trees, as post hoc importance scores.

One method (Greedy Wrapper) requires black-box optimization, while the other four methods have white-box objectives. With each feature-selection method, we select \(k \in \{5,10\}\) features, yielding small feature sets. We enforce the desired k with an additional constraint in the optimization, using the feature-set-size expression from Eq. (5).

Filter feature selection

As filter methods, we use the univariate MI (Eq. 11), the multivariate FCBF (Eq. 12), and the multivariate mRMR (Eq. 13). In all three methods, mutual information [37] serves as the dependency measure \(q(\cdot )\). Since mutual information has no fixed upper bound, we normalize its values per dataset and cross-validation fold to improve the comparability of feature-set quality. For FCBF and MI, we normalize the individual features’ qualities such that selecting all features yields a quality of 1 and selecting no feature yields a quality of 0. For mRMR, we min-max-normalize all mutual-information values to [0, 1], so the overall objective is in \([-1,1]\).

Wrapper feature selection

As a wrapper method, we employ our hill-climbing procedure Greedy Wrapper (Algorithm 1) with \(max\_iters = 1000\). To evaluate feature-set quality within the wrapper, we use the MCC score of a decision tree on a 20% validation split of the data.

Post hoc feature importance

As a post hoc importance measure called Model Gain, we use importance scores from scikit-learn’s decision trees. There, importance quantifies a feature’s contribution toward optimizing the split criterion of the tree, for which we choose information gain. These importances are normalized to sum up to 1 by default. We plug these importances into Eq. (11), i.e., treat them like univariate filter scores. The interpretation is different, though, since the scores originate from trees trained with all features rather than assessing features in isolation.

4.3.3 Alternatives (constraints)

Optimization

We address the optimization problem of alternative feature selection with a solver. Thus, when we speak of sequential search and simultaneous search in the evaluation, we refer to the following solver-based optimization procedures rather than the optimization problems in general: For the four white-box feature-selection methods, we use the solver to exactly solve the underlying optimization problems. Thus, given sufficient solving time, these alternatives are globally optimal. With Greedy Wrapper as the feature-selection method, the search procedure (Algorithm 1) is heuristic and might not cover the entire search space. There, the solver only assists in finding valid solutions but does not optimize.

Competitors

Other approaches are possible to tackle the sequential and simultaneous optimization problem for alternative feature selection. E.g., heuristic approaches may speed up the optimization for the white-box feature-selection methods but are out of the scope of the current article. Another source for competitors could be related work. However, as we discuss in Sect. 6, approaches from related work pursue different objective functions, operate with different notions of alternatives, and may only work for particular feature-selection methods. All these points prevent a meaningful comparison of these approaches to ours. E.g., a feature set considered alternative in related work might violate our constraints for alternatives. Further, within our search approaches, we can still put the feature-set quality into perspective by comparing alternatives to each other. In particular, the quality of the ‘original’ feature set, i.e., obtained by running the feature-selection methods without constraints for alternatives, serves as a natural reference point.

Search parametrization

We analyze sequential (Eq. 9) and simultaneous (Eq. 10) search for alternatives. For the latter, we employ sum-aggregation (Eq. A1) and min-aggregation (Eq. A2) in the objective. In figures and tables, we use the abbreviations seq., sim. (sum), and sim. (min) to denote these search methods.

We vary the parameters of the search systematically: We evaluate \(a \in \{1, \dots , 10\}\) alternatives for sequential search and \(a \in \{1, \dots , 5\}\) for simultaneous search due to the higher runtime of the latter. For the dissimilarity threshold \(\tau \), we analyze all possible sizes of the feature-set overlap in the Dice dissimilarity (Eqs. 3 and 8). Thus, for \(k=5\), we consider \(\tau \in \{0.2, 0.4, 0.6, 0.8, 1.0\}\), corresponding to an overlap of four to zero features. For \(k=10\), we consider \(\tau \in \{0.1, 0.2, \dots , 1.0\}\). We exclude \(\tau = 0\), which would allow returning duplicate feature sets.

Timeout

We employ a solver timeout to make a large-scale evaluation feasible and to account for the high variance of solver runtime. In particular, we grant each solver call 60 s multiplied by the number of feature sets sought. Thus, sequential search conducts multiple solver calls with 60 s timeout each, while simultaneous search conducts one solver call with proportionally more time, e.g., 300 s for five feature sets (i.e., four alternatives). The summed timeout for a fixed number of alternatives is the same for both search methods. For 84% of the feature sets in our evaluation, the solver finished before the timeout.

Table 2 Datasets from PMLB used in our experiments. m denotes the number of data objects and n the number of features
Fig. 1
figure 1

Standard deviation of feature-set quality within search runs over the number of alternatives a, by search method for alternatives and evaluation metric. Results with MI as feature-selection method and \(k=5\). Y-axes are truncated to improve readability

4.4 Datasets

Selection criteria

We use a variety of datasets from the Penn Machine Learning Benchmarks (PMLB) [60, 67]. To harmonize evaluation, we only consider binary classification, though alternative feature selection works for regression and multi-class problems as well. We exclude datasets containing fewer than 100 data objects since they might incur a high uncertainty when assessing feature-set quality. Otherwise, the number of data objects should not systematically impact feature-set quality and is unimportant for our evaluation. We also exclude datasets with fewer than 15 features to leave room for alternatives. Next, we exclude one dataset with 1000 features, which would dominate the overall runtime of the experiments. Finally, we manually exclude datasets that seem duplicated or modified versions of other datasets from the benchmark.

30 datasets with 106 to 9822 data objects and 15 to 168 features remain. The datasets do not contain any missing values. Categorical features have an ordinal encoding by default. Table 2 provides an overview of the datasets used.

Feature correlation

Table 2 also displays the datasets’ average feature correlation. In particular, we compute Spearman’s rank correlation between each pair of features and take the absolute values to evaluate the strength of dependencies rather than their sign. For the datasets in our study, the average feature correlations are often weak, mostly below 0.3. Generally, correlated features indicate that alternative feature sets may exist. However, there are two caveats. First, rank correlation only captures certain types of dependencies, while our feature-selection criteria and prediction models are more general. Second, for optimal alternatives, the dependency between highly predictive features is crucial, while the dependency between unimportant features matters less. However, the table only shows the mean over all feature pairs.

4.5 Implementation and execution

We implemented our experimental pipeline in Python 3.8, using scikit-learn [62] for machine learning and the solver SCIP [5] via the package OR-Tools [64] for optimization. A requirements file in our code specifies the versions of all packages. The experimental pipeline parallelizes over datasets, cross-validation folds, and feature-selection methods, while solver calls and model training are single-threaded. We ran the pipeline on a server with 128 GB RAM and an AMD EPYC 7551 CPU, having 32 physical cores and a base clock of 2.0 GHz. The parallelized pipeline run took 255 h, i.e., about 10.6 days.

5 Evaluation

In this section, we evaluate our experiments. In particular, we discuss the parametrization for searching alternatives: the search method (Sect. 5.1) and the two user parameters (Sect. 5.2), i.e., the number of alternatives a and dissimilarity threshold \(\tau \).

5.1 Search methods for alternatives

Variance in feature-set quality

As expected, the search method influences how much the training-set objective value Q varies among alternatives within each search run. Figure 1a visualizes this observation for MI, though it applies to all other white-box feature-selection methods as well. Each box in the figure shows how the variance within individual search runs for alternatives is distributed over other experimental settings, e.g., datasets and cross-validation folds. In particular, the quality of multiple alternatives from the same search run varies less for simultaneous than sequential search and less for min-aggregation than sum-aggregation. However, this difference in variance largely disappears on the test set, for the objective value (Fig. 1b) as well as prediction performance. This effect might result from overfitting: Even with similar training-set quality, some alternatives may generalize better than others. Thus, the variance in test feature-set quality caused by overfitting could alleviate the effect on variance caused by the search method.

Fig. 2
figure 2

Mean of feature-set quality within search runs over the number of alternatives a, by search method for alternatives and evaluation metric. Results with MI as feature-selection method and \(k=5\)

Average value of feature-set quality

While obtaining quality-homogeneous alternatives can be one goal of simultaneous search, the main selling point would be obtaining alternatives of higher average quality than sequential search. However, this potential advantage rarely materialized in our experiments. In particular, Fig. 2a compares the distribution of the mean training-set objective in search runs for MI as the feature-selection method. We observe that all three search methods yield similar distributions of training-set objective values. This observation holds for feature-set quality on the test set (Fig. 2b) as well. Further, the other four feature-selection methods besides MI also do not show a general quality advantage of simultaneous search. Finally, depending on the concrete experimental setting, either sequential or simultaneous search can yield significantly higher quality than the other search method, even if the overall mean difference is close to zero.

Table 3 Frequency of optimization statuses (Sect. 4.2) by feature-selection method and search method for alternatives

Optimization status

One reason why simultaneous search fails to consistently beat the quality of sequential search is that search results may be suboptimal. For Greedy Wrapper, the search is heuristic per se and does not cover the entire search space. For all feature-selection methods, the solver can time out. In particular, the corresponding solver status feasible occurs more often for simultaneous than sequential search, as Table 3 shows. Such timeout-affected simultaneous solutions can be worse than optimal sequential solutions. The optimization status not solved, i.e., not finding a feasible solution without proving infeasibility, did not occur in the displayed results. The feature-selection method mRMR is especially prone to suboptimal solutions, likely because it has a more complex objective than MI and Model Gain. In contrast, FCBF often results in infeasible optimization problems since its constraints preventing the selection of redundant features might prevent finding any valid feature set of size k.

Further, the fraction of timeouts strongly depends on the number of alternatives a. E.g., for simultaneous search with \(k=5\) and sum-aggregation, roughly 8% of the white-box searches timed out for one alternative but 20% for three alternatives and 30% for five alternatives. While we grant simultaneous searches proportionally more time to obtain more feature sets, the runtime increases clearly super-proportionally, as we analyze next.

Table 4 Mean optimization time by feature-selection method and search method for alternatives

Optimization time

The optimization time also favors sequential search. As Table 4 shows, the mean optimization time of sequential search is lower for all five feature-selection methods, partly by orders of magnitude. Further, FCBF, MI, and Model Gain experience a dramatic increase in optimization time with parameter a. E.g., for simultaneous search with \(k=5\) and sum-aggregation, MI has a mean optimization time of 0.03 s for \(a=1\), 0.09 s for \(a=2\), 0.31 s for \(a=3\), 3.84 s for \(a=4\), and 122.69 s for \(a=5\). In contrast, the runtime increase is considerably less for sequential search, which shows an approximately linear trend with the number of alternatives.

Another interesting question is how the runtime relates to n, the number of features in the dataset. Our experimental data show a positive correlation, as we expected. However, the observed trend is very noisy, particularly for simultaneous search. E.g., runtimes vary considerably even for a fixed n, and some higher-dimensional datasets even show lower average runtimes than lower-dimensional ones. This result indicates that other factors than n influence runtime. Besides factors related to the datasets and experimental design, the heuristics used by the solver may also cause the runtime to fluctuate considerably.

Based on all results described in this section, we focus on sequential search in the following, and we also recommend it to users. In particular, it was significantly faster than simultaneous search while yielding similar feature-set quality.

5.2 User parameters a and \(\tau \)

Fig. 3
figure 3

Mean of feature-set quality, max-normalized per search run for alternatives, over the number of alternatives and dissimilarity threshold \(\tau \), by evaluation metric. Results from sequential search with MI as feature-selection method and \(k=10\)

Feature-set quality

One would expect a decrease in feature-set quality with an increasing number of alternatives a and dissimilarity threshold \(\tau \), giving users control over alternatives. In particular, higher values of the two user parameters introduce more (for a) and stronger (for \(\tau \)) constraints to the optimization problem. Figure 3 shows this decreasing quality trend for MI-based feature selection. Since feature-set quality varies significantly among datasets, we display it normalized to [0, 1] within each search run for alternatives: We shift the range of all evaluation metrics to [0, 1] and max-normalize feature-set quality for each search of alternatives, i.e., the highest feature-set quality in the search run is set to 1, and the other qualities are scaled accordingly.

As visible in Fig. 3, there might be multiple alternatives of similar quality, particularly for low values of \(\tau \). Further, the objective value decreases most from the original, unconstrained feature set, i.e., the zeroth alternative in the figures, to the first alternative, but the decrease is less beyond. Note that Fig. 3 averages the normalized feature-set quality over datasets with different dimensionality. In our experiments, datasets with more features n tend to experience a smaller decrease in quality over a and \(\tau \). As higher-dimensional datasets offer more options for alternatives, this observation makes sense. However, this effect is not guaranteed since datasets with many features could also contain many useless features instead of interesting alternatives.

The overall decrease in quality is slightly less pronounced for the test-set objective value (Fig. 3c) than for the training-set one (Fig. 3a) since overfitting might occur. In particular, the original feature set can even have lower test-set quality than the subsequent alternatives. The change in feature-set quality becomes even less clear when using the alternative feature sets to train prediction models, i.e., decision trees. As Fig. 3e shows, the prediction performance varies little over the number of alternatives and the dissimilarity threshold \(\tau \). In particular, the optimization objective Q may only partially indicate actual prediction performance since the former may use a simplified feature-set quality criterion. Indeed, the overall correlation between optimization objective Q and prediction MCC is only weak to moderate in our experiments. Among the feature-selection methods, this correlation is highest for Greedy Wrapper but also not perfect there. In particular, the search procedure of Greedy Wrapper evaluates feature sets with a validation split of the training set. MCC on this holdout set may not perfectly correspond to MCC on the test set, which is not used in the search.

Fig. 4
figure 4

Fraction of optimization runs yielding a valid feature set over the number of alternatives and dissimilarity threshold \(\tau \). Results from sequential search with MI as feature-selection method and \(k=10\)

Optimization status

Increasing a and \(\tau \) does not only affect quality but can also render the optimization problem infeasible, i.e., no valid alternative might exist. Figure 4 visualizes how the percentage of valid feature sets develops over a and \(\tau \) in our experiments. In our prior analysis of feature-set quality, we excluded infeasible feature sets. For comparison, Figs. 3b, d, f show the same data as Fig. 3a, c, e but with the quality of infeasible feature sets set to zero, i.e., the theoretical minimum quality after we shifted the value ranges of evaluation metrics. In these three figures, the decrease in feature-set quality is noticeably stronger for all evaluation metrics. In particular, test-set prediction performance, which is relatively stable when excluding infeasible feature sets (Fig. 3e), also decreases now (Fig. 3f).

In contrast, if only considering valid feature sets, the mean quality may even increase over the number of alternatives, as visible in Fig. 3a and Fig. 3c for \(\tau =1.0\). This counterintuitive phenomenon can occur because some datasets run out of valid feature sets sooner than others, so the average quality may be determined for different sets of datasets at each number of alternatives.

Fig. 5
figure 5

Mean of training-set objective value, max-normalized per search run for alternatives, by feature-selection method and evaluation metric. Infeasible feature sets assigned a quality of 0. Results from sequential search with \(k=10\)

Influence of feature-selection method

While we discussed MI before, the decrease in the objective value over parameter a occurs for all feature-selection methods in our experiments, as Fig. 5a shows. However, the strength of the decrease varies considerably among the feature-selection methods. For example, Greedy Wrapper and mRMR show the least effect of a. The effect of \(\tau \) also depends on the feature-selection method, as Fig. 5b shows. Again, MI and Model Gain, which both use the simple univariate objective (Eq. 11), show a stronger trend than Greedy Wrapper and mRMR. The fifth feature-selection method, FCBF, also exhibits a clear quality decrease over a and \(\tau \), mainly due to a high rate of infeasible feature sets rather than a quality decrease of the found feature sets. Ultimately, choosing a particular feature-selection method depends on the use case. Thereby, users should be aware that effects regarding alternative feature sets may depend on this choice.

6 Related work

In this section, we review related work from the field of feature selection and other areas relevant to this article. To our knowledge, searching for optimal alternative feature sets in the sense of this article is novel. However, there is literature on optimal alternatives outside the field of feature selection. Additionally, there has been research focused on discovering multiple diverse feature sets.

Feature selection

Most feature-selection methods yield one solution [6], though there are some exceptions [16, 52, 72], e.g., fostering diversity within wrapper methods. Ensemble feature selection [69, 71] combines feature-selection results obtained by different feature-selection methods or on different samples of the data. There, diversity can be an auxiliary goal to improve prediction performance [23, 43, 80]. In addition, the notion of feature-set diversity differs from ours. In particular, these approaches give users less control over alternatives, e.g., do not enforce alternatives with hard constraints.

Approaches for statistically equivalent feature sets [6, 39] use statistical tests to determine feature sets that are equivalent for predictions. In contrast, we do not require equivalent quality but allow users to control the number and dissimilarity of alternatives directly.

There also is work on constraints for feature selection, e.g., for feature cost [61], feature groups [83], or domain knowledge [2, 21]. Such constraints do not explicitly target alternatives but could be integrated into our integer-programming formulation as well.

Subgroup discovery

Diversity also is an issue in subgroup set discovery, which searches for interesting regions in the data space instead of selecting features as a whole. [76] presents six strategies to facilitate subgroup diversity. Two are directly transferable to feature selection, but the criteria for being alternative are different from ours. In particular, one strategy prunes similar subgroups with exactly the same quality as previous solutions. The other one limits the number of subgroups a feature may occur in but does not constrain the overlap of subgroups per se.

Clustering

Finding alternative solutions has been addressed extensively in the field of clustering. [4] gives a taxonomy and describes algorithms for alternative clustering. Our problem definition is different in several respects. The notion of dissimilarity differs since we strive for differently composed feature sets, while alternative clustering targets different assignments of data objects to clusters. Next, using constraints is only one approach to obtain alternative clusterings. Finally, we optimize feature-set quality in a supervised prediction scenario while clustering is unsupervised.

Subspace search

Finding several useful feature sets also is part of subspace clustering [29, 51] and subspace search [19, 58, 74]. These approaches strive to improve data-mining algorithms by using subspaces, i.e., feature sets, and not all features. Some approaches explicitly try to remove redundancy between subspaces [51, 58] or increase subspace diversity [19, 74]. See [29] for an overview. Alternative feature selection is different in at least one of the following aspects: First, the objective function, i.e., the notion of quality, differs. Second, definitions of subspace redundancy may hinge on feature values instead of binary selection decisions. Third, users typically do not have any control over the dissimilarity. At best, there is a regularization parameter rather than a hard threshold.

Explainable artificial intelligence (XAI)

In XAI, alternative explanations might provide additional insights, allow to test different hypotheses, and foster trust in the predictions [34, 78]. In contrast, obtaining significantly different explanations for a prediction might raise doubts regarding their meaningfulness [30]. Diversity is a criterion studied for various explainers, e.g., for counterfactuals [13, 31, 47, 50, 68, 77], criticisms [33], or semifactual explanations [1]. There are several approaches to foster diversity, e.g., ensembling explanations [73], considering multiple local minima [77], using a search algorithm that maintains diversity [13], extending the optimization objective [1, 33, 50], or introducing constraints [31, 47, 68]. The last option is similar to the way we enforce alternatives. Of all these approaches, only [1, 47, 50] introduce parameters to control the diversity of solutions; only [47] offers a user-friendly dissimilarity threshold in [0, 1], while the other two approaches regularize the objective. Further, all the previously mentioned XAI techniques provide local explanations, i.e., aim at prediction outcomes for individual data objects rather than optimizing the global prediction quality of feature sets.

Rashomon sets

A Rashomon set is a set of prediction models that reach a certain, e.g., close-to-optimal, prediction performance [18]. Despite similar performance, these models may still assign different feature importance scores, leading to different explanations [38]. Thus, Rashomon sets may yield partial information about alternative feature sets. However, approaches for Rashomon sets do not explicitly search for alternative feature sets as a whole, i.e., feature sets satisfying a dissimilarity threshold relative to other sets. Instead, these approaches focus on the range of each feature’s importance over prediction models. Further, our notion of alternatives is not bound to model-based feature importance but encompasses a broader range of feature-selection methods. Finally, we use importance scores from one model instead of multiple models to find importance-based alternatives.

7 Conclusions and future work

Conclusions

Our article has studied alternative feature selection, the problem of identifying diverse feature sets that simultaneously maintain high quality. We formalized alternative feature selection as an optimization problem, using constraints that are independent of the chosen feature-selection method. These constraints can be combined with other constraints on feature sets and offer users control over diversity via two parameters: the number of alternatives and a dissimilarity threshold. We analyzed the computational complexity of this optimization problem, showing its \(\mathcal{N}\mathcal{P}\)-hardness. Additionally, we discussed the integration of various categories of feature-selection methods. Finally, we assessed the effectiveness of alternative feature selection using 30 datasets representing binary-classification problems.

Future work

In the current article, we conducted a broad quantitative evaluation of alternative feature selection on datasets from various domains. Practitioners can employ alternative feature selection in domain-specific case studies and evaluate the alternative feature sets qualitatively, thereby assessing their usefulness for interpretability. Within alternative feature selection, opportunities exist to modify the formulation of the optimization problem, such as by incorporating soft constraints or adopting a multi-objective optimization approach instead of employing hard constraints. Furthermore, one can integrate additional feature-selection methods into the search for alternatives, including embedded methods. Finally, there is potential for enhancing the efficiency of the search for alternatives, particularly in the case of simultaneous search, by developing heuristics instead of seeking exact solutions to the optimization problem.