1 Introduction

Feature selection pursues two major goals: to improve the performance of predictive algorithms like classification, regression, or clustering models as well as to improve data understanding and interpretability. Both aspects are of significant interest in the field of life science, such as healthcare, where major decisions may be based on data analysis. Here, two sources of information are often available: large-scale collections of data from multiple sources and profound knowledge from domain experts. Previous works tend to handle these sources as opposites, see Cheng et al. (2006), or neglect expert knowledge completely, see Pozzoli (2020). However, a combination of both can be valuable to compensate for underdetermined problem setups from high-dimensional datasets, which are prevalent in healthcare data analysis. Moreover, meta-information on the feature set may leverage interpretability. Works such as Liu and Zhang (2015) consider constraints between samples but neglect constraints between features. The extension of L1 regularization to the so-called Group Lasso (Yuan & Lin , 2006) and its variants (Ida et al. , 2019) account for block structure but cannot handle more complex constraint types. Elementary approaches to integrating user knowledge and feature selection include Guan Guan et al. (2009), who suggest manually adding user-defined features to the feature selection output of algorithms. A more advanced model by Brahim and Limam (2014) embeds prior knowledge into three particular feature selection algorithms. Though, their work neither allows a direct generalization to other feature selectors nor the integration of more general types of prior knowledge, such as side constraints. Hence, there is a lack of general and sophisticated frameworks for feature selection that combine data-driven methods with user knowledge and deliver transparent results.

Apart from measuring predictive model performance, properties like stability and reproducibility of the feature selector are essential for transparency. A model-independent approach for improving feature selection stability is to deploy ensembles of elementary feature selectors. Recent research by Bose (2021), and Jenul (2021) pursued this idea by utilizing sub-sampling strategies to generate model ensembles as such provide feature stability measures aside from good predictive performance. Seijo-Pardo et al. (2017) conclude that meta-models composed of elementary feature selectors improve the performance and robustness of the selected feature set in many cases. However, to the best of our knowledge, probabilistic approaches that exploit both — a sound statistical framework and individual model benefits of using an ensemble elementary feature selectors — are not yet available.

A prominent framework with the capability to combine data and expert knowledge is Bayesian statistics, which has been applied for feature selection in linear models, see O’Hara and Sillanpää (2009). Intentions behind the usage of Bayesian methodology vary significantly between authors and do not necessarily involve expert knowledge. Examples include Dalton (2013), who investigates sparsity priors, and Goldstein et al. (2020), who suggest a Bayesian framework to quantify the level of uncertainty in the underlying feature selection model. Other Bayesian approaches for feature selection include Saon and Padmanabhan (2001), and Lyle et al. (2020), but these works do not investigate the usage of expert knowledge as prior. Although the availability of expert knowledge plays a role in life sciences, none of these approaches strongly emphasizes domain knowledge about features, nor do they involve specific prior constraints defined by the user.

In this work, we propose a novel Bayesian approach to feature selection that incorporates expert knowledge and maintains considerable model generality. We aim to fill the gap between data-driven feature selection on one side and purely expert-focused feature selection on the other side. Our presented probabilistic approach, UBayFS, combines a generic ensemble feature selection framework with the exploitation of domain knowledge. Hence, it supports interpretability and improves the stability of the results. For this purpose, feature importance votes from independent elementary feature selectors are merged with constraints and feature weights specified by the expert. Constraints may be of a general type, such as selecting a maximum number of features or blocks of features. Both inputs, likelihood and prior, are aggregated in a sound statistical framework, producing a posterior probability distribution over all possible feature sets. We use a Genetic Algorithm for discrete optimization to efficiently optimize the posterior feature set in high-dimensional datasets. In an extensive experimental evaluation, we analyze UBayFS in a variety of model setups involving prior knowledge and constraints. Results on open-source datasets are benchmarked against state-of-the-art feature selectors in terms of predictive performance and stability, underlining the potential of UBayFS.

Notations  We will denote vectors by bold, uncapitalized, and matrices by bold, capitalized letters. Non-bold, uncapitalized letters indicate scalars or functions, and non-bold, capitalized letters indicate sets or constants. \(\Vert .\Vert _1\) denotes the L1-norm. [N] is an abbreviation of the set of indices \({1,\dots ,N}\). The N-dimensional vector of ones will be written as \(\varvec{1}_N\). Furthermore, we refer to sets of features by their feature indices, such as \(S\subseteq [N]\), or by a binary membership vector \(\varvec{\delta }^S\in \{0,1\}^N\) with components \((\varvec{\delta }^S)_n = \left\{ \begin{array}{ll} 1 &{} \text {if}~n\in S, \\ 0 &{} \text {otherwise.}\end{array}\right.\)

2 User-guided ensemble feature selector

Given a finite set of N features, the goal of UBayFS is to find an optimal subset of feature indices \(S^{\star }\subset [N]\), or, equivalently, \(\varvec{\delta }^{\star } = \varvec{\delta }^{S^{\star }}\in \{0,1\}^N\). We assume that information is available from

  1. 1.

    Training data to collect evidence by conventional data-driven feature selectors—we denote this as information from data \(\varvec{y}\),

  2. 2.

    The user’s domain knowledge encoded as subjective beliefs \(\varvec{\alpha }\in \mathbb {R}^N\) about the importance of features, where \(\alpha _n>0\) for all \(n\in [N]\), and

  3. 3.

    Side constraints, given as inequality system \(\varvec{A}\varvec{\delta }\le \varvec{b}\), to ensure that the obtained feature set conforms with practical requirements and restrictions.

UBayFS assumes a feature importance vector \(\varvec{\theta }\in [0,1]^N\), \(\Vert \varvec{\theta }\Vert _1 = 1\), which is probabilistic and not directly observable, such that evidence about \(\varvec{\theta }\) is collected from data \(\varvec{y}\) and prior weights \(\varvec{\alpha }\). Our model aims to maximize the accumulated importances \(\varvec{\delta }^T\varvec{\theta }\) of the selected features subject to side constraints \(\varvec{A}\varvec{\delta }\le \varvec{b}\). More specifically, we maximize the utility function

$$\begin{aligned} U(\varvec{\delta },\varvec{\theta })=\varvec{\delta }^T\varvec{\theta } - \lambda \kappa (\varvec{\delta }), ~ \lambda > 0, \end{aligned}$$
(1)

where \(\kappa (\varvec{\delta })\) is a non-negative scalar function which penalizes the degree of violation of the constraints. The precise form of \(\kappa (.)\) will be given later. Clearly, we require that \(\kappa (\varvec{\delta }) = 0\), if \(\varvec{A}\varvec{\delta }\le \varvec{b}\) is satisfied. In Eq. 1, \(\lambda >0\) plays the role of a Lagrange parameter, \(\lambda \kappa (\varvec{\delta })\) increases the amount of penalization imposed on a feature set violating the constraints. In terms of statistical decision theory, a Bayes decision should maximize the posterior expected utility

$$\begin{aligned} \mathbb {E}_{\varvec{\theta } \vert \varvec{y}}[U(\varvec{\delta },\varvec{\theta }(\varvec{y}))] = \varvec{\delta }^T\mathbb {E}_{\varvec{\theta } \vert \varvec{y}}[\varvec{\theta }(\varvec{y})]-\lambda \kappa (\varvec{\delta }) \longrightarrow \underset{\varvec{\delta }\in \{0,1\}^N}{\max }. \end{aligned}$$
(2)

We denote the optimal feature set according to Eq. 2 by \(\varvec{\delta }^\star\). The importance parameter \(\varvec{\theta }\) is inferred from data from elementary feature selectors trained on subsets of the dataset, summarized as \(\varvec{y}\), as well as prior feature importance scores \(\varvec{\alpha }\). Thus, the posterior probability distribution of \(\varvec{\theta }\) given observations \(\varvec{y}\), \(p(\varvec{\theta } \vert \varvec{y})\), is decomposed using Bayes’ theorem into

$$\begin{aligned} p(\varvec{\theta } \vert \varvec{y}) \propto p(\varvec{y} \vert \varvec{\theta }) \cdot p(\varvec{\theta }), \end{aligned}$$
(3)

where \(p(\varvec{y} \vert \varvec{\theta })\) describes the model likelihood (evidence from elementary feature selector model) and \(p(\varvec{\theta })\) describes the density of a prior distribution (user domain knowledge).

The remainder of this Section focuses on determining the missing model components to define the problem stated in Eq. (2), comprising (a) the feature importances \(\varvec{\theta }\), discussed in Sect. 2.1 and 2.2, and (b) the function \(\kappa\), discussed in Sect. 2.3. Finally, Sect. 2.4 suggests the discrete optimization procedure to solve Eq. (2).

2.1 Ensemble feature selection as likelihood

To collect information about feature importances from the given dataset, we train an ensemble of M elementary feature selectors of the same model type on distinct training subsets. The selection of a feature index set \(\varvec{\delta }^{(m)}\) comprising a constant number of \(l = \Vert \varvec{\delta }^{(m)}\Vert _1\) features in each elementary model m out of a total of M models can be interpreted as a result of drawing l balls from an urn, where each ball has a distinct color representing one feature \(n\in [N]\). Over all elementary models, \(\varvec{y}\) collects the counts of each feature being selected, resulting in a count vector in

$$\begin{aligned} \varvec{y} = \sum \limits _{m=1}^{M}\varvec{\delta }^{(m)}\in \{0,\dots ,M\}^N. \end{aligned}$$
(4)

Each elementary feature selector delivers a proposal for an optimal feature set. Thus, we let the frequency of drawing a feature throughout \(\varvec{\delta }^{(1)},\dots ,\varvec{\delta }^{(M)}\) represent its importance by defining the latent importance parameter vector \(\varvec{\theta } \in [0,1]^N\), \(\Vert \varvec{\theta }\Vert _1 = 1\), as the success probabilities of sampling each feature in an individual urn draw. In a statistical sense, we interpret the result from each elementary feature selector as realization from a multinomial distribution with parameters \(\varvec{\theta }\) and l.Footnote 1 This multinomial setup delivers the likelihood \(p(\varvec{y} \vert \varvec{\theta })\) as joint probability density

$$\begin{aligned} p(\varvec{y} \vert \varvec{\theta }) = \prod \limits _{m = 1}^{M} f_{\text {mult}}(\varvec{\delta }^{(m)};\varvec{\theta },l), \end{aligned}$$
(5)

where \(f_{\text {mult}}(\varvec{\delta }^{(m)};\varvec{\theta },l)\) denotes the density of a multinomial distribution with success probabilities \(\varvec{\theta }\) and a number of l urn draws. Relevant notations are summarized in Table 1.

Table 1 Notations for likelihood parameters

2.2 Expert knowledge as prior weights

To constitute the prior distribution, UBayFS uses expert knowledge as a-priori weights of features. Since the domain of the distribution of feature importances \(\varvec{\theta }\) is defined to be a simplex \(\varvec{\theta }\in \Theta \subset [0,1]^N, \Vert \varvec{\theta }\Vert _1 = 1\), the Dirichlet distribution is a natural choice as prior distribution, which is widely used in data science problems, such as Nakajima et al. (2014). Thus, we initially assume that a-priori

$$\begin{aligned} p(\varvec{\theta }) = f_{\text {Dir}}(\varvec{\theta };\varvec{\alpha }), \end{aligned}$$
(6)

where \(f_{\text {Dir}}(\varvec{\theta };\varvec{\alpha })\) denotes the density of the Dirichlet distribution with positive \(\varvec{\alpha } = (\alpha _1,\dots ,\alpha _N)\). Since the Dirichlet distribution is a conjugate prior of the multinomial distribution, the posterior distribution results in a Dirichlet type, again, see DeGroot (2005). Thus, it holds for the posterior density that

$$\begin{aligned} p(\varvec{\theta } \vert \varvec{y})\propto f_{\text {Dir}}(\varvec{\theta }; \varvec{\alpha }^{\circ }), \end{aligned}$$
(7)

where the parameter update is obtained in closed form by

$$\begin{aligned} \varvec{\alpha }^{\circ } = \varvec{\alpha } + \varvec{y}. \end{aligned}$$
(8)

In case of integer-valued prior weights \(\varvec{\alpha }\), they may be interpreted as pseudo-counts in the context of modelling success probabilities in an urn model—comparable to the information gained if the corresponding counts were observed in a multinomial data sample. In UBayFS, we obtain \(\varvec{\alpha }\) as feature weights provided by the user. If no user knowledge is available, the least informative choice is to specify uniform counts with a small positive value, such as \(\varvec{\alpha }_{\text {unif}}=0.01\cdot \varvec{1}_N\).

2.2.1 Generalized Dirichlet model

Even though the presented Dirichlet-multinomial model is a popular choice due to its favorable statistical properties, it implicitly assumes that classes (in our case, features) are mutually independent. However, high-dimensional datasets frequently involve complex correlation structures between the features. To account for this aspect, we generalize the setup by replacing the Dirichlet prior distribution with some generalized Dirichlet distribution. The highest level of generalization is achieved by Hankin (2010), who introduced the hyperdirichlet distribution, which may take arbitrary covariance structures into account. The hyperdirichlet distribution maintains the conjugate prior property with respect to the multinomial likelihood, and thus, inference is tractable; however, the analytical expression of the expected value involves the intractable normalization constant and, as a result, requires numerical means such as Monte-Carlo Markov Chain (MCMC) methods, which may face computational challenges due to the high dimensionality of the problem.

A compromise between the complexity of the problem and the flexibility of the covariance structure is given by an earlier version of the generalized Dirichlet distribution by Wong (1998), which is a special case of the hyperdirichlet setup, but more general than the standard Dirichlet distribution. In addition to the properties of the hyperdirichlet distribution, the expected value of the generalized Dirichlet distribution can be directly evaluated from the distribution parameters. Section 3 provides an experimental evaluation of the proposed variants to account for covariance structures in the UBayFS model.Footnote 2

2.3 Side constraints as regularization

Practical setups may require that a selected feature set fulfills certain consistency requirements. These may involve a maximum number of selected features, a low mutual correlation between features, or a block-wise selection of features. UBayFS enables the feature selection model to account for such requirements via a function \(\kappa\), which incorporates a system of K inequalities restricting the feature set \(\varvec{\delta }\), \(\varvec{A}\varvec{\delta }-\varvec{b}\le 0\), where \(\varvec{A}\in \mathbb {R}^{K\times N}\) and \(\varvec{b}\in \mathbb {R}^{K}\). Each single constraint \(k\in [K]\) can be evaluated via an inadmissibility function \(\kappa _k(.)\), such that

$$\begin{aligned} \kappa _k(\varvec{\delta }) = \left\{ \begin{array}{l l} 0 &{} \text {if}~ \left( \varvec{a}^{(k)}\right) ^T\varvec{\delta } - b^{(k)} \le 0 \\ 1 &{} \text {otherwise},\end{array}\right. \end{aligned}$$
(9)

where \(\varvec{a}^{(k)}\) is the k-th row vector of \(\varvec{A}\) and \(b^{(k)}\) the k-th element of \(\varvec{b}\). UBayFS generalizes the setup by relaxing the constraints: in case that a feature set \(\varvec{\delta }\) violates a constraint, it shall be assigned a higher penalty rather than being excluded completely. This effect is achieved by replacing \(\kappa _k(.)\) with a relaxed inadmissibility function \(\kappa _{k,\rho }(.)\) based on a logistic function with relaxation parameter \(\rho \in \mathbb {R}^{+}\cup \{\infty \}\):

$$\begin{aligned} \kappa _{k,\rho }(\varvec{\delta }) = \left\{ \begin{array}{l l} 0 &{} \text {if}~\left( \varvec{a}^{(k)}\right) ^T\varvec{\delta }\le b^{(k)}\\ 1 &{} \text {if}~ \left( \varvec{a}^{(k)}\right) ^T\varvec{\delta }> b^{(k)} \wedge \rho =\infty \\ \frac{1-\xi _{k,\rho }}{1 + \xi _{k,\rho }} &{} \text {otherwise}, \end{array} \right. \end{aligned}$$
(10)

with \(\xi _{k,\rho } = \exp \left( -\rho \left( \left( \varvec{a}^{(k)}\right) ^T \varvec{\delta } - b^{(k)}\right) \right)\). Fig. 1 illustrates that a large parameter \(\rho \longrightarrow \infty\) lets the inadmissibility converge pointwise towards the associated hard constraint. A low \(\rho\) changes the shape of the penalization to an almost constant function in a local neighborhood around the decision boundary, such that only a minor difference is made between feature sets that fulfill and those that violate a constraint.Footnote 3

Fig. 1
figure 1

The effect of \(\rho\) on \(\kappa _{k,\rho }\) for soft constraints

Finally, the joint inadmissibility function \(\kappa (.)\) aggregates information from all constraints

$$\begin{aligned} \kappa (\varvec{\delta }) = 1 - \prod \limits _{k=1}^{K} \left( 1 -\kappa _{k,\rho }(\varvec{\delta })\right) , \end{aligned}$$
(11)

which originates from the idea that \(\kappa = 1\) (maximum penalization) if at least one \(\kappa _{k,\rho }=1\), while \(\kappa =0\) (no penalization) if all \(\kappa _{k,\rho }=0\).

Note that different relaxation parameters may be used to prioritize the constraints among each other, hence \(\kappa\) involves a parameter vector \(\varvec{\rho }=(\rho _1,\dots ,\rho _K)\). Notations related to prior parameters and constraints are summarized in Table 2.

Table 2 Notations used for prior parameters

2.3.1 Feature decorrelation constraints

Commonly, feature sets with low mutual correlations are preferred since they tend to contain less redundant information. A special case of prior constraints can be defined to enforce that such feature sets are selected. We will refer to such constraints as decorrelation constraints. Decorrelation constraints are pairwise cannot-link constraints between highly correlated features, i.e., features i and j with a correlation coefficient \(\tau _{i,j}\) exceeding a predefined absolute threshold \(\vert \tau _{i,j}\vert > \tau\). For each such pair \(i,j\in [N], i\ne j\), a constraint is added to the constraint system as follows: the vector \(\varvec{a}\) with elements

$$\begin{aligned} a_{n} = \left\{ \begin{array}{ll} 1 &{} \text {if }n \in \{i,j\} \\ 0 &{} \text {else,}\end{array}\right. \end{aligned}$$
(12)

and an element \(b = 1\) are appended to \(\varvec{A}\) and \(\varvec{b}\), respectively. We set the shape parameter \(\rho\) to the odds ratio of the absolute correlation coefficient \(\tau _{i,j}\), given as

$$\begin{aligned} \rho = \frac{\vert \tau _{i,j}\vert }{1-\vert \tau _{i,j}\vert }. \end{aligned}$$
(13)

Hence, features with higher absolute correlations are assigned higher penalties and vice versa. As a result, the selected feature set contains features with lower mutual correlations.Footnote 4

2.3.2 Feature block priors

User knowledge may as well be available for feature blocks rather than for single features. Feature blocks are contextual groups of features, such as those extracted from the same source in a multi-source dataset. It can be desirable to select features from a few distinct blocks so that the model does not depend on all sources at once. While prior weights can be trivially assigned on block level, we transfer the concept of side constraints to feature blocks.

Feature blocks are specified via a block matrix \(\varvec{B} \in \{0,1\}^{W\times N}\), where 1 indicates that the feature \(n\in [N]\) is part of block \(w\in [W]\) and 0, else. Even though a full partition of the feature set is common, feature blocks are neither required to be mutually exclusive, nor exhaustive. Along with the block matrix \(\varvec{B}\), an inequality system between blocks consists of a matrix \(\varvec{A}^{\text {block}}\in \mathbb {R}^{K\times W}\) and a vector \(\varvec{b}^{\text {block}}\in \mathbb {R}^{K}\). To evaluate whether a block is selected by a feature set \(\varvec{\delta }\), we define the block selection vector \(\varvec{\delta }^{\text {block}}\in \{0,1\}^{W}\), given by

$$\begin{aligned} \varvec{\delta }^{\text {block}} = \left( \varvec{B}\varvec{\delta }\ge \varvec{1}_W\right) , \end{aligned}$$
(14)

where \(\ge\) refers to an element-wise comparison of vectors, delivering 1 for a component, if the condition is fulfilled, and 0, otherwise. In other words, a feature block is selected, if at least one feature of the corresponding block is selected. Although block constraints introduce non-linearity into the system of side constraints, they can be used in the same way as linear constraints between features and integrated into the joint inadmissibility function \(\kappa\).

2.4 Optimization

Exploiting the conjugate prior property, the posterior density of \(\varvec{\theta }\) can be expressed as a Dirichlet, generalized Dirichlet or hyperdirichlet distribution, respectively. The expected value \(\mathbb {E}_{\varvec{\theta }}[\varvec{\theta }]\) can be computed either in a closed-form expression (Dirichlet or generalized Dirichlet) Wong (1998), or simulated via a sampling procedure (hyperdirichlet) Hankin (2010). It remains to solve the discrete optimization problem in Eq. (2) as a final step.

figure a

Since an analytical minimization of the resulting knapsack problem is not feasible, we determine a numerical optimum \(\varvec{\delta }^{\star }\) by using discrete optimization: we deploy the Genetic Algorithm (GA) described by Givens and Hoeting (2012). To guarantee a fast convergence towards an acceptable solution, it is beneficial to provide initial samples, which are good candidates for the final solution. For this purpose we propose a probabilistic sampling algorithm, Alg. 1: In essence, the algorithm creates a random permutation of all features, \(\pi :[N]\rightarrow [N]\), by weighted and ordered sampling without replacement. The weights represent the posterior parameter vector \(\varvec{\alpha }^{\circ }\). Then, the algorithm iteratively accepts or rejects feature \(\pi (n)\) with a success probability

$$\begin{aligned} r_{\varvec{\delta }^{\dagger },\varvec{\delta }} = \left\{ \begin{array}{l l} \frac{1-\kappa (\varvec{\delta }^{\dagger })}{1-\kappa (\varvec{\delta })} &{} \text {if}~\kappa (\varvec{\delta }) <1 \\ 0 &{} \text {else,} \end{array}\right. \end{aligned}$$
(15)

denoting the admissibility ratios of feature sets with and without feature \(\pi (n)\). The generated sample accounts for high feature weights by low ranks, resulting in a higher probability to be accepted in the acceptance/rejection step.

The Genetic Algorithm (GA) for discrete optimization is initialized using Algorithm 1. Starting with an initial set of feature membership vectors \(\left\{ \varvec{\delta }^{0}\in \{0,1\}^N\right\}\), GA creates new vectors \(\varvec{\delta }^{t}\in \{0,1\}^N\) as pairwise combinations of two preceding vectors \(\varvec{\delta }^{t-1}\) and \(\tilde{\varvec{\delta }}^{t-1}\) in each iteration \(t\in [T]\). A combination refers to sampling component \(\varvec{\delta }^{t}_n\) from either \(\varvec{\delta }^{t-1}_n\) or \(\tilde{\varvec{\delta }}^{t-1}_n\) in a uniform way and adding minor random mutations to single components. The posterior density serves as fitness when deciding which vectors \(\varvec{\delta }^{t-1}\) and \(\tilde{\varvec{\delta }}^{t-1}\) from iteration \(t-1\) should be combined to \(\varvec{\delta }^{t}\) — the fitter, the more likely to be part of a combination.

The runtime of GA depends linearly on the population size, and the number of iterations. A good trade-off between runtime and convergence properties is important—a small population size, for example, might lead to faster convergence but might get trapped towards a local minimum. Further, the runtime is dependent on the complexity to compute the fitness function, which in turn depends on the dimensionality of the problem.

3 Experiments and results

Our numerical experiments evaluate the performance, flexibility, and applicability of UBayFS in two parts: first, a study conducted on synthetic datasets demonstrates the properties of the various model parameters, including

  1. a.

    The number of elementary models M (1a),

  2. b.

    The prior weights \(\varvec{\alpha }\) in a block-wise setup (1b),

  3. c.

    The constraint types and their shapes \(\rho\) in a block-wise setup (1c), as well as

  4. d.

    The type of prior distribution to account for feature dependencies (1d).

The second part of our experiment is conducted on real-world classification datasets from the life science domain. In a comparison with state-of-the-art ensemble feature selectors, we demonstrate that UBayFS delivers similar model performances. Our setups include ordinary and block feature selection without prior knowledge to ensure a fair comparison. Finally, we conduct a case study with expert knowledge available from biological investigations, and demonstrate how informative priors increase model performance in practice.

3.1 Default parameters

Six types of feature selectors are evaluated as elementary models for UBayFS:

  • Minimum Redundancy Maximum Relevance (mRMR) Ding and Peng (2005),

  • Fisher score Bishop (1995),

  • Decision tree for classification Breiman et al. (1984),

  • Recursive feature elimination (RFE) Guyon et al. (2002),

  • Hilbert-Schmidt Independence Criterion Lasso (HSIC) Yamada et al. (2014),

  • Lasso Tibshirani (1996).

However, the main focus of the present work is to evaluate the generic concept of UBayFS rather than to provide an in-depth analysis of these elementary feature selectors.

Our implementation of UBayFS in R (R Core Team , 2020)Footnote 5 uses the Genetic Algorithm package authored by Scrucca (2013) with \(T=100\) and \(Q = 100\); in most cases, convergence is achieved after around ten iterations. By default, each UBayFS setup comprises an uninformative prior with \(\alpha _n=0.01\) for all \(n\in [N]\), and a max-size constraint instructing to select \(b_{\text {MS}}\) features, which is determined individually for each dataset. Thus, by default, the constraint system is given as:

$$\begin{aligned} \varvec{A} = (1 ~ 1 ~ \dots ~ 1), \varvec{b} = b_{\text {MS}}, \varvec{\rho } = 1. \end{aligned}$$

No further user knowledge or side constraints are introduced unless stated explicitly in the particular setups. Each setup is executed in \(I = 10\) independent runs \(i \in [I]\), representing distinct random splits of the dataset \(\mathcal {D}\) into train data \(T_{\text {train}}^{(i)}\) and test data \(T_{\text {test}}^{(i)} = \mathcal {D}\setminus T_{\text {train}}^{(i)}\) (stratified 75%/25% split).

3.2 Evaluation metrics

For the synthetic datasets, performance is measured by the F1 score of correctly / incorrectly selected features since the ground truth about the relevance of features is known from the simulation procedure. For real-world data, F1 scores refer to the predictive results obtained by training a classification model after feature selection, and judge the feature selection quality indirectly. Furthermore, all experiments evaluate the stability measure by Nogueira et al. (2018) across I independent feature selection runs. Stability ranges asymptotically in [0, 1], where 1 indicates that the same features are selected in every run (perfectly stable). RuntimeFootnote 6 refers to the time the model requires to perform feature selection, including elementary model training and optimization, but excluding any predictive model trained on top of the feature selection results. Since prior parameters have a minor influence on the runtime, times will not be provided for experiments investigating these aspects.

3.3 Experiment 1: simulation study

To investigate major properties of UBayFS, we simulate four different datasets:

  1. i.

    An additive model (experiment 1a) similar to Data1 in Yamada et al. (2014), composed of a \((x_1,\dots ,x_{1000})\sim 1000\times 1000\) data matrix simulated from a Gaussian distribution \(N(\varvec{0}_{1000},\varvec{I}_{1000})\), and a binary target variable

    $$f(\varvec{x},\varepsilon )=g(-2\sin (2x_1)+x_2^2+x_3+\exp (-x_4)+\varepsilon ),$$

    where \(x_1,\dots ,x_4\) denote the features 1 to 4 and \(\varepsilon \sim N(0,1)\). The function g transforms z into a class variable by

    $$\begin{aligned} g(z)=\left\{ \begin{array}{ll} 1 &{} \text {if}~z\ge 0,\\ 0 &{} \text {otherwise;} \end{array}\right. \end{aligned}$$
  2. ii.

    A non-additive model (experiment 1a) similar to Data2 in Yamada et al. (2014), equivalent to the setup of i., except for a multiplicative target variable

    $$f(\varvec{x},\varepsilon )=g(x_1\cdot \exp (2x_2)+x_3^2+\varepsilon );$$
  3. iii.

    A simulated dataset (experiment 1b, 1c) with group structure among the features, produced via make_classification (Pedregosa , 2011), delivering a \(512\times 256\) dataset with 8 feature blocks à 32 features—4 of these blocks contain relevant features (4 important features per block), 2 blocks contain redundant features representing arbitrary linear combinations of the relevant features (3 redundant features per block);

  4. iv.

    Another dataset simulated via make_classification, comprising 32 features in total (16 important, 16 redundant) without block structure. This smaller dataset (\(64\times 32\)) has a complicated correlation structure due to the high number of redundant features and is used to evaluate UBayFS variants that take feature dependence into account (experiment 1d).

The maximum number of selected features \(b_{\text {MS}}\) is set to the ground truth number of relevant features, i.e. \(b_\text {MS}=4\) (dataset i.), \(b_\text {MS}=3\) (dataset ii.), and \(b_\text {MS}=16\) (datasets iii. and iv.), respectively. The default constraint shape parameters for MS is set to \(\rho _{\text {MS}} = 1\). Unless otherwise stated, the prior weights are set to a constant, uninformative value of \(\alpha =0.01\) for all features.

In addition to the constraint shape \(\rho\) associated with a single constraint, \(\lambda\) balances the overall impact of side constraints with the Dirichlet-multinomial model. A small parameter \(\lambda <1\) is not recommended since a lack of influential constraints (including the MS constraint) results in selecting all features due to an unregularized utility function U. On the other hand, a high \(\lambda\) has a similar effect as setting all shape parameters uniformly to \(\rho =\infty\); thus, all constraints are required to be fulfilled. In this study, \(\lambda\) has only a minor impact on the resulting model metrics and, therefore, is set to \(\lambda =1\).

3.3.1 Experiment 1a—likelihood parameters

Figure 2 demonstrates the effect of an increasing number of elementary models M to build the feature selector. M represents the parameter to steer the likelihood. Due to their excessive runtimes, HSIC and RFE are computed only for \(M\le 10\), while all other elementary feature selectors are evaluated for up to \(M=200\).

As expected, a higher M contributes largely to the runtime of the model, which increases linearly. In contrast, both F1 scores and stability values begin to saturate at around \(M=50\) to \(M=100\) models. Even though large ensembles are intractable with HSIC and RFE, small ensembles with \(M=5\) allow HSIC to retrieve almost all features, whereas simpler elementary feature selectors struggle to achieve high performances and stabilities even at higher levels of M. We conclude that large M does not necessarily improve the results but significantly impacts the runtime. Thus \(M\approx 100\) appears to be a reasonable choice in the subsequent settings, except for HSIC and RFE, where \(M=5\) will be set as a default.

Fig. 2
figure 2

Different numbers of elementary models M

3.3.2 Experiment 1b—“correct” and “incorrect” prior weights

To investigate the effect of prior weights \(\varvec{\alpha }\), we alter the prior weights in dataset iii. by feature block. A constant prior weight \(\alpha _R\) is assigned to all features from relevant blocks, i.e., blocks 1-4 containing informative and non-informative features. In contrast, features from blocks 5-8 (containing only non-informative features) are assigned a constant prior weight \(\alpha _{-R}\)—thereby, we simulate that the expert has approximate, yet not exact beliefs about feature relevance. By assigning higher prior weights \(\alpha _R>\alpha _{-R}\), the experiment simulates an agreement between the expert belief and the ground truth (“correct prior”), while a lower \(\alpha _{R}<\alpha _{-R}\) represents “wrong” prior information (“incorrect prior”). To simulate correct and incorrect prior knowledge at different levels, we increase \(\alpha _{R}\) while setting \(\alpha _{-R}\) to the default value 0.01, and vice versa.

Figure 3 illustrates that, as expected, feature selection performance in terms of F1 scores (evaluated with respect to the ground truth features) increases for higher \(\alpha _R\) and decreases for higher \(\alpha _{-R}\). Thus, across all elementary feature selectors, an improvement of the uninformative case \(\alpha _{R}=\alpha _{-R}=0.01\) can be achieved by an informative prior, if the prior represents a reasonable overlap with reality—this holds even though the relevant blocks also contain uninformative features, which are incremented by \(\alpha _R\) as well. On the other hand, erroneous prior knowledge can impact the feature selection results negatively. In contrast to the feature-wise F1 scores, stability remains mostly unaffected from strong prior knowledge on relevant or irrelevant blocks—incorrect prior knowledge merely tends to decrease stability to a minor degree.

Fig. 3
figure 3

Different prior weights assigned to relevant blocks, \(\alpha _R\), and to non-relevant blocks, \(\alpha _{-R}\)

3.3.3 Experiment 1c—side constraints

We investigate the following opposite constraint types:

  • Block-max-size (BMS): features are selected from at most \(b_{\text {BMS}}\) distinct blocks, and

  • Max-per-block (MPB): at most \(b_{\text {MPB}}\) features are selected from each block.

BMS is designed to enforce a clustering behavior, where all selected features originate from a maximum number of \(b_{\text {BMS}} = 4\) blocks. On the other hand, MPB aims to disperse the selection, indicating that a maximum number of \(b_{\text {MPB}}=2\) features per block is favorable. The strength of these constraints is steered via the corresponding shape parameters \(\rho _{\text {BMS}}\) and \(\rho _{\text {MPB}}\), respectively, while \(\rho = 0\) indicates that a constraint is omitted. From a default case of \(\rho _{\text {BMS}}=\rho _{\text {MPB}}=0\) (no block constraints), we investigate the behavior of UBayFS under one of the two constraints at a time at an increasing level of \(\rho _{\text {BMS}}\) or \(\rho _{\text {MPB}}\).

Fig. 4 illustrates how the opposite side constraints BMS and MPB affect the model at different levels of relaxation parameters. Both constraint types have a slightly negative impact on the outcome in terms of F1 and stability. This is caused by the fact that the “best” feature set has to be determined under a side constraint, which is not compatible with the ground truth—the ground truth defines 16 features out of four distinct blocks to be relevant, which cannot be covered by any of the constraints. Therefore, we can observe that UBayFS can handle such scenarios and still deliver appropriate and near-optimal solutions.

Fig. 4
figure 4

Different prior constraints assigned to blocks: MPB (maximum one feature per block) and BMS (block max-size) constraint types at distinct levels of \(\rho\). The special case \(\rho =0\) indicates that the corresponding constraint is omitted

3.3.4 Experiment 1d—between-feature correlations

In Sect. 2, multiple variants were discussed to account for datasets with a given correlation structure. On the one hand, the UBayFS framework permits to account for between-feature correlations via a generalization of the prior distribution; on the other hand, we may enforce that the highly correlated features should not be selected jointly via a decorrelation constraint. Both variants are different insofar as generalized priors aim to deliver a more appropriate estimation of the expected feature importances by correcting for dependencies in the observed feature sets, while decorrelation constraints directly affect the optimization procedure for \(\varvec{\delta }\).

In this experiment, we investigate both possibilities to account for between-feature correlations, along with combinations of both: we set a decorrelation constraint between all features with a mutual Spearman correlation \(\tau >0.4\) as described in Sect. 2.3, such that joint selection of highly correlated features is penalized. Further, we apply the following prior setups:

  • Dirichlet prior distribution (default),

  • Generalized Dirichlet distribution Wong (1998),

  • Hyperdirichlet distribution Hankin (2010).

Our experiment involves all combinations of prior setups with and without decorrelation constraint, executed on dataset iv. To measure the effect of decorrelation, we further evaluate the redundancy rate (RED) Zhao et al. (2010), defined as the average absolute Pearson correlation among selected features. A small RED is commonly preferred in practical setups.

The results in Fig. 5 show that neither feature-wise F1 scores nor stabilities change significantly between the prior models. Thus, the default Dirichlet model seems sufficient in practice. However, introducing decorrelation constraints has a slightly negative impact on stability, while yielding a small improvement in F1 scores and RED. Nonetheless, the most significant change between the variants can be observed with respect to runtime, which reflects the high computational burden associated with the hyperdirichlet prior model—even on a small dataset, the runtimes show a significant increase on a logarithmic scale. Thus, higher-dimensional datasets can only be tackled at an enormous computational cost with the hyperdirichlet setup.

Fig. 5
figure 5

Different setups to account for dependence structures between features

3.4 Experiment 2: real-world datasets

Numerical studies are conducted on eight open-source datasets presenting binary classification problems from the life science domain, see Table 3. For simplicity and due to extensive runtimes, we restrict the choice of the elementary feature selector for UBayFS to mRMR, Fisher, and decision tree with an uninformative prior, an MS constraint, and \(M=100\). The number of selected features is specified according to the size of the dataset (\(b_\text {MS}=5\) / 10 / 20 / 100 for datasets with fewer than 100 / between 100 and 1000 / between 1000 and 10000 / more than 10000 features, respectively).

Table 3 Real-world binary classification datasets from the life science domain used for experimental evaluation. For p53, a stratified subset out of \(>16000\) rows was used from the original dataset for this experiment

In addition to conventional feature selection (scenario 1) with max-size constraint \(b_{\text {MS}}\), specified in Table 3, we evaluate a block feature selection (scenario 2) for datasets with block-wise feature structure. For block feature selection, up to \(b_{\text {MS}}\) features should be selected from at most \(b_{\text {BMS}}\) distinct blocks.Footnote 7 Random forests (RF) Breiman (2001), and RENT Jenul (2021) (representing ensemble feature selectors that extend the concepts of decision trees and elastic net regularized models, respectively) are used as state-of-the-art benchmarks for standard feature selection, while Sparse Group Lasso (GL) Ida et al. (2019) is used as the benchmark for block feature selection. To conform with UBayFS, RENT and RF are adjusted to \(M=100\) elementary models, and all models are tuned to select approximately the same number of features, \(b_{\text {MS}}\). Since RENT and GL cannot be instructed to select \(b_{\text {MS}}\) features directly, regularization parameters are determined via bisection, such that the number of selected features is approximately equal to \(b_{\text {MS}}\).

The selected features cannot be evaluated directly in real-world datasets due to unknown ground truth on the feature relevance. Therefore, we train predictive models on \(T_{\text {train}}^{(i)}\) after feature selection and evaluate the selected features indirectly via the predictive performance on the test instances. To reduce the influence of the predictive model type, we train two distinct classifiers on \(T_{\text {train}}^{(i)}\) after feature selection, and report F1 scores for predictions on \(T_{\text {test}}^{(i)}\) for both. The choice of baseline classifiers to obtain the prediction comprises:

  • generalized linear model: logistic regression (GLM),

  • support vector machine (SVM).

Table 4 UBayFS with three distinct elementary feature selectors (M: mRMR, F: Fisher, T: decision tree) is compared to ensemble feature selectors RF and RENT in a standard feature selection scenario. UBayFS with additional (BMS) constraint is compared to Sparse Group Lasso (GL) for block-feature selection on datasets with block structure. Average F1 scores are given for different predictive models (GLM, SVM). The best scores for each dataset and evaluation metric are marked in bold—standard feature selection and block feature selection are assessed separately
Table 5 Mean stabilities of UBayFS with three distinct elementary feature selectors (M: mRMR, F: Fisher, T: decision tree), compared to ensemble feature selectors RF and RENT in standard feature selection, as well as to GL in block feature selection scenarios. The best scores in each row are marked in bold for each scenario

3.4.1 Results

Tables 4 and 5 present the results of the experiments on real-world data. Thereby, UBayFS achieves good predictive F1 scores throughout the different datasets, even though no expert knowledge is introduced to ensure a fair comparison. In the block feature selection setups, UBayFS benefits from block constraints and shows more flexibility than Sparse Group Lasso. Altogether, UBayFS can keep up with its competitors in terms of predictive performance in a diverse range of scenarios (low-dimensional and high-dimensional data, as well as unconstrained and constrained setups) while providing higher flexibility to introduce additional information or constraints. Overall, the results reflect that a particular strength of UBayFS lies in delivering a good trade-off between stabilities and predictive performance, compared to competitors such as RF, which deliver high F1 scores, but very low stabilities.

Figures 6 and 7 give additional insights into the performances of the UBayFS variants in the standard feature selection and block feature selection scenario, respectively. Differences between the F1 scores obtained by the different elementary feature selectors underline that UBayFS inherits benefits and drawbacks from its underlying elementary model type—in particular, the decision tree and HSIC achieved top results. Nevertheless, the building of ensembles allows to compensate in parts for mediocre stabilities.

Fig. 6
figure 6

Performance results of UBayFS feature selection on real-world datasets (MS constraint). F1 scores are determined after training and predicting a classifier (GLM or SVM) after feature selection. Results show mean values over \(I=10\) runs along with standard deviations

Fig. 7
figure 7

Performance results of UBayFS block feature selection on real-world datasets (MS and BMS constraints). F1 scores are determined after training and predicting a classifier (GLM or SVM) after feature selection. Results show mean values over \(I=10\) runs along with standard deviations

3.4.2 Case study with prior knowledge

Our evaluations underlined the applicability of UBayFS in real-world scenarios. However, due to the absence of prior knowledge, these scenarios covered only parts of the capabilities of the method. To exploit prior knowledge in practice, we revisit the lung cancer genome dataset (LUNG): in the dataset, eight gene expression features were identified as relevant in biological studies by Guan Guan et al. (2009). Thus, we assign higher prior weights \(\alpha _{R}\) to a-priori relevant features, while all other features get assigned the default prior weight \(\alpha _{-R}=0.01\). Our setups include one with “weak” prior (\(\alpha _{R} = 20\)), and one with “strong” prior (\(\alpha _{R} = 100\)), in addition to the setup without prior, shown in Table 4. The max-size constraint is set to \(b_{\text {MS}}=100\).

As summarized in Table 6, incorporating prior knowledge leads to an improvement of UBayFS results in most cases. Thus, the absolute performance lies in a similar top range as those reported in previous work by Brahim and Limam (2014), who evaluated averaged accuracies in a comparable setup on the same dataset (\(>0.99\) avg. accuracy). However, the comparability of accuracies is limited due to the unbalanced nature of the dataset. Between the UBayFS setups, results with weak prior are similar to those from no-prior results in the case of stable elementary feature selectors (mRMR and Fisher). In contrast, weak prior results resemble the strong prior in the case of a non-stable elementary feature selector (decision tree). Thus, a weak prior has a higher impact on the final results if the elementary models are more diverse.

Table 6 Average performance scores delivered by UBayFS on the LUNG dataset with and without prior knowledge

3.4.3 Runtime

Runtimes of all methods and datasets are provided in Table 7. Given a fixed set of model parameters, it becomes obvious that the major factor influencing the runtime of UBayFS is the number of features (columns) rather than the number of samples (rows). UBayFS runtimes refer to the MS setup—however, experiments showed only minor differences to the runtimes in the block feature selection setup. While RF and GL are more tractable in high-dimensional datasets, RENT seems to suffer from data dimensionality to a more considerable extent.

Across larger datasets, the main influencing factor on the runtime is the number and type of elementary models. For example, on the LUNG dataset (\(>12000\) features), the training procedure of 100 mRMR models as elementary models comprised 40 minutes (88% of UBayFS runtime), while optimization using the Genetic Algorithm comprised 5 minutes (11% of UBayFS runtime).Footnote 8

Table 7 Average runtime per run [s]

4 Discussion and conclusion

The presented Bayesian feature selector UBayFS has its strength in combining information from a data-driven ensemble model with expert prior knowledge targeted at life science applications. The generic framework is flexible in the choice of the elementary feature selector type, allowing a broad scope of applications scenarios by deploying adequate elementary feature selectors, such as those suggested by Sechidis and Brown (2018) for semi-supervised or Elghazel and Aussem (2015) for unsupervised problems. An extension of the presented experiments to multiple classes or multi-label classification problems (one object is not uniquely assigned to one class) is straightforward as well if the elementary feature selector is capable of tackling such datasets, such as Petković et al. (2020).

In general, the choice of the elementary feature selector is a central step when deploying the concept in practice—in particular, the size and structure of a dataset need to be taken into account. This work presented a broad range of elementary models to provide user guidance in practical setups. The option to build ensembles combining different model types, as discussed by Seijo-Pardo et al. (2017), turned out to deteriorate the stability of ensemble feature selectors and hence, is not considered in this study.

UBayFS presents two ways to account for feature dependencies: a generalized prior model as well as a decorrelation constraint. The latter effectively restricts the results, such that a simultaneous selection of highly correlated features is penalized. The generalizations of the prior model correct the estimated feature importances by the dependencies—in a low-dimensional scenario, the hyperdirichlet variant is the most accurate choice. However, this variant becomes intractable, if the dimensionality exceeds a few hundred features and requires simulation to determine the expected value in almost any case, preventing from analytically exact solutions. Since our experiments depicted that feature importances obtained from each of the three prior setup types are numerically similar, a conventional Dirichlet setup seems to deliver a sufficiently accurate approximation for high-dimensional datasets. This observation is also supported by the fact that many elementary feature selectors, such as mRMR or HSIC, can account for between-feature correlations, thus reducing the need to consider correlations in the meta-model.

Prior information from experts is introduced via prior feature weights and linking constraints describing between-feature dependencies, represented in a system of side constraints. Via a relaxation parameter, the inadmissibility is transferred into a soft constraint, favoring solutions that fulfill the constraints and penalizing violations. Introducing user knowledge directly into the feature selection process opens new opportunities for data analysis in life science applications. Still, such methodology bears the potential of intentional or unintentional misuse: as demonstrated in the experiment, the integration of unreliable or incorrect user knowledge may distort predictive results. Users have to be aware that UBayFS may contain subjective inputs and thus, take precautions to ensure that prior information is sufficiently verified, e.g., by published research in the field.

Based on the results from extensive experimental evaluations on multiple open-source datasets, a clear benefit of the proposed feature selector lies in the balance between predictive performance and stability. Particularly in life sciences, where few instances are available in high-dimensional datasets, user-guided feature selection is an opportunity to guide models to achieve otherwise intractable results. UBayFS delivers more flexibility to integrate domain knowledge than established state-of-the-art approaches. A practical limitation of UBayFS is that the runtime is arguably slower than simpler feature selectors, which becomes an obstacle in very high-dimensional datasets. The use of highly optimized algorithms like the Genetic Algorithm, along with an initialization using the suggested Alg. 1 mitigates this issue. However, it cannot compensate for the computational burden of training multiple elementary models.