A User-Guided Bayesian Framework for Ensemble Feature Selection in Life Science Applications (UBayFS)

Feature selection represents a measure to reduce the complexity of high-dimensional datasets and gain insights into the systematic variation in the data. This aspect is of specific importance in domains that rely on model interpretability, such as life sciences. We propose UBayFS, an ensemble feature selection technique embedded in a Bayesian statistical framework. Our approach considers two sources of information: data and domain knowledge. We build a meta-model from an ensemble of elementary feature selectors and aggregate this information in a multinomial likelihood. The user guides UBayFS by weighting features and penalizing specific feature blocks or combinations, implemented via a Dirichlet-type prior distribution and a regularization term. In a quantitative evaluation, we demonstrate that our framework (a) allows for a balanced trade-off between user knowledge and data observations, and (b) achieves competitive performance with state-of-the-art methods.


Introduction
Feature selection pursues two major goals: to improve generalizability and performance of predictive algorithms like classification, regression, or clustering models and to improve data understanding and interpretability. Both aspects are of significant interest in fields like healthcare, where major decisions may be based on data analysis. Here, two sources of information are available: large-scale collections of data from multiple sources and profound knowledge from domain experts. Previous works tend to handle these sources as opposites, see [4], or neglect expert knowledge completely, see [30]. However, a combination of both can be valuable to compensate for underdetermined problem setups from high-dimensional datasets. Moreover, meta-information on the feature set may leverage interpretability. Works such as [21] consider constraints between samples but neglect constraints between features. The extension of L1 regularization to the so-called Group Lasso [43] and its variants [19] account for block structure but cannot handle more complex constraint types. There is a lack of sophisticated probabilistic frameworks that tackle this issue and deliver transparent results.
Apart from measuring the influence on model performance, properties like stability and reproducibility of the feature selector are essential to ensure that the user can trust the predictive model. Even though variants to achieve reproducibility are available for certain model types, such as deep neural networks [22], a model-independent approach to stabilizing the feature selection process is to deploy ensembles of elementary feature selectors. Recent research pursued this idea by [20] utilizing regularized linear or generalized linear models and involving measures for stability in addition to predictive performance metrics. [35] conclude that meta-models composed of elementary feature selectors improve the performance and robustness of the selected feature set in many cases. However, to the best of our knowledge, probabilistic approaches that exploit both -a sound statistical framework and individual model benefits of using an ensemble elementary feature selectors -are not yet available.
A prominent framework with the capability to combine data and expert knowledge is Bayesian statistics, which has been applied for feature selection in linear models, see [27]. Intentions behind the usage of Bayesian methodology vary significantly between authors and do not necessarily involve expert knowledge. Examples include [6], who investigate sparsity priors and [13], who suggest a Bayesian framework to quantify the level of uncertainty in the underlying feature selection model. Other Bayesian approaches for feature selection include [23], and [32], but these works do not investigate the usage of expert knowledge as prior. Although the availability of expert knowledge plays a role in life sciences, none of these approaches strongly emphasize domain knowledge about features, nor do they involve specific prior constraints defined by the user.
In this work, we propose a novel Bayesian approach to feature selection that incorporates expert knowledge and maintains large model generality. We aim to fill the gap between data-driven feature selection on one side and purely expert focused feature selection on the other side. Our presented probabilistic approach, UBayFS, combines a generic ensemble feature selection framework with the exploitation of domain knowledge, such that it supports interpretability and improves the stability of the results. For this purpose, feature importance votes from independent elementary feature selectors are merged with constraints and feature weights specified by the expert. Constraints may be of a general type, such as a maximum number of features or blocks of features to be selected. Both inputs, likelihood and prior, are aggregated in a sound statistical framework, producing a posterior probability distribution over all possible feature sets. We use a Genetic Algorithm for discrete optimization to efficiently optimize the posterior feature set in high-dimensional datasets. In an extensive experiment section, we analyze UBayFS in a case study covering a variety of potential model constraints and parameter settings. Results on open-source datasets are benchmarked against state-of-the-art feature selectors concerning predictive performance and stability, underlining the potential of UBayFS.
Notations We will denote vectors by bold, uncapitalized, and matrices by bold, capitalized letters. Non-bold, uncapitalized letters indicate scalars or functions, and non-bold, capitalized letters indicate sets or constants. . 1 denotes the L1-norm.
[N ] is an abbreviation of the set of indices 1, . . . , N . The N -dimensional vector of ones will be written as 1 N . Furthermore, we refer to sets of features by their feature indices, such as S ⊆ [N ], or by a binary membership vector δ S ∈ {0, 1} N with components (δ S ) n = 1 if n ∈ S, 0 otherwise.

User-Guided Ensemble Feature Selector
Given a finite set of N features, the goal of UBayFS is to find an optimal subset of feature indices S ⊂ [N ], or equally δ ∈ {0, 1} N . We assume that information is available from 1. training data to collect evidence by conventional data-driven feature selectors-we denote this as information from data ∆, 2. the user's domain knowledge encoded as subjective beliefs α ∈ R N about the importance of features, where α n > 0 for all n ∈ [N ], and 3. side constraints Aδ ≤ b to ensure that the obtained feature set conforms with practical requirements and restrictions.
The proposed probabilistic model, UBayFS, builds on the definition of a loss function L, which evaluates the quality of selecting a feature set δ ∈ {0, 1} N in the presence of a vector of feature importances θ ∈ Θ, where Θ = {θ ∈ [0, 1] N : θ 1 = 1}. The parameter vector θ is assumed to be probabilistic and not directly observable, such that evidence about θ is collected from data and prior weights. In specific, L : {0, 1} N × Θ → R + links the unknown feature importances to the decision to select a feature set δ. We define L in the following way: where 1 N denotes the N -dimensional vector of ones, κ is a function accounting for violations of the side constraints, and λ > 0 indicates the overall power of the constraints (the purpose of ρ will be discussed at a later point along with the formulation of the constraint function κ). Thus, L accumulates the importances of all non-selected features (residual information) and penalizes the violation of side constraints Aδ ≤ b via a regularization term.
In terms of statistical decision theory, decisions should minimize the risk r(δ), which is given as the expected loss function over all possible states of nature θ: To determine E θ [θ] accordingly, UBayFS evaluates data from elementary feature selectors trained on subsets of the dataset, summarized as ∆, as well as prior feature importance scores α. Thus, the posterior probability distribution over the unknown feature importance parameter θ given the independent data sources ∆ and α, p(θ|∆, α), is decomposed using Bayes' theorem into p(θ|∆, α) ∝ p(∆|θ) · p(θ|α) (5) where p(∆|θ) describes the model likelihood (evidence from elementary feature selector models) and p(θ|α) describes the density of a prior distribution (user knowledge). The core part of UBayFS is to derive parametrizations for likelihood and prior distribution from our model inputs. Due to the convenient representation of the loss function, Eq. 2, it suffices to determine the expected value of the posterior distribution of θ. The optimal feature set is then given by which can be solved numerically via discrete optimization.

Ensemble feature selection as likelihood
To collect information about feature importances from the given dataset, we Each elementary feature selector delivers a proposal for an optimal feature set. Thus, we let the frequency of drawing a feature throughout δ (1) , . . . , δ (M ) represent its importance by defining the latent importance parameter vector θ ∈ [0, 1] N , θ 1 = 1, as the success probabilities of sampling each feature in an individual urn draw. In a statistical sense, we interpret the result from each elementary feature selector as realization from a multinomial distribution with parameters θ and l. 2 This multinomial setup delivers the likelihood p(∆|θ) as joint probability density where f mult (δ (m) ; θ, l) denotes the density of a multinomial distribution with success probabilities θ and a number of l urn draws. Relevant notations are summarized in Tab. 1.

Expert knowledge as prior weights
To constitute the prior distribution, UBayFS uses expert knowledge as a-priori weights of features. Since the domain of the distribution of feature importances θ is defined to be a simplex θ ∈ Θ ⊂ [0, 1] N , θ 1 = 1, the Dirichlet distribution is a natural choice as prior distribution, which is widely used in data science problems, such as [25]. Thus, we initially assume that a-priori p(θ) = f Dir (θ; α), input & elementary models where f Dir (θ; α) denotes the density of the Dirichlet distribution with positive α = (α 1 , . . . , α N ). Since the Dirichlet distribution is a conjugate prior of the multinomial distribution, the posterior distribution results in a Dirichlet type, again, see [8]. Thus, it holds for the posterior density that where the parameter update is obtained in closed form by In case of integer-valued prior weights α, they may be interpreted as pseudo-counts in the context of modelling success probabilities in an urn model-comparable to the information gained if the corresponding counts were observed in a multinomial data sample. In UBayFS, we obtain α as feature weights provided by the user. If no user knowledge is available, the least informative choice is to specify uniform counts with a small positive value, such as α unif = 0.01 · 1 N .
Generalized Dirichlet model Even though the presented Dirichlet-multinomial model is a popular choice due to its favorable statistical properties, it implicitly assumes that classes are mutually independent. However, high-dimensional datasets frequently involve complex correlation structures between the features. To account for this aspect, we generalize the setup by replacing the Dirichlet prior distribution with some generalized Dirichlet distribution. The highest level of generalization is achieved by [16], who introduce the hyperdirichlet distribution, which may take arbitrary covariance structures into account. The hyperdirichlet distribution maintains the conjugate prior property with respect to the multinomial likelihood, and thus, inference is tractable; however, the analytical expression of the expected value involves the intractable normalization constant and, as a result, requires numerical means such as Monte-Carlo Markov Chain (MCMC) methods, which may face computational challenges due to the high dimensionality of the problem.
A compromise between the complexity of the problem and the flexibility of the covariance structure is given by an earlier version of the generalized Dirichlet distribution by [40], which is a special case of the hyperdirichlet setup, but more general than the standard Dirichlet distribution. In addition to the properties of the hyperdirichlet distribution, the expected value of the generalized Dirichlet distribution can be directly evaluated from the distribution parameters. Section 3 provides an experimental evaluation of the proposed variants to account for covariance structures in the UBayFS model. 3

Side constraints as regularization
Practical setups may require that a selected feature set fulfills certain consistency requirements. These may involve a maximum number of selected features, a low mutual correlation between features, or a block-wise selection of features. UBayFS enables the feature selection model to account for such requirements via a system of K inequalities restricting the feature set δ, given as can be evaluated via an admissibility function ad k (.), such that where a (k) is the k-th row vector of A and b (k) the k-th element of b. UBayFS generalizes the setup by relaxing the constraints: in case that a feature set δ violates a constraint, it shall be assigned a higher penalty rather than being excluded completely. This effect is achieved by replacing ad k (.) with a relaxed admissibility function ad k,ρ (.) based on a logistic function with relaxation parameter ρ ∈ R + ∪ {∞}: 3 Details on the generalized prior distributions are provided in Appendix A. prior parameters Fig. 1 illustrates that a large parameter ρ −→ ∞ lets the admissibility converge towards the associated hard constraint. A low ρ changes the shape of the penalization to an almost constant function in a local neighborhood around the decision boundary, such that only a minor difference is made between feature sets that fulfill and those that violate a constraint. 4 Finally, the joint admissibility function κ(.) aggregates information from all constraints Note that different relaxation parameters can be specified to prioritize the constraints among each other, hence κ involves a parameter vector ρ = (ρ 1 , . . . , ρ K ). Relevant notations for prior parameters are summarized in Tab. 2.
Feature decorrelation constraints Commonly, feature sets with low mutual correlations are preferred since they tend to contain less redundant information. A special case of prior constraints can be defined to enforce that such feature sets are selected. We will refer to such constraints as decorrelation constraints. Decorrelation constraints are pairwise cannot-link constraints between features with high pairwise correlation coefficients-this is achieved by appending a vector a with elements and an element b = 1 to the constraint system. We select the shape parameter ρ i,j for the constraint between features i and j by the odds ratio of the absolute correlation coefficient τ i,j , such that features with an absolute correlation below τ are not penalized, while higher absolute correlations are assigned penalties that represent the level of correlation. As a result, the selected feature set contains features with lower mutual correlations. 5 4 for a proof see Appendix A 5 We suggest to use Spearman's rho as correlation coefficient, since it is robust (in contrast to Pearson's correlation coefficient) and faster to compute than Kendall's tau.
Feature block priors User knowledge may as well be available for feature blocks rather than for single features. Feature blocks are contextual groups of features, such as those extracted from the same source in a multi-source dataset. It can be desirable to select features from a few distinct blocks so that the model does not depend on all sources at once. While prior weights can be trivially assigned on block level, we transfer the concept of side constraints to feature blocks.
Feature blocks are specified via a block matrix B ∈ {0, 1} W ×N , where 1 indicates that the feature n ∈ [N ] is part of block w ∈ [W ] and 0, else. Even though a full partition of the feature set is common, feature blocks are neither required to be mutually exclusive, nor exhaustive. Along with the block matrix B, an inequality system between blocks consists of a matrix A block ∈ R K×W and a vector b block ∈ R K . To evaluate whether a block is selected by a feature set δ, we define the block selection vector δ block ∈ {0, 1} W , given by where ≥ refers to an element-wise comparison of vectors, delivering 1 for a component, if the condition is fulfilled, and 0, otherwise. In other words, a feature block is selected, if at least one feature of the corresponding block is selected. Although block constraints introduce non-linearity into the system of side constraints, they can be used in the same way as linear constraints between features and integrated into the joint admissibility function κ.

Optimization
Exploiting the conjugate prior property, the posterior density of θ can be expressed as a Dirichlet, generalized Dirichlet or hyperdirichlet distribution, respectively. Since the expected value E θ [θ] can be computed either in a closed-form expression (Dirichlet or generalized Dirichlet) [40], or simulated via a sampling procedure (hyperdirichlet) [16], it remains to solve the discrete optimization problem in Eq. 2 as a final step.
Algorithm 1 Probabilistic sampling algorithm to initialize GA.
generate a permutation π on [N ] by sampling N times without replacement with probabilities proportional to α • for i = π(1), . . . , π(N ) do Since an analytical minimization is not feasible, we determine a numerical optimum δ by using discrete optimization: we deploy the Genetic Algorithm (GA) described by [12]. To guarantee a fast convergence towards an acceptable solution, it is beneficial to provide initial samples, which are good candidates for the final solution. For this purpose we propose a probabilistic sampling algorithm, Alg. 1: In essence, the algorithm creates a random permutation of all features, π : [N ] → [N ], by weighted and ordered sampling without replacement. The weights represent the posterior parameter vector α • . Then, the algorithm iteratively accepts or rejects feature π(n) with a success probability denoting the admissibility ratios of feature sets with and without feature π(n). The generated sample accounts for high feature weights by low ranks, resulting in a higher probability to be accepted in the acceptance/rejection step.
The Genetic Algorithm (GA) for discrete optimization is initialized using Algorithm 1. Starting with an initial set of feature membership vectors δ 0 ∈ {0, 1} N , GA creates new vectors δ t ∈ {0, 1} N as pairwise combinations of two preceding vectors δ t−1 andδ t−1 in each iteration t ∈ [T ]. A combination refers to sampling component δ t n from either δ t−1 n orδ t−1 n in a uniform way and adding minor random mutations to single components. The posterior density serves as fitness when deciding which vectors δ t−1 andδ t−1 from iteration t − 1 should be combined to δ t -the fitter, the more likely to be part of a combination.

Experiments & Results
Our experiments evaluate the performance, flexibility, and applicability of UBayFS in two parts: first, a study conducted on synthetic datasets demonstrates the properties of the various model parameters, including a. the number of elementary models M (1a), b. the prior weights α in a block-wise setup (1b), c. the constraint types and their shapes ρ in a block-wise setup (1c), as well as d. the type of prior distribution to account for feature dependencies (1d).
The second part of the experiment is conducted on real-world classification datasets from the life science domain. We demonstrate the advantageous quality of the UBayFS framework in comparison with state-of-the-art ensemble feature selectors. The experiment also includes a block feature selection setup for datasets with block structure.
Nevertheless, the main focus of the present work is to demonstrate the merits of the generic concept of UBayFS rather than to provide an in-depth analysis of the elementary feature selectors.
Our implementation of UBayFS 6 in R ( [31]) uses the Genetic Algorithm package authored by [33] with T = 100 and Q = 100-in most cases, the optimum is reached after around ten iterations. By default, each UBayFS setup comprises an uninformative prior with α n = 0.01 for all n ∈ [N ], and a max-size constraint instructing to select b MS features, which is determined individually for each dataset. Each setup is executed in I = 10 independent runs i ∈ [I], representing distinct random splits of the dataset D into train data T Evaluation metrics For the synthetic datasets, performance is measured by the F1 score of correctly / incorrectly selected features since the ground truth about the relevance of features is known from the simulation procedure. For real-world data, F1 scores on the predictive results are used to judge the feature selection quality indirectly. Furthermore, all experiments use the stability measure by [26] to assess the agreement between results from I independent feature selection runs. Stability ranges asymptotically in [0, 1], where 1 indicates that the same features are selected in every run (perfectly stable). Runtime 7 refers to the time the model requires to perform feature selection, including elementary model training and optimization, but excluding any predictive model trained on top of the feature selection results. Since prior parameters have a minor influence on the runtime, times will not be provided for experiments investigating these aspects.
i. an additive model (experiment 1a) similar to Data1 in [41], composed of a (x 1 , . . . , x 1000 ) ∼ 1000 × 1000 data matrix simulated from a Gaussian distribution N (0 1000 , I 1000 ), and a target variable for classification, given by y = g(−2 sin(2x 1 ) + x 2 2 + x 3 + exp(−x 4 ) + ε), where x 1 , . . . , x 4 denote the features 1 to 4 and ε ∼ N (0, 1). The function g transforms z into a class variable by ii. a non-additive model (experiment 1a) similar to Data2 in [41], equivalent to the setup of i., except for a target variable y = g(x 1 · exp(2x 2 ) + x 2 3 + ε); iii. a simulated dataset (experiment 1b, 1c) with group structure among the features, produced via make_classification In addition to the constraint shape ρ associated with a single constraint, λ balances the overall impact of side constraints with the Dirichlet-multinomial model. However, a small parameter λ < 1 is not recommended since a lack of influential constraints (including the MS constraint) results in selecting all features due to a monotonic target function. On the other hand, a high λ has a similar effect as setting all shape parameters uniformly to ρ = ∞; thus, all constraints are required to be fulfilled. In this study, λ does not significantly impact the resulting model metrics and, therefore, is set to λ = 1 and not further evaluated in this study. As expected, a higher M contributes largely to the runtime of the model, which increases linearly. In contrast, both F1 scores and stability values begin to saturate at around M = 50 to M = 100 models. Even though large ensembles are intractable with HSIC and RFE, small ensembles with M = 5 allow HSIC to retrieve almost all features, whereas simpler elementary feature selectors struggle to achieve high performances and stabilities even at higher levels of M . We conclude that large M does not necessarily improve the results, but significantly impacts the runtime, thus M ≈ 100 appears to be a reasonable choice in the subsequent settings, except for HSIC and RFE, where M = 5 will be set as a default.

Experiment 1a-likelihood parameters
Experiment 1b-block-wise prior weights To investigate the effect of prior weights, we alter the prior weights for the four blocks containing relevant features (according to the simulation of dataset iii.). A constant prior weight α R is assigned to all features from relevant blocks, i.e., block containing relevant features. In contrast, features from all other blocks are assigned a constant prior weight α −R -thereby, we simulate that the expert has approximate, yet not exact beliefs about features relevance. By assigning higher prior weights α R > α −R , the experiment represents an agreement between the expert belief and the ground truth, while a lower α R < α −R represents "wrong" prior information. In this experiment, we alternatively increase either α R or α −R while setting the other to the default value 0.01. Fig. 3 illustrates that, as expected, feature selection performance in terms of F1 scores (evaluated with respect to the ground truth features) increases for higher α R and decreases for higher α −R . Thus, across all elementary feature selectors, an improvement of the uninformative case α R = α −R = 0.01 can be achieved by an informative prior, if the prior represents a reasonable overlap with reality-this holds even though the relevant block also contain uninformative features, which are incremented by α R as well. On the other hand, erroneous prior knowledge can impact the feature selection results negatively. In contrast to the feature-wise F1 scores, stability remains mostly unaffected from strong steered via the according shape parameters ρ BMS and ρ MPB , respectively. Per default, we indicate ρ = 0 in cases where a constraint is omitted. From a default case of ρ BMS = ρ M P B = 0 (no block constraints), we investigate the behavior of UBayFS in both directions, i.e. for an increasing level of ρ BMS or ρ M P B . Fig. 3 illustrates how the opposite prior constraints BMS and MPB affect the model at different levels of relaxation parameters. Both constraint types have a slightly negative impact on the outcome in terms of F1 and stability. This is caused by the fact that the "best" feature set has to be determined under a side constraint, which is not compatible with the ground truth-the ground truth defines 16 features out of four distinct blocks to be relevant, which cannot be covered by any of the constraints. Therefore, we can observe that UBayFS can handle such scenarios and still deliver appropriate and near-optimal solutions. Experiment 1d-feature dependence models In Section 2, multiple variants were discussed to account for datasets with correlation structure. On the one hand, the UBayFS framework permits to account for between-data correlations via a generalization of the prior distribution; on the other hand, we may enforce that the highly correlated features should not be selected jointly via a decorrelation constraint. Both variants are different insofar as generalized priors aim to deliver a more appropriate estimation of the expected feature importances by correcting for dependencies in the observed feature sets, while decorrelation constraints directly affect the optimization procedure for δ.
In this experiment, we investigate both possibilities to account for dependencies between features, along with combinations of both: we set a decorrelation constraint between all features with a mutual Spearman correlation τ > 0.4 as described in Section 2.3. Generalizations of the Dirichlet prior setup are denoted as follows: • Dirichlet prior distribution, • generalized Dirichlet distribution [40], • hyperdirichlet distribution [16].
Our experiment involves all combinations of prior setups with and without decorrelation constraint, executed on dataset vi. To measure the effect of decorrelation, we further evaluate the redundancy rate (RED) as suggested in [44]: the redundancy rate of a feature set is defined as the average absolute Pearson correlation between all pairs of distinct features in the selected feature set. A small RED is preferred in many practical setups.
The results show that neither feature-wise F1 scores, nor stabilities change significantly between the prior models. Thus, the default Dirichlet model seems sufficient to obtain reasonable results. However, introducing decorrelation constraints has a slightly negative impact on stability, while yielding a small improvement in F1 scores and RED. Nonetheless, the most significant change between the variants can be observed with respect to runtime, which reflects the high computational burden associated with the hyperdirichlet prior model-even on a small dataset, the runtimes show a significant increase on a logarithmic scale. Thus, higher-dimensional datasets cannot be tackled with the hyperdirichlet setup.

Experiment 2: Real-world life sciences datasets
Real -world experiments are conducted on seven open-source datasets presenting binary classification problems from the  life science domain, see Tab. 3. For simplicity and due to extensive runtimes, we restrict the choice of the elementary feature selector for UBayFS to mRMR, Fisher, and decision tree with an uninformative prior, an MS constraint, and M = 100. The number of selected features is specified according to the size of the dataset (b MS = 5 / 10 / 20 for datasets with fewer than 100 / between 100 and 1000 / more than 1000 features, respectively). i.e., more than one block, and evaluates block feature selection: a number of up to b MS features should be selected from at most b BMS distinct blocks. 8 Random forests (RF) [2], and RENT [20] (representing ensemble feature selectors that extend the concepts of decision trees and elastic net regularized models, respectively) are used as state-of-the-art benchmarks for standard feature selection, while Sparse Group Lasso (GL) [19] is used as the benchmark for block feature selection. To conform with UBayFS, RENT and RF are adjusted to M = 100 elementary models, and all models are tuned to select approximately the same number of features, b MS . Since RENT and GL cannot be instructed to select b MS features directly, regularization parameters are determined via bisection, such that the number of selected features is approximately equal to b MS .
The selected features cannot be evaluated directly in real-world datasets due to unknown ground truth on the feature relevance. Therefore, we train predictive models on T Results Tab. 4 and 5 present the results of the experiments on real-world data. Thereby, UBayFS can keep up with other approaches and achieves good predictive F1 scores throughout the different datasets, even though only a limited amount of expert knowledge is introduced to ensure a fair comparison. In the block feature selection setups, UBayFS benefits from block constraints and shows more flexibility than Sparse Group Lasso. Altogether, F1 scores are generally in a high range across all methods, suggesting that UBayFS can keep up or even outperform its competitors in a diverse range of scenarios (low-dimensional and high-dimensional data, as well as unconstrained and constrained setups). Fig.  6 and Fig. 7 provide additional insights into the performances of the UBayFS variants in the standard feature selection and block feature selection scenario, respectively.
Overall, the results reflect that a particular strength of UBayFS lies in delivering a good trade-off between stabilities and predictive performance, compared to competitors like RF, which deliver high F1 scores, but very low stabilities. Differences between the F1 scores obtained by the different elementary feature selectors underline that UBayFS inherits benefits and drawbacks from its underlying elementary model type-in particular, the decision tree and HSIC achieved top results. Nevertheless, the building of ensembles allows to compensate in parts for mediocre stabilities.
Runtimes of all methods and datasets are provided in Tab. 6. Given a fixed set of model parameters, it becomes obvious that the major factor influencing the runtime of UBayFS is the number of features (columns) rather than the number of Table 4: UBayFS with three distinct elementary feature selectors (M: mRMR, F: Fisher, T: decision tree) is compared to ensemble feature selectors RF and RENT in a standard feature selection scenario. Further, UBayFS with additional (BMS) constraint is compared to Sparse Group Lasso (GL) for block-feature selection on datasets with block structure. Average F1 scores are given for different predictive models (GLM, SVM). The best scores in each row are marked in bold for each scenario. . UBayFS runtimes refer to the MS setup-however, experiments showed only minor differences to the runtimes in the block feature selection setup. While RF and GL are more tractable in high-dimensional datasets, RENT seems to suffer from data dimensionality to a more considerable extent.

Discussion and Conclusion
The presented Bayesian feature selector UBayFS has its strength in combining information from a data-driven ensemble model with expert prior knowledge targeted at the life science domain. The generic framework is flexible in the choice of the elementary feature selector type, allowing a broad scope of applications scenarios by deploying adequate elementary feature selectors, such as those suggested by [34] for semi-supervised or [11] for unsupervised problems. An extension of the presented experiments to multiple classes or multi-label classification problems (one object is not uniquely assigned to one class) is straightforward as well if the elementary feature selector is capable of tackling such datasets, such as [29].
In general, the choice of the elementary feature selector is a central step when deploying the concept in practice-in particular, performance, stability, and runtime need to be taken into consideration, given the size and structure of a dataset. Still, the main focus of the present work is to discuss the conceptual properties of the framework rather than the individual characteristics of distinct elementary feature selectors. Nevertheless, a broad range of elementary models is used in the presented experiments to provide user guidance in practical setups. The option to build ensembles combining different model types, as discussed by [35], turned out to decrease the stability of UBayFS significantly and is therefore not considered in this study.
UBayFS presents two ways to account for feature dependencies: a generalized prior model, as well as a decorrelation constraint. The latter effectively restricts the results, such that a simultaneous selection of highly correlated features is  penalized. The generalizations of the prior model correct the estimated feature importances by the dependencies-in a low-dimensional scenario, the hyperdirichlet variant is the most accurate choice. However, this variant becomes intractable, if the dimensionality exceeds a few hundred features and requires simulation to determine the expected value in almost any case, preventing from analytically exact solutions. Since our experiments depicted that feature importances obtained from each of the three prior setup types are numerically similar, a conventional Dirichlet setup seems to deliver a sufficiently accurate approximation for high-dimensional datasets. This observation is also supported by the fact that many elementary feature selectors, such as mRMR or HSIC, can account for between-feature correlations, thus reducing the need to consider correlations in the meta-model. Prior information from experts is introduced via prior feature weights and linking constraints describing between-feature dependencies, represented in a system of side constraints. Via a relaxation parameter, the admissibility is transferred into a soft constraint, which favors solutions that fulfill the constraints, and penalizes violations. Introducing user knowledge directly into the feature selection process opens new opportunities for data analysis in life science applications. Still, such methodology bears the potential of intentional or unintentional incorrect use: as demonstrated in the experiment, the integration of unreliable or incorrect user knowledge makes the system prone to be steered in a user-defined direction. Users have to be aware that UBayFS may contain subjective inputs to prevent misuse. Thus, precautions must ensure that information provided to the system is sufficiently verified if any critical decisions are based on model output.  Based on the results from extensive experimental evaluations on multiple open-source datasets, a clear benefit of the proposed feature selector lies in the balance between predictive performance and stability. Particularly in life sciences, where few instances are available in high-dimensional datasets, user-guided feature selection can be an opportunity to guide the model to achieve tractable and high-quality results. UBayFS delivers more flexibility to integrate domain knowledge than established state-of-the-art approaches.
A practical limitation of UBayFS is that the runtime is arguably slower than other feature selectors, which becomes an obstacle in very high-dimensional datasets. The use of highly optimized algorithms like the Genetic Algorithm along with an initialization using the suggested Alg. 1 improves this issue. However, it cannot compensate for the computational burden of training multiple elementary models.
An even more general version is the hyperdirichlet distribution by [16], who characterizes the distribution by the probability density function where P(.) denotes the power set and F(G) denotes the parameter for each possible subset of [N ]. Since the closed-form expression of the expected value involves the normalization constant, which is intractable in practical high-dimensional setups, we deploy the Metropolis-Hastings (MH) algorithm implemented in [17] to sample from the hyperdirichlet distribution and determine the expected value empirically from the sample mean.