Implications on Feature Detection when using the Benefit-Cost Ratio

In many practical machine learning applications, there are two objectives: one is to maximize predictive accuracy and the other is to minimize costs of the resulting model. These costs of individual features may be financial costs, but can also refer to other aspects, like for example evaluation time. Feature selection addresses both objectives, as it reduces the number of features and can improve the generalization ability of the model. If costs differ between features, the feature selection needs to trade-off the individual benefit and cost of each feature. A popular trade-off choice is the ratio of both, the BCR (benefit-cost ratio). In this paper we analyze implications of using this measure with special focus to the ability to distinguish relevant features from noise. We perform a simulation study for different cost and data settings and obtain detection rates of relevant features and empirical distributions of the trade-off ratio. Our simulation study exposed a clear impact of the cost setting on the detection rate. In situations with large cost differences and small effect sizes, the BCR missed relevant features and preferred cheap noise features. We conclude that a trade-off between predictive performance and costs without a controlling hyperparameter can easily overemphasize very cheap noise features. While the simple benefit-cost ratio offers an easy solution to incorporate costs, it is important to be aware of its risks. Avoiding costs close to 0, rescaling large cost differences, or using a hyperparameter trade-off are ways to counteract the adverse effects exposed in this paper.

features. Depending on the application field, these costs may not only refer to financial aspects, but could also represent a time span to raise a feature or a patient harm during the sample taking process.
The general strategy to incorporate feature costs into a feature selection framework depends on the problem at hand. If a fixed total feature cost limit can be defined, the problem reduces to an additional optimization constraint for the feature selection problem. Many example applications of fixed budget costs can be found [3,7,8,9,10]. For situations without a fixed budget, the goal may be to harmonize costs of features and costs of prediction errors by identifying an optimal trade-off. Research on these flexible solutions can be found, e.g., in Zhou et al. [15] or Bolón-Canedo et al. [1]. A third situation is given when feature acquisition is undertaken sequentially. In such situations, tests can take advantage of intermediate results and reduce total costs by only requesting further features if the benefit justifies the additional cost, see, e.g. [5,13,14].
A common factor for all mentioned tasks is the need to somehow trade-off the benefit of a feature with its cost. As these measures are on different scales, the main options to combine them are either to optimize the ratio of both [2,6,8,9,10,11], or to trade-off a weighted sum [1,4,5,12,13,14]. In this paper we take a closer look at the first mentioned alternative. We analyze the consequences of using a simple benefit-cost ratio (BCR) with respect to the detection of relevant features against noise.
We start by defining the general cost-sensitive feature selection problem and discussing the theoretical implications of using the BCR. In the following section we perform a simulation study to analyze the influence of multiple data parameters and feature cost settings on the feature detection rate. Finally, we present the obtained results, discuss the general applicability of the basic BCR and provide recommendations for alternative trade-off measures.

Problem Definition
Given is a data set with n observations D i , i = 1, . . . , n and p features x ij , j = 1, . . . , p for observation i, and continuous response y i for observation i. Assume that the true relation is given by . In this data p rel features are assumed to have an influence on the response, while all other p noise features are independent of y. Then the goal of feature selection is to identify the subset of relevant features.
One obvious approach to ensure finding this optimal subset is an exhaustive search, i.e. to consider all possible subsets. However, this approach is usually not feasible for high dimensional feature spaces. Thus, heuristic selection algorithms like greedy sequential forward selection (SFS) are used. SFS iteratively adds the single most promising feature to the current result set. A typical way to estimate the importance of a feature x j when added to a given set s is to calculate the performance gain of a statistical model including the feature M (s ∪ {x j }|D) compared to a baseline model without it M (s|D). The feature with the highest gain in performance is then selected. Assuming a performance criterion Q, for which the optimal value is the minimal value, we can formulate one feature selection step of SFS bym In many real-world scenarios, obtaining a feature x j may cause individual feature costs c j . Cost-sensitive feature selection aims to incorporate these costs into the selection process to find cheap and well performing models. A popular method is to adapt the problem of Equation (1) tô This ratio of benefit and cost leads to a simple trade-off optimization, which relates the importance of a feature to its cost. In the following we describe negative implications of this simple and popular method when discriminating between relevant and noise features. The true performance gain of a noise feature is a value smaller or equal to zero, as it has no relation to the response but may create additional uncertainty. The true performance gain of a relevant feature is typically a value greater than 0. Nevertheless, the actual performance gain estimated on a sample data set does not always result in these true values. It can rather be seen as a random variable following a certain unknown distribution around the true value: For a real world data situation, the theoretical distributions of ∆Q j for different j can be assumed to overlap to some extent. That means, for one given sample data set, the actual estimated performance gain of a noise feature may be higher than the one of the relevant feature and thus an irrelevant feature may be selected.
When incorporating cost according to Equation (2), the performance gain distribution of feature x j is scaled by a positive factor c j , which increases and broadens V j , if c j < 1, and decreases and narrows it, if c j > 1. Increasing and broadening the distribution of a noise feature, while not altering the one of a relevant feature increases the overlap of both distributions. Therefore the probability of falsely selecting the noise feature increases. In some situations this problem may be negligible. In others, the cost-sensitive feature selection procedure can completely obfuscate any relevant feature. The actual magnitude of the cost influence depends on many factors including the sample size n, the true effect size of relevant features β, the residual variance σ 2 , the statistical model, and the performance measure Q. The goal of this paper is to analyze this problem and describe multiple parameter settings and their influence on the feature detection rate. We focus on linear regression models and use the root mean squared error (RMSE) on independent data to assess the quality of models. The RMSE is defined as withβ 0 andβ j estimated on training data and x ij and y i denoting observations of an independent test data set. By using such an independent test data set, the RMSE also allows a result of no improvement after adding a feature. In the following, for ease of presentation, we describe a single feature selection step of SFS from a pool of p rel relevant and p noise noise features. We also define this single step to be the first selection step, i.e. we define our baseline model to be the intercept model and compare the quality of all one-feature models. The final selection result of this one step can either be 'noise selected', 'relevant feature selected', or 'no feature selected'. Similarly to Definition (1), in the following we denote the gain in RMSE for feature j by ∆RMSE j . The corresponding distribution V j (·) has no analytical form. In the following simulation study, we overcome this problem by numerically approximating this distribution and computing selection probabilities on the empirical distribution.

Simulation Study
The goal of our simulation study is to assess the detection rate of a cost-sensitive feature selection step in multiple parameter settings. Additionally we aim to analyze the empirical distribution of our performance measure to further illustrate effects of cost scaling. We consider a linear regression scenario. Our response variable, as well as all p features are assumed to be normally distributed. We define p rel features to be relevant and the remaining p noise = p − p rel features to be noise. The individual costs of features can be seen as a relative scaling between the respective ∆RMSE j values of the features. To simplify our analyses, we do not consider individual costs for all features, but define only one single scaling factor θ for the relevant features. Hence, we implicitly define equal costs for the group of noise features and equal costs for the group of relevant features. We only differentiate between costs for information and costs for noise. To thoroughly assess the influence on the detection rate, we vary the feature cost scaling factor θ between 1, 10, 100 and 1000, the number of relevant features p rel between 1, 2, 5 and 10, the number of noise features p noise between 1, 10 and 50, and the effect size of the relevant feature β between 0, 0.01, . . . , 0.5. For multiple relevant features, we do not vary the effect size and define β j := β.
For each parameter combination, B = 1000 training (n train = 100) and test data sets (n test = 1000) are generated as follows. In a first step, features are drawn from a p-dimensional normal distribution where Ip is the p-dimensional identity matrix. Next, the response is drawn from the normal distribution We set the intercept to β 0 = 1 and the residual variance to σ 2 = 1 for all settings.
For every data set obtained in this way, we fit the baseline intercept model and all one-feature models separately and obtain We then compute the increase in RMSE for all features by As we are only interested in the question if a noise feature or a relevant feature is selected, we define the RMSE gain of noise and relevant features as our target variables. The best ∆RMSE value indicates the candidate that is selected from the noise and the relevant features, respectively.
As described earlier, we define our cost setting by a single factor θ, which scales relevant features. Hence, the assessed measure of RMSE gain for relevant features actually results in ∆RMSE rel θ . The final feature selection on a single data set can lead to three different outcomesm. We only consider increases in ∆RMSE. Therefore, if neither relevant, nor noise features result in a positive RMSE gain, then no feature is selected.m = arg max ∆RMSE rel θ , ∆RMSE noise , 0 As every setting is repeated 1000 times with newly simulated data sets, we can estimate the probability for each selection result m by looking at the relative frequency among those 1000 runs. We can further obtain empirical distributions of ∆RMSE for relevant and for noise features in different settings. The results for both of these analyses are presented in the following section.

Results
This section comprises the analysis of the selection probabilities with main results presented in Figure 1

Feature Detection Rates
The individual plots of Figure 1 illustrate the estimated probabilities for the three selection outcomes 'relevant feature selected ', 'noise feature selected ' and 'no feature selected ' along multiple effect sizes of the true effect β. Rows of the main plot matrix relate to different numbers of noise features, while columns represent the extent of cost-scaling applied to the relevant feature. The top-left plot describes a setting with one relevant and one noise feature. No cost scaling is applied, which could refer to a setting without or with equal costs, respectively. At an effect size of β = 0, where both features can be considered noise, their selection probability is approximately equal. In almost 70% of the cases, neither of them is selected. When increasing the effect size β, the selection probability for the relevant feature rises, while the probabilities for both other outcomes decrease. From around β = 0.3 onward, the relevant feature is identified approximately 100% of times. Increasing the number of noise features (rows 2 and 3) changes this result in multiple ways. The main difference can be seen in the number of times that no feature is selected. This value is reduced greatly for 10 noise features and disappears completely for 50 noise features. The other difference is that the selection curve of the relevant feature starts at a lower value and reaches 100% selection slightly later. These differences are however more subtle.
The main focus of our paper lies on the effect of incorporating costs and thus scaling the performance distribution of the relevant features. This scaling factor corresponds to the columns of the plot matrix. When increasing the factor, the decrease in selection probability of noise for higher effect sizes becomes smaller, eventually resulting in an approximately constant noise selection probability at θ = 1000. As the initial selection probability for noise increases with a larger number of noise features, the combined effect results in always selecting noise at the bottom-right plot.
The effects of increasing the number of relevant features is illustrated for a fixed scaling factor θ = 10 and an equal number of noise and relevant features in the additional bottom row of Figure 1. The main observation is that the extent of selecting no feature reduces with increasing p rel and instead a noise feature is selected. The probability of selecting a relevant feature does not seem to be strongly influenced, it is only slightly pushed back by noise and reaches the area of 100% selection for slightly larger efect sizes. Full illustrations including multiple values of θ and non-identical p rel and p noise are given in Additional file 1.

Empirical Distribution of ∆RMSE
The empirical distribution of RMSE gain for the relevant features depends on the true effect β, the cost scaling parameter θ, and the number of relevant features in the model. For noise, it only depends on the numbers of noise and relevant features, as the true effect is 0 and no scaling of noise is performed. A comprehensive illustration of all analyzed distributions for p rel = 1 is given in the top plot of Figure 2. A heatmap describes the distributions of RMSE gain for relevant features along different effect sizes. Lighter colors correspond to higher densities. RMSE gains for noise features are depicted by three density curves for settings with 1, 10 and 50 noise features, respectively. A gray area highlights the decision boundary for not selecting any feature.
The given plots provide deeper insight into the selection decisions illustrated previously in Figure 1. Analyzing the noise features, the distribution of RMSE gains of one single noise feature has the great majority of its probability mass within the gray area and would not be selected, regardless of the RMSE gain of the relevant feature. However, when increasing the number of noise fatures p noise , the noise distribution steadily moves out of this area. For the relevant parameter, the unscaled distribution (top-left plot) increases superlinearly along β and completely passes any noise distribution at around β = 0.4. A cost-scaling however lowers the slope of this increase and decreases the variance of the relevant feature distribution. As a consequence of both, surpassing the noise distributions happens notably slower. For θ = 100, the size of the relevant feature distribution compared to noise is shrunken down to a level making it almost invisible in the plot. The largest noise distribution is not surpassed at all in our range of β values. However, an important observation is that the total density of ∆RMSE rel below or equal to zero is constant for any

Discussion and Conclusion
The simulation study revealed multiple consequences of cost-sensitive feature selection when using the popular benefitcost ratio without a hyperparameter. In Figure 1, we see that cost-scaling the benefit ∆AIC makes the selection probability of noise features more robust, especially for large true effects. With θ → ∞, this probability becomes independent of β. However, the frequency of selecting noise does not necessarily approach 1, but converges to a certain limit. For θ → ∞, this limit is given by P (∆RMSE noise > 0). Values with negative RMSE difference will never be selected, regardless of the scaling. With an increasing number of noise features, the probability that all estimated performance gains are negative decreases. Hence, the described limit for selecting noise rises. The third row of Figure 1 illustrates the consequences of both effects, which eventually results in a noise selection probability of approximately 1 for all β values. The empirical distributions shown in Figure 2 further describe this relation. With higher cost penalization, the slope and variance of the RMSE gain distribution along β decreases. The probability regions favoring noise over the relevant feature constantly become larger as θ increases, yet the probability masses above and below 0 stay constant, further illustrating the probability limit of noise selection. The effects of increasing the number of relevant features in the true model are more subtle. The selection probability plots mainly show the effects already observed when increasing the number of noise features. The differences in the empirical densities of RMSE gains of relevant features in Figure 2 are the result of two effects. On the one hand, the maximum RMSE results in a higher value for a higher number of features. On the other hand, the relative share on the total information of a single feature decreases with higher p rel . For small β, the distribution of ∆RMSE rel is very skewed and the first effect dominates. For larger β, the distribution becomes less skewed and the latter effect has a higher impact. In total, this results in the observed trends with increasing p rel .
Altogether, our paper addressed implications of using the benefit-cost ratio without an additional hyperparameter for cost-sensitive feature selection. As using this ratio is a typical approach to incorporate feature costs, it is important to understand possible problems resulting from it. We provided a thorough problem description and analyzed multiple parameter settings in a simulation study. Results from this study illustrated that a strong cost-scaling, which may result from high relative cost differences between features, can notably influence the detection limit of relevant features. This effect interacts with the number of noise features in the data.
To avoid this problem we recommend using an adapted benefit-cost ratio, such as the ones proposed in Jagdhuber et al. [3] or Min et al. [9]. The main alternative solution to incorporate costs is a weighted linear combination as mentioned in the introduction of this paper. All these approaches share the idea of introducing a hyperparameter to control the trade-off between benefit and cost. This can reduce the problem, but it comes at the price of an additional estimation step. If the analysis requires the benefit-cost ratio without hyperparameter, we strongly recommend to thoroughly analyze the cost distribution of the given data set. If relative cost differences are high, transforming costs prior to applying the benefit-cost ratio may be beneficial. In practice, such extreme ratios may likely occur with some costs very close to 0, or from setting cost-free features to a cost of ǫ close to 0, as e.g. recommended in Min et al. [9].
The popularity of the benefit-cost-ratio shows the need for simple methods to incorporate costs without an additional parameter tuning step. Beyond the scope of this work, solving this problem with a comprehensible way to specify the trade-off between costs and performance with expert knowlegde, instead of tuning a black-box hyperparameter, would be of great interest. This would allow the user to specify the intended relation of costs and performance, which may differ greatly between fields of application. Our work covers a specific task in predictive modelling and tries to raise awareness of the problem. Further research may also consider different model types, performance measures, feature distributions, and additional aspects.