1 Introduction

Feature selection plays a crucial role in many medical applications. It allows to build models with better generalization and greater predictive power [1, 2]. Furthermore, feature selection is important when predictive capacity is not sufficient and there is also a need for explanation, that is, a detailed understanding of the underlying mechanisms and the discovery of dependency structures between features and the target variable [3]. The last years have witnessed a rapid and substantial advancement of feature selection methods that deal with the high dimensionality of the data; we refer to the latest reviews [4, 5].

Generally, the assumption behind most existing methods is that all features are assigned with the same cost. This may not be accurate, especially in medical applications where the acquisition of selected features may be costly. For example, in medical diagnosis, obtaining administrative data is inexpensive (e.g., age or weight), but any information extracted by a clinical test is combined with a specific cost. Costs can range from low values related to simple blood tests to expensive, often invasive diagnostic procedures such as biopsy. Price may not be the only factor considered as the cost. Feature costs may also correspond to time or difficulty in obtaining administrative data (e.g., due to privacy reasons) [6]. Another example of feature cost may be the risk associated with specific medical examinations (such as general anesthesia [7], diagnostic X-rays [8]). Eventually, costs may correspond to a choice of diagnostic procedure, e.g., the decision between invasive exploratory surgery and a simple blood test. In image analysis, costs may correspond to the difficulty in extracting feature values from images, a representative example is predicting COVID-19 cases based on chest X-ray images [9]. This is particularly important when computationally expensive deep neural network models are used to extract features from images [10]. Without the assumption of different costs, we might produce a powerful model, but it may be impossible to be used in practice, as it would be too expensive to make a prediction [11].

The feature selection methods that take into account the costs of features are called cost-sensitive feature selection methods. Importantly, in the considered framework, we assume that all features are available in the training data. Our goal is to select a subset of features to reduce the cost of obtaining a prediction for a new instance for which the feature values have not yet been acquired. Cost-sensitive feature selection is a more challenging task than traditional feature selection, as we have to take into account two issues: feature relevance and its cost.

It is worth considering in what situations the cost-sensitive methods have the opportunity to work better than traditional feature selection methods. To understand the issue more deeply, consider the following simple example, showing a situation in which the cost-sensitive method may overcome the traditional one and vice versa. Imagine that we would like to predict the occurrence of the disease Y using some features \(X_1,\ldots ,X_p\). There are three relevant features \(X_1,X_2,X_3\), with costs 1, 1, 1. The models based on subsets \(\{X_1\}\), \(\{X_1,X_2\}\) and \(\{X_1,X_2,X_3\}\) enable prediction of Y with accuracy \(70\%,80\%,90\%\), respectively and including the additional features does not improve prediction. For simplicity, assume also that for other combinations of variables (e.g., \(\{X_2,X_3\}\)) we obtain a lower accuracy. Moreover, assume that variables \(X_4,X_5,X_6\) are cheaper counterparts of \(X_1,X_2,X_3\) that yield slightly worse predictions, for example the models based on subsets \(\{X_4\}\), \(\{X_4,X_5\}\) and \(\{X_4,X_5,X_6\}\) give accuracy \(65\%,75\%,85\%\), respectively. The costs of \(X_4,X_5,X_6\) are 0.5, 0.5, 0.5. We can expect that the traditional feature selection method will tend to select \(X_1,X_2,X_3\), while cost-sensitive methods will prefer \(X_4,X_5,X_6\). When our budget B is not limited, e.g., \(B=3\), there is no need to use the cost-sensitive method, as the variables chosen by the traditional method fit the assumed budget and the corresponding model has a large predictive power (\(90\%\)). However, when our budget is limited, say \(B=2\) the situation changes. In this case, the cost-sensitive method allows to obtain accuracy \(85\%\) by choosing \(X_4,X_5,X_6\) (total cost: 1.5) whereas for the traditional method we can obtain only \(80\%\) when choosing \(X_1,X_2\) (total cost: 2). The above example indicates that we can expect the advantage of cost-sensitive methods over traditional methods when the considered budget is limited and when there exist cheaper counterparts of the relevant variables.

There are many different approaches to select features in supervised learning algorithms. In this paper, we focus on embedded feature selection methods based on the general penalized empirical risk minimization (ERM) framework. The considered framework includes various loss functions. In the experiments, we focus mainly on the logistic loss function, but the discussed methods can be combined with different loss functions (such as squared loss or hinge loss) and with different models including complex models such as deep neural networks [12]. Our primary goal is to compare different cost-sensitive variants of existing penalty functions such as lasso, adaptive lasso and non-convex penalties.

An important contribution of this work is a novel experimental framework, which allows us to validate the cost-sensitive methods on real datasets for which the cost information is not available. It is important as currently there is a shortage of benchmark datasets with assigned costs of variables. In the proposed methods, costs depend on the relevance of variables, which is the most realistic scenario in the medical domain. In addition to the original features, we create the so-called proxy variables. Proxy variables are assigned with lower costs than the original variables and are obtained by randomly permuting a fraction of the observations in the original variables. In the proposed scheme, we can control the costs of proxy features as well as the strength of dependence between proxy variables and the target variable.

The main contribution can be summarized as follows.

  • We analyze several cost-sensitive variants of existing penalties. In particular, the cost-sensitive modification of non-convex penalty MCP has not been considered in the literature and thus can be treated as a novel method.

  • We propose a novel experimental framework based on so-called proxy features allowing to generate artificial costs for datasets for which the cost information is not available.

  • We introduce a novel evaluation measure CSFDR (cost-sensitive false discovery rate), which allows to control the costs spent on irrelevant features.

  • We performed extensive experiments on a large medical database MIMIC for which the costs are assigned by experts as well as on other real and artificial datasets.

The paper is structured as follows. In Section 2 we discuss the existing related work, in Section 3 we describe the penalized empirical risk minimization methods. Section 4 contains the results of the experiments and Section 5 concludes the work.

2 Related work

The cost-sensitive feature selection methods have attracted some attention in the machine learning literature. The task is challenging as it is necessary to find a trade-off between the feature relevance and its cost. There are modifications of information theoretic-based filters that take into account cost information [13, 14], classical model-based information criteria such as AIC [15] , classification trees [16] and Random Forest [17].

Nowadays, the penalized ERM methods, combined with various penalties such as lasso [18], elastic net [1] or non-convex penalties (MCP, SCAD) [19] play a leading role among feature selection methods. They are considered as the gold standard in a great number of medical applications, especially in tasks where both prediction and feature selection are of our main interest. There are many advantages of penalized methods. First, they can successfully work for high-dimensional data sets as opposed to classical feature selection methods based on information criteria (e.g., AIC or BIC) which may fail when the number of variables is large when related to the number of observations. Secondly, they can be combined with various empirical risk functions corresponding to different loss functions, e.g., logistic loss, squared loss or hinge loss. Therefore, they can be used in different supervised tasks (e.g., regression, binary classification, multi-class classification, survival analysis). Finally, they belong to the class of embedded feature selection methods, which means that feature selection and parameter estimation are performed simultaneously. The last property is a great advantage when one aims to focus on the classification model and avoid using external feature selection procedures, such as filters. Penalized ERM methods have been successfully used in many medical applications (it is impossible to review all of them). The most important examples include predicting SARS-CoV-2 pneumonia [20], breast cancer [21], real-time forecasting of endemic infectious diseases [22], prediction of cardiovascular diseases [23], identification of ischemic stroke [24], prediction of future visual field progression in glaucoma disease [25], among many others. Moreover, several studies concerning elastic net and non-convex penalties have also focused on disease detection and survival prediction. For example, we mention breast cancer survival prediction [26] or penile cancer detection [27].

Regarding the cost-sensitive variants of penalized ERM methods, the literature is much more limited. The cost-sensitive modification of the lasso method for logistic regression was considered in [28]. In this work, different penalty factors were assigned to different modalities, such as clinical, gene expression, methylation and copy number variation modalities. Teisseyre et al. proposed a cost-sensitive modification of adaptive lasso [29] and combined it with classifier chains for multi-label classification. The method is based on the idea that the features selected in the previous model in a chain are more likely to be selected in the next model. Therefore, the penalties corresponding to the features that have been chosen by the model are decreased in the current model. An interesting method, called cheap knockoffs was proposed recently in [30]. The method is based on the so-called knock-off features, which are artificially generated features designed to mimic the correlation structure found within the original variables [31, 32]. The key idea of the method is to force the highest-cost features to compete with more knockoffs than the cheaper features. More precisely, for each original feature, multiple knockoffs are constructed, with the more expensive features having more knockoffs. The lasso method is launched with a joint set of original features and knockoffs. A feature is selected as relevant only if it beats all of its knockoff counterparts; costlier features have more competition. An upper bound on the weighted false discovery proportion associated with this procedure is derived, which corresponds to the fraction of the feature cost that is wasted on unimportant features. Despite the popularity of ERM penalized methods, the literature lacks in-depth analysis and comparison of penalized methods taking into account cost information. This paper aims to fill the gap.

Finally, let us mention that in addition to ERM-based methods, there are other approaches to feature selection. An interesting group of methods are nature-inspired metaheuristic algorithms such as cooperative coevolutionary algorithms [33], whale optimization algorithms [34] or grey wolf optimization [35]. Combining these approaches with cost information is an interesting topic for future research.

3 Penalized empirical risk minimization methods

3.1 Feature selection under budget constraint

We assume that each instance in the training data D is described by a pair \((x_i,y_i)\), \(i=1,\ldots ,n\), where \(x_i=(x_{i,1},\ldots ,x_{i,p})\) is the feature vector for i-th instance and \(y_i\) is a target variable value for i-th instance. Moreover, the costs \(c_1,\ldots ,c_p\) related to the features are given. The goal is to build a supervised model using training data that predict the target variable for the new instances. Importantly, the prediction must be based on the selected variables to fit within the assumed budget B. Importantly, in many practical situations, the assumed budget B can be significantly less than the total cost associated with all the features considered. We focus on the case of the linear predictor, i.e., the prediction is a certain function of the linear combination of the features \(x_i^T\beta \). The quality of the prediction \(\hat{y}_i\) is measured using loss function \(l(y_i,x_{i}^{T}\beta )\), where \(\beta \in R^p\) is a parameter vector. The most popular loss functions are: squared loss \(l(y_i,x_i^T\beta )=(y_i-x_i^T\beta )^2\), used in regression, logistic loss \(l(y_i,x_i^T\beta )=-[y_i\log (\sigma (x_i^T\beta ))+(1-y_i)\log (1-\sigma (x_i^T\beta ))]\), where \(\sigma (s)=(1+\exp (s))^{-1}\), corresponding to the logistic model and hinge loss \(l(y_i,x_i^T\beta )=\max \{0,1-y_ix_i^T\beta \}\), corresponding to the SVM method. The small value of the loss indicates that the predicted value is close to the observed value of the target variable. In the Empirical Risk Minimization framework (ERM), the objective is to minimize (with respect to \(\beta \)) the empirical risk

$$ \hat{R}(\beta ):=\frac{1}{n}\sum _{i=1}^{n}l(y_i,x_i^{T}\beta ), $$

corresponding to the theoretical risk function \(R(\beta )=E_{y,x}l(y,x^{T}\beta )\). Learning under the budget constraint B can be formulated as a constrained optimization problem in which we solve

$$\begin{aligned} \hat{\beta } = \arg \min _{\beta \in R^p} \hat{R}(\beta ) \quad \text {subject to:} \quad \sum _{j=1}^{p} c_j I \bigl [|\beta |_j \ne 0 \bigr ] \le B. \end{aligned}$$
(1)

where I() is the indicator function. The constraint in (1) means that the total cost associated with the selected features cannot exceed the assumed budget B. This is equivalent to finding the best model (in terms of empirical risk minimization) subject to a limited budget. When \(c_1=\ldots =c_p\), the above problem reduces to the selection of the best subset [1]. In selecting the best subset, we find the optimal solution subject to a limited number of features. The optimization problem (1) may be written in an equivalent form

$$\begin{aligned} \hat{\beta } = \arg \min _{\beta \in R^p}\left\{ \hat{R}(\beta )+\lambda \sum _{j=1}^{p}c_jI\bigl [|\beta |_j\ne 0\bigr ]\right\} , \end{aligned}$$
(2)

where parameter \(\lambda >0\) corresponds to the budget B. More precisely, for a given B there exists \(\lambda >0\) such that the two problems share the same solution, and vice versa. The second term in (2) is a penalty for the cost. The parameter \(\lambda >0\) controls the balance between the goodness of fit of the model and the cost. The larger the value of \(\lambda \), the cheaper the model created. Although the above approach seems to be encouraging, the issue is that (2) is non-convex, which makes it infeasible to solve computationally even for a moderate number of features (it is known to be NP-hard) [36]. The problem is due to the employment of \(\ell _0\)-type penalty. In the following subsections, we will discuss the alternative penalty functions which can be successfully used in practice.

3.2 Cost-sensitive lasso

The most natural way to relax the optimization problem (2) is to consider \(\ell _1\)-type penalty instead of the computationally infeasible \(\ell _0\) type penalties. The method (called: cost-sensitive lasso) includes solving:

$$\begin{aligned} \hat{\beta } = \arg \min _{\beta \in R^p}\left\{ \hat{R}(\beta )+\lambda \sum _{j=1}^{p}c_j \Vert \beta \Vert _1\right\} , \end{aligned}$$
(3)

where \(\Vert \beta \Vert _1=\sum _{j=1}^{p}|\beta _j|\). Generally, one can consider the \(L_q\) norm with any \(q\ge 0\). The lasso is special in that the choice \(q = 1\) is the smallest value of q (closest to the best subset selection) that leads to a convex constraint region and hence a convex optimization problem. In this sense, it is the closest convex relaxation of the best subset selection problem. The method was used in [28] where, instead of feature costs, different penalty factors \(c_j\) were assigned to different modalities, such as clinical, gene expression, methylation and copy number variation modalities.

It is important to choose the optimal value of the parameter \(\lambda \). In the traditional feature selection, it is done via cross-validation. For the cost-sensitive feature selection, the issue is more subtle. Observe that the cost associated with the chosen subset of features depends on parameter \(\lambda \). A large enough value of \(\lambda \) sets all the coefficients exactly equal to zero, which gives a total cost of zero. This \(\lambda \) can be calculated analytically. A small value of \(\lambda \) results in many non-zero coefficients, which yields a large cost. In the experiments, we use the following strategy. We consider a decreasing sequence of regularization parameters \(\lambda _1>\lambda _2>\ldots >\lambda _L\) and the corresponding costs associated with the selected features \(C(\lambda _1),\ldots ,C(\lambda _L)\). We set \(L=100\) in the experiments. Then, we choose the largest k such that the costs associated with the solution for \(\lambda _k\) do not exceed the budget B whereas the costs associated with the solution for \(\lambda _{k+1}\) exceed B, i.e. \(C(\lambda _k)\le B\) and \(C(\lambda _{k+1})>B\). Finally, from the sequence \(\lambda _1,\ldots ,\lambda _k\) we select the value for which we achieve the largest value of the evaluation measure considered (e.g., AUC).

Fig. 1
figure 1

Lasso and MCP penalties for \(\lambda = 1, \gamma = 3\)

3.3 Cost-sensitive adaptive lasso

The main idea of the adaptive method is to adaptively modify the penalty factors corresponding to the considered features. The adaptive method consists of two steps. In the first step, we fit the univariate model for each feature, i.e., for \(j=1,\ldots ,p\) we solve

$$\begin{aligned} \hat{\beta }^{(0)}_j\!=\!\arg \min _{\beta _j}\hat{R}_{j}(\beta _j),\!\! \!\quad \text {where } \hat{R}_j(\beta _j)\!=\!\frac{1}{n}\sum _{i=1}^{n}l(y_i,x_{i,j}\beta _j). \end{aligned}$$

In the second step, we solve

$$\begin{aligned} \hat{\beta } = \arg \min _{\beta \in R^p}\left\{ \hat{R}(\beta )+\lambda \sum _{j=1}^{p}\frac{c_j}{1+|\hat{\beta }_j^{(0)}|}\Vert \beta \Vert _1\right\} , \end{aligned}$$

If a given feature j is highly correlated with the target variable, then \(|\hat{\beta }^{(0)}_j|\) will be large, and consequently the penalty factor in the second step will be significantly reduced. In the initial step we use the univariate method to compute \(\hat{\beta }^{(0)}_j\), but alternatively other methods can also be used, such as the ordinary least squares (OLS) method when the regression problem is considered [37].

Fig. 2
figure 2

Artificial dataset (\(p=1\)): conditional distributions \(x|y=1\) and \(x|y=0\) for \(\pi =0.8\) and \(\pi =0.5\)

3.4 Cost-sensitive non-convex penalties

We also consider non-convex penalties. The representative and one of the most promising methods from this group is MCP (minimax concave penalty) [19]. In the cost-sensitive version of MCP, we solve

$$\begin{aligned} \hat{\beta } = \arg \min _{\beta \in R^p}\left\{ \hat{R}(\beta )+\sum _{j=1}^{p}c_j P(\beta _j,\lambda ,\gamma )\right\} , \end{aligned}$$

where the penalty function is defined as

$$\begin{aligned} P(\beta _j,\lambda ,\gamma )= {\left\{ \begin{array}{ll} \lambda |\beta _j|-\frac{\beta _j^2}{2\gamma }, \text { if } |\beta _j|\le \gamma \lambda \\ \frac{1}{2}\gamma \lambda ^2 , \text { if } |\beta _j|>\gamma \lambda \\ \end{array}\right. } \end{aligned}$$

and \(\gamma >0\) is a hyper-parameter. Figure 1 shows three penalty functions described above: \(\ell _0\)-type penalty, \(\ell _1\) lasso penalty, and MCP penalty. Lasso penalty significantly deviates from the \(\ell _0\)-type penalty when the absolute value of the coefficient increases. MCP starts out by applying the same rate of penalization as the lasso, then smoothly relaxes the rate down to zero as the absolute value of the coefficient increases. Therefore, MCP can be interpreted as a more accurate approximation of the \(\ell _0\)-type penalty. Our experiments confirm the promising behavior of the MCP. In the experiments we also consider the adaptive version of MCP, which works similarly to adaptive lasso, with the difference that instead of lasso, MCP is used in the second step. Other non-convex penalties can also be used, for example SCAD (Smoothly Clipped Absolute Deviation) [38], however according to our experiments, the results for MCP were slightly better than for SCAD and thus we only present the results for MCP.

4 Experiments

The main goal of the experiments is to analyze the differences between the cost-sensitive feature selection methods based on a penalized empirical risk minimization framework with the traditional feature selection that ignores information about feature costs. As the traditional method, we mean the lasso algorithm \({\beta } = \arg \min _{\beta \in R^p}\left\{ \hat{R}(\beta )+\lambda \sum _{j=1}^{p}\Vert \beta \Vert _1\right\} .\)

The standard lasso method is a baseline in our experiments. Furthermore, we analyze the performance of the cheap knockoff method [30] since it is also based on the lasso method. To make a comparison fair, in the experiments, we only focus on the methods based on penalized empirical risk minimization and do not include other groups of cost-sensitive methods such as filters. This is motivated by the fact that the filters are not associated with a particular classification model. In all considered methods we use the logistic loss function. Experiments are performed on both artificial and real medical datasets. There are many interesting research issues that we aim to verify in the experiments. First, what is the prediction accuracy of the methods considered when the budget is limited and which penalty is optimal? Secondly, how much do we improve the traditional feature selection when cost information is taken into account? Finally, how much cost do we waste on uninformative features?

4.1 Artificial dataset

The main advantage of using artificial datasets in the experiments is that we can control the dependence strength between the target variable y and feature vector x. In this way, we can analyze how the performance of the methods depends on the difficulty of the classification problem.

We use the following scheme. First, we generate a target variable from the Bernoulli distribution \(y \sim Bernoulli(0.5)\). Then we generate a feature vector x, using conditional distributions \(x |y\). When \(y=0\) we use a multivariate Gaussian distribution \(x |y=0 \sim N(0,I_{pxp})\) and when \(y=1\) we generate x from the multivariate mixture of two Gaussian distributions \(x|y=1 \sim (1-\pi ) N(0,I_{pxp}) + \pi N(b,I_{pxp})\), where \(\pi \) is a mixture parameter that varies in the experiments and \(b=(10,\ldots ,10)\). The higher \(\pi \) we choose, the easier it is to distinguish the target class based on features. On the other hand, for small \(\pi \) the distributions of x in the two classes overlap and the classification task becomes more difficult. Figure 2 shows the probability density functions corresponding to the conditional distributions \(x|y = 0\) (blue line) and \(x|y=1\) (orange line), for the one-dimensional case. Observe that for \(\pi =0.5\) (right-hand side figure) the distributions overlap more than for \(\pi =0.8\) (left-hand side figure).

Table 1 Summary of real datasets

4.2 Real datasets

In addition to the artificial datasets, we also analyze the performance of cost-sensitive feature selection methods on real medical datasets. We completed the experiments with six different datasets. In most cases, the values of the features are results of medical diagnostic tests for patients and the binary target variables that indicate the presence of the diseases. In the Table 1 we can see general information about the analyzed datasets. Moreover, in the selected datasets the costs of the features are assigned by experts. For the remaining datasets, we generate artificial costs using the strategy described in Section 4.3. In this strategy, we assign costs which are proportional to the value of the mutual information between the class variable and the features [39]. In this way, features that are strongly marginally dependent on the target variable are assigned higher costs. This strategy is motivated by the fact that features that carry valuable information (such as advanced diagnostic tests) are very often expensive. On the other hand, we are aware that in some situations, readily available and cheap features can be important predictive factors, for example age of the patient. Nevertheless, such method of generating costs seems to be more appropriate than e.g., generating the costs randomly.

The first analyzed real dataset is a publicly available clinical database MIMIC [40]. It contains various medical data about patients from the intensive care units (ICUs) who are diagnosed according to the coding scheme of the International Classification of Disease Revision 9 (ICD-9). The database contains information about the occurrence of many different diseases. In our experiments, we have chosen three frequently occurring diseases (diabetes, hypertension and liver disease). It should be noted that among the selected diseases, hypertension is the most common disease (\(62.25\%\) of patients), whereas liver disease is the rarest one (\(5.5\%\) of patients). The dataset contains a very large number of samples, which makes it more reliable in our experiments. The clinical database contains more than 300 features. Among them we have low-cost administrative data (e.g., age, weight, marital status). The other group of relatively inexpensive features are simple medical tests that could be obtained from a medical interview (e.g. heart rate). The next group, which contains more expensive features, are blood tests (e.g., Potassium or Calcium in blood). Furthermore, ICU patients were periodically examined and therefore, for selected medical tests we have multiple values corresponding to different periods. Subsequently, we take simple summary statistics from the measurements (mean, median, standard deviation). In our experiments, we selected a subgroup of 30 features that have the highest mutual information with the considered target variables. For each target variable (occurrence of the disease), it can be a different feature subset. The advantage of the MIMIC dataset is that the costs of the features are assigned by the experts, see the previous work [29]. They are based on the prices of diagnostic tests in laboratories in Poland. The main source of information on costs was the official price list of analytical laboratories ALAB. The costs are given in Polish currency. However, we stress that the absolute values of the costs are not important as in most countries the relations between the prices of different diagnostic tests are similar. Figures 5a, 6a and 7a show the costs of the features considered, as well as the mutual information values between the features and the target variables. Interestingly, for some datasets, we did not observe any significant relationship between the costs and the mutual information. For example, variable age seems to be an important factor in predicting hypertension disease, although its cost is very low, see Fig. 6a. Similarly, platelets are highly correlated with the occurrence of liver disease (variables 4-7, Fig. 7a), although the corresponding costs are relatively low. On the other hand, in the case of diabetes, some of the relevant features are associated with slightly higher costs, for example the variable ’Glucose in Serum or Plasma mean’, Fig. 5a.

In the case of the three remaining datasets described below, the costs are not known and therefore we use artificially generated costs as described in Section 4.3. We consider a popular Cleveland heart disease dataset [41] related to predicting the occurrence of heart disease. It contains 13 features that include basic information about the patients (e.g., age and sex), electrocardiographic results, and results of blood tests. The thyroid disease dataset is related to thyroid disease prediction [42]. The dataset comes from the Garavan Institute in Sydney, Australia. It contains 3772 observations among which thyroid disease is diagnosed in \(6.12\%\) of patients. In addition to information from the basic medical interview (e.g., age, pregnancy), the dataset contains results from blood diagnostic tests, specifically associated with the thyroid (e.g., TSH, TT4). The last dataset used in our experiments comes from the Open Access Series of Imaging Studies (OASIS) [43]. It contains data on magnetic resonance imaging (MRI) diagnostics associated with Alzheimer’s disease. The first group of features are administrative data (e.g., age, years of education) and the other features are based on MRI. Unfortunately, the dataset contains a small number of features.

4.3 Experimental framework

Unfortunately, there is a shortage of datasets containing feature costs. One of the possible solutions is to assign costs at random [13, 17]. However, this solution does not correspond to the real situation where costs may be correlated with the relevance of the features. The second idea is to work with a group of experts in a specific area of science and assign the costs as they would probably have been assigned in real life. When the features correspond to the results of some diagnostic tests, one may use the official price lists for diagnostic tests, which are publicly available. This strategy was used in the case of the MIMIC dataset in a previous paper [29]. However, very often assigning costs to features is much more problematic. The main limitation of this solution is that expert work can be costly and time-consuming. Moreover, some medical procedures are difficult to price. The last possible solution is to artificially generate the cost through some cost-setting strategies to simulate the real scenarios. This solution is simple and allows to take into account information about feature relevance.

In the proposed framework, we generate two additional sets of features. The first group is the so-called proxy features \(x_1',\ldots ,x_p'\) that are obtained from the original features \(x_1\ldots ,x_p\) by randomly permuting \(\rho \cdot n\) values in the original feature, where \(\rho \in [0,1]\) is a parameter that varies in simulations and n is the size of dataset. Observe that when \(\rho =1\), we permute all values and thus we completely break the dependence between the proxy variable and the target variable y. For \(\rho =0\), the proxy variable matches the original variable. In general, when \(\rho \in (0,1)\), the dependence between the proxy variable \(x_j'\) and the target variable y is weaker than the dependence between the original variable \(x_j\) and y. Proxy variables can be treated as noisy copies of the original variables. In addition to p proxy variables we generate p noisy variables \(x_1'',\ldots ,x_p''\) which are independent from y (they are obtained by setting \(\rho =1\) in the above procedure). The latter operation is done to make a feature selection task more challenging. Thus, in total we have 3p features, where p is the number of original ones. When the costs are not provided, we generate them as follows. First, the costs of the original features depend on their relevance. We use the following strategy. First, we compute the mutual information \(I(y,x_j)\) between the original features and the target variable. The higher the mutual information, the stronger the marginal dependence between the feature and the target variable. The costs of the original features are defined as the values of the mutual information, i.e.

$$\begin{aligned} c_j=I(y,x_j), j=1,\ldots ,p. \end{aligned}$$

The costs for the proxy and noisy features are defined as \(c_j'=c_j''=s\cdot c_j\), where \(s\in (0,1)\) is another parameter that controls the relation between the costs for the original and the proxy / noise features. For example, when \(s=0.5\), this means that the cost of the proxy feature is two times less than the cost of the original feature. Generally, the larger is the value of \(\rho \) and smaller s, it is better to replace the original features with the proxy features. We study how the value of \(\rho \) affects the performance of the methods considered. The above framework mimics a real scenario. For example, in medical diagnosis we can perform the expensive diagnostic test (original feature), which yields the accurate value of the feature or alternatively we can choose the cheaper diagnostic test (proxy feature) which gives an approximate value of the feature. As an example, we may consider medical ultrasonography (USG): 3D scans are more effective and precise than traditional 2D scans, but they are also more expensive; the 2D scan can be seen as an approximation of the 3D scan.

Table 2 Mean and standard deviation of compilation time in seconds for MIMIC-II (hypertension) dataset (\(N=100\), \(p=30\), \(\rho =0.9\), \(s=0.1\))

4.4 Evaluation measures

In the classification problem, one of the most common evaluation metric is the AUC score, which evaluates the quality of the prediction. It refers to the area under the receiver operating characteristic (ROC) curve and measures the ability of a classifier to distinguish between classes. Unlike many classical measures (such as accuracy, precision and recall), AUC does not depend on the threshold for posterior probability and therefore it can be treated as a more universal evaluation measure. In addition to the predictive power of the model, we would also like to control the costs spent on irrelevant features. Therefore, we introduce a novel measure called the cost-sensitive false discovery rate (CSFDR). The CSFDR measures how much of the cost for selected features was wasted on irrelevant variables. Formally, it is defined as follows. Let S be the set of all selected features and \(C(S) = \sum _{j \in S} c_j\) is a total cost associated with features in set S. Then let B be the budget for which we run the algorithm. Denote by \(N = \{x_1'', \ldots , x_p''\}\) the set of noisy features present in the dataset, we refer to Section 4.3, where the method of generating set N is described. The Cost Sensitive False Discovery Rate (CSFDR) is

$$\begin{aligned} \text {CSFDR}(S) = \frac{C(S\cap N)}{C(S)}. \end{aligned}$$
(4)

Small value of CSFDR indicates that we have spent little on noisy features. On the other hand, when \(CSFDR(S)\approx 1\) it means that most of the cost was spent on noisy features. Note that it is possible to calculate the above measure as the set N is known in advance in the considered experimental framework described in Section 4.3. Importantly, in the case of an artificial dataset, the set N matches exactly the set of irrelevant features, i.e., features which do not affect the target variable. However, in the case of real datasets, we do not know a priori the set of irrelevant features; the irrelevant features can still be present among the original features \(x_1,\ldots ,x_p\). In this case, the set N is only a subset of irrelevant features and thus CSFDR is a lower bound of the true fraction of the costs wasted on irrelevant features. Nevertheless, even in the case of experiments on real datasets, the CSFDR seems to be a (Table 2) useful measure, as it allows to verify how much cost was spent on artificially generated noisy features.

Fig. 3
figure 3

Experiment results for artificial dataset for easier classification problem

4.5 Results

4.5.1 Results for artificial datasets

We analyze the results for \(\pi =0.8\) (Fig. 3, Tables 3, 4, and 5) and for \(\pi =0.5\) (Fig. 4, Tables 6, 7, and 8 ). The value \(\pi =0.8\) corresponds to the easier classification problem as in this case the conditional distributions \(x\vert y=0\) and \(x\vert y=1\) are clearly separated; for \(\pi =0.5\), the classification problem is more challenging, the conditional distributions overlap. Indeed, for \(\pi =0.8\) the curves approach \(AUC\approx 0.9\), whereas for \(\pi =0.5\) they stabilize at the level \(AUC\approx 0.75\). We present the results for \(s=0.1\), which means that the cost of the proxy and noisy features is \(10\%\) of the cost of the corresponding original features. For such a scenario, we can expect the promising behavior of the cost-sensitive methods. The impact of parameter \(\rho \) is also investigated, we consider three values: \(\rho =0.7,0.8,0.9\). As expected, the accuracy of cost-sensitive methods increases slightly with \(\rho \), since for larger \(\rho \) the proxy and original variables become indistinguishable and therefore it is possible to replace the original features by cheaper proxy features without accuracy deterioration. We observe significant differences between the methods when the budget is limited, say it does not exceed \(40\%\) of the total cost. The poor performance of the standard method is due to the fact that the method focuses on selecting the expensive original features, but it cannot include all of them within the assumed budget. On the other hand, cost-sensitive methods work significantly better as they replace the original features by the cheaper proxy features and are able to fit into the assumed budget. For example, Fig. 3a shows that when the budget \(B=10\%\) of the total cost, the AUC for the traditional lasso is around \(75\%\), for cheap knock-off it is \(85\%\) and around \(90\%\) for the remaining cost-sensitive methods. For a larger budget B, all methods are comparable. This is consistent with our expectations as for larger B it is possible to include all relevant features and there is no need to apply cost-sensitive strategies. To verify whether the differences between the methods are statistically significant, we performed t-tests. In the first test, the null hypothesis corresponds to the equality of mean AUC values for the winner method and the traditional lasso, whereas the alternative hypothesis describes the situation in which the mean AUC for the winner method is greater than for the traditional lasso. In the second test, we compare the winner method and the second best method.

Table 3 Mean and standard deviation of AUC for artificial dataset (\(N=100\), \(n=1000\), \(p=30\), \(\pi =0.8\), \(\rho =0.7\), \(s=0.1\)). The last two columns contain the p-values of the tests comparing the winner method with the traditional lasso (pv 1) and the winner method with the second best method (pv 2)
Table 4 Mean and standard deviation of AUC for artificial dataset (\(N=100\), \(n=1000\), \(p=30\), \(\pi =0.8\), \(\rho =0.8\), \(s=0.1\)). The last two columns contain the p-values of the tests comparing the winner method with the traditional lasso (pv 1) and the winner method with the second best method (pv 2)
Table 5 Mean and standard deviation of AUC for artificial dataset (\(N=100\), \(n=1000\), \(p=30\), \(\pi =0.8\), \(\rho =0.9\), \(s=0.1\)). The last two columns contain the p-values of the tests comparing the winner method with the traditional lasso (pv 1) and the winner method with the second best method (pv 2)
Fig. 4
figure 4

Experiment results for artificial dataset for harder classification problem

Table 6 Mean and standard deviation of AUC for artificial dataset (\(N=100\), \(n=1000\), \(p=30\), \(\pi =0.5\), \(\rho =0.7\), \(s=0.1\)). The last two columns contain the p-values of the tests comparing the winner method with the traditional lasso (pv 1) and the winner method with the second best method (pv 2)
Table 7 Mean and standard deviation of AUC for artificial dataset (\(N=100\), \(n=1000\), \(p=30\), \(\pi =0.5\), \(\rho =0.8\), \(s=0.1\)). The last two columns contain the p-values of the tests comparing the winner method with the traditional lasso (pv 1) and the winner method with the second best method (pv 2)
Table 8 Mean and standard deviation of AUC for artificial dataset (\(N=100\), \(n=1000\), \(p=30\), \(\pi =0.5\), \(\rho =0.9\), \(s=0.1\)). The last two columns contain the p-values of the tests comparing the winner method with the traditional lasso (pv 1) and the winner method with the second best method (pv 2)
Fig. 5
figure 5

Experiment results for MIMIC-II (diabetes)

In addition to focusing on the predictive power of the methods considered, we also analyze the costs wasted on selecting irrelevant features. The right hand side panels in Figs. 3 and 4 depict how the cost-sensitive false discovery rate changes with the assumed budget. Obviously, the CSFDR is close to 0 for the traditional method as it selects very few irrelevant features. Interestingly, we observe that the CSFDR for the cheap knock-off method is usually lower than for the remaining cost-sensitive methods. This confirms the theoretical result presented in [30] which states that for a cheap knock-off method, the fraction of the cost spent on irrelevant features is bounded with a large probability. On the other hand, for small budget B, the CSFDR for the cheap knock-off method is larger than for the remaining cost-sensitive methods. Generally, we observe the highest values of CSFDR for cost-sensitive lasso and adaptive lasso. Interestingly, the CSFDR for cost-sensitive MCP (as well as for adaptive version) is significantly lower than for cost-sensitive lasso and adaptive lasso. At the same time, MCP works on a par with cost-sensitive lasso in terms of AUC. This suggests that MCP allows to achieve a sensible compromise between AUC maximization and CSFDR minimization.

4.5.2 Results for real datasets

Similarly as for artificial datasets, we present the results for \(s=0.1\) and \(\rho =0.9\), which indicates that the costs of proxy and noisy features are much lower than for the original ones and the proxy features are highly correlated with the original ones. Such a setting is particularly interesting in our experiments as we can expect promising performance from cost-sensitive methods. It turns out that most of the conclusions remain the same as for artificial datasets. First, we observe significant differences between the traditional method and cost-sensitive methods when the budget is limited, more precisely when it does not exceed \(20\%-40\%\) of the total cost. This is a positive message as the advantage of cost-sensitive methods for small budgets is particularly important when considering cost-sensitive methods.

Fig. 6
figure 6

Experiment results for MIMIC-II (hypertension)

Fig. 7
figure 7

Experiment results for MIMIC-II (liver)

Among the MIMIC datasets, we report the most significant differences for the liver dataset. For example, when the budget is around \(10\%\), the AUC for the traditional lasso is around \(80\%\), while the AUC for cost-sensitive methods oscillates around \(88-90\%\). Secondly, the cheap knock-off method works slightly worse than the remaining cost-sensitive methods, when the budget is very low. However, when the budget increases, the cheap knock-off method slightly outperforms the other methods for most datasets. Importantly, for larger B, the differences between the methods are usually not very pronounced; see Figs. 5, 6, 7, and 9 and Tables 9, 10, 11, 12, 13, and 14. For most datasets, the AUC curves for cost-sensitive lasso, adaptive lasso, MCP and adaptive MCP reach the plateau faster than the remaining methods. In the case of the Alzheimer data set (Fig. 9c), the traditional lasso and the cheap knock-off method reach the plateau relatively late, when \(B=25\%\) of the total cost. For most of the experiments, we do not see (Fig. 8) significant differences in AUC between cost-sensitive lasso and adaptive lasso, as well as between cost-sensitive MCP and its adaptive version. The most pronounced difference in model performance can be observed for the heart data set, Fig. 9a and Table 12.

Table 9 Mean and standard deviation of AUC for MIMIC-II (diabetes) dataset (\(N=100\), \(p=30\), \(\rho =0.9\), \(s=0.1\)). The last two columns contain the p-values of the tests comparing the winner method with the traditional lasso (pv 1) and the winner method with the second best method (pv 2)
Table 10 Mean and standard deviation of AUC for MIMIC-II (hypertension) dataset (\(N=100\), \(p=30\), \(\rho =0.9\), \(s=0.1\)). The last two columns contain the p-values of the tests comparing the winner method with the traditional lasso (pv 1) and the winner method with the second best method (pv 2)
Table 11 Mean and standard deviation of AUC for MIMIC-II (liver) dataset (\(N=100\), \(p=30\), \(\rho =0.9\), \(s=0.1\)). The last two columns contain the p-values of the tests comparing the winner method with the traditional lasso (pv 1) and the winner method with the second best method (pv 2)
Table 12 Mean and standard deviation of AUC for heart dataset (\(N=100\), \(p=13\), \(\rho =0.9\), \(s=0.1\)). The last two columns contain the p-values of the tests comparing the winner method with the traditional lasso (pv 1) and the winner method with the second best method (pv 2)
Table 13 Mean and standard deviation of AUC for thyroid dataset (\(N=100\), \(p=20\), \(\rho =0.9\), \(s=0.1\)). The last two columns contain the p-values of the tests comparing the winner method with the traditional lasso (pv 1) and the winner method with the second best method (pv 2)
Table 14 Mean and standard deviation of AUC for alzheimer dataset (\(N=100\), \(p=9\), \(\rho =0.9\), \(s=0.1\)). The last two columns contain the p-values of the tests comparing the winner method with the traditional lasso (pv 1) and the winner method with the second best method (pv 2)
Fig. 8
figure 8

Cost and relevance for the selected features for MIMIC-II (diabetes) dataset for budget \(10\%\) of all feature costs

Fig. 9
figure 9

Experiment results for heart, thyroid and alzheimer datasets

The analysis of the CSFDR for real datasets allows for the formulation of conclusions similar to those for artificial datasets. ‘ The CSFDR is close to 0 for the traditional method and for the cheap knock-off method. Both methods select very few irrelevant features. The price for this is however significantly lower AUC for lower budgets. Among the remaining cost-sensitive methods, the cost-sensitive MCP and its adaptive version allow to achieve significantly smaller CSFDR than for cost-sensitive lasso and adaptive lasso. For example, in the case of the heart dataset (Fig. 9), when \(B=0.2\), the CSFDR for MCP is around three times lower than for the cost-sensitive lasso. Interestingly, for the MIMIC diabetes dataset, we observe significant differences in CSFDR between lasso and adaptive lasso as well as between MCP and adaptive MCP. The experiments suggest that cost-sensitive MCP can be recommended as the final method, as it provides a sensible compromise between AUC maximization for small budgets and CSFDR minimization. At the same time, we do not see a clear advantage of adaptive MCP over standard MCP.

To gain a deeper insight into how the methods work, we analyze what types of features they are likely to select. Figure 8 shows the selected features, divided into three groups (original, proxy and noisy features), for MIMIC diabetes dataset and budget \(B=10\%\) of the total cost. We present the costs of the selected features as well as the mutual information values between the features and the target variable. The traditional lasso selects only two original features that are expensive. The feature subset chosen by the cheap knock-off method contains 3 original features and two cheap proxy features. The feature subsets selected by the remaining methods differ substantially. They contain a significant number of proxy features that aim to compensate for the lack of original features. The proxy features are much cheaper than the original ones and at the same time they are highly correlated with the target variable. In addition, it is seen that the remaining cost-sensitive methods include some noisy features. In particular, cost-sensitive lasso selects 12 noisy features, whereas MCP selects only 7 noisy features, which confirms our previous findings on smaller CSFDR for MCP compared to cost-sensitive lasso.

Furthermore, Table 2 shows the computational times for the considered methods, including traditional lasso. The table presents time-averaged over 100 trails (train-test splits) of feature selection for the MIMIC-II (hypertension). For all budgets, the traditional lasso is the fastest, the cost-sensitive lasso is 2 times slower and methods based on MCP are 3 times slower. The cheapknockoffs method is much slower than others, this is due to the complexity of knockoffs generation. The experiments were completed using R language on a PC machine with an 8-core Intel Xeon Processor E3-1270 processor and 32Gb of memory. The most important R library used during experiments are: ncvreg - MCP implementation for linear regression, cheapknockoff - cheapknockoff implementation, pROC - to measure the AUC metric and base R for other metrics.

5 Conclusions

The main goal of the present paper was to compare the cost-sensitive variants of existing penalized empirical risk minimization methods. Importantly, modifications of some methods (for example cost-sensitive MCP) have not been explored in the previous papers. We proposed a novel experimental framework that allows to analyze the impact of various parameters, such as the dependence strength or the cost ratio between different types of features. Moreover, it is possible to generate costs artificially. In this way, the cost-sensitive methods can be compared on different available data sets for which the cost information is not provided. The experiments performed on artificial and real medical datasets indicate that cost-sensitive methods achieve higher accuracy than traditional feature selection methods such as lasso, when the available budget is limited. As expected, we observe the significant advantage of cost-sensitive methods when there exist proxy features which are highly correlated with the original features but at the same time are much cheaper than the original ones. In addition, we considered the cheap knockoff method which has desirable theoretical properties related to controlling cost-sensitive FDR. However, our results show that when the budget is low, the cheap knockoff method usually achieves lower predictive power than the remaining cost-sensitive approaches. An interesting and important conclusion is that among considered penalties, the non-convex functions give the most promising results. They allow to achieve a trade-off between maximization of the predictive power and reducing the cost-sensitive FDR. Therefore, these methods (in particular cost-sensitive MCP) can be recommended in real cost-sensitive tasks.