Adaptive Covariate Acquisition for Minimizing Total Cost of Classification

In some applications, acquiring covariates comes at a cost which is not negligible. For example in the medical domain, in order to classify whether a patient has diabetes or not, measuring glucose tolerance can be expensive. Assuming that the cost of each covariate, and the cost of misclassification can be specified by the user, our goal is to minimize the (expected) total cost of classification, i.e. the cost of misclassification plus the cost of the acquired covariates. We formalize this optimization goal using the (conditional) Bayes risk and describe the optimal solution using a recursive procedure. Since the procedure is computationally infeasible, we consequently introduce two assumptions: (1) the optimal classifier can be represented by a generalized additive model, (2) the optimal sets of covariates are limited to a sequence of sets of increasing size. We show that under these two assumptions, a computationally efficient solution exists. Furthermore, on several medical datasets, we show that the proposed method achieves in most situations the lowest total costs when compared to various previous methods. Finally, we weaken the requirement on the user to specify all misclassification costs by allowing the user to specify the minimally acceptable recall (target recall). Our experiments confirm that the proposed method achieves the target recall while minimizing the false discovery rate and the covariate acquisition costs better than previous methods.


Introduction
In some applications, acquiring covariates comes at a cost which is not negligible. For example, in the medical domain, in order to classify whether a patient has diabetes or not, measuring glucose tolerance can be expensive. On the other hand, glucose tolerance can be a good indicator for diabetes, i.e. increases our chance of predicting diabetes (or its absence) correctly.
The example illustrates that in the medical domain we often have to strike a balance between classification accuracy and the cost of acquiring the covariates. A rational criteria to decide on the best trade-off is to minimize the expected total cost of classification: expected cost of misclassification plus the cost of acquired covariates.
In the first part of this article, we formalize the optimization of the expected total cost of classification using the (conditional) Bayes risk and describe the optimal solution using a recursive procedure. However, it turns out that the procedure is computationally infeasible due to basically two factors: (1) calculating the Bayes risk requires to estimate a high dimensional integral, (2) the number of different covariate acquisition paths is exponential in the number of covariates.
As a consequence, we introduce two assumptions: (1) the optimal classifier can be represented by a generalized additive model (GAM), (2) the optimal sets of covariates are limited to a sequence of sets of increasing size. We show that under these two assumptions, a computationally efficient solution exists.
Our framework requires that the user can specify the cost of misclassification: the false positive cost (cost of wrongly classifying a healthy person as having diabetes), and false negative cost (cost of classifying a diabetes patient as healthy). However, we show that the requirement on the user to specify the false negative cost can be replaced by the specification of a lower bound on the recall. This is motivated by the medical domain where it is more common to specify the minimally acceptable recall (target recall), rather than specifying the false negative cost.
Our main contributions are as follows: 1. We describe the optimal solution for minimizing the expected total cost of classification, which has not been clarified in previous works like ( Dulac-Arnold et al., 2012) and (Shim et al., 2018).
2. We prove that for a GAM, the estimation of the (conditional) Bayes risk reduces to a one dimensional density estimation and integral which can be solved computationally efficiently.
3. We propose an effective heuristic to estimate an optimal monotone increasing sequence of covariate sets by learning the regression coefficients of GAM with a group lasso penalty.
4. We prove that our framework can be used to guarantee a user-specified recall level, like 95% which is common in the medical domain.
5. We show on four medical datasets that the proposed method can lead to lower total cost of classification than the previous works in (Ji and Carin, 2007;Dulac-Arnold et al., 2012;Xu et al., 2012;Nan and Saligrama, 2017;Shim et al., 2018). Furthermore, evaluation under the requirement of a target recall shows that the proposed method achieves the target recall while minimizing the remaining costs and false discovery rate (FDR) better than previous works.
This article extends our preliminary work in (Andrade and Okajima, 2019) by replacing the linear classifier with GAM, allowing the specification of a minimal recall, determining the sequence of covariate sets by the solution path of a group lasso penalized convex optimization problem, and additional experimental evaluations on two more medical datasets.
In the next section, we formalize the optimal decision procedure to achieve, in expectation, the lowest total cost of classification. In Section 3, we introduce a (non-adaptive) method that minimizes an upper of the lowest achievable total cost, which we extend in Section 4 to an adaptive method. In Section 5, we explain two approximations for finding a sequence of monotone increasing covariate sets that is used by the proposed method. In Section 6, we show how the proposed framework can also be used for guaranteeing a target recall. Extensive empirical evaluations of our proposed method and previous methods are provided in Section 7. In Section 8, we give a concise review of related work. Finally, in Section 9, we summarize our findings.

A cost rational selection criteria
Let L := {l 1 , . . . , l c } denote the set of class labels, and c y,y * the cost of classifying a sample as class y * , when the true label is y. A decision procedure δ * : R p → L for which ∀δ : E x,y [c y,δ(x) ] ≥ E x,y [c y,δ * (x) ] is called a Bayes procedure. The following procedure δ * is a Bayes procedure (for a proof see, for example, Theorem 6.7.1 in Anderson (2003)): (1) The expected misclassification cost of the Bayes procedure, i.e. E x,y [c y,δ * (x) ], is called the Bayes risk.
Let us denote by V := {1, . . . , p} the index set of covariates with V ∩ L = ∅. We denote the Bayes procedure for classifying a sample based only on the covariates S ⊆ V by δ * S : R |S| → L. That means When it is clear from the context, we drop the index on δ * S , and just write δ * (x S ) instead of δ * S (x S ). 1

Optimal Procedure
The classical definition of Bayes procedure does not consider the cost of covariate acquisition, and assumes that all covariates are acquired at once. Therefore, let us first formally extend the definition appropriately. We use the following definition of a decision procedure.
Definition 1. A function of the form is called a decision procedure. 2 The condition in Equation (3) means that a decision procedure uses only the covariates that are indexed by S; the condition in Equation (4) means that a decision procedure cannot select a covariate that is already in S. In summary, the decision procedure π(x, S) either classifies the current sample, or selects a new covariate based on the observations x S . To simplify the notation, we write π(x S ) instead of π(x, S). Furthermore, we denote the cost of acquiring covariate i by c i .
If not stated otherwise, we assume that all costs are non-negative, i.e. c i ≥ 0, and c y,y ≥ 0.
Theorem 1. The decision procedure π * defined by is a Bayes procedure. That means for any other decision procedure π we have The proof is given in the appendix. We note that, if the covariates are discrete, we can formulate the problem as a stationary Markov decision process 2 denotes the Hadamard product, and 1 S ∈ R p is the vector that is one in all positions indexed by S, and zero otherwise.
(MDP) where every policy leads to a terminal state Bayer-Zubek, 2004). The Bayes procedure from Theorem 1 is then equivalent to the optimal policy defined by the Bellman updates with the discounting factor set to 1 (Russell and Norvig, 2003).
For continuous covariates, implementing the exact decision procedure π * is, in general, intractable. The reason is that in order to recursively evaluate the loss, we need to evaluate a sequence of interchanging minimizations and expectations. Therefore, we propose two relaxations and corresponding methods named Cost-sensitive Covariate Selection (COS) and Adaptive Cost-sensitive Forward Selection (AdaCOS) in Section 3 and 4, respectively.

Cost-sensitive Covariate Selection (COS)
Our first relaxation is to pull-out all minimizations in the recursion of Equation (6) which leads to the following upper bound: In the following, we denote this upper bound by U. However, directly trying to minimize U is still computationally difficult due to the exponential number of possible sets S. Note that this is similar to covariate selection in logistic classification like e.g. in Tibshirani (1996);O'Hara et al. (2009). Two important differences are that, in general, costs associated with covariates can be different from each other, and that the Bayes risk needs to be evaluate for all possible subsets S ⊆ V . The situation is illustrated in Figure 1.
We denote the method selecting the set S * that minimizes U, by Costsensitive Covariate Selection (COS). In order to approximately find the set S * we will use the methods described in Section 5. One disadvantage of COS is that it always selects the same set of covariates S * for any sample, though for some samples less/more covariates might be sufficient/necessary for good classification accuracy.

Adaptive Cost-sensitive Forward Selection
The method COS proposed in the previous section is not adaptive, i.e. it does not take into account the actually observed covariates in order to decide whether to proceed the acquisition of additional covariates, or whether to classify based on the covariates observed so far. However, as discussed in Section 2.1, without any additional assumptions estimating the optimal procedure π * from Theorem 1 is computationally infeasible. We therefore introduce two assumptions: 1. The optimal set S of acquired covariates belongs to S = {S 1 , S 2 , . . . S q }, where S 1 ⊂ S 2 ⊂ S 3 . . . S q ⊆ V .  Figure 1: Shows an example of the expected total cost of classification at the beginning when no covariates have been acquired. Each edge represents one decision: either asking for the value of a covariate, or classifying the sample based on the observed covariates so far using the Bayes procedure. Each leaf shows the expected total cost of classification when using the covariates that were selected on the path from the root to the leaf. If we do not re-evaluate the expected cost after acquiring a new covariate, we will always select the same set of covariates, namely the method COS.
2. The conditional class probability p(y|x S ), for S ∈ S, belongs to a logistic generalized additive model.
Before we proceed, let us introduce our definition of future costs. Let A ⊆ V and S ⊆ V \ A, then we define F x A (S) is the expected total additional cost of classification when we have already acquired the covariates A, and are planning to acquire additionally the covariates S before classifying. In particular, the upper bound U can be expressed as min S⊆V F x ∅ (S). Our approximation of the Bayes procedure π * from Theorem 1 is given in Algorithm 1. First, we acquire all covariates indexed by S 1 , and then check whether acquiring any additional covariates from S 2 \ S 1 , . . . S q \ S 1 reduces the total cost of classification in expectation. If that is the case, we acquire the covariates in S 2 \ S 1 , and proceed analogously. If the total cost of classification is not expected to decrease with more covariates, we stop and classify based on the covariates acquired so far. An example of the procedure is show in Figure  2.
Algorithm 1: Adaptive Cost-sensitive Forward Selection (AdaCOS) for classifying a test sample.
The algorithm is adaptive in the sense that the expected future costs F x A (S) depend on the covariates x A observed so far. Therefore, we see that the effectiveness of the algorithm hinges on the non-trivial task of calculating F x A (S).

Bayes Risk Estimation
The main challenge in evaluating the future costs is to estimate the multidimensional integral in E x S ,y c y,δ * (x A∪S ) |x A . By assuming that the conditional class probability p(y|x A∪S ) can be modeled by a logistic generalized additive model, we will show that it is possible to reduce the multi-dimensional integral into a one-dimensional with an effective approximation.
The logistic generalized additive model is defined as follows. Given the regression coefficients β = (β 1 , β 2 , . . . , β p ) ∈ R s·p , and intercept τ , the condi-tional class probability p(y = 1|x, β, τ ) is modeled as where g denotes the sigmoid function, and the non-linear transformations f i : R → R s are learned from the training data using penalized B splines, where s is the number of splines (Hastie et al., 2009). 3 Furthermore, for simplicity, we assume that there are only two class labels {0, 1}, and c 0,0 = c 1,1 = 0. 4 We then have . We see that δ * (x A∪S ) depends only on z (random variable) and z * (fixed). In the following, to simplify notation, let us denote by h(z) the conditional distribution p(z|x A ). We thus have where we used that c 1,1 = 0. Analogously, we have Thus the remaining task is to evaluate the following integral We assume that h(z) = p(z|x A ) can be well approximated by a normal distribution with mean µ z and variance σ 2 . We defer the explanation of how to estimate µ z and σ 2 to Section 4.1.1. The integral in Equation (8) has no analytic solution. One popular strategy is to approximate the sigmoid function g by the cumulative distribution function of the standard normal distribution Φ, as in Gaussian process classification (Rasmussen and Williams, 2006). However, it turns out that this approximation is not applicable here, since a or b is a finite real number in our case. Instead, we use here the fact that the sigmoid function can be well approximated with only a few number of linear functions. In order to facilitate notation, let us introduce the following constants: Then we can write the integral in Equation (8) Let us define the following piece-wise linear approximation of the sigmoid function: where for 1 ≤ t ≤ ξ + 1, we set b t := −10 + 20 ξ (t − 1) , and for 1 ≤ t ≤ ξ, we set and ξ is the number of linear approximations, which is, for example, set to 40. A comparison with the approximation Φ( π 8 u) is shown in Figure 3. That means for a relatively few number of linear approximations, we can achieve an approximation that is more accurate than the Φ-approximation. More importantly, as we show below, this allows for a tractable calculation of the integral in Equation (9), which is not the case when using the Φ-approximation.
du, which can be well approximated with standard numerical libraries. The remaining integral can also be expressed by Φ using the substitution u − µ := r, we have

Estimation of µ z and σ 2
Recall that we assumed that p(z|x A ) can be well approximated by a normal density with mean µ z and variance σ 2 . In order to estimate µ z and σ 2 , we propose to model z given x A as a regression problem with additive noise, where z is the response variable, and x A are the explanatory variables. In detail, for learning the regression model from the training data {x (k) } n k=1 , we prepare a collection of response and explanatory variable pairs of the form i ). We note that for training the regression model, we do not require the class label y. As a consequence, additional to the class-labeled training data, we could also exploit unlabeled training data (if available). 5 For our experiments, we use a standard Bayesian linear regression model with a scaled inverse χ 2 distribution prior on the noise variance (Gelman et al., 2013).

Cost-aware non-linear covariate selection
In the previous section, we assumed that the covariates are acquired in a specific sequence. In this section, we discuss two different approximation strategies for finding the optimal sequence of subsets such that the expected total cost of classification tends to minimal for a set S ∈ S.

Forward Selection
We suggest to set q := p + 1, and use greedy forward selection as outlined in Algorithm 2.
Note that from the definition in Equation 7, we have where the model for the conditional probability p(y|x S∪{j} ) used by δ f is trained using the samples in {1, . . . , n}\A f . The final estimate is acquired by averaging over all folds. An advantage of the above forward-selection procedure is that it uses an unbiased estimate of E x S [F x S ({j})], and assuming the variance is not too large, we can expect to find a good local minima.
However, there are several disadvantages. First, if the variance of the estimator is high, we might get stuck in a bad local minima. Second, the forwardselection procedure is extremely computationally expensive, and, as a consequence, unfeasible if p is large. Finally, a more subtle disadvantage is that it requires the full specification of the misclassification costs, i.e. the specification of c 0,1 and c 1,0 . As a consequence, it is not applicable when we are provided only with c 0,1 , which we will discuss in Section 6.

Group Lasso Penalty
Some of the disadvantages of the feed-forward selection method can be overcome by jointly training the model for the conditional probability p(y|x) with a sparsity-enforcing penalty on the regression coefficients. Here, this is possible since we assume a generalized additive model for p(y|x).
In particular, we propose to acquire the sets S 1 ⊂ S 2 ⊂ S 3 . . . S q by using the search path of a penalized logistic loss function. In detail, for different values of λ, we solve the following convex optimization problem where β = (β 1 , β 2 , . . . , β p ) ∈ R s·p . The group lasso penalty ensures that the regression coefficients β i are either all zero or all non-zero (Hastie et al., 2015). Note that in Equation (10) each group is scaled by c i which ensures that the regression coefficient of covariates with high cost are penalized more. 6 As a consequence, in order to be included into the final model, covariates with high cost are required to lower the negative log-likelihood term more than covariates with low costs. By inspecting the search path for different values of λ 1 > λ 2 > . . . > λ q , we acquire the sets the conditional class probability p(y = 1|x, β, τ ) is modeled as The non-linear transformations f i : R → R s , where s is the number of splines, are learned from the training data using penalized B splines (Hastie et al., 2009).

Extension to Classification With Recall Guarantees
So far, we assumed that both misclassification costs c 0,1 and c 1,0 are given. Arguably, the false positive cost c 0,1 is relatively easy to specify. For example, in the medical domain, this might correspond to the price of a medicine which was unnecessarily prescribed to a healthy patient.
On the other hand, the specification of c 1,0 is more difficult. For example, specifying the cost of a dead patient (that might have been rescued) is difficult. Therefore, in the medical domain, it is more common to try to make a guarantee on the recall 7 . In particular, it is common practice to require that the recall is 95% (Kanao et al., 2009).
In the following, we show how to estimate the false negative cost c 1,0 , given the false positive cost c 0,1 and the requirement that the recall is larger or equal to some value r.
In the following, we denote by 1 M the indicator function which is 1 if expression M is true and otherwise 0.
In order to emphasize the dependence on t, we write in the following R t instead of R δ . In particular, let t * be chosen such that where r is, for example, 0.95. Then the implicitly defined cost c 1,0 is given by It remains to show how t * can be estimated. In general, Equation (12) does not have a solution (in terms of t). We therefore solve the following problem which has always a solution (since t = 0 fulfills the constraint). Since p(x|y = 1) is unknown, we use the empirical training data distribution to estimate R t : where n 1 is the number of true samples (i.e. label y = 1), and p (−k) (y = 1|x (k) ) is the estimate of p(y = 1|x (k) ) of the classifier that was trained without sample k. In practice, since this type of leave-one-out estimation is computationally expensive, we use instead 10-fold cross-validation. Since the expression in Equation (15) is a monotone decreasing step function in t, we can easily solve the problem in (14) by sorting p (−k) (y = 1|x (k) ) n1 k=1 in decreasing order.

Adaptive Covariate Acquisition With Recall Guarantees
So far, we discussed how to estimate c 1,0 in the situation where only one classifier based on p(y = 1|x) is used. However, in the adaptive acquisition setting, the situation is more complicated since, in general, for different observed sets of variables, the conditional class probabilities are different. In particular, let S ⊂ S ⊆ V , where V is the set of all variables. Then, in general, we have which means that, in general, the optimal threshold t * which guarantees recall ≥ r is different for different sets of observed variables S. Furthermore, in the adaptive setting, estimating the recall using Equation (15) is not valid anymore since the distribution of the samples, with label y = 1, and for which we select the variable set S is, in general, different from p(x|y = 1), i.e. p(x|y = 1, acquired S ) = p(x|y = 1). Nevertheless, we show in the following that it is possible to define the cost c 1,0 such that the recall requirement is fulfilled. First, let us introduce the following notations. Let S 1 ⊂ S 2 ⊂ S 3 . . . S q , be the sets of variables that are considered for adaptive variable acquisition, i.e. first we acquire S 1 , and then we decide whether to classify or whether we acquire additionally the variables in S 2 \ S 1 , and so forth. Moreover, let δ t,S be the classifier based on the observed variables S and using threshold t, i.e.
This can be seen as follows. Assume that y = 1 and any classifier δ tj ,Sj outputs label 0, then an adversarial selection strategy will select this classifier. Otherwise, if all classifiers output label 1, then even an adversarial selection strategy needs to select a classifier δ tj ,Sj for which the output is 1. By the requirement of Inequality (16), the latter will happen with probability of at least r.
If we require that all thresholds are the same, i.e.

Experiments
We evaluate our proposed method on four real datasets from the medical domain which are frequently used for cost-sensitive classification: Pima Diabetes dataset (p = 8, n = 768), the Wisconsin Breast Cancer dataset (p = 10, n = 683), Heart-disease dataset (p = 13, n = 303), and the PhysioNet dataset (p = 30, n = 12000). The first three datasets are all available at the UCI Machine Learning repository 8 , the PhysioNet data is available at https://archive.physionet. org/pn3/challenge/2012/. Note that the PhysioNet data (Goldberger et al., 2000) contains for each patient several health check measures like cholesterol, taken at different times during their stay in the intensive care unit. As in (Shim et al., 2018), for each patient we use the last recorded value of each attribute to predict death (y = 1) or survival (y = 0). After filtering attributes which are mostly missing, we acquire a data set with 12000 patients and 30 attributes.
For Diabetes and Heart-disease we use the covariate costs as defined in (Ji and Carin, 2007), and (Turney, 1994), respectively. For the other datasets, we set the covariate costs uniformly to one.
Note that the Heart-disease and PhysioNet data contain missing values. For methods which cannot handle missing values (including our proposed methods) we assume that all covariates are jointly distributed according to a multivariate normal distribution, where the covariance matrix is estimated from all samples (including missing values) using the method from Lounici et al. (2014).
We compare the proposed method AdaCOS to the following methods: COS The proposed method but fixing the covariate set S ∈ {S 1 , S 2 , . . . , S q } to the one which minimizes the total costs in expectation, i.e.
which is estimated using 10-fold crossvalidation as in Section 5.1. 9

Full Model
The logistic generalized additive model which always acquires (and uses) all covariates.

Shim2018
The cost-sensitive classification method based on deep reinforcement learning as proposed in (Shim et al., 2018). 10 GreedyMiser The cost-sensitive tree construction method proposed in (Xu et al., 2012). 11 AdaptGbrt The method proposed in (Nan and Saligrama, 2017), which requires the specification of a high accuracy classifier for which we use the Full Model. 12 For all methods we estimate the hyperparameters with 10-fold crossvalidation, except where this is too computationally expensive: for Shim2018 we use the hold-out data split as in their provided implementation, for AdaptGbrt we use 5-fold crossvalidation.
As evaluation measure, we use the average total cost of classification, defined as avg total cost := 1 n t nt k=1 c y (k) ,y (k) * where n t is the number of test samples; y (k) and y (k) * is the k-th true test class and predicted test class, respectively; S (k) is the set of covariates that were used by the prediction model for classifying the k-th sample.

Results
For each dataset we use 5-fold cross-validation and report mean and standard deviation of the total costs. We evaluate all methods on two settings: • user-specified false positive and false negative misclassification costs.
• user-specified false positive misclassification cost and target recall.
If not stated otherwise, we use group lasso, as explained in Section 5.2, for acquiring the sets S 1 ⊂ S 2 , . . . S q .

User-specified misclassification costs
In the first setting, we assume that the user specifies the false positive cost in {1, 5, 10, 50, 100, 500, 1000}. The false negative cost is set to be 10 times the false positive cost, which reflects that it is more important to detect infected patients than avoiding wrongly classifying healthy patients.
The total cost of misclassification is shown in the top plot of Figures 4, 5, 6, and 7 for Diabetes, Breast Cancer, PhysioNet and Heart-disease, respectively. We observe that with respect to minimizing the total cost of classification (top plots), our proposed method AdaCOS performs better than all previously proposed methods.
In each of those figures, in the middle and bottom plot, we also show the weighted accuracy and the number of acquired covariates each plotted against the false positive cost (which is set by the user), respectively. Since we assume that false negative classification have 10 times higher cost than false positive classification, we use the weighted accuracy defined by weighted accuracy = true positives · 10 + true negatives number of true labels · 10 + number of false labels .
From the bottom plots, as expected, we see that all methods start acquiring more covariates as the user-specified false positive cost increases. At the same time, all methods, except Shim2018, show an increase in (weighted) accuracy. In particular, Shim2018 underperforms on the smaller datasets Diabetes, Breast Cancer, and Heart-disease, which is likely due to the difficulty of adjusting the hyper-parameters of their deep neural network classifier on small hold-out validation data.
In terms of (weighted) accuracy, Full Model performs always optimal, i.e. even for small datasets we do not find any gains in predictive accuracy by using a sparser model. As a conclusion, if the covariate costs are zero or negligible, we might just opt for the full model to get the lowest total costs. On the other hand, if the ratio of false-positive cost to covariate cost is less than around 100, the full model performs considerably worse than the proposed method in terms of total cost. The (weighted) accuracy of AdaCOS and COS are similar, while the former achieves the same accuracy with less covariates. This demonstrates the effectiveness of estimating the expected cost of misclassification depending on what we have observed so far using the conditional Bayes risk as in Equation (7).

Comparison of Group Lasso and Forward Selection
Next, we compare the forward selection strategy (Section 5.1) and group lasso (Section 5.2) when used for acquiring the covariate sets S 1 ⊂ S 2 , . . . S q . For Diabetes, Breast Cancer, and Heart-disease, the total costs of the proposed method AdaCOS with forward selection and group lasso are shown in Figure 8. Due to the high computational costs for large p, it was not feasible to apply forward selection to the PhysioNet dataset. We find that, except for the Breast Cancer dataset, both covariate acquisition strategies lead to similar results. For Breast Cancer, forward selection appears to be superior to group lasso.

Symmetric misclassification costs on Diabetes dataset
In order to compare to the results reported in (Ji and Carin, 2007;Dulac-Arnold et al., 2012), we also evaluate on the Diabetes dataset with symmetric misclassification costs (i.e. false negative and false positive costs are the same), and the cost for correct classification set to −50. The results, shown in Table 1, suggest that also in this setting the proposed method can have an advantage over previously proposed methods. In particular, the proposed method AdaCOS with forward selection has the lowest total costs, though, when using group lasso the proposed method underperforms.

User-specified target recall
Finally, we investigate the setting where the user specifies target recall r instead of false negative costs. Here, we show the results for r = 0.95, the results for target recall r = 0.99 are similar and given in the supplement material.
For this setting, we do not consider the method AdaptGbrt, since it does not allow the output of class probabilities or scores. For Shim2018 and GreedyMiser, we found that simply using the class probabilities/scores from the validation data to learn thresholds with recall ≥ r, tended to lead to recall less that r on the test data, as shown in Figure 13. Therefore, in order to make all results Table 1: Shows the total cost of misclassification under the same cost setting as in (Ji and Carin, 2007;Dulac-Arnold et al., 2012): user-specified misclassification costs are symmetric (either 400 or 800), cost of correct classification equals −50. The results for the methods DWSM and POMDP are taken from (Dulac-Arnold et al., 2012) and (Ji and Carin, 2007) comparable, we show the results for Shim2018 and GreedyMiser at the same recall level as the proposed method AdaCOS. The recall of the proposed method AdaCOS on the test data, as shown in Figure 13, never violates the target recall of 0.95.
Since no false negative costs are provided, we cannot evaluate in terms of total costs anymore. Instead, we evaluate in terms of average operation costs, defined as the average cost of false positives plus the costs for covariate acquisition: avg operation costs := 1 n t nt k=1 (1 − y (k) ) · y The results for all datasets are shown in Figures 9, 10, 11, and 12. For completeness, on each of the those figures, we also plot the false discovery rate (FDR) against the covariate costs (bottom plots), where the crosses of different sizes mark the standard deviation of FDR and covariates costs.
For all datasets, except Heart-Disease, the proposed method AdaCOS has the smallest operation costs. Furthermore, AdaCOS tends to achieve a lower false discovery rate with less covariates used.

Markov Decision Process (MDP) Framework
The MDP formulation and solution using an action-utility representation (Q-learning) in Bayer-Zubek, 2004) is closest to our approach. Their method also leads to a Bayes procedure. However, they do not provide a formal proof and consider only discrete covariates. The work in (Dulac-Arnold et al., 2011Karayev et al., 2013) also uses the MDP framework. However, their proposed method cannot incorporate the uncertainty about the covariate distributions. The work in (Ji and Carin, 2007) tries to model such uncertainties by modeling the costsensitive classification problem as a partial observable Markov decision process (POMDP). However, their POMDP formulation can lead to repeatedly selecting the same covariates, and as a consequence they need to adapt the stopping criteria. Janisch et al. (2017); Shim et al. (2018) suggest to use deep reinforcement learning with Q-learning. In contrast to MDP, a discriminative decision maker is learned which does not require an environmental model. Their method performs promising in the domain where huge amounts of labeled training data is available. Alternatively, the work in (Benbouzid et al., 2012) suggests the use of SARSA. The method in (Contardo et al., 2016) also addresses this problem with reinforcement learning.

Reinforcement Learning Approaches
Discriminative Decision Approach The work in  proposes an intriguing method for finding a decision procedure that is guaranteed to converge to the Bayes risk given sufficient enough training data. Their idea is to create a Bayes optimal classifier for all possible subsets of covariates, and a directed a-cyclic graph that connects them. They formulate the problem as an empirical risk minimization (ERM) problem, and show that with infinitely many training samples the loss at each node converges to the Bayes risks. However, in order to allow for scalability their method requires to acquire covariates in batches. The work in (Trapeznikov and Saligrama, 2013;Wang et al., 2014b) uses a similar framework but is restricted to a fixed sequential order.
Cost-sensitive Tree Construction The work in (Xu et al., 2012;Nan et al., 2015Nan et al., , 2016Nan and Saligrama, 2017;Peter et al., 2017) learns a random forest subject to budget constraints on the features. In particular, the methods in (Nan and Saligrama, 2017;Peter et al., 2017) are considered state of the art for this task. Their usage of gradient boosted decision trees (Friedman, 2001) makes them in particular effective for very large training data. Cost-sensitive decision trees for discrete covariates are also considered in (Sheng and Ling, 2006), and extended to Bayesian Networks in (Bilgic and Getoor, 2007).

Tree of Classifiers
The work in (Kusner et al., 2014;Xu et al., 2013) proposes to learn a tree of classifiers that minimizes a convex surrogate loss subject to budget constraints. Wang et al. (2014a) assumes a fixed number of pre-trained classifiers and the goal is to learn a policy that selects one of those classifiers.

Entropy-Based Approaches
The work in (Kanani and Melville, 2008;Gao and Koller, 2011;Kapoor and Horvitz, 2009;Gong et al., 2019) optimizes a criteria that combines the costs of features with an estimate of the class entropy of the resulting classifier. As such their objective function is different from ours.

Density Estimation via Autoencoders
The work in (Kachuee et al., 2018) suggests to acquire the covariate which has the highest sensitivity to the output prediction y. In order to account for different covariate acquisition costs the sensitivity scores are re-scaled appropriately. The sensitivity scores are estimated using a denoising autoencoder. Similarly, the work in (Ma et al., 2019) uses as objective function the expected Shannon information, which is estimated via a variational autoencoder. Both objective functions are not related to the minimization of the expected total cost.
Others The work in (Greiner et al., 2002) extends the Probably Approximately Correct (PAC) framework to prove the existence of a cost-sensitive classifier that is with high probability optimal in the sense of providing minimal average total costs. However, they assume a probability distribution over only discrete covariates. The method in (Lakkaraju and Rudin, 2017) is additionally focused on interpretability, and, as a consequence, optimizes an objective function that is different from ours. Imitation learning is also applied to this task by He et al. (2012), but their definition of loss is different from minimizing the total classification costs that we consider here. The work in (Nan et al., 2014) assumes a margin-based classifier and uses a k-nearest neighbor approach to estimate the accuracy of the classifier.

Conclusions
In this article, we addressed the problem of cost-sensitive classification where the goal is to minimize the total costs, defined as the expected cost of misclassification plus the cost for covariate acquisition.
In Section 2, we rigorously formalized this goal as the minimization of the (conditional) Bayes risk which can change after the acquisition of a new covariate. However, solving this minimization problem is hard. First, the evaluation of the conditional Bayes risk requires to estimate and integrate over a high dimensional density. Second, the Bayes risk must be evaluated for all combinations of covariate sets which is exponential in p the number of covariates.
In order to overcome the computational difficulties, we introduced two working assumptions: 1. The optimal classifier can be expressed as a generalized additive model (GAM).
2. The optimal sets of covariates can expressed as a sequence of sets that are monotone increasing, namely S 1 ⊂ S 2 . . . ⊂ S q .
Using the first assumption, we showed, in Section 4, that the evaluation of the conditional Bayes risk reduces to a one dimensional density estimation and integration problem which can be efficiently estimated.
Furthermore, we showed that the sequence S 1 ⊂ S 2 . . . ⊂ S q can be computationally efficiently acquired by inspecting the regression coefficient path when penalizing GAM with group lasso.
We note that some previous methods like Shim2018 (Shim et al., 2018) do not share our working assumptions, and instead use a very flexible classifier (deep neural network) and covariate acquisition strategy based on reinforcement learning. However, for small datasets, and even for medium large datasets like PhysioNet, we found that a generalized additive model is competitive or even better than a neural network classifier, and the flexibility of the reinforcement learning seems to suffer from high variance.
Finally, we considered the situation where not all misclassification costs are specified by the user. In particular, we considered the situation where the user specifies a target recall instead of the cost of false negative classification. We showed that it is possible to apply the proposed method by estimating the implicitly defined false negative cost. Our experiments showed that the resulting method indeed achieves the desired minimum recall, while minimizing the false discovery rate and covariate acquisition cost.
The source code of the proposed method and for reproducing all results is available at https://github.com/andrade-stats/AdaCOS_public.
We have Therefore π * (x S ) is a Bayes procedure, and as a consequence And therefore (Since S = V , and we have and the same analogously for π.) Induction step: S ⊂ V . Assume that for all S ∪ {i}, where i ∈ V \ S, the induction assumptions holds, that is Letπ denote a Bayes procedure. Using the structure of the loss function as defined in Equation (5), we have where in the line marked by (1) we used the induction assumption. The last line follows from Lemma 1. Sinceπ is a Bayes procedure, we must have equality in the second and fifth line. Therefore π * is also a Bayes procedure.
1. Case: π * (x S ) ∈ L. Then because of the definition of π * , we have 2. Case: π * (x S ) / ∈ L. Then because of the definition of π * , we have