New algorithms for budgeted learning
- First Online:
- Received:
- Accepted:
- 1 Citations
- 967 Downloads
Abstract
We explore the problem of budgeted machine learning, in which the learning algorithm has free access to the training examples’ class labels but has to pay for each attribute that is specified. This learning model is appropriate in many areas, including medical applications. We present new algorithms for choosing which attributes to purchase of which examples, based on algorithms for the multi-armed bandit problem. In addition, we also evaluate a group of algorithms based on the idea of incorporating second-order statistics into decision making. Most of our algorithms are competitive with the current state of art and performed better when the budget was highly limited (in particular, our new algorithm AbsoluteBR2). Finally, we present new heuristics for selecting an instance to purchase after the attribute is selected, instead of selecting an instance uniformly at random, which is typically done. While experimental results showed some performance improvements when using the new instance selectors, there was no consistent winner among these methods.
Keywords
Budgeted learning Multi-armed bandit1 Introduction
Approaches to typical machine learning applications usually operate under the assumption that data are freely available. That is, it is usually taken for granted that numerous fully-specified instances, along with their labels, are available to build a classifier. However, in many real-world applications, this assumption is far from realistic. Instead, collecting and specifying data may be very time-consuming and costly.
An area in machine learning that attempts to address the problem of learning in the absence of labels is active learning (Settles 2009). Rather than a passive learner that simply builds a hypothesis based on the available attribute/label pairs, an active learner must also choose which instances it wants an oracle to label. Many active learning algorithms and applications aim to reduce the real labeling costs (e.g., annotation time or cost of materials to acquire labels) while achieving competitive performance versus passive learning. In contrast, the theoretical literature mainly focuses on the label complexity—how many label purchases are required to learn asymptotically in the case for which all values of the attributes of the datasets are fully specified.
An alternative line of research called budgeted learning or active feature acquisition (Lizotte et al. 2003; Madani et al. 2004; Melville et al. 2004, 2005; Saar-Tsechansky et al. 2009) focuses on a model that is related to active learning. In budgeted learning, a learner considers instances in which the class labels are specified, but some or all of the attribute values are not. Instead, the learner can purchase an attribute value for a specific instance at some fixed cost, subject to an overall budget. The challenge to the learner is to decide which attributes of which instances will provide the best model from which to learn. The original motivation for the budgeted learning model came from medical applications in which the outcome of a treatment, drug trial, or control group (labels of an instance) is known and the features (results of running medical tests) are each available for a price. For example, a project is allocated $2 million to develop a diagnostic classifier for patient cancer subtypes. In this study, a pool of patients with a known cancer subtype were available, as were various diagnostic tests that could be performed, each with a cost. For this example, each test was expensive, and the overall budget was $2 million, so the learning algorithm had to carefully choose which patient would get which tests to learn a good cancer subtype classifier under the given $2 million budget.
Some early results of our work appeared in Deng et al. (2007), in which we presented new algorithms for choosing which attributes of which examples to purchase in the budgeted learning model. Several of our algorithms were based on results in the “multi-armed bandit” model. In this model, we have N slot machines that we may play for some fixed number of rounds. At each round, one must decide which single slot machine to play in order to maximize total reward over all rounds. Our first two algorithms were based on the algorithm Exp3 of Auer et al. (2002b) originally designed for the multi-armed bandit problem, and our third is based on the “follow the perturbed leader” approach of Kalai and Vempala (2005). In this paper, we describe this preliminary work and then present three new algorithms, which are variations of Biased Robin (Kapoor and Greiner 2005b), by incorporating second-order statistics into the decision-making process. We refer to our algorithms that use second-order statistics as Biased Robin 2 (BR2) series algorithms. Our results show that most of our proposed algorithms perform well when compared with the existing algorithms, in particular, AbsoluteBR2 (ABR2).
Previous budgeted learning algorithms that use naïve Bayes as the base classifier focus only on selecting attributes to purchase and then choose uniformly at random the instance whose attribute is to be purchased. In other words, such algorithms consider one instance to be as good as any other (given the class label) for querying the chosen attribute. In this paper, we extend our prior work by experimenting with strategies to select instances as well as attributes, choosing instances that are most wrongly predicted or that are least certain in the current model. Melville et al. (2004) and Melville et al. (2008) proposed to use US (uniform sampling) and ES (Error Sampling) to consider partial instances instead of all of the instances. The sampling is done before choosing an (instance, feature) pair. This is different from our row selection procedure, which is applied after a feature is chosen first.) While experimental results showed some performance improvements when using the new instance selectors, there was no consistent winner among these methods.
2 Background and related work
Budgeted learning is related to conventional machine learning techniques in that the learner is given a set D of labeled training data and then infers a classifier (hypothesis, or model) to label new examples. The key difference is that, in budgeted learning, there is an overall budget, and the learner has to use this given budget to learn as good a classifier or hypothesis as possible. There is a body of work in the data mining community by the name of “active feature acquisition” (Zheng and Padmanabhan 2002; Melville et al. 2004, 2005; Lomasky et al. 2007; Saar-Tsechansky et al. 2009), in which the features are purchased at a cost and the only difference is that their goal is to learn a hypothesis using as little cost as possible (i.e., no strict budget). For example, Saar-Tsechansky et al. (2009) proposed to use Log Gain—the expected likelihood of the training set—as a smoother measure of goodness of an attribute purchase. This idea was somewhat similar to some of the measures (conditional entropy and GINI index) that we have used for several budgeted learning algorithms. They also consider for purchase only those instances that are misclassified by the learned classifier instead of all of the instances to reduce the search space.
The difference between active feature acquisition and budgeted learning is that, budgeted learning usually has a hard budget set up front, while active feature acquisition does not have a hard budget. A minor distinction is that in some applications of active feature acquisition (Zheng and Padmanabhan 2002; Lomasky et al. 2007), individual instance/feature values cannot be bought one at a time. Instead, all the missing attributes of an instance must be synthesized or obtained as a whole.
One of the original applications of budgeted learning was in the medical domain (Madani et al. 2004). In this application, the examples are patients, and the (known) class label of patient x is y∈{−1,+1}, indicating whether or not x responded to a particular treatment. The (initially unknown) attributes of x are the results of tests performed on tissue samples gathered from patient x, e.g., a blood test for the presence of a particular antibody. In this case, any attribute of any instance can be determined. However, each costs time and money, and there is a finite budget limiting the attribute values that one can buy. Further, each attribute can cost a different amount of money, e.g., a blood test may cost less than a liver biopsy.
A second application of budgeted learning is in customer modeling. A company has significant data (attributes) on its own customers but may have the option to pay for other information on the same customers from other companies. Though they do not explicitly refer to it as “budgeted learning,” Zheng and Padmanabhan (2002) discuss this problem as applied to learning models of customers to web services, where a customer’s browsing history at a local site (e.g., Travelocity) is known, but the same customer’s history at other sites (e.g., Expedia) is not, though by assumption it could be purchased from these competing sites. Zheng and Padmanabhan evaluated two heuristic approaches to this problem, both of which are based on the idea of how much additional information the unspecified attributes can provide. Their first algorithm (AVID) imputes, at each iteration, the values of the unspecified features based on the specified ones. It does this multiple times and then considers the feature with the highest variance to be the least certain and thus the best one to purchase. Their second algorithm (GODA) also imputes values of the unspecified features, but then it selects for purchase the unspecified feature that maximizes expected “goodness” based on a specified function (e.g., training error). In their work, they assume that all attributes’ values are purchased at the same time rather than as individual (instance, feature) pairs.
In other work, Veeramachaneni et al. (2006) studied the problem of budgeted learning for the purpose of detecting irrelevant features rather than building a classifier. Specifically, let θ∈Θ parameterize the probability distribution over \(\mathcal{Z} \times\mathcal{X} \times \mathcal{J}\), where \(\mathcal{Z}\) is the space of attributes that are always known to the learner, \(\mathcal{X}\) is the space of attributes that can be queried by the learner, and \(\mathcal{J}\) is the space of labels. Their goal was to accurately learn some function g(θ) (e.g., the true classification error rate of a model) by querying as few unknown attributes as possible. Their algorithm purchased attributes that were expected to induce maximum change in the estimate of g. As with other budgeted learning methods, this approach inherently seeks out attributes that are more relevant to estimating g.
Related to budgeted learning is the learning of “bounded active classifiers,” in which one has fully specified training examples with their labels, but the final hypothesis h must pay for attributes of new examples when predicting a label, spending at most B_{h}. This, of course, can be combined with budgeted learning to the budgeted learning of bounded active classifiers. Such a learning algorithm has been termed a “budgeted bounded-active-classifier learner” (bBACl) (Kapoor and Greiner 2005b, 2005a; Greiner et al. 2002). For simplicity, in our work we focus on budgeted learning of unbounded classifiers, leaving the extension of our results to the bBACl problem as future work.
Budgeted learning falls under a general framework of problems that represent a trade-off between exploration and exploitation in an online decision process. In this framework, a learner has several actions it may choose, each action with a corresponding payoff. Initially, the learner starts with little or no knowledge and must spend some time gathering information about which attributes are most relevant, reflecting the need for exploration. At some point, the learner begins exploitation by purchasing more values of attributes that are known to be informative (assuming that a particular attribute is equally informative for all instances), in an attempt to maximize its expected reward. In budgeted learning, the reward is the performance of the final classifier when the budget is exhausted. Clearly, the budgeted learner must purchase a variety of attributes in order to form a more complete model of the underlying problem domain and also figure out which attributes are more informative, thus showing the importance of exploration in budgeted learning. Exploration and exploitation can and often do operate at the same time. In other words, there may be no clean-cut boundaries between these two facets. For example, to explore more efficiently (which is crucial when the budget is limited), we often need to utilize/exploit what is already known to decide which attribute to purchase next. On the other hand, the result of exploration, i.e., the rewards received during the process, are valuable for deciding which attributes should be purchased more often in order to build a good final classifier.
Most theoretical studies of budgeted learning are based on related problems within the exploration versus exploitation framework, such as the coins problem and the multi-armed bandit problem (see Sect. 2.3). In the coins problem, one is given N biased coins {c_{1},…,c_{N}}, where c_{i}’s probability of heads is distributed according to some known prior R_{i} (though we say “coins,” the problem can be generalized to non-binary outcomes). In each round, a learner selects a coin to flip at unit cost. After exhausting its budget, the learner must choose a single coin to flip ad infinitum, and its payoff is the number of heads that the coin yields. The goal is to choose the coin that has the highest probability of heads, and performance is measured by the regret incurred by the learner’s choice, i.e., the expected amount that the learner could have done better by choosing the optimal coin. Madani et al. (2004) showed that this problem can be modeled as one of sequential decision making under uncertainty, and as such can be solved via dynamic programming when the number of coins is taken to be constant. However, the complexity of such a dynamic programming solution is exponential, and in fact this problem is NP-hard under specific conditions. They argued that straightforward heuristics such as “Round Robin” (repeatedly cycle through each coin until the budget is exhausted) do not have any constant approximation guarantees. That is, for any constant ℓ there is a problem with minimum regret r^{∗} such that the regret of Round Robin is >ℓr^{∗}. Kapoor and Greiner (2005c) empirically evaluated common reinforcement learning techniques on this problem. Guha and Munagala (2007) were able to design a 4-approximation algorithm for the general coins problem using a linear programming rounding technique. Goel et al. (2009) presented an algorithm that also guaranteed a constant factor approximation. Their algorithm was based on “ratio index” (analogous to the Gittins index) such that a single number can be computed for each arm and the arm with the highest index is chosen for experimentation.
Recently, budgeted learning has been studied in the context of sampling as few times as possible to minimize the maximal variance of all the arms in a multi-armed bandit problem (Antos et al. 2008). Further, in Li (2009), the goal is to minimize the expected risk of the parameters in a generative Bayesian network, with the risk chosen to be the expected KL divergence of the parameters from their expectations. Finally, Goel et al. (2006) studied optimization problems with inputs that are random variables, where the only available data are samples of the distribution.
In the following sections, we give detailed descriptions of algorithms on which we base our contributions. Section 2.1 explains Biased Robin (Lizotte et al. 2003). Section 2.2 describes RSFL (Kapoor and Greiner 2005b). Finally, Sect. 2.3 explains bandit-based algorithms (Auer et al. 2002b; Kalai and Vempala 2005).
2.1 Biased Robin
One of the simplest early budgeted learning algorithms was Biased Robin (BR). BR is similar to a Round Robin approach (where first attribute 1 is purchased once, then attribute 2, and so on to attribute N, then repeat), except that attribute i is repeatedly purchased until such purchases are no longer “successful,” and then the algorithm moves on to attribute i+1. Success is measured by a feedback function, examples of which are described later. Despite its simplicity, BR is a solid performer in training naïve Bayes classifiers (Lizotte et al. 2003; Kapoor and Greiner 2005b).
2.2 Single-Feature Lookahead
Single Feature Lookahead (SFL) is a method introduced by Lizotte et al. (2003). Using the posterior class distribution of its naïve Bayes model, one can compute the expected loss of any sequence of actions. Given sufficient computational resources and the ability to compute expected loss, one can achieve Bayes optimality by considering all possible future actions contingent on all possible outcomes along the way (Wang et al. 2005). Because this is too computationally intensive, SFL and similar approaches restrict the space of future models they consider, by restricting the class of policies considered. In SFL, each (label, attribute) pair is associated with an expected loss incurred by spending the entire remaining budget on that pair. This expected loss is computed using the current posterior naïve Bayes model, which, given an allocation, gives the distribution over future models. Expected loss is computed over this distribution. The pair with the lowest loss is then purchased once. SFL introduces a lookahead into the state space S of all possible purchases without explicitly expanding the entire state space. At any point in the algorithm’s execution, one has an allocationα: a description of how many feature values have been purchased from instances with a certain class label. Specifically, α_{ijk} is a count of the number of times attribute i has been purchased from an instance with class label j with resulting attribute value k. (Thus, SFL requires the possible attribute values to be discrete.) Given a current allocation, SFL calculates the expected loss of performing all possible single-purchase actions (purchasing an attribute/label pair which results in a specific attribute value) and chooses the action that minimizes the expected loss of the resulting allocation. In a randomized version of SFL called RSFL (Kapoor and Greiner 2005b), the next (label, attribute) pair to purchase is sampled from the Gibbs distribution induced by the SFL losses.
2.3 The multi-armed bandit problem
There are close connections between budgeted learning and the so-called “multi-armed bandit problem” first studied by Robbins (1952). The problem can be stated as follows (Gittins 1979): there are N arms, each having an unknown success probability of emitting a unit reward. The success probabilities of the arms are assumed to be independent of each other. The objective is to pull arms sequentially so as to maximize the total reward. Many policies have been proposed for this problem under the independent-arm assumption (Lai and Robbins 1985; Auer et al. 2002b). The key difference between budgeted learning and the multi-armed bandit problem is that in the latter, one tries to maximize the cumulative rewards over all pulls, whereas with budgeted learning, one simply wants to maximize the accuracy of the resulting classifier or a “simple” reward in some sense.
2.3.1 Exp3 algorithm
Auer et al. proved that under appropriate conditions,^{1} the expected total reward of Exp3 will not differ from the best possible by more than \(2.63 \sqrt{g N \ln N}\), where g is an upper bound on the total reward of the best sequence of choices.
2.3.2 FPL algorithm
2.3.3 Other results
Recently, Bubeck et al. (2008) pointed out an interesting link between simple (one-shot) and cumulative (overall) regret, which is the difference in reward of the algorithm in question and that of an optimal, omniscient algorithm. One of the surprising results is that an upper bound on the cumulative regret implies a lower bound on simple regret for all algorithms. In other words, the exploration-exploitation trade-offs are qualitatively different for these two types of regrets. In fact, according to Bubeck et al. (2008), one of the very successful algorithms (Auer et al. 2002a) for cumulative regret would perform worse than the uniform random algorithm when the budget goes to infinity. However, in their simulation study, this was not observed since in order for this to occur, the budget would have to be very large, to the point that the computed regrets would fall below that of the precision of the computer used to run the simulations. Their results do not seem to be directly transferable to budgeted machine learning due to their assumptions. Another important contribution is that they pointed out that exploration (allocation) strategies can be different from the decision (recommendation) strategies in the pure exploratory multi-armed bandit problems.
2.3.4 Discussion
The theoretical guarantees in the work of Auer et al. (2002b) give us good motivation for using similar approaches to the budgeted learning problem. Auer et al. make no assumptions about the underlying distribution of slot machines and their results still hold even when the rewards are dynamic and may depend on previous sequences of rewards and the player’s random draws. Thus, we can plug in our choice of reward function for the slot machines (say, the negative conditional entropy of the class label with respect to an attribute) and their bounds automatically translate into guarantees in the budgeted learning context. However, these bounds only apply to the cumulative regret with respect to the best sequence of arm pulls. For instance, nothing is said by these bounds about the class label’s uncertainty with respect to an attribute upon the last round of purchases. In other words, it remains an open question whether under some conditions, one can bound the resulting training error with respect to the best set of purchases or the best possible error rate.
3 Our algorithms
For simplicity, all of our algorithms focus on a unit-cost model of budgeted learning, i.e., each attribute costs one monetary unit. Sections 3.1 and 3.2 present our two algorithms based on Exp3 and one algorithm based on “follow the perturbed leader” for multi-armed bandit problems. Section 3.3 presents our algorithms based on Biased Robin and second order statistics. Section 3.4 presents our row selection policies.
3.1 Exp3-based algorithms (Exp3C and Exp3CR)
After choosing a column to purchase, a row (instance) must also be selected. After choosing a column i, Exp3C selects a row uniformly at random from all rows that do not yet have column i filled in.
In our second algorithm (Exp3CR, for “Exp3-Column-Row”), we define a distribution over the rows as well as the columns, i.e., we now have two weight vectors instead of one. After choosing a column according to its distribution (which is done the same way as in Exp3C), our algorithm then chooses a row according to the row distribution. Once reinforcement is received, both weight vectors are updated independently of each other. Thus we replace the naïve Bayes assumption with a product distribution over the (column, row) pairs.
As previously mentioned, by directly applying the regret bounds of Auer et al. (2002b), we easily bound the regret of Exp3C. Specifically, we see that the reward of our algorithms will be within a factor of \(2.63 \sqrt{g B \ln B}\) of that of the best sequence of attribute choices, where B is an upper bound on the total reward of this best sequence. It remains an open problem as to what kinds of bounds follow for Exp3CR.
3.2 Follow the expected leader (FEL)
Our third algorithm is a variation of the “follow the perturbed leader” (FPL) type algorithms due to Kalai and Vempala (2005). As with the previous two algorithms, we treat each attribute as an arm.
The framework for FPL assumes that we have access to the costs x_{i}(t) for all arms at every time step (had they been chosen). However, this assumption is not reasonable in the context of budgeted learning. For this reason, our implementation is a slight variation of the standard FPL called FEL (“follow the expected leader”) from Kalai and Vempala (2005). First, we assume that x_{i}(t) is zero if the arm was not played (the attribute was not chosen as a purchase). Next, let #x_{i}(t) be the number of times attribute i was chosen up to time step t. Now, instead of cumulative cost, we use the average of the perturbed cost (over the trials that an attribute is actually purchased) as the selection criterion. The FEL algorithm is presented as Algorithm 2. Just as with Exp3C and Exp3CR, we measure the cost as the training error on the partially specified training set. While we considered other reward functions (GINI index, expected classification error, and conditional entropy), classification error on the training set tended to work best for our algorithms in terms of accuracy on the test set. Thus the results we present represent those from the top-performing reward function for each algorithm.
3.3 Variance-based biased Robin algorithms
Recall that Biased Robin is similar to a Round Robin approach except that attribute i is repeatedly purchased until such purchases are no longer “successful,” and then the algorithm moves on to attribute i+1. In practice, however, making a decision solely based on the outcome of the last action can be problematic, e.g., due to the fluctuation (instability) of the learning process (Kapoor and Greiner 2005b).
3.4 Instance (row) selection heuristics
Many budgeted learning algorithms (except Exp3CR) only select columns for purchase, implicitly assuming that given a column, any instance (or any instance with a given class label) is equally useful. Thus rows are selected uniformly at random from either the entire training set or from among instances of a particular class. However, it may not always be the case that two instances are equally informative given an attribute. Thus, we refine these algorithms by defining criteria for choosing specific instances from which to purchase an attribute. When using our row selection strategies, after the budgeted learner chooses an attribute and a class label,^{3} the row (instance) chosen will be the one optimizing our selection criteria among those with the same class label and with the selected column yet unpurchased.
3.4.1 Entropy as row selection criterion (EN, en)
3.4.2 Error correction as row selection criterion (EC, ec)
The other row selection heuristic we considered is a greedy “error-correcting” approach. For each training instance m still with missing attributes, we calculate the predicted probability of its true class P_{t}(ℓ_{m}∣x_{m}), where ℓ_{m} stands for the true class of instance m. We then pick \(\operatorname{argmin}_{m} \{P_{t}(\ell_{m} \mid x_{m})\}\), the instance with the smallest estimated probability in its true class. Intuitively, the selected instance is more likely to be classified wrong by the classifier, so knowing more about this instance should improve the performance, especially in the early stages of the training process. Melville et al. (2005, 2008) proposed to use US (uniform sampling) and ES (Error Sampling) to consider only partial instances instead of all of the instances. The sampling is done before choosing an (instance, feature) pair. This is different from our row selection, which is applied after a feature is chosen first.
4 Experimental results
Data set information
Data set | Num. of Instances | Num. of Attributes | Num. of Classes |
---|---|---|---|
breast-cancer | 286 | 9 | 2 |
colic | 368 | 22 | 2 |
kr-vs-kp | 3196 | 35 | 2 |
mushroom | 8124 | 22 | 2 |
vote | 435 | 16 | 2 |
zoo | 101 | 17 | 7 |
We partitioned each data set in 10 different ways. For each partitioning, we used 10-fold cross validation, so the results presented in this section are averages of 100 folds. We ran 10-fold CV on 10 different partitionings because we found that the performance of each algorithm is sensitive to the partitioning used, and repeating the process reduced the variance. In addition to our algorithms (Exp3C, Exp3CR, FEL and the BR2 variations) and those in the literature described earlier (RSFL and BR), as a control, we also considered a random shopper that uniformly at random selects an unpurchased (instance, attribute) pair. We tried various reward functions with each algorithm, and chose the reward function that worked best for each algorithm. For our algorithms, we used the classification accuracy on the partially specified training set as a reward function. For RSFL and BR, we used conditional entropy. When applicable, we ran each algorithm with uniform random selection of rows as well as with the entropy-based and error correction-based approaches of Sect. 3.4.
For the algorithms with adjustable parameters, we report the best results from the parameter values we tested, yielding a “best-case” description of each algorithm’s performance. For Exp3C and Exp3CR, we tried γ∈{0.01,0.05,0.10,0.15,0.20,0.25}, and chose γ=0.15 for Exp3C and γ=0.20 for Exp3CR. For FEL, we tried ϵ∈{0.01,0.05,0.1,0.2,0.5} and chose ϵ=0.1. For AbsoluteBR2, we chose Δ_{a}=0.01 from {0.00001,0.0001,0.001,0.01,0.1,1}, for RelativeBR2, we chose Δ_{r}=1 from {0.01,0.1,1,10,100}, and for WindowBR2 we chose Δ_{w}=2 from {2,3,4,5,6,7,8,9}.
To evaluate the overall behavior of each of the algorithms, we constructed learning curves that reflect the performance of a heuristic by its mean error over the 10×10 folds as more attributes are purchased. On each fold, each algorithm was run up to a budget of B=100. Each purchase was unit cost.
Algorithm abbreviations, full names, and short descriptions
Algorithm Identifier | Full Name | Source |
---|---|---|
BR, br | Biased Robin | Section 2.1 |
ABR2, abr2 | Absolute Biased Robin 2 | Section 3.3 |
RBR2, rbr2 | Relative Biased Robin 2 | Section 3.3 |
WBR2, wbr2 | Window Biased Robin 2 | Section 3.3 |
Exp3C, exp3c | Exp3 Column | Algorithm 1 |
Exp3CR, exp3cr | Exp3 Column Row | Section 3.1 |
RSFL | Random Single Feature Look-ahead | Section 2.2 |
FEL | Follow the Expected Leader | Algorithm 2 |
Rand | Random | Randomly choose a feature and an instance |
algoname.ec | aglorithm algname with Error-Correction row selector | Section 3.4.2 |
algoname.en | algorithm algname with Entropy row selector | Section 3.4.1 |
algoname.ur | algorithm algname with Uniform Random row selector | Section 3.4 |
4.1 Summary statistics
Ultimately, the goal in budgeted learning is to reduce the number of attributes one must purchase in order to effectively learn. To summarize and compare learning curves, we use summary statistics.
The first statistic we use is the target mean, which is the mean of the error rates for the final 20 % of the total budget achieved by the random shopper. We define the target budget of an algorithm over the 10×10 folds on a given data set as the minimum budget needed by an algorithm to be competitive with the random shopper. For a given algorithm A and trial t (i.e., the naïve Bayes model trained on the training set after t purchases by A), we compute the mean error rate of the last 5 % of purchases for A. Then the target budget is the smallest t for which the target mean is achieved by A. We use a window size of 5 % of the total budget to reduce the influence of outliers as the learning curves can have high variance early on. If an algorithm fails to achieve the target mean, its target budget is simply the entire budget B.
We also report each algorithm’s data utilization ratio, which is the algorithm’s target budget divided by the target budget of the random shopper. Thus, a lower data utilization ratio reflects that the algorithm was able to make more useful purchases overall while excluding large changes in performance as the budget is exhausted. This is especially informative because with larger budgets, each algorithm will naturally converge to the baseline, making distinctions between them meaningless as more purchases are made. These metrics are similar to those used by Abe and Mamitsuka (1998), Melville and Mooney (2004), and Culver et al. (2006) in the context of active learning.
4.2 Comparing attribute (column) selectors and instance (row) selectors
Target budget and data utilization rates for algorithms with the EC row selector. Total budget was B=100
Dataset | exp3cr | exp3c.ec | fel.ec | rsfl.ec | rand.ec | br.ec | abr2.ec | wbr2.ec | rbr2.ec |
---|---|---|---|---|---|---|---|---|---|
breast-cancer | 42.0 | 35.0 | 100.0 | 100.0 | 50.0 | 46.0 | 35.0 | 100.0 | 60.0 |
(0.61) | (0.51) | (1.45) | (1.45) | (0.72) | (0.67) | (0.51) | (1.45) | (0.87) | |
colic-Nominalized | 78.0 | 83.0 | 78.0 | 100.0 | 89.0 | 100.0 | 93.0 | 98.0 | 51.0 |
(0.89) | (0.94) | (0.89) | (1.14) | (1.01) | (1.14) | (1.06) | (1.11) | (0.58) | |
kr-vs-kp | 90.0 | 80.0 | 90.0 | 100.0 | 91.0 | 100.0 | 82.0 | 74.0 | 100.0 |
(0.96) | (0.85) | (0.96) | (1.06) | (0.97) | (1.06) | (0.87) | (0.79) | (1.06) | |
mushroom | 86.0 | 77.0 | 73.0 | 83.0 | 82.0 | 70.0 | 71.0 | 77.0 | 65.0 |
(0.91) | (0.82) | (0.78) | (0.88) | (0.87) | (0.74) | (0.76) | (0.82) | (0.69) | |
vote | 100.0 | 100.0 | 100.0 | 71.0 | 100.0 | 65.0 | 63.0 | 100.0 | 57.0 |
(1.14) | (1.14) | (1.14) | (0.81) | (1.14) | (0.74) | (0.72) | (1.14) | (0.65) | |
zoo | 82.0 | 77.0 | 86.0 | 76.0 | 75.0 | 77.0 | 68.0 | 85.0 | 77.0 |
(0.88) | (0.83) | (0.92) | (0.82) | (0.81) | (0.83) | (0.73) | (0.91) | (0.83) | |
Mean | 79.67 | 75.33 | 87.83 | 88.33 | 81.16 | 76.33 | 68.67 | 89 | 68.33 |
Mean DUR | 0.9 | 0.85 | 1.02 | 1.03 | 0.92 | 0.86 | 0.77 | 1.04 | 0.78 |
Median DUR | 0.9 | 0.84 | 0.94 | 0.97 | 0.92 | 0.79 | 0.74 | 1.01 | 0.76 |
Target budget and data utilization rates for algorithms with the EN row selector. Total budget was B=100
Dataset | exp3cr | exp3c.en | fel.en | rsfl.en | rand.en | br.en | abr2.en | wbr2.en | rbr2.en |
---|---|---|---|---|---|---|---|---|---|
breast-cancer | 42.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 59.0 | 50.0 | 46.0 |
(0.61) | (1.45) | (1.45) | (1.45) | (1.45) | (1.45) | (0.86) | (0.72) | (0.67) | |
colic-Nominalized | 78.0 | 100.0 | 100.0 | 95.0 | 100.0 | 100.0 | 68.0 | 96.0 | 82.0 |
(0.89) | (1.14) | (1.14) | (1.08) | (1.14) | (1.14) | (0.77) | (1.09) | (0.93) | |
kr-vs-kp | 90.0 | 82.0 | 100.0 | 100.0 | 98.0 | 82.0 | 100.0 | 100.0 | 69.0 |
(0.96) | (0.87) | (1.06) | (1.06) | (1.04) | (0.87) | (1.06) | (1.06) | (0.73) | |
mushroom | 86.0 | 76.0 | 94.0 | 85.0 | 84.0 | 78.0 | 81.0 | 70.0 | 63.0 |
(0.91) | (0.81) | (1) | (0.9) | (0.89) | (0.83) | (0.86) | (0.74) | (0.67) | |
vote | 100.0 | 58.0 | 79.0 | 51.0 | 87.0 | 57.0 | 55.0 | 76.0 | 85.0 |
(1.14) | (0.66) | (0.9) | (0.58) | (0.99) | (0.65) | (0.62) | (0.86) | (0.97) | |
zoo | 82.0 | 79.0 | 73.0 | 86.0 | 83.0 | 73.0 | 69.0 | 81.0 | 89.0 |
(0.88) | (0.85) | (0.78) | (0.92) | (0.89) | (0.78) | (0.74) | (0.87) | (0.96) | |
Mean | 79.67 | 82.5 | 91 | 86.17 | 92 | 81.67 | 72 | 78.83 | 72.33 |
Mean DUR | 0.9 | 0.96 | 1.06 | 1 | 1.07 | 0.95 | 0.82 | 0.89 | 0.82 |
Median DUR | 0.9 | 0.86 | 1.03 | 0.99 | 1.02 | 0.85 | 0.81 | 0.87 | 0.83 |
Target budget and data utilization rates for algorithms with the UR row selector. Total budget was B=100
Dataset | exp3cr | exp3c.ur | fel.ur | rsfl.ur | rand.ur | br.ur | abr2.ur | wbr2.ur | rbr2.ur |
---|---|---|---|---|---|---|---|---|---|
breast-cancer | 42.0 | 46.0 | 18.0 | 100.0 | 69.0 | 66.0 | 45.0 | 51.0 | 52.0 |
(0.61) | (0.67) | (0.26) | (1.45) | (1) | (0.96) | (0.65) | (0.74) | (0.75) | |
colic-Nominalized | 78.0 | 89.0 | 82.0 | 100.0 | 88.0 | 100.0 | 90.0 | 93.0 | 82.0 |
(0.89) | (1.01) | (0.93) | (1.14) | (1) | (1.14) | (1.02) | (1.06) | (0.93) | |
kr-vs-kp | 90.0 | 74.0 | 89.0 | 100.0 | 94.0 | 100.0 | 81.0 | 77.0 | 91.0 |
(0.96) | (0.79) | (0.95) | (1.06) | (1) | (1.06) | (0.86) | (0.82) | (0.97) | |
mushroom | 86.0 | 80.0 | 77.0 | 85.0 | 94.0 | 75.0 | 76.0 | 72.0 | 65.0 |
(0.91) | (0.85) | (0.82) | (0.9) | (1) | (0.8) | (0.81) | (0.77) | (0.69) | |
vote | 100.0 | 100.0 | 100.0 | 76.0 | 88.0 | 35.0 | 100.0 | 100.0 | 77.0 |
(1.14) | (1.14) | (1.14) | (0.86) | (1) | (0.4) | (1.14) | (1.14) | (0.88) | |
zoo | 82.0 | 72.0 | 77.0 | 80.0 | 93.0 | 83.0 | 71.0 | 82.0 | 96.0 |
(0.88) | (0.77) | (0.83) | (0.86) | (1) | (0.89) | (0.76) | (0.88) | (1.03) | |
Mean | 79.67 | 76.83 | 73.83 | 90.17 | 87.67 | 76.5 | 77.17 | 79.17 | 77.17 |
Mean DUR | 0.9 | 0.87 | 0.82 | 1.05 | 1 | 0.87 | 0.87 | 0.9 | 0.88 |
Median DUR | 0.9 | 0.82 | 0.88 | 0.98 | 1 | 0.92 | 0.84 | 0.85 | 0.9 |
Mean accuracies of all algorithms using the EC row selector at the minimum target budget
dataset | exp3c.ec | fel.ec | rsfl.ec | rand.ec | br.ec | abr2.ec | wbr2.ec | rbr2.ec |
---|---|---|---|---|---|---|---|---|
breast-cancer | 0.698 | 0.676 | 0.632 | 0.693 | 0.690 | 0.693 | 0.659 | 0.675 |
colic-Nominalized | 0.644 | 0.652 | 0.635 | 0.640 | 0.609 | 0.635 | 0.631 | 0.676 |
kr-vs-kp | 0.569 | 0.567 | 0.558 | 0.563 | 0.558 | 0.572 | 0.575 | 0.555 |
mushroom | 0.832 | 0.831 | 0.817 | 0.820 | 0.834 | 0.837 | 0.809 | 0.841 |
vote | 0.883 | 0.879 | 0.881 | 0.880 | 0.885 | 0.888 | 0.879 | 0.886 |
zoo | 0.769 | 0.765 | 0.771 | 0.776 | 0.781 | 0.790 | 0.778 | 0.774 |
Mean accuracies of all algorithms using the EN row selector at the minimum target budget
dataset | exp3c.en | fel.en | rsfl.en | rand.en | br.en | abr2.en | wbr2.en | rbr2.en |
---|---|---|---|---|---|---|---|---|
breast-cancer | 0.647 | 0.626 | 0.630 | 0.592 | 0.679 | 0.691 | 0.694 | 0.692 |
colic-Nominalized | 0.639 | 0.630 | 0.633 | 0.617 | 0.624 | 0.679 | 0.652 | 0.664 |
kr-vs-kp | 0.566 | 0.550 | 0.566 | 0.553 | 0.565 | 0.554 | 0.557 | 0.575 |
mushroom | 0.821 | 0.811 | 0.814 | 0.817 | 0.817 | 0.819 | 0.823 | 0.843 |
vote | 0.884 | 0.879 | 0.890 | 0.862 | 0.884 | 0.884 | 0.880 | 0.868 |
zoo | 0.762 | 0.784 | 0.748 | 0.752 | 0.783 | 0.792 | 0.775 | 0.769 |
Mean accuracies of all algorithms using the UR row selector at the minimum target budget
dataset | exp3c.ur | fel.ur | rsfl.ur | rand.ur | br.ur | abr2.ur | wbr2.ur | rbr2.ur |
---|---|---|---|---|---|---|---|---|
breast-cancer | 0.682 | 0.692 | 0.660 | 0.670 | 0.674 | 0.667 | 0.677 | 0.681 |
colic-Nominalized | 0.666 | 0.672 | 0.669 | 0.657 | 0.656 | 0.667 | 0.656 | 0.671 |
kr-vs-kp | 0.575 | 0.571 | 0.564 | 0.555 | 0.558 | 0.569 | 0.572 | 0.563 |
mushroom | 0.820 | 0.827 | 0.817 | 0.809 | 0.826 | 0.827 | 0.824 | 0.844 |
vote | 0.869 | 0.872 | 0.878 | 0.839 | 0.889 | 0.872 | 0.865 | 0.875 |
zoo | 0.792 | 0.779 | 0.753 | 0.757 | 0.779 | 0.794 | 0.780 | 0.763 |
Mean accuracies of all algorithms using the EC, EN, and UR column selectors at the minimum target budget
dataset | exp3c.ec | exp3c.en | exp3c.ur | fel.ec | fel.en | fel.ur | rsfl.ec | rsfl.en | rsfl.ur | rand.ec | rand.en | rand.ur |
---|---|---|---|---|---|---|---|---|---|---|---|---|
breast-cancer | 0.698 | 0.638 | 0.689 | 0.675 | 0.618 | 0.692 | 0.664 | 0.660 | 0.668 | 0.698 | 0.595 | 0.675 |
colic-Nominalized | 0.672 | 0.647 | 0.667 | 0.673 | 0.630 | 0.668 | 0.669 | 0.673 | 0.670 | 0.672 | 0.632 | 0.668 |
kr-vs-kp | 0.569 | 0.568 | 0.575 | 0.573 | 0.563 | 0.576 | 0.569 | 0.568 | 0.570 | 0.574 | 0.568 | 0.570 |
mushroom | 0.841 | 0.844 | 0.837 | 0.844 | 0.819 | 0.837 | 0.843 | 0.841 | 0.842 | 0.843 | 0.842 | 0.833 |
vote | 0.882 | 0.887 | 0.879 | 0.880 | 0.887 | 0.880 | 0.880 | 0.890 | 0.886 | 0.883 | 0.888 | 0.886 |
zoo | 0.787 | 0.772 | 0.790 | 0.775 | 0.796 | 0.785 | 0.797 | 0.758 | 0.785 | 0.795 | 0.775 | 0.756 |
Mean accuracies of all algorithms using the EC, EN, and UR column selectors at the minimum target budget
dataset | br.ec | br.en | br.ur | abr2.ec | abr2.en | abr2.ur | wbr2.ec | wbr2.en | wbr2.ur | rbr2.ec | rbr2.en | rbr2.ur |
---|---|---|---|---|---|---|---|---|---|---|---|---|
breast-cancer | 0.694 | 0.679 | 0.688 | 0.693 | 0.680 | 0.689 | 0.669 | 0.695 | 0.699 | 0.694 | 0.692 | 0.689 |
colic-Nominalized | 0.670 | 0.641 | 0.666 | 0.652 | 0.679 | 0.663 | 0.661 | 0.669 | 0.676 | 0.676 | 0.649 | 0.655 |
kr-vs-kp | 0.559 | 0.573 | 0.556 | 0.573 | 0.562 | 0.579 | 0.575 | 0.557 | 0.572 | 0.558 | 0.575 | 0.555 |
mushroom | 0.844 | 0.832 | 0.838 | 0.847 | 0.835 | 0.835 | 0.826 | 0.843 | 0.841 | 0.840 | 0.843 | 0.839 |
vote | 0.861 | 0.871 | 0.889 | 0.886 | 0.887 | 0.882 | 0.884 | 0.887 | 0.886 | 0.886 | 0.870 | 0.880 |
zoo | 0.786 | 0.790 | 0.781 | 0.790 | 0.791 | 0.784 | 0.785 | 0.787 | 0.795 | 0.791 | 0.775 | 0.768 |
4.3 Comparisons of algorithms using Wilcoxon tests
We then did a Wilcoxon signed rank test (Wilcoxon 1945) to compare every pair of algorithms from a group of algorithms at their minimum target budget. In a Wilcoxon signed rank test, we first have a hypothesis which is H_{0}:θ=0, meaning that there is no difference between the achieved classification accuracies of two algorithms at a given budget. Each algorithm has 100 achieved accuracies (10 iterations times 10 folds) at a given budget. Let A_{i} and B_{i} be the i-th achieved accuracy of the two algorithms, and let Z_{i}=A_{i}−B_{i} for i=1,…,100. The Wilcoxon signed ranked test procedure is as follows: (1) Observations of Z_{i}=0 are excluded. Let m be the reduced sample size. (2) Sort the absolute values |Z_{1}|,…,|Z_{m}| in ascending order, and let the rank of each non-zero |Z_{i}| be R_{i} (the smallest positive |Z_{i}| gets the rank of 1, and a mean rank is assigned to tied scores). (3) Denote the positive Z_{i} values with φ_{i}=I(Z_{i}>0), where I(⋅) is an indicator function: φ_{i}=1 for Z_{i}>0, otherwise φ_{i}=0. (4) The Wilcoxon signed ranked statistic W_{+} is defined as \(W_{+} = \sum _{i=1}^{n}\varphi_{i} R_{i}\). Define W_{−} similarly by summing ranks of the negative differences Z_{i}. (5) Calculate S as the smaller of these two rank sums: S=min(W_{+},W_{−}). (6) Find the critical value for the given sample size m and the wanted confidence level. (7) Compare S to the critical value, and reject H_{0} if S is less than or is equal to the critical value.
The Wilcoxon-based comparison of pairs of algorithms at the minimum target budget of all the algorithms listed in the table for data sets breast-cancer, colic-Nominalized, kr-vs-kp, mushroom, vote, and zoo. “++0−−+” means that the left side algorithm, compared to the top side algorithm, is significantly better, significantly better, no significant difference, significantly worse, significantly worse, and significantly better at a p<0.05 level for the 6 data sets
The Wilcoxon-based comparison of pairs of algorithms at the minimum target budget of all the algorithms listed in the table for data sets breast-cancer, colic-Nominalized, kr-vs-kp, mushroom, vote, and zoo
The Wilcoxon-based comparison of pairs of algorithms at the minimum target budget of all the algorithms listed in the table for data sets breast-cancer, colic-Nominalized, kr-vs-kp, mushroom, vote, and zoo
In our second set of experiments using Wilcoxon-based tests, we held each column selector fixed and compared row selectors. For each fixed column selector, on each data set we pairwise-compared via a Wilcoxon test all three row selectors, and counted wins. In most cases, there was an overall winning row selector, but nothing that consistently and dramatically stood out from the other row selectors. (The only exception was RBR2.en, which stood out next to RBR2.ec and RBR2.ur with 7 wins and 0 losses.) Further, the winning row selector varied with the column selector used.
The Wilcoxon-based comparison of pairs of algorithms at the minimum target budget of all the algorithms listed in the table for data sets breast-cancer, colic-Nominalized, kr-vs-kp, mushroom, vote, and zoo
In conclusion, BR and ABR2 stand out among algorithms using the EC row selector; ABR2 and Exp3C stand out among algorithms using the EN row selector; and ABR2 and RSFL stand out among algorithms using the UR row selector. While certain row selectors stood out for specific column selectors, no single row selector consistently stood out from the others in our experiments. Comparing the algorithms which are combinations of column selectors and row selectors, ABR2 with EN performs the best.
Generally speaking, ABR2 stands out among algorithms no matter which row selector is used. ABR2 is based on the belief that if there is a significant hypothesis change when purchasing an (instance, feature) pair, then it is likely that this feature is useful in building a classifier. Our experimental results showed that this belief is a good heuristic. Compared to RBR2 and WBR2, which judge significant hypothesis changes by comparing such changes to previous ones, ABR2 is more effective in its use of measuring absolute changes versus relative and window-based ones. Finally, ABR2 with the EN row selector by far works the best among all algorithms.
5 Conclusions and future work
We presented new algorithms for the budgeted learning problem (choosing which attribute to purchase at each step), many showing improvement over the state of the art, for example ABR2, WBR2, Exp3C, and FEL. We also described variations on methods for selecting a row (instance) to purchase an attribute of, selecting the row with most uncertainty in the current model. For different algorithms using the same row selector, BR and ABR2 stand out for all algorithms with the EC row selector; ABR2 and Exp3C stand out for all algorithms with the EN row selector; and ABR2 and RSFL stand out for all algorithms with the UR row selector. When the column selector is held fixed and the row selector is varied, no single row selector consistently stood out from the others. When comparing all algorithms, ABR2 with all row selectors, WBR2 and Exp3C with the EC row selector, and FEL with the UR row selector perform well. ABR2 with EN was the overall best performer.
There are several other directions for future work. First, while there are some theoretical results for the coins problem, there are no learning-theoretic results (e.g., PAC-style results) for the general budgeted learning problem of learning a hypothesis when the features of the training data have to be purchased at a cost, subject to an overall budget. Our use of Auer et al.’s multi-armed bandit algorithms to this problem (Sect. 3.1) may ultimately yield such a result. However, in order to get a PAC-style bound for our algorithm, one needs to relate regret bounds or one shot bounds such as those in Bubeck et al. (2008) to generalization error bounds of the final model.
Other future work includes extending the basic budgeted learning model in the context of bandit-based algorithms, in particular the budgeted learning of bounded active classifiers (BACs) (Kapoor and Greiner 2005a, 2005b). Further future work is to treat class labels and attribute pairs as bandits by separating the rewards for different class labels. Finally, exploring different base learners other than naïve Bayes (such as support vector machines) is another direction for future research.
These conditions are based only on the choice of parameters, not any statistical properties of the slot machines.
We tried several reward functions: GINI index, expected classification error, and conditional entropy. Classification error on the training set tended to work best for our algorithms, so those are the results that we present in Sect. 4.
If a class label is not chosen at this stage, then the row selection strategy will randomly choose a class label according to the class distribution.
While the best performers in terms of mean and median statistics were often the best in terms of the Wilcoxon statistic, this was not always the case. Inconsistencies occurred when a majority of the accuracies for algorithm X in cross validation were better than those for algorithm Y, but the mean of the accuracies for Y were better than X, due to outlying accuracy values.
Acknowledgements
The authors thank the reviewers for their helpful comments. This material is based upon work supported by the National Science Foundation under grant number 0743783.