Machine Learning

, Volume 90, Issue 1, pp 59–90

New algorithms for budgeted learning

  • Kun Deng
  • Yaling Zheng
  • Chris Bourke
  • Stephen Scott
  • Julie Masciale
Article

DOI: 10.1007/s10994-012-5299-2

Cite this article as:
Deng, K., Zheng, Y., Bourke, C. et al. Mach Learn (2013) 90: 59. doi:10.1007/s10994-012-5299-2

Abstract

We explore the problem of budgeted machine learning, in which the learning algorithm has free access to the training examples’ class labels but has to pay for each attribute that is specified. This learning model is appropriate in many areas, including medical applications. We present new algorithms for choosing which attributes to purchase of which examples, based on algorithms for the multi-armed bandit problem. In addition, we also evaluate a group of algorithms based on the idea of incorporating second-order statistics into decision making. Most of our algorithms are competitive with the current state of art and performed better when the budget was highly limited (in particular, our new algorithm AbsoluteBR2). Finally, we present new heuristics for selecting an instance to purchase after the attribute is selected, instead of selecting an instance uniformly at random, which is typically done. While experimental results showed some performance improvements when using the new instance selectors, there was no consistent winner among these methods.

Keywords

Budgeted learningMulti-armed bandit

1 Introduction

Approaches to typical machine learning applications usually operate under the assumption that data are freely available. That is, it is usually taken for granted that numerous fully-specified instances, along with their labels, are available to build a classifier. However, in many real-world applications, this assumption is far from realistic. Instead, collecting and specifying data may be very time-consuming and costly.

An area in machine learning that attempts to address the problem of learning in the absence of labels is active learning (Settles 2009). Rather than a passive learner that simply builds a hypothesis based on the available attribute/label pairs, an active learner must also choose which instances it wants an oracle to label. Many active learning algorithms and applications aim to reduce the real labeling costs (e.g., annotation time or cost of materials to acquire labels) while achieving competitive performance versus passive learning. In contrast, the theoretical literature mainly focuses on the label complexity—how many label purchases are required to learn asymptotically in the case for which all values of the attributes of the datasets are fully specified.

An alternative line of research called budgeted learning or active feature acquisition (Lizotte et al. 2003; Madani et al. 2004; Melville et al. 2004, 2005; Saar-Tsechansky et al. 2009) focuses on a model that is related to active learning. In budgeted learning, a learner considers instances in which the class labels are specified, but some or all of the attribute values are not. Instead, the learner can purchase an attribute value for a specific instance at some fixed cost, subject to an overall budget. The challenge to the learner is to decide which attributes of which instances will provide the best model from which to learn. The original motivation for the budgeted learning model came from medical applications in which the outcome of a treatment, drug trial, or control group (labels of an instance) is known and the features (results of running medical tests) are each available for a price. For example, a project is allocated $2 million to develop a diagnostic classifier for patient cancer subtypes. In this study, a pool of patients with a known cancer subtype were available, as were various diagnostic tests that could be performed, each with a cost. For this example, each test was expensive, and the overall budget was $2 million, so the learning algorithm had to carefully choose which patient would get which tests to learn a good cancer subtype classifier under the given $2 million budget.

Some early results of our work appeared in Deng et al. (2007), in which we presented new algorithms for choosing which attributes of which examples to purchase in the budgeted learning model. Several of our algorithms were based on results in the “multi-armed bandit” model. In this model, we have N slot machines that we may play for some fixed number of rounds. At each round, one must decide which single slot machine to play in order to maximize total reward over all rounds. Our first two algorithms were based on the algorithm Exp3 of Auer et al. (2002b) originally designed for the multi-armed bandit problem, and our third is based on the “follow the perturbed leader” approach of Kalai and Vempala (2005). In this paper, we describe this preliminary work and then present three new algorithms, which are variations of Biased Robin (Kapoor and Greiner 2005b), by incorporating second-order statistics into the decision-making process. We refer to our algorithms that use second-order statistics as Biased Robin 2 (BR2) series algorithms. Our results show that most of our proposed algorithms perform well when compared with the existing algorithms, in particular, AbsoluteBR2 (ABR2).

Previous budgeted learning algorithms that use naïve Bayes as the base classifier focus only on selecting attributes to purchase and then choose uniformly at random the instance whose attribute is to be purchased. In other words, such algorithms consider one instance to be as good as any other (given the class label) for querying the chosen attribute. In this paper, we extend our prior work by experimenting with strategies to select instances as well as attributes, choosing instances that are most wrongly predicted or that are least certain in the current model. Melville et al. (2004) and Melville et al. (2008) proposed to use US (uniform sampling) and ES (Error Sampling) to consider partial instances instead of all of the instances. The sampling is done before choosing an (instance, feature) pair. This is different from our row selection procedure, which is applied after a feature is chosen first.) While experimental results showed some performance improvements when using the new instance selectors, there was no consistent winner among these methods.

2 Background and related work

Budgeted learning is related to conventional machine learning techniques in that the learner is given a set D of labeled training data and then infers a classifier (hypothesis, or model) to label new examples. The key difference is that, in budgeted learning, there is an overall budget, and the learner has to use this given budget to learn as good a classifier or hypothesis as possible. There is a body of work in the data mining community by the name of “active feature acquisition” (Zheng and Padmanabhan 2002; Melville et al. 2004, 2005; Lomasky et al. 2007; Saar-Tsechansky et al. 2009), in which the features are purchased at a cost and the only difference is that their goal is to learn a hypothesis using as little cost as possible (i.e., no strict budget). For example, Saar-Tsechansky et al. (2009) proposed to use Log Gain—the expected likelihood of the training set—as a smoother measure of goodness of an attribute purchase. This idea was somewhat similar to some of the measures (conditional entropy and GINI index) that we have used for several budgeted learning algorithms. They also consider for purchase only those instances that are misclassified by the learned classifier instead of all of the instances to reduce the search space.

The difference between active feature acquisition and budgeted learning is that, budgeted learning usually has a hard budget set up front, while active feature acquisition does not have a hard budget. A minor distinction is that in some applications of active feature acquisition (Zheng and Padmanabhan 2002; Lomasky et al. 2007), individual instance/feature values cannot be bought one at a time. Instead, all the missing attributes of an instance must be synthesized or obtained as a whole.

One of the original applications of budgeted learning was in the medical domain (Madani et al. 2004). In this application, the examples are patients, and the (known) class label of patient x is y∈{−1,+1}, indicating whether or not x responded to a particular treatment. The (initially unknown) attributes of x are the results of tests performed on tissue samples gathered from patient x, e.g., a blood test for the presence of a particular antibody. In this case, any attribute of any instance can be determined. However, each costs time and money, and there is a finite budget limiting the attribute values that one can buy. Further, each attribute can cost a different amount of money, e.g., a blood test may cost less than a liver biopsy.

A second application of budgeted learning is in customer modeling. A company has significant data (attributes) on its own customers but may have the option to pay for other information on the same customers from other companies. Though they do not explicitly refer to it as “budgeted learning,” Zheng and Padmanabhan (2002) discuss this problem as applied to learning models of customers to web services, where a customer’s browsing history at a local site (e.g., Travelocity) is known, but the same customer’s history at other sites (e.g., Expedia) is not, though by assumption it could be purchased from these competing sites. Zheng and Padmanabhan evaluated two heuristic approaches to this problem, both of which are based on the idea of how much additional information the unspecified attributes can provide. Their first algorithm (AVID) imputes, at each iteration, the values of the unspecified features based on the specified ones. It does this multiple times and then considers the feature with the highest variance to be the least certain and thus the best one to purchase. Their second algorithm (GODA) also imputes values of the unspecified features, but then it selects for purchase the unspecified feature that maximizes expected “goodness” based on a specified function (e.g., training error). In their work, they assume that all attributes’ values are purchased at the same time rather than as individual (instance, feature) pairs.

In other work, Veeramachaneni et al. (2006) studied the problem of budgeted learning for the purpose of detecting irrelevant features rather than building a classifier. Specifically, let θΘ parameterize the probability distribution over \(\mathcal{Z} \times\mathcal{X} \times \mathcal{J}\), where \(\mathcal{Z}\) is the space of attributes that are always known to the learner, \(\mathcal{X}\) is the space of attributes that can be queried by the learner, and \(\mathcal{J}\) is the space of labels. Their goal was to accurately learn some function g(θ) (e.g., the true classification error rate of a model) by querying as few unknown attributes as possible. Their algorithm purchased attributes that were expected to induce maximum change in the estimate of g. As with other budgeted learning methods, this approach inherently seeks out attributes that are more relevant to estimating g.

Related to budgeted learning is the learning of “bounded active classifiers,” in which one has fully specified training examples with their labels, but the final hypothesis h must pay for attributes of new examples when predicting a label, spending at most Bh. This, of course, can be combined with budgeted learning to the budgeted learning of bounded active classifiers. Such a learning algorithm has been termed a “budgeted bounded-active-classifier learner” (bBACl) (Kapoor and Greiner 2005b, 2005a; Greiner et al. 2002). For simplicity, in our work we focus on budgeted learning of unbounded classifiers, leaving the extension of our results to the bBACl problem as future work.

Budgeted learning falls under a general framework of problems that represent a trade-off between exploration and exploitation in an online decision process. In this framework, a learner has several actions it may choose, each action with a corresponding payoff. Initially, the learner starts with little or no knowledge and must spend some time gathering information about which attributes are most relevant, reflecting the need for exploration. At some point, the learner begins exploitation by purchasing more values of attributes that are known to be informative (assuming that a particular attribute is equally informative for all instances), in an attempt to maximize its expected reward. In budgeted learning, the reward is the performance of the final classifier when the budget is exhausted. Clearly, the budgeted learner must purchase a variety of attributes in order to form a more complete model of the underlying problem domain and also figure out which attributes are more informative, thus showing the importance of exploration in budgeted learning. Exploration and exploitation can and often do operate at the same time. In other words, there may be no clean-cut boundaries between these two facets. For example, to explore more efficiently (which is crucial when the budget is limited), we often need to utilize/exploit what is already known to decide which attribute to purchase next. On the other hand, the result of exploration, i.e., the rewards received during the process, are valuable for deciding which attributes should be purchased more often in order to build a good final classifier.

Most theoretical studies of budgeted learning are based on related problems within the exploration versus exploitation framework, such as the coins problem and the multi-armed bandit problem (see Sect. 2.3). In the coins problem, one is given N biased coins {c1,…,cN}, where ci’s probability of heads is distributed according to some known prior Ri (though we say “coins,” the problem can be generalized to non-binary outcomes). In each round, a learner selects a coin to flip at unit cost. After exhausting its budget, the learner must choose a single coin to flip ad infinitum, and its payoff is the number of heads that the coin yields. The goal is to choose the coin that has the highest probability of heads, and performance is measured by the regret incurred by the learner’s choice, i.e., the expected amount that the learner could have done better by choosing the optimal coin. Madani et al. (2004) showed that this problem can be modeled as one of sequential decision making under uncertainty, and as such can be solved via dynamic programming when the number of coins is taken to be constant. However, the complexity of such a dynamic programming solution is exponential, and in fact this problem is NP-hard under specific conditions. They argued that straightforward heuristics such as “Round Robin” (repeatedly cycle through each coin until the budget is exhausted) do not have any constant approximation guarantees. That is, for any constant there is a problem with minimum regret r such that the regret of Round Robin is >ℓr. Kapoor and Greiner (2005c) empirically evaluated common reinforcement learning techniques on this problem. Guha and Munagala (2007) were able to design a 4-approximation algorithm for the general coins problem using a linear programming rounding technique. Goel et al. (2009) presented an algorithm that also guaranteed a constant factor approximation. Their algorithm was based on “ratio index” (analogous to the Gittins index) such that a single number can be computed for each arm and the arm with the highest index is chosen for experimentation.

Recently, budgeted learning has been studied in the context of sampling as few times as possible to minimize the maximal variance of all the arms in a multi-armed bandit problem (Antos et al. 2008). Further, in Li (2009), the goal is to minimize the expected risk of the parameters in a generative Bayesian network, with the risk chosen to be the expected KL divergence of the parameters from their expectations. Finally, Goel et al. (2006) studied optimization problems with inputs that are random variables, where the only available data are samples of the distribution.

In the following sections, we give detailed descriptions of algorithms on which we base our contributions. Section 2.1 explains Biased Robin (Lizotte et al. 2003). Section 2.2 describes RSFL (Kapoor and Greiner 2005b). Finally, Sect. 2.3 explains bandit-based algorithms (Auer et al. 2002b; Kalai and Vempala 2005).

2.1 Biased Robin

One of the simplest early budgeted learning algorithms was Biased Robin (BR). BR is similar to a Round Robin approach (where first attribute 1 is purchased once, then attribute 2, and so on to attribute N, then repeat), except that attribute i is repeatedly purchased until such purchases are no longer “successful,” and then the algorithm moves on to attribute i+1. Success is measured by a feedback function, examples of which are described later. Despite its simplicity, BR is a solid performer in training naïve Bayes classifiers (Lizotte et al. 2003; Kapoor and Greiner 2005b).

2.2 Single-Feature Lookahead

Single Feature Lookahead (SFL) is a method introduced by Lizotte et al. (2003). Using the posterior class distribution of its naïve Bayes model, one can compute the expected loss of any sequence of actions. Given sufficient computational resources and the ability to compute expected loss, one can achieve Bayes optimality by considering all possible future actions contingent on all possible outcomes along the way (Wang et al. 2005). Because this is too computationally intensive, SFL and similar approaches restrict the space of future models they consider, by restricting the class of policies considered. In SFL, each (label, attribute) pair is associated with an expected loss incurred by spending the entire remaining budget on that pair. This expected loss is computed using the current posterior naïve Bayes model, which, given an allocation, gives the distribution over future models. Expected loss is computed over this distribution. The pair with the lowest loss is then purchased once. SFL introduces a lookahead into the state space S of all possible purchases without explicitly expanding the entire state space. At any point in the algorithm’s execution, one has an allocationα: a description of how many feature values have been purchased from instances with a certain class label. Specifically, αijk is a count of the number of times attribute i has been purchased from an instance with class label j with resulting attribute value k. (Thus, SFL requires the possible attribute values to be discrete.) Given a current allocation, SFL calculates the expected loss of performing all possible single-purchase actions (purchasing an attribute/label pair which results in a specific attribute value) and chooses the action that minimizes the expected loss of the resulting allocation. In a randomized version of SFL called RSFL (Kapoor and Greiner 2005b), the next (label, attribute) pair to purchase is sampled from the Gibbs distribution induced by the SFL losses.

In SFL and RSFL, the expectation is over all belief states that can be reached using the given allocation:
$$\sum_{\alpha'} P(\alpha') \mathrm{Loss}(\alpha') $$
where the loss of a possible new state (represented by the new allocation α′ after the purchase) is weighted by its probability of occurrence.
Several heuristics have been considered for the loss function, including the GINI index (Lizotte et al. 2003) and expected classification error (Kapoor and Greiner 2005b). The GINI index is defined as
$$ \sum_{j\in J}\sum _{\vec{x}\in X} P(\vec{x}) P(j\mid\vec{x}) \bigl(1-P(j\mid\vec{x})\bigr) $$
(1)
where J is the set of all possible labels and \(\vec{x}\) is a feature vector drawn from an instance space X. The expected classification error with respect to an attribute ai is defined as
$$ \sum_k P(a_i=k) \min_{j\in J}\bigl(1-P(j\mid a_i=k)\bigr) $$
(2)
where the sum is taken over all possible attribute values k for attribute ai.
In our experiments, we found that Randomized SFL and Biased Robin performed their best when using conditional entropy (as used by Kapoor and Greiner 2005b):
$$ \mathsf{CE}(i,j) = -\sum _k P(a_i = k) \sum_j P(j \mid a_i=k) \log_2\bigl(P(j \mid a_i=k) \bigr) $$
(3)
Conditional entropy measures the uncertainty of the class label given the value of an attribute.
In our experiments, we compared our algorithms against RSFL, a randomized version of SFL (Kapoor and Greiner 2005b) because straight SFL can degenerate into Biased Robin, exhausting its budget on the current best attribute (Kapoor and Greiner 2005b). Such behavior was also observed in the context of the coins problem (Madani et al. 2004) and we experienced similar results in our experiments with pure SFL. In RSFL, the conditional entropy is used to define a Gibbs distribution for which we choose the ith attribute from an instance with class label j with probability
$$\frac{\exp(-\mathsf{CE}(i,j))}{\sum_{i,j} \exp(-\mathsf{CE}(i,j))} $$

2.3 The multi-armed bandit problem

There are close connections between budgeted learning and the so-called “multi-armed bandit problem” first studied by Robbins (1952). The problem can be stated as follows (Gittins 1979): there are N arms, each having an unknown success probability of emitting a unit reward. The success probabilities of the arms are assumed to be independent of each other. The objective is to pull arms sequentially so as to maximize the total reward. Many policies have been proposed for this problem under the independent-arm assumption (Lai and Robbins 1985; Auer et al. 2002b). The key difference between budgeted learning and the multi-armed bandit problem is that in the latter, one tries to maximize the cumulative rewards over all pulls, whereas with budgeted learning, one simply wants to maximize the accuracy of the resulting classifier or a “simple” reward in some sense.

2.3.1 Exp3 algorithm

In the context of approaching budgeted learning as a multi-armed bandit problem, we apply results from Auer et al. (2002b). Their most basic algorithm (Exp3) maintains a weight wi (initialized to 1) for each of the N arms. At each trial, it plays machine i with probability
$$P(i) = \frac{\gamma}{N} + (1-\gamma) \frac{w_i}{\sum_{n=1}^N w_n} $$
where γ is a parameter governing the mixture between the weight-based distribution (controlling exploitation) and the uniform distribution (allowing for exploration). After playing the chosen machine (call it i), the reward r is returned, which is used to update the weight wi by multiplying it by exp(γr/(P(i)N)) (all other weights are unchanged).

Auer et al. proved that under appropriate conditions,1 the expected total reward of Exp3 will not differ from the best possible by more than \(2.63 \sqrt{g N \ln N}\), where g is an upper bound on the total reward of the best sequence of choices.

2.3.2 FPL algorithm

The “follow the perturbed leader” (FPL) algorithm (Kalai and Vempala 2005), although originally designed as an online expert-based algorithm, is applicable to the multi-armed bandit problem. The idea is to select the most informative arm by selecting the best arm thus far, hence “follow the leader.” At each time step, a cost (or equivalently reward, as they are inversely related) is counted toward the arm selected. At time t, the cumulative cost of each arm can be calculated, and the arm that has incurred the least cost is chosen for the next pull. Without randomization, an adversary can easily trick such deterministic algorithms into wrong decisions. To address this problem, a random perturbation is added to the cost of each arm before making the decision, thus the name “follow the perturbed leader.” Similar to the results of Auer et al. (2002b), it can be shown that, under appropriate conditions, the regret of the perturbed leader algorithm is small relative to the best sequence of choices. Let min-costT be the total cost of the best single arm up to time T in hindsight. Then the expected cost of the perturbed leader can be bounded as
$$E[\mathrm{cost}_{\mbox{\scriptsize\textsc{fpl}}}] \leq(1+\varepsilon) \mbox{min-cost}_T + \frac{\mathcal{O}(\log{N})}{\epsilon} $$
where ε is a user-specified parameter. The details of its adaptation to the budgeted learning setting are described in Sect. 3.2.

2.3.3 Other results

Recently, Bubeck et al. (2008) pointed out an interesting link between simple (one-shot) and cumulative (overall) regret, which is the difference in reward of the algorithm in question and that of an optimal, omniscient algorithm. One of the surprising results is that an upper bound on the cumulative regret implies a lower bound on simple regret for all algorithms. In other words, the exploration-exploitation trade-offs are qualitatively different for these two types of regrets. In fact, according to Bubeck et al. (2008), one of the very successful algorithms (Auer et al. 2002a) for cumulative regret would perform worse than the uniform random algorithm when the budget goes to infinity. However, in their simulation study, this was not observed since in order for this to occur, the budget would have to be very large, to the point that the computed regrets would fall below that of the precision of the computer used to run the simulations. Their results do not seem to be directly transferable to budgeted machine learning due to their assumptions. Another important contribution is that they pointed out that exploration (allocation) strategies can be different from the decision (recommendation) strategies in the pure exploratory multi-armed bandit problems.

2.3.4 Discussion

The theoretical guarantees in the work of Auer et al. (2002b) give us good motivation for using similar approaches to the budgeted learning problem. Auer et al. make no assumptions about the underlying distribution of slot machines and their results still hold even when the rewards are dynamic and may depend on previous sequences of rewards and the player’s random draws. Thus, we can plug in our choice of reward function for the slot machines (say, the negative conditional entropy of the class label with respect to an attribute) and their bounds automatically translate into guarantees in the budgeted learning context. However, these bounds only apply to the cumulative regret with respect to the best sequence of arm pulls. For instance, nothing is said by these bounds about the class label’s uncertainty with respect to an attribute upon the last round of purchases. In other words, it remains an open question whether under some conditions, one can bound the resulting training error with respect to the best set of purchases or the best possible error rate.

3 Our algorithms

For simplicity, all of our algorithms focus on a unit-cost model of budgeted learning, i.e., each attribute costs one monetary unit. Sections 3.1 and 3.2 present our two algorithms based on Exp3 and one algorithm based on “follow the perturbed leader” for multi-armed bandit problems. Section 3.3 presents our algorithms based on Biased Robin and second order statistics. Section 3.4 presents our row selection policies.

3.1 Exp3-based algorithms (Exp3C and Exp3CR)

Our first two algorithms are based on the multi-armed bandit algorithm Exp3 of Auer et al. (2002b) described in Sect. 2.3.1. Our first algorithm (Exp3C, for “Exp3-Column”) treats each of the N attributes (columns) as an arm. Each column has a weight that is initialized to 1. Each purchasing round, Exp3C chooses an attribute (column) of some instance to buy based on the weights. Specifically, Exp3C chooses column i with probability
$$P(i) = \frac{\gamma}{N} + (1-\gamma) \frac{w_i}{\sum_{n=1}^N w_n} $$
where γ is a parameter. After the purchase, we build a new naïve Bayes model on the training set and Exp3C gets as a reward the classification accuracy of the naïve Bayes model evaluated on the partially specified training set.2 The Exp3C algorithm is presented as Algorithm 1.
https://static-content.springer.com/image/art%3A10.1007%2Fs10994-012-5299-2/MediaObjects/10994_2012_5299_Fig1_HTML.gif
Algorithm 1

Algorithm Exp3C

After choosing a column to purchase, a row (instance) must also be selected. After choosing a column i, Exp3C selects a row uniformly at random from all rows that do not yet have column i filled in.

In our second algorithm (Exp3CR, for “Exp3-Column-Row”), we define a distribution over the rows as well as the columns, i.e., we now have two weight vectors instead of one. After choosing a column according to its distribution (which is done the same way as in Exp3C), our algorithm then chooses a row according to the row distribution. Once reinforcement is received, both weight vectors are updated independently of each other. Thus we replace the naïve Bayes assumption with a product distribution over the (column, row) pairs.

As previously mentioned, by directly applying the regret bounds of Auer et al. (2002b), we easily bound the regret of Exp3C. Specifically, we see that the reward of our algorithms will be within a factor of \(2.63 \sqrt{g B \ln B}\) of that of the best sequence of attribute choices, where B is an upper bound on the total reward of this best sequence. It remains an open problem as to what kinds of bounds follow for Exp3CR.

3.2 Follow the expected leader (FEL)

Our third algorithm is a variation of the “follow the perturbed leader” (FPL) type algorithms due to Kalai and Vempala (2005). As with the previous two algorithms, we treat each attribute as an arm.

Let xi(t) be the cost of the ith attribute at time step t. At each time step t, FPL computes the sum of all costs of each arm, \(c_{i} = \sum_{q=1}^{t-1} x_{i}(q)\) and adds a perturbation factor (or noise) εi generated uniformly at random from [0,1/ϵ]. FPL then chooses to play the arm
$$\mathop{\mathrm{argmin}}\limits_{1\leq i\leq n} \{ c_i + \varepsilon_i\} $$

The framework for FPL assumes that we have access to the costs xi(t) for all arms at every time step (had they been chosen). However, this assumption is not reasonable in the context of budgeted learning. For this reason, our implementation is a slight variation of the standard FPL called FEL (“follow the expected leader”) from Kalai and Vempala (2005). First, we assume that xi(t) is zero if the arm was not played (the attribute was not chosen as a purchase). Next, let #xi(t) be the number of times attribute i was chosen up to time step t. Now, instead of cumulative cost, we use the average of the perturbed cost (over the trials that an attribute is actually purchased) as the selection criterion. The FEL algorithm is presented as Algorithm 2. Just as with Exp3C and Exp3CR, we measure the cost as the training error on the partially specified training set. While we considered other reward functions (GINI index, expected classification error, and conditional entropy), classification error on the training set tended to work best for our algorithms in terms of accuracy on the test set. Thus the results we present represent those from the top-performing reward function for each algorithm.

3.3 Variance-based biased Robin algorithms

Recall that Biased Robin is similar to a Round Robin approach except that attribute i is repeatedly purchased until such purchases are no longer “successful,” and then the algorithm moves on to attribute i+1. In practice, however, making a decision solely based on the outcome of the last action can be problematic, e.g., due to the fluctuation (instability) of the learning process (Kapoor and Greiner 2005b).

We tested three alternative methods based on second-order statistics to judge whether a further purchase of the same attribute is successful or not. Let the current change of the hypotheses be
$$\delta(t)=\frac{\sum_{m=1}^{M}\sum_{j=1}^{J} | P_t(j \mid x_m) -P_{t-1}(j \mid x_m) |}{M} $$
where Pt(jxm) is the probability estimated by the trained classifier (after the tth purchase) that the instance m belongs to the jth class, Pt−1(jxm) is the probability estimate made by the classifier built after the (t−1)th purchase, M is the number of instances, and J is the number of classes.
All the following heuristics are based on the intuition that no further exploration of an attribute should be continued if the induced hypothesis does not change much at all. For the first method, a purchase is successful when δ(t)>Δa. For the second method, a purchase is successful when δ(t)/δ(t−1)>Δr. For the third method, a purchase is successful when δ(t)>median{δ(t−Δw),δ(t−Δw+1),…,δ(t−1)}. In these algorithms, Δa, Δr, and Δw are parameters. The BR variants using the above three methods judging “successful” are called AbsoluteBR2 (ABR2), RelativeBR2 (RBR2), and WindowBR2 (WBR2), respectively.
https://static-content.springer.com/image/art%3A10.1007%2Fs10994-012-5299-2/MediaObjects/10994_2012_5299_Fig2_HTML.gif
Algorithm 2

Algorithm Follow the Expected Leader (FEL)

3.4 Instance (row) selection heuristics

Many budgeted learning algorithms (except Exp3CR) only select columns for purchase, implicitly assuming that given a column, any instance (or any instance with a given class label) is equally useful. Thus rows are selected uniformly at random from either the entire training set or from among instances of a particular class. However, it may not always be the case that two instances are equally informative given an attribute. Thus, we refine these algorithms by defining criteria for choosing specific instances from which to purchase an attribute. When using our row selection strategies, after the budgeted learner chooses an attribute and a class label,3 the row (instance) chosen will be the one optimizing our selection criteria among those with the same class label and with the selected column yet unpurchased.

3.4.1 Entropy as row selection criterion (EN, en)

Intuitively, given a selected column, one wants to find a row such that the (row, column) purchase gives the most information possible. Put another way, we want to choose an instance whose classification is least certain under the current model. That is, it is best to choose an instance nearest to the current model’s decision boundary. This technique has been very successful in active learning (Lewis and Gale 1994; Campbell et al. 2000; Schohn and Cohn 2000; Tong and Koller 2001). For naïve Bayes this means choosing the instance whose posterior class probability distribution is closest to uniform over the classes. I.e., we choose the instance xm that maximizes the entropy of the posterior class distribution:
$$-\sum_{j=1}^J P_t(j \mid x_m) \log_2{P_t(j \mid x_m)} $$

3.4.2 Error correction as row selection criterion (EC, ec)

The other row selection heuristic we considered is a greedy “error-correcting” approach. For each training instance m still with missing attributes, we calculate the predicted probability of its true class Pt(mxm), where m stands for the true class of instance m. We then pick \(\operatorname{argmin}_{m} \{P_{t}(\ell_{m} \mid x_{m})\}\), the instance with the smallest estimated probability in its true class. Intuitively, the selected instance is more likely to be classified wrong by the classifier, so knowing more about this instance should improve the performance, especially in the early stages of the training process. Melville et al. (2005, 2008) proposed to use US (uniform sampling) and ES (Error Sampling) to consider only partial instances instead of all of the instances. The sampling is done before choosing an (instance, feature) pair. This is different from our row selection, which is applied after a feature is chosen first.

4 Experimental results

In this section we present our experimental results on several UCI data sets (Asuncion and Newman 2009) (see Table 1). All of these data sets have a large number of attributes, which is good for testing budgeted learning algorithms whose essence is to identify those attributes that are more helpful for building the classifier. In order to run RSFL, we chose only data sets that had nominal attributes or could easily be made nominal, so that the number of possible (feature, value) pairs is limited. For the few data sets with missing attributes, the mode was used to fill in that attribute value. All algorithms were written in Java within the Weka machine learning framework (Witten and Frank 2005) and used its naïve Bayes (NB) as the base learner. We chose NB since it was used by related work (Lizotte et al. 2003), is quick to train, and handles missing attribute values.
Table 1

Data set information

Data set

Num. of Instances

Num. of Attributes

Num. of Classes

breast-cancer

286

9

2

colic

368

22

2

kr-vs-kp

3196

35

2

mushroom

8124

22

2

vote

435

16

2

zoo

101

17

7

We partitioned each data set in 10 different ways. For each partitioning, we used 10-fold cross validation, so the results presented in this section are averages of 100 folds. We ran 10-fold CV on 10 different partitionings because we found that the performance of each algorithm is sensitive to the partitioning used, and repeating the process reduced the variance. In addition to our algorithms (Exp3C, Exp3CR, FEL and the BR2 variations) and those in the literature described earlier (RSFL and BR), as a control, we also considered a random shopper that uniformly at random selects an unpurchased (instance, attribute) pair. We tried various reward functions with each algorithm, and chose the reward function that worked best for each algorithm. For our algorithms, we used the classification accuracy on the partially specified training set as a reward function. For RSFL and BR, we used conditional entropy. When applicable, we ran each algorithm with uniform random selection of rows as well as with the entropy-based and error correction-based approaches of Sect. 3.4.

For the algorithms with adjustable parameters, we report the best results from the parameter values we tested, yielding a “best-case” description of each algorithm’s performance. For Exp3C and Exp3CR, we tried γ∈{0.01,0.05,0.10,0.15,0.20,0.25}, and chose γ=0.15 for Exp3C and γ=0.20 for Exp3CR. For FEL, we tried ϵ∈{0.01,0.05,0.1,0.2,0.5} and chose ϵ=0.1. For AbsoluteBR2, we chose Δa=0.01 from {0.00001,0.0001,0.001,0.01,0.1,1}, for RelativeBR2, we chose Δr=1 from {0.01,0.1,1,10,100}, and for WindowBR2 we chose Δw=2 from {2,3,4,5,6,7,8,9}.

To evaluate the overall behavior of each of the algorithms, we constructed learning curves that reflect the performance of a heuristic by its mean error over the 10×10 folds as more attributes are purchased. On each fold, each algorithm was run up to a budget of B=100. Each purchase was unit cost.

In our experiment, more than 20 algorithms were evaluated for each data set. In Table 2 we list algorithm abbreviations, full names, and sources. For the sake of brevity, only learning curves using the EC-based row selection are presented (Figs. 1 and 2), and we only sampled every fifth data point. To keep the plots uncluttered, in Fig. 1 we plot results for Random, Biased Robin (BR), Absolute BR2 (ABR2), Window BR2 (WBR2) and Relative BR2 (RBR2). In Fig. 2, we include ABR2 as a reference against Random, Exp3 Column-Row (Exp3CR), Exp3 Column (Exp3C), Follow the Expected Leader (FEL), and Randomized Single-Feature Lookahead (RSFL). In the following sections, we present a detailed analysis of our results.
https://static-content.springer.com/image/art%3A10.1007%2Fs10994-012-5299-2/MediaObjects/10994_2012_5299_Fig3_HTML.gif
Fig. 1

Learning curves for Baseline, Random, BR, ABR2, WBR2, and RBR2 on each data set under the Error Correction (EC) row selector. (a) Breast-cancer; (b) colic-nominalized; (c) mushroom; (d) kr-vs-kp; (e) vote; (f) zoo

https://static-content.springer.com/image/art%3A10.1007%2Fs10994-012-5299-2/MediaObjects/10994_2012_5299_Fig4_HTML.gif
Fig. 2

Learning curves for Baseline, Random, Exp3CR, Exp3C, FEL, RSFL, and ABR2 on each data set under the Error Correction (EC) row selector. (a) Breast-cancer; (b) colic-nominalized; (c) mushroom; (d) kr-vs-kp; (e) vote; (f) zoo

Table 2

Algorithm abbreviations, full names, and short descriptions

Algorithm Identifier

Full Name

Source

BR, br

Biased Robin

Section 2.1

ABR2, abr2

Absolute Biased Robin 2

Section 3.3

RBR2, rbr2

Relative Biased Robin 2

Section 3.3

WBR2, wbr2

Window Biased Robin 2

Section 3.3

Exp3C, exp3c

Exp3 Column

Algorithm 1

Exp3CR, exp3cr

Exp3 Column Row

Section 3.1

RSFL

Random Single Feature Look-ahead

Section 2.2

FEL

Follow the Expected Leader

Algorithm 2

Rand

Random

Randomly choose a feature and an instance

algoname.ec

aglorithm algname with Error-Correction row selector

Section 3.4.2

algoname.en

algorithm algname with Entropy row selector

Section 3.4.1

algoname.ur

algorithm algname with Uniform Random row selector

Section 3.4

4.1 Summary statistics

Ultimately, the goal in budgeted learning is to reduce the number of attributes one must purchase in order to effectively learn. To summarize and compare learning curves, we use summary statistics.

The first statistic we use is the target mean, which is the mean of the error rates for the final 20 % of the total budget achieved by the random shopper. We define the target budget of an algorithm over the 10×10 folds on a given data set as the minimum budget needed by an algorithm to be competitive with the random shopper. For a given algorithm A and trial t (i.e., the naïve Bayes model trained on the training set after t purchases by A), we compute the mean error rate of the last 5 % of purchases for A. Then the target budget is the smallest t for which the target mean is achieved by A. We use a window size of 5 % of the total budget to reduce the influence of outliers as the learning curves can have high variance early on. If an algorithm fails to achieve the target mean, its target budget is simply the entire budget B.

We also report each algorithm’s data utilization ratio, which is the algorithm’s target budget divided by the target budget of the random shopper. Thus, a lower data utilization ratio reflects that the algorithm was able to make more useful purchases overall while excluding large changes in performance as the budget is exhausted. This is especially informative because with larger budgets, each algorithm will naturally converge to the baseline, making distinctions between them meaningless as more purchases are made. These metrics are similar to those used by Abe and Mamitsuka (1998), Melville and Mooney (2004), and Culver et al. (2006) in the context of active learning.

4.2 Comparing attribute (column) selectors and instance (row) selectors

In our first set of experiments, we held the row selector fixed and compared the budgeted learning algorithms (which we sometimes refer to as “attribute selectors” or “column selectors”). Tables 35 show the target budgets and data utilization rates for the algorithms for each row selector. In those tables, the suffix of each algorithm’s name denotes which row selector was used: “ec” for Error-Correction, “en” for Entropy, and “ur” for Uniform Random. For example, abr2.ec is the AbsoluteBR2 algorithm run with the Error-Correction row selector.
Table 3

Target budget and data utilization rates for algorithms with the EC row selector. Total budget was B=100

Dataset

exp3cr

exp3c.ec

fel.ec

rsfl.ec

rand.ec

br.ec

abr2.ec

wbr2.ec

rbr2.ec

breast-cancer

42.0

35.0

100.0

100.0

50.0

46.0

35.0

100.0

60.0

(0.61)

(0.51)

(1.45)

(1.45)

(0.72)

(0.67)

(0.51)

(1.45)

(0.87)

colic-Nominalized

78.0

83.0

78.0

100.0

89.0

100.0

93.0

98.0

51.0

(0.89)

(0.94)

(0.89)

(1.14)

(1.01)

(1.14)

(1.06)

(1.11)

(0.58)

kr-vs-kp

90.0

80.0

90.0

100.0

91.0

100.0

82.0

74.0

100.0

(0.96)

(0.85)

(0.96)

(1.06)

(0.97)

(1.06)

(0.87)

(0.79)

(1.06)

mushroom

86.0

77.0

73.0

83.0

82.0

70.0

71.0

77.0

65.0

(0.91)

(0.82)

(0.78)

(0.88)

(0.87)

(0.74)

(0.76)

(0.82)

(0.69)

vote

100.0

100.0

100.0

71.0

100.0

65.0

63.0

100.0

57.0

(1.14)

(1.14)

(1.14)

(0.81)

(1.14)

(0.74)

(0.72)

(1.14)

(0.65)

zoo

82.0

77.0

86.0

76.0

75.0

77.0

68.0

85.0

77.0

(0.88)

(0.83)

(0.92)

(0.82)

(0.81)

(0.83)

(0.73)

(0.91)

(0.83)

Mean

79.67

75.33

87.83

88.33

81.16

76.33

68.67

89

68.33

Mean DUR

0.9

0.85

1.02

1.03

0.92

0.86

0.77

1.04

0.78

Median DUR

0.9

0.84

0.94

0.97

0.92

0.79

0.74

1.01

0.76

When using EC as the row selector (Table 3), RBR2 performed the best in terms of mean target budget, and ABR2 performed the best for mean and median DUR. When using EN as the row selector (Table 4), ABR2 performed the best in terms of mean target budget and median DUR, and ABR2 and RBR2 performed the best in terms of mean DUR. This is consistent4 with the Wilcoxon signed rank tests of Sect. 4.3, which indicate that BR2.ec and ABR2.ec are the best among all algorithms using the EC row selector (Table 11), and ABR2.en is the best among all algorithms using the EN row selector (Table 12). In contrast to EC and EN, when UR was used (Table 5), FEL performed the best in terms of mean target budget and mean DUR, and Exp3C performed the best in terms of median DUR. This result is different from the Wilcoxon signed ranked tests of Sect. 4.3, which indicate that ABR2.ur and RSFL.ur are the best among all algorithms using the UR row selector (Table 13).
Table 4

Target budget and data utilization rates for algorithms with the EN row selector. Total budget was B=100

Dataset

exp3cr

exp3c.en

fel.en

rsfl.en

rand.en

br.en

abr2.en

wbr2.en

rbr2.en

breast-cancer

42.0

100.0

100.0

100.0

100.0

100.0

59.0

50.0

46.0

(0.61)

(1.45)

(1.45)

(1.45)

(1.45)

(1.45)

(0.86)

(0.72)

(0.67)

colic-Nominalized

78.0

100.0

100.0

95.0

100.0

100.0

68.0

96.0

82.0

(0.89)

(1.14)

(1.14)

(1.08)

(1.14)

(1.14)

(0.77)

(1.09)

(0.93)

kr-vs-kp

90.0

82.0

100.0

100.0

98.0

82.0

100.0

100.0

69.0

(0.96)

(0.87)

(1.06)

(1.06)

(1.04)

(0.87)

(1.06)

(1.06)

(0.73)

mushroom

86.0

76.0

94.0

85.0

84.0

78.0

81.0

70.0

63.0

(0.91)

(0.81)

(1)

(0.9)

(0.89)

(0.83)

(0.86)

(0.74)

(0.67)

vote

100.0

58.0

79.0

51.0

87.0

57.0

55.0

76.0

85.0

(1.14)

(0.66)

(0.9)

(0.58)

(0.99)

(0.65)

(0.62)

(0.86)

(0.97)

zoo

82.0

79.0

73.0

86.0

83.0

73.0

69.0

81.0

89.0

(0.88)

(0.85)

(0.78)

(0.92)

(0.89)

(0.78)

(0.74)

(0.87)

(0.96)

Mean

79.67

82.5

91

86.17

92

81.67

72

78.83

72.33

Mean DUR

0.9

0.96

1.06

1

1.07

0.95

0.82

0.89

0.82

Median DUR

0.9

0.86

1.03

0.99

1.02

0.85

0.81

0.87

0.83

Table 5

Target budget and data utilization rates for algorithms with the UR row selector. Total budget was B=100

Dataset

exp3cr

exp3c.ur

fel.ur

rsfl.ur

rand.ur

br.ur

abr2.ur

wbr2.ur

rbr2.ur

breast-cancer

42.0

46.0

18.0

100.0

69.0

66.0

45.0

51.0

52.0

(0.61)

(0.67)

(0.26)

(1.45)

(1)

(0.96)

(0.65)

(0.74)

(0.75)

colic-Nominalized

78.0

89.0

82.0

100.0

88.0

100.0

90.0

93.0

82.0

(0.89)

(1.01)

(0.93)

(1.14)

(1)

(1.14)

(1.02)

(1.06)

(0.93)

kr-vs-kp

90.0

74.0

89.0

100.0

94.0

100.0

81.0

77.0

91.0

(0.96)

(0.79)

(0.95)

(1.06)

(1)

(1.06)

(0.86)

(0.82)

(0.97)

mushroom

86.0

80.0

77.0

85.0

94.0

75.0

76.0

72.0

65.0

(0.91)

(0.85)

(0.82)

(0.9)

(1)

(0.8)

(0.81)

(0.77)

(0.69)

vote

100.0

100.0

100.0

76.0

88.0

35.0

100.0

100.0

77.0

(1.14)

(1.14)

(1.14)

(0.86)

(1)

(0.4)

(1.14)

(1.14)

(0.88)

zoo

82.0

72.0

77.0

80.0

93.0

83.0

71.0

82.0

96.0

(0.88)

(0.77)

(0.83)

(0.86)

(1)

(0.89)

(0.76)

(0.88)

(1.03)

Mean

79.67

76.83

73.83

90.17

87.67

76.5

77.17

79.17

77.17

Mean DUR

0.9

0.87

0.82

1.05

1

0.87

0.87

0.9

0.88

Median DUR

0.9

0.82

0.88

0.98

1

0.92

0.84

0.85

0.9

Next, we computed the mean classification accuracy of each algorithm after spending budget bd, where bd is the minimum target budget on data set d of all algorithms that use the same row selector. For example, for the EC row selector on data set vote, we considered all algorithms’ mean accuracies after spending a budget of bvote=57.0 (Table 3). These average accuracies are presented in Tables 68. Tables 6 and 7 show that with the EC or EN row selector, ABR2 and RBR2 each has two wins, which is the most for any algorithm. This is partially consistent with the Wilcoxon signed rank tests of Sect. 4.3, which indicate that BR2.ec and ABR2.ec are the best among all algorithms using the EC row selector (Table 11), and ABR2.en is the best among all algorithms using the EN row selector (Table 12). In Table 8, we see that for the UR row selector, FEL has the most wins (again, two). Again, this result is different from the Wilcoxon signed ranked tests of Sect. 4.3, which indicate that ABR2.ur and RSFL.ur are the best among all algorithms using the UR row selector (Table 13).
Table 6

Mean accuracies of all algorithms using the EC row selector at the minimum target budget

dataset

exp3c.ec

fel.ec

rsfl.ec

rand.ec

br.ec

abr2.ec

wbr2.ec

rbr2.ec

breast-cancer

0.698

0.676

0.632

0.693

0.690

0.693

0.659

0.675

colic-Nominalized

0.644

0.652

0.635

0.640

0.609

0.635

0.631

0.676

kr-vs-kp

0.569

0.567

0.558

0.563

0.558

0.572

0.575

0.555

mushroom

0.832

0.831

0.817

0.820

0.834

0.837

0.809

0.841

vote

0.883

0.879

0.881

0.880

0.885

0.888

0.879

0.886

zoo

0.769

0.765

0.771

0.776

0.781

0.790

0.778

0.774

Table 7

Mean accuracies of all algorithms using the EN row selector at the minimum target budget

dataset

exp3c.en

fel.en

rsfl.en

rand.en

br.en

abr2.en

wbr2.en

rbr2.en

breast-cancer

0.647

0.626

0.630

0.592

0.679

0.691

0.694

0.692

colic-Nominalized

0.639

0.630

0.633

0.617

0.624

0.679

0.652

0.664

kr-vs-kp

0.566

0.550

0.566

0.553

0.565

0.554

0.557

0.575

mushroom

0.821

0.811

0.814

0.817

0.817

0.819

0.823

0.843

vote

0.884

0.879

0.890

0.862

0.884

0.884

0.880

0.868

zoo

0.762

0.784

0.748

0.752

0.783

0.792

0.775

0.769

Table 8

Mean accuracies of all algorithms using the UR row selector at the minimum target budget

dataset

exp3c.ur

fel.ur

rsfl.ur

rand.ur

br.ur

abr2.ur

wbr2.ur

rbr2.ur

breast-cancer

0.682

0.692

0.660

0.670

0.674

0.667

0.677

0.681

colic-Nominalized

0.666

0.672

0.669

0.657

0.656

0.667

0.656

0.671

kr-vs-kp

0.575

0.571

0.564

0.555

0.558

0.569

0.572

0.563

mushroom

0.820

0.827

0.817

0.809

0.826

0.827

0.824

0.844

vote

0.869

0.872

0.878

0.839

0.889

0.872

0.865

0.875

zoo

0.792

0.779

0.753

0.757

0.779

0.794

0.780

0.763

In our second set of experiments, we held the column selector fixed and compared the row selectors. Referring back to Tables 35, we set each budget bd as the minimum target budget on data set d of all algorithms that use the same column selector. For example, when considering the set of algorithms \(\mathcal{A}=\{\mbox{exp3c.ec, exp3c.en, exp3c.ur}\}\), we have bzoo=min{77.0,79.0,72.0}=72.0. Next, we computed the mean classification accuracy of each algorithm after spending budget bd, and reported these values in Tables 9 and 10. From these two tables, we see that the EC row selector stands out when Random is the column selector. While some other row selectors benefited more from some column selectors than other column selectors (EC for BR, EN for ABR2, UR for WBR2, and EC for RBR2), the difference in performance between each “winning” column selector and its competitors was small, and the winning row selector varied with the column selector used. The lack of any consistent, outstanding row selector is similar to our Wilcoxon-based results of Sect. 4.3.
Table 9

Mean accuracies of all algorithms using the EC, EN, and UR column selectors at the minimum target budget

dataset

exp3c.ec

exp3c.en

exp3c.ur

fel.ec

fel.en

fel.ur

rsfl.ec

rsfl.en

rsfl.ur

rand.ec

rand.en

rand.ur

breast-cancer

0.698

0.638

0.689

0.675

0.618

0.692

0.664

0.660

0.668

0.698

0.595

0.675

colic-Nominalized

0.672

0.647

0.667

0.673

0.630

0.668

0.669

0.673

0.670

0.672

0.632

0.668

kr-vs-kp

0.569

0.568

0.575

0.573

0.563

0.576

0.569

0.568

0.570

0.574

0.568

0.570

mushroom

0.841

0.844

0.837

0.844

0.819

0.837

0.843

0.841

0.842

0.843

0.842

0.833

vote

0.882

0.887

0.879

0.880

0.887

0.880

0.880

0.890

0.886

0.883

0.888

0.886

zoo

0.787

0.772

0.790

0.775

0.796

0.785

0.797

0.758

0.785

0.795

0.775

0.756

Table 10

Mean accuracies of all algorithms using the EC, EN, and UR column selectors at the minimum target budget

dataset

br.ec

br.en

br.ur

abr2.ec

abr2.en

abr2.ur

wbr2.ec

wbr2.en

wbr2.ur

rbr2.ec

rbr2.en

rbr2.ur

breast-cancer

0.694

0.679

0.688

0.693

0.680

0.689

0.669

0.695

0.699

0.694

0.692

0.689

colic-Nominalized

0.670

0.641

0.666

0.652

0.679

0.663

0.661

0.669

0.676

0.676

0.649

0.655

kr-vs-kp

0.559

0.573

0.556

0.573

0.562

0.579

0.575

0.557

0.572

0.558

0.575

0.555

mushroom

0.844

0.832

0.838

0.847

0.835

0.835

0.826

0.843

0.841

0.840

0.843

0.839

vote

0.861

0.871

0.889

0.886

0.887

0.882

0.884

0.887

0.886

0.886

0.870

0.880

zoo

0.786

0.790

0.781

0.790

0.791

0.784

0.785

0.787

0.795

0.791

0.775

0.768

4.3 Comparisons of algorithms using Wilcoxon tests

We then did a Wilcoxon signed rank test (Wilcoxon 1945) to compare every pair of algorithms from a group of algorithms at their minimum target budget. In a Wilcoxon signed rank test, we first have a hypothesis which is H0:θ=0, meaning that there is no difference between the achieved classification accuracies of two algorithms at a given budget. Each algorithm has 100 achieved accuracies (10 iterations times 10 folds) at a given budget. Let Ai and Bi be the i-th achieved accuracy of the two algorithms, and let Zi=AiBi for i=1,…,100. The Wilcoxon signed ranked test procedure is as follows: (1) Observations of Zi=0 are excluded. Let m be the reduced sample size. (2) Sort the absolute values |Z1|,…,|Zm| in ascending order, and let the rank of each non-zero |Zi| be Ri (the smallest positive |Zi| gets the rank of 1, and a mean rank is assigned to tied scores). (3) Denote the positive Zi values with φi=I(Zi>0), where I(⋅) is an indicator function: φi=1 for Zi>0, otherwise φi=0. (4) The Wilcoxon signed ranked statistic W+ is defined as \(W_{+} = \sum _{i=1}^{n}\varphi_{i} R_{i}\). Define W similarly by summing ranks of the negative differences Zi. (5) Calculate S as the smaller of these two rank sums: S=min(W+,W). (6) Find the critical value for the given sample size m and the wanted confidence level. (7) Compare S to the critical value, and reject H0 if S is less than or is equal to the critical value.

In our first set of experiments using Wilcoxon-based tests, we held the row selector fixed and compared the budgeted learning algorithms (which we sometimes refer to as “attribute selectors” or “column selectors”). Tables 1113 show the results of our Wilcoxon-based comparison between each algorithm on the left side of the table to each algorithm on the top of the table. When the left side algorithm is significantly better (i.e., reaches a significantly higher classification accuracy) than the top side algorithm at a p<0.05 significance level for a data set, a “+” sign is shown there; when the top side algorithm is significantly better than the left side algorithm at a p<0.05 level for a data set, a “−” sign is shown there; when the left side algorithm and the top side algorithm are not significantly different, a “0” is shown there. Thus, “++0−−+” means the first algorithm, when compared to the second algorithm, is significantly better, significantly better, no significant difference, significantly worse, significantly worse, and significantly better at a p<0.05 level for the 6 data sets.
Table 11

The Wilcoxon-based comparison of pairs of algorithms at the minimum target budget of all the algorithms listed in the table for data sets breast-cancer, colic-Nominalized, kr-vs-kp, mushroom, vote, and zoo. “++0−−+” means that the left side algorithm, compared to the top side algorithm, is significantly better, significantly better, no significant difference, significantly worse, significantly worse, and significantly better at a p<0.05 level for the 6 data sets

https://static-content.springer.com/image/art%3A10.1007%2Fs10994-012-5299-2/MediaObjects/10994_2012_5299_Tab11_HTML.gif
In Table 11, when all algorithms use the EC row selector (except Exp3CR), BR.ec has the largest number of wins and the smallest number of losses, followed closely by ABR2.ec, which has the second largest number of wins and the second smallest number of losses. In Table 12, when all algorithms use the EN row selector (except Exp3CR), ABR2.en has the largest number of wins and the smallest number of losses, followed by Exp3C.en, which has the second largest number of wins and the second smallest number of losses. In Table 13, when all algorithms use the UR row selector (except Exp3CR), ABR2.ur has the largest number of wins and the smallest number of losses, followed by RSFL.ur, which has the same number of wins and the second smallest number of losses.
Table 12

The Wilcoxon-based comparison of pairs of algorithms at the minimum target budget of all the algorithms listed in the table for data sets breast-cancer, colic-Nominalized, kr-vs-kp, mushroom, vote, and zoo

https://static-content.springer.com/image/art%3A10.1007%2Fs10994-012-5299-2/MediaObjects/10994_2012_5299_Tab12_HTML.gif
Table 13

The Wilcoxon-based comparison of pairs of algorithms at the minimum target budget of all the algorithms listed in the table for data sets breast-cancer, colic-Nominalized, kr-vs-kp, mushroom, vote, and zoo

https://static-content.springer.com/image/art%3A10.1007%2Fs10994-012-5299-2/MediaObjects/10994_2012_5299_Tab13_HTML.gif

In our second set of experiments using Wilcoxon-based tests, we held each column selector fixed and compared row selectors. For each fixed column selector, on each data set we pairwise-compared via a Wilcoxon test all three row selectors, and counted wins. In most cases, there was an overall winning row selector, but nothing that consistently and dramatically stood out from the other row selectors. (The only exception was RBR2.en, which stood out next to RBR2.ec and RBR2.ur with 7 wins and 0 losses.) Further, the winning row selector varied with the column selector used.

In our third experiment using Wilcoxon-based tests, we compared all 25 algorithms. Table 14 shows the results of our Wilcoxon-based comparisons between each algorithm on the left side of the table to each algorithm on the top side of the table. From this table, we can see that ABR2.en has by far the largest number of wins and the smallest number of losses. ABR2 with other row selectors also performed well at the minimum target budget. Also performing well were WBR2 with the EC row selector, Exp3C with the EC row selector, and FEL with the UR row selector.
Table 14

The Wilcoxon-based comparison of pairs of algorithms at the minimum target budget of all the algorithms listed in the table for data sets breast-cancer, colic-Nominalized, kr-vs-kp, mushroom, vote, and zoo

https://static-content.springer.com/image/art%3A10.1007%2Fs10994-012-5299-2/MediaObjects/10994_2012_5299_Tab14a_HTML.gif
https://static-content.springer.com/image/art%3A10.1007%2Fs10994-012-5299-2/MediaObjects/10994_2012_5299_Tab14b_HTML.gif
https://static-content.springer.com/image/art%3A10.1007%2Fs10994-012-5299-2/MediaObjects/10994_2012_5299_Tab14c_HTML.gif

In conclusion, BR and ABR2 stand out among algorithms using the EC row selector; ABR2 and Exp3C stand out among algorithms using the EN row selector; and ABR2 and RSFL stand out among algorithms using the UR row selector. While certain row selectors stood out for specific column selectors, no single row selector consistently stood out from the others in our experiments. Comparing the algorithms which are combinations of column selectors and row selectors, ABR2 with EN performs the best.

Generally speaking, ABR2 stands out among algorithms no matter which row selector is used. ABR2 is based on the belief that if there is a significant hypothesis change when purchasing an (instance, feature) pair, then it is likely that this feature is useful in building a classifier. Our experimental results showed that this belief is a good heuristic. Compared to RBR2 and WBR2, which judge significant hypothesis changes by comparing such changes to previous ones, ABR2 is more effective in its use of measuring absolute changes versus relative and window-based ones. Finally, ABR2 with the EN row selector by far works the best among all algorithms.

5 Conclusions and future work

We presented new algorithms for the budgeted learning problem (choosing which attribute to purchase at each step), many showing improvement over the state of the art, for example ABR2, WBR2, Exp3C, and FEL. We also described variations on methods for selecting a row (instance) to purchase an attribute of, selecting the row with most uncertainty in the current model. For different algorithms using the same row selector, BR and ABR2 stand out for all algorithms with the EC row selector; ABR2 and Exp3C stand out for all algorithms with the EN row selector; and ABR2 and RSFL stand out for all algorithms with the UR row selector. When the column selector is held fixed and the row selector is varied, no single row selector consistently stood out from the others. When comparing all algorithms, ABR2 with all row selectors, WBR2 and Exp3C with the EC row selector, and FEL with the UR row selector perform well. ABR2 with EN was the overall best performer.

There are several other directions for future work. First, while there are some theoretical results for the coins problem, there are no learning-theoretic results (e.g., PAC-style results) for the general budgeted learning problem of learning a hypothesis when the features of the training data have to be purchased at a cost, subject to an overall budget. Our use of Auer et al.’s multi-armed bandit algorithms to this problem (Sect. 3.1) may ultimately yield such a result. However, in order to get a PAC-style bound for our algorithm, one needs to relate regret bounds or one shot bounds such as those in Bubeck et al. (2008) to generalization error bounds of the final model.

Other future work includes extending the basic budgeted learning model in the context of bandit-based algorithms, in particular the budgeted learning of bounded active classifiers (BACs) (Kapoor and Greiner 2005a, 2005b). Further future work is to treat class labels and attribute pairs as bandits by separating the rewards for different class labels. Finally, exploring different base learners other than naïve Bayes (such as support vector machines) is another direction for future research.

Footnotes
1

These conditions are based only on the choice of parameters, not any statistical properties of the slot machines.

 
2

We tried several reward functions: GINI index, expected classification error, and conditional entropy. Classification error on the training set tended to work best for our algorithms, so those are the results that we present in Sect. 4.

 
3

If a class label is not chosen at this stage, then the row selection strategy will randomly choose a class label according to the class distribution.

 
4

While the best performers in terms of mean and median statistics were often the best in terms of the Wilcoxon statistic, this was not always the case. Inconsistencies occurred when a majority of the accuracies for algorithm X in cross validation were better than those for algorithm Y, but the mean of the accuracies for Y were better than X, due to outlying accuracy values.

 

Acknowledgements

The authors thank the reviewers for their helpful comments. This material is based upon work supported by the National Science Foundation under grant number 0743783.

Copyright information

© The Author(s) 2012

Authors and Affiliations

  • Kun Deng
    • 1
  • Yaling Zheng
    • 2
  • Chris Bourke
    • 2
  • Stephen Scott
    • 2
  • Julie Masciale
    • 3
  1. 1.Department of StatisticsUniversity of MichiganAnn ArborUSA
  2. 2.Department of Computer Science & Eng.University of NebraskaLincolnUSA
  3. 3.Union PacificOmahaUSA