Here we compare the outcomes of decisions informed with CIA with the outcomes of decisions informed by alternative approaches. The settings we consider here employ two primary phases: an information acquisition policy selects audits periodically to learn the models, and then the tax authority conducts a revenue-maximizing audit campaign informed by the learned models.
We employ data provided by Texas Comptroller of Public Accounts. The dataset reflects audits performed over a period of two years and is used by the state to produce the response and revenue predictive models. The data includes business information, such as taxpayer industry code (Standard Industrial Classification), business type (corporation, partnership, etc.), location, firm’s capital, gross sales, information on employee and wages, information on sales tax filings from recent years, information on other tax filings, and outcomes of prior audits. The data also includes the outcomes of sales tax audits carried out in the last campaign for each firm, that is, whether or not a firm has under-reported its earnings and any sales tax revenue recovered by the audit. Particularly informative features for inferring non-compliance include wages and a measure capturing the audit success rate of the relevant industry group. Particularly informative features for estimating audit revenue include the company’s gross sales scaled by the average gross sales of peer firms and the ratio between reported taxable income and gross sales. Table 1 shows summary statistics of these data.
The models are induced from data reflecting only a fraction of the firms in the state. Each year only about 0.7 % of firms are audited. In addition, the state uses a subset of recent audits from a “moving window” to build the models; consequently, only about 3 % of the firms in the state are used for modeling. The available set of audits is not only small, but it also potentially lacks informative audits from the wide variety of firms in the state.
The taxing authority’s objective is to conduct audits for which the expected revenue exceeds a given threshold amount. We first apply the CIA policy to identify informative audits from which the response and revenue models were induced and later use the learned models to identify desirable audits. Based on prior evidence of thresholds used in practice (Micci-Barreca and Ramachandran 2004), we consider two threshold values of Th, $ 8,000 and $ 10,000. These thresholds indicate respectively that 27 % and 23.44 % of all firms are desirable audits. In subsequent evaluations over a different domain, we show that our results are consistent across different settings.
Tax audit decision makers require models that are simple, and therefore the empirical evaluations that follow employ comprehensible predictive models used in practice to inform audit decisions. Specifically, we use J48 (Witten and Frank 1999), WEKA’s implementation of C4.5 (Quinlan 1993) to induce the response model and we use WEKA’s linear regression to estimate the revenue recovered in the case of non-compliance. In all the experiments we report here CIA employs 10 bootstrap samples at each acquisition phase.
Because our aim is to improve decisions, we evaluate alternative policies not by the accuracy of the models being deployed but by the profits from the decisions they inform. We report average results over 100 randomized experiments. In each experiment, we randomly partition the data into three exclusive subsets: an initial training set of 100 instances from which the initial set of models is induced, a pool of 1000 prospective companies whose audits can be acquired, and a test set of 1000 firms for evaluating audit decisions informed by each policy. To reduce experimental variance, the same partitions are used to evaluate all the policies. At each acquisition phase, we let each policy acquire 50 audits from the pool of prospective acquisitions. For each policy and after each acquisition phase we induce the response and revenue models from the augmented training data and subsequently apply the learned models to inform the decision-theoretic optimal choices for companies in the test set.
In the figures that follow the initial performance shown for all policies is identical and reflects the decision performance enabled by the initial training set before any acquisitions are selected. New learning experiences are acquired by each policy until the entire pool of prospective acquisitions is exhausted. All policies also exhibit the same performance once the entire set of prospective acquisitions is acquired.
The training data available to the state (as well as to us) for supervised learning is not a representative sample, but constitutes firms that have been audited by the tax authority. However, we could not request that the state perform new company audits. This limits both the set of company audits that the active learning policies select from and the set of companies to which optimal decision making is applied. As such, our experiments reflect a scenario in which active learning is used to select or reject from a set of audits that has been pre-selected by the state. Nevertheless, an effective policy is likely to identify particularly informative audits in this setting. Similarly, the profit-maximizing audit decisions for our test set correspond to a scenario in which models learned with active learning are applied either to approve or to forgo audits proposed by the state. A comprehensive evaluation of CIA’s benefits for audit selection can only be obtained once CIA is field tested by the tax authority, so that audits selected by CIA from the complete set of firms in the state can be acquired.
To compensate for this limitation in our primary data, we complement our empirical results with an evaluation of CIA’s performance for similar decisions in a different domain where there are no constraints on the selection of informative instances or on the decisions made by the learned models. We consider the problem posed by the 1998 KDD-cup competition (Hettich and Bay 1999), in which the goal was to maximize the profit from a donation campaign by a charitable organization. The data consist of responses to solicitations by donors who had made their last donation between 13 and 24 months prior to the proposed campaign. To our knowledge the data reflects a representative sample of such donors. The models estimate the expected profit from a donor to allow the charity to target profitable donors in expectation. Table 2 shows descriptive statistics for this dataset. As in sales tax audit decisions, the decision-theoretic optimal choice is inferred from a response model and a revenue model. In the donation target application, the models estimate respectively the likelihood of response and the amount a prospective donor will contribute if a donation is made. For this evaluation we apply the same induction techniques as before. We also follow prior work for the choice of predictors and use only instances with actual donations to model the donation amount (Zadrozny and Elkan 2001). We employ CIA to acquire the donor responses from which the predictive models are learned in order to improve subsequent targeting decisions. We also aim to explore how different policies perform as the decision parameters change. We create several versions of the problem by varying the proportion of donors and the risk (loss) from a futile solicitation. Furthermore, the solicitation task defined by the 1998 KDD cup competition entails that the trivial solution of soliciting all customers yields performance that is only marginally improved with better estimations of the decision-theoretic choices. To capture differences among alternative policies, we evaluate the performances of the policies when 20 % and 40 % of solicitations yield a donation, and vary the risk (loss) incurred from futile solicitations. We consider increasingly higher solicitation costs of $ 1, $ 3, and $ 5 with 20 % donors, and an additional set of higher costs of $ 7, $ 9, and $ 11 with a higher proportion of 40 % donors.
Alternative information acquisition policies
We compare the benefits from decisions informed by CIA with those obtained by five alternative acquisition policies and a uniform policy that acquires learning experiences uniformly at random.
Three alternative acquisition policies are accuracy-centric and aim to improve the predictive accuracy of one of the models informing the decisions. We apply two of these policies, Uncertainty Sampling (US) (Lewis and Gale 1994) and Query-by-Bagging (QBB) (Abe and Mamitsuka 1998), to improve the response model’s accuracy, and the third policy, proposed by Krogh and Vedelsby (1995) (henceforth, KV), to improve the revenue model’s accuracy. Our fourth alternative acquisition policy, Alternate Selection (hereafter referred to as Alternating) is a type of Multi-Task Active Learning (MTAL) (Reichart et al. 2008) that aims to simultaneously improve the predictive accuracies of multiple
classification models. Alternating and the other MTAL policy, Rank Combination, exhibit comparable performance (Reichart et al. 2008). However, the accuracy of each of the two MTAL policies is comparable to or lower than that achieved by a single-model accuracy-centric policy applied to each of the models. In our empirical evaluation we employ a variant of the Alternating policy. The original Alternating policy aimed to improve two classification models by alternating between acquisitions of informative learning experiences for each of the models. We adapt this policy to alternate between acquisitions of particularly informative instances for the response and revenue models. We select informative instances for the response model using the Uncertainty Sampling active learner (Lewis and Gale 1994) and for the revenue model using the policy proposed by Krogh and Vedelsby (1995). Note that an MTAL policy implicitly assumes that all improvements in any of the models are equally desirable, regardless of their implications for decisions. The objective of our comparisons is to evaluate whether CIA can improve decisions through its capacity to prefer improvements in one of the models over the other.
Our final alternative policy, GOAL, was the first decision-centric active learner (Saar-Tsechansky and Provost 2007). GOAL pursues acquisitions to improve a classification model’s probability estimation if these improvements are also likely to yield better decisions. However, unlike CIA, GOAL does not consider the implications of prospective acquisitions for other models informing decisions. Our evaluation explores whether CIA’s consideration of all the models yields better decisions than enabled by GOAL.
Finally, note that all active learners bias the training set to achieve a given objective. When multiple models are induced from the data, a bias that improves one model may undermine the other with potentially adverse outcomes for decision making. By contrast, a uniform acquisition policy is bias-free, as it is equally beneficial for acquiring learning examples for any model. We aim to establish whether different active learners consistently yield decisions that are at least comparable to decisions informed by a representative sample drawn uniformly at random.