Collaborative information acquisition for data-driven decisions
- 551 Downloads
Data-driven predictive models are routinely used by government agencies and industry to improve the efficiency of their decision-making. In many cases, agencies acquire training data over time, incurring both direct and opportunity costs. Active learning can be used to acquire particularly informative training data that improve learning cost-effectively. However, when multiple models are used to inform decisions, prior work on active learning has significant limitations: either it improves the accuracy of predictive models without regard to how accuracy affects decision making or it addresses only decisions informed by a single predictive model. We propose that decisions informed by multiple models warrant a new kind of Collaborative Information Acquisition (CIA) policy that allows multiple learners to reason collaboratively about informative acquisitions. This paper focuses on tax audit decisions, which affect a vital revenue source for governments worldwide. Because audits are costly to conduct, active learning policies can help identify particularly informative audits to improve future decisions. However, existing active learning models are poorly suited to audit decisions, because audits are best informed by multiple predictive models. We develop a CIA policy to improve the decisions the models inform, and we demonstrate that CIA can substantially increase sales tax revenues. We also demonstrate that the CIA policy can improve decisions to target directly individuals in a donation campaign. Finally, we discuss and demonstrate the risks for decision making of the naive use of existing active learning policies.
KeywordsActive learning Predictive modeling Decision making Audit selection Direct targeting Tax audit
We thank Dr. Daniele Micci-Barreca and Dr. Satheesh Ramachandran for their introduction to the tax audit problem and for their generous collaboration. We also thank anonymous Machine Learning journal reviewers and the editors of this special issue for their insights and for thoughtful comments and suggestions that have improved the paper.
- Abe, N., & Mamitsuka, H. (1998). Query learning strategies using boosting and bagging. In Proceedings of the fifteenth international conference on machine learning (pp. 1–9). San Mateo: Morgan Kaufmann Google Scholar
- Abe, N., Melville, P., Pendus, C., Reddy, C. K., Jensen, D. L., Thomas, V. P., Bennett, J. J., Anderson, G. F., Cooley, B. R., Kowalczyk, M., Domick, M., & Gardinier, T. (2010). Optimizing debt collections using constrained reinforcement learning. In Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’10) (pp. 75–84). New York: ACM. CrossRefGoogle Scholar
- Attenberg, J., Melville, P., Provost, F., & Saar-Tsechansky, M. (2011). Selective data acquisition for machine learning. In Cost-sensitive machine learning. Boca Raton: CRC Press. Google Scholar
- Christie, E., & Holzner, M. (2006). What explains tax evasion? An empirical assessment based on European data. wiiw working papers 40, The Vienna Institute for International Economic Studies, wiiw. Google Scholar
- Cohn, D., Atlas, L., & Ladner, R. (1994). Improving generalization with active learning. Machine Learning, 15(2), 201–221. Google Scholar
- Cornelius, T. (2012). Wisconsin budget basics guide, Wisconsin budget project. http://www.wisconsinbudgetproject.org/wp-content/uploads/2012/07/budgetbasics.pdf. Accessed 2013-05-11.
- EOI (2012). Washington state budget 101. http://www.eoionline.org/tax_reform/fact_sheets/WashingtonStateBudget101-Nov12.pdf. Accessed 2013-05-11.
- Hettich, S., & Bay, S. D. (1999). In The UCI KDD archive. Google Scholar
- IRS (2012). IRS releases new tax gap estimates; compliance rates remain statistically unchanged from previous study. http://www.irs.gov/uac/IRS-Releases-New-Tax-Gap-Estimates;-Compliance-Rates-Remain-Statistically-Unchanged-From-Previous-Study. Accessed 2013-05-15.
- Kapoor, A., Horvitz, E., & Basu, S. (2007). Selective supervision: guiding supervised learning with decision-theoretic active learning. In International joint conference on artificial intelligence. Google Scholar
- Krogh, A., & Vedelsby, J. (1995). Neural network ensembles, cross validation, and active learning. In Advances in neural information processing systems (pp. 231–238). Cambridge: MIT Press. Google Scholar
- Lewis, D., & Gale, W. A. (1994). A sequential algorithm for training text classifiers. In Proceedings of the 6th international conference on knowledge discovery and data mining. Google Scholar
- Lizotte, D., Madani, O., & Greiner, R. (2003). Budgeted learning of naive-Bayes classifiers. In Proceedings of the 19th conference on knowledge uncertainty in artificial intelligence. Google Scholar
- Melville, P., Saar-Tsechansky, M., Provost, F., & Mooney, R. (2004). Active feature-value acquisition for classifier induction. In Proceedings of the 3rd international conference on data mining. Google Scholar
- Micci-Barreca, D., & Ramachandran, S. (2004). Improving tax administration with data mining. White paper. Elite Analytics LLC. Google Scholar
- Quinlan, J. R. (1993). C4.5: programs for machine learning. San Mateo: Morgan Kaufmann. Google Scholar
- Reichart, R., Tomanek, K., & Hahn, U. (2008). Multi-task active learning for linguistic annotations. In ACL. Google Scholar
- Roy, N., & McCallum, A. (2001). Toward optimal active learning through sampling estimation of error reduction. In Proceedings of the 18th international conference on machine learning (pp. 441–448). Google Scholar
- Settles, B. (2009). Active learning literature survey. Tech. rep., University of Wisconsin-Madison. Google Scholar
- Sheng, V. S., Provost, F., & Ipeirotis, P. G. (2008). Get another label? Improving data quality and data mining using multiple, noisy labelers. In KDD (pp. 614–622). Google Scholar
- Witten, I. H., & Frank, E. (1999). Data mining: practical machine learning tools and techniques with Java implementations. San Mateo: Morgan Kaufmann. Google Scholar
- Zadrozny, B., & Elkan, C. (2001). Learning and making decisions when costs and probabilities are both unknown. In Proceedings of the 7th international conference on knowledge discovery and data mining (pp. 204–212). New York: ACM. Google Scholar