Machine Learning

, Volume 95, Issue 1, pp 71–86 | Cite as

Collaborative information acquisition for data-driven decisions



Data-driven predictive models are routinely used by government agencies and industry to improve the efficiency of their decision-making. In many cases, agencies acquire training data over time, incurring both direct and opportunity costs. Active learning can be used to acquire particularly informative training data that improve learning cost-effectively. However, when multiple models are used to inform decisions, prior work on active learning has significant limitations: either it improves the accuracy of predictive models without regard to how accuracy affects decision making or it addresses only decisions informed by a single predictive model. We propose that decisions informed by multiple models warrant a new kind of Collaborative Information Acquisition (CIA) policy that allows multiple learners to reason collaboratively about informative acquisitions. This paper focuses on tax audit decisions, which affect a vital revenue source for governments worldwide. Because audits are costly to conduct, active learning policies can help identify particularly informative audits to improve future decisions. However, existing active learning models are poorly suited to audit decisions, because audits are best informed by multiple predictive models. We develop a CIA policy to improve the decisions the models inform, and we demonstrate that CIA can substantially increase sales tax revenues. We also demonstrate that the CIA policy can improve decisions to target directly individuals in a donation campaign. Finally, we discuss and demonstrate the risks for decision making of the naive use of existing active learning policies.


Active learning Predictive modeling Decision making Audit selection Direct targeting Tax audit 



We thank Dr. Daniele Micci-Barreca and Dr. Satheesh Ramachandran for their introduction to the tax audit problem and for their generous collaboration. We also thank anonymous Machine Learning journal reviewers and the editors of this special issue for their insights and for thoughtful comments and suggestions that have improved the paper.


  1. Abe, N., & Mamitsuka, H. (1998). Query learning strategies using boosting and bagging. In Proceedings of the fifteenth international conference on machine learning (pp. 1–9). San Mateo: Morgan Kaufmann Google Scholar
  2. Abe, N., Melville, P., Pendus, C., Reddy, C. K., Jensen, D. L., Thomas, V. P., Bennett, J. J., Anderson, G. F., Cooley, B. R., Kowalczyk, M., Domick, M., & Gardinier, T. (2010). Optimizing debt collections using constrained reinforcement learning. In Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’10) (pp. 75–84). New York: ACM. CrossRefGoogle Scholar
  3. Attenberg, J., Melville, P., Provost, F., & Saar-Tsechansky, M. (2011). Selective data acquisition for machine learning. In Cost-sensitive machine learning. Boca Raton: CRC Press. Google Scholar
  4. Christie, E., & Holzner, M. (2006). What explains tax evasion? An empirical assessment based on European data. wiiw working papers 40, The Vienna Institute for International Economic Studies, wiiw. Google Scholar
  5. Cohn, D., Atlas, L., & Ladner, R. (1994). Improving generalization with active learning. Machine Learning, 15(2), 201–221. Google Scholar
  6. Cornelius, T. (2012). Wisconsin budget basics guide, Wisconsin budget project. Accessed 2013-05-11.
  7. EOI (2012). Washington state budget 101. Accessed 2013-05-11.
  8. Hettich, S., & Bay, S. D. (1999). In The UCI KDD archive. Google Scholar
  9. IRS (2012). IRS releases new tax gap estimates; compliance rates remain statistically unchanged from previous study.;-Compliance-Rates-Remain-Statistically-Unchanged-From-Previous-Study. Accessed 2013-05-15.
  10. Kapoor, A., Horvitz, E., & Basu, S. (2007). Selective supervision: guiding supervised learning with decision-theoretic active learning. In International joint conference on artificial intelligence. Google Scholar
  11. King, R. D., Whelan, K. E., Jones, F. M., Reiser, P. G. K., Bryant, C. H., Muggleton, S. H., Kell, D. B., & Oliver, S. G. (2004). Functional genomic hypothesis generation and experimentation by a robot scientist. Nature, 427, 247–252. CrossRefGoogle Scholar
  12. Krogh, A., & Vedelsby, J. (1995). Neural network ensembles, cross validation, and active learning. In Advances in neural information processing systems (pp. 231–238). Cambridge: MIT Press. Google Scholar
  13. Lewis, D., & Gale, W. A. (1994). A sequential algorithm for training text classifiers. In Proceedings of the 6th international conference on knowledge discovery and data mining. Google Scholar
  14. Lizotte, D., Madani, O., & Greiner, R. (2003). Budgeted learning of naive-Bayes classifiers. In Proceedings of the 19th conference on knowledge uncertainty in artificial intelligence. Google Scholar
  15. Melville, P., Saar-Tsechansky, M., Provost, F., & Mooney, R. (2004). Active feature-value acquisition for classifier induction. In Proceedings of the 3rd international conference on data mining. Google Scholar
  16. Micci-Barreca, D., & Ramachandran, S. (2004). Improving tax administration with data mining. White paper. Elite Analytics LLC. Google Scholar
  17. Quinlan, J. R. (1993). C4.5: programs for machine learning. San Mateo: Morgan Kaufmann. Google Scholar
  18. Reichart, R., Tomanek, K., & Hahn, U. (2008). Multi-task active learning for linguistic annotations. In ACL. Google Scholar
  19. Roy, N., & McCallum, A. (2001). Toward optimal active learning through sampling estimation of error reduction. In Proceedings of the 18th international conference on machine learning (pp. 441–448). Google Scholar
  20. Saar-Tsechansky, M., & Provost, F. (2007). Decision-centric active learning of binary-outcome models. Information Systems Research, 18(1), 1–19. CrossRefGoogle Scholar
  21. Saar-Tsechansky, M., Melville, P., & Provost, F. J. (2009). Active feature-value acquisition. Management Science, 55(4), 664–684. CrossRefGoogle Scholar
  22. Settles, B. (2009). Active learning literature survey. Tech. rep., University of Wisconsin-Madison. Google Scholar
  23. Sheng, V. S., Provost, F., & Ipeirotis, P. G. (2008). Get another label? Improving data quality and data mining using multiple, noisy labelers. In KDD (pp. 614–622). Google Scholar
  24. Witten, I. H., & Frank, E. (1999). Data mining: practical machine learning tools and techniques with Java implementations. San Mateo: Morgan Kaufmann. Google Scholar
  25. Zadrozny, B., & Elkan, C. (2001). Learning and making decisions when costs and probabilities are both unknown. In Proceedings of the 7th international conference on knowledge discovery and data mining (pp. 204–212). New York: ACM. Google Scholar

Copyright information

© The Author(s) 2013

Authors and Affiliations

  1. 1.AustinUSA
  2. 2.Information, Risk and Operations Management, McCombs School of BusinessThe University of Texas at AustinAustinUSA

Personalised recommendations