Data Mining and Knowledge Discovery

, Volume 17, Issue 2, pp 164–206 | Cite as

Quantifying counts and costs via classification

Article

Abstract

Many business applications track changes over time, for example, measuring the monthly prevalence of influenza incidents. In situations where a classifier is needed to identify the relevant incidents, imperfect classification accuracy can cause substantial bias in estimating class prevalence. The paper defines two research challenges for machine learning. The ‘quantification’ task is to accurately estimate the number of positive cases (or class distribution) in a test set, using a training set that may have a substantially different distribution. The ‘cost quantification’ variant estimates the total cost associated with the positive class, where each case is tagged with a cost attribute, such as the expense to resolve the case. Quantification has a very different utility model from traditional classification research. For both forms of quantification, the paper describes a variety of methods and evaluates them with a suitable methodology, revealing which methods give reliable estimates when training data is scarce, the testing class distribution differs widely from training, and the positive class is rare, e.g., 1% positives. These strengths can make quantification practical for business use, even where classification accuracy is poor.

Keywords

Supervised machine learning Classification Prevalence estimation Class distribution estimation Quantification research methodology Detecting and tracking trends Concept drift Class imbalance Text mining 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Fawcett T (2003) ROC graphs: notes and practical considerations for data mining researchers. Hewlett-Packard Labs, TR HPL-2003-4. http://www.hpl.hp.com/techreports
  2. Fawcett T, Flach P (2005) A response to Webb and Ting’s ‘On the application of ROC analysis to predict classification performance under varying class distributions’. Mach Learn 58(1): 33–38CrossRefGoogle Scholar
  3. Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3(Mar): 1289–1305MATHCrossRefGoogle Scholar
  4. Forman G (2005) Counting positives accurately despite inaccurate classification. In: Proceedings of the 16th European conference on machine learning (ECML), Porto, pp 564–575Google Scholar
  5. Forman G (2006) Quantifying trends accurately despite classifier error and class imbalance. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), Philadelphia, pp 157–166Google Scholar
  6. Forman G, Kirshenbaum E, Suermondt J (2006) Pragmatic text mining: minimizing human effort to quantify many issues in call logs. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), Philadelphia, pp 852–861Google Scholar
  7. Ghani R (2000) Using error-correcting codes for text classification. In: Proceedings of the 17th international conference on machine learning (ICML), pp 303–310Google Scholar
  8. Han E, Karypis G (2000) Centroid-based document classification: analysis & experimental results. In: Proceedings of the 4th European conference on the principles of data mining and knowledge discovery (PKDD), pp 424–431Google Scholar
  9. Havre S, Hetzler E, Whitney P, Nowell L (2002) ThemeRiver: visualizing thematic changes in large document collections. IEEE Trans Vis Comput Graph 8(1): 9–20CrossRefGoogle Scholar
  10. Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), San Francisco, pp 97–106Google Scholar
  11. Lachiche N, Flach PA (2003) Improving accuracy and cost of two-class and multi-class probabilistic classifiers using ROC curves. In: Proceedings of the 20th international conference on machine learning (ICML), Washington DC, pp 416–423Google Scholar
  12. MacKenzie DI, Nichols JD, Lachman GB, Droege S, Royle JA, Langtimm CA (2002) Estimating site occupancy rates when detection probabilities are less than one. Ecology 83: 2248–2255CrossRefGoogle Scholar
  13. Mei Q, Zhai C (2005) Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery in data mining (KDD), Chicago, pp 198–207Google Scholar
  14. Provost F, Fawcett T (2001) Robust classification for imprecise environments. Mach Learn 42: 203–231MATHCrossRefGoogle Scholar
  15. Saerens M, Latinne P, Decaestecker C (2002) Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Comput 14(1): 21–41MATHCrossRefGoogle Scholar
  16. Seber GAF (1982) The estimation of animal abundance and related parameters, 2nd edn. Blackburn Press, New JerseyGoogle Scholar
  17. Turney PD (2000) Types of cost in inductive concept learning. In: Workshop on cost-sensitive learning at the seventeenth international conference on machine learning (WCSL, ICML, Stanford University). Computing Research Repository, vol cs.LG/0212034Google Scholar
  18. Valenstein P (1990) Evaluation diagnostic tests with imperfect standards. Am J Clin Pathol 93: 252–258Google Scholar
  19. Van Hulse J, Khoshgoftaar TM, Napolitano A (2007) Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th international conference on machine learning (ICML), Oregon, pp 935–942Google Scholar
  20. Vucetic S, Obradovic Z (2001) Classification on data with biased class distribution. In: Proceedings of the 12th European conference on machine learning (ECML), Freiburg, pp 527–538Google Scholar
  21. Weiss GM, Provost F (2003) Learning when training data are costly: the effect of class distribution on tree induction. J Artif Intell Res 19: 315–354MATHGoogle Scholar
  22. Witten I, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco, CAMATHGoogle Scholar
  23. Wu G, Chang E (2005) KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Trans on Knowl Data Eng 17(6): 786–795CrossRefGoogle Scholar
  24. Zhou X-H, Obuchowski NA, McClish DK (2002) Statistical methods in diagnostic medicine. Wiley, New York MATHGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  1. 1.Hewlett-Packard LabsPalo AltoUSA

Personalised recommendations