Data Mining and Knowledge Discovery

, Volume 17, Issue 2, pp 225–252 | Cite as

Automatically countering imbalance and its empirical relationship to cost

  • Nitesh V. Chawla
  • David A. Cieslak
  • Lawrence O. Hall
  • Ajay Joshi
Article

Abstract

Learning from imbalanced data sets presents a convoluted problem both from the modeling and cost standpoints. In particular, when a class is of great interest but occurs relatively rarely such as in cases of fraud, instances of disease, and regions of interest in large-scale simulations, there is a correspondingly high cost for the misclassification of rare events. Under such circumstances, the data set is often re-sampled to generate models with high minority class accuracy. However, the sampling methods face a common, but important, criticism: how to automatically discover the proper amount and type of sampling? To address this problem, we propose a wrapper paradigm that discovers the amount of re-sampling for a data set based on optimizing evaluation functions like the f-measure, Area Under the ROC Curve (AUROC), cost, cost-curves, and the cost dependent f-measure. Our analysis of the wrapper is twofold. First, we report the interaction between different evaluation and wrapper optimization functions. Second, we present a set of results in a cost- sensitive environment, including scenarios of unknown or changing cost matrices. We also compared the performance of the wrapper approach versus cost-sensitive learning methods—MetaCost and the Cost-Sensitive Classifiers—and found the wrapper to outperform the cost-sensitive classifiers in a cost-sensitive environment. Lastly, we obtained the lowest cost per test example compared to any result we are aware of for the KDD-99 Cup intrusion detection data set.

Keywords

Classification Unbalanced data Cost-sensitive learning 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Amor NB, Benferhat S, Elouedi Z (2004) Naive bayes vs. decision trees in intrusion detection systems. In: Proceedings of the ACM symposium on applied computing, pp 420–424Google Scholar
  2. Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2005) Ensembles of classifiers from spatially disjoint data. In: Proceedings of the sixth international conference on multiple classifier systems, pp 196–205Google Scholar
  3. Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations 6(1): 20–29CrossRefGoogle Scholar
  4. Blake CL, Newman DJ, Hettich S, Merz CJ (1998) UCI repository of machine learning databases. URL: http://www.ics.uci.edu/~mlearn/MLRepository.html
  5. Bowyer KW, Hall LO, Chawla NV, Moore TE (2000) A parallel decision tree builder for mining very large visualization datasets. In: Proceedings of the IEEE International conference on systems, man and cyberneticsGoogle Scholar
  6. Breiman L (1996) Bagging predictors. Machine Learn 24(2): 123–140MATHMathSciNetGoogle Scholar
  7. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intel Res 16: 321–357MATHGoogle Scholar
  8. Chawla NV, Hall LO, Joshi A (2005) Wrapper-based computation and evaluation of sampling methods for imbalanced datasets. In: KDD workshop: utility-based data miningGoogle Scholar
  9. Chawla NV, Japkowicz N, Kołcz A (eds) (2003) Proceedings of the ICML’2003 workshop on learning from imbalanced data setsGoogle Scholar
  10. Chawla NV, Japkowicz N, Kolcz A (2004) Editorial: learning from imbalanced datasets. SIGKDD Explorations 6(1): 1–6CrossRefGoogle Scholar
  11. Cieslak D, Chawla NV (2006) Calibration and power of PETs on unbalanced datasets. TR 2006-12, Department of Computer Science and Engineering, University of Notre DameGoogle Scholar
  12. Cohen WW (1995a) Fast effective rule induction. In Prieditis A, Russell S (eds) 12th International conference on machine learning, Morgan Kaufmann, Tahoe City, CA, pp 115–123Google Scholar
  13. Cohen WW (1995b) Learning to classify English text with ILP methods. In: 5th International workshop on inductive logic programming, pp 3–24Google Scholar
  14. Dietterich T, Margineantu D, Provost F, Turney P (eds) (2000) Proceedings of the ICML’2000 workshop on cost-sensitive learningGoogle Scholar
  15. Domingos P (1999) MetaCost: a general method for making classifiers cost-sensitive. In: Knowledge discovery and data mining, pp 155–164Google Scholar
  16. Drummond C, Holte R (2003) C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In: Proceedings of the ICML’03 workshop on learning from imbalanced data setsGoogle Scholar
  17. Drummond C, Holte RC (2006) Cost curves: an improved method for visualizing classifier performance. Machine Learn 65(1): 95–130CrossRefGoogle Scholar
  18. Dumais S, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. In: Seventh international conference on information and knowledge management, pp 148–155Google Scholar
  19. Elkan C (1999) Results of the KDD’99 classifier learning contest. http://www.cse.ucsd.edu/~elkan/clresults.html
  20. Elkan C (2001) The foundations of cost-sensitive learning. In: Proceedings of the seventeenth international joint conference on artificial intelligence, pp 973–978Google Scholar
  21. Esposito F, Malerba D, Semeraro G (1994) Multistrategy learning for document recognition. Appl Artif Intel 8: 33–84CrossRefGoogle Scholar
  22. Ferri C, Flach P, Orallo J, Lachice N (eds) (2004) First workshop on ROC analysis in AI. ECAIGoogle Scholar
  23. Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Machine Learn 30(2–3): 195–215CrossRefGoogle Scholar
  24. Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one sided selection. In: Proceedings of the fourteenth international conference on machine learning. Morgan Kaufmann, Nashville, Tennesse, pp 179–186Google Scholar
  25. Lewis D, Ringuette M (1994) A comparison of two learning algorithms for text categorization. In: 3rd Annual symposium on document analysis and information retrieval, pp 81–93Google Scholar
  26. Ling C, Li C (1998) Data mining for direct marketing problems and solutions. In: Proceedings of the fourth international conference on knowledge discovery and data mining (KDD-98). AAAI Press, New York, NY, pp 73–79Google Scholar
  27. Maloof M (2003) Learning when data sets are imbalanced and when costs are unequal and unknown. In: Proceedings of the ICML’03 workshop on learning from imbalanced data setsGoogle Scholar
  28. Mladenic D, Grobelnik M (1999) Feature selection for unbalanced class distribution and naive bayes. In: ICML, pp 258–267Google Scholar
  29. Provost FJ, Domingos P (2003) Tree induction for probability-based ranking. Machine Learn 52(3): 199–215MATHCrossRefGoogle Scholar
  30. Provost FJ, Fawcett T, Kohavi R (1998) The case against accuracy estimation for comparing induction algorithms. In: Fifteenth international conference on machine learning, pp 445–453Google Scholar
  31. Quinlan JR (1993) Programs for machine learning. Morgan KaufmannGoogle Scholar
  32. Sabhnani MR, Serpen G (2003) Application of machine learning algorithms to KDD intrusion detection dataset with misuse detection context. In: Proceedings of the international conference on machine learning: models, technologies, and applications, pp 209–215Google Scholar
  33. Thain D, Tannenbaum T, Livny M (2005) Distributed computing in practice: the condor experience. Concur Comput Pract Exp 17: 323–356CrossRefGoogle Scholar
  34. Weiss G, McCarthy K, Zabar B (2007) Cost-sensitive learning vs. sampling: which is best for handling unbalanced classes with unequal error costs? In: DMIN, pp 35–41Google Scholar
  35. Weiss G, Provost F (2003) Learning when training data are costly: the effect of class distribution on tree induction. J Artif Intel Res 19: 315–354MATHGoogle Scholar
  36. Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan KaufmannGoogle Scholar
  37. Woods K, Doss C, Bowyer KW, Solka J, Priebe C, Kegelmeyer WP (1993) Comparative evaluation of pattern recognition techniques for detection of microcalcifications in mammography. Int J Pattern Recog Artif Intel 7(6): 1417–1436CrossRefGoogle Scholar
  38. Zadrozny B, Elkan C (2001) Learning and making decisions when costs and probabilities are both unknown. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data Mining, pp 204–213Google Scholar
  39. Zadrozny B, Langford J, Abe N (2003) Cost-sensitive learning by cost-proportionate example weighting. In: ICDM, pp 435–442Google Scholar
  40. Zhou Z, Liu X (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowledge Data Eng 18(1): 63–77CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Nitesh V. Chawla
    • 1
  • David A. Cieslak
    • 1
  • Lawrence O. Hall
    • 2
  • Ajay Joshi
    • 2
  1. 1.Department of Computer Science and EngineeringUniversity of Notre DameNotre DameUSA
  2. 2.Department of Computer Science and EngineeringUniversity of South FloridaTampaUSA

Personalised recommendations