Advertisement

Data Mining for Imbalanced Datasets: An Overview

  • Nitesh V. Chawla

Summary

A dataset is imbalanced if the classification categories are not approximately equally represented. Recent years brought increased interest in applying machine learning techniques to difficult “real-world” problems, many of which are characterized by imbalanced data. Additionally the distribution of the testing data may differ from that of the training data, and the true misclassification costs may be unknown at learning time. Predictive accuracy, a popular choice for evaluating performance of a classifier, might not be appropriate when the data is imbalanced and/or the costs of different errors vary markedly. In this Chapter, we discuss some of the sampling techniques used for balancing the datasets, and the performance measures more appropriate for mining imbalanced datasets.

Key words

imbalanced datasets classification sampling ROC cost-sensitive measures precision and recall 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Notes

Acknowledgements

I would like to thank Larry Hall, Kevin Bowyer and Philip Kegelmeyer for their valuable input during my Ph.D. research in this field. I am also extremely grateful to all my collaborators and co-authors in the area of learning from imbalanced datasets. I have enjoyed working with them and contributing to this field.

References

  1. Batista, G. E. A. P. A., Prati, R. C., and Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations, 6(1).Google Scholar
  2. Bauer, E. and Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting and variants. Machine Learning, 36(1,2).Google Scholar
  3. Bradley, A. P. (1997). The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition, 30(6):1145–1159.CrossRefGoogle Scholar
  4. Buckland, M. and Gey, F. (1994). The Relationship Between Recall and Precision. Journal of the American Society for Information Science, 45(1):12–19.CrossRefGoogle Scholar
  5. Chawla, N. V. (2003). C4.5 and Imbalanced Data sets: Investigating the Effect of Sampling Method, Probabilistic Estimate, and Decision Tree Structure. In ICML Workshop on Learning from Imbalanced Data sets, Washington, DC.Google Scholar
  6. Chawla, N. V., Bowyer, K.W., Hall, L. O., and Kegelmeyer,W. P. (2002). SMOTE: Synthetic Minority Oversampling TEchnique. Journal of Artificial Intelligence Research, 16:321–357.zbMATHGoogle Scholar
  7. Chawla, N. V., Japkowicz, N., and Kołcz, A., editors (2003a). Proceedings of the ICML’2003 Workshop on Learning from Imbalanced Data Sets II.Google Scholar
  8. Chawla, N. V., Japkowicz, N., and Kołcz, A., editors (2004a). SIGKDD Special Issue on Learning from Imbalanced Datasets.Google Scholar
  9. Chawla, N. V., Japkowicz, N., Kolcz, A. (2004b), Editorial: Learning form Imbalanced Datasets, SIGKDD Explorations, 6(1).Google Scholar
  10. Chawla, N. V., Lazarevic, A., Hall, L. O., and Bowyer, K. W. (2003b). Smoteboost: Improving Prediction of the Minority Class in Boosting. In Seventh European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 107–119, Dubrovnik, Croatia.Google Scholar
  11. Cohen, W. (1995a). Fast Effective Rule Induction. In Proceedings of the Twelfth International Conference on Machine Learning, pages 115–123. Department of Computer Science, Katholieke Universiteit Leuven.Google Scholar
  12. Cohen,W. (1995b). Learning to Classify English Text with ILP Methods. In Proceedings of the 5th International Workshop on Inductive Logic Programming, pages 3–24. Department of Computer Science, Katholieke Universiteit Leuven.Google Scholar
  13. Cost, S. and Salzberg, S. (1993). A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features. Machine Learning, 10(1):57–78.Google Scholar
  14. Dietterich, T. (2000). An Empirical Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting and Randomization. Machine Learning, 40(2):139 – 157.CrossRefGoogle Scholar
  15. Dietterich, T., Margineantu, D., Provost, F., and Turney, P., editors (2003). Proceedings of the ICML’2000 Workshop on COST-SENSITIVE LEARNING.Google Scholar
  16. Domingos, P. (1999). Metacost: A General Method for Making Classifiers Cost-sensitive. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 155–164, San Diego, CA. ACM Press.Google Scholar
  17. Drummond, C. and Holte, R. (2003). C4.5, class imbalance, and cost sensitivity: Why undersampling beats over-sampling. In Proceedings of the ICML’03 Workshop on Learning from Imbalanced Data Sets.Google Scholar
  18. Drummond, C. and Holte, R. C. (2000). Explicitly Representing Expected Cost: An Alternative to ROC Representation. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 198–207, Boston. ACM.Google Scholar
  19. Dumais, S., Platt, J., Heckerman, D., and Sahami, M. (1998). Inductive Learning Algorithms and Representations for Text Categorization. In Proceedings of the Seventh International Conference on Information and Knowledge Management., pages 148–155.Google Scholar
  20. Egan, J. P. (1975). Signal Detection Theory and ROC Analysis. In Series in Cognition and Perception. Academic Press, New York.Google Scholar
  21. Elkan, C. (2001). The Foundations of Cost-sensitive Learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, pages 973–978, Seattle, WA.Google Scholar
  22. Ezawa, K., J., Singh, M., and Norton, S., W. (1996). Learning Goal Oriented Bayesian Networks for Telecommunications Risk Management. In Proceedings of the International Conference on Machine Learning, ICML-96, pages 139–147, Bari, Italy. Morgan Kauffman.Google Scholar
  23. Fan,W., Stolfo, S., Zhang, J., and Chan, P. (1999). Adacost: Misclassification Cost-sensitive Boosting. In Proceedings of Sixteenth International Conference on Machine Learning, pages 983–990, Slovenia.Google Scholar
  24. Ferri, C., Flach, P., Orallo, J., and Lachice, N., editors (2004). ECAI’ 2004 First Workshop on ROC Analysis in AI. ECAI.Google Scholar
  25. Freund, Y. and Schapire, R. (1996). Experiments with a New Boosting Algorithm. In Thirteenth International Conference on Machine Learning, Bari, Italy.Google Scholar
  26. Guo, H. and Viktor, H. L. (2004). Learning from imbalanced data sets with boosting and data generation: The DataBoost-IM approach. SIGKDD Explorations, 6(1).Google Scholar
  27. Hand, D. J. (1997). Construction and Assessment of Classification Rules. John Wiley and Sons.Google Scholar
  28. Hart, P. E. (1968). The Condensed Nearest Neighbor Rule. IEEE Transactions on Information Theory, 14:515–516.CrossRefGoogle Scholar
  29. Japkowicz, N. (2000a). The Class Imbalance Problem: Significance and Strategies. In Proceedings of the 2000 International Conference on Artificial Intelligence (IC-AI’2000): Special Track on Inductive Learning, Las Vegas, Nevada.Google Scholar
  30. Japkowicz, N. (2000b). Learning from Imbalanced Data sets: A Comparison of Various Strategies. In Proceedings of the AAAI’2000 Workshop on Learning from Imbalanced Data Sets, Austin, TX.Google Scholar
  31. Japkowicz, N. (2001a). Concept-learning in the presence of between-class and within-class imbalances. In Proceedings of the Fourteenth Conference of the Canadian Society for Computational Studies of Intelligence, pages 67–77.Google Scholar
  32. Japkowicz, N. (2001b). Supervised versus unsupervised binary-learning by feedforward neural networks. Machine Learning, 42(1/2):97–122.zbMATHCrossRefGoogle Scholar
  33. Jo, T. and Japkowicz, N. (2004). Class imbalances versus small disjuncts. SIGKDD Explorations, 6(1).Google Scholar
  34. Joshi, M., Kumar, V., and Agarwal, R. (2001). Evaluating Boosting Algorithms to Classify Rare Classes: Comparison and Improvements. In Proceedings of the First IEEE International Conference on Data Mining, pages 257–264, San Jose, CA.Google Scholar
  35. Juszczak, P. and Duin, R. P. W. (2003). Uncertainty sampling methods for one-class classifiers. In Proceedings of the ICML’03 Workshop on Learning from Imbalanced Data Sets.Google Scholar
  36. Kubat, M., Holte, R., and Matwin, S. (1998). Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning, 30:195–215.CrossRefGoogle Scholar
  37. Kubat, M. and Matwin, S. (1997). Addressing the Curse of Imbalanced Training Sets: One Sided Selection. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 179–186, Nashville, Tennesse. Morgan Kaufmann.Google Scholar
  38. Laurikkala, J. (2001). Improving Identification of Difficult Small Classes by Balancing Class Distribution. Technical Report A-2001-2, University of Tampere.Google Scholar
  39. Lee, S. S. (2000). Noisy Replication in Skewed Binary Classification. Computational Statistics and Data Analysis, 34.Google Scholar
  40. Lewis, D. and Catlett, J. (1994). Heterogeneous Uncertainity Sampling for Supervised Learning. In Proceedings of the Eleventh International Conference of Machine Learning, pages 148–156, San Francisco, CA. Morgan Kaufmann.Google Scholar
  41. Ling, C. and Li, C. (1998). Data Mining for Direct Marketing Problems and Solutions. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), New York, NY. AAAI Press.Google Scholar
  42. Liu, Y., Chawla, N. V., Shriberg, E., Stolcke, A., and Harper, M. (2004). Resampling Techniques for Sentence Boundary Detection: A Case Study in Machine Learning from Imbalanced Data for Spoken Language Processing. Under Review.Google Scholar
  43. Maloof, M. (2003). Learning when data sets are imbalanced and when costs are unequal and unknown. In Proceedings of the ICML’03 Workshop on Learning from Imbalanced Data Sets.Google Scholar
  44. Mladenić, D. and Grobelnik, M. (1999). Feature Selection for Unbalanced Class Distribution and Naive Bayes. In Proceedings of the 16th International Conference on Machine Learning., pages 258–267. Morgan Kaufmann.Google Scholar
  45. Phua, C. and Alahakoon, D. (2004). Minority report in fraud detection: Classification of skewed data. SIGKDD Explorations, 6(1).Google Scholar
  46. Provost, F. and Fawcett, T. (2001). Robust Classification for Imprecise Environments. Machine Learning, 42/3:203–231.CrossRefGoogle Scholar
  47. Quinlan, J. R. (1992). C4. 5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA.Google Scholar
  48. Radivojac, P., Chawla, N. V., Dunker, K., and Obradovic, Z. (2004). Classification and Knowledge Discovery in Protein Databases. Journal of Biomedical Informatics, 37(4):224–239.CrossRefGoogle Scholar
  49. Raskutti, B. and Kowalczyk, A. (2004). Extreme rebalancing for svms: a case study. SIGKDD Explorations, 6(1).Google Scholar
  50. Solberg, A. H. and Solberg, R. (1996). A Large-Scale Evaluation of Features for Automatic Detection of Oil Spills in ERS SAR Images. In International Geoscience and Remote Sensing Symposium, pages 1484–1486, Lincoln, NE.Google Scholar
  51. Swets, J. (1988). Measuring the Accuracy of Diagnostic Systems. Science, 240:1285–1293.CrossRefMathSciNetGoogle Scholar
  52. Tax, D. (2001). One-class classification. PhD thesis, Delft University of Technology.Google Scholar
  53. Ting, K. (2000). A comparative study of cost-sensitive boosting algorithms. In Proceedings of Seventeenth International Conference on Machine Learning, pages 983–990, Stanford, CA.Google Scholar
  54. Tomek, I. (1976). Two Modifications of CNN. IEEE Transactions on Systems, Man and Cybernetics, 6:769–772.zbMATHCrossRefMathSciNetGoogle Scholar
  55. Turney, P. (2000). Types of Cost in Inductive Concept Learning. In Workshop on Cost- Sensitive Learning at the Seventeenth International Conference on Machine Learning, pages 15–21, Stanford, CA.Google Scholar
  56. Weiss, G. and Provost, F. (2003). Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction. Journal of Artificial Intelligence Research, 19:315–354.zbMATHGoogle Scholar
  57. Woods, K., Doss, C., Bowyer, K., Solka, J., Priebe, C., and Kegelmeyer, P. (1993). Comparative Evaluation of Pattern Recognition Techniques for Detection of Microcalcifications in Mammography. International Journal of Pattern Recognition and Artificial Intelligence, 7(6):1417–1436.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringUniversity of Notre DameNotre DameUSA

Personalised recommendations