Advertisement

Data Mining pp 159-192 | Cite as

Effects of Oversampling Versus Cost-Sensitive Learning for Bayesian and SVM Classifiers

  • Alexander Liu
  • Cheryl Martin
  • Brian La Cour
  • Joydeep Ghosh
Chapter
Part of the Annals of Information Systems book series (AOIS, volume 8)

Abstract

In this chapter, we examine the relationship between cost-sensitive learning and resampling. We first introduce these concepts, including a new resampling method called “generative oversampling,” which creates new data points by learning parameters for an assumed probability distribution. We then examine theoretically and empirically the effects of different forms of resampling and their relationship to cost-sensitive learning on different classifiers and different data characteristics. For example, we show that generative oversampling used with linear SVMs provides the best results for a variety of text data sets. In contrast, no significant performance difference is observed for low-dimensional data sets when using Gaussians to model distributions in a naive Bayes classifier. Our theoretical and empirical results in these and other cases support the conclusion that the relative performance of costsensitive learning and resampling is dependent on both the classifier and the data characteristics.

Keywords

Minority Class Positive Class Resampling Method Laplace Smoothing Initial Parameter Estimate 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations 6(1), 20–29 (2004)CrossRefGoogle Scholar
  2. 2.
    Bucila, C., Caruana, R., Niculescu-Mizil, A.: Model compression. KDD pp. 535–541 (2006)Google Scholar
  3. 3.
    Chan, P.K., Stolfo, S.J.: Toward scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection. Knowledge Discovery and Data Mining pp. 164–168 (1998)Google Scholar
  4. 4.
    Chawla, N., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-sampling TEchnique. Journal of Artificial Intelligence Research 16, 321–357 (2002)Google Scholar
  5. 5.
    Duda, R., Hart, P., Stork, D.: Pattern Classification. John Wiley & Sons, New York (2001)Google Scholar
  6. 6.
    Elkan, C.: The foundations of cost-sensitive learning. Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence pp. 973–978 (2001)Google Scholar
  7. 7.
    Han, E.H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J.: WebACE: A web agent for document categorization and exploration. Proceedings of the Second International Conference on Autonomous Agents pp. 408–415 (1998)Google Scholar
  8. 8.
    Hersh, W., Buckley, C., Leone, T.J., Hickam, D.: Ohsumed: An interactive retrieval evaluation and new large test collection for research. Proceedings of ACM SIGIR pp. 192–201 (1994)Google Scholar
  9. 9.
    Hulse, J.V., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: ICML ’07: Proceedings of the 24th international conference on Machine learning, pp. 935–942 (2007)Google Scholar
  10. 10.
    Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. Proceedings of ECML-98, 10th European Conference on Machine Learning (1398), 137–142 (1998)Google Scholar
  11. 11.
    Karypis, G.: CLUTO – a clustering toolkit. University of Minnesota technical report 02-017(2002)Google Scholar
  12. 12.
    Kubat, M., Holte, R.C., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Machine Learning 30(2–3), 195–215 (1998)CrossRefGoogle Scholar
  13. 13.
    Lewis, D., Gale, W.: Training text classifiers by uncertainty sampling. Proceedings of the Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(1994)Google Scholar
  14. 14.
    Ling, C.X., Li, C.: Data mining for direct marketing: Problems and solutions. Knowledge Discovery and Data Mining pp. 73–79 (1998)Google Scholar
  15. 15.
    Liu, A., Ghosh, J., Martin, C.: Generative oversampling for mining imbalanced datasets. DMIN ’07: International Conference on Data Mining(2007)Google Scholar
  16. 16.
    Maloof, M.: Learning when data sets are imbalanced and when costs are unequal and unknown. ICML-2003 Workshop on Learning from Imbalanced Data Sets II(2003)Google Scholar
  17. 17.
    McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. AAAI-98 Workshop on Learning for Text Categorization(1998)Google Scholar
  18. 18.
    McCarthy, K., Zabar, B., Weiss, G.: Does cost-sensitive learning beat sampling for classifying rare classes? UBDM ’05: Proceedings of the 1st international workshop on Utility-based data mining pp. 69–77 (2005)Google Scholar
  19. 19.
    Melville, P., Mooney, R.J.: Diverse ensembles for active learning. ICML ’04: Proceedings of the twenty-first international conference on Machine learning pp. 584–591 (2004)Google Scholar
  20. 20.
    Morik, K., Brockhausen, P., Joachims, T.: Combining statistical learning with a knowledge-based approach – a case study in intensive care monitoring. Proceedings of the 16th International Conference on Machine Learning (ICML-99)(1999)Google Scholar
  21. 21.
    Phua, C., Alahakoon, D., Lee, V.: Minority report in fraud detection: Classification of skewed data. SIGKDD Explor. Newsl. 6(1), 50–59 (2004)CrossRefGoogle Scholar
  22. 22.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)CrossRefGoogle Scholar
  23. 23.
    Turney, P.D.: Learning algorithms for keyphrase extraction. Information Retrieval 2(4), 303–336 (2000)CrossRefGoogle Scholar
  24. 24.
    Weiss, G., McCarthy, K., Zabar, B.: Cost-sensitive learning vs. sampling: Which is best for handling class imbalance? DMIN ’07: International Conference on Data Mining(2007)Google Scholar
  25. 25.
    Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval 1(1/2), 69–90 (1999)CrossRefGoogle Scholar
  26. 26.
    Zadrozny, B., Langford, J., Abe, N.: Cost-sensitive learning by cost-proportionate example weighting. ICDM ’03: Proceedings of the Third IEEE International Conference on Data Mining(2003)Google Scholar
  27. 27.
    Zhang, Mani: kNN approach to unbalanced data distributions: A case study involving information extraction. ICML ’03: Proceedings of the twentieth international conference on Machine learning(2003)Google Scholar
  28. 28.
    Zhong, S., Ghosh, J.: A comparative study of generative models for document clustering. SDM Workshop on Clustering High Dimensional Data and Its Applications(2003)Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.Applied Research LabsUniversity of Texas at AustinAustinUSA
  2. 2.Department of Electrical and Computer EngineeringUniversity of Texas at AustinAustinUSA

Personalised recommendations