K-means Based Local Decomposition for Rare Class Analysis

  • Junjie Wu
Chapter
Part of the Springer Theses book series (Springer Theses)

Abstract

Given its importance, the problem of predicting rare classes in large-scale multi-labeled data sets has attracted great attentions in the literature. However, rare class analysis remains a critical challenge, because there is no natural way developed for handling imbalanced class distributions. This chapter thus fills this crucial void by developing a method for Classification using lOcal clusterinG (COG). Specifically, for a data set with an imbalanced class distribution, we perform clustering within each large class and produce sub-classes with relatively balanced sizes.

Keywords

Local Cluster Complex Concept Linear Classifier Normal Class Network Intrusion Detection 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. Boser, B., Guyon, I., Vapnik, V.: A training algorithm for optimal margin classifiers. In: Proceedings of the 5th Annual Workshop on Computational Learning Theory, pp. 144–152 (1992)Google Scholar
  2. Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)MATHGoogle Scholar
  3. Cohen, W.: Fast effective rule induction. In: Proceedings of the 12th International Conference on Machine Learning, pp. 115–123 (1995)Google Scholar
  4. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge (2000)Google Scholar
  5. DeGroot, M., Schervish, M.: Probability and Statistics, 3rd edn. Addison Wesley, Reading (2001)Google Scholar
  6. Domingos, P.: Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 155–164 (1999)Google Scholar
  7. Drummond, C., Holte, R.: C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Proceedings of the 20th International Conference on Machine Learning, Workshop on Learning from Imbalanced Data Sets II (2003)Google Scholar
  8. Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. Wiley-Interscience, New York (2000)Google Scholar
  9. Elkan, C.: The foundations of cost-sensitive learning. In: Proceedings of the 2001 International Joint Conferences on Artificial Intelligence, pp. 973–978 (2001)Google Scholar
  10. Fan, W., Stolfo, S., Zhang, J., Chan, P.: Adacost: misclassification cost-sensitive boosting. In: Proceedings of the 16th Internation Conference on Machine Learning, pp. 97–105 (1999)Google Scholar
  11. Freund, Y., Schapire, R.: A decision-theoretic generalization of on-line learning and an application to boosting. In: Proceedings of the 2nd European Conference on Computational Learning Theory, pp. 23–37 (1995)Google Scholar
  12. Han, E.H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J.: Webace: a web agent for document categorization and exploration. In: Proceedings of the 2nd International Conference on Autonomous Agents (1998)Google Scholar
  13. Japkowicz, N.: Supervised learning with unsupervised output separation. In: Proceedings of the 6th International Conference on Artificial Intelligence and Soft Computing, pp. 321–325 (2002)Google Scholar
  14. Joshi, M., Agarwal, R., Kumar, V.: Mining needle in a haystack: classifying rare classes via two-phase rule induction. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pp. 91–102 (2001)Google Scholar
  15. Joshi, M., Kumar, V., Agarwal, R.: Evaluating boosting algorithms to classify rare classes: comparison and improvements. In: Proceedings of the 2001 IEEE International Conference on Data Mining, pp. 257–264 (2001)Google Scholar
  16. Kubat, M., Holte, R., Matwin, S.: Learning when negative examples abound. In: Proceedings of the 9th European Conference on Machine Learning, pp. 146–153 (1997)Google Scholar
  17. Kubat, M., Holte, R., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Machine Learning 30, 195–215 (1998)CrossRefGoogle Scholar
  18. Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the 14th International Conference on Machine Learning, pp. 179–186 (1997)Google Scholar
  19. Ling, C., Li, C.: Data mining for direct marketing: problems and solutions. In: Proceedings of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 73–79 (1998)Google Scholar
  20. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)Google Scholar
  21. Maimon, O., Rokach, L. (eds.): The Data Mining and Knowledge Discovery Handbook. Springer, Heidelberg (2005)Google Scholar
  22. Margineantu, D., Dietterich, T.: Learning decision trees for loss minimization in multi-class problems. Tech. Rep. TR 99–30-03, Oregon State University (1999)Google Scholar
  23. Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)CrossRefGoogle Scholar
  24. Raudys, S., Jain, A.: Small sample size effects in statistical pattern recognition: recommendations for practitioners. IEEE Trans. Pattern Anal. Mach. Intell. 13(3), 252–264 (1991)CrossRefGoogle Scholar
  25. Riddle, P., Segal, R., Etzioni, O.: Representation design and brute-force induction in a boeing manufacturing design. Appl. Artif. Intell. 8, 125–147 (1994)CrossRefGoogle Scholar
  26. Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley, Reading (2005)Google Scholar
  27. Vapnik, V.: The Nature of Statistical Learning. Springer, New York (1995)MATHGoogle Scholar
  28. Weiss, G.: Mining with rarity: a unifying framework. ACM SIGKDD Explor. 6(1), 7–19 (2004)CrossRefGoogle Scholar
  29. Weiss, G., Hirsh, H.: Learning to predict rare events in event sequences. In: Proceedings of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 359–363 (1998)Google Scholar
  30. Witten, I., Frank, E.: Data mining: practical machine learning tools and techniques. Morgan Kaufmann, San Francisco (2005)Google Scholar
  31. Wu, J., Xiong, H., Chen, J.: Cog: local decomposition for rare class analysis. Data Min. Knowl. Discov. 20, 191–220 (2010)MathSciNetCrossRefGoogle Scholar
  32. Zadrozny, B., Langford, J., Abe, N.: Cost-sensitive learning by cost-proportionate example weighting. In: Proceedings of the 2003 IEEE International Conference on Data Mining, pp. 435–442 (2003)Google Scholar
  33. Zurada, J., Foster, B., Ward, T.: Investigation of artificial neural networks for classifying levels of financial distress of firms: the case of an unbalanced training sample. In: Abramowicz, W., Zurada, J. (eds.) Knowledge Discovery for Business Information Systems, pp. 397–423. Kluwer, Dordrecht (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Junjie Wu
    • 1
  1. 1.Department of Information Systems, School of Economics and ManagementBeihang UniversityBeijing China

Personalised recommendations