Advertisement

The Impact of Local Data Characteristics on Learning from Imbalanced Data

  • Jerzy Stefanowski
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8537)

Abstract

Problems of learning classifiers from imbalanced data are discussed. First, we look at different data difficulty factors corresponding to complex distributions of the minority class and show that they could be approximated by analysing the neighbourhood of the learning examples from the minority class. We claim that the results of this analysis could be a basis for developing new algorithms. In this paper we show such possibilities by discussing modifications of informed pre-processing method LN–SMOTE as well as by incorporating types of examples into rule induction algorithm BRACID.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    An, A.: Learning classification rules from data. Computers and Mathematics with Applications 45, 737–748 (2003)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Anyfantis, D., Karagiannopoulos, M., Kotsiantis, S., Pintelas, P.: Robustness of learning techniques in handling class noise in imbalanced datasets. In: Boukis, C., Pnevmatikakis, A., Polymenakos, L. (eds.) AIAI 2007. IFIP, vol. 247, pp. 21–28. Springer, Boston (2007)Google Scholar
  3. 3.
    Batista, G., Prati, R., Monard, M.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter 6(1), 20–29 (2004)CrossRefGoogle Scholar
  4. 4.
    Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 475–482. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  5. 5.
    Błaszczyński, J., Stefanowski, J., Idkowiak, Ł.: Extending bagging for imbalanced data. In: Burduk, R., Jackowski, K., Kurzynski, M., Wozniak, M., Zolnierek, A. (eds.) CORES 2013. AISC, vol. 226, pp. 273–282. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  6. 6.
    Błaszczyński, J., Stefanowski, J., Szajek, M.: Local Neighbourhood in Generalizing Bagging for Imbalanced Data. In: Proc. of COPEM 2013 - Solving Complex Machine Learning Problems with Ensemble Methods Workshop at ECML PKDD 2013, Praque, pp. 10–24 (2013)Google Scholar
  7. 7.
    Chawla, N.: Data mining for imbalanced datasets: An overview. In: Maimon, O., Rokach, L. (eds.) The Data Mining and Knowledge Discovery Handbook, pp. 853–867. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  8. 8.
    Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: Synthetic Minority Over-sampling Technique. J. of Artificial Intelligence Research 16, 341–378 (2002)CrossRefGoogle Scholar
  9. 9.
    Cost, S., Salzberg, S.: A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features. Machine Learning Journal 10(1), 1213–1228 (1993)Google Scholar
  10. 10.
    Fernández, A., García, S., Herrera, F.: Addressing the Classification with Imbalanced Data: Open Problems and New Challenges on Class Distribution. In: Corchado, E., Kurzyński, M., Woźniak, M. (eds.) HAIS 2011, Part I. LNCS, vol. 6678, pp. 1–10. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  11. 11.
    Furnkranz, J.: Separate-and-conquer rule learning. Artificial Intelligence Review 13(1), 3–54 (1999)CrossRefGoogle Scholar
  12. 12.
    Furnkranz, J., Gamberger, D., Lavrac, N.: Foundations of Rule Learning. Springer (2012)Google Scholar
  13. 13.
    Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 99, 1–22 (2011)Google Scholar
  14. 14.
    García, V., Sánchez, J., Mollineda, R.A.: An empirical study of the behavior of classifiers on imbalanced and overlapped data sets. In: Rueda, L., Mery, D., Kittler, J. (eds.) CIARP 2007. LNCS, vol. 4756, pp. 397–406. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  15. 15.
    Grzymala-Busse, J.W., Goodwin, L.K., Grzymala-Busse, W., Zheng, X.: An approach to imbalanced data sets based on changing rule strength. In: Proceedings of Learning from Imbalanced Data Sets, AAAI Workshop at the 17th Conference on AI, pp. 69–74 (2000)Google Scholar
  16. 16.
    Grzymala-Busse, J.W., Stefanowski, J., Wilk, S.: A comparison of two approaches to data mining from imbalanced data. Journal of Intelligent Manufacturing 16(6), 565–574 (2005)CrossRefGoogle Scholar
  17. 17.
    Han, H., Wang, W., Mao, B.: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  18. 18.
    He, H., Garcia, E.: Learning from imbalanced data. IEEE Transactions on Data and Knowledge Engineering 21(9), 1263–1284 (2009)CrossRefGoogle Scholar
  19. 19.
    He, H., Yungian, M. (eds.): Imbalanced Learning. Foundations, Algorithms and Applications. IEEE - Wiley (2013)Google Scholar
  20. 20.
    Hido, S., Kashima, H.: Roughly balanced bagging for imbalance data. Statistical Analysis and Data Mining 2(5-6), 412–426 (2009)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Japkowicz, N.: Class imbalance: Are we focusing on the right issue? In: Proc. II Workshop on Learning from Imbalanced Data Sets, ICML Conf., pp. 17–23 (2003)Google Scholar
  22. 22.
    Jo, T., Japkowicz, N.: Class Imbalances versus small disjuncts. ACM SIGKDD Explorations Newsletter 6(1), 40–49 (2004)CrossRefGoogle Scholar
  23. 23.
    Khoshgoftaar, T., Van Hulse, J., Napolitano, A.: Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Transactions on Systems, Man, and Cybernetics–Part A 41(3), 552–568 (2011)CrossRefGoogle Scholar
  24. 24.
    Kubat, M., Matwin, S.: Addresing the curse of imbalanced training sets: one-side selection. In: Proc. of the 14th Int. Conf. on Machine Learning, ICML 1997, pp. 179–186 (1997)Google Scholar
  25. 25.
    Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. Tech. Report A-2001-2, University of Tampere (2001)CrossRefGoogle Scholar
  26. 26.
    Liu, A., Zhu, Z.: Ensemble methods for class imbalance learning. In: He, H., Yungian, M. (eds.) Imbalanced Learning. Foundations, Algorithms and Apllications, pp. 61–82. Wiley (2013)Google Scholar
  27. 27.
    Lumijarvi, J., Laurikkala, J., Juhola, M.: A comparison of different heterogeneous proximity functions and Euclidean distance. Stud Health Technol. Inform. 107 (pt. 2), 1362–1366 (2004)Google Scholar
  28. 28.
    Lopez, V., Fernandez, A., Garcia, S., Palade, V., Herrera, F.: An Insight into Classification with Imbalanced Data: Empirical Results and Current Trends on Using Data Intrinsic Characteristics. Information Sciences 257, 113–141 (2014)CrossRefGoogle Scholar
  29. 29.
    Lopez, V., Triguero, I., Garcia, S., Carmona, C., Herrera, F.: Addressing imbalanced classification with instance generation techniques: IPADE-ID. Neurocomputing 126, 15–28 (2014)CrossRefGoogle Scholar
  30. 30.
    Maciejewski, T., Stefanowski, J.: Local neighbourhood extension of SMOTE for mining imbalanced data. In: Proc. IEEE Symp. on Computational Intelligence and Data Mining, pp. 104–111 (2011)Google Scholar
  31. 31.
    Napierala, K.: Improving rule classifiers for imbalanced data. Ph.D. Thesis. Poznan University of Technology (2013)Google Scholar
  32. 32.
    Napierała, K., Stefanowski, J., Wilk, S.: Learning from Imbalanced Data in Presence of Noisy and Borderline Examples. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS, vol. 6086, pp. 158–167. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  33. 33.
    Napierała, K., Stefanowski, J.: Argument Based Generalization of MODLEM Rule Induction Algorithm. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS (LNAI), vol. 6086, pp. 138–147. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  34. 34.
    Napierala, K., Stefanowski, J.: Identification of different types of minority class examples in imbalanced data. In: Corchado, E., Snášel, V., Abraham, A., Woźniak, M., Graña, M., Cho, S.-B. (eds.) HAIS 2012, Part II. LNCS (LNAI), vol. 7209, pp. 139–150. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  35. 35.
    Napierala, K., Stefanowski, J.: BRACID: a comprehensive approach to learning rules from imbalanced data. Journal of Intelligent Information Systems 39(2), 335–373 (2012)CrossRefGoogle Scholar
  36. 36.
    Prati, R., Batista, G., Monard, M.: Class imbalance versus class overlapping: An analysis of a learning system behavior. In: Proc. 3rd Mexican Int. Conf. on Artificial Intelligence, pp. 312–321 (2004)CrossRefGoogle Scholar
  37. 37.
    Ramentol, E., Caballero, Y., Bello, R., Herrera, F.: SMOTE-RSB *: A hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowledge Inform. Systems 33(2), 245–265 (2012)CrossRefGoogle Scholar
  38. 38.
    Sikora, M., Wrobel, L.: Data-driven adaptive selection of rule quality measures for improving rule induction and filtration algorithms. Int. J. General Systems 42(6), 594–613 (2013)CrossRefGoogle Scholar
  39. 39.
    Stefanowski, J.: On combined classifiers, rule induction and rough sets. In: Peters, J.F., Skowron, A., Düntsch, I., Grzymała-Busse, J.W., Orłowska, E., Polkowski, L. (eds.) Transactions on Rough Sets VI. LNCS, vol. 4374, pp. 329–350. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  40. 40.
    Stefanowski, J.: Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data. In: Ramanna, S., Jain, L.C., Howlett, R.J. (eds.) Emerging Paradigms in Machine Learning, pp. 277–306 (2013)CrossRefGoogle Scholar
  41. 41.
    Stefanowski, J., Wilk, S.: Selective pre-processing of imbalanced data for improving classification performance. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 283–292. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  42. 42.
    Stefanowski, J., Wilk, S.: Extending rule-based classifiers to improve recognition of imbalanced classes. In: Ras, Z.W., Dardzinska, A. (eds.) Advances in Data Management. SCI, vol. 223, pp. 131–154. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  43. 43.
    Tomasev, N., Mladenic, D.: Class imbalance and the curse of minority hubs. Knowledge-Based Systems 53, 157–172 (2013)CrossRefGoogle Scholar
  44. 44.
    Weiss, G.M.: Mining with rarity: a unifying framework. ACM SIGKDD Explorations Newsletter 6(1), 7–19 (2004)CrossRefGoogle Scholar
  45. 45.
    Wilson, D.R., Martinez, T.R.: Improved heterogeneous distance functions. Journal of Artifical Intelligence Research 6, 1–34 (1997)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Jerzy Stefanowski
    • 1
  1. 1.Institute of Computing SciencePoznań University of TechnologyPoznańPoland

Personalised recommendations