Knowledge and Information Systems

, Volume 33, Issue 2, pp 245–265 | Cite as

SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory

  • Enislay Ramentol
  • Yailé Caballero
  • Rafael Bello
  • Francisco Herrera
Regular Paper

Abstract

Imbalanced data is a common problem in classification. This phenomenon is growing in importance since it appears in most real domains. It has special relevance to highly imbalanced data-sets (when the ratio between classes is high). Many techniques have been developed to tackle the problem of imbalanced training sets in supervised learning. Such techniques have been divided into two large groups: those at the algorithm level and those at the data level. Data level groups that have been emphasized are those that try to balance the training sets by reducing the larger class through the elimination of samples or increasing the smaller one by constructing new samples, known as undersampling and oversampling, respectively. This paper proposes a new hybrid method for preprocessing imbalanced data-sets through the construction of new samples, using the Synthetic Minority Oversampling Technique together with the application of an editing technique based on the Rough Set Theory and the lower approximation of a subset. The proposed method has been validated by an experimental study showing good results using C4.5 as the learning algorithm.

Keywords

Imbalanced data-sets Classification Data preparation Oversampling Undersampling Rough sets theory 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Alcalá-Fdez J, Sánchez L, García S, del Jesus MJ, Ventura S, Garrell JM, Otero J, Romero C, Bacardit J, Rivas VM, Fernández JC, Herrera F (2009) KEEL: a software tool to assess evolutionary algorithms to data mining problems. Soft Comput 13(3): 307–318CrossRefGoogle Scholar
  2. 2.
    Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple-Valued Log Soft Comput 17(2–3): 255–287Google Scholar
  3. 3.
    Asuncion A, Newman D (2007) UCI Machine learning repository. http://mlearn.ics.uci.edu/MLRepository.html
  4. 4.
    Batista GEAPA, Prati RC, Monard MC (2004) A study of the behaviour of several methods for balancing machine learning training data. SIGKDD Explor 6(1): 20–29CrossRefGoogle Scholar
  5. 5.
    Bello, R, Falcon, R, Pedrycz, W, Kacprzyk, J (eds) (2008) Granular computing: at the junction of rough sets and fuzzy sets. SpringerGoogle Scholar
  6. 6.
    Bradley AP (1997) The use of the Area Under the ROC Curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7): 1145–1159CrossRefGoogle Scholar
  7. 7.
    Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) ‘Safe-Level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem’. Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD09). LNCS 3644. Springer, pp 475–482Google Scholar
  8. 8.
    Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res 16: 321–357MATHGoogle Scholar
  9. 9.
    Chawla NV, Japkowicz N, Kolcz A (2004) Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor 6(1): 1–6CrossRefGoogle Scholar
  10. 10.
    Chawla NV, Cieslak D, Hall L, Joshi A (2008) Automatically countering imbalance and its empirical relationship to cost. Data Min Knowl Discov 17(2): 225–252MathSciNetCrossRefGoogle Scholar
  11. 11.
    Chen Y-S, Cheng C-H (2010) Forecasting PGR of the financial industry using a rough sets classifier based on attribute-granularity. Knowl Inf Syst 25(1): 57–79MATHCrossRefGoogle Scholar
  12. 12.
    Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7: 1–30MathSciNetMATHGoogle Scholar
  13. 13.
    Fernández A, García S, del Jesus MJ, Herrera F (2008) A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets Syst 159(18): 2378–2398CrossRefGoogle Scholar
  14. 14.
    Fernández A, del Jesus MJ, Herrera F (2010) Multi-class imbalanced data-sets with linguistic fuzzy rule based classification systems based on pairwise learning. 13th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU2010) LNAI 6178. pp 89–98. 159(18):2378–2398Google Scholar
  15. 15.
    Fürnkranz J (2002) Round robin classification. J Mach Learn Res 2: 721–747MathSciNetMATHGoogle Scholar
  16. 16.
    García S, Herrera F (2008) An extension on “Statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9: 2677–2694MATHGoogle Scholar
  17. 17.
    García S, Herrera F (2009) Evolutionary under-sampling for classification with imbalanced data sets: proposals and taxonomy. Evol Comput 17(3): 275–306CrossRefGoogle Scholar
  18. 18.
    García S, Fernández A, Luengo J, Herrera F (2009) A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability. Soft Comput 13(10): 959–977CrossRefGoogle Scholar
  19. 19.
    García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180: 2044–2064CrossRefGoogle Scholar
  20. 20.
    Greco S (2001) Rough sets theory for multicriteria decision analysis. Eur J Oper Res 129: 1–47MathSciNetMATHCrossRefGoogle Scholar
  21. 21.
    Grzymala-Busse JW, Stefanowski J, Wilk S (2005) A comparison of two approaches to data mining from imbalanced data. J Intell Manuf 16(6): 565–573CrossRefGoogle Scholar
  22. 22.
    Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International conference on intelligent computing (ICIC05) LNCS 3644. Springer, pp 878–887Google Scholar
  23. 23.
    He H, García EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9): 1263–1284CrossRefGoogle Scholar
  24. 24.
    Holm S (1979) A simple sequentially rejective multiple test procedure, Scandinavian. J Stat 6: 65–70MathSciNetGoogle Scholar
  25. 25.
    Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3): 299–310CrossRefGoogle Scholar
  26. 26.
    Huan YM, Hung CM, Jiau HC (2006) Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem. Nonlinear Anal Real World Appl 7(4): 720–747MathSciNetCrossRefGoogle Scholar
  27. 27.
    Iman R, Davenport J (1980) Approximations of the critical region of the Friedman statistic. Commun Stat Part A Theory Methods 9: 571–595CrossRefGoogle Scholar
  28. 28.
    Ling C, Sheng V (2006) Test strategies for cost-sensitive decision trees. IEEE Trans Knowl Data Eng 18(8): 1055–1057CrossRefGoogle Scholar
  29. 29.
    Mazurowski M, Habas P, Zurada J, Lo J, Baker J, Tourassi G (2008) Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw 21(2-3): 427–436CrossRefGoogle Scholar
  30. 30.
    Midelfar H, Komorowski J, Nørsett K, Yadetie F, Sandvik A, Lægreid A (2003) Learning rough set classifiers from gene expression and clinical data. Fundam Inf 53: 155–183Google Scholar
  31. 31.
    Orriols-Puig A, Bernadó-Mansilla E (2009) Evolutionary rule-based systems for imbalanced datasets. Soft Comput 13(3): 213–225MATHCrossRefGoogle Scholar
  32. 32.
    Pawlak Z (1982) Rough sets. Int J Comput Inf Sci 11: 145–172MathSciNetCrossRefGoogle Scholar
  33. 33.
    Quinlan J (1993) C4.5 programs for machine learning. Morgan Kaufmann, CAGoogle Scholar
  34. 34.
    Sheskin D (2003) Handbook of parametric and nonparametric statistical procedures. chapman & hall, CRC PressGoogle Scholar
  35. 35.
    Slowinski R, Vanderpooten D (1997) Similarity relation as a basis for rough approximations. Adv Mach Intell Soft-Comput 4: 17–33Google Scholar
  36. 36.
    Sun Y, Kamel MS, Wong AK, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40: 3358–3378MATHCrossRefGoogle Scholar
  37. 37.
    Sun Y, Wong AK, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern Recognit Artif Intell 23(4): 687–719CrossRefGoogle Scholar
  38. 38.
    Suresh S, Sundararajan N, Saratchandran P (2008) Risk-sensitive loss functions for sparse multi-category classification problems. Inf Sci 178(12): 2621–2638MathSciNetMATHCrossRefGoogle Scholar
  39. 39.
    Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Commun 6: 769–772MathSciNetMATHCrossRefGoogle Scholar
  40. 40.
    Tsumoto S (2003) Automated extraction of hierarchical decision rules from clinical databases using rough set model. Expert Syst Appl 24: 189–197CrossRefGoogle Scholar
  41. 41.
    Wang BX, Japkowicz N (2010) Boosting support vector machines for imbalanced data sets. Knowl Inf Syst 25(1): 1–20CrossRefGoogle Scholar
  42. 42.
    Wei-hua X, Xiao-yan Z, Jian-min Z, Wen-xiu Z (2008) Attribute reduction in ordered information systems based on evidence theory. Knowl Inf Syst 178(5): 1355–1371Google Scholar
  43. 43.
    Weiss GM, Hirsh H (2000) A quantitative study of small disjuncts, In: Proceedings of the 17th national conference on artificial inteligence. pp 665–670Google Scholar
  44. 44.
    Weiss GM, Provost F (2003) Learning when training data are costly: the effect of class distribution on tree induction. J Artif Intell Res 19: 315–354MATHGoogle Scholar
  45. 45.
    Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Commun 2(3): 408–421CrossRefGoogle Scholar
  46. 46.
    Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1): 1–37CrossRefGoogle Scholar
  47. 47.
    Xu W, Zhang X, Zhong J, Zhang W (2010) Attribute reduction in ordered information systems based on evidence theory. Knowl Inf Syst 25(1): 169–184CrossRefGoogle Scholar
  48. 48.
    Yang Q, Wu X (2006) 10 challenging problems in data mining research. Int J Inf Technol Decis Mak 5(4): 597–604CrossRefGoogle Scholar
  49. 49.
    Zhou Z-H, Liu X-Y (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1): 63–77CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2011

Authors and Affiliations

  • Enislay Ramentol
    • 1
  • Yailé Caballero
    • 1
  • Rafael Bello
    • 2
  • Francisco Herrera
    • 3
  1. 1.Department of Computer ScienceUniversity of CamagüeyCamagüeyCuba
  2. 2.Department of Computer ScienceUniversidad Central de Las VillasSanta ClaraCuba
  3. 3.Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology)University of GranadaGranadaSpain

Personalised recommendations