Improving SMOTE with Fuzzy Rough Prototype Selection to Detect Noise in Imbalanced Classification Data

  • Nele Verbiest
  • Enislay Ramentol
  • Chris Cornelis
  • Francisco Herrera
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7637)


In this paper, we present a prototype selection technique for imbalanced data, Fuzzy Rough Imbalanced Prototype Selection (FRIPS), to improve the quality of the artificial instances generated by the Synthetic Minority Over-sampling TEchnique (SMOTE). Using fuzzy rough set theory, the noise level of each instance is measured, and instances for which the noise level exceeds a certain threshold level are deleted. The threshold is determined using a wrapper approach that evaluates the training Area Under the Curve of candidate subsets. This proposal aims to clean noisy data before applying SMOTE, such that SMOTE can generate high quality artificial data.

Experiments on artificial data show that FRIPS in combination with SMOTE outperforms state-of-the-art methods, and that it particularly performs well in the presence of noise.


SMOTE imbalanced classification AUC fuzzy rough set theory 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. SIGKDD Explorations 6(1), 20–29 (2004)CrossRefGoogle Scholar
  2. 2.
    Bradley, A.P.: The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition 30(7), 1145–1159 (1997)CrossRefGoogle Scholar
  3. 3.
    Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-Level-SMOTE – Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 475–482. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  4. 4.
    Chawla, N.W., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE – Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)zbMATHGoogle Scholar
  5. 5.
    Cover, T., Hart, P.: Nearest Neighbor Pattern Classification. IEEE Transactions on Information Theory 13(1), 21–27 (1967)zbMATHCrossRefGoogle Scholar
  6. 6.
    Derrac, J., García, S., Molina, D., Herrera, F.: A Practical Tutorial on the Use of Nonparametric Statistical Tests as a Methodology for Comparing Evolutionary and Swarm Intelligence Algorithms. Swarm and Evolutionary Computation 1(1), 3–18 (2011)CrossRefGoogle Scholar
  7. 7.
    Dubois, D., Prade, H.: Rough Fuzzy Sets and Fuzzy Rough Sets. International Journal of General Systems 17(2-3), 191–209 (1990)zbMATHCrossRefGoogle Scholar
  8. 8.
    García, S., Derrac, J., Cano, J.R., Herrera, F.: Prototype Selection for Nearest Neighbor Classification – Taxonomy and Empirical Study. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(3), 417–435 (2012)CrossRefGoogle Scholar
  9. 9.
    García, S., Fernández, F., Luengo, J., Herrera, F.: A Study of Statistical Techniques and Performance Measures for Genetics-Based Machine Learning – Accuracy and Interpretability. Soft Computing 13(10), 959–977 (2009)CrossRefGoogle Scholar
  10. 10.
    García, S., Alcalá Fernandez, J., Luengo, J., Herrera, F.: Advanced Nonparametric Tests for Multiple Comparisons in the Design of Experiments in Computational Intelligence and Data Mining – Experimental Analysis of Power. Information Sciences 180(10), 2044–2064 (2010)CrossRefGoogle Scholar
  11. 11.
    Han, H., Wang, W., Mao, B.: Borderline-SMOTE – A New Over-Sampling Method in Imbalanced Data Sets Learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  12. 12.
    Napierala, K., Stefanowski, J., Wilk, S.: Learning from Imbalanced Data in Presence of Noisy and Borderline Examples. In: Szczuka, M., et al. (eds.) RSCTC 2010. LNCS, vol. 6086, pp. 158–167. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  13. 13.
    Ramentol, E., Caballero, Y., Bello, R., Herrera, F.: SMOTE-RSB* – A Hybrid Preprocessing Approach Based on Oversampling and Undersampling for High Imbalanced Data-Sets Using Smote and Rough Sets Theory. Knowledge and Information Systems (2011) (in press)Google Scholar
  14. 14.
    Ramentol, E., Verbiest, N., Bello, R., Caballero, Y., Cornelis, C., Herrera, F.: Smote-frst – A New Resampling Method Using Fuzzy Rough Set Theory. In: 10th International FLINS Conference on Uncertainty Modeling in Knowledge Engineering and Decision Making, FLINS 2012 (in press, 2012)Google Scholar
  15. 15.
    Stefanowski, J., Wilk, S.: Selective Pre-processing of Imbalanced Data for Improving Classification Performance. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 283–292. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  16. 16.
    Verbiest, N., Cornelis, C., Herrera, F.: Fuzzy Rough Prototype Selection (submitted)Google Scholar
  17. 17.
    Wilcoxon, F.: Individual Comparisons by Ranking Methods. Biometrics Bulletin 1(6), 80–83 (1945)CrossRefGoogle Scholar
  18. 18.
    Yager, R.R.: On Ordered Weighted Averaging Aggregation Operators in Multicriteria Decisionmaking. IEEE Transactions on Systems, Man and Cybernetics 18(1), 183–190 (1988)MathSciNetzbMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Nele Verbiest
    • 1
  • Enislay Ramentol
    • 2
  • Chris Cornelis
    • 1
    • 3
  • Francisco Herrera
    • 3
  1. 1.Dept. of Applied Mathematics and Computer ScienceGhent UniversityBelgium
  2. 2.Dept. of Computer ScienceUniversity of CamagüeyCuba
  3. 3.Dept. of Computer Science and AIUniversity of GranadaSpain

Personalised recommendations