SMOTE-D a Deterministic Version of SMOTE

  • Fredy Rodríguez Torres
  • Jesús A. Carrasco-Ochoa
  • José Fco. Martínez-Trinidad
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9703)

Abstract

Imbalanced data is a problem of current research interest. This problem arises when the number of objects in a class is much lower than in other classes. In order to address this problem several methods for oversampling the minority class have been proposed. Oversampling methods generate synthetic objects for the minority class in order to balance the amount of objects between classes, among them, SMOTE is one of the most successful and well-known methods. In this paper, we introduce a modification of SMOTE which deterministically generates synthetic objects for the minority class. Our proposed method eliminates the random component of SMOTE and generates different amount of synthetic objects for each object of the minority class. An experimental comparison of the proposed method against SMOTE in standard imbalanced datasets is provided. The experimental results show an improvement of our proposed method regarding SMOTE, in terms of F-measure.

Keywords

Imbalanced datasets Oversampling Supervised classification 

Notes

Acknowledgment

This work was partly supported by National Council of Science and Technology of Mexico under the scholarship grant 627301.

References

  1. 1.
    Alcalá-Fdez, J., Fernandez, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17(2–3), 255–287 (2011)Google Scholar
  2. 2.
    Chawla, N.V., et al.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)MATHGoogle Scholar
  3. 3.
    Wilson, D., Randall Martinez, T.R.: Improved heterogeneous distance functions. J. Artif. Intell. Res. 6, 1–34 (1997)MathSciNetMATHGoogle Scholar
  4. 4.
    Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  5. 5.
    Ramentol, E., et al.: SMOTE-RSB*: a hybrid preprocessing approach based on over-sampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl. Inf. Syst. 33(2), 245–265 (2012)CrossRefGoogle Scholar
  6. 6.
    Sáez, J.A., et al.: SMOTE IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf. Sci. 291, 184–203 (2015)CrossRefGoogle Scholar
  7. 7.
    Deepa, T., Punithavalli, M.: An E-SMOTE technique for feature selection in high-dimensional imbalanced dataset. In: 2011 3rd International Conference on Electronics Computer Technology (ICECT), vol. 2. IEEE (2011)Google Scholar
  8. 8.
    Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 475–482. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  9. 9.
    Koto, F.: SMOTE-OUT, SMOTE-COSINE, and selected-SMOTE: an enhancement strategy to handle imbalance in data level. In: 2014 International Conference on Advanced Computer Science and Information Systems (ICACSIS). IEEE (2014)Google Scholar
  10. 10.
    Dong, Y., Wang, X.: A new over-sampling approach: random-SMOTE for learning from imbalanced data sets. In: Xiong, H., Lee, W.B. (eds.) KSEM 2011. LNCS, vol. 7091, pp. 343–352. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  11. 11.
    Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM (1999)Google Scholar
  12. 12.
    Shakiba, N., Rueda, L.: MicroRNA identification using linear dimensionality reduction with explicit feature mapping. In: BMC Proceedings. BioMed Central (2013)Google Scholar
  13. 13.
    Batuwita, R., Palade, V.: Adjusted geometric-mean: a novel performance measure for imbalanced bioinformatics datasets learning. J. Bioinf. Comput. Biol. 10(04), 1250003 (2012)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Fredy Rodríguez Torres
    • 1
  • Jesús A. Carrasco-Ochoa
    • 1
  • José Fco. Martínez-Trinidad
    • 1
  1. 1.Instituto Nacional de Astrofísica, Óptica y ElectrónicaPueblaMexico

Personalised recommendations