Skip to main content

Advertisement

Log in

A Novel Algorithm for Imbalance Data Classification Based on Genetic Algorithm Improved SMOTE

  • Research Article - Computer Engineering and Computer Science
  • Published:
Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Abstract

The classification of imbalanced data has been recognized as a crucial problem in machine learning and data mining. In an imbalanced dataset, there are significantly fewer training instances of one class compared to another class. Hence, the minority class instances are much more likely to be misclassified. In the literature, the synthetic minority over-sampling technique (SMOTE) has been developed to deal with the classification of imbalanced datasets. It synthesizes new samples of the minority class to balance the dataset, by re-sampling the instances of the minority class. Nevertheless, the existing algorithms-based SMOTE uses the same sampling rate for all instances of the minority class. This results in sub-optimal performance. To address this issue, we propose a novel genetic algorithm-based SMOTE (GASMOTE) algorithm. The GASMOTE algorithm uses different sampling rates for different minority class instances and finds the combination of optimal sampling rates. The experimental results on ten typical imbalance datasets show that, compared with SMOTE algorithm, GASMOTE can increase 5.9% on F-measure value and 1.6% on G-mean value, and compared with Borderline-SMOTE algorithm, GASMOTE can increase 3.7% on F-measure value and 2.3% on G-mean value. GASMOTE can be used as a new over-sampling technique to deal with imbalance dataset classification problem. We have particularly applied the GASMOTE algorithm to a practical engineering application: prediction of rockburst in the VCR rockburst datasets. The experiment results indicate that the GASMOTE algorithm can accurately predict the rockburst occurrence and hence provides guidance to the design and construction of safe deep mining engineering structures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Anand A., Pugalenthi G., Fogel G., Suganthan P.: An approach for classification of highly imbalanced data using weighting and undersampling. Amino Acids 39, 1385–1391 (2010)

    Article  Google Scholar 

  2. Liu L., Cai Y., Lu W., Feng K., Peng C., Niu B.: Prediction of protein–protein interactions based on pseAA composition and hybrid feature selection. Biochem. Biophys. Res. Commun. 380, 318–322 (2009)

    Article  Google Scholar 

  3. He, H.; Shen, X.: A ranked subspace learning method for gene expression data classification. In: IC-AI, pp. 358–364 (2007)

  4. Kubat M., Holte R., Matwin S.: Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30, 195–215 (1998)

    Article  Google Scholar 

  5. Castillo M., Serrano J.: A multistrategy approach for digital text categorization from imbalanced documents. SIGKDD Explor. Newsl. 6, 70–79 (2004)

    Article  Google Scholar 

  6. Phua C., Alahakoon D., Lee V.: Minority report in fraud detection: classification of skewed data. SIGKDD Explor. Newsl. 6, 50–59 (2004)

    Article  Google Scholar 

  7. Soda P.: A multi-objective optimization approach for class imbalance learning. Pattern Recognit. 44, 1801–1810 (2011)

    Article  MATH  Google Scholar 

  8. Haibo H.E., Garcia E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21, 1263–1284 (2009)

    Article  Google Scholar 

  9. Gu Q., Yuan L., Xiong Q., Ning B., Li W.: A comparative study of cost-sensitive learning algorithm based on imbalanced data sets. Microelectron. Comput. 28, 146–149 (2009)

    Google Scholar 

  10. Wang C., Pan Z., Dong L., Ma C., Zhang X.: Research on classification for imbalanced dataset based on improved SMOTE. Comput. Eng. Appl. 49, 184–187 (2013)

    Google Scholar 

  11. Ge J., Qiu Y., Wu C., Pu G.: Summary of genetic algorithms research. Appl. Res. Comput. 25, 2911–2916 (2008)

    Google Scholar 

  12. Estabrooks A., Jo T., Japkowicz N.: A multiple resampling method for learning from imbalanced data sets. Comput. Intell. 20, 18–36 (2004)

    Article  MathSciNet  Google Scholar 

  13. Chawla N., Bowyer K., Hall L., Kegelmeyer W.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    MATH  Google Scholar 

  14. Wang, B.X.; Japkowicz, N.: Imbalanced Data Set Learning with Synthetic Samples. In: Proceedings of the IRIS Machine Learning Workshop (2004)

  15. Han, H.; Wang, W.; Mao, B.: Borderline-SMOTE: a new over-sampling method in imbalance data set learning. In: Proceedings of International Conference on Intelligent Computing. Springer, Berlin Heidelberg, pp. 878–887 (2005)

  16. He, H.; Bai, Y.; Garcia, E.; Li, S.: Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: IEEE International Joint Conference on Neural Networks, pp. 1322–1328 (2008)

  17. Chawla, N.; Lazarevic, A.; Hall, L.; Bowyer, K.: Smoteboost: improving prediction of the minority class in boosting. In: Proceedings of the Principles of Knowledge Discovery in Databases, pp. 107–119 (2003)

  18. Guo H., Viktor H.L.: Learning from imbalance data set with boosting and data generation: the DataBoost-IM approach. ACM SIGKDD Explor. Newsl. 6, 30–39 (2004)

    Article  Google Scholar 

  19. Chen S., Guo G., Chen L.: Clustering ensembles based classification method for imbalanced data sets. Pattern Recognit. Artif. Intell. 23, 772–780 (2010)

    Google Scholar 

  20. Chen S., He H., Garcia E.: Ramoboost: ranked minority oversampling in boosting. IEEE Trans. Neural Netw. 21, 1624–1642 (2010)

    Article  Google Scholar 

  21. Ling C.X., Shen G., Victor S.: A comparative study of cost-sensitive classifiers. Chin. J. Comput. 30, 1203–1212 (2007)

    MathSciNet  Google Scholar 

  22. Zhou Z., Liu X.: Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans. Knowl. Data Eng. 18, 63–77 (2006)

    Article  Google Scholar 

  23. Sun Y., Kamel M., Wong A., Wang Y.: Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit. 40, 3358–3378 (2007)

    Article  MATH  Google Scholar 

  24. Wu G, Chen Q.: Combined classifier algorithm for imbalanced datasets. Comput. Eng. Des. 28, 5687–5690 (2007)

    Google Scholar 

  25. Luo B., Yu G.: AdaBoost Classification of Multiple Classes with Imbalanced Distribution. J. Yangtze Univ. (Nat. Sci. Edit.) Sci. Eng. 4, 50–54 (2007)

    Google Scholar 

  26. Zhou Z.: Ensemble Methods: Foundations and Algorithms. Chapman and Hall/CRC, Boca Raton, FL (2012)

    Google Scholar 

  27. Galar M., Fernandez A., Barrenechea E., Bustince H.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. C Appl. Rev. 42, 463–484 (2012)

    Article  Google Scholar 

  28. Liao H.-W., Zhou D.-L.: Review of adaboost and its improvement. Comput. Syst. Appl. 21, 240–244 (2012)

    Google Scholar 

  29. Liu, Y.; An, A.; Huang, X.: Boosting prediction accuracy on imbalanced datasets with svm ensembles. PAKDD, pp. 107–118 (2006)

  30. Wang B., Japkowicz N.: Boosting support vector machines for imbalanced data sets. Knowl. Inf. Syst. 25, 1–20 (2010)

    Article  Google Scholar 

  31. Liu X.-Y., Wu J., Zhou Z.-H.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B 39, 539–550 (2009)

    Article  Google Scholar 

  32. Ertekin, S.; Huang, J.; Bottou, L.; Giles, C.: Learning on the border: active learning in imbalanced data classification. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, pp. 127–136 (2007)

  33. Ertekin, S.; Huang, J.; Giles, C.: Active learning for class imbalance problem. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in, Information Retrieval, pp. 823–824 (2007)

  34. Weiss G.M.: Mining with rarity: a unifying framework. Sigkdd Explor. Spec. Issue Learn. Imbalanced Datasets 6, 7–19 (2004)

    Article  Google Scholar 

  35. Van~Rijsbergen C.J.: Information Retrieval. Butterworths, London (1979)

    MATH  Google Scholar 

  36. Kubat M., Holte R.C., Matwin S.: Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30, 195–215 (1998)

    Article  Google Scholar 

  37. Wang Y.: The Research and Application of Genetic Algorithm–3PM Crossover Operator Based Annealing Genetic Algorithm and the Research of Its Application. Jiangnan University, Wuxin (2009)

    Google Scholar 

  38. Gong W.: Differential Evolution Algorithm and Its Application in Clustering Analysis. School of Computer, China University of Geosciences, Wuhan (2010)

    Google Scholar 

  39. Pan Z., Kang L., Chen Y.: Evolutionary Computation. Tsinghua University Press, Beijing (1998)

    Google Scholar 

  40. Quinlan J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Francisco, CA (1993)

    Google Scholar 

  41. http://www.keel.es/dataset.php

  42. Witten L.H., Frank E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, Seattle, WA (2000)

    Google Scholar 

  43. Feng X.: Introduction of Intelligent Rock mechanics. Science Press, Beijing (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kun Jiang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiang, K., Lu, J. & Xia, K. A Novel Algorithm for Imbalance Data Classification Based on Genetic Algorithm Improved SMOTE. Arab J Sci Eng 41, 3255–3266 (2016). https://doi.org/10.1007/s13369-016-2179-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13369-016-2179-2

Keywords

Navigation