Advertisement

Adaptive Multi-objective Swarm Crossover Optimization for Imbalanced Data Classification

  • Jinyan LiEmail author
  • Simon FongEmail author
  • Meng Yuan
  • Raymond K. Wong
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10086)

Abstract

Training a classifier with imbalanced dataset where there are more data from the majority class than the minority class is a known problem in data mining research community. The resultant classifier would become under-fitted in recognizing test instances of minority class and over-fitted with overwhelming mediocre samples from the majority class. Many existing techniques have been tried, ranging from artificially boosting the amount of the minority class training samples such as SMOTE, downsizing the volume of the majority class samples, to modifying the classification induction algorithm in favour of the minority class. However, finding the optimal ratio between the samples from the two majority/minority class for building a classifier that has the best accuracy is tricky, due to the non-linear relationships between the attributes and the class labels. Merely rebalancing the sample sizes of the two classes to exact portions will often not produce the best result. Brute-force attempt to search for the perfect combination of majority/minority class samples for the best classification result is NP-hard. In this paper, a unified preprocessing approach is proposed, using stochastic swarm heuristics to cooperatively optimize the mixtures from the two classes by progressively rebuilding the training dataset is proposed. Our novel approach is shown to outperform the existing popular methods.

Keywords

Class rebalancing Swarm optimization Classification 

Notes

Acknowledgement

The authors are thankful for the financial support from the Research Grant Temporal Data Stream Mining by Using Incrementally Optimized Very Fast Decision Forest (iOVFDF), Grant no. MYRG2015-00128-FST, offered by the University of Macau, FST, and RDAO.

References

  1. 1.
    Sun, A., Ee-Peng, L., Liu, Y.: On strategies for imbalanced text classification using SVM: a comparative study. Decis. Support Syst. 48(1), 191–201 (2009)CrossRefGoogle Scholar
  2. 2.
    Cao, H., Li, X.L., Woon, D.Y.K., Ng, S.K.: Integrated oversampling for imbalanced time series classification. IEEE Trans. Knowl. Data Eng. 25(12), 2809–2822 (2013)CrossRefGoogle Scholar
  3. 3.
    Chan, P.K., Stolfo, S.J.: Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In: KDD, vol. 1998 (1998)Google Scholar
  4. 4.
    Kubat, M., Holte, R.C., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30(2-3), 195–215 (1998)CrossRefGoogle Scholar
  5. 5.
    Choe, W., Ersoy, O.K., Bina, M.: Neural network schemes for detecting rare events in human genomic DNA. Bioinformatics 16(12), 1062–1072 (2000)CrossRefGoogle Scholar
  6. 6.
    Mazurowski, M.A., et al.: Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw. 21(2), 427–436 (2008)CrossRefGoogle Scholar
  7. 7.
    He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)CrossRefGoogle Scholar
  8. 8.
    Tang, Y., Zhang, Y.Q., Chawla, N.V., Krasser, S.: SVMs modeling for highly imbalanced classification. IEEE Trans. Syst. Man Cybern. Part B Cybern. 39(1), 281–288 (2009)CrossRefGoogle Scholar
  9. 9.
    Guo, H., Viktor, H.L.: Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. ACM SIGKDD Explor. Newslett. 6(1), 30–39 (2004)CrossRefGoogle Scholar
  10. 10.
    Li, J., Fong, S., Mohammed, S., et al.: Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms. J. Supercomput. 72, 3708 (2016). doi: 10.1007/s11227-015-1541-6 Google Scholar
  11. 11.
    Chawla, N.V.: C4. 5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In: Proceedings of the ICML, vol. 3 (2003)Google Scholar
  12. 12.
    Stone, E.A.: Predictor performance with stratified data and imbalanced classes. Nat. Methods 11(8), 782–783 (2014)CrossRefGoogle Scholar
  13. 13.
    Chen, Y.-W., Lin, C.-J.: Combining SVMs with various feature selection strategies. In: Guyon, I., Nikravesh, M., Gunn, S., Zadeh, L.A. (eds.) Feature Extraction: Foundations and Applications. Studies in Fuzziness and Soft Computing, pp. 315–324. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  14. 14.
    Wallace, B.C., et al.: Class imbalance, redux. In: 2011 IEEE 11th International Conference on Data Mining (ICDM). IEEE (2011)Google Scholar
  15. 15.
    Liu, A., Ghosh, J., Martin, C.E.: Generative oversampling for mining imbalanced datasets. In: DMIN (2007)Google Scholar
  16. 16.
    Batuwita, R., Palade, V: Efficient resampling methods for training support vector machines with imbalanced datasets. In: The 2010 International Joint Conference on Neural Networks (IJCNN). IEEE (2010)Google Scholar
  17. 17.
    Drummond, C., Holte, R.C.: C4. 5 class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on learning from imbalanced datasets II, vol. 11 (2003)Google Scholar
  18. 18.
    Kubat, M., Matwin, S: Addressing the curse of imbalanced training sets: one-sided selection. In: ICML, vol. 97 (1997)Google Scholar
  19. 19.
    Chawla, N.V., Bowyer, K.W.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 341–378 (2002)zbMATHGoogle Scholar
  20. 20.
    Galar, M., et al.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012)CrossRefGoogle Scholar
  21. 21.
    Thai-Nghe, N., Gantner, Z., Schmidt-Thieme, L: Cost-sensitive learning methods for imbalanced data. In: The 2010 International Joint Conference on Neural Networks (IJCNN). IEEE (2010)Google Scholar
  22. 22.
    Zhu, X.: Lazy bagging for classifying imbalanced data. In: IEEE ICDM 2007, pp. 763–768 (2007)Google Scholar
  23. 23.
    Sun, Y., Kamel, M.S., Wang, Y.: Boosting for learning multiple classes with imbalanced class distribution. In: IEEE ICDM 2006, pp. 592–602 (2006)Google Scholar
  24. 24.
    del Río, S., et al.: On the use of MapReduce for imbalanced big data using random forest. Inf. Sci. 285, 112–137 (2014)CrossRefGoogle Scholar
  25. 25.
    Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: SMOTEBoost: improving prediction of the minority class in boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003). doi: 10.1007/978-3-540-39804-2_12 CrossRefGoogle Scholar
  26. 26.
    Fan, W., et al.: AdaCost: misclassification cost-sensitive boosting. In: ICML, vol. 99 (1999)Google Scholar
  27. 27.
    Sun, Y., et al.: Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit. 40(12), 3358–3378 (2007)CrossRefzbMATHGoogle Scholar
  28. 28.
    Zadrozny, B., Langford, J., Abe, N.: Cost-sensitive learning by cost-proportionate example weighting. In: IEEE ICDM 2003, pp. 435–442 (2003)Google Scholar
  29. 29.
    Kennedy, J., et al.: Swarm Intelligence. Morgan Kaufmann, San Francisco (2001)Google Scholar
  30. 30.
    Poli, R., Kennedy, J., Blackwell, T.: Particle swarm optimization. Swarm Intell. 1(1), 33–57 (2007)CrossRefGoogle Scholar
  31. 31.
    Li, J., et al.: Adaptive swarm balancing algorithms for rare-event prediction in imbalanced healthcare data. Comput. Med. Imaging Graph (2016). http://dx.doi.org/10.1016/j.compmedimag.2016.05.001
  32. 32.
    Van den Bergh, F., Engelbrecht, A.P.: A cooperative approach to particle swarm optimization. IEEE Trans. Evol. Comput. 8(3), 225–239 (2004)CrossRefGoogle Scholar
  33. 33.
    Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977)CrossRefzbMATHGoogle Scholar
  34. 34.
    Viera, A.J., Garrett, J.M.: Understanding interobserver agreement: the kappa statistic. Fam. Med. 37(5), 360–363 (2005)Google Scholar
  35. 35.
    Fonseca, C.M., Fleming, P.J.: Multiobjective optimization and multiple constraint handling with evolutionary algorithms. I: a unified formulation. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 28(1), 26–37 (1998)CrossRefGoogle Scholar
  36. 36.
    García, S., Herrera, F.: Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evol. Comput. 17(3), 275–306 (2009)CrossRefGoogle Scholar
  37. 37.
    Cieslak, D.A., Chawla, N.V.: Learning decision trees for unbalanced data. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008. LNCS (LNAI), vol. 5211, pp. 241–256. Springer, Heidelberg (2008). doi: 10.1007/978-3-540-87479-9_34 CrossRefGoogle Scholar
  38. 38.
    Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: ICML, vol. 96 (1996)Google Scholar
  39. 39.
    Alcalá, J., et al.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple Valued Logic Soft Comput. 17(255-287), 11 (2010)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.Department of Computer Information ScienceUniversity of MacauMacau SARChina
  2. 2.School of Computer Science and EngineeringUniversity of New South WalesKensingtonAustralia

Personalised recommendations