Similarity Majority Under-Sampling Technique for Easing Imbalanced Classification Problem

  • Jinyan Li
  • Simon Fong
  • Shimin Hu
  • Raymond K. Wong
  • Sabah Mohammed
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 845)


Imbalanced classification problem is an enthusiastic topic in the fields of data mining, machine learning and pattern recognition. The imbalanced distributions of different class samples result in the classifier being over-fitted by learning too many majority class samples and under-fitted in recognizing minority class samples. Prior methods attempt to ease imbalanced problem through sampling techniques, in order to re-assign and rebalance the distributions of imbalanced dataset. In this paper, we proposed a novel notion to under-sample the majority class size for adjusting the original imbalanced class distributions. This method is called Similarity Majority Under-sampling Technique (SMUTE). By calculating the similarity of each majority class sample and observing its surrounding minority class samples, SMUTE effectively separates the majority and minority class samples to increase the recognition power for each class. The experimental results show that SMUTE could outperform the current under-sampling methods when the same under-sampling rate is used.


Imbalanced classification Under-sampling Similarity measure SMUTE 



The authors are thankful to the financial support from the research grant, #MYRG2016-00069, titled ‘Nature-Inspired Computing and Metaheuristics Algorithms for Optimizing Data Mining Performance’ offered by RDAO/FST, University of Macau and Macau SAR government.


  1. 1.
    Weiss, G.M., Provost, F.: Learning when training data are costly: the effect of class distribution on tree induction. J. Artif. Intell. Res. 19, 315–354 (2003)zbMATHGoogle Scholar
  2. 2.
    Li, J., Fong, S., Sung, Y., Cho, K., Wong, R., Wong, K.K.: Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification. BioData Min. 9(1), 37 (2016)CrossRefGoogle Scholar
  3. 3.
    Cao, H., Li, X.-L., Woon, D.Y.-K., Ng, S.-K.: Integrated oversampling for imbalanced time series classification. IEEE Trans. Knowl. Data Eng. 25(12), 2809–2822 (2013)CrossRefGoogle Scholar
  4. 4.
    Kubat, M., Holte, R.C., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30(2–3), 195–215 (1998)CrossRefGoogle Scholar
  5. 5.
    Li, J., Fong, S., Mohammed, S., Fiaidhi, J.: Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms. J. Supercomput. 72(10), 3708–3728 (2016)CrossRefGoogle Scholar
  6. 6.
    Chawla, N.V.: C4. 5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure (2002)Google Scholar
  7. 7.
    Tang, Y., Zhang, Y.-Q., Chawla, N.V., Krasser, S.: SVMs modeling for highly imbalanced classification. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 39(1), 281–288 (2009)CrossRefGoogle Scholar
  8. 8.
    Li, J., Fong, S., Yuan, M., Wong, R.K.: Adaptive multi-objective swarm crossover optimization for imbalanced data classification. In: Li, J., Li, X., Wang, S., Li, J., Sheng, Q.Z. (eds.) ADMA 2016. LNCS (LNAI), vol. 10086, pp. 374–390. Springer, Cham (2016). Scholar
  9. 9.
    Stone, E.A.: Predictor performance with stratified data and imbalanced classes. Nat. Methods 11(8), 782 (2014)CrossRefGoogle Scholar
  10. 10.
    Guo, H., Viktor, H.L.: Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. ACM SIGKDD Explor. Newsl. 6(1), 30–39 (2004)CrossRefGoogle Scholar
  11. 11.
    Weiss, G.M.: Learning with rare cases and small disjuncts. In: ICML, pp. 558–565 (1995)Google Scholar
  12. 12.
    Weiss, G.M.: Mining with rarity: a unifying framework. ACM SIGKDD Explor. Newsl. 6(1), 7–19 (2004)CrossRefGoogle Scholar
  13. 13.
    Arunasalam, B., Chawla, S.: CCCS: a top-down associative classifier for imbalanced class distribution. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 517–522. ACM (2006)Google Scholar
  14. 14.
    Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. CRC Press, Boca Raton (1984)zbMATHGoogle Scholar
  15. 15.
    Li, J., Fong, S., Wong, R.K., Chu, V.W.: Adaptive multi-objective swarm fusion for imbalanced data classification. Inf. Fusion 39, 1–24 (2018)CrossRefGoogle Scholar
  16. 16.
    Drummond, C., Holte, R.C.: C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on Learning from Imbalanced Datasets II, Citeseer (2003)Google Scholar
  17. 17.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)zbMATHGoogle Scholar
  18. 18.
    Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). Scholar
  19. 19.
    Hu, S., Liang, Y., Ma, L., He, Y.: MSMOTE: improving classification performance when training data is imbalanced. In: Second International Workshop on Computer Science and Engineering, WCSE 2009, pp. 13–17. IEEE (2009)Google Scholar
  20. 20.
    Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: ICML, pp. 179–186 (1997)Google Scholar
  21. 21.
    Chen, X., Gerlach, B., Casasent, D.: Pruning support vectors for imbalanced data classification. In: IJCNN 2005, Proceedings, pp. 1883–1888. IEEE (2005)Google Scholar
  22. 22.
    Raskutti, B., Kowalczyk, A.: Extreme re-balancing for SVMs: a case study. ACM SIGKDD Explor. Newsl. 6(1), 60–69 (2004)CrossRefGoogle Scholar
  23. 23.
    Estabrooks, A., Japkowicz, N.: A mixture-of-experts framework for learning from imbalanced data sets. In: Hoffmann, F., Hand, D.J., Adams, N., Fisher, D., Guimaraes, G. (eds.) IDA 2001. LNCS, vol. 2189, pp. 34–43. Springer, Heidelberg (2001). Scholar
  24. 24.
    Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–449 (2002)zbMATHGoogle Scholar
  25. 25.
    Quinlan, J.R.: Bagging, boosting, and C4. 5. In: AAAI/IAAI, vol. 1, pp. 725–730 (1996)Google Scholar
  26. 26.
    Sun, Y., Kamel, M.S., Wang, Y.: Boosting for learning multiple classes with imbalanced class distribution. In: Sixth International Conference on ICDM 2006, pp. 592–602. IEEE (2006)Google Scholar
  27. 27.
    Alcalá, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17(2–3), 255–287 (2010)Google Scholar
  28. 28.
    Li, J., Fong, S., Zhuang, Y.: Optimizing SMOTE by metaheuristics with neural network and decision tree. In: 3rd International Symposium on Computational and Business Intelligence (ISCBI), pp. 26–32. IEEE (2015)Google Scholar
  29. 29.
    Viera, A.J., Garrett, J.M.: Understanding interobserver agreement: the kappa statistic. Fam. Med. 37(5), 360–363 (2005)Google Scholar
  30. 30.
    Cha, S.-H.: Comprehensive survey on distance/similarity measures between probability density functions. City 1(2), 1 (2007)Google Scholar
  31. 31.
    Nguyen, H.V., Bai, L.: Cosine similarity metric learning for face verification. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010. LNCS, vol. 6493, pp. 709–720. Springer, Heidelberg (2011). Scholar
  32. 32.
    Santini, S., Jain, R.: Similarity measures. IEEE Trans. Pattern Anal. Mach. Intell. 21(9), 871–883 (1999)CrossRefGoogle Scholar
  33. 33.
    Ahlgren, P., Jarneving, B., Rousseau, R.: Requirements for a cocitation similarity measure, with special reference to Pearson’s correlation coefficient. J. Am. Soc. Inform. Sci. Technol. 54(6), 550–560 (2003)CrossRefGoogle Scholar
  34. 34.
    Xu, Z., Xia, M.: Distance and similarity measures for hesitant fuzzy sets. Inf. Sci. 181(11), 2128–2138 (2011)MathSciNetCrossRefGoogle Scholar
  35. 35.
    Choi, S.-S., Cha, S.-H., Tappert, C.C.: A survey of binary similarity and distance measures. J. Syst. Cybern. Inform. 8(1), 43–48 (2010)Google Scholar
  36. 36.
    Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A.: RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern.-Part A: Syst. Hum. 40(1), 185–197 (2010)CrossRefGoogle Scholar
  37. 37.
    Liu, X.-Y., Wu, J., Zhou, Z.-H.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 39(2), 539–550 (2009)CrossRefGoogle Scholar
  38. 38.
    Mani, I., Zhang, I.: kNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of Workshop on Learning from Imbalanced Datasets (2003)Google Scholar
  39. 39.
    He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  1. 1.Department of Computer and Information ScienceUniversity of MacauTaipa, Macau SARChina
  2. 2.School of Computer Science and EngineeringUniversity of New South WalesSydneyAustralia
  3. 3.Department of Computer ScienceLakehead UniversityThunder BayCanada

Personalised recommendations