Rare Event Prediction Using Similarity Majority Under-Sampling Technique

  • Jinyan Li
  • Simon FongEmail author
  • Shimin Hu
  • Victor W. Chu
  • Raymond K. Wong
  • Sabah Mohammed
  • Nilanjan Dey
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 788)


In data mining it is not uncommon to be confronted by imbalanced classification problem in which interesting samples are rare. Having too many ordinary but too few rare samples as training data, will mislead the classifier to become over-fitted by learning too much from majority class samples and become under-fitted lacking recognizing power for minority class samples. In this research work, a novel rebalancing technique that under-samples (reduce by sampling) the majority class size for subsiding the imbalanced class distributions without synthesizing extra training samples, is studied. This simple method is called Similarity Majority Under-Sampling Technique (SMUTE). By measuring the similarity between each majority class sample and its surrounding minority class samples, SMUTE effectively discriminates the majority and minority class samples with consideration of not changing too much of the underlying non-linear mapping between the input variables and the target classes. Two experiments are conducted and reported in this paper: one is an extensive performance comparison of SMUTE with the states-of-the-arts using generated imbalanced data; the other is the use of real data representing a case of natural disaster prevention where accident samples are rare. SMUTE is found to be working favourably well over other methods in both cases.


Imbalanced classification Under-Sampling Similarity measure SMUTE 



The authors are thankful to the financial support from the research grants, (1) MYRG2016-00069, titled ‘Nature-Inspired Computing and Metaheuristics Algorithms for Optimizing Data Mining Performance’ offered by RDAO/FST, University of Macau and Macau SAR government. (2) FDCT/126/2014/A3, titled ‘A Scalable Data Stream Mining Methodology: Stream-based Holistic Analytics and Reasoning in Parallel’ offered by FDCT of Macau SAR government. Special thanks go to a Master student, Jin Zhen, for her kind assistance in programming and experimentation.


  1. 1.
    Weiss, G.M., Provost, F.: Learning when training data are costly: the effect of class distribution on tree induction. J. Artif. Intell. Res. 19, 315–354 (2003)zbMATHGoogle Scholar
  2. 2.
    Li, J., et al.: Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms. J. Supercomputing 72(10), 3708–3728 (2016)CrossRefGoogle Scholar
  3. 3.
    Cao, H., et al.: Integrated oversampling for imbalanced time series classification. IEEE Trans. Knowl. Data Eng. 25(12), 2809–2822 (2013)CrossRefGoogle Scholar
  4. 4.
    Kubat, M., Holte, R.C., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30(2–3), 195–215 (1998)CrossRefGoogle Scholar
  5. 5.
    Li, J., et al.: Solving the under-fitting problem for decision tree algorithms by incremental swarm optimization in rare-event healthcare classification. J. Med. Imaging Health Inform. 6(4), 1102–1110 (2016)CrossRefGoogle Scholar
  6. 6.
    Li, J., et al.: Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification. BioData Mining 9(1), 37 (2016)CrossRefGoogle Scholar
  7. 7.
    Chawla, N.V.: C4. 5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In: Proceedings of the ICML (2003)Google Scholar
  8. 8.
    Tang, Y., et al.: SVMs modeling for highly imbalanced classification. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 39(1), 281–288 (2009)Google Scholar
  9. 9.
    Li, J., Fong, S., Yuan, M., Wong, R.K.: Adaptive multi-objective swarm crossover optimization for imbalanced data classification. In: Li, J., Li, X., Wang, S., Li, J., Sheng, Q.Z. (eds.) ADMA 2016. LNCS, vol. 10086, pp. 374–390. Springer, Cham (2016). CrossRefGoogle Scholar
  10. 10.
    Stone, E.A.: Predictor performance with stratified data and imbalanced classes. Nat. Methods 11(8), 782 (2014)CrossRefGoogle Scholar
  11. 11.
    Guo, H., Viktor, H.L.: Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. ACM Sigkdd Explor. Newslett. 6(1), 30–39 (2004)CrossRefGoogle Scholar
  12. 12.
    Li, J., et al.: Similarity majority under-sampling technique for easing imbalanced classification problem. In: 15th Australasian Data Mining Conference (AusDM 2017), Melbourne, Australia, 19–25 August 2017, Proceedings. Australian Computer Society (2017)Google Scholar
  13. 13.
    Weiss, G.M.: Learning with rare cases and small disjuncts. In: ICML (1995)Google Scholar
  14. 14.
    Weiss, G.M.: Mining with rarity: a unifying framework. ACM Sigkdd Explor. Newslett. 6(1), 7–19 (2004)CrossRefGoogle Scholar
  15. 15.
    Arunasalam, B., Chawla, S.: CCCS: a top-down associative classifier for imbalanced class distribution. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM (2006)Google Scholar
  16. 16.
    Drummond, C., Holte, R.C.: C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on Learning from Imbalanced Datasets II. Citeseer (2003)Google Scholar
  17. 17.
    Chawla, N.V., et al.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)zbMATHGoogle Scholar
  18. 18.
    Li, J., et al.: Adaptive multi-objective swarm fusion for imbalanced data classification. Inf. Fusion 39, 1–24 (2018)CrossRefGoogle Scholar
  19. 19.
    Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). CrossRefGoogle Scholar
  20. 20.
    Hu, S., et al.: MSMOTE: improving classification performance when training data is imbalanced. In: Second International Workshop on Computer Science and Engineering, WCSE 2009. IEEE (2009)Google Scholar
  21. 21.
    Estabrooks, A., Japkowicz, N.: A mixture-of-experts framework for learning from imbalanced data sets. In: Hoffmann, F., Hand, David J., Adams, N., Fisher, D., Guimaraes, G. (eds.) IDA 2001. LNCS, vol. 2189, pp. 34–43. Springer, Heidelberg (2001). CrossRefGoogle Scholar
  22. 22.
    Quinlan, J.R.: Bagging, boosting, and C4. 5. In: AAAI/IAAI, vol. 1 (1996)Google Scholar
  23. 23.
    Sun, Y., Kamel, M.S., Wang, Y.: Boosting for learning multiple classes with imbalanced class distribution. In: Sixth International Conference on Data Mining, ICDM 2006. IEEE (2006)Google Scholar
  24. 24.
    Alcalá, J., et al.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17(2–3), 255–287 (2010)Google Scholar
  25. 25.
    Li, J., Fong, S., Zhuang, Y.: Optimizing SMOTE by metaheuristics with neural network and decision tree. In: 2015 3rd International Symposium on Computational and Business Intelligence (ISCBI). IEEE (2015)Google Scholar
  26. 26.
    Viera, A.J., Garrett, J.M.: Understanding interobserver agreement: the kappa statistic. Fam. Med. 37(5), 360–363 (2005)Google Scholar
  27. 27.
    Cha, S.-H.: Comprehensive survey on distance/similarity measures between probability density functions. City 1(2), 1 (2007)MathSciNetGoogle Scholar
  28. 28.
    Nguyen, H.V., Bai, L.: Cosine similarity metric learning for face verification. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010. LNCS, vol. 6493, pp. 709–720. Springer, Heidelberg (2011). CrossRefGoogle Scholar
  29. 29.
    Santini, S., Jain, R.: Similarity measures. IEEE Trans. Pattern Anal. Mach. Intell. 21(9), 871–883 (1999)CrossRefGoogle Scholar
  30. 30.
    Ahlgren, P., Jarneving, B., Rousseau, R.: Requirements for a cocitation similarity measure, with special reference to Pearson’s correlation coefficient. J. Am. Soc. Inf. Sci. Technol. 54(6), 550–560 (2003)CrossRefGoogle Scholar
  31. 31.
    Xu, Z., Xia, M.: Distance and similarity measures for hesitant fuzzy sets. Inf. Sci. 181(11), 2128–2138 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  32. 32.
    Choi, S.-S., Cha, S.-H., Tappert, C.C.: A survey of binary similarity and distance measures. J. Syst. Cybern. Inform. 8(1), 43–48 (2010)Google Scholar
  33. 33.
    Kotsiantis, S., Kanellopoulos, D., Pintelas, P.: Handling imbalanced datasets: a review. GESTS Int. Trans. Comput. Sci. Eng. 30(1), 25–36 (2006)Google Scholar
  34. 34.
    Tomek, I.: An experiment with the edited nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. 6, 448–452 (1976)MathSciNetzbMATHGoogle Scholar
  35. 35.
    Bekkar, M., Alitouche, T.A.: Imbalanced data learning approaches review. Int. J. Data Mining Knowl. Manag. Process 3(4), 15 (2013)CrossRefGoogle Scholar
  36. 36.
    Mani, I., Zhang, I.: kNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of Workshop on Learning from Imbalanced Datasets (2003)Google Scholar
  37. 37.
    He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)CrossRefGoogle Scholar
  38. 38.
    Bifet, A., et al.: Moa: massive online analysis. J. Mach. Learn. Res. 11(May), 1601–1604 (2010)Google Scholar
  39. 39.
    Ding, Z.: Diversified ensemble classifiers for highly imbalanced data learning and their application in bioinformatics (2011)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2017

Authors and Affiliations

  • Jinyan Li
    • 1
  • Simon Fong
    • 1
    Email author
  • Shimin Hu
    • 1
  • Victor W. Chu
    • 2
  • Raymond K. Wong
    • 3
  • Sabah Mohammed
    • 4
  • Nilanjan Dey
    • 5
  1. 1.Department of Computer Information ScienceUniversity of MacauMacau SARChina
  2. 2.School of Computer Science and EngineeringNanyang Technological UniversitySingaporeSingapore
  3. 3.School of Computer Science and EngineeringUniversity of New South WalesSydneyAustralia
  4. 4.Department of Computer ScienceLakehead UniversityThunder BayCanada
  5. 5.Department of Information TechnologyTechno India College of TechnologyKolkataIndia

Personalised recommendations