Advertisement

MahalCUSFilter: A Hybrid Undersampling Method to Improve the Minority Classification Rate of Imbalanced Datasets

  • Venkata Krishnaveni Chennuru
  • Sobha Rani Timmappareddy
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10682)

Abstract

Class Imbalance problem has received considerable attention in the machine learning research. Among the methods which handle class imbalance problem, undersampling is a data level approach which preprocesses the data set to reduce the size of the majority class instances. Most of the existing undersampling methods apply either prototype selection or clustering techniques to balance the data set. They are effective and popular, but both processes are complex. Drawbacks of the cluster based undersampling methods are: The quality of the chosen majority class samples varies depending on clustering algorithm, number of clusters and also the convergence is difficult. Drawback of prototype selection methods is that they have to compare each majority instance with it’s k nearest neighbors to decide which majority class instance should be selected/discarded which is not only time consuming and is also difficult to implement for large datasets. Proposed undersampling method MahalanobisCentroidbasedUndersampingwithFilter (MahalCUSFilter) overcomes the above said problems: parameter dependence, complexity and information loss. Proposed method is used in conjunction with c4.5 and kNN classifiers, and found to improve the minority class classification rate of all datasets with comparable overall performance for the entire dataset. To the best of our knowledge this kind of grouping has not been used in undersampling to improve the classification accuracy of imbalanced data sets.

References

  1. 1.
    Alcalá-Fdez, J., Sanchez, L., Garcia, S., del Jesus, M.J., Ventura, S., Garrell, J.M., Otero, J., Romero, C., Bacardit, J., Rivas, V.M., et al.: Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput.-A Fus. Found. Methodol. Appl. 13(3), 307–318 (2009)Google Scholar
  2. 2.
    Alshomrani, S., Bawakid, A., Shim, S.-O., Fernández, A., Herrera, F.: A proposal for evolutionary fuzzy systems using feature weighting: dealing with overlapping in imbalanced datasets. Knowl.-Based Syst. 73, 1–17 (2015)CrossRefGoogle Scholar
  3. 3.
    Asuncion, A., Newman, D.: Uci machine learning repository (2007)Google Scholar
  4. 4.
    Barella, V.H., Costa, E.P., Carvalho, A.C.P.L.F.: Clusteross: a new undersampling method for imbalanced learning. In: Brazilian Conference on Intelligent Systems, 3rd; Encontro Nacional de Inteligência Artificial e Computacional, 11th. Universidade de São Paulo-USP (2014)Google Scholar
  5. 5.
    Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explor. Newsl. 6(1), 20–29 (2004)CrossRefGoogle Scholar
  6. 6.
    Beyan, C., Fisher, R.: Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recogn. 48(5), 1653–1672 (2015)CrossRefGoogle Scholar
  7. 7.
    Díez-Pastor, J.F., Rodríguez, J.J., García-Osorio, C., Kuncheva, L.I.: Random balance: ensembles of variable priors classifiers for imbalanced data. Knowl.-Based Syst. 85, 96–111 (2015)CrossRefGoogle Scholar
  8. 8.
    Hart, P.: The condensed nearest neighbor rule (corresp.). IEEE Trans. Inf. Theory 14(3), 515–516 (1968)CrossRefGoogle Scholar
  9. 9.
    Kubat, M., Matwin, S., et al.: Addressing the curse of imbalanced training sets: one-sided selection. In: ICML, vol. 97, Nashville, USA, pp. 179–186 (1997)Google Scholar
  10. 10.
    Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Quaglini, S., Barahona, P., Andreassen, S. (eds.) AIME 2001. LNCS (LNAI), vol. 2101, pp. 63–66. Springer, Heidelberg (2001).  https://doi.org/10.1007/3-540-48229-6_9 CrossRefGoogle Scholar
  11. 11.
    Longadge, M.R., Dongre, M.S.S., Malik, L.: Multi-cluster based approach for skewed data in data mining. J. Comput. Eng. (IOSR-JCE) 12(6), 66–73 (2013)Google Scholar
  12. 12.
    Manjula, M., Seeniselvi, T.: Ensembles of first order logical decision trees for imbalanced classification problemsGoogle Scholar
  13. 13.
    Ng, W.W., Hu, J., Yeung, D.S., Yin, S., Roli, F.: Diversified sensitivity-based undersampling for imbalance classification problems. IEEE Trans. Cybern. 45(11), 2402–2412 (2015)CrossRefGoogle Scholar
  14. 14.
    Rahman, M.M., Davis, D.: Cluster based under-sampling for unbalanced cardiovascular data. In: Proceedings of the World Congress on Engineering, vol. 3, pp. 3–5 (2013)Google Scholar
  15. 15.
    Rencher, A.C.: Methods of Multivariate Analysis, vol. 492. Wiley, Hoboken (2003)zbMATHGoogle Scholar
  16. 16.
    Sobhani, P., Viktor, H., Matwin, S.: Learning from imbalanced data using ensemble methods and cluster-based undersampling. In: Appice, A., Ceci, M., Loglisci, C., Manco, G., Masciari, E., Ras, Z.W. (eds.) NFMCP 2014. LNCS (LNAI), vol. 8983, pp. 69–83. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-17876-9_5 Google Scholar
  17. 17.
    Sun, Z., Song, Q., Zhu, X., Sun, H., Xu, B., Zhou, Y.: A novel ensemble method for classifying imbalanced data. Pattern Recogn. 48(5), 1623–1637 (2015)CrossRefGoogle Scholar
  18. 18.
    Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 6, 769–772 (1976)MathSciNetzbMATHGoogle Scholar
  19. 19.
    Wang, C., Hu, L., Guo, M., Liu, X., Zou, Q.: imDC: an ensemble learning method for imbalanced classification with mirna data. Genet. Mol. Res. 14(1), 123–133 (2015)CrossRefGoogle Scholar
  20. 20.
    Witten, I.H., Frank, E., Trigg, L.E., Hall, M.A., Holmes, G., Cunningham, S.J.: Weka: practical machine learning tools and techniques with Java implementations (1999)Google Scholar
  21. 21.
    Yen, S.-J., Lee, Y.-S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 36(3), 5718–5727 (2009)CrossRefGoogle Scholar
  22. 22.
    Zhang, S., Sadaoui, S., Mouhoub, M.: An empirical analysis of imbalanced data classification. Comput. Inf. Sci. 8(1), 151 (2015)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.SCISUniversity of HyderabadHyderabadIndia

Personalised recommendations