Advertisement

Semi Supervised Under-Sampling: A Solution to the Class Imbalance Problem for Classification and Feature Selection

Conference paper

Abstract

Most medical datasets are not balanced in their class labels. Furthermore, in some cases it has been noticed that the given class labels do not accurately represent characteristics of the data record. Most existing classification methods tend not to perform well on minority class examples when the dataset is extremely imbalanced. This is because they aim to optimize the overall accuracy without considering the relative distribution of each class. The class imbalance problem can also affect the feature selection process. In this paper we propose a cluster based under-sampling technique that solves the class imbalance problem for our cardiovascular data. Data prepared using this technique shows significant better performance than existing methods. A feature selection framework for unbalanced data is also proposed in this paper. The research found that ReliefF can be used to select fewer attributes, with no degradation of subsequent classifier performance, for the data balanced by the proposed under-sampling method.

Keywords

Class imbalance Clustering Over sampling ReliefF SMOTE Under sampling 

Notes

Acknowledgments

The authors gratefully acknowledge SEED Software in the Department of Computer Science of The University of Hull, UK, for funding this research project.

References

  1. 1.
    Y. Liu, X.H. Yu, J.X. Huang, A.J. An, Combining integrated sampling with SVM ensembles for learning from imbalanced datasets. Inf. Process. Manage. 47, 617–631 (2011)CrossRefGoogle Scholar
  2. 2.
    M.-S. Kim, An effective under-sampling method for class. Imbalance data problem, in Presented at the 8th International Symposium on Advance intelligent System (ISIS 2007), 2007Google Scholar
  3. 3.
    Z. Yan-Ping, Z. Li-Na, W. Yong-Cheng, Cluster-based majority under-sampling approaches for class imbalance learning, in 2010 2nd IEEE International Conference on Information and Financial Engineering (ICIFE), 2010, pp. 400–404Google Scholar
  4. 4.
    Al-Shahib, R. Breitling, D. Gilbert, Feature selection and the class imbalance problem in predicting protein function from sequence. Appl Bioinformatics 4, 195–203 (2005)CrossRefGoogle Scholar
  5. 5.
    R. Laza, R. Pavon, M. Reboiro-Jato, F. Fdez-Riverola, Evaluating the effect of unbalanced data in biomedical document classification. J. Integr. Bioinformatics 8, 177 (2011)Google Scholar
  6. 6.
    N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)MATHGoogle Scholar
  7. 7.
    Y. Zhai, N. Ma, D. Ruan, B. An, An effective over-sampling method for imbalanced data sets classification. Chin. J. Electron. 20, 489–494 (2011)Google Scholar
  8. 8.
    S.-J. Yen, Y.-S. Lee, Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 36, 5718–5727 (2009)CrossRefGoogle Scholar
  9. 9.
    C. Drummond, R.C. Holte, C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling, in Presented at the Workshop on Learning from Imbalanced Data Sets II, 2003Google Scholar
  10. 10.
    Y.-M. Chyi, Classification analysis techniques for skewed class distribution problems. Master, Department of Information Management, National Sun Yat-Sen University (2003)Google Scholar
  11. 11.
    M.M. Rahman, D.N. Davis, Cluster based under-sampling for unbalanced cardiovascular data, in Lecture Notes in Engineering and Computer Science: Proceedings of The World Congress on Engineering 2013, London, 2013, pp. 1480–1485Google Scholar
  12. 12.
    R.C. Barros, M.P. Basgalupp, A.C.P.L.F. de Carvalho, A.A. Freitas, A survey of evolutionary algorithms for decision-tree induction. IEEE. Trans. Syst. Man Cybern. Part C: Appl. Rev. 42, 291–312 (2012)Google Scholar
  13. 13.
    F. Lotte, A. Lecuyer, B. Arnaldi, FuRIA: an inverse solution based feature extraction algorithm using fuzzy set theory for brain-computer interfaces. IEEE Trans. Signal Process. 57, 3253–3263 (2009)CrossRefMathSciNetGoogle Scholar
  14. 14.
    O. Maimon, L. Rokach, Data mining and knowledge discovery handbook (Springer, Berlin, 2010)CrossRefMATHGoogle Scholar
  15. 15.
    F. Lotte, A. Lecuyer, B. Arnaldi, FuRIA: a novel feature extraction algorithm for brain-computer interfaces using inverse models and fuzzy regions of interest, in Presented at the 3rd International IEEE/EMBS Conference on Neural Engineering, CNE ’07, 2007Google Scholar
  16. 16.
    I. Yoo, P. Alafaireet, M. Marinov, K. Pena-Hernandez, R. Gopidi, J.F. Chang et al., Data mining in healthcare and biomedicine: a survey of the literature. J. Med. Syst. 36, 2431–2448 (2012)CrossRefGoogle Scholar
  17. 17.
    R. Quinlan, C4.5: programs for machine learning (Morgan Kaufmann, San Mateo, 1993)Google Scholar
  18. 18.
    R.R. Bouckaert, E. Frank, M.A. Hall, G. Holmes, B. Pfahringer, P. Reutemann et al., WEKA-experiences with a java open-source project. J. Mach. Learn. Res. 11, 2533–2541 (2010)MATHGoogle Scholar
  19. 19.
    K. Kira, L.A. Rendell, A practical approach to feature selection, in Presented at the Proceedings of the ninth international workshop on Machine learning, Aberdeen, Scotland, United Kingdom, 1992Google Scholar
  20. 20.
    M. Robnik-Šikonja, I. Kononenko, Theoretical and empirical analysis of ReliefF and RReliefF. Mach. Learn. 53, 23–69 (2003)CrossRefMATHGoogle Scholar
  21. 21.
    D.N. Davis, T.T.T. Nguyen, Generating and verifying risk prediction models using data mining (A case study from cardiovascular medicine), in Presented at the European Society for Cardiovascular Surgery 57th Annual Congress of ESCVS, Barcelona Spain, 2008Google Scholar
  22. 22.
    T. C. W. Landgrebe, R. P. W. Duin, Efficient Multiclass ROC Approximation by Decomposition via Confusion Matrix Perturbation Analysis, Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(5), 810–822, (2008)Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2014

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of HullKingston upon HullUK

Personalised recommendations