Assessing the Impact of Class-Imbalanced Data for Classifying Relevant/Irrelevant Medline Documents

  • Reyes Pavón
  • Rosalía Laza
  • Miguel Reboiro-Jato
  • Florentino Fdez-Riverola
Part of the Advances in Intelligent and Soft Computing book series (AINSC, volume 93)

Abstract

Imbalanced data is a well-known common problem in many practical applications of machine learning and its effects on the performance of standard classifiers are remarkable. In this paper we investigate if the classification of Medline documents using MeSH controlled vocabulary poses additional challenges when dealing with class-imbalanced prediction. For this task, we evaluate the performance of Bayesian networks by using some available strategies to overcome the effect of class imbalance. Our results show both that Bayesian network classifiers are sensitive to class imbalance and existing techniques can improve their overall performance.

Keywords

document classification imbalanced data Medline documents MeSH terms Bayesian networks 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intell. Data Anal. 6(5), 429 (2002)MATHGoogle Scholar
  2. 2.
    Van Hulse, J., Khoshgoftaar, T.: Knowledge discovery from imbalanced and noisy data. Data & Knowledge Engineering 68, 1513–1542 (2009)CrossRefGoogle Scholar
  3. 3.
    He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21(9), 1263–1284 (2009)CrossRefGoogle Scholar
  4. 4.
    Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of CIKM 1998, 7th ACM International Conference on Information and Knowledge Management, Bethesda, MD (1998)Google Scholar
  5. 5.
    Lam, W., Low, K.F., Ho, C.Y.: Using a Bayesian network induction approach for text categorization. In: Proceedings of IJCAI 1997, 15th International Joint Conference on Artificial Intelligence, Nagoya, Japan (1997)Google Scholar
  6. 6.
    Yu, T., Jan, T., Simoff, S., Debeham, J.: A hierarchical VQSVM for imbalanced data sets. In: Proceedings of the International Joint Conference on Neural Networks, Orlando, Florida (2007)Google Scholar
  7. 7.
    Zhu, X.: Lazy Bagging for Classifying Imbalanced Data. In: Proceedings of the 7th IEEE International Conference on Data Mining, Omaha NE, USA (2007)Google Scholar
  8. 8.
    Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289–1305 (2003)MATHCrossRefGoogle Scholar
  9. 9.
    Chen, X., Wasikowski, M.: FAST: A ROC-based feature selection metric for small samples and imbalanced data classification problems. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas (2008)Google Scholar
  10. 10.
    Glez-Peña, D., López, S., Pavón, R., Laza, R., Iglesias, E.L., Borrajo, L.: Classification of Medline documents using MeSH terms. In: Proceedings of the 4th International Workshop on Practical Applications of Computational Biology & Bioinformatics, Salamanca, Spain (2009)Google Scholar
  11. 11.
    Ng, W., Dash, M.: An evaluation of progressive sampling for imbalanced data sets. In: Proceedings of the 6th IEEE International Conference on Data Mining – Workshops, Hong Kong, China (2006)Google Scholar
  12. 12.
    Yen, S.J., Lee, Y.S., Lin, C.H., Ying, J.C.: Investigating the effect of sampling methods for imbalanced data distributions. In: Proceedings of the 2006 IEEE International Conference on Systems, Man, and Cybernetics, Taipei, Taiwan (2006)Google Scholar
  13. 13.
    Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intell. Data Anal. 6, 429–449 (2002)MATHGoogle Scholar
  14. 14.
    Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, Tennessee, USA (1997)Google Scholar
  15. 15.
    Drummond, C., Holte, R.: C4.5, class imbalance, and cost sensitivity: why undersampling beats oversampling. In: Proceedings of the ICML2003 - Workshop on Learning from Imbalanced Data Sets, Washington, DC USA (2003)Google Scholar
  16. 16.
    Elkan, C.: The foundations of cost-sensitive learning. In: Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, Seattle, Washington, USA (2001)Google Scholar
  17. 17.
    Domingos, P.: Metacost: A general method for making classifiers costsensitive. In: Proceedings of the Fifth ACM SIGKDD International Conference Knowledge Discovery and Data Mining, San Diego, CA (1999)Google Scholar
  18. 18.
    Zhou, Z.H., Liu, X.Y.: Training cost-sensitive neural networks with methods addressing the class imbalance problem. Proceedings of the IEEE Transactions on Knowledge and Data Engineering (2006)Google Scholar
  19. 19.
    Liu, X.Y., Zhou, Z.H.: The influence of class imbalance on cost-sensitive learning: an empirical study. In: Proceedings of the 6th International Conference on Data Mining, Hong Kong, China (2006)Google Scholar
  20. 20.
    Japkowicz, N., Myers, C., Gluck, M.: A novelty detection approach to classification. In: Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Montreal, Quebec, Canada (1995)Google Scholar
  21. 21.
    Kubat, M., Holte, R., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Machine Learning 30, 195–215 (1998)CrossRefGoogle Scholar
  22. 22.
    Molinara, M., Ricamato, M.T., Tortorella, F.: Facing imbalanced classes through aggregation of classifiers. In: Proceedings of the 14th International Conference on Image Analysis and Processing, Modena, Italy (2007)Google Scholar
  23. 23.
    Ertekin, S., Huang, J., Giles, C.L.: Active learning for class imbalance problem. In: Proceedings of the 21st annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, Netherlands (2007)Google Scholar
  24. 24.
    Ertekin, S., Huang, J., Bottou, L., Giles, C.L.: Learning on the border: active learning in imbalanced data classification. In: Proceedings of the ACM Sixteenth Conference on Information and Knowledge Management, Lisboa, Portugal (2007)Google Scholar
  25. 25.
    Cooper, G.F., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Machine Learning 9(4), 309–347 (1992)MATHGoogle Scholar
  26. 26.
    Zhang, J., Mani, I.: kNN approach to unbalanced data distributions: A case study involving information extraction. In: Proceedings of the ICML 2003 workshop on learning from imbalanced datasets, Washigton DC, USA (2003)Google Scholar
  27. 27.
    He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the Int. Joint Conference on Neural Networks, IJCNN 2008, Hong Kong, China (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Reyes Pavón
    • 1
  • Rosalía Laza
    • 1
  • Miguel Reboiro-Jato
    • 1
  • Florentino Fdez-Riverola
    • 1
  1. 1.ESEI: Escuela Superior de Ingeniería InformáticaUniversity of Vigo, Edificio PolitécnicoOurenseSpain

Personalised recommendations