Abstract
Imbalanced data is a well-known common problem in many practical applications of machine learning and its effects on the performance of standard classifiers are remarkable. In this paper we investigate if the classification of Medline documents using MeSH controlled vocabulary poses additional challenges when dealing with class-imbalanced prediction. For this task, we evaluate the performance of Bayesian networks by using some available strategies to overcome the effect of class imbalance. Our results show both that Bayesian network classifiers are sensitive to class imbalance and existing techniques can improve their overall performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intell. Data Anal. 6(5), 429 (2002)
Van Hulse, J., Khoshgoftaar, T.: Knowledge discovery from imbalanced and noisy data. Data & Knowledge Engineering 68, 1513–1542 (2009)
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21(9), 1263–1284 (2009)
Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of CIKM 1998, 7th ACM International Conference on Information and Knowledge Management, Bethesda, MD (1998)
Lam, W., Low, K.F., Ho, C.Y.: Using a Bayesian network induction approach for text categorization. In: Proceedings of IJCAI 1997, 15th International Joint Conference on Artificial Intelligence, Nagoya, Japan (1997)
Yu, T., Jan, T., Simoff, S., Debeham, J.: A hierarchical VQSVM for imbalanced data sets. In: Proceedings of the International Joint Conference on Neural Networks, Orlando, Florida (2007)
Zhu, X.: Lazy Bagging for Classifying Imbalanced Data. In: Proceedings of the 7th IEEE International Conference on Data Mining, Omaha NE, USA (2007)
Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289–1305 (2003)
Chen, X., Wasikowski, M.: FAST: A ROC-based feature selection metric for small samples and imbalanced data classification problems. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas (2008)
Glez-Peña, D., López, S., Pavón, R., Laza, R., Iglesias, E.L., Borrajo, L.: Classification of Medline documents using MeSH terms. In: Proceedings of the 4th International Workshop on Practical Applications of Computational Biology & Bioinformatics, Salamanca, Spain (2009)
Ng, W., Dash, M.: An evaluation of progressive sampling for imbalanced data sets. In: Proceedings of the 6th IEEE International Conference on Data Mining – Workshops, Hong Kong, China (2006)
Yen, S.J., Lee, Y.S., Lin, C.H., Ying, J.C.: Investigating the effect of sampling methods for imbalanced data distributions. In: Proceedings of the 2006 IEEE International Conference on Systems, Man, and Cybernetics, Taipei, Taiwan (2006)
Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intell. Data Anal. 6, 429–449 (2002)
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, Tennessee, USA (1997)
Drummond, C., Holte, R.: C4.5, class imbalance, and cost sensitivity: why undersampling beats oversampling. In: Proceedings of the ICML2003 - Workshop on Learning from Imbalanced Data Sets, Washington, DC USA (2003)
Elkan, C.: The foundations of cost-sensitive learning. In: Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, Seattle, Washington, USA (2001)
Domingos, P.: Metacost: A general method for making classifiers costsensitive. In: Proceedings of the Fifth ACM SIGKDD International Conference Knowledge Discovery and Data Mining, San Diego, CA (1999)
Zhou, Z.H., Liu, X.Y.: Training cost-sensitive neural networks with methods addressing the class imbalance problem. Proceedings of the IEEE Transactions on Knowledge and Data Engineering (2006)
Liu, X.Y., Zhou, Z.H.: The influence of class imbalance on cost-sensitive learning: an empirical study. In: Proceedings of the 6th International Conference on Data Mining, Hong Kong, China (2006)
Japkowicz, N., Myers, C., Gluck, M.: A novelty detection approach to classification. In: Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Montreal, Quebec, Canada (1995)
Kubat, M., Holte, R., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Machine Learning 30, 195–215 (1998)
Molinara, M., Ricamato, M.T., Tortorella, F.: Facing imbalanced classes through aggregation of classifiers. In: Proceedings of the 14th International Conference on Image Analysis and Processing, Modena, Italy (2007)
Ertekin, S., Huang, J., Giles, C.L.: Active learning for class imbalance problem. In: Proceedings of the 21st annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, Netherlands (2007)
Ertekin, S., Huang, J., Bottou, L., Giles, C.L.: Learning on the border: active learning in imbalanced data classification. In: Proceedings of the ACM Sixteenth Conference on Information and Knowledge Management, Lisboa, Portugal (2007)
Cooper, G.F., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Machine Learning 9(4), 309–347 (1992)
Zhang, J., Mani, I.: kNN approach to unbalanced data distributions: A case study involving information extraction. In: Proceedings of the ICML 2003 workshop on learning from imbalanced datasets, Washigton DC, USA (2003)
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the Int. Joint Conference on Neural Networks, IJCNN 2008, Hong Kong, China (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pavón, R., Laza, R., Reboiro-Jato, M., Fdez-Riverola, F. (2011). Assessing the Impact of Class-Imbalanced Data for Classifying Relevant/Irrelevant Medline Documents. In: Rocha, M.P., Rodríguez, J.M.C., Fdez-Riverola, F., Valencia, A. (eds) 5th International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB 2011). Advances in Intelligent and Soft Computing, vol 93. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19914-1_45
Download citation
DOI: https://doi.org/10.1007/978-3-642-19914-1_45
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19913-4
Online ISBN: 978-3-642-19914-1
eBook Packages: EngineeringEngineering (R0)