Abstract
Medical document classification is still one of the popular research problems inside text classification domain. In this study, the impact of feature selection and feature weighting on medical document classification is analyzed using two datasets containing MEDLINE documents. The performances of two different feature selection methods namely Gini index and distinguishing feature selector and two different term weighting methods namely term frequency (TF) and term frequency-inverse document frequency (TF-IDF) are analyzed using two pattern classifiers. These pattern classifiers are Bayesian network and C4.5 decision tree. As this study deals with single-label classification, a subset of documents inside OHSUMED and a self-constructed dataset is used for assessment of these methods. Due to having low amount of documents for some categories in self-compiled dataset, only documents belonging to 10 different disease categories are used in the experiments for both datasets. Experimental results show that the better result is obtained with combination of distinguishing feature selector, TF feature weighting, and Bayesian network classifier.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Uysal, A.K., Gunal, S.: A novel probabilistic feature selection method for text classification. Knowl.-Based Syst. 36, 226–235 (2012)
Idris, I., Selamat, A., Nguyen, N.T., Omatu, S., Krejcar, O., Kuca, K., Penhaker, M.: A combined negative selection algorithm—particle swarm optimization for an email spam detection system. Eng. Appl. Artif. Intell. 39, 33–44 (2015)
Zhang, C., Wu, X., Niu, Z., Ding, W.: Authorship identification from unstructured texts. Knowl.-Based Syst. 66, 99–111 (2014)
Ozel, S.A.: A Web page classification system based on a genetic algorithm using tagged-terms as features. Expert Syst. Appl. 38(4), 3407–3415 (2011)
Agarwal, B., Mittal, N.: Prominent Feature Extraction for Sentiment Analysis, pp. 21–45. Springer (2016)
Pak, M.Y., Gunal, S.: Sentiment classification based on domain prediction. Elektronika ir Elektrotechnika 22(2), 96–99 (2016)
Garla, V., Taylor, C., Brandt, C.: Semi-supervised clinical text classification with Laplacian SVMs: an application to cancer case management. J. Biomed. Inform. 46(5), 869–875 (2013)
Yetisgen-Yildiz, M., Pratt, W.: The effect of feature representation on MEDLINE document classification. In: AMIA Annual Symposium Proceedings, p. 849. American Medical Informatics Association (2005)
Yepes, A.J.J., Plaza, L., Carrillo-de-Albornoz, J., Mork, J.G., Aronson, A.R.: Feature engineering for MEDLINE citation categorization with MeSH. BMC Bioinform. 16(1), 1 (2015)
MEDLINE. [http://www.nlm.nih.gov/databases/databases_medline.html]. Accessed 2015
Pubmed [http://www.ncbi.nlm.nih.gov/pubmed]. Accessed 2015
Rak, R., Kurgan, L.A., Reformat, M.: Multilabel associative classification categorization of MEDLINE articles into MeSH keywords. IEEE Eng. Med. Biol. Mag. 26(2), 47 (2007)
Spat, S., Cadonna, B., Rakovac, I., Gutl, C., Leitner, H., Stark, G., Beck, P.: Multi-label text classification of German language medical documents. In: Proceedings of the 12th World Congress on Health (Medical) Informatics; Building Sustainable Health Systems, p. 2343 (2007)
Camous, F., Blott, S., Smeaton, A.F.: Ontology-based MEDLINE document classification. In: Bioinformatics Research and Development, pp. 439–452. Springer Berlin Heidelberg (2007)
Poulter, G.L., Rubin, D.L., Altman, R.B.: Seoighe, C.: MScanner: a classifier for retrieving medline citations. BMC Bioinform. 9(1), 108 (2008)
Yi, K., Beheshti, J.: A hidden Markov model-based text classification of medical documents. J. Inf. Sci. (2008)
Frunza, O., Inkpen, D., Matwin, S., Klement, W., O’blenis, P.: Exploiting the systematic review protocol for classification of medical abstracts. Artif. Intell. Med. 51(1), 17–25 (2011)
Dollah, R.B., Aono, M.: Ontology based approach for classifying biomedical text abstracts. Int. J. Data Engi. (IJDE), 2(1), 1–15 (2011)
Albitar, S., Espinasse, B., Fournier, S.: Semantic enrichments in text supervised classification: application to medical domain. In: The Twenty-Seventh International Flairs Conference (2014)
Uysal, A.K., Gunal, S.: Text classification using genetic algorithm oriented latent semantic features. Expert Syst. Appl. 41(13), 5938–5947 (2014)
Parlak, B., Uysal, A. K.: Classification of medical documents according to diseases. In: 23th IEEE Signal Processing and Communications Applications Conference (SIU), pp. 1635–1638 (2015)
Rais, M., Lachkar, A.: Evaluation of disambiguation strategies on biomedical text categorization. In: International Conference on Bioinformatics and Biomedical Engineering, pp. 790–801. Springer International Publishing (2016)
Baker, S., Silins, I., Guo, Y., Ali, I., Högberg, J., Stenius, U., Korhonen, A.: Automatic semantic classification of scientific literature according to the hallmarks of cancer. Bioinformatics 32(3), 432–440 (2016)
Morid, M.A., Fiszman, M., Raja, K., Jonnalagadda, S.R., Del Fiol, G.: Classification of clinically useful sentences in clinical evidence resources. J. Biomed. Inform. 60, 14–22 (2016)
Parlak, B., Uysal, A.K.: The impact of feature selection on medical document classification. In: 11th Iberian Conference on Information Systems and Technologies (CISTI), pp. 1–5 (2016)
Pakhomov, S.V., Buntrock, J.D., Chute, C.G.: Automating the assignment of diagnosis codes to patient encounters using example-based and machine learning techniques. J. Am. Med. Inform. Assoc. 13(5), 516–525 (2006)
Van Der Zwaan, J., Sang, E.T.K., de Rijke, M.: An experiment in automatic classification of pathological reports. In: Artificial Intelligence in Medicine, pp. 207–216. Springer, Berlin Heidelberg (2007)
Waraporn, P., Meesad, P., Clayton, G.: Ontology-supported processing of clinical text using medical knowledge integration for multi-label classification of diagnosis coding (2010). arXiv:1004.1230
Boytcheva, S.: Automatic matching of ICD-10 codes to diagnoses in discharge letters. In: Proceedings of the Workshop on Biomedical Natural Language Processing, pp. 11–18. Hissar, Bulgaria (2011)
Ceylan, N.M., Alpkocak, A., Esatoglu, A.E.: Tıbbi Kayıtlara ICD-10 Hastalık Kodlarının Atanmasına Yardımcı Akıllı Bir Sistem (2012)
Arifoglu, D., Deniz, O., Alecakır, K., Yondem, M.: CodeMagic: semi-automatic assignment of ICD-10-AM codes to patient records. In: Information Sciences and Systems 2014, pp. 259–268. Springer International Publishing (2014)
Uysal, A.K., Gunal, S., Ergin, S., Gunal, E.S.: Detection of SMS spam messages on mobile phones. In: 20th IEEE Signal Processing and Communications Applications Conference (SIU), pp. 1–4 (2012)
Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval Cambridge University Press, New York, USA (2008)
Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)
Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., Wang, Z.: A novel feature selection algorithm for text categorization. Expert Syst. Appl. 33(1), 1–5 (2007)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: An Update. SIGKDD Explor. 11(1) (2009)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, Jim Gray (ed.). Morgan Kaufmann Publishers, San Fransisco (2005)
Goutte, C., Gaussier, E.: A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In: Proceedings of the Europe Conference Information Retrieval Research, pp. 345–359 (2005)
Rocha, A., Rocha, B.: Adopting nursing health record standards. Inform. Health Soc. Care 39(1), 1–14 (2014)
Acknowledgements
This work was supported by Anadolu University, Fund of Scientific Research Projects under grant number 1503F136.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Parlak, B., Uysal, A.K. (2018). On Feature Weighting and Selection for Medical Document Classification. In: Rocha, Á., Reis, L. (eds) Developments and Advances in Intelligent Systems and Applications. Studies in Computational Intelligence, vol 718. Springer, Cham. https://doi.org/10.1007/978-3-319-58965-7_19
Download citation
DOI: https://doi.org/10.1007/978-3-319-58965-7_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58963-3
Online ISBN: 978-3-319-58965-7
eBook Packages: EngineeringEngineering (R0)