Skip to main content

On Feature Weighting and Selection for Medical Document Classification

  • Conference paper
  • First Online:

Part of the book series: Studies in Computational Intelligence ((SCI,volume 718))

Abstract

Medical document classification is still one of the popular research problems inside text classification domain. In this study, the impact of feature selection and feature weighting on medical document classification is analyzed using two datasets containing MEDLINE documents. The performances of two different feature selection methods namely Gini index and distinguishing feature selector and two different term weighting methods namely term frequency (TF) and term frequency-inverse document frequency (TF-IDF) are analyzed using two pattern classifiers. These pattern classifiers are Bayesian network and C4.5 decision tree. As this study deals with single-label classification, a subset of documents inside OHSUMED and a self-constructed dataset is used for assessment of these methods. Due to having low amount of documents for some categories in self-compiled dataset, only documents belonging to 10 different disease categories are used in the experiments for both datasets. Experimental results show that the better result is obtained with combination of distinguishing feature selector, TF feature weighting, and Bayesian network classifier.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Uysal, A.K., Gunal, S.: A novel probabilistic feature selection method for text classification. Knowl.-Based Syst. 36, 226–235 (2012)

    Article  Google Scholar 

  2. Idris, I., Selamat, A., Nguyen, N.T., Omatu, S., Krejcar, O., Kuca, K., Penhaker, M.: A combined negative selection algorithm—particle swarm optimization for an email spam detection system. Eng. Appl. Artif. Intell. 39, 33–44 (2015)

    Article  Google Scholar 

  3. Zhang, C., Wu, X., Niu, Z., Ding, W.: Authorship identification from unstructured texts. Knowl.-Based Syst. 66, 99–111 (2014)

    Article  Google Scholar 

  4. Ozel, S.A.: A Web page classification system based on a genetic algorithm using tagged-terms as features. Expert Syst. Appl. 38(4), 3407–3415 (2011)

    Article  MathSciNet  Google Scholar 

  5. Agarwal, B., Mittal, N.: Prominent Feature Extraction for Sentiment Analysis, pp. 21–45. Springer (2016)

    Google Scholar 

  6. Pak, M.Y., Gunal, S.: Sentiment classification based on domain prediction. Elektronika ir Elektrotechnika 22(2), 96–99 (2016)

    Article  Google Scholar 

  7. Garla, V., Taylor, C., Brandt, C.: Semi-supervised clinical text classification with Laplacian SVMs: an application to cancer case management. J. Biomed. Inform. 46(5), 869–875 (2013)

    Article  Google Scholar 

  8. Yetisgen-Yildiz, M., Pratt, W.: The effect of feature representation on MEDLINE document classification. In: AMIA Annual Symposium Proceedings, p. 849. American Medical Informatics Association (2005)

    Google Scholar 

  9. Yepes, A.J.J., Plaza, L., Carrillo-de-Albornoz, J., Mork, J.G., Aronson, A.R.: Feature engineering for MEDLINE citation categorization with MeSH. BMC Bioinform. 16(1), 1 (2015)

    Article  Google Scholar 

  10. MEDLINE. [http://www.nlm.nih.gov/databases/databases_medline.html]. Accessed 2015

  11. Pubmed [http://www.ncbi.nlm.nih.gov/pubmed]. Accessed 2015

  12. Rak, R., Kurgan, L.A., Reformat, M.: Multilabel associative classification categorization of MEDLINE articles into MeSH keywords. IEEE Eng. Med. Biol. Mag. 26(2), 47 (2007)

    Article  Google Scholar 

  13. Spat, S., Cadonna, B., Rakovac, I., Gutl, C., Leitner, H., Stark, G., Beck, P.: Multi-label text classification of German language medical documents. In: Proceedings of the 12th World Congress on Health (Medical) Informatics; Building Sustainable Health Systems, p. 2343 (2007)

    Google Scholar 

  14. Camous, F., Blott, S., Smeaton, A.F.: Ontology-based MEDLINE document classification. In: Bioinformatics Research and Development, pp. 439–452. Springer Berlin Heidelberg (2007)

    Google Scholar 

  15. Poulter, G.L., Rubin, D.L., Altman, R.B.: Seoighe, C.: MScanner: a classifier for retrieving medline citations. BMC Bioinform. 9(1), 108 (2008)

    Article  Google Scholar 

  16. Yi, K., Beheshti, J.: A hidden Markov model-based text classification of medical documents. J. Inf. Sci. (2008)

    Google Scholar 

  17. Frunza, O., Inkpen, D., Matwin, S., Klement, W., O’blenis, P.: Exploiting the systematic review protocol for classification of medical abstracts. Artif. Intell. Med. 51(1), 17–25 (2011)

    Google Scholar 

  18. Dollah, R.B., Aono, M.: Ontology based approach for classifying biomedical text abstracts. Int. J. Data Engi. (IJDE), 2(1), 1–15 (2011)

    Google Scholar 

  19. Albitar, S., Espinasse, B., Fournier, S.: Semantic enrichments in text supervised classification: application to medical domain. In: The Twenty-Seventh International Flairs Conference (2014)

    Google Scholar 

  20. Uysal, A.K., Gunal, S.: Text classification using genetic algorithm oriented latent semantic features. Expert Syst. Appl. 41(13), 5938–5947 (2014)

    Article  Google Scholar 

  21. Parlak, B., Uysal, A. K.: Classification of medical documents according to diseases. In: 23th IEEE Signal Processing and Communications Applications Conference (SIU), pp. 1635–1638 (2015)

    Google Scholar 

  22. Rais, M., Lachkar, A.: Evaluation of disambiguation strategies on biomedical text categorization. In: International Conference on Bioinformatics and Biomedical Engineering, pp. 790–801. Springer International Publishing (2016)

    Google Scholar 

  23. Baker, S., Silins, I., Guo, Y., Ali, I., Högberg, J., Stenius, U., Korhonen, A.: Automatic semantic classification of scientific literature according to the hallmarks of cancer. Bioinformatics 32(3), 432–440 (2016)

    Article  Google Scholar 

  24. Morid, M.A., Fiszman, M., Raja, K., Jonnalagadda, S.R., Del Fiol, G.: Classification of clinically useful sentences in clinical evidence resources. J. Biomed. Inform. 60, 14–22 (2016)

    Article  Google Scholar 

  25. Parlak, B., Uysal, A.K.: The impact of feature selection on medical document classification. In: 11th Iberian Conference on Information Systems and Technologies (CISTI), pp. 1–5 (2016)

    Google Scholar 

  26. Pakhomov, S.V., Buntrock, J.D., Chute, C.G.: Automating the assignment of diagnosis codes to patient encounters using example-based and machine learning techniques. J. Am. Med. Inform. Assoc. 13(5), 516–525 (2006)

    Article  Google Scholar 

  27. Van Der Zwaan, J., Sang, E.T.K., de Rijke, M.: An experiment in automatic classification of pathological reports. In: Artificial Intelligence in Medicine, pp. 207–216. Springer, Berlin Heidelberg (2007)

    Google Scholar 

  28. Waraporn, P., Meesad, P., Clayton, G.: Ontology-supported processing of clinical text using medical knowledge integration for multi-label classification of diagnosis coding (2010). arXiv:1004.1230

  29. Boytcheva, S.: Automatic matching of ICD-10 codes to diagnoses in discharge letters. In: Proceedings of the Workshop on Biomedical Natural Language Processing, pp. 11–18. Hissar, Bulgaria (2011)

    Google Scholar 

  30. Ceylan, N.M., Alpkocak, A., Esatoglu, A.E.: Tıbbi Kayıtlara ICD-10 Hastalık Kodlarının Atanmasına Yardımcı Akıllı Bir Sistem (2012)

    Google Scholar 

  31. Arifoglu, D., Deniz, O., Alecakır, K., Yondem, M.: CodeMagic: semi-automatic assignment of ICD-10-AM codes to patient records. In: Information Sciences and Systems 2014, pp. 259–268. Springer International Publishing (2014)

    Google Scholar 

  32. Uysal, A.K., Gunal, S., Ergin, S., Gunal, E.S.: Detection of SMS spam messages on mobile phones. In: 20th IEEE Signal Processing and Communications Applications Conference (SIU), pp. 1–4 (2012)

    Google Scholar 

  33. Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval Cambridge University Press, New York, USA (2008)

    Google Scholar 

  34. Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)

    Article  Google Scholar 

  35. Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., Wang, Z.: A novel feature selection algorithm for text categorization. Expert Syst. Appl. 33(1), 1–5 (2007)

    Article  Google Scholar 

  36. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: An Update. SIGKDD Explor. 11(1) (2009)

    Google Scholar 

  37. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, Jim Gray (ed.). Morgan Kaufmann Publishers, San Fransisco (2005)

    Google Scholar 

  38. Goutte, C., Gaussier, E.: A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In: Proceedings of the Europe Conference Information Retrieval Research, pp. 345–359 (2005)

    Google Scholar 

  39. Rocha, A., Rocha, B.: Adopting nursing health record standards. Inform. Health Soc. Care 39(1), 1–14 (2014)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work was supported by Anadolu University, Fund of Scientific Research Projects under grant number 1503F136.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bekir Parlak .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Cite this paper

Parlak, B., Uysal, A.K. (2018). On Feature Weighting and Selection for Medical Document Classification. In: Rocha, Á., Reis, L. (eds) Developments and Advances in Intelligent Systems and Applications. Studies in Computational Intelligence, vol 718. Springer, Cham. https://doi.org/10.1007/978-3-319-58965-7_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-58965-7_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-58963-3

  • Online ISBN: 978-3-319-58965-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics