Comparative Study of Feature Selection Methods for Medical Full Text Classification

  • Carlos Adriano Gonçalves
  • Eva Lorenzo Iglesias
  • Lourdes Borrajo
  • Rui CamachoEmail author
  • Adrián Seara Vieira
  • Célia Talma Gonçalves
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11466)


There is a lot of work in text categorization using only the title and abstract of the papers. However, in a full paper there is a much larger amount of information that could be used to improve the text classification performance. The potential benefits of using full texts come with an additional problem: the increased size of the data sets.

To overcome the increased the size of full text data sets we performed an assessment study on the use of feature selection methods for full text classification. We have compared two existing feature selection methods (Information Gain and Correlation) and a novel method called k-Best-Discriminative-Terms. The assessment was conducted using the Ohsumed corpora. We have made two sets of experiments: using title and abstract only; and full text.

The results achieved by the novel method show that the novel method does not perform well in small amounts of text like title and abstract but performs much better for the full text data sets and requires a much smaller number of attributes.


Text classification Feature selection Medical texts corpus 



This work was supported by the Consellería de Educación, Universidades e Formación Profesional (Xunta de Galicia) under the scope of the strategic funding of ED431C2018/55-GRC Competitive Reference Group. This work was also partially funded by the ERDF through the COMPETE 2020 Programme within project POCI-01-0145-FEDER-006961, and by National Funds through the FCT as part of project UID/EEA/50014/2013.


  1. 1.
    Gonçalves, C.A., Iglesias, E.L., Borrajo, L., Camacho, R., Vieira, A. S., Gonçalves, C.T.: LearnSec: a framework for full text analysis. In: de Cos Juez, F. et al. (eds) Hybrid Artificial Intelligent Systems HAIS 2018, vol. 10870, pp. 502–513. Springer, Cham (2018). Scholar
  2. 2.
    Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23, 2507–2517 (2007)CrossRefGoogle Scholar
  3. 3.
    Markov, A.A., Nitussov, A.Y., Voropai, L., Link, D., Custance, G., Mahoney, M.S.: Classical Text in Translation: An Example of Statistical Investigation of the Text Eugene Onegin Concerning the Connection of Samples in Chains (2006)Google Scholar
  4. 4.
    Borasem, P.N., Kinariwala, S.A.: Image re-ranking using information gain and relative consistency through multigraph learning (2016)Google Scholar
  5. 5.
    Vieira, A.S., Iglesias, E.L., Borrajo, L.: An HMM-based text classier less sensitive to document management problems. Bioinformatics 11, 503–515 (2016)Google Scholar
  6. 6.
    Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and Naive Bayes. In: 16th International Conference on Machine Learning (ICML), pp. 258–267. Morgan Kaufmann Publishers, San Francisco (1999)Google Scholar
  7. 7.
    Yang, Y., Pedersen, J. O.: A comparative study on feature selection in text categorization. In: Fourteenth International Conference on Machine Learning, pp. 412–420. Morgan Kaufmann Publishers Inc., San Francisco (1997)Google Scholar
  8. 8.
    Parlak, B., Uysal, A. K.: The impact of feature selection on medical document classification. In: 11th Iberian Conference on Information Systems and Technologies (CISTI), pp. 1–5 (2016)Google Scholar
  9. 9.
    Imambi, S.S., Sudha, T.: Article: a novel feature selection method for classification of medical documents from pubmed. Int. J. Comput. Appl. 26(9), 29–33 (2011)Google Scholar
  10. 10.
    Monta, E., Ranilla, J., Fernandez, J., Combarro, E.F., Diaz, I.: Scoring and selecting terms for text categorization. IEEE Intell. Syst. 20, 40–47 (2005)Google Scholar
  11. 11.
    Forman, G.: Feature selection for text classification. In: Liu, H., Motoda, H. (eds.) Computational Methods of Feature Selection, Data Mining and Knowledge Discoveries Series, pp. 257–276. Chapman and Hall/CRC, Boca Raton (2007)CrossRefGoogle Scholar
  12. 12.
    Hall, M.A., Smith, L.A.: Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In: Proceedings of the Twelfth International Florida Artificial Intelligence Research Society Conference, pp. 235–239. AAAI Press (1999)Google Scholar
  13. 13.
    Hersh, W.R., Buckley, C., Leone, T.J., Hickam, D.H.: Ohsumed: an interactive retrieval evaluation and new large test collection for research. In: 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press (1994)Google Scholar
  14. 14.
    Zdravevski, E., Lameski, P., Kulakov, A., Filiposka, S., Trajanov, D., Boro, J.: Parallel computation of information gain using Hadoop and MapReduce. In: Federated Conference on Computer Science and Information Systems (2015)Google Scholar
  15. 15.
    Shang, C., Li, M., Feng, S., Jiang, Q, Fan, J.: Feature selection via maximizing global information gain for text classification. J. Know.-Based Syst. 54, 298–309 (2013)CrossRefGoogle Scholar
  16. 16.
    Wang, F., Li, C., Wang, J., Xu, J., Li, L.: A two-stage feature selection method for text categorization by using category correlation degree and latent semantic indexing. J. Shanghai Jiaotong Univ. (Sci.) 20(1), 44–50 (2015)CrossRefGoogle Scholar
  17. 17.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  18. 18.
    Xu, Y., Wang, B., Li, J.T., Jing, H.: An extended document frequency metric for feature selection in text categorization. In: Li, H., Liu, T., Ma, W.-Y., Sakai, T., Wong, K.-F., Zhou, G. (eds.) AIRS 2008. LNCS, vol. 4993, pp. 71–82. Springer, Heidelberg (2008). Scholar
  19. 19.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)CrossRefGoogle Scholar
  20. 20.
    Talma Gonçalves, C., Camacho, R., Oliveira, E.: BioTextRetriever: a tool to retrieve relevant papers. Int. J. Knowl. Discov. Bioinform. 2(3), 21–36 (2011)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Carlos Adriano Gonçalves
    • 1
    • 3
  • Eva Lorenzo Iglesias
    • 1
  • Lourdes Borrajo
    • 1
  • Rui Camacho
    • 2
    • 3
    Email author
  • Adrián Seara Vieira
    • 1
  • Célia Talma Gonçalves
    • 4
    • 5
  1. 1.Computer Science DepartmentUniversity of Vigo, Escola Superior de Enxeñería InformáticaOurenseSpain
  2. 2.Faculdade de Engenharia da Universidade do PortoPortoPortugal
  3. 3.LIAAD - INESC TECPortoPortugal
  4. 4.CEOS.PP/ISCAP-P.PORTOPortoPortugal
  5. 5.LIACCPortoPortugal

Personalised recommendations