Development of a Machine Learning Framework for Biomedical Text Mining

  • Ruben Rodrigues
  • Hugo Costa
  • Miguel Rocha
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 477)


Biomedical text mining (BTM) aims to create methods for searching and structuring knowledge extracted from biomedical literature. Named entity recognition (NER), a BTM task, seeks to identify mentions to biological entities in texts. Dictionaries, regular expressions, natural language processing and machine learning (ML) algorithms are used in this task. Over the last years, @Note2, an open-source software framework, which includes user-friendly interfaces for important tasks in BTM, has been developed, but it did not include ML-based methods. In this work, the development of a framework, BioTML, including a number of ML-based approaches for NER is proposed, to fill the gap between @Note2 and state-of-the-art ML approaches. BioTML was integrated in @Note2 as a novel plug-in, where Hidden Markov Models, Conditional Random Fields and Support Vector Machines were implemented to address NER tasks, working with a set of over 60 feature types used to train ML models. The implementation was supported in open-source software, such as MALLET, LibSVM, ClearNLP or OpenNLP. Several manually annotated corpora were used in the validation of BioTML. The results are promising, while there is room for improvement.


Biomedical text mining Named entity recognition Machine learning 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Feldman, R., Sanger, J.: The Text Mining Hand Book - Advanced Approaches in Analysing Unstructured Data (2007)Google Scholar
  2. 2.
    Shatkay, H., Craven, M.: Mining the biomedical literature (2012)Google Scholar
  3. 3.
    Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 1–20, 2007 (1991)Google Scholar
  4. 4.
    Kim, J.D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus - A semantically annotated corpus for bio-textmining. Bioinformatics 19 (2003)Google Scholar
  5. 5.
    Eom, J., Zhang, B.: PubMiner : Machine Learning-based Text Mining for Biomedical Information Analysis. Genomics 2, 99–106 (2004)Google Scholar
  6. 6.
    Takeuchi, K., Collier, N.: Bio-medical entity extraction using support vector machines. Artificial Intelligence in Medicine 33, 125–137 (2005)CrossRefGoogle Scholar
  7. 7.
    Bundschus, M., Dejori, M., Stetter, M., Tresp, V., Kriegel, H.P.: Extraction of semantic biomedical relations from text using conditional random fields. BMC Bioinformatics 9, 207 (2008)CrossRefGoogle Scholar
  8. 8.
    Ramage, D.: Hidden Markov models fundamentals. Standford CS229 Section Notes, pp. 1–13 (2007)Google Scholar
  9. 9.
    Sutton, C.: An Introduction to Conditional Random Fields. Foundations and Trends in Machine Learning 4(4), 267–373 (2012)CrossRefzbMATHGoogle Scholar
  10. 10.
    Torii, M., Wagholikar, K., Liu, H.: Detecting concept mentions in biomedical text using hidden Markov model: multiple concept types at once or one at a time? Journal of Biomedical Semantics 5, 3 (2014)CrossRefGoogle Scholar
  11. 11.
    Lourenço, A., Carreira, R., Carneiro, S., Maia, P., Glez-Peña, D., Fdez-Riverola, F., Ferreira, E.C., Rocha, I., Rocha, M.: @Note: A workbench for Biomedical Text Mining. Journal of Biomedical Informatics 42(4), 710–720 (2009)CrossRefGoogle Scholar
  12. 12.
    Batanlar, Y., Özuysal, M.: Introduction to machine learning. Methods in Molecular Biology 1107, 105–128 (2014)CrossRefGoogle Scholar
  13. 13.
    Quan, C., Wang, M., Ren, F.: An unsupervised text mining method for relation extraction from biomedical literature. PLoS ONE 9(7), 1–8 (2014)CrossRefGoogle Scholar
  14. 14.
    Pereira, F., Lafferty, J., Mccallum, A.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th International Conference on Machine Learning, (ICML), pp. 282–289 (2001)Google Scholar
  15. 15.
    Campos, D., Matos, S., Oliveira, J.L.: Gimli: open source and high-performance biomedical name recognition. BMC Bioinformatics 14, 54 (2013)CrossRefGoogle Scholar
  16. 16.
    Morton, T., Kottmann, J., Baldridge, J.: OpenNLP: A Java-based NLP Toolkit (2005)Google Scholar
  17. 17.
    Choi, J.D.: Optimization of Natural Language Processing Components for Robustness and Scalability. PhD thesis, University of Colorado at Boulder (2012)Google Scholar
  18. 18.
    Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D: The stanford coreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meet. Assoc. Comput. Linguistics: System Demonstrations, pp. 55–60 (2014)Google Scholar
  19. 19.
    McCallum, A.K.: MALLET: A Machine Learning for Language Toolkit (2002)Google Scholar
  20. 20.
    Kim, J.D., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N.: Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of Intern. Joint Workshop Natural Language Processing in Biomedicine and Its Applications, pp. 70–75 (2004)Google Scholar
  21. 21.
    Zhou, G., Su, J.: Exploring deep knowledge resources in biomedical name recognition. In: Workshop on Natural Language Processing in Biomedicine and Its Applications at COLING, pp. 96–99 (2004)Google Scholar
  22. 22.
    Krallinger, M., et al.: Overview of the CHEMDNER patents task. In: Proceedings of the Fifth BioCreative Challenge Evaluation Workshop, pp. 63–75 (2015)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Centre of Biological EngineeringUniversity of MinhoBragaPortugal
  2. 2.Silicolife, LdaBragaPortugal

Personalised recommendations