Skip to main content

Development of a Machine Learning Framework for Biomedical Text Mining

  • Conference paper
  • First Online:
10th International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB 2016)

Abstract

Biomedical text mining (BTM) aims to create methods for searching and structuring knowledge extracted from biomedical literature. Named entity recognition (NER), a BTM task, seeks to identify mentions to biological entities in texts. Dictionaries, regular expressions, natural language processing and machine learning (ML) algorithms are used in this task. Over the last years, @Note2, an open-source software framework, which includes user-friendly interfaces for important tasks in BTM, has been developed, but it did not include ML-based methods. In this work, the development of a framework, BioTML, including a number of ML-based approaches for NER is proposed, to fill the gap between @Note2 and state-of-the-art ML approaches. BioTML was integrated in @Note2 as a novel plug-in, where Hidden Markov Models, Conditional Random Fields and Support Vector Machines were implemented to address NER tasks, working with a set of over 60 feature types used to train ML models. The implementation was supported in open-source software, such as MALLET, LibSVM, ClearNLP or OpenNLP. Several manually annotated corpora were used in the validation of BioTML. The results are promising, while there is room for improvement.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Feldman, R., Sanger, J.: The Text Mining Hand Book - Advanced Approaches in Analysing Unstructured Data (2007)

    Google Scholar 

  2. Shatkay, H., Craven, M.: Mining the biomedical literature (2012)

    Google Scholar 

  3. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 1–20, 2007 (1991)

    Google Scholar 

  4. Kim, J.D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus - A semantically annotated corpus for bio-textmining. Bioinformatics 19 (2003)

    Google Scholar 

  5. Eom, J., Zhang, B.: PubMiner : Machine Learning-based Text Mining for Biomedical Information Analysis. Genomics 2, 99–106 (2004)

    Google Scholar 

  6. Takeuchi, K., Collier, N.: Bio-medical entity extraction using support vector machines. Artificial Intelligence in Medicine 33, 125–137 (2005)

    Article  Google Scholar 

  7. Bundschus, M., Dejori, M., Stetter, M., Tresp, V., Kriegel, H.P.: Extraction of semantic biomedical relations from text using conditional random fields. BMC Bioinformatics 9, 207 (2008)

    Article  Google Scholar 

  8. Ramage, D.: Hidden Markov models fundamentals. Standford CS229 Section Notes, pp. 1–13 (2007)

    Google Scholar 

  9. Sutton, C.: An Introduction to Conditional Random Fields. Foundations and Trends in Machine Learning 4(4), 267–373 (2012)

    Article  MATH  Google Scholar 

  10. Torii, M., Wagholikar, K., Liu, H.: Detecting concept mentions in biomedical text using hidden Markov model: multiple concept types at once or one at a time? Journal of Biomedical Semantics 5, 3 (2014)

    Article  Google Scholar 

  11. Lourenço, A., Carreira, R., Carneiro, S., Maia, P., Glez-Peña, D., Fdez-Riverola, F., Ferreira, E.C., Rocha, I., Rocha, M.: @Note: A workbench for Biomedical Text Mining. Journal of Biomedical Informatics 42(4), 710–720 (2009)

    Article  Google Scholar 

  12. Batanlar, Y., Özuysal, M.: Introduction to machine learning. Methods in Molecular Biology 1107, 105–128 (2014)

    Article  Google Scholar 

  13. Quan, C., Wang, M., Ren, F.: An unsupervised text mining method for relation extraction from biomedical literature. PLoS ONE 9(7), 1–8 (2014)

    Article  Google Scholar 

  14. Pereira, F., Lafferty, J., Mccallum, A.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th International Conference on Machine Learning, (ICML), pp. 282–289 (2001)

    Google Scholar 

  15. Campos, D., Matos, S., Oliveira, J.L.: Gimli: open source and high-performance biomedical name recognition. BMC Bioinformatics 14, 54 (2013)

    Article  Google Scholar 

  16. Morton, T., Kottmann, J., Baldridge, J.: OpenNLP: A Java-based NLP Toolkit (2005)

    Google Scholar 

  17. Choi, J.D.: Optimization of Natural Language Processing Components for Robustness and Scalability. PhD thesis, University of Colorado at Boulder (2012)

    Google Scholar 

  18. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D: The stanford coreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meet. Assoc. Comput. Linguistics: System Demonstrations, pp. 55–60 (2014)

    Google Scholar 

  19. McCallum, A.K.: MALLET: A Machine Learning for Language Toolkit (2002)

    Google Scholar 

  20. Kim, J.D., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N.: Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of Intern. Joint Workshop Natural Language Processing in Biomedicine and Its Applications, pp. 70–75 (2004)

    Google Scholar 

  21. Zhou, G., Su, J.: Exploring deep knowledge resources in biomedical name recognition. In: Workshop on Natural Language Processing in Biomedicine and Its Applications at COLING, pp. 96–99 (2004)

    Google Scholar 

  22. Krallinger, M., et al.: Overview of the CHEMDNER patents task. In: Proceedings of the Fifth BioCreative Challenge Evaluation Workshop, pp. 63–75 (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruben Rodrigues .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Rodrigues, R., Costa, H., Rocha, M. (2016). Development of a Machine Learning Framework for Biomedical Text Mining. In: Saberi Mohamad, M., Rocha, M., Fdez-Riverola, F., Domínguez Mayo, F., De Paz, J. (eds) 10th International Conference on Practical Applications of Computational Biology & Bioinformatics. PACBB 2016. Advances in Intelligent Systems and Computing, vol 477. Springer, Cham. https://doi.org/10.1007/978-3-319-40126-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-40126-3_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-40125-6

  • Online ISBN: 978-3-319-40126-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics