Skip to main content

Development of a Machine Learning Framework for Biomedical Text Mining

Part of the Advances in Intelligent Systems and Computing book series (AISC,volume 477)

Abstract

Biomedical text mining (BTM) aims to create methods for searching and structuring knowledge extracted from biomedical literature. Named entity recognition (NER), a BTM task, seeks to identify mentions to biological entities in texts. Dictionaries, regular expressions, natural language processing and machine learning (ML) algorithms are used in this task. Over the last years, @Note2, an open-source software framework, which includes user-friendly interfaces for important tasks in BTM, has been developed, but it did not include ML-based methods. In this work, the development of a framework, BioTML, including a number of ML-based approaches for NER is proposed, to fill the gap between @Note2 and state-of-the-art ML approaches. BioTML was integrated in @Note2 as a novel plug-in, where Hidden Markov Models, Conditional Random Fields and Support Vector Machines were implemented to address NER tasks, working with a set of over 60 feature types used to train ML models. The implementation was supported in open-source software, such as MALLET, LibSVM, ClearNLP or OpenNLP. Several manually annotated corpora were used in the validation of BioTML. The results are promising, while there is room for improvement.

Keywords

  • Biomedical text mining
  • Named entity recognition
  • Machine learning

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-40126-3_5
  • Chapter length: 9 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   169.00
Price excludes VAT (USA)
  • ISBN: 978-3-319-40126-3
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   219.99
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Feldman, R., Sanger, J.: The Text Mining Hand Book - Advanced Approaches in Analysing Unstructured Data (2007)

    Google Scholar 

  2. Shatkay, H., Craven, M.: Mining the biomedical literature (2012)

    Google Scholar 

  3. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 1–20, 2007 (1991)

    Google Scholar 

  4. Kim, J.D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus - A semantically annotated corpus for bio-textmining. Bioinformatics 19 (2003)

    Google Scholar 

  5. Eom, J., Zhang, B.: PubMiner : Machine Learning-based Text Mining for Biomedical Information Analysis. Genomics 2, 99–106 (2004)

    Google Scholar 

  6. Takeuchi, K., Collier, N.: Bio-medical entity extraction using support vector machines. Artificial Intelligence in Medicine 33, 125–137 (2005)

    CrossRef  Google Scholar 

  7. Bundschus, M., Dejori, M., Stetter, M., Tresp, V., Kriegel, H.P.: Extraction of semantic biomedical relations from text using conditional random fields. BMC Bioinformatics 9, 207 (2008)

    CrossRef  Google Scholar 

  8. Ramage, D.: Hidden Markov models fundamentals. Standford CS229 Section Notes, pp. 1–13 (2007)

    Google Scholar 

  9. Sutton, C.: An Introduction to Conditional Random Fields. Foundations and Trends in Machine Learning 4(4), 267–373 (2012)

    CrossRef  MATH  Google Scholar 

  10. Torii, M., Wagholikar, K., Liu, H.: Detecting concept mentions in biomedical text using hidden Markov model: multiple concept types at once or one at a time? Journal of Biomedical Semantics 5, 3 (2014)

    CrossRef  Google Scholar 

  11. Lourenço, A., Carreira, R., Carneiro, S., Maia, P., Glez-Peña, D., Fdez-Riverola, F., Ferreira, E.C., Rocha, I., Rocha, M.: @Note: A workbench for Biomedical Text Mining. Journal of Biomedical Informatics 42(4), 710–720 (2009)

    CrossRef  Google Scholar 

  12. Batanlar, Y., Özuysal, M.: Introduction to machine learning. Methods in Molecular Biology 1107, 105–128 (2014)

    CrossRef  Google Scholar 

  13. Quan, C., Wang, M., Ren, F.: An unsupervised text mining method for relation extraction from biomedical literature. PLoS ONE 9(7), 1–8 (2014)

    CrossRef  Google Scholar 

  14. Pereira, F., Lafferty, J., Mccallum, A.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th International Conference on Machine Learning, (ICML), pp. 282–289 (2001)

    Google Scholar 

  15. Campos, D., Matos, S., Oliveira, J.L.: Gimli: open source and high-performance biomedical name recognition. BMC Bioinformatics 14, 54 (2013)

    CrossRef  Google Scholar 

  16. Morton, T., Kottmann, J., Baldridge, J.: OpenNLP: A Java-based NLP Toolkit (2005)

    Google Scholar 

  17. Choi, J.D.: Optimization of Natural Language Processing Components for Robustness and Scalability. PhD thesis, University of Colorado at Boulder (2012)

    Google Scholar 

  18. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D: The stanford coreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meet. Assoc. Comput. Linguistics: System Demonstrations, pp. 55–60 (2014)

    Google Scholar 

  19. McCallum, A.K.: MALLET: A Machine Learning for Language Toolkit (2002)

    Google Scholar 

  20. Kim, J.D., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N.: Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of Intern. Joint Workshop Natural Language Processing in Biomedicine and Its Applications, pp. 70–75 (2004)

    Google Scholar 

  21. Zhou, G., Su, J.: Exploring deep knowledge resources in biomedical name recognition. In: Workshop on Natural Language Processing in Biomedicine and Its Applications at COLING, pp. 96–99 (2004)

    Google Scholar 

  22. Krallinger, M., et al.: Overview of the CHEMDNER patents task. In: Proceedings of the Fifth BioCreative Challenge Evaluation Workshop, pp. 63–75 (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruben Rodrigues .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Rodrigues, R., Costa, H., Rocha, M. (2016). Development of a Machine Learning Framework for Biomedical Text Mining. In: Saberi Mohamad, M., Rocha, M., Fdez-Riverola, F., Domínguez Mayo, F., De Paz, J. (eds) 10th International Conference on Practical Applications of Computational Biology & Bioinformatics. PACBB 2016. Advances in Intelligent Systems and Computing, vol 477. Springer, Cham. https://doi.org/10.1007/978-3-319-40126-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-40126-3_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-40125-6

  • Online ISBN: 978-3-319-40126-3

  • eBook Packages: EngineeringEngineering (R0)