Skip to main content

Abstract

The identification of the most relevant articles for a given task among a rapidly increasing number of options is a highly time-consuming task performed by researchers. To help in this task, a package called BioTMPy (https://github.com/BioSystemsUM/biotmpy) was developed to implement a complete pipeline to classify biomedical literature using state-of-the-art Deep Learning models. The package is divided into distinct modules that can be used in different steps of a pipeline, together or taken independently. To validate BioTMPy, the package was used to compare several pre-trained embeddings on a dataset from a BioCreative’s challenge, where BioWordVec showed a slightly better performance over GloVe, PubMed vectors and “pubmed_ncbi” embeddings. Additionally, we implemented and compared several state-of-the-art DL models encompassing recurrent and convolutional layers, as well as transformers with attention mechanisms, including the ones from the BERT family. We were able to obtain an improvement of over 7% for average precision and 3% for F1-score when compared to the challenge’s best submission.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://biotmpyppi.bio.di.uminho.pt/.

References

  1. Krallinger, M., Valencia, A.: Text-mining and information-retrieval services for molecular biology (2005)

    Google Scholar 

  2. Mirończuk, M.M., Protasiewicz, J.: A recent overview of the state-of-the-art elements of text classification, September 2018

    Google Scholar 

  3. Fiorini, N., et al.: Best match: new relevance search for PubMed. PLoS Biol. 16(8), e2005343 (2018)

    Google Scholar 

  4. Cohen, A.M., Hersh, W.R.: A survey of current work in biomedical text mining. Briefings Bioinform. 6, 57–71 (2005)

    Article  Google Scholar 

  5. Ignatow, G., Mihalcea, R.: An introduction to text mining: research design, data collection, and analysis (2018). https://study.sagepub.com/introtextmining

  6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding, October 2018

    Google Scholar 

  7. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, December 2017, NIPS, pp. 5999–6009 (2017)

    Google Scholar 

  8. Chollet, F.: Deep Learning with Phyton (2018)

    Google Scholar 

  9. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  10. McKinney, W., Team, P.: Pandas: powerful python data analysis toolkit, p. 1625 (2015)

    Google Scholar 

  11. Harris, C.R., et al.: Array programming with NumPy. Nature 585(7825), 357–362 (2020)

    Article  Google Scholar 

  12. Wolf, T., et al.: HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv arXiv:1910..03771 (2019)

  13. Hunter, J.D.: Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9(3), 90–95 (2007)

    Article  Google Scholar 

  14. Natural language toolkit. https://www.nltk.org/

  15. Burns, G.A., Li, X., Peng, N.: Building deep learning models for evidence classification from the open access biomedical literature. Database J. Biol. Databases Curation 2019 (2019)

    Google Scholar 

  16. Lee, J., et al.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2019)

    Google Scholar 

  17. Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text, March 2019. http://arxiv.org/abs/1903.10676

  18. Islamaj Doǧan, R., et al.: Overview of the BioCreative VI Precision Medicine Track: Mining protein interactions and mutations for precision medicine (2019)

    Google Scholar 

  19. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, pp. 1532–1543 (2014)

    Google Scholar 

  20. Zhang, Y., Chen, Q., Yang, Z., Lin, H., Lu, Z.: BioWordVec, improving biomedical word embeddings with subword information and MeSH. Sci. Data 6(1), 52 (2019). www.nature.com/scientificdata

  21. Pyysalo, S., Ginter, F., Moen, H., Salakoski, T., Ananiadou, S.: Distributional semantics resources for biomedical text processing. Aistats 5, 39–44 (2013)

    Google Scholar 

  22. Kim, S., Fiorini, N., Wilbur, W.J., Lu, Z.: Bridging the gap: incorporating a semantic similarity measure for effectively mapping PubMed queries to documents. J. Biomed. Inform. 75, 122–127 (2017)

    Article  Google Scholar 

  23. Gururangan, S., et al.: Don’t stop pretraining: adapt language models to domains and tasks, pp. 8342–8360 (2020). https://github.com/allenai/

Download references

Acknowledgements

This research has been supported by FCT - Fundação para a Ciência e Tecnologia through the DeepBio project - ref. NORTE-01-0247-FEDER-039831, funded by Lisboa 2020, Norte 2020, Portugal 2020 and FEDER - Fundo Europeu de Desenvolvimento Regional.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nuno Alves .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Alves, N., Rodrigues, R., Rocha, M. (2022). BioTMPy: A Deep Learning-Based Tool to Classify Biomedical Literature. In: Rocha, M., Fdez-Riverola, F., Mohamad, M.S., Casado-Vara, R. (eds) Practical Applications of Computational Biology & Bioinformatics, 15th International Conference (PACBB 2021). PACBB 2021. Lecture Notes in Networks and Systems, vol 325. Springer, Cham. https://doi.org/10.1007/978-3-030-86258-9_12

Download citation

Publish with us

Policies and ethics