Abstract
The identification of the most relevant articles for a given task among a rapidly increasing number of options is a highly time-consuming task performed by researchers. To help in this task, a package called BioTMPy (https://github.com/BioSystemsUM/biotmpy) was developed to implement a complete pipeline to classify biomedical literature using state-of-the-art Deep Learning models. The package is divided into distinct modules that can be used in different steps of a pipeline, together or taken independently. To validate BioTMPy, the package was used to compare several pre-trained embeddings on a dataset from a BioCreative’s challenge, where BioWordVec showed a slightly better performance over GloVe, PubMed vectors and “pubmed_ncbi” embeddings. Additionally, we implemented and compared several state-of-the-art DL models encompassing recurrent and convolutional layers, as well as transformers with attention mechanisms, including the ones from the BERT family. We were able to obtain an improvement of over 7% for average precision and 3% for F1-score when compared to the challenge’s best submission.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Krallinger, M., Valencia, A.: Text-mining and information-retrieval services for molecular biology (2005)
Mirończuk, M.M., Protasiewicz, J.: A recent overview of the state-of-the-art elements of text classification, September 2018
Fiorini, N., et al.: Best match: new relevance search for PubMed. PLoS Biol. 16(8), e2005343 (2018)
Cohen, A.M., Hersh, W.R.: A survey of current work in biomedical text mining. Briefings Bioinform. 6, 57–71 (2005)
Ignatow, G., Mihalcea, R.: An introduction to text mining: research design, data collection, and analysis (2018). https://study.sagepub.com/introtextmining
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding, October 2018
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, December 2017, NIPS, pp. 5999–6009 (2017)
Chollet, F.: Deep Learning with Phyton (2018)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
McKinney, W., Team, P.: Pandas: powerful python data analysis toolkit, p. 1625 (2015)
Harris, C.R., et al.: Array programming with NumPy. Nature 585(7825), 357–362 (2020)
Wolf, T., et al.: HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv arXiv:1910..03771 (2019)
Hunter, J.D.: Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9(3), 90–95 (2007)
Natural language toolkit. https://www.nltk.org/
Burns, G.A., Li, X., Peng, N.: Building deep learning models for evidence classification from the open access biomedical literature. Database J. Biol. Databases Curation 2019 (2019)
Lee, J., et al.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2019)
Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text, March 2019. http://arxiv.org/abs/1903.10676
Islamaj Doǧan, R., et al.: Overview of the BioCreative VI Precision Medicine Track: Mining protein interactions and mutations for precision medicine (2019)
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, pp. 1532–1543 (2014)
Zhang, Y., Chen, Q., Yang, Z., Lin, H., Lu, Z.: BioWordVec, improving biomedical word embeddings with subword information and MeSH. Sci. Data 6(1), 52 (2019). www.nature.com/scientificdata
Pyysalo, S., Ginter, F., Moen, H., Salakoski, T., Ananiadou, S.: Distributional semantics resources for biomedical text processing. Aistats 5, 39–44 (2013)
Kim, S., Fiorini, N., Wilbur, W.J., Lu, Z.: Bridging the gap: incorporating a semantic similarity measure for effectively mapping PubMed queries to documents. J. Biomed. Inform. 75, 122–127 (2017)
Gururangan, S., et al.: Don’t stop pretraining: adapt language models to domains and tasks, pp. 8342–8360 (2020). https://github.com/allenai/
Acknowledgements
This research has been supported by FCT - Fundação para a Ciência e Tecnologia through the DeepBio project - ref. NORTE-01-0247-FEDER-039831, funded by Lisboa 2020, Norte 2020, Portugal 2020 and FEDER - Fundo Europeu de Desenvolvimento Regional.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Alves, N., Rodrigues, R., Rocha, M. (2022). BioTMPy: A Deep Learning-Based Tool to Classify Biomedical Literature. In: Rocha, M., Fdez-Riverola, F., Mohamad, M.S., Casado-Vara, R. (eds) Practical Applications of Computational Biology & Bioinformatics, 15th International Conference (PACBB 2021). PACBB 2021. Lecture Notes in Networks and Systems, vol 325. Springer, Cham. https://doi.org/10.1007/978-3-030-86258-9_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-86258-9_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86257-2
Online ISBN: 978-3-030-86258-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)