Abstract
The aim of this study is to systematically examine the performance of transformer-based models for the detection of tumor morphology mentions in clinical documents in Spanish. For this purpose, we analyzed 3 transformer models supporting the Spanish language, namely multilingual BERT, BETO and XLM-RoBERTa. By means of a transfer-learning-based approach, the models were first pretrained on a collection of real-world oncology clinical cases with the goal of adapting transformers to the distinctive features of the Spanish oncology domain. The resulting models were further fine-tuned on the Cantemist-NER task, addressing the detection of tumor morphology mentions as a multi-class sequence-labeling problem. To evaluate the effectiveness of the proposed approach, we compared the obtained results by the domain-specific version of the examined transformers with the performance achieved by the general-domain version of the models. The results obtained in this paper empirically demonstrated that, for every analyzed transformer, the clinical version outperformed the corresponding general-domain model on the detection of tumor morphology mentions in clinical case reports in Spanish. Additionally, the combination of the transfer-learning-based approach with an ensemble strategy exploiting the predictive capabilities of the distinct transformer architectures yielded the best obtained results, achieving a precision value of 0.893, a recall of 0.887 and an F1-score of 0.89, which remarkably surpassed the prior state-of-the-art performance for the Cantemist-NER task.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Alsentzer, E., et al.: Publicly available clinical BERT embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop, pp. 72–78. Association for Computational Linguistics, Minneapolis, June 2019
Baumann, L.A., Baker, J., Elshaug, A.G.: The impact of electronic health record systems on clinical documentation times: a systematic review. Health Policy 122(8), 827–836 (2018)
Bronnert, J.: Preparing for the CAC transition. J. AHIMA 82(7), 60–1; quiz 62 (2011)
Cañete, J., Chaperon, G., Fuentes, R., Ho, J.H., Kang, H., Pérez, J.: Spanish pre-trained BERT model and evaluation data. In: Practical ML for Developing Countries Workshop@ ICLR 2020 (2020)
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. arXiv [cs.CL], November 2019
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv [cs.CL], October 2018
García-Pablos, A., Perez, N., Cuadros, M.: Vicomtech at CANTEMIST 2020. In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), pp. 489–498. CEUR Workshop Proceedings (2020)
Hughes, M., Li, I., Kotoulas, S., Suzumura, T.: Medical text classification using convolutional neural networks. Stud. Health Technol. Inform. 235, 246–250 (2017)
Johnson, M., et al.: Google’s multilingual neural machine translation system: enabling Zero-Shot translation. Trans. Assoc. Comput. Linguist. 5, 339–351 (2017)
Kudo, T., Richardson, J.: SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv [cs.CL], August 2018
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning. ICML 2001, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco, June 2001
Lample, G., Conneau, A.: Cross-lingual language model pretraining. arXiv [cs.CL] (2019)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv [cs.CL] (2019)
López-García, G., Jerez, J.M., Ribelles, N., Alba, E., Veredas, F.J.: ICB-UMA at CANTEMIST 2020: automatic ICD-O coding in Spanish with BERT. In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), pp. 468–476. CEUR Workshop Proceedings (2020)
Miranda-Escalada, A., Farré-Maduell, E., Krallinger, M.: Named entity recognition, concept normalization and clinical coding: overview of the cantemist track for cancer text mining in Spanish, corpus, guidelines, methods and results. In: Iberian Languages Evaluation Forum (IberLEF 2020), pp. 303–323. CEUR Workshop Proceedings, Málaga, Spain (2020)
Mujtaba, G., et al.: Clinical text classification research trends: systematic literature review and open issues. Expert Syst. Appl. 116, 494–520 (2019)
National Cancer Institute: How Cancer Is Diagnosed (2019). https://www.cancer.gov/about-cancer/diagnosis-staging/diagnosis. Accessed 23 Apr 2021
Qiu, J.X., Yoon, H.J., Fearn, P.A., Tourassi, G.D.: Deep learning for automated extraction of primary sites from cancer pathology reports. IEEE J. Biomed. Health Inform. 22(1), 244–251 (2018)
Ramshaw, L.A., Marcus, M.P.: Text chunking using transformation-based learning. In: Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E., Yarowsky, D. (eds.) Natural Language Processing Using Very Large Corpora. Text, Speech and Language Technology, vol. 11, pp. 157–176. Springer, Dordrecht (1999). https://doi.org/10.1007/978-94-017-2390-9_10
Ribelles, N., et al.: Galén: Sistema de información para la gestión y coordinación de procesos en un servicio de oncología. RevistaeSalud 6(21), 1–12 (2010)
Si, Y., Wang, J., Xu, H., Roberts, K.: Enhancing clinical concept extraction with contextual embeddings. J. Am. Med. Inform. Assoc. 26(11), 1297–1304 (2019)
Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S., Tsujii, J.: BRAT: a Web-based Tool for NLP-assisted text annotation. In: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 102–107. Association for Computational Linguistics, Avignon, April 2012
Sung, H., et al.: Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: Cancer J. Clin. 71, 209–249 (2021)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc. (2017)
Vítores, D.F.: El español: una lengua viva. Informe 2020. Instituto Cervantes (2020)
Wolf, T., et al.: HuggingFace’s transformers: state-of-the-art natural language processing. arXiv [cs.CL], October 2019
Xiong, Y., Huang, Y., Chen, Q., Wang, X., Nic, Y., Tang, B.: A joint model for medical named entity recognition and normalization. In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), pp. 499–504. CEUR Workshop Proceedings (2020)
Yang, X., Bian, J., Hogan, W.R., Wu, Y.: Clinical concept extraction using transformers. J. Am. Med. Inform. Assoc. 27(12), 1935–1942 (2020)
Zhu, F., et al.: Biomedical text mining and its applications in cancer research. J. Biomed. Inform. 46(2), 200–211 (2013)
Acknowledgments
This work was partially supported by the project PID2020-116898RB-I00, Ministerio de Ciencia e Innovación, Plan Nacional de I+D+i, the project UMA-CEIATECH-01, Andalucía TECH, and the I Plan Propio de Investigación, Transferencia y Divulgación Científica of the Universidad de Málaga.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
López-García, G., Jerez, J.M., Ribelles, N., Alba, E., Veredas, F.J. (2021). Detection of Tumor Morphology Mentions in Clinical Reports in Spanish Using Transformers. In: Rojas, I., Joya, G., Català, A. (eds) Advances in Computational Intelligence. IWANN 2021. Lecture Notes in Computer Science(), vol 12861. Springer, Cham. https://doi.org/10.1007/978-3-030-85030-2_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-85030-2_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85029-6
Online ISBN: 978-3-030-85030-2
eBook Packages: Computer ScienceComputer Science (R0)