Detection of Tumor Morphology Mentions in Clinical Reports in Spanish Using Transformers

López-García, Guillermo; Jerez, José M.; Ribelles, Nuria; Alba, Emilio; Veredas, Francisco J.

doi:10.1007/978-3-030-85030-2_3

Detection of Tumor Morphology Mentions in Clinical Reports in Spanish Using Transformers

Guillermo López-García¹¹,
José M. Jerez¹¹,
Nuria Ribelles¹²,
Emilio Alba¹² &
…
Francisco J. Veredas¹¹

Conference paper
First Online: 21 August 2021

1196 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12861))

Abstract

The aim of this study is to systematically examine the performance of transformer-based models for the detection of tumor morphology mentions in clinical documents in Spanish. For this purpose, we analyzed 3 transformer models supporting the Spanish language, namely multilingual BERT, BETO and XLM-RoBERTa. By means of a transfer-learning-based approach, the models were first pretrained on a collection of real-world oncology clinical cases with the goal of adapting transformers to the distinctive features of the Spanish oncology domain. The resulting models were further fine-tuned on the Cantemist-NER task, addressing the detection of tumor morphology mentions as a multi-class sequence-labeling problem. To evaluate the effectiveness of the proposed approach, we compared the obtained results by the domain-specific version of the examined transformers with the performance achieved by the general-domain version of the models. The results obtained in this paper empirically demonstrated that, for every analyzed transformer, the clinical version outperformed the corresponding general-domain model on the detection of tumor morphology mentions in clinical case reports in Spanish. Additionally, the combination of the transfer-learning-based approach with an ensemble strategy exploiting the predictive capabilities of the distinct transformer architectures yielded the best obtained results, achieving a precision value of 0.893, a recall of 0.887 and an F1-score of 0.89, which remarkably surpassed the prior state-of-the-art performance for the Cantemist-NER task.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://sklearn-crfsuite.readthedocs.io/.

References

Alsentzer, E., et al.: Publicly available clinical BERT embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop, pp. 72–78. Association for Computational Linguistics, Minneapolis, June 2019
Google Scholar
Baumann, L.A., Baker, J., Elshaug, A.G.: The impact of electronic health record systems on clinical documentation times: a systematic review. Health Policy 122(8), 827–836 (2018)
Article Google Scholar
Bronnert, J.: Preparing for the CAC transition. J. AHIMA 82(7), 60–1; quiz 62 (2011)
Google Scholar
Cañete, J., Chaperon, G., Fuentes, R., Ho, J.H., Kang, H., Pérez, J.: Spanish pre-trained BERT model and evaluation data. In: Practical ML for Developing Countries Workshop@ ICLR 2020 (2020)
Google Scholar
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. arXiv [cs.CL], November 2019
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv [cs.CL], October 2018
Google Scholar
García-Pablos, A., Perez, N., Cuadros, M.: Vicomtech at CANTEMIST 2020. In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), pp. 489–498. CEUR Workshop Proceedings (2020)
Google Scholar
Hughes, M., Li, I., Kotoulas, S., Suzumura, T.: Medical text classification using convolutional neural networks. Stud. Health Technol. Inform. 235, 246–250 (2017)
PubMed Google Scholar
Johnson, M., et al.: Google’s multilingual neural machine translation system: enabling Zero-Shot translation. Trans. Assoc. Comput. Linguist. 5, 339–351 (2017)
Article Google Scholar
Kudo, T., Richardson, J.: SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv [cs.CL], August 2018
Google Scholar
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning. ICML 2001, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco, June 2001
Google Scholar
Lample, G., Conneau, A.: Cross-lingual language model pretraining. arXiv [cs.CL] (2019)
Google Scholar
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv [cs.CL] (2019)
Google Scholar
López-García, G., Jerez, J.M., Ribelles, N., Alba, E., Veredas, F.J.: ICB-UMA at CANTEMIST 2020: automatic ICD-O coding in Spanish with BERT. In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), pp. 468–476. CEUR Workshop Proceedings (2020)
Google Scholar
Miranda-Escalada, A., Farré-Maduell, E., Krallinger, M.: Named entity recognition, concept normalization and clinical coding: overview of the cantemist track for cancer text mining in Spanish, corpus, guidelines, methods and results. In: Iberian Languages Evaluation Forum (IberLEF 2020), pp. 303–323. CEUR Workshop Proceedings, Málaga, Spain (2020)
Google Scholar
Mujtaba, G., et al.: Clinical text classification research trends: systematic literature review and open issues. Expert Syst. Appl. 116, 494–520 (2019)
Article Google Scholar
National Cancer Institute: How Cancer Is Diagnosed (2019). https://www.cancer.gov/about-cancer/diagnosis-staging/diagnosis. Accessed 23 Apr 2021
Qiu, J.X., Yoon, H.J., Fearn, P.A., Tourassi, G.D.: Deep learning for automated extraction of primary sites from cancer pathology reports. IEEE J. Biomed. Health Inform. 22(1), 244–251 (2018)
Article Google Scholar
Ramshaw, L.A., Marcus, M.P.: Text chunking using transformation-based learning. In: Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E., Yarowsky, D. (eds.) Natural Language Processing Using Very Large Corpora. Text, Speech and Language Technology, vol. 11, pp. 157–176. Springer, Dordrecht (1999). https://doi.org/10.1007/978-94-017-2390-9_10
Chapter Google Scholar
Ribelles, N., et al.: Galén: Sistema de información para la gestión y coordinación de procesos en un servicio de oncología. RevistaeSalud 6(21), 1–12 (2010)
Google Scholar
Si, Y., Wang, J., Xu, H., Roberts, K.: Enhancing clinical concept extraction with contextual embeddings. J. Am. Med. Inform. Assoc. 26(11), 1297–1304 (2019)
Article Google Scholar
Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S., Tsujii, J.: BRAT: a Web-based Tool for NLP-assisted text annotation. In: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 102–107. Association for Computational Linguistics, Avignon, April 2012
Google Scholar
Sung, H., et al.: Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: Cancer J. Clin. 71, 209–249 (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc. (2017)
Google Scholar
Vítores, D.F.: El español: una lengua viva. Informe 2020. Instituto Cervantes (2020)
Google Scholar
Wolf, T., et al.: HuggingFace’s transformers: state-of-the-art natural language processing. arXiv [cs.CL], October 2019
Google Scholar
Xiong, Y., Huang, Y., Chen, Q., Wang, X., Nic, Y., Tang, B.: A joint model for medical named entity recognition and normalization. In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), pp. 499–504. CEUR Workshop Proceedings (2020)
Google Scholar
Yang, X., Bian, J., Hogan, W.R., Wu, Y.: Clinical concept extraction using transformers. J. Am. Med. Inform. Assoc. 27(12), 1935–1942 (2020)
Article Google Scholar
Zhu, F., et al.: Biomedical text mining and its applications in cancer research. J. Biomed. Inform. 46(2), 200–211 (2013)
Article Google Scholar

Download references

Acknowledgments

This work was partially supported by the project PID2020-116898RB-I00, Ministerio de Ciencia e Innovación, Plan Nacional de I+D+i, the project UMA-CEIATECH-01, Andalucía TECH, and the I Plan Propio de Investigación, Transferencia y Divulgación Científica of the Universidad de Málaga.

Author information

Authors and Affiliations

Departamento de Lenguajes y Ciencias de la Computación, Universidad de Málaga, 29071, Málaga, Spain
Guillermo López-García, José M. Jerez & Francisco J. Veredas
Unidad de Gestión Clínica Intercentros de Oncología, Instituto de Investigación Biomédica de Málaga (IBIMA), Hospitales Universitarios Regional y Virgen de la Victoria, 29010, Málaga, Spain
Nuria Ribelles & Emilio Alba

Authors

Guillermo López-García
View author publications
You can also search for this author in PubMed Google Scholar
José M. Jerez
View author publications
You can also search for this author in PubMed Google Scholar
Nuria Ribelles
View author publications
You can also search for this author in PubMed Google Scholar
Emilio Alba
View author publications
You can also search for this author in PubMed Google Scholar
Francisco J. Veredas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guillermo López-García .

Editor information

Editors and Affiliations

University of Granada, Granada, Spain
Ignacio Rojas
University of Málaga, Málaga, Spain
Gonzalo Joya
Technical University of Catalonia, Barcelona, Spain
Andreu Català

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

López-García, G., Jerez, J.M., Ribelles, N., Alba, E., Veredas, F.J. (2021). Detection of Tumor Morphology Mentions in Clinical Reports in Spanish Using Transformers. In: Rojas, I., Joya, G., Català, A. (eds) Advances in Computational Intelligence. IWANN 2021. Lecture Notes in Computer Science(), vol 12861. Springer, Cham. https://doi.org/10.1007/978-3-030-85030-2_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-85030-2_3
Published: 21 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85029-6
Online ISBN: 978-3-030-85030-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics