Defining a state-of-the-art POS-tagging environment for Brazilian Portuguese clinical texts

de Oliveira, Lucas Ferro Antunes; e Oliveira, Lucas Emanuel Silva; Gumiel, Yohan Bonescki; Carvalho, Deborah Ribeiro; Moro, Claudia Maria Cabral

doi:10.1007/s42600-020-00067-7

Defining a state-of-the-art POS-tagging environment for Brazilian Portuguese clinical texts

Original Article
Published: 19 June 2020

Volume 36, pages 267–276, (2020)
Cite this article

Research on Biomedical Engineering Aims and scope Submit manuscript

Lucas Ferro Antunes de Oliveira¹,
Lucas Emanuel Silva e Oliveira ORCID: orcid.org/0000-0003-1811-5087¹,
Yohan Bonescki Gumiel¹,
Deborah Ribeiro Carvalho¹ &
…
Claudia Maria Cabral Moro¹

242 Accesses
2 Citations
Explore all metrics

Abstract

Purpose

Natural language processing techniques are essential for unlocking patients’ data from electronic health records. An important NLP task is the ability to recognize morphosyntactic information from the texts, a process called part-of-speech (POS) tagging. Currently, neural network architectures are the state-of-the-art method, although there is a lack of studies exploiting this approach within Brazilian Portuguese clinical texts. The objective of this study is to define a state-of-the-art POS-tagging environment for Brazilian Portuguese clinical texts.

Methods

We reviewed multiple neural network-based POS-tagging algorithms, and the Flair tool was selected due to its exceptional performance in the journalistic domain, as there is any specific algorithm to Portuguese clinical texts. We executed a normalization process on available corpora from multiple domains (two journalistic, one biomedical, one clinical, and a new corpus composed of all three of these). The Flair algorithm was trained with all corpora, generating five models, which were evaluated with all domains.

Results

The clinical model achieved 92.39% accuracy (previous POS-tagging clinical work reached 91.5%); the biomedical model achieved 97.9% accuracy. All the models were assessed on their own test set.

Conclusion

We developed a new state-of-the-art modeling environment for POS tagging of Brazilian Portuguese clinical texts and achieved comparable results to other state-of-the-art studies in journalistic contexts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Text Analysis and Information Extraction from Spanish Written Documents

Parsing clinical text: how good are the state-of-the-art parsers?

Article Open access 20 May 2015

From POS tagging to dependency parsing for biomedical event extraction

Article Open access 12 February 2019

Notes

https://catalog.ldc.upenn.edu/LDC99T42
http://opennlp.sourceforge.net
https://github.com/flairNLP/flair
https://github.com/HAILab-PUCPR/pos-tagging-tagset-normalization
https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_4_ELMO_BERT_FLAIR_EMBEDDING.md
https://colab.research.google.com/
https://github.com/HAILab-PUCPR/portuguese-clinical-pos-tagger

References

Afonso S, Bick E, Haber R, Santos D. Floresta sintá(c)tica: A treebank for Portuguese. Proc. 3rd Int. Conf. Lang. Resour. Eval. Lr. 2002, Paris; 2002, p. 1698–703.
Akbik A, Blythe D, Vollgraf R. Contextual string embeddings for sequence labeling. Proc. 27th Int. Conf. Comput. Linguist., Santa Fe, New Mexico, USA: Association for Computational Linguistics; 2018, p. 1638–49.
Assale M, Dui LG, Cina A, Seveso A, Cabitza F. The Revival of the Notes Field: Leveraging the Unstructured Content in Electronic Health Records. Front Med. 2019;6:1–23. https://doi.org/10.3389/fmed.2019.00066.
Article Google Scholar
Collobert R. Deep learning for efficient discriminative parsing Ronan. Proc Fourteenth Int Conf Artif Intell Stat. 2011;15:224–32.
Google Scholar
Dalianis H. Clinical text mining. Cham: Springer International Publishing; 2018. https://doi.org/10.1007/978-3-319-78503-5.
Book Google Scholar
Duarte JM, Areco K, Goihman S, Birelo E, De Domenico L. Corpora Analysis : Journalistic and Scientific. 2018;10:71–8.
Google Scholar
Fernandes ER, Rodrigues IM, Milidiu RL. Portuguese Part-of-Speech Tagging with Large Margin Structure Learning. 2014 Brazilian Conf. Intell. Syst., IEEE; 2014, p. 25–30. doi: https://doi.org/10.1109/BRACIS.2014.16.
Fonseca ERG, Rosa JL, Aluísio SM. Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese. J Brazilian Comput Soc. 2015;21. https://doi.org/10.1186/s13173-014-0020-x.
Freitas C, Trugo LF, Chalub F, Paulino-Passos G, Rademaker A. Tagsets and Datasets: Some Experiments Based on Portuguese Language. Comput. Process. Port. Lang. - 12th Int. Conf. PROPOR 2018, 2018, p. 459–69. doi: https://doi.org/10.1007/978-3-319-99722-3_46.
Graves A, Mohamed A, Hinton G. Speech recognition with deep recurrent neural networks. 2013 IEEE Int. Conf. Acoust. Speech Signal Process., IEEE; 2013, p. 6645–9. doi: https://doi.org/10.1109/ICASSP.2013.6638947.
Hirsch JS, Tanenbaum JS, Gorman SL, Liu C, Schmitz E, Hashorva D, et al. HARVEST, a longitudinal patient record summarizer. J Am Med Informatics Assoc. 2015;22:263–74. https://doi.org/10.1136/amiajnl-2014-002945.
Article Google Scholar
Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF Models for Sequence Tagging. CoRR. 2015.
Jensen PB, Jensen LJ, Brunak S. Mining electronic health records: towards better research applications and clinical care. Nat Rev Genet. 2012;13:395–405. https://doi.org/10.1038/nrg3208.
Article Google Scholar
Jurafsky D, Martin JH. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Second. Prentice Hall; 2008.
Kenei J, Opiyo TOE, Oboko R, Moso J. Clinical documents summarization using text visualization technique. Int J Comput Inf Technol. 2018;7:139–56.
Google Scholar
Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C. Neural architectures for named entity recognition. 2016 Conf North Am Chapter Assoc Comput Linguist Hum Lang Technol NAACL HLT 2016 - Proc Conf. 2016:260–70.
Oleynik M, Nohama P, Cancian PS, Schulz S. Performance analysis of a POS tagger applied to discharge summaries in portuguese. Stud Health Technol Inform. 2010;160:959–63. https://doi.org/10.3233/978-1-60750-588-4-959.
Article Google Scholar
Oliveira LESE, Gumiel YB, ABV DS, LMM C, Carvalho DR, Hasan SA, et al. Learning Portuguese Clinical Word Embeddings: A Multi-Specialty and Multi-Institutional Corpus of Clinical Narratives Supporting a Downstream Biomedical Task. Stud Health Technol Inform. 2019;264:123–7. https://doi.org/10.3233/SHTI190196.
Article Google Scholar
Peters AC, Oleynik M, Pacheco EJ, Moro CMC, Schulz S, Nohama P. Elaboração de um Corpus Médico baseado em Narrativas Clínicas contidas em Sumários de Alta Hospitalar. An Do XII Congr Bras Informática Em Saúde 2010. doi: https://doi.org/10.13140/RG.2.1.4412.7441.
dos Santos CN, Zadrozny B. Learning character-level representations for part-of-speech tagging. ICML’14 Proc. 31st Int. Conf. Int. Conf. Mach Learn. 2014;32:1818–26.
Google Scholar
Shickel B, Tighe PJ, Bihorac A, Rashidi P. Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis. IEEE J Biomed Heal Informatics. 2018;22:1589–604. https://doi.org/10.1109/JBHI.2017.2767063.
Article Google Scholar
de Sousa RCC, Lopes H. Portuguese POS Tagging Using BLSTM Without Handcrafted Features. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 11896 LNCS, 2019, p. 120–30. doi: https://doi.org/10.1007/978-3-030-33904-3_11.
de Souza JVA, Gumiel YB, Oliveira LES de, Moro CMC. Named Entity Recognition for Clinical Portuguese Corpus with Conditional Random Fields and Semantic Groups. An. do XIX Simpósio Bras. Comput. Apl. à Saúde, Niterói: Sociedade Brasileira de Computação; 2019, p. 318–23.
Taylor A, Marcus M, Santorini B. The Penn Treebank: An Overview. 2003:5–22. doi: https://doi.org/10.1007/978-94-010-0201-1_1.
Taylor C. Structured vs. Unstructured Data. 2018. [cited 2019 August 14]. Available from: https://www.datamation.com/big-data/structured-vs-unstructured-data.html.

Download references

Funding

We would like to thank the Philips Research North America and Fundação Araucária for financing this research.

Author information

Authors and Affiliations

Graduate Program in Health Technology, Pontifical Catholic University of Paraná, Rua Imaculada Conceição, 1155, Curitiba, Paraná, Brazil
Lucas Ferro Antunes de Oliveira, Lucas Emanuel Silva e Oliveira, Yohan Bonescki Gumiel, Deborah Ribeiro Carvalho & Claudia Maria Cabral Moro

Authors

Lucas Ferro Antunes de Oliveira
View author publications
You can also search for this author in PubMed Google Scholar
Lucas Emanuel Silva e Oliveira
View author publications
You can also search for this author in PubMed Google Scholar
Yohan Bonescki Gumiel
View author publications
You can also search for this author in PubMed Google Scholar
Deborah Ribeiro Carvalho
View author publications
You can also search for this author in PubMed Google Scholar
Claudia Maria Cabral Moro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lucas Emanuel Silva e Oliveira.

Ethics declarations

The UNI, MAC and BOS corpora were freely available for download since they do not make use of sensitive data. The PUC clinical corpus has already been anonymized and approved for research use in its original project (Peters et al., 2010), through protocol number 0375.084.000-10 of the National Commission for Research Ethics, and from the Research Ethics Committee of the Pontifical Catholic University of Paraná (0004422/10).

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

de Oliveira, L.F.A., e Oliveira, L.E.S., Gumiel, Y.B. et al. Defining a state-of-the-art POS-tagging environment for Brazilian Portuguese clinical texts. Res. Biomed. Eng. 36, 267–276 (2020). https://doi.org/10.1007/s42600-020-00067-7

Download citation

Received: 07 October 2019
Accepted: 16 May 2020
Published: 19 June 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s42600-020-00067-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Defining a state-of-the-art POS-tagging environment for Brazilian Portuguese clinical texts