Natural language processing techniques are essential for unlocking patients’ data from electronic health records. An important NLP task is the ability to recognize morphosyntactic information from the texts, a process called part-of-speech (POS) tagging. Currently, neural network architectures are the state-of-the-art method, although there is a lack of studies exploiting this approach within Brazilian Portuguese clinical texts. The objective of this study is to define a state-of-the-art POS-tagging environment for Brazilian Portuguese clinical texts.
We reviewed multiple neural network-based POS-tagging algorithms, and the Flair tool was selected due to its exceptional performance in the journalistic domain, as there is any specific algorithm to Portuguese clinical texts. We executed a normalization process on available corpora from multiple domains (two journalistic, one biomedical, one clinical, and a new corpus composed of all three of these). The Flair algorithm was trained with all corpora, generating five models, which were evaluated with all domains.
The clinical model achieved 92.39% accuracy (previous POS-tagging clinical work reached 91.5%); the biomedical model achieved 97.9% accuracy. All the models were assessed on their own test set.
We developed a new state-of-the-art modeling environment for POS tagging of Brazilian Portuguese clinical texts and achieved comparable results to other state-of-the-art studies in journalistic contexts.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
Afonso S, Bick E, Haber R, Santos D. Floresta sintá(c)tica: A treebank for Portuguese. Proc. 3rd Int. Conf. Lang. Resour. Eval. Lr. 2002, Paris; 2002, p. 1698–703.
Akbik A, Blythe D, Vollgraf R. Contextual string embeddings for sequence labeling. Proc. 27th Int. Conf. Comput. Linguist., Santa Fe, New Mexico, USA: Association for Computational Linguistics; 2018, p. 1638–49.
Assale M, Dui LG, Cina A, Seveso A, Cabitza F. The Revival of the Notes Field: Leveraging the Unstructured Content in Electronic Health Records. Front Med. 2019;6:1–23. https://doi.org/10.3389/fmed.2019.00066.
Collobert R. Deep learning for efficient discriminative parsing Ronan. Proc Fourteenth Int Conf Artif Intell Stat. 2011;15:224–32.
Dalianis H. Clinical text mining. Cham: Springer International Publishing; 2018. https://doi.org/10.1007/978-3-319-78503-5.
Duarte JM, Areco K, Goihman S, Birelo E, De Domenico L. Corpora Analysis : Journalistic and Scientific. 2018;10:71–8.
Fernandes ER, Rodrigues IM, Milidiu RL. Portuguese Part-of-Speech Tagging with Large Margin Structure Learning. 2014 Brazilian Conf. Intell. Syst., IEEE; 2014, p. 25–30. doi: https://doi.org/10.1109/BRACIS.2014.16.
Fonseca ERG, Rosa JL, Aluísio SM. Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese. J Brazilian Comput Soc. 2015;21. https://doi.org/10.1186/s13173-014-0020-x.
Freitas C, Trugo LF, Chalub F, Paulino-Passos G, Rademaker A. Tagsets and Datasets: Some Experiments Based on Portuguese Language. Comput. Process. Port. Lang. - 12th Int. Conf. PROPOR 2018, 2018, p. 459–69. doi: https://doi.org/10.1007/978-3-319-99722-3_46.
Graves A, Mohamed A, Hinton G. Speech recognition with deep recurrent neural networks. 2013 IEEE Int. Conf. Acoust. Speech Signal Process., IEEE; 2013, p. 6645–9. doi: https://doi.org/10.1109/ICASSP.2013.6638947.
Hirsch JS, Tanenbaum JS, Gorman SL, Liu C, Schmitz E, Hashorva D, et al. HARVEST, a longitudinal patient record summarizer. J Am Med Informatics Assoc. 2015;22:263–74. https://doi.org/10.1136/amiajnl-2014-002945.
Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF Models for Sequence Tagging. CoRR. 2015.
Jensen PB, Jensen LJ, Brunak S. Mining electronic health records: towards better research applications and clinical care. Nat Rev Genet. 2012;13:395–405. https://doi.org/10.1038/nrg3208.
Jurafsky D, Martin JH. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Second. Prentice Hall; 2008.
Kenei J, Opiyo TOE, Oboko R, Moso J. Clinical documents summarization using text visualization technique. Int J Comput Inf Technol. 2018;7:139–56.
Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C. Neural architectures for named entity recognition. 2016 Conf North Am Chapter Assoc Comput Linguist Hum Lang Technol NAACL HLT 2016 - Proc Conf. 2016:260–70.
Oleynik M, Nohama P, Cancian PS, Schulz S. Performance analysis of a POS tagger applied to discharge summaries in portuguese. Stud Health Technol Inform. 2010;160:959–63. https://doi.org/10.3233/978-1-60750-588-4-959.
Oliveira LESE, Gumiel YB, ABV DS, LMM C, Carvalho DR, Hasan SA, et al. Learning Portuguese Clinical Word Embeddings: A Multi-Specialty and Multi-Institutional Corpus of Clinical Narratives Supporting a Downstream Biomedical Task. Stud Health Technol Inform. 2019;264:123–7. https://doi.org/10.3233/SHTI190196.
Peters AC, Oleynik M, Pacheco EJ, Moro CMC, Schulz S, Nohama P. Elaboração de um Corpus Médico baseado em Narrativas Clínicas contidas em Sumários de Alta Hospitalar. An Do XII Congr Bras Informática Em Saúde 2010. doi: https://doi.org/10.13140/RG.2.1.4412.7441.
dos Santos CN, Zadrozny B. Learning character-level representations for part-of-speech tagging. ICML’14 Proc. 31st Int. Conf. Int. Conf. Mach Learn. 2014;32:1818–26.
Shickel B, Tighe PJ, Bihorac A, Rashidi P. Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis. IEEE J Biomed Heal Informatics. 2018;22:1589–604. https://doi.org/10.1109/JBHI.2017.2767063.
de Sousa RCC, Lopes H. Portuguese POS Tagging Using BLSTM Without Handcrafted Features. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 11896 LNCS, 2019, p. 120–30. doi: https://doi.org/10.1007/978-3-030-33904-3_11.
de Souza JVA, Gumiel YB, Oliveira LES de, Moro CMC. Named Entity Recognition for Clinical Portuguese Corpus with Conditional Random Fields and Semantic Groups. An. do XIX Simpósio Bras. Comput. Apl. à Saúde, Niterói: Sociedade Brasileira de Computação; 2019, p. 318–23.
Taylor A, Marcus M, Santorini B. The Penn Treebank: An Overview. 2003:5–22. doi: https://doi.org/10.1007/978-94-010-0201-1_1.
Taylor C. Structured vs. Unstructured Data. 2018. [cited 2019 August 14]. Available from: https://www.datamation.com/big-data/structured-vs-unstructured-data.html.
We would like to thank the Philips Research North America and Fundação Araucária for financing this research.
The UNI, MAC and BOS corpora were freely available for download since they do not make use of sensitive data. The PUC clinical corpus has already been anonymized and approved for research use in its original project (Peters et al., 2010), through protocol number 0375.084.000-10 of the National Commission for Research Ethics, and from the Research Ethics Committee of the Pontifical Catholic University of Paraná (0004422/10).
Conflict of interest
The authors declare that they have no conflict of interest.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
de Oliveira, L.F.A., e Oliveira, L.E.S., Gumiel, Y.B. et al. Defining a state-of-the-art POS-tagging environment for Brazilian Portuguese clinical texts. Res. Biomed. Eng. 36, 267–276 (2020). https://doi.org/10.1007/s42600-020-00067-7
- Natural language processing
- POS tagging
- Clinical narratives
- Neural networks