Skip to main content

Defining a state-of-the-art POS-tagging environment for Brazilian Portuguese clinical texts



Natural language processing techniques are essential for unlocking patients’ data from electronic health records. An important NLP task is the ability to recognize morphosyntactic information from the texts, a process called part-of-speech (POS) tagging. Currently, neural network architectures are the state-of-the-art method, although there is a lack of studies exploiting this approach within Brazilian Portuguese clinical texts. The objective of this study is to define a state-of-the-art POS-tagging environment for Brazilian Portuguese clinical texts.


We reviewed multiple neural network-based POS-tagging algorithms, and the Flair tool was selected due to its exceptional performance in the journalistic domain, as there is any specific algorithm to Portuguese clinical texts. We executed a normalization process on available corpora from multiple domains (two journalistic, one biomedical, one clinical, and a new corpus composed of all three of these). The Flair algorithm was trained with all corpora, generating five models, which were evaluated with all domains.


The clinical model achieved 92.39% accuracy (previous POS-tagging clinical work reached 91.5%); the biomedical model achieved 97.9% accuracy. All the models were assessed on their own test set.


We developed a new state-of-the-art modeling environment for POS tagging of Brazilian Portuguese clinical texts and achieved comparable results to other state-of-the-art studies in journalistic contexts.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2


  1. 1.

  2. 2.

  3. 3.

  4. 4.

  5. 5.

  6. 6.

  7. 7.


  1. Afonso S, Bick E, Haber R, Santos D. Floresta sintá(c)tica: A treebank for Portuguese. Proc. 3rd Int. Conf. Lang. Resour. Eval. Lr. 2002, Paris; 2002, p. 1698–703.

  2. Akbik A, Blythe D, Vollgraf R. Contextual string embeddings for sequence labeling. Proc. 27th Int. Conf. Comput. Linguist., Santa Fe, New Mexico, USA: Association for Computational Linguistics; 2018, p. 1638–49.

  3. Assale M, Dui LG, Cina A, Seveso A, Cabitza F. The Revival of the Notes Field: Leveraging the Unstructured Content in Electronic Health Records. Front Med. 2019;6:1–23.

    Article  Google Scholar 

  4. Collobert R. Deep learning for efficient discriminative parsing Ronan. Proc Fourteenth Int Conf Artif Intell Stat. 2011;15:224–32.

    Google Scholar 

  5. Dalianis H. Clinical text mining. Cham: Springer International Publishing; 2018.

    Book  Google Scholar 

  6. Duarte JM, Areco K, Goihman S, Birelo E, De Domenico L. Corpora Analysis : Journalistic and Scientific. 2018;10:71–8.

    Google Scholar 

  7. Fernandes ER, Rodrigues IM, Milidiu RL. Portuguese Part-of-Speech Tagging with Large Margin Structure Learning. 2014 Brazilian Conf. Intell. Syst., IEEE; 2014, p. 25–30. doi:

  8. Fonseca ERG, Rosa JL, Aluísio SM. Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese. J Brazilian Comput Soc. 2015;21.

  9. Freitas C, Trugo LF, Chalub F, Paulino-Passos G, Rademaker A. Tagsets and Datasets: Some Experiments Based on Portuguese Language. Comput. Process. Port. Lang. - 12th Int. Conf. PROPOR 2018, 2018, p. 459–69. doi:

  10. Graves A, Mohamed A, Hinton G. Speech recognition with deep recurrent neural networks. 2013 IEEE Int. Conf. Acoust. Speech Signal Process., IEEE; 2013, p. 6645–9. doi:

  11. Hirsch JS, Tanenbaum JS, Gorman SL, Liu C, Schmitz E, Hashorva D, et al. HARVEST, a longitudinal patient record summarizer. J Am Med Informatics Assoc. 2015;22:263–74.

    Article  Google Scholar 

  12. Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF Models for Sequence Tagging. CoRR. 2015.

  13. Jensen PB, Jensen LJ, Brunak S. Mining electronic health records: towards better research applications and clinical care. Nat Rev Genet. 2012;13:395–405.

    Article  Google Scholar 

  14. Jurafsky D, Martin JH. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Second. Prentice Hall; 2008.

  15. Kenei J, Opiyo TOE, Oboko R, Moso J. Clinical documents summarization using text visualization technique. Int J Comput Inf Technol. 2018;7:139–56.

    Google Scholar 

  16. Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C. Neural architectures for named entity recognition. 2016 Conf North Am Chapter Assoc Comput Linguist Hum Lang Technol NAACL HLT 2016 - Proc Conf. 2016:260–70.

  17. Oleynik M, Nohama P, Cancian PS, Schulz S. Performance analysis of a POS tagger applied to discharge summaries in portuguese. Stud Health Technol Inform. 2010;160:959–63.

    Article  Google Scholar 

  18. Oliveira LESE, Gumiel YB, ABV DS, LMM C, Carvalho DR, Hasan SA, et al. Learning Portuguese Clinical Word Embeddings: A Multi-Specialty and Multi-Institutional Corpus of Clinical Narratives Supporting a Downstream Biomedical Task. Stud Health Technol Inform. 2019;264:123–7.

    Article  Google Scholar 

  19. Peters AC, Oleynik M, Pacheco EJ, Moro CMC, Schulz S, Nohama P. Elaboração de um Corpus Médico baseado em Narrativas Clínicas contidas em Sumários de Alta Hospitalar. An Do XII Congr Bras Informática Em Saúde 2010. doi:

  20. dos Santos CN, Zadrozny B. Learning character-level representations for part-of-speech tagging. ICML’14 Proc. 31st Int. Conf. Int. Conf. Mach Learn. 2014;32:1818–26.

    Google Scholar 

  21. Shickel B, Tighe PJ, Bihorac A, Rashidi P. Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis. IEEE J Biomed Heal Informatics. 2018;22:1589–604.

    Article  Google Scholar 

  22. de Sousa RCC, Lopes H. Portuguese POS Tagging Using BLSTM Without Handcrafted Features. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 11896 LNCS, 2019, p. 120–30. doi:

  23. de Souza JVA, Gumiel YB, Oliveira LES de, Moro CMC. Named Entity Recognition for Clinical Portuguese Corpus with Conditional Random Fields and Semantic Groups. An. do XIX Simpósio Bras. Comput. Apl. à Saúde, Niterói: Sociedade Brasileira de Computação; 2019, p. 318–23.

  24. Taylor A, Marcus M, Santorini B. The Penn Treebank: An Overview. 2003:5–22. doi:

  25. Taylor C. Structured vs. Unstructured Data. 2018. [cited 2019 August 14]. Available from:

Download references


We would like to thank the Philips Research North America and Fundação Araucária for financing this research.

Author information



Corresponding author

Correspondence to Lucas Emanuel Silva e Oliveira.

Ethics declarations

The UNI, MAC and BOS corpora were freely available for download since they do not make use of sensitive data. The PUC clinical corpus has already been anonymized and approved for research use in its original project (Peters et al., 2010), through protocol number 0375.084.000-10 of the National Commission for Research Ethics, and from the Research Ethics Committee of the Pontifical Catholic University of Paraná (0004422/10).

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

de Oliveira, L.F.A., e Oliveira, L.E.S., Gumiel, Y.B. et al. Defining a state-of-the-art POS-tagging environment for Brazilian Portuguese clinical texts. Res. Biomed. Eng. 36, 267–276 (2020).

Download citation


  • Natural language processing
  • POS tagging
  • Clinical narratives
  • Neural networks