Skip to main content

A Study on the Impact of Intradomain Finetuning of Deep Language Models for Legal Named Entity Recognition in Portuguese

  • Conference paper
  • First Online:
Intelligent Systems (BRACIS 2020)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12319))

Included in the following conference series:

Abstract

Deep language models, like ELMo, BERT and GPT, have achieved impressive results on several natural language tasks. These models are pretrained on large corpora of unlabeled general domain text and later supervisedly trained on downstream tasks. An optional step consists of finetuning the language model on a large intradomain corpus of unlabeled text, before training it on the final task. This aspect is not well explored in the current literature. In this work, we investigate the impact of this step on named entity recognition (NER) for Portuguese legal documents. We explore different scenarios considering two deep language architectures (ELMo and BERT), four unlabeled corpora and three legal NER tasks for the Portuguese language. Experimental findings show a significant improvement on performance due to language model finetuning on intradomain text. We also evaluate the finetuned models on two general-domain NER tasks, in order to understand whether the aforementioned improvements were really due to domain similarity or simply due to more training data. The achieved results also indicate that finetuning on a legal domain corpus hurts performance on the general-domain NER tasks. Additionally, our BERT model, finetuned on a legal corpus, significantly improves on the state-of-the-art performance on the LeNER-Br corpus, a Portuguese language NER corpus for the legal domain.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/netoferraz/acordaos-tcu.

  2. 2.

    The public agency for law enforcement and prosecution of crimes in the Brazilian state of Mato Grosso do Sul.

  3. 3.

    https://allennlp.org/elmo.

  4. 4.

    https://spacy.io/.

  5. 5.

    https://www.kaggle.com/ferraz/acordaos-tcu.

  6. 6.

    https://allennlp.org/.

  7. 7.

    https://github.com/chakki-works/seqeval.

  8. 8.

    https://www.clips.uantwerpen.be/conll2002/ner/bin/conlleval.txt.

  9. 9.

    The results presented in the LeNER-Br paper are based on the token-level evaluation, which is not standard in the literature and provides much higher numbers.

  10. 10.

    https://github.com/huggingface/transformers.

References

  1. Alsentzer, E., et al.: Publicly available clinical BERT embeddings. CoRR, abs/1904.03323 (2019). http://arxiv.org/abs/1904.03323

  2. Angelidis, I., Chalkidis, I., Koubarakis, M.: Named entity recognition, linking and generation for Greek legislation. In: Proceedings of JURIX 2018 (2018)

    Google Scholar 

  3. Badji, I.: Legal entity extraction with NER systems, June 2018. http://oa.upm.es/51740/

  4. de Castro, P.V.Q.: Aprendizagem profunda para reconhecimento de entidades nomeadas em domínio jurídico. Master’s thesis, Programa de Pós-graduação em Ciência da Computação (INF) (2019). http://repositorio.bc.ufg.br/tede/handle/tede/10276. Instituto de Informática - INF (RG)

  5. Quinta de Castro, P.V., Félix Felipe da Silva, N., da Silva Soares, A.: Portuguese named entity recognition using LSTM-CRF. In: Villavicencio, A., et al. (eds.) PROPOR 2018. LNCS (LNAI), vol. 11122, pp. 83–92. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99722-3_9

    Chapter  Google Scholar 

  6. de Castro, P.V.Q., da Silva, N.F.F., da Silva Soares, A.: Contextual representations and semi-supervised named entity recognition for Portuguese language. In: Proceedings of IberLEF@SEPLN 2019 (2019)

    Google Scholar 

  7. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2018). http://arxiv.org/abs/1810.04805

  8. do Amaral, D.O.F., Vieira, R.: NERP-CRF: uma ferramenta para o reconhecimento de entidades nomeadas por meio de conditional random fields. Linguamática 6, 41–49 (2014)

    Google Scholar 

  9. Dozier, C., Kondadadi, R., Light, M., Vachher, A., Veeramachaneni, S., Wudali, R.: Named entity recognition and resolution in legal text. In: Francesconi, E., Montemagni, S., Peters, W., Tiscornia, D. (eds.) Semantic Processing of Legal Texts. LNCS (LNAI), vol. 6036, pp. 27–43. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12837-0_2

    Chapter  Google Scholar 

  10. Freitas, C., Mota, C., Santos, D., Oliveira, H.G., Carvalho, P.: Second HAREM: advancing the state of the art of named entity recognition in Portuguese. In: Proceedings of LREC 2010 (2010)

    Google Scholar 

  11. Hakala, K., Pyysalo, S.: Biomedical named entity recognition with multilingual BERT. In: Proceedings of the 5th Workshop on BioNLP Open Shared Tasks, Hong Kong, China, November 2019, pp. 56–61. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D19-5709. https://www.aclweb.org/anthology/D19-5709

  12. Howard, J., Ruder, S.: Fine-tuned language models for text classification. CoRR, abs/1801.06146 (2018). http://arxiv.org/abs/1801.06146

  13. Lample, G., Conneau, A.: Cross-lingual language model pretraining. CoRR, abs/1901.07291 (2019). http://arxiv.org/abs/1901.07291

  14. Luz de Araujo, P.H., de Campos, T.E., de Oliveira, R.R.R., Stauffer, M., Couto, S., Bermejo, P.: LeNER-Br: a dataset for named entity recognition in Brazilian legal text. In: Villavicencio, A., et al. (eds.) PROPOR 2018. LNCS (LNAI), vol. 11122, pp. 313–323. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99722-3_32

    Chapter  Google Scholar 

  15. Martin, L., et al.: CamemBERT: a tasty French language model. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-main.645

  16. Peters, M.E., et al.: Deep contextualized word representations. CoRR, abs/1802.05365 (2018). http://arxiv.org/abs/1802.05365

  17. Pirovani, J., Oliveira, E.: Portuguese named entity recognition using conditional random fields and local grammars. In: Proceedings of LREC 2018, May 2018

    Google Scholar 

  18. Polignano, M., Basile, P., de Gemmis, M., Semeraro, G., Basile, V.: AlBERTo - Italian BERT language understanding model for NLP challenging tasks based on tweets. In: CLiC-it (2019)

    Google Scholar 

  19. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)

    Google Scholar 

  20. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019)

    Google Scholar 

  21. Rother, K., Rettberg, A: ULMFiT at GermEval-2018: a deep neural language model for the classification of hate speech in German tweets. In: Proceedings of the GermEval 2018 Workshop, September 2018

    Google Scholar 

  22. Souza, F., Nogueira, R., Lotufo, R.: Portuguese named entity recognition using BERT-CRF. arXiv:1909.10649 (2020)

  23. Vaswani, A., et al.: Attention is all you need. CoRR, abs/1706.03762 (2017). http://arxiv.org/abs/1706.03762

  24. Wagner Filho, J.A., Wilkens, R., Idiart, M., Villavicencio, A.: The brWaC corpus: a new open resource for Brazilian Portuguese. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA) (2018). https://www.aclweb.org/anthology/L18-1686

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luiz Henrique Bonifacio .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bonifacio, L.H., Vilela, P.A., Lobato, G.R., Fernandes, E.R. (2020). A Study on the Impact of Intradomain Finetuning of Deep Language Models for Legal Named Entity Recognition in Portuguese. In: Cerri, R., Prati, R.C. (eds) Intelligent Systems. BRACIS 2020. Lecture Notes in Computer Science(), vol 12319. Springer, Cham. https://doi.org/10.1007/978-3-030-61377-8_46

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-61377-8_46

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-61376-1

  • Online ISBN: 978-3-030-61377-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics