A Study on the Impact of Intradomain Finetuning of Deep Language Models for Legal Named Entity Recognition in Portuguese

Bonifacio, Luiz Henrique; Vilela, Paulo Arantes; Lobato, Gustavo Rocha; Fernandes, Eraldo Rezende

doi:10.1007/978-3-030-61377-8_46

Luiz Henrique Bonifacio¹⁰,
Paulo Arantes Vilela^10,11,
Gustavo Rocha Lobato¹¹ &
…
Eraldo Rezende Fernandes¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12319))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

1261 Accesses
3 Citations

Abstract

Deep language models, like ELMo, BERT and GPT, have achieved impressive results on several natural language tasks. These models are pretrained on large corpora of unlabeled general domain text and later supervisedly trained on downstream tasks. An optional step consists of finetuning the language model on a large intradomain corpus of unlabeled text, before training it on the final task. This aspect is not well explored in the current literature. In this work, we investigate the impact of this step on named entity recognition (NER) for Portuguese legal documents. We explore different scenarios considering two deep language architectures (ELMo and BERT), four unlabeled corpora and three legal NER tasks for the Portuguese language. Experimental findings show a significant improvement on performance due to language model finetuning on intradomain text. We also evaluate the finetuned models on two general-domain NER tasks, in order to understand whether the aforementioned improvements were really due to domain similarity or simply due to more training data. The achieved results also indicate that finetuning on a legal domain corpus hurts performance on the general-domain NER tasks. Additionally, our BERT model, finetuned on a legal corpus, significantly improves on the state-of-the-art performance on the LeNER-Br corpus, a Portuguese language NER corpus for the legal domain.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/netoferraz/acordaos-tcu.
2.
The public agency for law enforcement and prosecution of crimes in the Brazilian state of Mato Grosso do Sul.
3.
https://allennlp.org/elmo.
4.
https://spacy.io/.
5.
https://www.kaggle.com/ferraz/acordaos-tcu.
6.
https://allennlp.org/.
7.
https://github.com/chakki-works/seqeval.
8.
https://www.clips.uantwerpen.be/conll2002/ner/bin/conlleval.txt.
9.
The results presented in the LeNER-Br paper are based on the token-level evaluation, which is not standard in the literature and provides much higher numbers.
10.
https://github.com/huggingface/transformers.

References

Alsentzer, E., et al.: Publicly available clinical BERT embeddings. CoRR, abs/1904.03323 (2019). http://arxiv.org/abs/1904.03323
Angelidis, I., Chalkidis, I., Koubarakis, M.: Named entity recognition, linking and generation for Greek legislation. In: Proceedings of JURIX 2018 (2018)
Google Scholar
Badji, I.: Legal entity extraction with NER systems, June 2018. http://oa.upm.es/51740/
de Castro, P.V.Q.: Aprendizagem profunda para reconhecimento de entidades nomeadas em domínio jurídico. Master’s thesis, Programa de Pós-graduação em Ciência da Computação (INF) (2019). http://repositorio.bc.ufg.br/tede/handle/tede/10276. Instituto de Informática - INF (RG)
Quinta de Castro, P.V., Félix Felipe da Silva, N., da Silva Soares, A.: Portuguese named entity recognition using LSTM-CRF. In: Villavicencio, A., et al. (eds.) PROPOR 2018. LNCS (LNAI), vol. 11122, pp. 83–92. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99722-3_9
Chapter Google Scholar
de Castro, P.V.Q., da Silva, N.F.F., da Silva Soares, A.: Contextual representations and semi-supervised named entity recognition for Portuguese language. In: Proceedings of IberLEF@SEPLN 2019 (2019)
Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2018). http://arxiv.org/abs/1810.04805
do Amaral, D.O.F., Vieira, R.: NERP-CRF: uma ferramenta para o reconhecimento de entidades nomeadas por meio de conditional random fields. Linguamática 6, 41–49 (2014)
Google Scholar
Dozier, C., Kondadadi, R., Light, M., Vachher, A., Veeramachaneni, S., Wudali, R.: Named entity recognition and resolution in legal text. In: Francesconi, E., Montemagni, S., Peters, W., Tiscornia, D. (eds.) Semantic Processing of Legal Texts. LNCS (LNAI), vol. 6036, pp. 27–43. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12837-0_2
Chapter Google Scholar
Freitas, C., Mota, C., Santos, D., Oliveira, H.G., Carvalho, P.: Second HAREM: advancing the state of the art of named entity recognition in Portuguese. In: Proceedings of LREC 2010 (2010)
Google Scholar
Hakala, K., Pyysalo, S.: Biomedical named entity recognition with multilingual BERT. In: Proceedings of the 5th Workshop on BioNLP Open Shared Tasks, Hong Kong, China, November 2019, pp. 56–61. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D19-5709. https://www.aclweb.org/anthology/D19-5709
Howard, J., Ruder, S.: Fine-tuned language models for text classification. CoRR, abs/1801.06146 (2018). http://arxiv.org/abs/1801.06146
Lample, G., Conneau, A.: Cross-lingual language model pretraining. CoRR, abs/1901.07291 (2019). http://arxiv.org/abs/1901.07291
Luz de Araujo, P.H., de Campos, T.E., de Oliveira, R.R.R., Stauffer, M., Couto, S., Bermejo, P.: LeNER-Br: a dataset for named entity recognition in Brazilian legal text. In: Villavicencio, A., et al. (eds.) PROPOR 2018. LNCS (LNAI), vol. 11122, pp. 313–323. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99722-3_32
Chapter Google Scholar
Martin, L., et al.: CamemBERT: a tasty French language model. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-main.645
Peters, M.E., et al.: Deep contextualized word representations. CoRR, abs/1802.05365 (2018). http://arxiv.org/abs/1802.05365
Pirovani, J., Oliveira, E.: Portuguese named entity recognition using conditional random fields and local grammars. In: Proceedings of LREC 2018, May 2018
Google Scholar
Polignano, M., Basile, P., de Gemmis, M., Semeraro, G., Basile, V.: AlBERTo - Italian BERT language understanding model for NLP challenging tasks based on tweets. In: CLiC-it (2019)
Google Scholar
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019)
Google Scholar
Rother, K., Rettberg, A: ULMFiT at GermEval-2018: a deep neural language model for the classification of hate speech in German tweets. In: Proceedings of the GermEval 2018 Workshop, September 2018
Google Scholar
Souza, F., Nogueira, R., Lotufo, R.: Portuguese named entity recognition using BERT-CRF. arXiv:1909.10649 (2020)
Vaswani, A., et al.: Attention is all you need. CoRR, abs/1706.03762 (2017). http://arxiv.org/abs/1706.03762
Wagner Filho, J.A., Wilkens, R., Idiart, M., Villavicencio, A.: The brWaC corpus: a new open resource for Brazilian Portuguese. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA) (2018). https://www.aclweb.org/anthology/L18-1686

Download references

Author information

Authors and Affiliations

Universidade Federal de Mato Grosso do Sul, Campo Grande, Brazil
Luiz Henrique Bonifacio, Paulo Arantes Vilela & Eraldo Rezende Fernandes
Ministério Público do Estado de Mato Grosso do Sul, Campo Grande, Brazil
Paulo Arantes Vilela & Gustavo Rocha Lobato

Authors

Luiz Henrique Bonifacio
View author publications
You can also search for this author in PubMed Google Scholar
Paulo Arantes Vilela
View author publications
You can also search for this author in PubMed Google Scholar
Gustavo Rocha Lobato
View author publications
You can also search for this author in PubMed Google Scholar
Eraldo Rezende Fernandes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Luiz Henrique Bonifacio .

Editor information

Editors and Affiliations

Federal University of São Carlos, São Carlos, Brazil
Ricardo Cerri
Federal University of ABC, Santo Andre, Brazil
Ronaldo C. Prati

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bonifacio, L.H., Vilela, P.A., Lobato, G.R., Fernandes, E.R. (2020). A Study on the Impact of Intradomain Finetuning of Deep Language Models for Legal Named Entity Recognition in Portuguese. In: Cerri, R., Prati, R.C. (eds) Intelligent Systems. BRACIS 2020. Lecture Notes in Computer Science(), vol 12319. Springer, Cham. https://doi.org/10.1007/978-3-030-61377-8_46

Download citation

DOI: https://doi.org/10.1007/978-3-030-61377-8_46
Published: 13 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61376-1
Online ISBN: 978-3-030-61377-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics