Combining prompt-based language models and weak supervision for labeling named entity recognition on legal documents

Oliveira, Vitor; Nogueira, Gabriel; Faleiros, Thiago; Marcacini, Ricardo

doi:10.1007/s10506-023-09388-1

Combining prompt-based language models and weak supervision for labeling named entity recognition on legal documents

Original Research
Published: 15 February 2024

(2024)
Cite this article

Artificial Intelligence and Law Aims and scope Submit manuscript

Vitor Oliveira¹,
Gabriel Nogueira¹^na1,
Thiago Faleiros¹^na1 &
…
Ricardo Marcacini ORCID: orcid.org/0000-0002-2309-3487²^na1

570 Accesses
1 Altmetric
Explore all metrics

Abstract

Named entity recognition (NER) is a very relevant task for text information retrieval in natural language processing (NLP) problems. Most recent state-of-the-art NER methods require humans to annotate and provide useful data for model training. However, using human power to identify, circumscribe and label entities manually can be very expensive in terms of time, money, and effort. This paper investigates the use of prompt-based language models (OpenAI’s GPT-3) and weak supervision in the legal domain. We apply both strategies as alternative approaches to the traditional human-based annotation method, relying on computer power instead human effort for labeling, and subsequently compare model performance between computer and human-generated data. We also introduce combinations of all three mentioned methods (prompt-based, weak supervision, and human annotation), aiming to find ways to maintain high model efficiency and low annotation costs. We showed that, despite human labeling still maintaining better overall performance results, the alternative strategies and their combinations presented themselves as valid options, displaying positive results and similar model scores at lower costs. Final results demonstrate preservation of human-trained models scores averaging 74.0% for GPT-3, 95.6% for weak supervision, 90.7% for GPT + weak supervision combination, and 83.9% for GPT + 30% human-labeling combination.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 9

A Weak Supervision Approach with Adversarial Training for Named Entity Recognition

Semi-supervised Learning for Fine-Grained Entity Typing with Mixed Label Smoothing and Pseudo Labeling

UlyssesNER-Br: A Corpus of Brazilian Legislative Documents for Named Entity Recognition

Notes

https://ai.stanford.edu/blog/weak-supervision/.
https://nido.unb.br/.
https://github.com/UnB-KnEDLe.
https://huggingface.co/pierreguillou/bert-base-cased-pt-lenerbr.
https://huggingface.co/adalbertojunior/distilbert-portuguese-cased.
The Critical Difference (CD) is a metric established at Demšar (2006), that determines if one or more learning algorithms, in a specific domain, are in fact statistically different or not. The CD value is calculated considering each algorithm’s results and their relative difference from each other. This value represents a critical difference threshold, that if surpassed by the relative difference between algorithms, makes it possible to affirm their statistical difference.

References

Bach SH, Rodriguez D, Liu Y et al (2019) Snorkel drybell: a case study in deploying weak supervision at industrial scale. In: Proceedings of the 2019 international conference on management of data, SIGMOD ’19. Association for Computing Machinery, New York, NY, USA, pp 362–375. https://doi.org/10.1145/3299869.3314036
Brown TB, Mann B, Ryder N et al (2020) Language models are few-shot learners. arXiv:2005.14165
Chowdhary K (2020) Natural language processing. In: Fundamentals of artificial intelligence. Springer, New Delhi, pp 603–649
Dai H, Song Y, Wang H (2021) Ultra-fine entity typing with weak supervision from a masked language model. arXiv:2106.04098
Dale R (2021) Gpt-3: what’s it good for? Nat Lang Eng 27(1):113–118. https://doi.org/10.1017/S1351324920000601
Article Google Scholar
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet Google Scholar
Dozier C, Kondadadi R, Light M et al (2010) Named entity recognition and resolution in legal text. In: Semantic processing of legal texts. Springer, pp 27–43
Eddy SR (2004) What is a hidden Markov model? Nat Biotechnol 22(10):1315–1316
Article CAS PubMed Google Scholar
Floridi L, Chiriatti M (2020) Gpt-3: its nature, scope, limits, and consequences. Mind Mach 30(4):681–694
Article Google Scholar
Fredriksson T, Mattos DI, Bosch J et al (2020) Data labeling: an empirical investigation into industrial challenges and mitigation strategies. In: Product-focused software process improvement: 21st international conference, PROFES 2020, Proceedings 21, Turin, Italy, November 25–27, 2020. Springer, pp 202–216
Giri R, Porwal Y, Shukla V et al (2017) Approaches for information retrieval in legal documents. In: 2017 tenth international conference on contemporary computing (IC3). IEEE, pp 1–6
Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 18(5):602–610. https://doi.org/10.1016/j.neunet.2005.06.042
Article PubMed Google Scholar
Karamanolakis G, Mukherjee S, Zheng G et al (2021) Self-training with weak supervision. arXiv:2104.05514
Lison P, Hubin A, Barnes J et al (2020) Named entity recognition without labelled data: a weak supervision approach. arXiv:2004.14723
Lison P, Barnes J, Hubin A (2021) skweak: weak supervision made easy for NLP. arXiv preprint arXiv:2104.09683
Liu Y, Ott M, Goyal N et al (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
Liu P, Yuan W, Fu J et al (2023) Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput Surv 55(9):1–35. https://doi.org/10.1145/3560815
Article Google Scholar
Luz de Araujo PH, de Campos TE, de Oliveira RR et al (2018) LeNER-Br: a dataset for named entity recognition in brazilian legal text. In: International conference on computational processing of the Portuguese language. Springer, pp 313–323
Maiya AS (2020) ktrain: a low-code library for augmented machine learning. arXiv preprint arXiv:2004.10703 [cs.LG]
Marrero M, Urbano J, Sánchez-Cuadrado S et al (2013) Named entity recognition: fallacies, challenges and opportunities. Comput Stand Interfaces 35(5):482–489
Article Google Scholar
Meyer S, Elsweiler D, Ludwig B et al (2022) Do we still need human assessors? prompt-based gpt-3 user simulation in conversational ai. In: Proceedings of the 4th conference on conversational user interfaces, CUI ’22. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3543829.3544529,
Nasar Z, Jaffry SW, Malik MK (2021) Named entity recognition and relation extraction: state-of-the-art. ACM Comput Surv 54(1):1–39
Article Google Scholar
Ratner A, Bach SH, Ehrenberg H et al (2020) Snorkel: rapid training data creation with weak supervision. VLDB J 29(2):709–730
Article PubMed Google Scholar
Ratner AJ, De Sa CM, Wu S et al (2016) Data programming: creating large training sets, quickly. Advances in neural information processing systems 29
Sakhaee N, Wilson MC (2021) Information extraction framework to build legislation network. Artif Intell Law 29(1):35–58
Article Google Scholar
Smith LN (2015) Cyclical learning rates for training neural networks. arXiv:1506.01186
Souza F, Nogueira R, Lotufo R (2020) BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Brazilian conference on intelligent systems. Springer, pp 403–417
Sun C, Qiu X, Xu Y et al (2019) How to fine-tune bert for text classification? In: China national conference on Chinese computational linguistics. Springer, Cham, pp 194–206
Torfi A, Shirvani RA, Keneshloo Y et al (2020) Natural language processing advancements by deep learning: a survey. arXiv preprint arXiv:2003.01200
Vardhan H, Surana N, Tripathy B (2021) Named-entity recognition for legal documents. In: International conference on advanced machine learning technologies and applications. Springer, pp 469–479
Vasiliev Y (2020) Natural Language processing with Python and SpaCy: a practical introduction. No Starch Press, San Francisco
Google Scholar
Wang S, Liu Y, Xu Y et al (2021) Want to reduce labeling cost? GPT-3 can help. arXiv:2108.13487
Wang S, Sun X, Li X et al (2023) Gpt-ner: named entity recognition via large language models. arXiv:2304.10428
Wei X, Cui X, Cheng N et al (2023) Zero-shot information extraction via chatting with chatgpt. arXiv:2302.10205
Zamani H, Croft WB (2018) On the theory of weak supervision for information retrieval. In: Proceedings of the 2018 ACM SIGIR international conference on theory of information retrieval, ICTIR ’18. Association for Computing Machinery, New York, NY, USA, pp 147–154. https://doi.org/10.1145/3234944.3234968
Zhang S, He L, Dragut E et al (2019) How to invest my time: Lessons from human-in-the-loop entity extraction. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp 2305–2313
Zhou ZH (2018) A brief introduction to weakly supervised learning. Natl Sci Rev 5(1):44–53
Article MathSciNet Google Scholar

Download references

Acknowledgements

The authors would like to thank Fundação de Apoio à Pesquisa do Distrito Federal (FAPDF), Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES), Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP - process number 2023/10100-4), and project KnEDLe-UNB.

Author information

G. Nogueira, T. Faleiros and R. Marcacini contributed equally to this work.

Authors and Affiliations

Departamento de Ciência da Computaça̧o, Universidade de Brasília, Campus Universitário Darcy Ribeiro, Brasília, Distrito Federal, 70910-900, Brazil
Vitor Oliveira, Gabriel Nogueira & Thiago Faleiros
Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, Avenida Trabalhador São-carlense, 400 - Centro, São Carlos, São Paulo, 13566-590, Brazil
Ricardo Marcacini

Authors

Vitor Oliveira
View author publications
You can also search for this author in PubMed Google Scholar
Gabriel Nogueira
View author publications
You can also search for this author in PubMed Google Scholar
Thiago Faleiros
View author publications
You can also search for this author in PubMed Google Scholar
Ricardo Marcacini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ricardo Marcacini.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Oliveira, V., Nogueira, G., Faleiros, T. et al. Combining prompt-based language models and weak supervision for labeling named entity recognition on legal documents. Artif Intell Law (2024). https://doi.org/10.1007/s10506-023-09388-1

Download citation

Accepted: 20 December 2023
Published: 15 February 2024
DOI: https://doi.org/10.1007/s10506-023-09388-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Combining prompt-based language models and weak supervision for labeling named entity recognition on legal documents

Abstract

Access this article

Similar content being viewed by others

A Weak Supervision Approach with Adversarial Training for Named Entity Recognition

Semi-supervised Learning for Fine-Grained Entity Typing with Mixed Label Smoothing and Pseudo Labeling

UlyssesNER-Br: A Corpus of Brazilian Legislative Documents for Named Entity Recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Combining prompt-based language models and weak supervision for labeling named entity recognition on legal documents

Abstract

Access this article

Similar content being viewed by others

A Weak Supervision Approach with Adversarial Training for Named Entity Recognition

Semi-supervised Learning for Fine-Grained Entity Typing with Mixed Label Smoothing and Pseudo Labeling

UlyssesNER-Br: A Corpus of Brazilian Legislative Documents for Named Entity Recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation