Data Augmentation and Large Language Model for Legal Case Retrieval and Entailment

Bui, Minh-Quan; Do, Dinh-Truong; Le, Nguyen-Khang; Nguyen, Dieu-Hien; Nguyen, Khac-Vu-Hiep; Anh, Trang Pham Ngoc; Le Nguyen, Minh

doi:10.1007/s12626-024-00158-2

Data Augmentation and Large Language Model for Legal Case Retrieval and Entailment

Article
Published: 26 March 2024

Volume 18, pages 49–74, (2024)
Cite this article

The Review of Socionetwork Strategies Aims and scope Submit manuscript

Minh-Quan Bui ORCID: orcid.org/0000-0003-2323-4558¹^na1,
Dinh-Truong Do¹^na1,
Nguyen-Khang Le¹^na1,
Dieu-Hien Nguyen¹,
Khac-Vu-Hiep Nguyen¹,
Trang Pham Ngoc Anh¹ &
…
Minh Le Nguyen¹

102 Accesses
Explore all metrics

Abstract

The Competition on Legal Information Extraction and Entailment (COLIEE) is a well-known international competition organized each year with the goal of applying machine learning algorithms and techniques in the analysis and understanding of legal documents. Two main applications of using machine learning in this domain are entailment and information retrieval. In the realm of legal text analysis, the scarcity of annotated data poses a significant challenge for training robust models. To address this limitation, we employ data augmentation methods to artificially expand the training dataset, enhancing the model’s ability to generalize across diverse legal contexts. Additionally, our approach harnesses the power of a state-of-the-art language model, enabling the extraction of nuanced legal information and improving entailment predictions. We evaluate the performance of our methodology on datasets from the competition, showcasing its effectiveness in achieving competitive results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

A survey on large language model based autonomous agents

Article Open access 22 March 2024

Natural Language Processing

Data availability

Furthermore, ensuring sufficient data availability remains crucial for further advancements in this domain.

Notes

References

Arslan, Y., Allix, K., Veiber, L., Lothritz, C., Bissyandé, T. F., Klein, J., & Goujon, A. (2021). A comparison of pre-trained language models for multi-class text classification in the financial domain. Companion Proceedings of the Web Conference, pp. 260–268.
Bach, S., Sanh, V., Yong, Z. X., Webson, A., Raffel, C., Nayak, N. V., Sharma, A., Kim, T., Bari, M. S., Fevry, T., Alyafeai, Z., Dey, M., Santilli, A., Sun, Z., Ben-david, S., Xu, C., Chhablani, G., Wang, H., Fries, J., Al-shaibani, M., Sharma, S., Thakker, U., Almubarak, K., Tang, X., Radev, D., Jiang, M. T.-j., & Rush, A. (2022). PromptSource: An integrated development environment and repository for natural language prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations (Dublin, Ireland, May 2022), Association for Computational Linguistics, (pp. 93–104).
Balntas, V., Riba, E., Ponsa, D., & Mikolajczyk, K. (2016). Learning local feature descriptors with triplets and shallow convolutional neural networks. In Bmvc, 1, 3.
Beltagy, I., Peters, M.E., & Cohan, A. (2020). Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
...Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, & H. Lin (Eds.), Advances in neural information processing systems (Vol. 33, pp. 1877–1901). Curran Associates, Inc.
Google Scholar
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.
Google Scholar
Bui, Q. M., Nguyen, C., Do, D.-T., Le, N.-K., Nguyen, D.-H., Nguyen, T.-T.-T., Nguyen, M.-P., & Nguyen, M. L. (2022). Jnlp team: Deep learning approaches for tackling long and ambiguous legal documents in coliee 2022. In New Frontiers in Artificial Intelligence: JSAI-isAI 2022 Workshop, JURISIN 2022, and JSAI 2022 International Session, Kyoto, Japan, June 12–17, Revised Selected Papers, Springer, (pp. 68–83).
Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., & Androutsopoulos, I. (2020). Legal-bert: The muppets straight out of law school. arXiv preprint arXiv:2010.02559.
Chekalina, V., & Panchenko, A. (2021). Retrieving comparative arguments using ensemble methods and neural information retrieval. Working Notes of CLEF.
Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, E., Wang, X., Dehghani, M., & Brahma, S., et al. (2022). Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
He, P., Liu, X., Gao, J., & Chen, W. (2020). Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654.
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916.
Lin, J., Ma, X., Lin, S.-C., Yang, J.-H., Pradeep, R., & Nogueira, R. (2021). Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), (pp. 2356–2362).
Muennighoff, N., Wang, T., Sutawika, L., Roberts, A., Biderman, S., Scao, T. L., Bari, M. S., Shen, S., Yong, Z.-X., Schoelkopf, H., Tang, X., Radev, D., Aji, A. F., Almubarak, K., Albanie, S., Alyafeai, Z., Webson, A., Raff, E., & Raffel, C. (2022). Crosslingual generalization through multitask finetuning.
Nguyen, H.-T., Nguyen, M.-P., Vuong, T.-H.-Y., Bui, M.-Q., Nguyen, M.-C., Dang, T.-B., Tran, V., Nguyen, L.-M., & Satoh, K. (2022). Transformer-based approaches for legal text processing: Jnlp team-coliee 2021. The Review of Socionetwork Strategies, 16(1), 135–155.
Article Google Scholar
Nogueira, R., Jiang, Z., Pradeep, R., & Lin, J. (2020). Document ranking with a pretrained sequence-to-sequence model. In Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, (pp. 708–718).
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Gray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. In A. H. Oh, A. Agarwal, D. Belgrave, & K. Cho (Eds.), Advances in Neural Information Processing Systems.
Rabelo, J., Kim, M.-Y., & Goebel, R. (2021). The application of text entailment techniques in coliee 2020. In New Frontiers in Artificial Intelligence: JSAI-isAI 2020 Workshops, JURISIN, LENLS 2020 Workshops, Virtual Event, November 15–17, 2020, Revised Selected Papers 12, Springer, (pp. 240–253).
Rabelo, J., Kim, M.-Y., & Goebel, R. (2023). Semantic-based classification of relevant case law. In New Frontiers in Artificial Intelligence: JSAI-isAI 2022 Workshop, JURISIN 2022, and JSAI 2022 International Session, Kyoto, Japan, June 12–17, 2022, Revised Selected Papers, Springer, (pp. 84–95).
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1), 5485–5551.
Google Scholar
Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
Reynolds, L., & McDonell, K. (2021). Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, (pp. 1–7).
Sanh, V., Webson, A., Raffel, C., Bach, S., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Raja, A., Dey, M., Bari, M. S., Xu, C., Thakker, U., Sharma, S. S., Szczechla, E., Kim, T., Chhablani, G., Nayak, N., Datta, D., Chang, J., Jiang, M. T.-J., Wang, H., Manica, M., Shen, S., Yong, Z. X., Pandey, H., Bawden, R., Wang, T., Neeraj, T., Rozen, J., Sharma, A., Santilli, A., Fevry, T., Fries, J. A., Teehan, R., Scao, T. L., Biderman, S., Gao, L., Wolf, T., & Rush, A. M. (2022). Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
Sanh, V., Webson, A., Raffel, C., Bach, S., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Raja, A., Dey, M., Bari, M. S., Xu, C., Thakker, U., Sharma, S. S., Szczechla, E., Kim, T., Chhablani, G., Nayak, N., Datta, D., Chang, J., Jiang, M. T.-J., Wang, H., Manica, M., Shen, S., Yong, Z. X., Pandey, H., Bawden, R., Wang, T., Neeraj, T., Rozen, J., Sharma, A., Santilli, A., Fevry, T., Fries, J. A., Teehan, R., Scao, T. L., Biderman, S., Gao, L., Wolf, T., & Rush, A. M. (2022). Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., & Zaharia, M. ColBERTv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Seattle, United States, July 2022), Association for Computational Linguistics, (pp. 3715–3734).
Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon, F., & Gallé, M., et al. (2022). Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
Sun, X., Qu, Y., Gao, L., Sun, X., Qi, H., Zhang, B., & Shen, T. (2021). Ensemble-based information retrieval with mass estimation for hyperspectral target detection. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–23.
Google Scholar
Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., & Hashimoto, T. B. (2023). Stanford alpaca: An instruction-following llama model. GitHub repository.
Tay, Y., Dehghani, M., Tran, V. Q., Garcia, X., Bahri, D., Schuster, T., Zheng, H. S., Houlsby, N., & Metzler, D. (2022). Unifying language learning paradigms. arXiv preprint arXiv:2205.05131.
Wei, J., Bosma, M., Zhao, V., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., & Le, Q. V. (2022). Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
Wenpeng Yin, J. H., &. Roth, D. (2019). Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach. In EMNLP.
Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A., & Raffel, C. (2021). mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, (pp. 483–498).
Yoshioka, M., Aoki, Y., & Suzuki, Y. (2021). Bert-based ensemble methods with data augmentation for legal textual entailment in coliee statute law task. In Proceedings of the eighteenth international conference on artificial intelligence and law, (pp. 278–284).

Download references

Author information

Quan Minh Bui, Dinh-Truong Do and Nguyen-Khang Le have contributed equally to this work.

Authors and Affiliations

JAIST: Hokuriku Sentan Kagaku Gijutsu Daigakuin Daigaku, Nomi, Japan
Minh-Quan Bui, Dinh-Truong Do, Nguyen-Khang Le, Dieu-Hien Nguyen, Khac-Vu-Hiep Nguyen, Trang Pham Ngoc Anh & Minh Le Nguyen

Authors

Minh-Quan Bui
View author publications
You can also search for this author in PubMed Google Scholar
Dinh-Truong Do
View author publications
You can also search for this author in PubMed Google Scholar
Nguyen-Khang Le
View author publications
You can also search for this author in PubMed Google Scholar
Dieu-Hien Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Khac-Vu-Hiep Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Trang Pham Ngoc Anh
View author publications
You can also search for this author in PubMed Google Scholar
Minh Le Nguyen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Minh-Quan Bui.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Bui, MQ., Do, DT., Le, NK. et al. Data Augmentation and Large Language Model for Legal Case Retrieval and Entailment. Rev Socionetwork Strat 18, 49–74 (2024). https://doi.org/10.1007/s12626-024-00158-2

Download citation

Received: 30 August 2023
Accepted: 20 February 2024
Published: 26 March 2024
Issue Date: April 2024
DOI: https://doi.org/10.1007/s12626-024-00158-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data Augmentation and Large Language Model for Legal Case Retrieval and Entailment

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

A survey on large language model based autonomous agents

Natural Language Processing

Data availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Data Augmentation and Large Language Model for Legal Case Retrieval and Entailment

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

A survey on large language model based autonomous agents

Natural Language Processing

Data availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation