Skip to main content
Log in

Data Augmentation and Large Language Model for Legal Case Retrieval and Entailment

  • Article
  • Published:
The Review of Socionetwork Strategies Aims and scope Submit manuscript

Abstract

The Competition on Legal Information Extraction and Entailment (COLIEE) is a well-known international competition organized each year with the goal of applying machine learning algorithms and techniques in the analysis and understanding of legal documents. Two main applications of using machine learning in this domain are entailment and information retrieval. In the realm of legal text analysis, the scarcity of annotated data poses a significant challenge for training robust models. To address this limitation, we employ data augmentation methods to artificially expand the training dataset, enhancing the model’s ability to generalize across diverse legal contexts. Additionally, our approach harnesses the power of a state-of-the-art language model, enabling the extraction of nuanced legal information and improving entailment predictions. We evaluate the performance of our methodology on datasets from the competition, showcasing its effectiveness in achieving competitive results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availability

Furthermore, ensuring sufficient data availability remains crucial for further advancements in this domain.

Notes

  1. https://github.com/google-research/text-to-text-transfer-transformer.

  2. https://huggingface.co/sentence-transformers/all-mpnet-base-v2.

  3. https://huggingface.co/microsoft/mpnet-base.

  4. https://pypi.org/project/langdetect/.

  5. https://huggingface.co/.

References

  1. Arslan, Y., Allix, K., Veiber, L., Lothritz, C., Bissyandé, T. F., Klein, J., & Goujon, A. (2021). A comparison of pre-trained language models for multi-class text classification in the financial domain. Companion Proceedings of the Web Conference, pp. 260–268.

  2. Bach, S., Sanh, V., Yong, Z. X., Webson, A., Raffel, C., Nayak, N. V., Sharma, A., Kim, T., Bari, M. S., Fevry, T., Alyafeai, Z., Dey, M., Santilli, A., Sun, Z., Ben-david, S., Xu, C., Chhablani, G., Wang, H., Fries, J., Al-shaibani, M., Sharma, S., Thakker, U., Almubarak, K., Tang, X., Radev, D., Jiang, M. T.-j., & Rush, A. (2022). PromptSource: An integrated development environment and repository for natural language prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations (Dublin, Ireland, May 2022), Association for Computational Linguistics, (pp. 93–104).

  3. Balntas, V., Riba, E., Ponsa, D., & Mikolajczyk, K. (2016). Learning local feature descriptors with triplets and shallow convolutional neural networks. In Bmvc, 1, 3.

  4. Beltagy, I., Peters, M.E., & Cohan, A. (2020). Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.

  5. ...Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, & H. Lin (Eds.), Advances in neural information processing systems (Vol. 33, pp. 1877–1901). Curran Associates, Inc.

    Google Scholar 

  6. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.

    Google Scholar 

  7. Bui, Q. M., Nguyen, C., Do, D.-T., Le, N.-K., Nguyen, D.-H., Nguyen, T.-T.-T., Nguyen, M.-P., & Nguyen, M. L. (2022). Jnlp team: Deep learning approaches for tackling long and ambiguous legal documents in coliee 2022. In New Frontiers in Artificial Intelligence: JSAI-isAI 2022 Workshop, JURISIN 2022, and JSAI 2022 International Session, Kyoto, Japan, June 12–17, Revised Selected Papers, Springer, (pp. 68–83).

  8. Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., & Androutsopoulos, I. (2020). Legal-bert: The muppets straight out of law school. arXiv preprint arXiv:2010.02559.

  9. Chekalina, V., & Panchenko, A. (2021). Retrieving comparative arguments using ensemble methods and neural information retrieval. Working Notes of CLEF.

  10. Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, E., Wang, X., Dehghani, M., & Brahma, S., et al. (2022). Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.

  11. Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.

  12. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

  13. He, P., Liu, X., Gao, J., & Chen, W. (2020). Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654.

  14. Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916.

  15. Lin, J., Ma, X., Lin, S.-C., Yang, J.-H., Pradeep, R., & Nogueira, R. (2021). Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), (pp. 2356–2362).

  16. Muennighoff, N., Wang, T., Sutawika, L., Roberts, A., Biderman, S., Scao, T. L., Bari, M. S., Shen, S., Yong, Z.-X., Schoelkopf, H., Tang, X., Radev, D., Aji, A. F., Almubarak, K., Albanie, S., Alyafeai, Z., Webson, A., Raff, E., & Raffel, C. (2022). Crosslingual generalization through multitask finetuning.

  17. Nguyen, H.-T., Nguyen, M.-P., Vuong, T.-H.-Y., Bui, M.-Q., Nguyen, M.-C., Dang, T.-B., Tran, V., Nguyen, L.-M., & Satoh, K. (2022). Transformer-based approaches for legal text processing: Jnlp team-coliee 2021. The Review of Socionetwork Strategies, 16(1), 135–155.

    Article  Google Scholar 

  18. Nogueira, R., Jiang, Z., Pradeep, R., & Lin, J. (2020). Document ranking with a pretrained sequence-to-sequence model. In Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, (pp. 708–718).

  19. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Gray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. In A. H. Oh, A. Agarwal, D. Belgrave, & K. Cho (Eds.), Advances in Neural Information Processing Systems.

  20. Rabelo, J., Kim, M.-Y., & Goebel, R. (2021). The application of text entailment techniques in coliee 2020. In New Frontiers in Artificial Intelligence: JSAI-isAI 2020 Workshops, JURISIN, LENLS 2020 Workshops, Virtual Event, November 15–17, 2020, Revised Selected Papers 12, Springer, (pp. 240–253).

  21. Rabelo, J., Kim, M.-Y., & Goebel, R. (2023). Semantic-based classification of relevant case law. In New Frontiers in Artificial Intelligence: JSAI-isAI 2022 Workshop, JURISIN 2022, and JSAI 2022 International Session, Kyoto, Japan, June 12–17, 2022, Revised Selected Papers, Springer, (pp. 84–95).

  22. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1), 5485–5551.

    Google Scholar 

  23. Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.

  24. Reynolds, L., & McDonell, K. (2021). Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, (pp. 1–7).

  25. Sanh, V., Webson, A., Raffel, C., Bach, S., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Raja, A., Dey, M., Bari, M. S., Xu, C., Thakker, U., Sharma, S. S., Szczechla, E., Kim, T., Chhablani, G., Nayak, N., Datta, D., Chang, J., Jiang, M. T.-J., Wang, H., Manica, M., Shen, S., Yong, Z. X., Pandey, H., Bawden, R., Wang, T., Neeraj, T., Rozen, J., Sharma, A., Santilli, A., Fevry, T., Fries, J. A., Teehan, R., Scao, T. L., Biderman, S., Gao, L., Wolf, T., & Rush, A. M. (2022). Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.

  26. Sanh, V., Webson, A., Raffel, C., Bach, S., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Raja, A., Dey, M., Bari, M. S., Xu, C., Thakker, U., Sharma, S. S., Szczechla, E., Kim, T., Chhablani, G., Nayak, N., Datta, D., Chang, J., Jiang, M. T.-J., Wang, H., Manica, M., Shen, S., Yong, Z. X., Pandey, H., Bawden, R., Wang, T., Neeraj, T., Rozen, J., Sharma, A., Santilli, A., Fevry, T., Fries, J. A., Teehan, R., Scao, T. L., Biderman, S., Gao, L., Wolf, T., & Rush, A. M. (2022). Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.

  27. Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., & Zaharia, M. ColBERTv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Seattle, United States, July 2022), Association for Computational Linguistics, (pp. 3715–3734).

  28. Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon, F., & Gallé, M., et al. (2022). Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.

  29. Sun, X., Qu, Y., Gao, L., Sun, X., Qi, H., Zhang, B., & Shen, T. (2021). Ensemble-based information retrieval with mass estimation for hyperspectral target detection. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–23.

    Google Scholar 

  30. Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., & Hashimoto, T. B. (2023). Stanford alpaca: An instruction-following llama model. GitHub repository.

  31. Tay, Y., Dehghani, M., Tran, V. Q., Garcia, X., Bahri, D., Schuster, T., Zheng, H. S., Houlsby, N., & Metzler, D. (2022). Unifying language learning paradigms. arXiv preprint arXiv:2205.05131.

  32. Wei, J., Bosma, M., Zhao, V., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., & Le, Q. V. (2022). Finetuned language models are zero-shot learners. In International Conference on Learning Representations.

  33. Wenpeng Yin, J. H., &. Roth, D. (2019). Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach. In EMNLP.

  34. Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A., & Raffel, C. (2021). mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, (pp. 483–498).

  35. Yoshioka, M., Aoki, Y., & Suzuki, Y. (2021). Bert-based ensemble methods with data augmentation for legal textual entailment in coliee statute law task. In Proceedings of the eighteenth international conference on artificial intelligence and law, (pp. 278–284).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Minh-Quan Bui.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bui, MQ., Do, DT., Le, NK. et al. Data Augmentation and Large Language Model for Legal Case Retrieval and Entailment. Rev Socionetwork Strat 18, 49–74 (2024). https://doi.org/10.1007/s12626-024-00158-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12626-024-00158-2

Keywords

Navigation