Skip to main content
Log in

A topic discovery approach for unsupervised organization of legal document collections

  • Original Research
  • Published:
Artificial Intelligence and Law Aims and scope Submit manuscript

Abstract

Technology has substantially transformed the way legal services operate in many different countries. With a large and complex collection of digitized legal documents, the judiciary system worldwide presents a promising scenario for the development of intelligent tools. In this work, we tackle the challenging task of organizing and summarizing the constantly growing collection of legal documents, uncovering hidden topics, or themes that later can support tasks such as legal case retrieval and legal judgment prediction. Our approach to this problem relies on topic discovery techniques combined with a variety of preprocessing techniques and learning-based vector representations of words, such as Doc2Vec and BERT-like models. The proposed method was validated using four different datasets composed of short and long legal documents in Brazilian Portuguese, from legal decisions to chapters in legal books. Analysis conducted by a team of legal specialists revealed the effectiveness of the proposed approach to uncover unique and relevant topics from large collections of legal documents, serving many purposes, such as giving support to legal case retrieval tools and also providing the team of legal specialists with a tool that can accelerate their work of labeling/tagging legal documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://www.uscourts.gov/statistics-reports/federal-judicial-caseload-statistics-2021.

  2. https://www.cnj.jus.br/pesquisas-judiciarias/justica-em-numeros/.

  3. https://www.cnj.jus.br/pesquisas-judiciarias/justica-em-numeros/.

  4. https://retool.com/.

References

  • Angelov D (2020) Top2vec: distributed representations of topics. CoRR arXiv:2008.09470

  • Badenes-Olmedo C, Redondo-García JL, Corcho O (2019) Scalable cross-lingual document similarity through language-specific concept hierarchies. In: Proceedings of the 10th international conference on knowledge capture. Association for Computing Machinery, pp 147–153. https://doi.org/10.1145/3360901.3364444

  • Bianchi F, Terragni S, Hovy D (2021) Pre-training is a hot topic: contextualized document embeddings improve topic coherence. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 2: short papers). Association for Computational Linguistics, pp 759–766. https://doi.org/10.18653/v1/2021.acl-short.96

  • Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  • Chalkidis I, Androutsopoulos I, Aletras N (2019a) Neural legal judgment prediction in English. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 4317–4323. https://doi.org/10.18653/v1/P19-1424

  • Chalkidis I, Fergadiotis E, Malakasiotis P, et al (2019b) Large-scale multi-label text classification on EU legislation. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 6314–6322. https://doi.org/10.18653/v1/P19-1636

  • Chalkidis I, Fergadiotis M, Malakasiotis P, et al (2020) LEGAL-BERT: the muppets straight out of law school. In: Findings of the association for computational linguistics: EMNLP 2020. Association for Computational Linguistics, pp 2898–2904. https://doi.org/10.18653/v1/2020.findings-emnlp.261

  • Devlin J, Chang MW, Lee K, et al (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp 4171–4186. https://doi.org/10.18653/v1/N19-1423

  • Dieng AB, Ruiz FJR, Blei DM (2020) Topic modeling in embedding spaces. Trans Assoc Comput Linguist 8:439–453. https://doi.org/10.1162/tacl_a_00325

    Article  Google Scholar 

  • Fang Y, Tian X, Wu H et al (2020) Few-shot learning for Chinese legal controversial issues classification. IEEE Access 8:75,022-75,034. https://doi.org/10.1109/ACCESS.2020.2988493

    Article  Google Scholar 

  • Grootendorst M (2022) Bertopic: neural topic modeling with a class-based tf-idf procedure. CoRR arXiv:2203.05794

  • Gupta P, Chaudhary Y, Schütze H (2021) Multi-source neural topic modeling in multi-view embedding spaces. In: Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies. Association for Computational Linguistics, pp 4205–4217. https://doi.org/10.18653/v1/2021.naacl-main.332

  • Jelodar H, Wang Y, Yuan C et al (2019) Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimed Tools Appl 78(11):15169–15211. https://doi.org/10.1007/s11042-018-6894-4

    Article  Google Scholar 

  • Mandal A, Chaki R, Saha S, et al (2017) Measuring similarity among legal court case documents. In: Proceedings of the 10th annual ACM India compute conference. Association for Computing Machinery, pp 1–9. https://doi.org/10.1145/3140107.3140119

  • McInnes L, Healy J, Melville J (2018) Umap: uniform manifold approximation and projection for dimension reduction. arXiv

  • Miao Y, Yu L, Blunsom P (2016) Neural variational inference for text processing. In: Proceedings of the 33rd international conference on international conference on machine learning, vol 48, pp 1727–1736

  • Mikolov T, Chen K, Corrado GS, et al (2013) Efficient estimation of word representations in vector space. In: International conference on learning representations

  • Mikolov T, Grave E, Bojanowski P, et al (2018) Advances in pre-training distributed word representations. In: Proceedings of the international conference on language resources and evaluation (LREC 2018)

  • Nanda R, John AK, Caro LD, et al (2017) Legal information retrieval using topic clustering and neural networks. In: Satoh K, Kim MY, Kano Y, et al (eds) COLIEE 2017. 4th competition on legal information extraction and entailment, EPiC series in computing, vol 47. EasyChair, pp 68–78

  • Pennington J, Socher R, Manning C (2014) GloVe: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1532–1543. https://doi.org/10.3115/v1/D14-1162

  • Polo FM, Mendonça GCF, Parreira KCJ, et al (2021) Legalnlp-natural language processing methods for the Brazilian legal language. In: Anais do XVIII Encontro Nacional de Inteligência Artificial e Computacional, pp 763–774

  • Rahman MF, Liu W, Suhaim SB, et al (2016) Hdbscan: density based clustering over location based services. ArXiv arXiv:1602.03730

  • Reimers N, Gurevych I (2019) Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pp 3982–3992. https://doi.org/10.18653/v1/D19-1410

  • Rosa GM, Rodrigues RC, de Alencar Lotufo R, et al (2021) Yes, BM25 is a strong baseline for legal case retrieval. CoRR arXiv:2105.05686

  • Shao Y, Mao J, Liu Y, et al (2020) Bert-pli: modeling paragraph-level interactions for legal case retrieval. In: Proceedings of the twenty-ninth international joint conference on artificial intelligence, IJCAI-20, pp 3501–3507. https://doi.org/10.24963/ijcai.2020/484

  • Souza F, Nogueira R, Lotufo R (2020) Bertimbau: pretrained Bert models for Brazilian Portuguese. In: Intelligent systems, pp 403–417

  • Srivastava A, Sutton C (2017) Autoencoding variational inference for topic models. In: International conference on learning representations

  • Terragni S, Fersini E, Messina E (2021) Word embedding-based topic similarity measures. In: Natural language processing and information systems, pp 33–45

  • Thompson L, Mimno D (2020) Topic modeling with contextualized word representation clusters. CoRR arXiv:https://arxiv.org/abs/2010.12626

  • Tran V, Nguyen ML, Satoh K (2019) Building legal case retrieval systems with lexical matching and summarization using a pre-trained phrase scoring model. In: Proceedings of the seventeenth international conference on artificial intelligence and law, pp 275-282. https://doi.org/10.1145/3322640.3326740

  • Vianna D, Silva de Moura E (2022) Organizing Portuguese legal documents through topic discovery. In: Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval. Association for Computing Machinery, SIGIR ’22, p 3388-3392, https://doi.org/10.1145/3477495.3536329

  • Vulić I, De Smet W, Tang J et al (2015) Probabilistic topic modeling in multilingual settings: an overview of its methodology and applications. Inf Process Manag 51(1):111–147. https://doi.org/10.1016/j.ipm.2014.08.003

    Article  Google Scholar 

  • Wagner Filho JA, Wilkens R, Idiart M, et al (2018) The brWaC corpus: a new open resource for Brazilian Portuguese. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan

  • Webber W, Moffat A, Zobel J (2010) A similarity measure for indefinite rankings. ACM Trans Inf Syst 10(1145/1852102):1852106

    Google Scholar 

  • Wu TH, Kao B, Chan F, et al (2022) Semantic search and summarization of judgments using topic modeling. In: Legal knowledge and information systems: JURIX 2021: the thirty-fourth annual conference, Vilnius, Lithuania, 8–10 Dec 2021, p 100

  • Zhong H, Guo Z, Tu C, et al (2018) Legal judgment prediction via topological learning. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 3540–3549. https://doi.org/10.18653/v1/D18-1390

Download references

Funding

Funding was partially provided by Jusbrasil Postdoctoral Fellowship Program, Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (Grant No. 001), Amazonas State Research Support Foundation - FAPEAM - through the POSGRAD project, and Conselho Nacional de Desenvolvimento Científico e Tecnológico (Grant No. 307248/2019-4).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniela Vianna.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vianna, D., de Moura, E.S. & da Silva, A.S. A topic discovery approach for unsupervised organization of legal document collections. Artif Intell Law (2023). https://doi.org/10.1007/s10506-023-09371-w

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10506-023-09371-w

Keywords

Navigation