Abstract
Technology has substantially transformed the way legal services operate in many different countries. With a large and complex collection of digitized legal documents, the judiciary system worldwide presents a promising scenario for the development of intelligent tools. In this work, we tackle the challenging task of organizing and summarizing the constantly growing collection of legal documents, uncovering hidden topics, or themes that later can support tasks such as legal case retrieval and legal judgment prediction. Our approach to this problem relies on topic discovery techniques combined with a variety of preprocessing techniques and learning-based vector representations of words, such as Doc2Vec and BERT-like models. The proposed method was validated using four different datasets composed of short and long legal documents in Brazilian Portuguese, from legal decisions to chapters in legal books. Analysis conducted by a team of legal specialists revealed the effectiveness of the proposed approach to uncover unique and relevant topics from large collections of legal documents, serving many purposes, such as giving support to legal case retrieval tools and also providing the team of legal specialists with a tool that can accelerate their work of labeling/tagging legal documents.
Similar content being viewed by others
References
Angelov D (2020) Top2vec: distributed representations of topics. CoRR arXiv:2008.09470
Badenes-Olmedo C, Redondo-García JL, Corcho O (2019) Scalable cross-lingual document similarity through language-specific concept hierarchies. In: Proceedings of the 10th international conference on knowledge capture. Association for Computing Machinery, pp 147–153. https://doi.org/10.1145/3360901.3364444
Bianchi F, Terragni S, Hovy D (2021) Pre-training is a hot topic: contextualized document embeddings improve topic coherence. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 2: short papers). Association for Computational Linguistics, pp 759–766. https://doi.org/10.18653/v1/2021.acl-short.96
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Chalkidis I, Androutsopoulos I, Aletras N (2019a) Neural legal judgment prediction in English. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 4317–4323. https://doi.org/10.18653/v1/P19-1424
Chalkidis I, Fergadiotis E, Malakasiotis P, et al (2019b) Large-scale multi-label text classification on EU legislation. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 6314–6322. https://doi.org/10.18653/v1/P19-1636
Chalkidis I, Fergadiotis M, Malakasiotis P, et al (2020) LEGAL-BERT: the muppets straight out of law school. In: Findings of the association for computational linguistics: EMNLP 2020. Association for Computational Linguistics, pp 2898–2904. https://doi.org/10.18653/v1/2020.findings-emnlp.261
Devlin J, Chang MW, Lee K, et al (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp 4171–4186. https://doi.org/10.18653/v1/N19-1423
Dieng AB, Ruiz FJR, Blei DM (2020) Topic modeling in embedding spaces. Trans Assoc Comput Linguist 8:439–453. https://doi.org/10.1162/tacl_a_00325
Fang Y, Tian X, Wu H et al (2020) Few-shot learning for Chinese legal controversial issues classification. IEEE Access 8:75,022-75,034. https://doi.org/10.1109/ACCESS.2020.2988493
Grootendorst M (2022) Bertopic: neural topic modeling with a class-based tf-idf procedure. CoRR arXiv:2203.05794
Gupta P, Chaudhary Y, Schütze H (2021) Multi-source neural topic modeling in multi-view embedding spaces. In: Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies. Association for Computational Linguistics, pp 4205–4217. https://doi.org/10.18653/v1/2021.naacl-main.332
Jelodar H, Wang Y, Yuan C et al (2019) Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimed Tools Appl 78(11):15169–15211. https://doi.org/10.1007/s11042-018-6894-4
Mandal A, Chaki R, Saha S, et al (2017) Measuring similarity among legal court case documents. In: Proceedings of the 10th annual ACM India compute conference. Association for Computing Machinery, pp 1–9. https://doi.org/10.1145/3140107.3140119
McInnes L, Healy J, Melville J (2018) Umap: uniform manifold approximation and projection for dimension reduction. arXiv
Miao Y, Yu L, Blunsom P (2016) Neural variational inference for text processing. In: Proceedings of the 33rd international conference on international conference on machine learning, vol 48, pp 1727–1736
Mikolov T, Chen K, Corrado GS, et al (2013) Efficient estimation of word representations in vector space. In: International conference on learning representations
Mikolov T, Grave E, Bojanowski P, et al (2018) Advances in pre-training distributed word representations. In: Proceedings of the international conference on language resources and evaluation (LREC 2018)
Nanda R, John AK, Caro LD, et al (2017) Legal information retrieval using topic clustering and neural networks. In: Satoh K, Kim MY, Kano Y, et al (eds) COLIEE 2017. 4th competition on legal information extraction and entailment, EPiC series in computing, vol 47. EasyChair, pp 68–78
Pennington J, Socher R, Manning C (2014) GloVe: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1532–1543. https://doi.org/10.3115/v1/D14-1162
Polo FM, Mendonça GCF, Parreira KCJ, et al (2021) Legalnlp-natural language processing methods for the Brazilian legal language. In: Anais do XVIII Encontro Nacional de Inteligência Artificial e Computacional, pp 763–774
Rahman MF, Liu W, Suhaim SB, et al (2016) Hdbscan: density based clustering over location based services. ArXiv arXiv:1602.03730
Reimers N, Gurevych I (2019) Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pp 3982–3992. https://doi.org/10.18653/v1/D19-1410
Rosa GM, Rodrigues RC, de Alencar Lotufo R, et al (2021) Yes, BM25 is a strong baseline for legal case retrieval. CoRR arXiv:2105.05686
Shao Y, Mao J, Liu Y, et al (2020) Bert-pli: modeling paragraph-level interactions for legal case retrieval. In: Proceedings of the twenty-ninth international joint conference on artificial intelligence, IJCAI-20, pp 3501–3507. https://doi.org/10.24963/ijcai.2020/484
Souza F, Nogueira R, Lotufo R (2020) Bertimbau: pretrained Bert models for Brazilian Portuguese. In: Intelligent systems, pp 403–417
Srivastava A, Sutton C (2017) Autoencoding variational inference for topic models. In: International conference on learning representations
Terragni S, Fersini E, Messina E (2021) Word embedding-based topic similarity measures. In: Natural language processing and information systems, pp 33–45
Thompson L, Mimno D (2020) Topic modeling with contextualized word representation clusters. CoRR arXiv:https://arxiv.org/abs/2010.12626
Tran V, Nguyen ML, Satoh K (2019) Building legal case retrieval systems with lexical matching and summarization using a pre-trained phrase scoring model. In: Proceedings of the seventeenth international conference on artificial intelligence and law, pp 275-282. https://doi.org/10.1145/3322640.3326740
Vianna D, Silva de Moura E (2022) Organizing Portuguese legal documents through topic discovery. In: Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval. Association for Computing Machinery, SIGIR ’22, p 3388-3392, https://doi.org/10.1145/3477495.3536329
Vulić I, De Smet W, Tang J et al (2015) Probabilistic topic modeling in multilingual settings: an overview of its methodology and applications. Inf Process Manag 51(1):111–147. https://doi.org/10.1016/j.ipm.2014.08.003
Wagner Filho JA, Wilkens R, Idiart M, et al (2018) The brWaC corpus: a new open resource for Brazilian Portuguese. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan
Webber W, Moffat A, Zobel J (2010) A similarity measure for indefinite rankings. ACM Trans Inf Syst 10(1145/1852102):1852106
Wu TH, Kao B, Chan F, et al (2022) Semantic search and summarization of judgments using topic modeling. In: Legal knowledge and information systems: JURIX 2021: the thirty-fourth annual conference, Vilnius, Lithuania, 8–10 Dec 2021, p 100
Zhong H, Guo Z, Tu C, et al (2018) Legal judgment prediction via topological learning. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 3540–3549. https://doi.org/10.18653/v1/D18-1390
Funding
Funding was partially provided by Jusbrasil Postdoctoral Fellowship Program, Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (Grant No. 001), Amazonas State Research Support Foundation - FAPEAM - through the POSGRAD project, and Conselho Nacional de Desenvolvimento Científico e Tecnológico (Grant No. 307248/2019-4).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Vianna, D., de Moura, E.S. & da Silva, A.S. A topic discovery approach for unsupervised organization of legal document collections. Artif Intell Law (2023). https://doi.org/10.1007/s10506-023-09371-w
Accepted:
Published:
DOI: https://doi.org/10.1007/s10506-023-09371-w