A topic discovery approach for unsupervised organization of legal document collections

Vianna, Daniela; de Moura, Edleno Silva; da Silva, Altigran Soares

doi:10.1007/s10506-023-09371-w

A topic discovery approach for unsupervised organization of legal document collections

Original Research
Published: 19 July 2023

(2023)
Cite this article

Artificial Intelligence and Law Aims and scope Submit manuscript

Daniela Vianna ORCID: orcid.org/0000-0003-2943-5211¹,
Edleno Silva de Moura^1,2^na1 &
Altigran Soares da Silva¹^na1

459 Accesses
Explore all metrics

Abstract

Technology has substantially transformed the way legal services operate in many different countries. With a large and complex collection of digitized legal documents, the judiciary system worldwide presents a promising scenario for the development of intelligent tools. In this work, we tackle the challenging task of organizing and summarizing the constantly growing collection of legal documents, uncovering hidden topics, or themes that later can support tasks such as legal case retrieval and legal judgment prediction. Our approach to this problem relies on topic discovery techniques combined with a variety of preprocessing techniques and learning-based vector representations of words, such as Doc2Vec and BERT-like models. The proposed method was validated using four different datasets composed of short and long legal documents in Brazilian Portuguese, from legal decisions to chapters in legal books. Analysis conducted by a team of legal specialists revealed the effectiveness of the proposed approach to uncover unique and relevant topics from large collections of legal documents, serving many purposes, such as giving support to legal case retrieval tools and also providing the team of legal specialists with a tool that can accelerate their work of labeling/tagging legal documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring the Challenges and Limitations of Unsupervised Machine Learning Approaches in Legal Concepts Discovery

Topic Modelling-Based Approach for Clustering Legal Documents

Topic Modeling in the ENRON Dataset

Notes

References

Angelov D (2020) Top2vec: distributed representations of topics. CoRR arXiv:2008.09470
Badenes-Olmedo C, Redondo-García JL, Corcho O (2019) Scalable cross-lingual document similarity through language-specific concept hierarchies. In: Proceedings of the 10th international conference on knowledge capture. Association for Computing Machinery, pp 147–153. https://doi.org/10.1145/3360901.3364444
Bianchi F, Terragni S, Hovy D (2021) Pre-training is a hot topic: contextualized document embeddings improve topic coherence. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 2: short papers). Association for Computational Linguistics, pp 759–766. https://doi.org/10.18653/v1/2021.acl-short.96
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Chalkidis I, Androutsopoulos I, Aletras N (2019a) Neural legal judgment prediction in English. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 4317–4323. https://doi.org/10.18653/v1/P19-1424
Chalkidis I, Fergadiotis E, Malakasiotis P, et al (2019b) Large-scale multi-label text classification on EU legislation. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 6314–6322. https://doi.org/10.18653/v1/P19-1636
Chalkidis I, Fergadiotis M, Malakasiotis P, et al (2020) LEGAL-BERT: the muppets straight out of law school. In: Findings of the association for computational linguistics: EMNLP 2020. Association for Computational Linguistics, pp 2898–2904. https://doi.org/10.18653/v1/2020.findings-emnlp.261
Devlin J, Chang MW, Lee K, et al (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp 4171–4186. https://doi.org/10.18653/v1/N19-1423
Dieng AB, Ruiz FJR, Blei DM (2020) Topic modeling in embedding spaces. Trans Assoc Comput Linguist 8:439–453. https://doi.org/10.1162/tacl_a_00325
Article Google Scholar
Fang Y, Tian X, Wu H et al (2020) Few-shot learning for Chinese legal controversial issues classification. IEEE Access 8:75,022-75,034. https://doi.org/10.1109/ACCESS.2020.2988493
Article Google Scholar
Grootendorst M (2022) Bertopic: neural topic modeling with a class-based tf-idf procedure. CoRR arXiv:2203.05794
Gupta P, Chaudhary Y, Schütze H (2021) Multi-source neural topic modeling in multi-view embedding spaces. In: Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies. Association for Computational Linguistics, pp 4205–4217. https://doi.org/10.18653/v1/2021.naacl-main.332
Jelodar H, Wang Y, Yuan C et al (2019) Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimed Tools Appl 78(11):15169–15211. https://doi.org/10.1007/s11042-018-6894-4
Article Google Scholar
Mandal A, Chaki R, Saha S, et al (2017) Measuring similarity among legal court case documents. In: Proceedings of the 10th annual ACM India compute conference. Association for Computing Machinery, pp 1–9. https://doi.org/10.1145/3140107.3140119
McInnes L, Healy J, Melville J (2018) Umap: uniform manifold approximation and projection for dimension reduction. arXiv
Miao Y, Yu L, Blunsom P (2016) Neural variational inference for text processing. In: Proceedings of the 33rd international conference on international conference on machine learning, vol 48, pp 1727–1736
Mikolov T, Chen K, Corrado GS, et al (2013) Efficient estimation of word representations in vector space. In: International conference on learning representations
Mikolov T, Grave E, Bojanowski P, et al (2018) Advances in pre-training distributed word representations. In: Proceedings of the international conference on language resources and evaluation (LREC 2018)
Nanda R, John AK, Caro LD, et al (2017) Legal information retrieval using topic clustering and neural networks. In: Satoh K, Kim MY, Kano Y, et al (eds) COLIEE 2017. 4th competition on legal information extraction and entailment, EPiC series in computing, vol 47. EasyChair, pp 68–78
Pennington J, Socher R, Manning C (2014) GloVe: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1532–1543. https://doi.org/10.3115/v1/D14-1162
Polo FM, Mendonça GCF, Parreira KCJ, et al (2021) Legalnlp-natural language processing methods for the Brazilian legal language. In: Anais do XVIII Encontro Nacional de Inteligência Artificial e Computacional, pp 763–774
Rahman MF, Liu W, Suhaim SB, et al (2016) Hdbscan: density based clustering over location based services. ArXiv arXiv:1602.03730
Reimers N, Gurevych I (2019) Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pp 3982–3992. https://doi.org/10.18653/v1/D19-1410
Rosa GM, Rodrigues RC, de Alencar Lotufo R, et al (2021) Yes, BM25 is a strong baseline for legal case retrieval. CoRR arXiv:2105.05686
Shao Y, Mao J, Liu Y, et al (2020) Bert-pli: modeling paragraph-level interactions for legal case retrieval. In: Proceedings of the twenty-ninth international joint conference on artificial intelligence, IJCAI-20, pp 3501–3507. https://doi.org/10.24963/ijcai.2020/484
Souza F, Nogueira R, Lotufo R (2020) Bertimbau: pretrained Bert models for Brazilian Portuguese. In: Intelligent systems, pp 403–417
Srivastava A, Sutton C (2017) Autoencoding variational inference for topic models. In: International conference on learning representations
Terragni S, Fersini E, Messina E (2021) Word embedding-based topic similarity measures. In: Natural language processing and information systems, pp 33–45
Thompson L, Mimno D (2020) Topic modeling with contextualized word representation clusters. CoRR arXiv:https://arxiv.org/abs/2010.12626
Tran V, Nguyen ML, Satoh K (2019) Building legal case retrieval systems with lexical matching and summarization using a pre-trained phrase scoring model. In: Proceedings of the seventeenth international conference on artificial intelligence and law, pp 275-282. https://doi.org/10.1145/3322640.3326740
Vianna D, Silva de Moura E (2022) Organizing Portuguese legal documents through topic discovery. In: Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval. Association for Computing Machinery, SIGIR ’22, p 3388-3392, https://doi.org/10.1145/3477495.3536329
Vulić I, De Smet W, Tang J et al (2015) Probabilistic topic modeling in multilingual settings: an overview of its methodology and applications. Inf Process Manag 51(1):111–147. https://doi.org/10.1016/j.ipm.2014.08.003
Article Google Scholar
Wagner Filho JA, Wilkens R, Idiart M, et al (2018) The brWaC corpus: a new open resource for Brazilian Portuguese. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan
Webber W, Moffat A, Zobel J (2010) A similarity measure for indefinite rankings. ACM Trans Inf Syst 10(1145/1852102):1852106
Google Scholar
Wu TH, Kao B, Chan F, et al (2022) Semantic search and summarization of judgments using topic modeling. In: Legal knowledge and information systems: JURIX 2021: the thirty-fourth annual conference, Vilnius, Lithuania, 8–10 Dec 2021, p 100
Zhong H, Guo Z, Tu C, et al (2018) Legal judgment prediction via topological learning. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 3540–3549. https://doi.org/10.18653/v1/D18-1390

Download references

Funding

Funding was partially provided by Jusbrasil Postdoctoral Fellowship Program, Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (Grant No. 001), Amazonas State Research Support Foundation - FAPEAM - through the POSGRAD project, and Conselho Nacional de Desenvolvimento Científico e Tecnológico (Grant No. 307248/2019-4).

Author information

Edleno Silva de Moura and Altigran Soares da Silva have contributed equally to this work.

Authors and Affiliations

Instituto de Computação, Universidade Federal do Amazonas, Manaus, Amazonas, Brazil
Daniela Vianna, Edleno Silva de Moura & Altigran Soares da Silva
Jusbrasil, Salvador, Bahia, Brazil
Edleno Silva de Moura

Authors

Daniela Vianna
View author publications
You can also search for this author in PubMed Google Scholar
Edleno Silva de Moura
View author publications
You can also search for this author in PubMed Google Scholar
Altigran Soares da Silva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniela Vianna.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Vianna, D., de Moura, E.S. & da Silva, A.S. A topic discovery approach for unsupervised organization of legal document collections. Artif Intell Law (2023). https://doi.org/10.1007/s10506-023-09371-w

Download citation

Accepted: 04 July 2023
Published: 19 July 2023
DOI: https://doi.org/10.1007/s10506-023-09371-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A topic discovery approach for unsupervised organization of legal document collections

Abstract

Access this article

Similar content being viewed by others

Exploring the Challenges and Limitations of Unsupervised Machine Learning Approaches in Legal Concepts Discovery

Topic Modelling-Based Approach for Clustering Legal Documents

Topic Modeling in the ENRON Dataset

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A topic discovery approach for unsupervised organization of legal document collections

Abstract

Access this article

Similar content being viewed by others

Exploring the Challenges and Limitations of Unsupervised Machine Learning Approaches in Legal Concepts Discovery

Topic Modelling-Based Approach for Clustering Legal Documents

Topic Modeling in the ENRON Dataset

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation