Abstract
The common law system is a legal system that values precedent, or previous court decisions, in the resolution of current cases. As the availability of legal documents in digital form has increased, it has become more difficult for legal professionals to manually identify relevant past cases due to the vast amount of data. Researchers have developed automated systems for determining the similarity between legal documents to address this issue. Our research explores various representations of a legal document and discusses a novel paragraph filtering process to identify key paragraphs using legal citation information to remove unnecessary text paragraphs without disturbing the concept of the legal document. State-of-the-art techniques like TF-IDF, BERT, Legal Bert, Doc2Vec, and Legal-longformer are used for the performance analysis of the proposed approach with document comparison. It has been shown that a model trained on the proposed filtered paragraphs can achieve better results than a model trained on the complete text and can also shorten the document by over 40%. The proposed filtering strategy could be helpful for models like BERT, where the maximum token length is fixed.
Similar content being viewed by others
References
Beel J, Langer S, Genzmehr M, Gipp B, Breitinger C, Nürnberger A (2013) Research paper recommender system evaluation: a quantitative literature survey. In: Proceedings of the international workshop on reproducibility and replication in recommender systems evaluation, pp 15–22
Beltagy I, Peters ME, Cohan A (2020) Longformer: the long-document transformer. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP 2020). Association for Computational Linguistics, pp 4195–4205
Bhattacharya P, Paul S, Ghosh K, Ghosh S, Wyner A (2019) Identification of rhetorical roles of sentences in Indian legal judgments. In: Legal knowledge and information systems–JURIX, pp 3–12
Bhattacharya P, Ghosh K, Pal A, Ghosh S (2020a) Hier-spcnet: a legal statute hierarchy-based heterogeneous network for computing legal case document similarity. In: proceedings of the ACM SIGIR conference on research and development in information retrieval, pp 1657–1660
Bhattacharya P, Ghosh K, Pal A, Ghosh S (2020b) Methods for computing legal document similarity: a comparative study. arXiv preprint https://arxiv.org/abs/2004.12307
Boer A, Winkels R (2016) Making a cold start in legal recommendation: an experiment. In: Bex F, Villata S (eds) Legal knowledge and information systems: JURIX 2016: the twenty-ninth annual conference, pp 131–136
Chalkidis I, Androutsopoulos I, Aletras N (2019) Neural legal judgment prediction in English. In: Proceedings of the 57th annual meeting of the association for computational linguistics, Florence, Italy, pp 4317–4323. https://doi.org/10.18653/v1/P19-1424. https://www.aclweb.org/anthology/P19-1424
Chalkidis I, Fergadiotis M, Malakasiotis P, Aletras N, Androutsopoulos I (2020) LEGAL-BERT: the muppets straight out of law school. In: findings of the association for computational Linguistics: EMNLP 2020, pp 2898–2904. https://huggingface.co/nlpaueb/legal-bert-base-uncased
Chalkidis I, Garneau N, Goanta C et al (2023) LeXFiles and LegalLAMA: facilitating English multinational legal language model development. In: Proceedings of the 61st annual meeting of the association for computational linguistics (ACL 2023). Association for Computational Linguistics, Online, pp 865–876
Chen H, Wu L, Chen J, Lu W, Ding J (2022) A comparative study of automated legal text classification using random forests and deep learning. Inf Process Manag 59(2):102798
Correia F, Almeida A, Nunes J, Santos K, Hartmann I, Silva F, Lopes H (2022) Fine-grained legal entity annotation: a case study on the Brazilian Supreme Court. Inf Process Manag 59(1):102794
Devlin J, Chang MW, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In: proceedings of NAACL-HLT 2019, pp 4171–4186. https://huggingface.co/bert-base-uncased
Feng Y, Li C, Ge J, Luo B, Ng V (2021) Recommending statutes: a portable method based on neural networks. ACM Trans Knowl Discov Data 15(2):1–22
Goldfarb L (1964) Michigan law review mellinkoff: the language of the Law. Mich Law Rev 63(1):180
Götz M, Bodenstein C, and Riedel M (2015) HPDBSCAN: highly parallel DBSCAN. In: Proceedings of the MLHPC 15: Proceedings of the workshop on machine learning in high-performance computing environments, pp 1–10
Huang Z, Low C, Teng M, Zhang H, Ho D E, Krass M, Grabmair M (2021) Context-aware legal citation recommendation using deep learning. In: Proceedings of the eighteenth international conference on artificial intelligence and law, pp 79–88
Kumar S, Reddy K, Reddy V, Singh A (2011) Similarity analysis of legal judgments. In: Proceedings of the fourth annual ACM Bangalore conference, association for computing machinery, New York, NY, USA, Article 17, pp 1–4
Lastres S (2013) Rebooting legal research in a digital age, insight paper, director of library and knowledge management. Debevoise and Plimpton LLP
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: 31st International conference on machine learning ICML, vol 4, pp 2931–2939
Li X, Yuan J, Shi Y, Sun Z, Ruan J (2020) Emerging trends and innovation modes of internet finance—results from co-word and co-citation networks. https://doi.org/10.3390/fi12030052
Li H, Ai Q, Chen J et al (2023) SAILER: structure-aware pre-trained language model for legal case retrieval. In: Proceedings of the 46th international ACM SIGIR conference on research and development in information retrieval (SIGIR '23). ACM, New York, NY, USA, pp 1–10
Liu Y et al (2019) RoBERTa: a robustly optimized BERT pretraining approach. http://arxiv.org/abs/1907.11692
Lyu Y, Wang Z, Ren Z, Ren P, Chen Z, Liu X, Li Y, Li H, Song H (2022) Improving legal judgment prediction through reinforced criminal element extraction. Inf Process Manag 59:102780
Mandal A, Chaki R, Saha S, Ghosh K, Pal A, Ghosh S (2017a) Measuring similarity among legal court case document. In: Proceedings of the 10th annual ACM India compute conference, pp 1–9
Mandal A, Ghosh K, Pal A, and Ghosh S (2017b) Automatic catchphrase identification from legal court case documents. In: Proceedings of the 2017 ACM on conference on information and knowledge management, pp 2187–2190
Mandal A, Ghosh K, Ghosh S et al (2021) Unsupervised approaches for measuring textual similarity between legal court case reports. Artif Intell Law 29:417–451
Mcelvain G, Sanchez G, Matthews S, Teo D, Pompili F, Custis T (2019) WestSearch Plus : a non-factoid question-answering system for the legal domain. In: Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, pp 1361–1364
Minocha A, Singh N, Srivastava A (2015) Finding relevant Indian judgments using dispersion of citation network. In: Proceedings of the 24th international conference on World Wide Web, pp 1085–1088
Mueller J, Thyagarajan A (2016) Siamese recurrent architectures for learning sentence similarity. In: Thirtieth AAAI conference on artificial intelligence, technical papers: NLP and machine learning. https://doi.org/10.1609/aaai.v30i1.10350
Mumcuoğlu E, Öztürk C, Ozaktas H, Koç A (2021) Natural language processing in law: prediction of outcomes in the higher courts of Turkey. Inf Process Manag 58:102684
Ostendorff M, Ash E, Ruas T, Gipp B, Schneider J, Rehm G (2021) Evaluating document representations for content-based legal literature recommendations. In: Proceedings of the eighteenth international conference on artificial intelligence and law, pp 109–118
Pedersen T, Pakhomov S, Patwardhan S, Chute C (2007) Measures of semantic similarity and relatedness in the biomedical domain. J Biomed Inform 40:288–299
Salton G, Sussenguth EH (1964) Some flexible information retrieval systems using structure matching procedures. In: AFIPS '64 (Spring): Proceedings of the April 21–23, 1964, spring joint computer conference. Association for Computing Machinery, New York, NY, USA, pp 587–597. https://doi.org/10.1145/1464122.1464178
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 11:613–620
Savelka J, Ashley KD (2021) Discovering explanatory sentences in legal case decisions using pre-trained language modelss. https://doi.org/10.48550/arXiv.2112.07165
Shao Y, Mao J, Liu Y, Ma W, Satoh K, Zhang M, Ma S (2020) Bert-pli: modeling paragraph-level interactions for legal case retrieval. In: Proceedings of the twenty-ninth international joint conference on artificial intelligence, IJCAI-20, pp 3501–3507
Someren V (2023) UvA-DARE (digital academic repository) towards a legal recommender system. In: Legal knowledge and information systems: JURIX 2014: the twenty-seventh annual conference, pp 169–178
Wang P, Yang Z, Niu S, Zhang Y, Zhang L, Niu S (2018) Modeling dynamic pairwise attention for crime classification over legal articles. In: the 41st international ACM SIGIR conference on research & development in information retrieval, pp 485–494
Wang P, Fan Y, Niu S, Yang Z, Zhang Y, Guo J (2019) Hierarchical matching network for crime classification. In: Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, pp 325–334
Xu N, Wang P, Chen L, Pan L, Wang X, Zhao J (2018) Distinguish confusing law articles for legal judgment prediction. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 3086–3095
Yang F, Chen J, Huang Y, Li C (2020) Court similar case recommendation model based on word embedding and word frequency 2020. In: 12th International conference on advanced computational intelligence, pp 165–170
Zheng L, Guha N, Anderson B, Henderson P, Ho D (2021) When does pretraining help? Assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings. In: Proceedings of the eighteenth international conference on artificial intelligence and law, pp 159–168
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Makawana, M., Mehta, R.G. A novel network-based paragraph filtering technique for legal document similarity analysis. Artif Intell Law (2023). https://doi.org/10.1007/s10506-023-09375-6
Accepted:
Published:
DOI: https://doi.org/10.1007/s10506-023-09375-6