Skip to main content
Log in

A novel network-based paragraph filtering technique for legal document similarity analysis

  • Published:
Artificial Intelligence and Law Aims and scope Submit manuscript

Abstract

The common law system is a legal system that values precedent, or previous court decisions, in the resolution of current cases. As the availability of legal documents in digital form has increased, it has become more difficult for legal professionals to manually identify relevant past cases due to the vast amount of data. Researchers have developed automated systems for determining the similarity between legal documents to address this issue. Our research explores various representations of a legal document and discusses a novel paragraph filtering process to identify key paragraphs using legal citation information to remove unnecessary text paragraphs without disturbing the concept of the legal document. State-of-the-art techniques like TF-IDF, BERT, Legal Bert, Doc2Vec, and Legal-longformer are used for the performance analysis of the proposed approach with document comparison. It has been shown that a model trained on the proposed filtered paragraphs can achieve better results than a model trained on the complete text and can also shorten the document by over 40%. The proposed filtering strategy could be helpful for models like BERT, where the maximum token length is fixed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Beel J, Langer S, Genzmehr M, Gipp B, Breitinger C, Nürnberger A (2013) Research paper recommender system evaluation: a quantitative literature survey. In: Proceedings of the international workshop on reproducibility and replication in recommender systems evaluation, pp 15–22

  • Beltagy I, Peters ME, Cohan A (2020) Longformer: the long-document transformer. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP 2020). Association for Computational Linguistics, pp 4195–4205

  • Bhattacharya P, Paul S, Ghosh K, Ghosh S, Wyner A (2019) Identification of rhetorical roles of sentences in Indian legal judgments. In: Legal knowledge and information systems–JURIX, pp 3–12

  • Bhattacharya P, Ghosh K, Pal A, Ghosh S (2020a) Hier-spcnet: a legal statute hierarchy-based heterogeneous network for computing legal case document similarity. In: proceedings of the ACM SIGIR conference on research and development in information retrieval, pp 1657–1660

  • Bhattacharya P, Ghosh K, Pal A, Ghosh S (2020b) Methods for computing legal document similarity: a comparative study. arXiv preprint https://arxiv.org/abs/2004.12307

  • Boer A, Winkels R (2016) Making a cold start in legal recommendation: an experiment. In: Bex F, Villata S (eds) Legal knowledge and information systems: JURIX 2016: the twenty-ninth annual conference, pp 131–136

  • Chalkidis I, Androutsopoulos I, Aletras N (2019) Neural legal judgment prediction in English. In: Proceedings of the 57th annual meeting of the association for computational linguistics, Florence, Italy, pp 4317–4323. https://doi.org/10.18653/v1/P19-1424. https://www.aclweb.org/anthology/P19-1424

  • Chalkidis I, Fergadiotis M, Malakasiotis P, Aletras N, Androutsopoulos I (2020) LEGAL-BERT: the muppets straight out of law school. In: findings of the association for computational Linguistics: EMNLP 2020, pp 2898–2904. https://huggingface.co/nlpaueb/legal-bert-base-uncased

  • Chalkidis I, Garneau N, Goanta C et al (2023) LeXFiles and LegalLAMA: facilitating English multinational legal language model development. In: Proceedings of the 61st annual meeting of the association for computational linguistics (ACL 2023). Association for Computational Linguistics, Online, pp 865–876

  • Chen H, Wu L, Chen J, Lu W, Ding J (2022) A comparative study of automated legal text classification using random forests and deep learning. Inf Process Manag 59(2):102798

    Article  Google Scholar 

  • Correia F, Almeida A, Nunes J, Santos K, Hartmann I, Silva F, Lopes H (2022) Fine-grained legal entity annotation: a case study on the Brazilian Supreme Court. Inf Process Manag 59(1):102794

    Article  Google Scholar 

  • Devlin J, Chang MW, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In: proceedings of NAACL-HLT 2019, pp 4171–4186. https://huggingface.co/bert-base-uncased

  • Feng Y, Li C, Ge J, Luo B, Ng V (2021) Recommending statutes: a portable method based on neural networks. ACM Trans Knowl Discov Data 15(2):1–22

    Article  Google Scholar 

  • Goldfarb L (1964) Michigan law review mellinkoff: the language of the Law. Mich Law Rev 63(1):180

    Article  Google Scholar 

  • Götz M, Bodenstein C, and Riedel M (2015) HPDBSCAN: highly parallel DBSCAN. In: Proceedings of the MLHPC 15: Proceedings of the workshop on machine learning in high-performance computing environments, pp 1–10

  • Huang Z, Low C, Teng M, Zhang H, Ho D E, Krass M, Grabmair M (2021) Context-aware legal citation recommendation using deep learning. In: Proceedings of the eighteenth international conference on artificial intelligence and law, pp 79–88

  • Kumar S, Reddy K, Reddy V, Singh A (2011) Similarity analysis of legal judgments. In: Proceedings of the fourth annual ACM Bangalore conference, association for computing machinery, New York, NY, USA, Article 17, pp 1–4

  • Lastres S (2013) Rebooting legal research in a digital age, insight paper, director of library and knowledge management. Debevoise and Plimpton LLP

  • Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: 31st International conference on machine learning ICML, vol 4, pp 2931–2939

  • Li X, Yuan J, Shi Y, Sun Z, Ruan J (2020) Emerging trends and innovation modes of internet finance—results from co-word and co-citation networks. https://doi.org/10.3390/fi12030052

  • Li H, Ai Q, Chen J et al (2023) SAILER: structure-aware pre-trained language model for legal case retrieval. In: Proceedings of the 46th international ACM SIGIR conference on research and development in information retrieval (SIGIR '23). ACM, New York, NY, USA, pp 1–10

  • Liu Y et al (2019) RoBERTa: a robustly optimized BERT pretraining approach. http://arxiv.org/abs/1907.11692

  • Lyu Y, Wang Z, Ren Z, Ren P, Chen Z, Liu X, Li Y, Li H, Song H (2022) Improving legal judgment prediction through reinforced criminal element extraction. Inf Process Manag 59:102780

    Article  Google Scholar 

  • Mandal A, Chaki R, Saha S, Ghosh K, Pal A, Ghosh S (2017a) Measuring similarity among legal court case document. In: Proceedings of the 10th annual ACM India compute conference, pp 1–9

  • Mandal A, Ghosh K, Pal A, and Ghosh S (2017b) Automatic catchphrase identification from legal court case documents. In: Proceedings of the 2017 ACM on conference on information and knowledge management, pp 2187–2190

  • Mandal A, Ghosh K, Ghosh S et al (2021) Unsupervised approaches for measuring textual similarity between legal court case reports. Artif Intell Law 29:417–451

    Article  Google Scholar 

  • Mcelvain G, Sanchez G, Matthews S, Teo D, Pompili F, Custis T (2019) WestSearch Plus : a non-factoid question-answering system for the legal domain. In: Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, pp 1361–1364

  • Minocha A, Singh N, Srivastava A (2015) Finding relevant Indian judgments using dispersion of citation network. In: Proceedings of the 24th international conference on World Wide Web, pp 1085–1088

  • Mueller J, Thyagarajan A (2016) Siamese recurrent architectures for learning sentence similarity. In: Thirtieth AAAI conference on artificial intelligence, technical papers: NLP and machine learning. https://doi.org/10.1609/aaai.v30i1.10350

  • Mumcuoğlu E, Öztürk C, Ozaktas H, Koç A (2021) Natural language processing in law: prediction of outcomes in the higher courts of Turkey. Inf Process Manag 58:102684

    Article  Google Scholar 

  • Ostendorff M, Ash E, Ruas T, Gipp B, Schneider J, Rehm G (2021) Evaluating document representations for content-based legal literature recommendations. In: Proceedings of the eighteenth international conference on artificial intelligence and law, pp 109–118

  • Pedersen T, Pakhomov S, Patwardhan S, Chute C (2007) Measures of semantic similarity and relatedness in the biomedical domain. J Biomed Inform 40:288–299

    Article  Google Scholar 

  • Salton G, Sussenguth EH (1964) Some flexible information retrieval systems using structure matching procedures. In: AFIPS '64 (Spring): Proceedings of the April 21–23, 1964, spring joint computer conference. Association for Computing Machinery, New York, NY, USA, pp 587–597. https://doi.org/10.1145/1464122.1464178

    Chapter  Google Scholar 

  • Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 11:613–620

    Article  MATH  Google Scholar 

  • Savelka J, Ashley KD (2021) Discovering explanatory sentences in legal case decisions using pre-trained language modelss. https://doi.org/10.48550/arXiv.2112.07165

  • Shao Y, Mao J, Liu Y, Ma W, Satoh K, Zhang M, Ma S (2020) Bert-pli: modeling paragraph-level interactions for legal case retrieval. In: Proceedings of the twenty-ninth international joint conference on artificial intelligence, IJCAI-20, pp 3501–3507

  • Someren V (2023) UvA-DARE (digital academic repository) towards a legal recommender system. In: Legal knowledge and information systems: JURIX 2014: the twenty-seventh annual conference, pp 169–178

  • Wang P, Yang Z, Niu S, Zhang Y, Zhang L, Niu S (2018) Modeling dynamic pairwise attention for crime classification over legal articles. In: the 41st international ACM SIGIR conference on research & development in information retrieval, pp 485–494

  • Wang P, Fan Y, Niu S, Yang Z, Zhang Y, Guo J (2019) Hierarchical matching network for crime classification. In: Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, pp 325–334

  • Xu N, Wang P, Chen L, Pan L, Wang X, Zhao J (2018) Distinguish confusing law articles for legal judgment prediction. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 3086–3095

  • Yang F, Chen J, Huang Y, Li C (2020) Court similar case recommendation model based on word embedding and word frequency 2020. In: 12th International conference on advanced computational intelligence, pp 165–170

  • Zheng L, Guha N, Anderson B, Henderson P, Ho D (2021) When does pretraining help? Assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings. In: Proceedings of the eighteenth international conference on artificial intelligence and law, pp 159–168

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mayur Makawana.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Makawana, M., Mehta, R.G. A novel network-based paragraph filtering technique for legal document similarity analysis. Artif Intell Law (2023). https://doi.org/10.1007/s10506-023-09375-6

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10506-023-09375-6

Keywords

Navigation