Legal sentence boundary detection using hybrid deep learning and statistical models

Sheik, Reshma; Ganta, Sneha Rao; Nirmala, S. Jaya

doi:10.1007/s10506-024-09394-x

Legal sentence boundary detection using hybrid deep learning and statistical models

Original Research
Published: 14 March 2024

(2024)
Cite this article

Artificial Intelligence and Law Aims and scope Submit manuscript

152 Accesses
1 Altmetric
Explore all metrics

Abstract

Sentence boundary detection (SBD) represents an important first step in natural language processing since accurately identifying sentence boundaries significantly impacts downstream applications. Nevertheless, detecting sentence boundaries within legal texts poses a unique and challenging problem due to their distinct structural and linguistic features. Our approach utilizes deep learning models to leverage delimiter and surrounding context information as input, enabling precise detection of sentence boundaries in English legal texts. We evaluate various deep learning models, including domain-specific transformer models like LegalBERT and CaseLawBERT. To assess the efficacy of our deep learning models, we compare them with a state-of-the-art domain-specific statistical conditional random field (CRF) model. After considering model size, F1-score, and inference time, we identify the Convolutional Neural Network Model (CNN) as the top-performing deep learning model. To further enhance performance, we integrate the features of the CNN model into the subsequent CRF model, creating a hybrid architecture that combines the strengths of both models. Our experiments demonstrate that the hybrid model outperforms the baseline model, achieving a 4% improvement in the F1-score. Additional experiments showcase the superiority of the hybrid model over SBD open-source libraries when confronted with an out-of-domain test set. These findings underscore the importance of efficient SBD in legal texts and emphasize the advantages of employing deep learning models and hybrid architectures to achieve optimal performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep learning in law: early adaptation and legal word embeddings trained on large corpora

Article 11 December 2018

Recurrent neural network-based models for recognizing requisite and effectuation parts in legal texts

Article 24 March 2018

Analyzing Vietnamese Legal Questions Using Deep Neural Networks with Biaffine Classifiers

Data availability

The dataset we used in our work is open for research purposes. The code used in this study was developed using open-source tools and libraries and is open for researchers to reproduce the experiments and build upon the research findings, potentially leading to new insights.

Notes

References

Agarap AF (2018) Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375
Allen LE, Lysaght LJ (2015) Modern logic as a tool for remedying ambiguities in legal documents and analyzing the structure of legal documents’ contained definitions. Logic in the Theory and Practice of Lawmaking, 383–407
Bahdanau D, Cho KH, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: 3rd international conference on learning representations. ICLR 2015
Bhattacharya P, Paul S, Ghosh K, Ghosh S, Wyner A (2021) DeepRhole: deep learning for rhetorical role labeling of sentences in legal case documents. Artif Intell Law. https://doi.org/10.1007/s10506-021-09304-5
Article Google Scholar
Bird S (2006) NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL 2006 interactive presentation sessions, pp 69–72
Brugger T, Stürmer M, Niklaus J (2023) MultiLegalSBD: a multilingual legal sentence boundary detection dataset. arXiv:2305.01211
Chalkidis I, Androutsopoulos I (2017) A deep learning approach to contract element extraction. In: JURIX, vol 2017, pp 155–164
Chalkidis I, Fergadiotis M, Malakasiotis P, Aletras N, Androutsopoulos I (2020) LEGAL-BERT: the Muppets straight out of law school. In: Findings of the association for computational linguistics: EMNLP 2020, pp 2898–2904
Chen H, Pieptea LF, Ding J (2022) Construction and evaluation of a high-quality corpus for legal intelligence using semiautomated approaches. IEEE Trans Reliab 71(2):657–673
Article Google Scholar
Chollampatt S, Ng HT (2018) A multilayer convolutional encoder-decoder neural network for grammatical error correction. In: Proceedings of the AAAI conference on artificial intelligence, vol 32
Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555
Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537
Google Scholar
Deroy A, Ghosh K, Ghosh S (2023) Ensemble methods for improving extractive summarization of legal case judgements. Artif Intell Law. https://doi.org/10.1007/s10506-023-09349-8
Article Google Scholar
Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10(7):1895–1923
Article Google Scholar
Dunning T (1994) Accurate methods for the statistics of surprise and coincidence. Comput Linguist 19(1):61–74
Google Scholar
Francis WN, Kucera H (1979) Brown corpus manual. Lett Editor 5(2):7
Google Scholar
Gers FA, Schmidhuber J, Cummins F (2000) Learning to forget: continual prediction with LSTM. Neural Comput 12(10):2451–2471
Article Google Scholar
Gillick D (2009) Sentence boundary detection and the problem with the us. In: Proceedings of human language technologies: the 2009 annual conference of the North American chapter of the association for computational linguistics, companion volume: short papers, pp 241–244
Glaser I, Moser S, Matthes F (2021) Sentence boundary detection in German legal documents. In: ICAART (2), pp 812–821
Grefenstette G, Tapanainen P (1994) What is a word, what is a sentence?: problems of tokenisation
Griffis D, Shivade C, Fosler-Lussier E, Lai AM (2016) A quantitative and qualitative evaluation of sentence boundary detection for the clinical domain. In AMIA summits on translational science proceedings, vol 2016, p 88
Habernal I, Faber D, Recchia N, Bretthauer S, Gurevych I, Spiecker genannt Döhmann I, Burchard C (2023) Mining legal arguments in court decisions. Artif Intell Law. https://doi.org/10.1007/s10506-023-09361-y
Article Google Scholar
Honnibal M, Montani I, Van Landeghem S, Boyd A (2020) spaCy: industrial-strength Natural Language Processing in Python
Jain D, Borah MD, Biswas A (2023) A sentence is known by the company it keeps: improving legal document summarization using deep clustering. Artif Intell Law. https://doi.org/10.1007/s10506-023-09345-y
Article Google Scholar
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3rd international conference on learning representations. ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings
Kiss T, Strunk J (2006) Unsupervised multilingual sentence boundary detection. Comput Linguist 32(4):485–525
Article Google Scholar
Kudo T, Richardson J (2018) SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In: Proceedings of the 2018 conference on empirical methods in natural language processing: system demonstrations. Association for Computational Linguistics, pp 66–71
Lafferty J, McCallum A, Pereira FC (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data
Lavery UA (1921) The language of the law. Am Bar Assoc J 7(6):277–283
Google Scholar
Lin Z, Feng M, Santos CNd, Yu M, Xiang B, Zhou B, Bengio Y (2017) A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130
Liu DC, Nocedal J (1989) On the limited memory BFGS method for large scale optimization. Math Program 45(1–3):503–528
Article MathSciNet Google Scholar
López R, Pardo TA (2015) Experiments on sentence boundary detection in user-generated web content. In: Computational linguistics and intelligent text processing: 16th international conference, CICLing 2015, Cairo, Egypt, April 14–20, 2015, Springer proceedings, Part I 16, pp 227–237
Malik V, Sanjay R, Nigam SK, Ghosh K, Guha SK, Bhattacharya A, Modi A (2021) ILDC for CJPE: Indian legal documents corpus for court judgment prediction and explanation. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers). Association for Computational Linguistics, Online, pp 4046–4062
Mandal A, Ghosh K, Ghosh S, Mandal S (2021) A sequence labeling model for catchphrase identification from legal case documents. Artif Intell Law. https://doi.org/10.1007/s10506-021-09296-2
Article Google Scholar
Manning CD, Surdeanu M, Bauer J, Finkel JR, Bethard S, McClosky D (2014) The Stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pp 55–60
McNemar Q (1947) Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12(2):153–157
Article Google Scholar
Mikheev A (2000) Tagging sentence boundaries. In: 1st meeting of the North American chapter of the association for computational linguistics
Minixhofer B, Pfeiffer J, Vulić I (2023) Where’s the point? Self-supervised multilingual punctuation-agnostic sentence segmentation. arXiv preprint arXiv:2305.18893
Okazaki N (2007) CRFsuite: a fast implementation of conditional random fields (CRFs)
Păiş V, Tufiş D (2021) Capitalization and punctuation restoration: a survey. Artif Intell Rev 55(3):1681–1722
Article Google Scholar
Palmer DD, Hearst MA (1997) Adaptive multilingual sentence boundary disambiguation. Comput Linguist 23(2):241–267
Google Scholar
Qi P, Zhang Y, Zhang Y, Bolton J, Manning CD (2020) Stanza: a Python natural language processing toolkit for many human languages. In: Proceedings of the 58th annual meeting of the association for computational linguistics: system demonstrations
Read J, Dridan R, Oepen S, Solberg LJ (2012) Sentence boundary detection: a long solved problem? In: Proceedings of COLING 2012. Posters, pp 985–994
Rehbein I, Ruppenhofer J, Schmidt T (2020) Improving sentence boundary detection for spoken language transcripts. In: Proceedings of the twelfth language resources and evaluation conference. European Language Resources Association, Marseille, France, pp 7102–7111
Reynar JC, Ratnaparkhi A (1997) A maximum entropy approach to identifying sentence boundaries. In: Proceedings of the fifth conference on applied natural language processing, pp 16–19
Riley M (1989) Some applications of tree-based modelling to speech and language. In: Speech and natural language: proceedings of a workshop held at Cape Cod, Massachusetts, October 15–18, 1989
Rudrapal D, Jamatia A, Chakma K, Das A, Gambäck B (2015) Sentence boundary detection for social media text. In: Proceedings of the 12th international conference on natural language processing, pp 254–260
Sadvilkar N, Neumann M (2020) PySBD: pragmatic sentence boundary disambiguation. In: Proceedings of second workshop for NLP open source software (NLP-OSS), pp 110–114
Sanchez G (2019) Sentence boundary detection in legal text. In: Proceedings of the natural legal language processing workshop 2019, pp 31–38
Saravanan M, Ravindran B (2010) Identification of rhetorical roles for segmentation and summarization of a legal judgment. Artif Intell Law 18:45–76
Article Google Scholar
Savelka J, Walker VR, Grabmair M, Ashley KD (2017) Sentence boundary detection in adjudicatory decisions in the United States. Trait Autom Lang 58:21
Google Scholar
Schweter S, Ahmed S (2019) Deep-EOS: general-purpose neural networks for sentence boundary detection. In: KONVENS
Sheik R, Gokul T, Nirmala S (2022) Efficient deep learning-based sentence boundary detection in legal text. In: Proceedings of the natural legal language processing workshop 2022, pp 208–217
Sutton C, McCallum A et al (2012) An introduction to conditional random fields. Found Trends® Mach Learn 4(4):267–373
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS'17, pp 6000–6010
Wang X, Utiyama M, Sumita E (2019) Online sentence segmentation for simultaneous interpretation using multi-shifted recurrent neural network. In: Proceedings of machine translation summit XVII: research track, pp 1–11
Wicks R, Post M (2021) A unified approach to sentence segmentation of punctuated text in many languages. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), pp 3995–4007
Wong F, Chao S (2010) iSentenizer: an incremental sentence boundary classifier. In: Proceedings of the 6th international conference on natural language processing and knowledge engineering (NLPKE-2010). IEEE, pp 1–7
Young T, Hazarika D, Poria S, Cambria E (2018) Recent trends in deep learning based natural language processing. IEEE Comput Intell Mag 13(3):55–75
Article Google Scholar
Zheng L, Guha N, Anderson BR, Henderson P, Ho DE (2021) When does pretraining help? Assessing self-supervised learning for law and the casehold dataset of 53,000+ legal holdings. In: Proceedings of the eighteenth international conference on artificial intelligence and law, pp 159–168

Download references

Author information

Authors and Affiliations

National Institute of Technology, Trichy, Tiruchirappalli, Tamil Nadu, India
Reshma Sheik, Sneha Rao Ganta & S. Jaya Nirmala

Authors

Reshma Sheik
View author publications
You can also search for this author in PubMed Google Scholar
Sneha Rao Ganta
View author publications
You can also search for this author in PubMed Google Scholar
S. Jaya Nirmala
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Reshma Sheik.

Ethics declarations

Conflict of interest

All authors certify that they have no involvement in any firm or entity with any financial or non-financial interest in the materials covered in this document.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sheik, R., Ganta, S.R. & Nirmala, S.J. Legal sentence boundary detection using hybrid deep learning and statistical models. Artif Intell Law (2024). https://doi.org/10.1007/s10506-024-09394-x

Download citation

Accepted: 13 February 2024
Published: 14 March 2024
DOI: https://doi.org/10.1007/s10506-024-09394-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Legal sentence boundary detection using hybrid deep learning and statistical models

Abstract

Access this article

Similar content being viewed by others

Deep learning in law: early adaptation and legal word embeddings trained on large corpora

Recurrent neural network-based models for recognizing requisite and effectuation parts in legal texts

Analyzing Vietnamese Legal Questions Using Deep Neural Networks with Biaffine Classifiers

Data availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Legal sentence boundary detection using hybrid deep learning and statistical models

Abstract

Access this article

Similar content being viewed by others

Deep learning in law: early adaptation and legal word embeddings trained on large corpora

Recurrent neural network-based models for recognizing requisite and effectuation parts in legal texts

Analyzing Vietnamese Legal Questions Using Deep Neural Networks with Biaffine Classifiers

Data availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation