Skip to main content
Log in

Legal sentence boundary detection using hybrid deep learning and statistical models

  • Original Research
  • Published:
Artificial Intelligence and Law Aims and scope Submit manuscript

Abstract

Sentence boundary detection (SBD) represents an important first step in natural language processing since accurately identifying sentence boundaries significantly impacts downstream applications. Nevertheless, detecting sentence boundaries within legal texts poses a unique and challenging problem due to their distinct structural and linguistic features. Our approach utilizes deep learning models to leverage delimiter and surrounding context information as input, enabling precise detection of sentence boundaries in English legal texts. We evaluate various deep learning models, including domain-specific transformer models like LegalBERT and CaseLawBERT. To assess the efficacy of our deep learning models, we compare them with a state-of-the-art domain-specific statistical conditional random field (CRF) model. After considering model size, F1-score, and inference time, we identify the Convolutional Neural Network Model (CNN) as the top-performing deep learning model. To further enhance performance, we integrate the features of the CNN model into the subsequent CRF model, creating a hybrid architecture that combines the strengths of both models. Our experiments demonstrate that the hybrid model outperforms the baseline model, achieving a 4% improvement in the F1-score. Additional experiments showcase the superiority of the hybrid model over SBD open-source libraries when confronted with an out-of-domain test set. These findings underscore the importance of efficient SBD in legal texts and emphasize the advantages of employing deep learning models and hybrid architectures to achieve optimal performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Fig. 3
Algorithm 2

Similar content being viewed by others

Data availability

The dataset we used in our work is open for research purposes. The code used in this study was developed using open-source tools and libraries and is open for researchers to reproduce the experiments and build upon the research findings, potentially leading to new insights.

Notes

  1. https://github.com/jsavelka/sbd_adjudicatory_dec.

  2. https://case.law/.

  3. https://www.chokkan.org/software/crfsuite/.

  4. https://huggingface.co/.

  5. https://pandas.pydata.org/.

  6. https://numpy.org/.

  7. https://www.statsmodels.org/stable/index.html.

  8. https://matplotlib.org/.

  9. https://code.visualstudio.com/.

  10. https://pypi.org/project/sklearn-crfsuite/.

  11. https://github.com/researchnlp/SBDLegal.

  12. https://pypi.org/project/syntok/.

References

  • Agarap AF (2018) Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375

  • Allen LE, Lysaght LJ (2015) Modern logic as a tool for remedying ambiguities in legal documents and analyzing the structure of legal documents’ contained definitions. Logic in the Theory and Practice of Lawmaking, 383–407

  • Bahdanau D, Cho KH, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: 3rd international conference on learning representations. ICLR 2015

  • Bhattacharya P, Paul S, Ghosh K, Ghosh S, Wyner A (2021) DeepRhole: deep learning for rhetorical role labeling of sentences in legal case documents. Artif Intell Law. https://doi.org/10.1007/s10506-021-09304-5

    Article  Google Scholar 

  • Bird S (2006) NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL 2006 interactive presentation sessions, pp 69–72

  • Brugger T, Stürmer M, Niklaus J (2023) MultiLegalSBD: a multilingual legal sentence boundary detection dataset. arXiv:2305.01211

  • Chalkidis I, Androutsopoulos I (2017) A deep learning approach to contract element extraction. In: JURIX, vol 2017, pp 155–164

  • Chalkidis I, Fergadiotis M, Malakasiotis P, Aletras N, Androutsopoulos I (2020) LEGAL-BERT: the Muppets straight out of law school. In: Findings of the association for computational linguistics: EMNLP 2020, pp 2898–2904

  • Chen H, Pieptea LF, Ding J (2022) Construction and evaluation of a high-quality corpus for legal intelligence using semiautomated approaches. IEEE Trans Reliab 71(2):657–673

    Article  Google Scholar 

  • Chollampatt S, Ng HT (2018) A multilayer convolutional encoder-decoder neural network for grammatical error correction. In: Proceedings of the AAAI conference on artificial intelligence, vol 32

  • Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555

  • Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537

    Google Scholar 

  • Deroy A, Ghosh K, Ghosh S (2023) Ensemble methods for improving extractive summarization of legal case judgements. Artif Intell Law. https://doi.org/10.1007/s10506-023-09349-8

    Article  Google Scholar 

  • Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10(7):1895–1923

    Article  Google Scholar 

  • Dunning T (1994) Accurate methods for the statistics of surprise and coincidence. Comput Linguist 19(1):61–74

    Google Scholar 

  • Francis WN, Kucera H (1979) Brown corpus manual. Lett Editor 5(2):7

    Google Scholar 

  • Gers FA, Schmidhuber J, Cummins F (2000) Learning to forget: continual prediction with LSTM. Neural Comput 12(10):2451–2471

    Article  Google Scholar 

  • Gillick D (2009) Sentence boundary detection and the problem with the us. In: Proceedings of human language technologies: the 2009 annual conference of the North American chapter of the association for computational linguistics, companion volume: short papers, pp 241–244

  • Glaser I, Moser S, Matthes F (2021) Sentence boundary detection in German legal documents. In: ICAART (2), pp 812–821

  • Grefenstette G, Tapanainen P (1994) What is a word, what is a sentence?: problems of tokenisation

  • Griffis D, Shivade C, Fosler-Lussier E, Lai AM (2016) A quantitative and qualitative evaluation of sentence boundary detection for the clinical domain. In AMIA summits on translational science proceedings, vol 2016, p 88

  • Habernal I, Faber D, Recchia N, Bretthauer S, Gurevych I, Spiecker genannt Döhmann I, Burchard C (2023) Mining legal arguments in court decisions. Artif Intell Law. https://doi.org/10.1007/s10506-023-09361-y

    Article  Google Scholar 

  • Honnibal M, Montani I, Van Landeghem S, Boyd A (2020) spaCy: industrial-strength Natural Language Processing in Python

  • Jain D, Borah MD, Biswas A (2023) A sentence is known by the company it keeps: improving legal document summarization using deep clustering. Artif Intell Law. https://doi.org/10.1007/s10506-023-09345-y

    Article  Google Scholar 

  • Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3rd international conference on learning representations. ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings

  • Kiss T, Strunk J (2006) Unsupervised multilingual sentence boundary detection. Comput Linguist 32(4):485–525

    Article  Google Scholar 

  • Kudo T, Richardson J (2018) SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In: Proceedings of the 2018 conference on empirical methods in natural language processing: system demonstrations. Association for Computational Linguistics, pp 66–71

  • Lafferty J, McCallum A, Pereira FC (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data

  • Lavery UA (1921) The language of the law. Am Bar Assoc J 7(6):277–283

    Google Scholar 

  • Lin Z, Feng M, Santos CNd, Yu M, Xiang B, Zhou B, Bengio Y (2017) A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130

  • Liu DC, Nocedal J (1989) On the limited memory BFGS method for large scale optimization. Math Program 45(1–3):503–528

    Article  MathSciNet  Google Scholar 

  • López R, Pardo TA (2015) Experiments on sentence boundary detection in user-generated web content. In: Computational linguistics and intelligent text processing: 16th international conference, CICLing 2015, Cairo, Egypt, April 14–20, 2015, Springer proceedings, Part I 16, pp 227–237

  • Malik V, Sanjay R, Nigam SK, Ghosh K, Guha SK, Bhattacharya A, Modi A (2021) ILDC for CJPE: Indian legal documents corpus for court judgment prediction and explanation. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers). Association for Computational Linguistics, Online, pp 4046–4062

  • Mandal A, Ghosh K, Ghosh S, Mandal S (2021) A sequence labeling model for catchphrase identification from legal case documents. Artif Intell Law. https://doi.org/10.1007/s10506-021-09296-2

    Article  Google Scholar 

  • Manning CD, Surdeanu M, Bauer J, Finkel JR, Bethard S, McClosky D (2014) The Stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pp 55–60

  • McNemar Q (1947) Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12(2):153–157

    Article  Google Scholar 

  • Mikheev A (2000) Tagging sentence boundaries. In: 1st meeting of the North American chapter of the association for computational linguistics

  • Minixhofer B, Pfeiffer J, Vulić I (2023) Where’s the point? Self-supervised multilingual punctuation-agnostic sentence segmentation. arXiv preprint arXiv:2305.18893

  • Okazaki N (2007) CRFsuite: a fast implementation of conditional random fields (CRFs)

  • Păiş V, Tufiş D (2021) Capitalization and punctuation restoration: a survey. Artif Intell Rev 55(3):1681–1722

    Article  Google Scholar 

  • Palmer DD, Hearst MA (1997) Adaptive multilingual sentence boundary disambiguation. Comput Linguist 23(2):241–267

    Google Scholar 

  • Qi P, Zhang Y, Zhang Y, Bolton J, Manning CD (2020) Stanza: a Python natural language processing toolkit for many human languages. In: Proceedings of the 58th annual meeting of the association for computational linguistics: system demonstrations

  • Read J, Dridan R, Oepen S, Solberg LJ (2012) Sentence boundary detection: a long solved problem? In: Proceedings of COLING 2012. Posters, pp 985–994

  • Rehbein I, Ruppenhofer J, Schmidt T (2020) Improving sentence boundary detection for spoken language transcripts. In: Proceedings of the twelfth language resources and evaluation conference. European Language Resources Association, Marseille, France, pp 7102–7111

  • Reynar JC, Ratnaparkhi A (1997) A maximum entropy approach to identifying sentence boundaries. In: Proceedings of the fifth conference on applied natural language processing, pp 16–19

  • Riley M (1989) Some applications of tree-based modelling to speech and language. In: Speech and natural language: proceedings of a workshop held at Cape Cod, Massachusetts, October 15–18, 1989

  • Rudrapal D, Jamatia A, Chakma K, Das A, Gambäck B (2015) Sentence boundary detection for social media text. In: Proceedings of the 12th international conference on natural language processing, pp 254–260

  • Sadvilkar N, Neumann M (2020) PySBD: pragmatic sentence boundary disambiguation. In: Proceedings of second workshop for NLP open source software (NLP-OSS), pp 110–114

  • Sanchez G (2019) Sentence boundary detection in legal text. In: Proceedings of the natural legal language processing workshop 2019, pp 31–38

  • Saravanan M, Ravindran B (2010) Identification of rhetorical roles for segmentation and summarization of a legal judgment. Artif Intell Law 18:45–76

    Article  Google Scholar 

  • Savelka J, Walker VR, Grabmair M, Ashley KD (2017) Sentence boundary detection in adjudicatory decisions in the United States. Trait Autom Lang 58:21

    Google Scholar 

  • Schweter S, Ahmed S (2019) Deep-EOS: general-purpose neural networks for sentence boundary detection. In: KONVENS

  • Sheik R, Gokul T, Nirmala S (2022) Efficient deep learning-based sentence boundary detection in legal text. In: Proceedings of the natural legal language processing workshop 2022, pp 208–217

  • Sutton C, McCallum A et al (2012) An introduction to conditional random fields. Found Trends® Mach Learn 4(4):267–373

  • Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS'17, pp 6000–6010

  • Wang X, Utiyama M, Sumita E (2019) Online sentence segmentation for simultaneous interpretation using multi-shifted recurrent neural network. In: Proceedings of machine translation summit XVII: research track, pp 1–11

  • Wicks R, Post M (2021) A unified approach to sentence segmentation of punctuated text in many languages. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), pp 3995–4007

  • Wong F, Chao S (2010) iSentenizer: an incremental sentence boundary classifier. In: Proceedings of the 6th international conference on natural language processing and knowledge engineering (NLPKE-2010). IEEE, pp 1–7

  • Young T, Hazarika D, Poria S, Cambria E (2018) Recent trends in deep learning based natural language processing. IEEE Comput Intell Mag 13(3):55–75

    Article  Google Scholar 

  • Zheng L, Guha N, Anderson BR, Henderson P, Ho DE (2021) When does pretraining help? Assessing self-supervised learning for law and the casehold dataset of 53,000+ legal holdings. In: Proceedings of the eighteenth international conference on artificial intelligence and law, pp 159–168

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Reshma Sheik.

Ethics declarations

Conflict of interest

All authors certify that they have no involvement in any firm or entity with any financial or non-financial interest in the materials covered in this document.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sheik, R., Ganta, S.R. & Nirmala, S.J. Legal sentence boundary detection using hybrid deep learning and statistical models. Artif Intell Law (2024). https://doi.org/10.1007/s10506-024-09394-x

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10506-024-09394-x

Keywords

Navigation