Abstract
Sentence boundary detection (SBD) represents an important first step in natural language processing since accurately identifying sentence boundaries significantly impacts downstream applications. Nevertheless, detecting sentence boundaries within legal texts poses a unique and challenging problem due to their distinct structural and linguistic features. Our approach utilizes deep learning models to leverage delimiter and surrounding context information as input, enabling precise detection of sentence boundaries in English legal texts. We evaluate various deep learning models, including domain-specific transformer models like LegalBERT and CaseLawBERT. To assess the efficacy of our deep learning models, we compare them with a state-of-the-art domain-specific statistical conditional random field (CRF) model. After considering model size, F1-score, and inference time, we identify the Convolutional Neural Network Model (CNN) as the top-performing deep learning model. To further enhance performance, we integrate the features of the CNN model into the subsequent CRF model, creating a hybrid architecture that combines the strengths of both models. Our experiments demonstrate that the hybrid model outperforms the baseline model, achieving a 4% improvement in the F1-score. Additional experiments showcase the superiority of the hybrid model over SBD open-source libraries when confronted with an out-of-domain test set. These findings underscore the importance of efficient SBD in legal texts and emphasize the advantages of employing deep learning models and hybrid architectures to achieve optimal performance.
Similar content being viewed by others
Data availability
The dataset we used in our work is open for research purposes. The code used in this study was developed using open-source tools and libraries and is open for researchers to reproduce the experiments and build upon the research findings, potentially leading to new insights.
Notes
References
Agarap AF (2018) Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375
Allen LE, Lysaght LJ (2015) Modern logic as a tool for remedying ambiguities in legal documents and analyzing the structure of legal documents’ contained definitions. Logic in the Theory and Practice of Lawmaking, 383–407
Bahdanau D, Cho KH, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: 3rd international conference on learning representations. ICLR 2015
Bhattacharya P, Paul S, Ghosh K, Ghosh S, Wyner A (2021) DeepRhole: deep learning for rhetorical role labeling of sentences in legal case documents. Artif Intell Law. https://doi.org/10.1007/s10506-021-09304-5
Bird S (2006) NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL 2006 interactive presentation sessions, pp 69–72
Brugger T, Stürmer M, Niklaus J (2023) MultiLegalSBD: a multilingual legal sentence boundary detection dataset. arXiv:2305.01211
Chalkidis I, Androutsopoulos I (2017) A deep learning approach to contract element extraction. In: JURIX, vol 2017, pp 155–164
Chalkidis I, Fergadiotis M, Malakasiotis P, Aletras N, Androutsopoulos I (2020) LEGAL-BERT: the Muppets straight out of law school. In: Findings of the association for computational linguistics: EMNLP 2020, pp 2898–2904
Chen H, Pieptea LF, Ding J (2022) Construction and evaluation of a high-quality corpus for legal intelligence using semiautomated approaches. IEEE Trans Reliab 71(2):657–673
Chollampatt S, Ng HT (2018) A multilayer convolutional encoder-decoder neural network for grammatical error correction. In: Proceedings of the AAAI conference on artificial intelligence, vol 32
Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555
Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537
Deroy A, Ghosh K, Ghosh S (2023) Ensemble methods for improving extractive summarization of legal case judgements. Artif Intell Law. https://doi.org/10.1007/s10506-023-09349-8
Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10(7):1895–1923
Dunning T (1994) Accurate methods for the statistics of surprise and coincidence. Comput Linguist 19(1):61–74
Francis WN, Kucera H (1979) Brown corpus manual. Lett Editor 5(2):7
Gers FA, Schmidhuber J, Cummins F (2000) Learning to forget: continual prediction with LSTM. Neural Comput 12(10):2451–2471
Gillick D (2009) Sentence boundary detection and the problem with the us. In: Proceedings of human language technologies: the 2009 annual conference of the North American chapter of the association for computational linguistics, companion volume: short papers, pp 241–244
Glaser I, Moser S, Matthes F (2021) Sentence boundary detection in German legal documents. In: ICAART (2), pp 812–821
Grefenstette G, Tapanainen P (1994) What is a word, what is a sentence?: problems of tokenisation
Griffis D, Shivade C, Fosler-Lussier E, Lai AM (2016) A quantitative and qualitative evaluation of sentence boundary detection for the clinical domain. In AMIA summits on translational science proceedings, vol 2016, p 88
Habernal I, Faber D, Recchia N, Bretthauer S, Gurevych I, Spiecker genannt Döhmann I, Burchard C (2023) Mining legal arguments in court decisions. Artif Intell Law. https://doi.org/10.1007/s10506-023-09361-y
Honnibal M, Montani I, Van Landeghem S, Boyd A (2020) spaCy: industrial-strength Natural Language Processing in Python
Jain D, Borah MD, Biswas A (2023) A sentence is known by the company it keeps: improving legal document summarization using deep clustering. Artif Intell Law. https://doi.org/10.1007/s10506-023-09345-y
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3rd international conference on learning representations. ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings
Kiss T, Strunk J (2006) Unsupervised multilingual sentence boundary detection. Comput Linguist 32(4):485–525
Kudo T, Richardson J (2018) SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In: Proceedings of the 2018 conference on empirical methods in natural language processing: system demonstrations. Association for Computational Linguistics, pp 66–71
Lafferty J, McCallum A, Pereira FC (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data
Lavery UA (1921) The language of the law. Am Bar Assoc J 7(6):277–283
Lin Z, Feng M, Santos CNd, Yu M, Xiang B, Zhou B, Bengio Y (2017) A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130
Liu DC, Nocedal J (1989) On the limited memory BFGS method for large scale optimization. Math Program 45(1–3):503–528
López R, Pardo TA (2015) Experiments on sentence boundary detection in user-generated web content. In: Computational linguistics and intelligent text processing: 16th international conference, CICLing 2015, Cairo, Egypt, April 14–20, 2015, Springer proceedings, Part I 16, pp 227–237
Malik V, Sanjay R, Nigam SK, Ghosh K, Guha SK, Bhattacharya A, Modi A (2021) ILDC for CJPE: Indian legal documents corpus for court judgment prediction and explanation. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers). Association for Computational Linguistics, Online, pp 4046–4062
Mandal A, Ghosh K, Ghosh S, Mandal S (2021) A sequence labeling model for catchphrase identification from legal case documents. Artif Intell Law. https://doi.org/10.1007/s10506-021-09296-2
Manning CD, Surdeanu M, Bauer J, Finkel JR, Bethard S, McClosky D (2014) The Stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pp 55–60
McNemar Q (1947) Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12(2):153–157
Mikheev A (2000) Tagging sentence boundaries. In: 1st meeting of the North American chapter of the association for computational linguistics
Minixhofer B, Pfeiffer J, Vulić I (2023) Where’s the point? Self-supervised multilingual punctuation-agnostic sentence segmentation. arXiv preprint arXiv:2305.18893
Okazaki N (2007) CRFsuite: a fast implementation of conditional random fields (CRFs)
Păiş V, Tufiş D (2021) Capitalization and punctuation restoration: a survey. Artif Intell Rev 55(3):1681–1722
Palmer DD, Hearst MA (1997) Adaptive multilingual sentence boundary disambiguation. Comput Linguist 23(2):241–267
Qi P, Zhang Y, Zhang Y, Bolton J, Manning CD (2020) Stanza: a Python natural language processing toolkit for many human languages. In: Proceedings of the 58th annual meeting of the association for computational linguistics: system demonstrations
Read J, Dridan R, Oepen S, Solberg LJ (2012) Sentence boundary detection: a long solved problem? In: Proceedings of COLING 2012. Posters, pp 985–994
Rehbein I, Ruppenhofer J, Schmidt T (2020) Improving sentence boundary detection for spoken language transcripts. In: Proceedings of the twelfth language resources and evaluation conference. European Language Resources Association, Marseille, France, pp 7102–7111
Reynar JC, Ratnaparkhi A (1997) A maximum entropy approach to identifying sentence boundaries. In: Proceedings of the fifth conference on applied natural language processing, pp 16–19
Riley M (1989) Some applications of tree-based modelling to speech and language. In: Speech and natural language: proceedings of a workshop held at Cape Cod, Massachusetts, October 15–18, 1989
Rudrapal D, Jamatia A, Chakma K, Das A, Gambäck B (2015) Sentence boundary detection for social media text. In: Proceedings of the 12th international conference on natural language processing, pp 254–260
Sadvilkar N, Neumann M (2020) PySBD: pragmatic sentence boundary disambiguation. In: Proceedings of second workshop for NLP open source software (NLP-OSS), pp 110–114
Sanchez G (2019) Sentence boundary detection in legal text. In: Proceedings of the natural legal language processing workshop 2019, pp 31–38
Saravanan M, Ravindran B (2010) Identification of rhetorical roles for segmentation and summarization of a legal judgment. Artif Intell Law 18:45–76
Savelka J, Walker VR, Grabmair M, Ashley KD (2017) Sentence boundary detection in adjudicatory decisions in the United States. Trait Autom Lang 58:21
Schweter S, Ahmed S (2019) Deep-EOS: general-purpose neural networks for sentence boundary detection. In: KONVENS
Sheik R, Gokul T, Nirmala S (2022) Efficient deep learning-based sentence boundary detection in legal text. In: Proceedings of the natural legal language processing workshop 2022, pp 208–217
Sutton C, McCallum A et al (2012) An introduction to conditional random fields. Found Trends® Mach Learn 4(4):267–373
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS'17, pp 6000–6010
Wang X, Utiyama M, Sumita E (2019) Online sentence segmentation for simultaneous interpretation using multi-shifted recurrent neural network. In: Proceedings of machine translation summit XVII: research track, pp 1–11
Wicks R, Post M (2021) A unified approach to sentence segmentation of punctuated text in many languages. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), pp 3995–4007
Wong F, Chao S (2010) iSentenizer: an incremental sentence boundary classifier. In: Proceedings of the 6th international conference on natural language processing and knowledge engineering (NLPKE-2010). IEEE, pp 1–7
Young T, Hazarika D, Poria S, Cambria E (2018) Recent trends in deep learning based natural language processing. IEEE Comput Intell Mag 13(3):55–75
Zheng L, Guha N, Anderson BR, Henderson P, Ho DE (2021) When does pretraining help? Assessing self-supervised learning for law and the casehold dataset of 53,000+ legal holdings. In: Proceedings of the eighteenth international conference on artificial intelligence and law, pp 159–168
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All authors certify that they have no involvement in any firm or entity with any financial or non-financial interest in the materials covered in this document.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sheik, R., Ganta, S.R. & Nirmala, S.J. Legal sentence boundary detection using hybrid deep learning and statistical models. Artif Intell Law (2024). https://doi.org/10.1007/s10506-024-09394-x
Accepted:
Published:
DOI: https://doi.org/10.1007/s10506-024-09394-x