Abstract
In a Common Law system, legal practitioners need frequent access to prior case documents that discuss relevant legal issues. Case documents are generally very lengthy, containing complex sentence structures, and reading them fully is a strenuous task even for legal practitioners. Having a concise overview of these documents can relieve legal practitioners from the task of reading the complete case statements. Legal catchphrases are (multi-word) phrases that provide a concise overview of the contents of a case document, and automated generation of catchphrases is a challenging problem in legal analytics. In this paper, we propose a novel supervised neural sequence tagging model for the extraction of catchphrases from legal case documents. Specifically, we show that incorporating document-specific information along with a sequence tagging model can enhance the performance of catchphrase extraction. We perform experiments over a set of Indian Supreme Court case documents, for which the gold-standard catchphrases (annotated by legal practitioners) are obtained from a popular legal information system. The performance of our proposed method is compared with that of several existing supervised and unsupervised methods, and our proposed method is empirically shown to be superior to all baselines.
Similar content being viewed by others
Code availability
The implementation of the proposed model (D2V-BiGRU-CRF) is available at https://github.com/amarnamarpan/D2V-BiGRU-CRF.
Notes
Accuracy is a well-known set-based evaluation metric to measure the performance of classification algorithms, that measures what fraction of instances are correctly classified by a model. In the present context, accuracy can be used to measure what fraction of catchphrases are correctly identified by a method.
The dataset is available online at https://archive.ics.uci.edu/ml/datasets/Legal+Case+Reports.
The GitHub url to our noun phrase extractor is https://github.com/amarnamarpan/NNP-extractor.
The complete list of POS tags can be found at www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.
Available at https://pypi.org/project/python-crfsuite/.
To get viterbi accuracy scores in pyCRFsuite one can use the ‘-i’ option while tagging.
available at http://radimrehurek.com/gensim/index.html.
available online at https://keras.io/.
Available at https://www.cs.waikato.ac.nz/ml/weka/.
To compute rouge recall score we use the implementation found at https://pypi.org/project/rouge-score/.
References
Al-Shboul B, Myaeng SH (2014) Wikipedia-based query phrase expansion in patent class search. Inform Retrieval J 17:430–451
Alzaidy R, Caragea C, Giles CL (2019) Bi-lstm-crf sequence labeling for keyphrase extraction from scholarly documents. In: Proceedings of the International Conference on World Wide Web, pp 2551–2557
Augenstein I, Das M, Riedel S, Vikraman L, McCallum A (2017) SemEval 2017 task 10: ScienceIE - extracting keyphrases and relations from scientific publications. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp 546–555
Bhattacharya P, Hiware K, Rajgaria S, Pochhi N, Ghosh K, Ghosh S (2019) A comparative study of summarization algorithms applied to legal case judgments. In: Advances in Information Retrieval, pp 413–428
Bizer C, Lehmann J, Kobilarov G, Auer S, Becker C, Cyganiak R, Hellmann S (2009) Dbpedia-a crystallization point for the web of data. J Web Semantics 7(3):154–165
Breiman L (2001) Random forests. Mach learn 45(1):5–32
Breiman L, Friedman JH, Olshen RA, Stone CJ (1983) Classification and regression trees. CRC Press, Cambridge
Caragea C, Bulgarov FA, Godea A, Das Gollapalli S (2014) Citation-enhanced keyphrase extraction from research papers: A supervised approach. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp 1435–1446
Cardellino C, Teruel M, Alemany LA, Villata S (2017) A low-cost, high-coverage legal named entity recognizer, classifier and linker. In: Proceedings of International Conference on Articial Intelligence and Law), pp 9–18
Chiu JP, Nichols E (2016) Named entity recognition with bidirectional LSTM-CNNs. Trans Assoc Comput Linguist 4:357–370
Dhondt E, Verberne S, Oostdijk N, Beney J, Koster C, Boves L (2014) Dealing with temporal variation in patent categorization. Inform Retrieval J 17:520–544
Firoozeh N, Nazarenko A, Alizon F, Daille B (2019) Keyword extraction: issues and methods. Nat Lang Eng 26:259–291
Florescu C, Caragea C (2017) PositionRank: An unsupervised approach to keyphrase extraction from scholarly documents. In: Proceedings of Annual Meeting of the Association for Computational Linguistics, pp 1105–1115
Frank E, et al. (1999) Domain-specific keyphrase extraction. In: International Joint Conference on Artificial Intelligence, pp 668–673
Galgani F, et al. (2012) Towards automatic generation of catchphrases for legal case reports. In: Proceedings of Computational Linguistics and Intelligent Text Processing (CICLing), pp 414–425
Giamblanco N, Siddavaatam P (2017) Keyword and Keyphrase Extraction using Newton’s Law of Universal Gravitation. Proceedings of Canadian Conference on Electrical and Computer Engineering pp 1–4
Gollapalli SD, Li X, Yang P (2017) Incorporating expert knowledge into keyphrase extraction. In: Association for the Advancement of Artificial Intelligence
Hasan KS, Ng V (2014) Automatic keyphrase extraction: A survey of the state of the art. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 1262–1273
Haveliwala TH (2002) Topic-sensitive pagerank. In: Proceedings of the International Conference on World Wide Web, p 517–526
Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B (1998) Support vector machines. IEEE Intell Syst Appl 13(4):18–28
Hinton GE (1990) Connectionist learning procedures. In: Machine Learning, pp 555 – 610
Hu J, Li S, Yao Y, Yu L, Yang G, Hu J (2018) Patent keyword extraction algorithm based on distributed representation for patent classification. Entropy 20(2):104
Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of International Conference on Machine Learning, pp 282–289
Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C (2016) Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 260–270
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of International Conference on Machine Learning, pp 1188–1196
Le TTN, Shirai K, Nguyen ML, Shimazu A (2015) Extracting indices from Japanese legal documents. Art Intell Law 23(4):315–344
Lin CY (2004) ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, Association for Computational Linguistics, Barcelona, Spain, pp 74–81, https://www.aclweb.org/anthology/W04-1013
Liu Z, Li P, Zheng Y, Sun M (2009) Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, p 257–266
Liu Z, Huang W, Zheng Y, Sun M (2010) Automatic keyphrase extraction via topic decomposition. In: Proceedings of the 2010 conference on Empirical Methods in Natural Language Processing, pp 366–376
Lossio-Ventura JA, Jonquet C, Roche M, Teisseire M (2016) Biomedical term extraction: overview and a new methodology. Inform Ret J 19:59–99
Mahdabi P, Crestani F (2014) The effect of citation analysis on query expansion for patent retrieval. Inform Ret J 17:412–429
Mandal A, Ghosh K, Pal A, Ghosh S (2017) Automatic catchphrase identification from legal court case documents. In: Conference on Information and Knowledge Management, ACM, New York, USA, CIKM ’17, pp 2187–2190
Mandal A, Ghosh K, Ghosh S, Mandal S (2021) Unsupervised approaches for measuring textual similarity between legal court case reports. Artificial Intelligence and Law
Medelyan O (2009) Human-competitive automatic topic indexing. PhD thesis, The University of Waikato, New Zealand
Nasar Z, Jaffry SW, Malik MK (2019) Textual keyword extraction and summarization: state-of-the-art. Inform Process Manag 56(6):102088
Nguyen S, Nguyen LM, Tojo S, Satoh K, Shimazu A (2018) Recurrent neural network-based models for recognizing requisite and effectuation parts in legal texts. Artificial Intelligence and Law pp 1–31
Okamoto M, Shan Z, Orihara R (2017) Applying information extraction for patent structure analysis. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, p 989–992
Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp 2227–2237
Qazvinian V, Radev DR, Özgür A (2010) Citation summarization through keyphrase extraction. In: Proceedings of Conference on Computational Linguistics, pp 895–903
Shi W, Zheng W, Yu JX, Cheng H, Zou L (2017) Keyphrase extraction using knowledge graphs. Data Sci Eng 2(4):275–288
Siddiqi S, Sharan A (2015) Keyword and keyphrase extraction techniques: a literature review. Int J Comput Appl 109(2)
Siegel S (1956) Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill series in psychology, McGraw-Hill
Suzuki S, Takatsuka H (2016) Extraction of keywords of novelties from patent claims. In: Proceedings of Conference on Computational Linguistics, pp 1192–1200
Tannebaum W, Rauber A (2014) Using query logs of uspto patent examiners for automatic query expansion in patent searching. Inform Ret J 17:452–470
Tomokiyo T, Hurst M (2003) A language model approach to keyphrase extraction. In: Proceedings of ACL Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pp 33–40
Tran V, Le Nguyen M, Tojo S, Satoh K (2020) Encoded summarization: summarizing documents into continuous vector space for legal case retrieval. Artificial Intelligence and Law pp 1–27
Tran VD, Nguyen ML, Satoh K (2018) Automatic catchphrase extraction from legal case documents via scoring using deep neural networks. CoRR arxiv:abs/1809.05219
Truong S, Le Minh N, Satoh K, Satoshi T, Shimazu A (2017) Single and multiple layer bi-lstmcrf for recognizing requisite and effectuation parts in legal texts. In: Proceedings of Automated Semantic Analysis of Information in Legal Texts
Vega-Oliveros DA, Gomes PS, Milios EE, Berton L (2019) A multi-centrality index for graph-based keyword extraction. Inform Process Manag 56(6):102063
Verberne S, Sappelli M, Hiemstra D, Kraaij W (2016) Evaluation and analysis of term scoring methods for term extraction. Inform Ret J 19(5):510–545
Witten IH, Paynter GW, Frank E, Gutwin C, Nevill-Manning CG (1999) Kea: Practical automatic keyphrase extraction. In: Proceedings of the Fourth ACM Conference on Digital Libraries, p 254–255
Wu YFB, Li Q (2008) Document keyphrases as subject metadata: Incorporating document key concepts in search results. Inform Ret J 11:229–249
Zahoor F, Bajwa IS (2014) Automatic extraction of catchphrases from software license agreement. Proceedings of International Conference on Intelligent Human-Machine Systems and Cybernetics 2:189–193
Zhong H, Xiao C, Tu C, Zhang T, Liu Z, Sun M (2020) How does NLP benefit legal system: A summary of legal artificial intelligence. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp 5218–5230
Zhou D, Truran M, Liu J, Zhang S (2014) Using multiple query representations in patent prior-art search. Inform Ret J 17:471–491
Zhu X, Lyu C, Ji D, Liao H, Li F (2020) Deep neural model with self-training for scientific keyphrase extraction. Public Library of Science (Plos one) 15(5):e0232547
Acknowledgements
The authors acknowledge faculty members from The West Bengal National University of Juridical Sciences (www.nujs.edu), and Rajiv Gandhi School of Intellectual Property Law (www.iitkgp.ac.in/department/IP) for insightful discussions. The research is partially supported by the TCG Centres for Research and Education in Science and Technology (CREST) through the project titled ‘Smart Legal Consultant: AI-based Legal Analytics’. The first author is supported by the Visvesvaraya PhD scheme from the Ministry of Electronics and Information Technology (Grant No. VISPHDMEITY-1570).
Funding
The first author received his research grants from the “Ministry of Electronics and Information Technology, Government of India” via granting the fellowship “Visvesvaraya PhD Scheme for Electronics and IT”. The research is partially supported by the TCG Centres for Research and Education in Science and Technology (CREST) through the project titled ‘Smart Legal Consultant: AI-based Legal Analytics’.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Mandal, A., Ghosh, K., Ghosh, S. et al. A sequence labeling model for catchphrase identification from legal case documents. Artif Intell Law 30, 325–358 (2022). https://doi.org/10.1007/s10506-021-09296-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10506-021-09296-2