Abstract
Large-scale public datasets are vital for driving the progress of abstractive summarization, especially in law, where documents have highly specialized jargon. However, the available resources are English-centered, limiting research advancements in other languages. This paper introduces LAWSUIT, a collection of 14K Italian legal verdicts with expert-authored abstractive maxims drawn from the Constitutional Court of the Italian Republic. LAWSUIT presents an arduous task with lengthy source texts and evenly distributed salient content. We offer extensive experiments with sequence-to-sequence and segmentation-based approaches, revealing that the latter achieve better results in full and few-shot settings. We openly release LAWSUIT to foster the development and automation of real-world legal applications.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Text summarization (Sharma and Sharma 2023) is a persistent pursuit of natural language processing (NLP). Recently, there has been a growing interest in abstractive summarization (AS), which involves paraphrasing the essential details of textual documents in a succinct and accessible language (Zhang et al. 2022). This surge in interest is primarily attributed to the availability of large pretrained language models (Lewis et al. 2020; Guo et al. 2022; Moro et al. 2022) and publicly accessible datasets spanning various domains (Cohan et al. 2018; Narayan et al. 2018). One particularly impactful domain in real-world applications is law, where documents often consist of thousands of words filled with jargon and intricate expressions. The complexity of these documents makes their comprehension a time-consuming and labor-intensive process, even for legal experts (Kanapala et al. 2019). Therefore, legal AS (Moro et al. 2023) is a practical, useful, and essential task to promote knowledge acquisition. Lamentably, current legal summarization corpora are almost entirely devoted to English. There are yet no Italian datasets for legal AS, which limits the research, access, and elaboration of lawful texts and their implications to Italian law practitioners.
To fill this gap, we present the first large-scale Italian legal AS dataset, LAWSUIT,Footnote 1 consisting of 14,000 source documents with expert-authored summaries (Fig. 1). LAWSUIT allows the community to study the AS of legal verdicts in a critical application setting found in the Constitutional Court of the Italian Republic (CCIR). As the highest court in Italy for constitutional law matters, the CCIR maintains a comprehensive record of legal verdicts, accessible through an open-access data portal (https://dati.cortecostituzionale.it). In particular, highly qualified legal experts meticulously crafted and reviewed each ruling and the accompanying maxims, i.e., the synopses that clarify the events and core decisions. Beyond its potential to expand summarization capabilities, benefiting legal NLP benchmarks and tangible uses, LAWSUIT boasts several key features:
-
The average number of source and target words is significantly higher than that contemplated by existing Italian summarization datasets (+269% and +589%, respectively) (Casola and Lavelli 2021), encouraging long document AS for Italian.
-
In contrast to existing English legal benchmarks (Kornilova and Eidelman 2019; Huang et al. 2021), the salient content in the input is more uniformly distributed, and summary-worthy words are not concentrated in specific sections of the text. This characteristic poses a unique challenge for summarization tasks, requiring comprehensive processing of the entire source document rather than relying on localized content.
-
Unlike many summarization datasets that undergo automatic construction processes (Cohan et al. 2018; Grusky et al. 2018; Sharma et al. 2019; Huang et al. 2021), our inputs and targets are authored by experts. Specifically, university law professors and magistrates are responsible for drafting the verdicts, and the corresponding maxims are compiled by the supervisory office. The supervisory office oversees the formal control of the texts in collaboration with the study assistants of the President. This meticulous procedure ensures a high level of quality control and supervision, mitigating the risk of model hallucination (Maynez et al. 2020), which refers to the generation of unfaithful outputs due to training on targets that contain facts that are not supported by the source text.
We benchmark LAWSUIT using various extractive and abstractive summarization solutions, including a segmentation-based pipeline that demonstrates superior performance in both full and few-shot summarization scenarios, namely training models with all or just a few dozen instances.
2 Related work
Natural Language Processing for Legal Texts Legal NLP has been the subject of extensive research in various legal tasks, including information retrieval (Chalkidis et al. 2018; Hendrycks et al. 2021; Sansone and Sperlí 2022), question answering (Ravichander et al. 2019; Huang et al. 2020; Kien et al. 2020; Zhong et al. 2020), text classification (Chalkidis et al. 2019; Tuggener et al. 2020; Chalkidis et al. 2021, 2022; Feng et al. 2022), and automatic text summarization (Duan et al. 2019; Zhong et al. 2019; Bhattacharya et al. 2021; Elaraby and Litman 2022; Moro and Ragazzi 2022; Moro et al. 2023). Moreover, recent endeavors have increasingly shifted towards non-English applications (Metsker et al. 2019; Wang et al. 2019; Malik et al. 2021; Xiao et al. 2021; Bakker et al. 2022; Qin et al. 2022; Niklaus et al. 2023), including Italian (Bellandi et al. 2022; Galli et al. 2022; Licari and Comandé 2022; Tagarelli and Simeri 2022), thus stimulating research in low-resource language contexts. These studies focus on fetching past court decisions and predicting outcomes. To the best of our knowledge, we pioneer the exploration of Italian legal document summarization grounded in a non-common law system. This is achieved by releasing the first large-scale legal abstractive summarization dataset derived from the CCIR.
Legal Document Summarization Previous studies on automatic summarization of court proceedings have mainly relied on extractive approaches, where the predicted summary consists of exact sentences taken directly from the source material. This ranges from unsupervised methods (Farzindar and Lapalme 2004; Saravanan et al. 2006; Polsley et al. 2016; Zhong et al. 2019) to supervised methods (Liu and Chen 2019). In contrast, our work is centered on AS, where the output is a rewording of the input. Abstraction is more closely aligned with the actual conditions of legal practices (Kornilova and Eidelman 2019; Sharma et al. 2019; Huang et al. 2021; Shen et al. 2022).
Legal Summarization Datasets Given the crucial social role of the legal domain and the growing demand for summarization tools (Jain et al. 2021), numerous datasets have been introduced, covering various types of documents. These include case reports (Greenleaf et al. 1995), judgments (Grover et al. 2004), legislative bills (Kornilova and Eidelman 2019), patents (Sharma et al. 2019), government reports (Huang et al. 2021), and federal civil rights LawSuITs (Shen et al. 2022). Diversity allows the development of large language models pretrained on legal text (Chalkidis et al. 2020; Zheng et al. 2021). Our dataset presents unique challenges, as it consists of lengthy domain-specialized documents that are inherently difficult to summarize. Challenges arise due to (i) the scattered distribution of summary-worthy information throughout the input and (ii) the occasional presence of formulaic expressions in the targets. Previous works have introduced Italian summarization datasets featuring short documents, such as those in the news (Landro et al. 2022) and articles related to Wikipedia (Ladhak et al. 2020; Casola and Lavelli 2021). Instead, LAWSUIT comprises longer texts (refer to Table 1), establishing itself as the first dataset for the Italian long document summarization task. Notably, the dataset includes gold summaries, diverging from the cases where summaries are automatically generated using the first sentence (Ladhak et al. 2020) or by concatenating the title with a description (Landro et al. 2022), a procedure that can compromise the factual consistency of models trained on such data (Maynez et al. 2020). In terms of legal contributions, LAWSUIT establishes the first large-scale legal resource, distinguishing itself from smaller datasets (Aumiller et al. 2022) and those designed exclusively for extractive summarization (Licari and Comandé 2022).
Italian Legal Language Models Since 2017, legal text analysis has been revolutionized by transformer-based architectures. Despite these advancements, accurately training machines to understand legal language remains a significant challenge. Legal language models, often benefitting from specialized pretraining (Chalkidis et al. 2020), currently achieve state-of-the-art results on various benchmarks (Zheng et al. 2021; Chalkidis et al. 2022). However, public generative models pretrained on legal corpora are scarce, forcing reliance on general models instead (Hwang et al. 2022; Shen et al. 2022). An extensive literature review (Katz et al. 2023) shows that English dominates open-source Legal NLP (56%), followed by Chinese (\(\approx \)10%), with models usually requiring extensive training hardware (Song et al. 2023). The main challenge in applying current language models to Italian documents is their inadequate training in comprehending instructions in that language.
Some contributions have explored Italian encoder-only models. UmBERTo (110 M) (Parisi et al. 2020) is the result of continual pretraining on top of RoBERTa using whole-word masking with filtered resources from Wikipedia and CommonCrawl. In the legal domain, Licari et al. introduced Italian-Legal-BERT (111 M) (Licari and Comandé 2022), which runs the continual pre-training of a general-domain Italian BERT model with civil law corpora and from scratch pre-training based on CamemBERT (111 M) (Martin et al. 2020), with distilled and long-document variants. However, these works fall outside of our scope, which is instead concerned with generative architectures.
In this sense, Mattei et al. proposed GePpeTto (117 M) (Mattei et al. 2020), a GPT-2 model fine-tuned on Italian Wikipedia and the ItWac corpus (Baroni et al. 2009), mainly aimed at text completion. Sarti and Nissim devised IT5 (60 M, 220 M, 738 M) (Sarti and Nissim 2024), a family of encoder-decoder transformer models pretrained on a cleaned version of the Italian mC4 corpus,Footnote 2 a web-crawled text collection that includes more than 40 billion words. La Quatra and Cagliero submitted BART-IT (Quatra and Cagliero 2023), an Italian version of BART trained on the same mixture of IT5. Santilli et al. released Camoscio (7B) (Santilli and Rodolà 2023), an instruction-tuned LLaMA model trained with low-rank (LoRA) adaptation on an Italian (ChatGPT-translated) version of Stanford Alpaca (Taori et al. 2023).
Regarding conversational objectives, Bacciu et al. (2023) presented Fauno (7B/13B), a LoRA fine-tuned version of Baize (Xu et al. 2023) in heterogeneous synthetic Italian datasets. LLaMantino (7B, 13B, 70B) (Basile et al. 2023) is a family of Italian-adapted LLaMA-2 models, trained using QLoRA on the IT5 data mixture. Maestrale (7B)Footnote 3 is a Mistral model specialized in Italian through continual pretraining and instruction fine-tuning. Zefiro (7B)Footnote 4 is a porting of the Mistral model to the Italian language, obtained through continual pretraining on a random subset of Oscar and Wikipedia data, supervised fine-tuning on UltraChat-ITA (silver translation), and DPO alignment (Rafailov et al. 2023) with the ultrafeedback preference dataset (silver translation). Minerva (350 M, 1B, 3B)Footnote 5 is a family of large language models pretrained from scratch on 660B tokens (330B in Italian, 330B in English). DanteLLM (Bacciu et al. 2024) is a QLoRA fine-tuned version of Mistral-Instruct (7B), trained on the Italian SQuAD dataset (Croce et al. 2018), 25K sentences from the Europarl dataset (Koehn 2005), the Fauno’s Quora dataset and the Camoscio dataset. Notably, as underscored by the current leaderboard dedicated to Italian language modeling available on HuggingFace,Footnote 6 the results achievable by language models pretrained from scratch on the Italian language are significantly inferior compared to those achievable by foundational models that have undergone extensive pretraining on larger multilingual corpora.
Taking LAWSUIT as a testbed, we fairly compare the effectiveness and efficiency of available Italian-adapted or multi-lingual encoder–decoder models with million-scale parameters, which offer significant advantages in hardware-constrained scenarios. We examine their adaptability to different tasks, languages, and amounts of labeled training data.
3 LAWSUIT
LAWSUIT is a large-scale Italian AS dataset that collects CCIR-sourced legal verdicts, serving as a new and demanding benchmark for the NLP community. The corpus comprises 14,000 long texts from 1956 to 2022, classified into orders and judgments (see Fig. 2 for statistics based on the year), each meticulously paired with a set of maxims (concatenated to form the target summary). The term order denotes a legal ruling declared during the judicial proceeding to settle questions and disputes verified during the trial, while judgment refers to a legal ruling declared by the judicial body at the end of the trial. The maxims summarize the judicial process by encapsulating key details about the ruling, general legal characteristics, references, and the final provision. Each source consists of three informative sections (Fig. 1):
-
Epigraph: the introduction detailing the main gist of the ruling and the context in which the request is addressed.
-
Text: the core content that highlights the legal extremes.
-
Decision: the concluding segment of the ruling that contains the final provisions of the Court.
Additional details on data access are provided in Appendix 6.
3.1 LAWSUIT Processing
To construct the LAWSUIT dataset, we started with 21,331 instances obtained from the CCIR open data, discarding verdicts that were too recent and lacked an associated maxim.
Size Filtering We retained records with summary lengths between 100 and 2000 words and source texts between 1000 and 20,000 words, resulting in 14,072 instances. This step aimed to remove unbalanced texts (i.e., outliers) that do not reflect the average typical characteristics of these legal documents. Specifically, excessively short texts often lack sufficient context, while very long texts can introduce redundancy. Therefore, retaining only texts within the specified length ranges ensures that the model is exposed to a more homogeneous and representative sample of the data, leading to better generalization and performance.
Duplicate Data Removal To identify and eliminate duplicate instances, we employed an approach similar to Kornilova and Eidelman (2019), resulting in 14,054 instances. Technically, the process involved (i) removing stop words and the 30 most common terms (e.g., article, law, court), (ii) vectorizing texts using scikit-learn’s CountVectorizer, (iii) computing average cosine similarity between the texts and the summaries for each pair of verdicts, and (iv) iteratively adding verdicts while discarding instances highly similar (>96%) to any verdicts already included. Duplicates were often orders on related subjects pronounced in close time frames or written as corrections to previous orders with drafting errors; we kept the most recent version of the document.
Compression Ratio Filtering The compression ratio quantifies how much a document is condensed to produce its summary. This metric is defined as the ratio between the number of words in the input and its corresponding target (Grusky et al. 2018). As for the size filtering procedure, we aimed to create a high-quality homogeneous dataset without outliers. Thus, due to the considerable variation in the sizes of both sources and summaries, we retained verdicts with a compression ratio between 2 and 70, ending up with 14,000 instances.
Quality Control and Text Cleaning Besides traditional operations (e.g., extra spaces and newline chars disposal), we implemented a preprocessing pipeline aimed at ensuring the textual quality of the dataset. The steps involved were as follows:
-
Removal of epigraph and decision prefixes containing personal names, such as those of the president, editors, and directors;
-
Given that several instances lacked a clear structural separation between the epigraph and the main text, we have explicitly delineated these sections to enhance overall structuring;
-
Elimination of duplicated notes found at the end of maxims, which were deemed irrelevant due to versioning management;
-
Replacement of apostrophes in vowels with correct accents by applying UTF-8 encoding;
-
Removal of publisher, judge, and reviewer information at the end of the decision;
-
Deletion of backslashes in ruling codes to address encoding errors present in the original JSON files.
On the other hand, certain elements, recognized for their high frequency and factuality role, were intentionally retained: (i) cf., bibliographic citations pointing to external references; (ii) artt., legal jargon signifying the citation of multiple articles; (iii) personal names, except for publisher, judge, and reviewer.
Train-test Split. Following prior work (Cripwell et al. 2023), we employ a dataset split size of 90-5-5 to ensure sufficient training data while allowing for adequate validation and testing. Therefore, the dataset was divided into train (90%, 12,600 samples), validation (5%, 700), and test (5%, 700) sets. We carried out a proportional stratified random sampling without replacement, considering the categorization and lengths of the sources. To be precise, we (i) evenly distributed the orders and judgments in the splits to have the same percentage of each type in each split and (ii) divided them equally based on their lengths (tertiles are calculated to assign \(\{\text {short}, \text {medium}, \text {long}\}\) classes). Table 2 shows the equal distribution of documents among the splits, specifying fine-graned statistics about the number of words within the three source sections (i.e., epigraph, text, and decision).
3.2 Dataset characterization
Table 1 offers a comparative analysis of key statistics between LAWSUIT and other relevant text summarization datasets. Concretely, we present corpus sizes and the average number of words and sentences in both source documents and target summaries, calculated using the NLTK library (Bird 2006). Additionally, we furnish information on the average coverage, density, and compression ratio of extractive fragments in terms of words and sentences, as defined by Grusky et al. (2018). In particular, LAWSUIT exhibits longer source texts and target summaries than existing datasets, except for GovReport, where the targets contain more source-related tokens, indicating greater coverage. Moreover, we observe a slightly smaller frequency of vocabulary words w.r.t. corpora with a higher number of documents, suggesting that while our dataset is more concise, it still captures the essential linguistic diversity, maintaining a robust and representative vocabulary distribution. In terms of legal contributions, it is noteworthy that LAWSUIT represents the inaugural dataset composed exclusively of Italian documents, distinguishing it from multilingual datasets that include only limited subsets of Italian texts.
Summary Abstractiveness Compared to previous contributions, we observe that the summaries within LAWSUIT exhibit substantial coverage (0.92).Footnote 7 This implies that the target generations contain fewer unsupported entities and facts, ensuring faithfulness while mitigating the risk of hallucinations, an imperative consideration in legal applications. Simultaneously, we note that the density, which represents the average length of the extractive fragments, is the highest among the datasets, suggesting that the summaries in LAWSUIT might have an extractive nature. To dispel this assumption, following established methodologies (See et al. 2017; Chen and Bansal 2018; Sharma et al. 2019), we compute abstractiveness as the fraction of novel n-grams in the summary that do not appear in the input source. Figure 3 illustrates the percentage of novel sentences and n-grams with \(n \in [1,10]\), indicating that many summary details are not verbatim extractions from sources but rather abstractive, despite the high density.
Coverage increment and section informativeness Given the obstacles presented by lengthy sources in identifying salient content for inclusion in the summary, we examine the coverage increment of summary-worthy unigrams in the input. To achieve this, we divide each source into ten equal partitions, according to Huang et al. (2021). Specifically, we count the number of unique unigrams that also appear in the target, accumulated from the document’s start to the end of each partition. Figure 4 illustrates that relevant information is spread throughout documents, with novel salient unigrams being covered more uniformly as more content is consumed. This means that LAWSUIT exhibits less positional bias and requires a comprehensive reading of the entire input. To further elucidate this aspect, we break down the informativeness across the three sections (i.e., epigraph, text, decision) by computing the percentage of unique salient unigrams occurring in each text span. Figure 5 demonstrates that the core content of a summary is generally concentrated in the text section of the ruling to which it refers. However, through a deeper qualitative investigation (Appendix 10), we discover that the epigraph and the decision are essential at both ends of the generation, where the maxim is likely to mention references and final court judgments, briefly rephrasing and aggregating them.
Summary Formulaicness Legal summaries often incorporate common expressions and shared standard structures, enabling models to learn patterns during training without a deep understanding of the input. To quantify this phenomenon, we analyze the formulaicness of summaries in the training set by calculating the longest common subsequence (LCS) (Lin 2004). Technically, we compute the LCS for each subset by taking 5 non-overlapping subsets of 100 random samples.Footnote 8 Figure 6 highlights that summaries in LAWSUIT have a lower occurrence of structural patterns across targets than related English legal datasets, especially BillSum, despite the latter having shorter summaries. In fact, the longer the resumes, the higher the chance that words overlap.
4 Experiments
Our goal with LAWSUIT is to establish a novel and challenging benchmark to advance Legal NLP in real-world applications. Therefore, our experiments with LAWSUIT delve into two research questions.
-
RQ1: can current models effectively summarize Italian legal verdicts to support legal practitioners and automate downstream applications?
-
RQ2: given the high cost of human annotation for creating labeled examples, can models be configured to produce useful summaries in real-world scenarios with only a handful of training instances?
To answer them, we set up the following tasks.
-
Full summarization: this involves training models with the entire set of available instances in LAWSUIT, totaling 12,600 samples.
-
Few-shot summarization: this simulates a scenario marked by data scarcity for model supervision due to the high cost of labeling. To replicate this setting, models are provided with only the first 10 and 100 training samples, aligning with previous works (Zhang et al. 2020; Chen and Shuai 2021; Moro and Ragazzi 2022).Footnote 9
4.1 Models
We investigate the performance of multiple extractive and abstractive solutions on LAWSUIT.
Extractive Baselines For upper-bound performance, we consider an oracle: Oracle-Opt selects, for each of the k gold summary sentences—extrapolated with the NLTK library—the input sentence that maximizes the average ROUGE-{1,2,L} F1 score. LexRank-PLM is a graph-based unsupervised extractive summarizer that leverages LexRank’s eigenvector centrality (Erkan and Radev 2004) and a pretrained language model (paraphrase-multilingual-MiniLM-L12-v2) to enhance sentence representation during text encoding. Epi, Text, and Dec select the first n sentences from the epigraph, text, and decision, respectively. Cat concatenates the first \(\nicefrac {n}{3}\) sentences from the three sections, maintaining the occurrence order in the source document.Footnote 10
Abstractive Baselines mBART (Liu et al. 2020; Tang et al. 2020) is a sequence-to-sequence model largely pretrained on multiple languages using Bart’s denoising objective (Lewis et al. 2020); it can process inputs up to 1024 tokens.Footnote 11IT5 (Sarti and Nissim 2022) is a text-to-text model centered on T5 (Raffel et al. 2020) and pretrained on Italian corpora; it is unbounded in the input dimension thanks to its positional embedding mechanism. mT5 (Xue et al. 2021) is a T5-based model pretrained on multiple languages. We employ the small (s), base (b), and large (l) model checkpoints (see Table 8 for technical details).
Segmentation-based Pipeline Inspired by the necessity of (i) comprehensively processing the entire input source without overlooking details, (ii) minimizing the risk of model hallucination through careful consideration of small, highly correlated source–target pairs, and (iii) generating precise summaries in scenarios with limited data availability, we introduce a straightforward yet powerful language-agnostic segmentation-based approach. Let \(\mathcal {D}=\{d_1, \dots , d_n\}\) be the long input document, where each \(d_i\) is a sentence; this solution divides \(\mathcal {D}\) into non-overlapping chunks (i.e., a set of consecutive sentences), each containing a maximum of \(\mathcal {M}\) tokens. Specifically, we start with an empty chunk c and iteratively add sentences until \(\mathcal {M}\). To train our solution, we assign each summary sentence—selected with NLTK—to the chunk that maximizes the ROUGE-1 precision metric—creating small high-correlated training pairs (\(c_i, t_i\))—as defined by Moro and Ragazzi (2022). On the other hand, at inference time, the chunks are summarized, and their predictions are concatenated in the order of occurrence of the source document to produce the final summary. We refer to this approach with the term SegSumm, depicted in Fig. 7.
This pipeline is related to but differs from Moro and Ragazzi (2022) because the segmentation is model-agnostic—and thus language-agnostic—making it applicable to multiple languages, including Italian (see Sect. 4.4.2 for experiments on English legal texts).
Note: when small values of \(\mathcal {M}\) are used, the document is divided into multiple chunks. Consequently, if the number of summary sentences is fewer than the number of chunks, the chunks without corresponding target sentences are discarded during the training process. In toher words, the above target-matching algorithm does not ensure \(t_i \ne \emptyset \), which is evident if the number of chunks is greater than the target sentences. However, the summaries in LAWSUIT have an averagely higher number of sentences (see Table 1) than the hypothetical number of source chunks.
4.2 Implementation and hardware
For abstractive summarizers, we fine-tune the models using the PyTorch (Paszke et al. 2019) implementation from the HuggingFace library (Wolf et al. 2019), leveraging publicly available checkpoints. The models are trained for 3 epochs on a single NVIDIA GeForce RTX 3090 GPU (24GB VRAM) from an internal cluster, with a learning rate of 5e-5. In the decoding process, we apply beam search with 4 beams and n-gram repetition blocks for n>5, using 1024 as the maximum summary length. The seed is fixed at 42 for reproducibility. Additional details are available in Appendix 7.
4.3 Evaluation setup
To gauge a comprehensive evaluation, we conduct both a quantitative and qualitative analysis with automatic metrics and human annotators, respectively.
Automatic ROUGE-{1,2,L} F1 (Lin 2004) and BERTScore F1 (BS) (Zhang et al. 2020) are used to calculate the lexical overlap and estimated semantic overlap between the generated and the gold summaries, respectively. For BS, we use the bert-base-multilingual-cased model and set rescale_with_baseline=True.
Human Given the potential failure of automatic metrics to act as reliable proxies for summary quality dimensions, we perform an in-depth human evaluation. In the steps of previous work (Narayan et al. 2018; Fabbri et al. 2019; Moro et al. 2023), we use Best-Worst Scaling (Louviere and Woodworth 1991; Louviere et al. 2015), which is more trustworthy and less expensive than rating scales (Kiritchenko and Mohammad 2017). Pointedly, we provide 3 legal expert evaluators with the source document and the artificial summaries from the best-performed models. We ask them to rank predictions according to informativeness, fluency, and factuality. The assessment is done on 30 randomly selected documents from LAWSUIT’s test set by comparing all the possible summary pair combinations, i.e., 90 binary preference annotations per participant. We randomize the order of pairs and per-example sources to guard the rating against being gamed. Elicited labels are used to establish a score in \([-1,1]\) for each summary source s: \(\%_{best}(s) - \%_{worst}(s)\). The annotation process takes \(\approx \)6 h per judge, 18 in total. Appendix 9 illustrates our setup.
4.4 Results and discussion
4.4.1 Dataset
Italian Legal Ruling Summarization Table 3 presents the performance of each baseline in LAWSUIT, where the summarizers are tasked with extracting and synthesizing crucial information from lengthy sources, utilizing varying numbers of training samples (Table 4). Table 5 presents the transfer learning performance of IT5-small when trained and tested across orders and judgments. In terms of abstractive summarizers, models that allow long inputs (IT5) perform better than input-constrained models (mBart) on all tasks, underscoring the utility of an extensive input context. Longer input also brings consistent performance gains for IT5 across tasks. Interestingly, SegSumm significantly exceeds the baselines (p-value \(< 0.05\) with student t-test) in full and few-shot summarization. Human evaluation results are reported in Table 6. SegSumm is rated the best in all dimensions. These findings demonstrate that existing language models can effectively support Italian legal summarization, particularly when equipped with segmentation capabilities (RQ1). Indeed, text segmentation allows the model to process the entire document without truncating information that exceeds the maximum input size permitted by its architecture. In plausible few-shot scenarios, SegSumm emerges as the sole model offering satisfactory effectiveness (RQ2). We provide some examples of the generated summaries for few shot training and training on entire data in Appendix 12.
Generating summaries from sections To further explore the importance of reading the entire input source in LAWSUIT, we train summarizers on individual sections (i.e., epigraph, text, decision) to generate the summary.
As shown in Table 4, the model trained on the three concatenated sections reveals significant improvements compared to processing only the epigraph and the decision. As the text section is longer, models processing only that part are marginally less efficient. However, this analysis indicates that all the source sections are sufficiently informative to produce a comprehensive summary. This further underscores the importance of avoiding the truncation of longer texts due to context limitations and instead leveraging segmentation-based approaches.
4.4.2 Method
Generality of SegSumm Due to the language independence of the SegSumm approach, we gauge its generality to analyze whether it can improve legal applications in other languages. Specifically, we experiment with the BillSum dataset under low-resource conditions (Moro et al. 2023, 2023, 2023), simulating a real-world legal scenario.Footnote 12 We compare with existing solutions concentrating on few-shot summarization: Pegasus (Zhang et al. 2020), a transformer-based model pretrained with a summarization-specific objective that allows for fast adaptation with few labeled samples. MTL-ABS (Chen and Shuai 2021), a meta-transfer learning approach that augments training data with similar corpora. Se3 (Moro and Ragazzi 2022), a segmentation-based solution equipped with metric learning. Athena (Moro and Ragazzi 2023), a segmentation-based model with a dynamic learned size of the chunks. LW-ML (Huh and Ko 2022), a meta-learning algorithm that inserts a lightweight module into the attention mechanism of a pretrained language model. Regarding our solution, we test SegSumm on top of Bart-base (Lewis et al. 2020). Table 7 points out that SegSumm largely outperforms previous models, confirming the usefulness of text segmentation for legal texts.
5 Conclusion
In this paper, we introduced LAWSUIT, the first large-scale dataset for the abstractive summarization of long Italian legal verdicts. The challenges presented by LAWSUIT include lengthy sources, the uniform distribution of relevant information throughout the input, and the lower presence of formulaic patterns in the targets. Through an extensive series of experiments, we found that a text segmentation pipeline significantly outperforms other methods in both few-shot and full summarization. We anticipate that LAWSUIT will contribute to the development of real-world legal summarization systems and stimulate research towards effective long-range solutions for Italian legal documents. Future works will extend LAWSUIT to new tasks, such as cross-domain and in-domain ruling classification (Domeniconi et al. 2014a, b, 2015, 2016, 2017; Moro et al. 2018), legal reasoning (Guha et al. 2023; Moro et al. 2024), open-domain question answering (Frisoni et al. 2024), corpus-level knowledge extraction (Frisoni and Moro 2020), and lay summarization (Ragazzi et al. 2024). By representing the source document as a graph (Moro et al. 2023), researchers could explore efficient segmentation and summarization techniques based on graph sparsification (Domeniconi et al. 2014, 2016; Zaheer et al. 2020), eventually using distributed algorithms (Lodi et al. 2010; Cerroni et al. 2015) to handle a large number of nodes and edges.
6 Limitations
As there are no publicly available Italian datasets specifically designed to summarize long legal documents, we conducted a comparison between LAWSUIT and existing English legal datasets. However, it is crucial to acknowledge that English and Italian differ not only in language but also in dictionary and style, potentially introducing linguistic biases when comparing statistics. While SegSum serves as a baseline, it requires the generation of at least one sentence for each chunk during inference. Although this is suitable for extensive summaries, such as those found in typical long document summarization datasets such as LAWSUIT and GovReport, it might be less scalable for concise summaries. Regarding low-resource experiments, our method is guided by published top-tier work, but we recognize that the sample selection process could significantly impact the final results. Hence, future contributions should explore various subsets of the training set to gain a more comprehensive understanding.
Notes
The dataset is available at: https://disi-unibo-nlp.github.io/publications-site/lawsuit/.
Coverage is defined as the average fraction of token spans that can be jointly identified in both the source and target. For example, a coverage of 0.92 indicates that 92% of the summary words appear in extractive source fragments.
Random subsets avoid the need for computing all combinations of targets in the dataset, thereby alleviating space and time complexity.
We use the same sample size also for the validation set.
Preliminary experiments led us to prefer mBART over BART-IT, as the latter showed extremely low quality in the legal domain.
We used the same training settings but set 100 and 300 as the min and max summary size and 3 as the n-grams penalty.
References
Aumiller D, Chouhan A, Gertz M (2022) Eur-lex-sum: a multi- and cross-lingual dataset for long-form summarization in the legal domain. In: Goldberg Y, Kozareva Z, Zhang Y (eds.) EMNLP, pp 7626–7639. ACL. https://aclanthology.org/2022.emnlp-main.519
Bacciu A, Campagnano C, Trappolini G, Silvestri F (2024) DanteLLM: let’s push Italian LLM research forward! In: Calzolari N, Kan M-Y, Hoste V, Lenci A, Sakti S, Xue N (eds.) Proceedings of the 2024 Joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024), pp 4343–4355. ELRA and ICCL, Torino, Italia. https://aclanthology.org/2024.lrec-main.388
Bacciu A, Trappolini G, Santilli A, Rodolà E, Silvestri F (2023) Fauno: the Italian large language model that will leave you senza parole! In: Nardini FM, Tonellotto N, Faggioli G, Ferrara A (eds) Proceedings of the 13th Italian information retrieval workshop (IIR 2023), Pisa, Italy, June 8–9, 2023. CEUR Workshop Proceedings, vol. 3448, pp 9–17. CEUR-WS.org. https://ceur-ws.org/Vol-3448/paper-24.pdf
Bakker R, van Drie RAN, de Boer M, van Doesburg R, et al. (2022) Semantic role labelling for Dutch law texts. In: LREC, pp 448–457. European Language Resources Association, Marseille, France. https://aclanthology.org/2022.lrec-1.47
Baroni M, Bernardini S, Ferraresi A, Zanchetta E (2009) The wacky wide web: a collection of very large linguistically processed web-crawled corpora. Lang Resour Eval 43(3):209–226. https://doi.org/10.1007/S10579-009-9081-4
Basile P, Musacchio E, Polignano M, Siciliani L, Fiameni G, Semeraro G (2023) Llamantino: Llama 2 models for effective text generation in italian language. arXiv:2312.09993
Bellandi V, Castano S, Ceravolo P, Damiani E, et al. (2022) Knowledge-based legal document retrieval: a case study on Italian civil court decisions. In: EKAW. CEUR Workshop proceedings, vol. 3256. CEUR-WS.org. http://ceur-ws.org/Vol-3256/km4law2.pdf
Bhattacharya P, Poddar S, Rudra K, Ghosh K, et al. (2021) Incorporating domain knowledge for extractive summarization of legal case documents. In: ICAIL, pp 22–31. ACM. https://doi.org/10.1145/3462757.3466092
Bird S (2006) NLTK: the natural language toolkit. In: ACL. The Association for Computer Linguistics. https://doi.org/10.3115/1225403.1225421
Casola S, Lavelli A (2021) WITS: wikipedia for italian text summarization. In: CLiC-it. CEUR workshop proceedings, vol. 3033. CEUR-WS.org. http://ceur-ws.org/Vol-3033/paper65.pdf
Cerroni W, Moro G, Pasolini R, Ramilli M (2015) Decentralized detection of network attacks through P2P data clustering of SNMP data. Comput Secur 52:1–16. https://doi.org/10.1016/J.COSE.2015.03.006
Chalkidis I, Androutsopoulos I, Aletras N (2019) Neural legal judgment prediction in English. In: ACL, pp 4317–4323. ACL, Florence, Italy. https://doi.org/10.18653/v1/P19-1424
Chalkidis I, Androutsopoulos I, Michos A (2018) Obligation and prohibition extraction using hierarchical RNNs. In: ACL, pp 254–259. ACL, Melbourne, Australia. https://doi.org/10.18653/v1/P18-2041
Chalkidis I, Fergadiotis M, Androutsopoulos I (2021) Multieurlex - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer. In: EMNLP, pp 6974–6996. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.559
Chalkidis I, Fergadiotis M, Malakasiotis P, Aletras N, et al (2020) LEGAL-BERT: The muppets straight out of law school. In: EMNLP, pp 2898–2904. ACL, Online. https://doi.org/10.18653/v1/2020.findings-emnlp.261
Chalkidis I, Jana A, Hartung D, Bommarito M, et al. (2022) LexGLUE: a benchmark dataset for legal language understanding in English. In: ACL, pp 4310–4330. ACL, Dublin, Ireland https://doi.org/10.18653/v1/2022.acl-long.297
Chen Y-C, Bansal M (2018) Fast abstractive summarization with reinforce-selected sentence rewriting. In: ACL, pp 675–686. ACL, Melbourne, Australia. https://doi.org/10.18653/v1/P18-1063
Chen Y, Shuai H (2021) Meta-transfer learning for low-resource abstractive summarization. In: AAAI, pp 12692–12700. AAAI Press. https://ojs.aaai.org/index.php/AAAI/article/view/17503
Cohan A, Dernoncourt F, Kim DS, Bui T, et al. (2018) A discourse-aware attention model for abstractive summarization of long documents. In: NAACL, pp 615–621. ACL, New Orleans, Louisiana. https://doi.org/10.18653/v1/N18-2097
Cripwell L, Legrand J, Gardent, C (2023) Simplicity level estimate (SLE): a learned reference-less metric for sentence simplification. In: Bouamor H, Pino J, Bali K (eds) Proceedings of the 2023 conference on empirical methods in natural language processing, EMNLP 2023, Singapore, December 6–10, 2023, pp 12053–12059. Association for Computational Linguistics. https://doi.org/10.18653/V1/2023.EMNLP-MAIN.739
Croce D, Zelenanska A, Basili R (2018) Neural learning for question answering in italian. In: Ghidini C, Magnini B, Passerini A, Traverso P (eds) AI*IA 2018 - advances in artificial intelligence - XVIIth international conference of the Italian Association for artificial intelligence, Trento, Italy, November 20–23, 2018, proceedings. Lecture Notes in Computer Science, vol. 11298, pp 389–402. Springer. https://doi.org/10.1007/978-3-030-03840-3_29
Domeniconi G, Masseroli M, Moro G, Pinoli P (2016) Cross-organism learning method to discover new gene functionalities. Comput Methods Programs Biomed 126:20–34. https://doi.org/10.1016/J.CMPB.2015.12.002
Domeniconi G, Masseroli M, Moro G, Pinoli P (2014) Discovering new gene functionalities from random perturbations of known gene ontological annotations. In: Fred ALN, Filipe J (eds) KDIR 2014 - Proceedings of the international conference on knowledge discovery and information retrieval, Rome, Italy, 21–24 October, 2014, pp 107–116. SciTePress.https://doi.org/10.5220/0005087801070116
Domeniconi G, Moro G, Pagliarani A, Pasolini R (2015) Markov chain based method for in-domain and cross-domain sentiment classification. In: Fred ALN, Dietz JLG, Aveiro D, Liu K, Filipe J (eds) KDIR 2015 - Proceedings of the international conference on knowledge discovery and information retrieval, part of the 7th international joint conference on knowledge discovery, knowledge engineering and knowledge management (IC3K 2015), Volume 1, Lisbon, Portugal, November 12–14, 2015, pp 127–137. SciTePress.https://doi.org/10.5220/0005636001270137
Domeniconi G, Moro G, Pagliarani A, Pasolini R (2017) On deep learning in cross-domain sentiment classification. In: Fred ALN, Filipe J (eds.) Proceedings of the 9th International joint conference on knowledge discovery, knowledge engineering and knowledge management - (Volume 1), Funchal, Madeira, Portugal, November 1–3, 2017, pp 50–60. SciTePress. https://doi.org/10.5220/0006488100500060
Domeniconi G, Moro G, Pasolini R, Sartori C (2014) Cross-domain text classification through iterative refining of target categories representations. In: Fred ALN, Filipe J (eds) KDIR 2014 - proceedings of the international conference on knowledge discovery and information retrieval, Rome, Italy, 21–24 October, 2014, pp 31–42. SciTePress. https://doi.org/10.5220/0005069400310042
Domeniconi G, Moro G, Pasolini R, Sartori C (2014) Iterative refining of category profiles for nearest centroid cross-domain text classification. In: Fred ALN, Dietz JLG, Aveiro D, Liu K, Filipe J (eds) Knowledge discovery, knowledge engineering and knowledge management - 6th international joint conference, IC3K 2014, Rome, Italy, October 21–24, 2014, Revised Selected Papers. Communications in Computer and Information Science, vol. 553, pp 50–67. Springer. https://doi.org/10.1007/978-3-319-25840-9_4
Domeniconi G, Semertzidis K, López V, Daly EM, Kotoulas S, Moro G (2016) A novel method for unsupervised and supervised conversational message thread detection. In: Francalanci C, Helfert M (eds) DATA 2016 - Proceedings of 5th international conference on data management technologies and applications, Lisbon, Portugal, 24–26 July, 2016, pp 43–54. SciTePress. https://doi.org/10.5220/0006001100430054
Duan X, Zhang Y, Yuan L, Zhou X, et al. (2019) Legal summarization for multi-role debate dialogue via controversy focus mining and multi-task learning. In: CIKM, pp 1361–1370. ACM. https://doi.org/10.1145/3357384.3357940
Elaraby M, Litman D (2022) ArgLegalSumm: improving abstractive summarization of legal documents with argument mining. In: COLING, pp 6187–6194. International Committee on Computational Linguistics, Gyeongju, Republic of Korea. https://aclanthology.org/2022.coling-1.540
Erkan G, Radev DR (2004) Lexrank: graph-based lexical centrality as salience in text summarization. J Artif Intell Res 22:457–479. https://doi.org/10.1613/jair.1523
Fabbri A, Li I, She T, Li S, et al (2019) Multi-news: a large-scale multi-document summarization dataset and abstractive hierarchical model. In: ACL, pp 1074–1084. ACL, Florence, Italy. https://doi.org/10.18653/v1/P19-1102
Farzindar A, Lapalme G (2004) Legal text summarization by exploration of the thematic structure and argumentative roles. In: Text Summarization branches out, pp 27–34. ACL, Barcelona, Spain. https://aclanthology.org/W04-1006
Feng Y, Li C, Ng V (2022) Legal judgment prediction via event extraction with constraints. In: ACL, pp 648–664. ACL, Dublin, Ireland https://doi.org/10.18653/v1/2022.acl-long.48
Frisoni G, Cocchieri A, Presepi A, Moro G, Meng Z (2024) To generate or to retrieve? On the effectiveness of artificial contexts for medical open-domain question answering. arXiv:2403.01924
Frisoni G, Moro G (2020) Phenomena explanation from text: Unsupervised learning of interpretable and statistically significant knowledge. In: Hammoudi S, Quix C, Bernardino J (eds) Data management technologies and applications - 9th international conference, DATA 2020, Virtual Event, July 7–9, 2020, Revised Selected Papers. Communications in Computer and Information Science, vol. 1446, pp 293–318. Springer. https://doi.org/10.1007/978-3-030-83014-4_14
Galli F, Grundler G, Fidelangeli A, Galassi A, et al. (2022) Predicting outcomes of italian VAT decisions. In: JURIX. Frontiers in artificial intelligence and applications, vol. 362, pp 188–193. IOS Press. https://doi.org/10.3233/FAIA220465
Greenleaf G, (1995) Public access to law via internet: the Australasian legal information institute. In: Paper presented at the asian pacific specials, health and law librarians conference (6th, et al Sydney). J Law Inf Sci 6(1):49–69
Grover C, Hachey B, Hughson I (2004) The HOLJ corpus. supporting summarisation of legal texts. In: LINC, pp 47–54. COLING, Geneva, Switzerland (2004). https://aclanthology.org/W04-1907
Grusky M, Naaman M, Artzi Y (2018) Newsroom: a dataset of 1.3 million summaries with diverse extractive strategies. In: NAACL, pp 708–719. ACL, New Orleans, Louisiana. https://doi.org/10.18653/v1/N18-1065
Guha N, Nyarko J, Ho DE, Ré C, Chilton A, K A, Chohlas-Wood A, Peters A, Waldon B, Rockmore DN, Zambrano D, Talisman D, Hoque E, Surani F, Fagan F, Sarfaty G, Dickinson GM, Porat H, Hegland J, Wu J, Nudell J, Niklaus J, Nay JJ, Choi JH, Tobia K, Hagan M, Ma M, Livermore MA, Rasumov-Rahe N, Holzenberger N, Kolt N, Henderson P, Rehaag S, Goel S, Gao S, Williams S, Gandhi S, Zur T Iyer V, Li Z (2023) Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S.(eds.) Advances in neural information processing systems 36: annual conference on neural information processing systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10–16, 2023. http://papers.nips.cc/paper_files/paper/2023/hash/89e44582fd28ddfea1ea4dcb0ebbf4b0-Abstract-Datasets_and_Benchmarks.html
Guo M, Ainslie J, Uthus D, Ontanon S, et al. (2022) LongT5: efficient text-to-text transformer for long sequences. In: NAACL, pp 724–736. ACL, Seattle, United States . https://doi.org/10.18653/v1/2022.findings-naacl.55
Hendrycks D, Burns C, Chen A, Ball S (2021) CUAD: an expert-annotated NLP dataset for legal contract review. In: NeurIPS. https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/6ea9ab1baa0efb9e19094440c317e21b-Abstract-round1.html
Huang L, Cao S, Parulian N, Ji H, et al. (2021) Efficient attentions for long document summarization. In: NAACL, pp 1419–1436. ACL. https://doi.org/10.18653/v1/2021.naacl-main.112
Huang W, Jiang J, Qu Q, Yang M (2020) AILA: a question answering system in the legal domain. In: IJCAI, pp 5258–5260. ijcai.org. https://doi.org/10.24963/ijcai.2020/762
Huh T, Ko Y (2022) Lightweight meta-learning for low-resource abstractive summarization. In: SIGIR, pp. 2629–2633. ACM. https://doi.org/10.1145/3477495.3531908
Hwang W, Lee D, Cho K, Lee H, et al (2022) A multi-task benchmark for korean legal language understanding and judgement prediction. In: NeurIPS. http://papers.nips.cc/paper_files/paper/2022/hash/d15abd14d5894eebd185b756541d420e-Abstract-Datasets_and_Benchmarks.html
Jain D, Borah MD, Biswas A (2021) Summarization of legal documents: where are we now and the way forward. Comput Sci Rev 40:100388. https://doi.org/10.1016/j.cosrev.2021.100388
Kanapala A, Pal S, Pamula R (2019) Text summarization from legal documents: a survey. Artif Intell Rev 51(3):371–402. https://doi.org/10.1007/s10462-017-9566-2
Katz DM, Hartung D, Gerlach L, Jana A, et al (2023) Natural language processing in the legal domain. arXiv:2302.12039
Kien PM, Nguyen H-T, Bach NX, Tran V, et al. (2020) Answering legal questions by learning neural attentive text representation. In: COLING, pp. 988–998. International Committee on Computational Linguistics, Barcelona, Spain (Online). https://doi.org/10.18653/v1/2020.coling-main.86
Kiritchenko S, Mohammad S (2017) Best-worst scaling more reliable than rating scales: a case study on sentiment intensity annotation. In: ACL, pp 465–470. ACL, Vancouver, Canada.https://doi.org/10.18653/v1/P17-2074
Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of machine translation summit X: Papers, MTSummit 2005, Phuket, Thailand, September 13–15, 2005, pp 79–86. https://aclanthology.org/2005.mtsummit-papers.11
Kornilova A, Eidelman V (2019) BillSum: a corpus for automatic summarization of US legislation. In: Proceedings of the 2nd workshop on new frontiers in summarization, pp 48–56. ACL, Hong Kong, China. https://doi.org/10.18653/v1/D19-5406
Ladhak F, Durmus E, Cardie C, McKeown, K (2020) WikiLingua: a new benchmark dataset for cross-lingual abstractive summarization. In: EMNLP, pp 4034–4048. ACL. https://doi.org/10.18653/v1/2020.findings-emnlp.360
Landro N, Gallo I, La Grassa R, Federici E (2022) Two new datasets for Italian-language abstractive text summarization. Information 13(5). https://doi.org/10.3390/info13050228
Lewis M, Liu Y, Goyal N, Ghazvininejad M, et al. (2020) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: ACL, pp 7871–7880. ACL. https://doi.org/10.18653/v1/2020.acl-main.703
Lhoest Q, Villanova del Moral A, Jernite Y, Thakur A, et al (2021) Datasets: a community library for natural language processing. In: EMNLP, pp 175–184. ACL, Online and Punta Cana, Dominican Republic. https://doi.org/10.18653/v1/2021.emnlp-demo.21
Licari D, Comandé G (2022) ITALIAN-LEGAL-BERT: a pre-trained transformer language model for italian law. In: Symeonidou D, Yu R, Ceolin D, Poveda-Villalón M, Audrito D, Caro LD, Grasso F, Nai R, Sulis E, Ekaputra FJ, Kutz O, Troquard N (eds) Companion proceedings of the 23rd international conference on knowledge engineering and knowledge management, Bozen-Bolzano, Italy, September 26–29, 2022. CEUR workshop proceedings, vol. 3256. CEUR-WS.org. https://ceur-ws.org/Vol-3256/km4law3.pdf
Lin C-Y (2004) ROUGE: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81. ACL, Barcelona, Spain. https://aclanthology.org/W04-1013
Liu Y, Gu J, Goyal N, Li X et al (2020) Multilingual denoising pre-training for neural machine translation. TACL 8:726–742. https://doi.org/10.1162/tacl_a_00343
Liu C, Chen K (2019) Extracting the gist of Chinese judgments of the supreme court. In: ICAIL, pp 73–82. ACM. https://doi.org/10.1145/3322640.3326715
Lodi S, Moro G, Sartori C (2010) Distributed data clustering in multi-dimensional peer-to-peer networks. In: Shen, H.T., Bouguettaya, A.(eds.) Database Technologies 2010, Twenty-First Australasian Database Conference (ADC 2010), Brisbane, Australia, 18–22 January, 2010, Proceedings. CRPIT, vol. 104, pp 171–178. Australian Computer Society. http://portal.acm.org/citation.cfm?id=1862264 &CFID=17470975 &CFTOKEN=71845406
Louviere JJ, Flynn TN, Marley AAJ (2015) Best-worst scaling: theory. Cambridge University Press, Methods and Applications
Louviere JJ, Woodworth, GG (1991) Best-worst scaling: a model for the largest difference judgments. Technical report, Working paper
Malik V, Sanjay R, Nigam SK, Ghosh K, et al. (2021) ILDC for CJPE: Indian legal documents corpus for court judgment prediction and explanation. In: ACL, pp 4046–4062. ACL. https://doi.org/10.18653/v1/2021.acl-long.313
Martin L, Muller B, Ortiz Suárez PJ, Dupont Y, Romary L, de la Clergerie É, Seddah D, Sagot B (2020) CamemBERT: a tasty French language model. In: Jurafsky D, Chai J, Schluter N, Tetreault J (eds) Proceedings of the 58th annual meeting of the association for computational linguistics, pp 7203–7219. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.645
Mattei LD, Cafagna M, Dell’Orletta F, Nissim M, Guerini M (2020) Geppetto carves italian into a language model. In: Monti J, Dell’Orletta F, Tamburini F (eds) Proceedings of the Seventh Italian conference on computational linguistics, CLiC-it 2020, Bologna, Italy, March 1–3, 2021. CEUR Workshop Proceedings, vol. 2769. CEUR-WS.org. https://ceur-ws.org/Vol-2769/paper_46.pdf
Maynez J, Narayan S, Bohnet B, McDonald R (2020) On faithfulness and factuality in abstractive summarization. In: ACL, pp 1906–1919. ACL. https://doi.org/10.18653/v1/2020.acl-main.173
Metsker OG, Trofimov E, Grechishcheva S (2019) Natural language processing of russian court decisions for digital indicators mapping for oversight process control efficiency: disobeying a police officer case. In: EGOSE. communications in computer and information science, vol. 1135, pp 295–307. Springer. https://doi.org/10.1007/978-3-030-39296-3_22
Moro G, Ragazzi L (2023) Align-then-abstract representation learning for low-resource summarization. Neurocomputing 548:126356. https://doi.org/10.1016/J.NEUCOM.2023.126356
Moro G, Piscaglia N, Ragazzi L, Italiani P (2023) Multi-language transfer learning for low-resource legal case summarization. Artif Intell Law. https://doi.org/10.1007/s10506-023-09373-8
Moro G, Ragazzi L, Valgimigli L, Frisoni G, Sartori C, Marfia G (2023) Efficient memory-enhanced transformer for long-document summarization in low-resource regimes. Sensors 23(7):3542. https://doi.org/10.3390/S23073542
Moro G, Pagliarani A, Pasolini R, Sartori C (2018) Cross-domain & in-domain sentiment analysis with memory-based deep neural networks. In: Fred ALN, Filipe J (eds) Proceedings of the 10th international joint conference on knowledge discovery, knowledge engineering and knowledge management, IC3K 2018, Volume 1: KDIR, Seville, Spain, September 18–20, 2018, pp 125–136. SciTePress. https://doi.org/10.5220/0007239101270138
Moro G, Ragazzi L (2022) Semantic self-segmentation for abstractive summarization of long documents in low-resource regimes. In: Thirty-Sixth AAAI conference on artificial intelligence, AAAI 2022, thirty-fourth conference on innovative applications of artificial intelligence, IAAI 2022, The twelveth symposium on educational advances in artificial intelligence, EAAI 2022 Virtual Event, February 22–March 1, 2022, pp 11085–11093. AAAI Press. https://doi.org/10.1609/AAAI.V36I10.21357
Moro G, Ragazzi L, Valgimigli L (2023) Carburacy: summarization models tuning and comparison in eco-sustainable regimes with a novel carbon-aware accuracy. In: Williams B, Chen Y, Neville J (eds.) Thirty-seventh AAAI conference on artificial intelligence, AAAI 2023, Thirty-fifth conference on innovative applications of artificial intelligence, IAAI 2023, Thirteenth symposium on educational advances in artificial intelligence, EAAI 2023, Washington, DC, USA, February 7–14, 2023, pp 14417–14425. AAAI Press. https://doi.org/10.1609/AAAI.V37I12.26686
Moro G, Ragazzi L, Valgimigli L (2023) Graph-based abstractive summarization of extracted essential knowledge for low-resource scenarios. In: Gal K, Nowé A, Nalepa GJ, Fairstein R, Radulescu R (eds) ECAI 2023 - 26th European conference on artificial intelligence, September 30–October 4, 2023, Kraków, Poland—Including 12th conference on prestigious applications of intelligent systems (PAIS 2023). Frontiers in Artificial Intelligence and Applications, vol. 372, pp 1747–1754. IOS Press. https://doi.org/10.3233/FAIA230460
Moro G, Ragazzi L, Valgimigli L, Freddi D (2022) Discriminative marginalized probabilistic neural method for multi-document summarization of medical literature. In: Muresan S, Nakov P, Villavicencio A (eds) Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp 180–189. Association for Computational Linguistics. https://doi.org/10.18653/V1/2022.ACL-LONG.15
Moro G, Ragazzi L, Valgimigli L, Molfetta L (2023) Retrieve-and-rank end-to-end summarization of biomedical studies. In: Pedreira, O., Estivill-Castro, V (eds) Similarity search and applications - 16th international conference, SISAP 2023, A Coruña, Spain, October 9–11, 2023, proceedings. Lecture Notes in Computer Science, vol. 14289 pp 64–78. Springer. https://doi.org/10.1007/978-3-031-46994-7_6
Moro G, Ragazzi L, Valgimigli L, Vincenzi F, Freddi D (2024) Revelio: Interpretable long-form question answering. In: The second tiny papers track at ICLR 2024, Tiny Papers @ ICLR 2024, Vienna, Austria, May 7–11, 2024. OpenReview.net. https://openreview.net/pdf?id=fyvEJXsaQf
Narayan S, Cohen SB, Lapata M (2018) Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In: EMNLP, pp 1797–1807. ACL, Brussels, Belgium. https://doi.org/10.18653/v1/D18-1206
Niklaus J, Matoshi V, Rani P, Galassi A, et al. (2023) LEXTREME: a multi-lingual and multi-task benchmark for the legal domain. arXiv:2301.13126
Parisi L, Francia S, Magnani P (2020) UmBERTo: an Italian Language Model trained with Whole Word Masking. GitHub
Paszke A, Gross S, Massa F, Lerer A, et al (2019) Pytorch: an imperative style, high-performance deep learning library. In: NeurIPS, pp 8024–8035. https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html
Polsley S, Jhunjhunwala P, Huang R (2016) CaseSummarizer: a system for automated summarization of legal texts. In: COLING, pp 258–262. COLING, Osaka, Japan. https://aclanthology.org/C16-2054
Qin R, Huang M, Luo Y (2022) A comparison study of pre-trained language models for chinese legal document classification. In: ICAIBD, pp 444–449. https://doi.org/10.1109/ICAIBD55127.2022.9820466
Quatra ML, Cagliero L (2023) BART-IT: an efficient sequence-to-sequence model for Italian text summarization. Future Internet 15(1):15. https://doi.org/10.3390/FI15010015
Rafailov R, Sharma A, Mitchell E, Manning CD, Ermon S, Finn C (2023) Direct preference optimization: your language model is secretly a reward model. In: Oh A, Naumann T, Globerson A, Saenko K, Hardt M, Levine S (eds) Advances in neural information processing systems 36: annual conference on neural information processing systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10–16, 2023. http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html
Raffel C, Shazeer N, Roberts A, Lee K et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 21:140–114067
Ragazzi L, Italiani P, Moro G, Panni M (2024) What are you token about? differentiable perturbed top-\(k\) token selection for scientific document summarization. In: Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, August 11–16, 2024, pp 9427-9440. Association for Computational Linguistics. https://aclanthology.org/2024.findings-acl.561
Ravichander A, Black AW, Wilson S, Norton T, et al. (2019) Question answering for privacy policies: Combining computational and legal perspectives. In: EMNLP-IJCNLP, pp 4947–4958. ACL, Hong Kong, China. https://doi.org/10.18653/v1/D19-1500
Sansone C, Sperlí G (2022) Legal information retrieval systems: state-of-the-art and open issues. Inf Syst 106:101967. https://doi.org/10.1016/j.is.2021.101967
Santilli A, Rodolà E (2023) Camoscio: an Italian instruction-tuned llama. In: Boschetti F, Lebani GE, Magnini B, Novielli N (eds) Proceedings of the 9th Italian conference on computational linguistics, Venice, Italy, November 30–December 2, 2023. CEUR workshop proceedings, vol. 3596. CEUR-WS.org. https://ceur-ws.org/Vol-3596/paper44.pdf
Saravanan M, Ravindran B, Raman S (2006) Improving legal document summarization using graphical models. In: JURIX. Frontiers in artificial intelligence and applications, vol. 152, pp 51–60. IOS Press. http://www.booksonline.iospress.nl/Content/View.aspx?piid=2367
Sarti G, Nissim M (2022) IT5: large-scale text-to-text pretraining for italian language understanding and generation. arXiv:2203.03759
Sarti G, Nissim M (2024) IT5: text-to-text pretraining for italian language understanding and generation. In: Calzolari N, Kan M, Hoste V, Lenci A, Sakti S, Xue N (eds) Proceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation, LREC/COLING 2024, 20–25 May, 2024, Torino, Italy, pp 9422–9433. ELRA and ICCL. https://aclanthology.org/2024.lrec-main.823
See A, Liu PJ, Manning CD (2017) Get to the point: summarization with pointer-generator networks. In: ACL, pp 1073–1083. ACL, Vancouver, Canada. https://doi.org/10.18653/v1/P17-1099
Sharma G, Sharma D (2023) Automatic text summarization methods: a comprehensive review. SN Comput Sci 4(1):33. https://doi.org/10.1007/s42979-022-01446-w
Sharma E, Li C, Wang L (2019) BIGPATENT: a large-scale dataset for abstractive and coherent summarization. In: ACL, pp 2204–2213. ACL, Florence, Italy. https://doi.org/10.18653/v1/P19-1212
Shen Z, Lo K, Yu L, Dahlberg N, et al (2022) Multi-lexsum: real-world summaries of civil rights lawsuits at multiple granularities. In: NeurIPS. http://papers.nips.cc/paper_files/paper/2022/hash/552ef803bef9368c29e53c167de34b55-Abstract-Datasets_and_Benchmarks.html
Shen Z, Lo K, Yu L, Dahlberg N, et al. (2022) Multi-lexsum: real-world summaries of civil rights lawsuits at multiple granularities. arXiv:2206.10883
Song P, et al (2023) LaWGPT. https://github.com/pengxiao-song/LaWGPT/tree/main
Tagarelli A, Simeri A (2022) Unsupervised law article mining based on deep pre-trained language representation models with application to the italian civil code. Artif Intell Law 30(3):417–473. https://doi.org/10.1007/s10506-021-09301-8
Tang Y, Tran C, Li X, Chen P, et al (2020) Multilingual translation with extensible multilingual pretraining and finetuning. arXiv:2008.00401
Taori R, Gulrajani I, Zhang T, Dubois Y, Li X, Guestrin C, Liang P, Hashimoto TB (2023) Stanford Alpaca: an instruction-following LLaMA model. GitHub
Tuggener D, von Däniken P, Peetz T, Cieliebak M (2020) LEDGAR: A large-scale multi-label corpus for text classification of legal provisions in contracts. In: LREC, pp 1235–1241. European Language Resources Association, Marseille, France. https://aclanthology.org/2020.lrec-1.155
Wang Z, Wang B, Duan X, Wu D, et al. (2019) Iflylegal: A chinese legal system for consultation, law searching, and document analysis. In: EMNLP-IJCNLP, pp 97–102. Association for Computational Linguisticshttps://doi.org/10.18653/v1/D19-3017
Wolf T, Debut L, Sanh V, Chaumond J, et al (2019) Huggingface’s transformers: state-of-the-art natural language processing. arXiv:1910.03771
Xiao C, Hu X, Liu Z, Tu C et al (2021) Lawformer: a pre-trained language model for chinese legal long documents. AI Open 2:79–84. https://doi.org/10.1016/j.aiopen.2021.06.003
Xue L, Constant N, Roberts A, Kale M, Al-Rfou R, Siddhant A, Barua A, Raffel C (2021) mt5: a massively multilingual pre-trained text-to-text transformer. In: Toutanova K, Rumshisky A, Zettlemoyer L, Hakkani-Tür D, Beltagy I, Bethard S, Cotterell R, Chakraborty T, Zhou Y (eds) Proceedings of the 2021 conference of the North American Chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2021, Online, June 6–11, 2021, pp 483–498. Association for Computational Linguistics. https://doi.org/10.18653/V1/2021.NAACL-MAIN.41
Xu C, Guo D, Duan N, McAuley J (2023) Baize: An open-source chat model with parameter-efficient tuning on self-chat data. In: Bouamor H, Pino J, Bali K (eds) Proceedings of the 2023 conference on empirical methods in natural language processing, pp 6268–6278. Association for Computational Linguistics, Singapore. https://doi.org/10.18653/v1/2023.emnlp-main.385
Zaheer M, Guruganesh G, Dubey KA, Ainslie J, Alberti C, Ontañón S, Pham P, Ravula A, Wang Q, Yang L, Ahmed A (2020) Big bird: transformers for longer sequences. In: Larochelle H, Ranzato M, Hadsell R, Balcan M, Lin H (eds) Advances in neural information processing systems 33: annual conference on neural information processing systems 2020, NeurIPS 2020, December 6–12, 2020, Virtual. https://proceedings.neurips.cc/paper/2020/hash/c8512d142a2d849725f31a9a7a361ab9-Abstract.html
Zhang T, Kishore V, Wu F, Weinberger KQ, et al (2020) Bertscore: evaluating text generation with BERT. In: ICLR. OpenReview.net. https://openreview.net/forum?id=SkeHuCVFDr
Zhang J, Zhao Y, Saleh M, Liu PJ (2020) PEGASUS: pre-training with extracted gap-sentences for abstractive summarization. In: ICML. Proceedings of machine learning research, vol. 119, pp 11328–11339. PMLR. http://proceedings.mlr.press/v119/zhang20ae.html
Zhang M, Zhou G, Yu W, Huang N, et al. (2022) A comprehensive survey of abstractive text summarization based on deep learning. Comput Intell Neurosci 2022
Zheng L, Guha N, Anderson BR, Henderson P, et al (2021) When does pretraining help? Assessing self-supervised learning for law and the casehold dataset of 53,000+ legal holdings. In: ICAIL, pp 159–168. ACM. https://doi.org/10.1145/3462757.3466088
Zheng L, Guha N, Anderson BR, Henderson P, et al (2021) When does pretraining help?: assessing self-supervised learning for law and the casehold dataset of 53,000+ legal holdings. In: ICAIL, pp 159–168. ACM. https://doi.org/10.1145/3462757.3466088
Zhong H, Xiao C, Tu C, Zhang T, et al. (2020) JEC-QA: a legal-domain question answering dataset. In: AAAI, pp 9701–9708. AAAI Press. https://ojs.aaai.org/index.php/AAAI/article/view/6519
Zhong L, Zhong Z, Zhao Z, Wang S, et al. (2019) Automatic summarization of legal decisions using iterative masking of predictive sentences. In: ICAIL, pp 163–172. ACM. https://doi.org/10.1145/3322640.3326728
Acknowledgements
This research is partially supported by (i) Artificial Intelligence for Public Administration Connected (AI-PACT): https://disi-unibo-nlp.github.io/projects/aipact/, (ii) the Complementary National Plan PNC-I.1, “Research initiatives for innovative technologies and pathways in the health and welfare sector” D.D. 931 of 06/06/2022, DARE—DigitAl lifelong pRevEntion initiative, code PNC0000002, CUP B53C22006450001, (iii) the PNRR, M4C2, FAIR—Future Artificial Intelligence Research, Spoke 8 “Pervasive AI,” funded by the European Commission under the NextGeneration EU program. We thank the Maggioli Group (https://www.maggioli.com/who-we-are/company-profile) for granting the Ph.D. scholarship to Luca Ragazzi from November 2020 to January 2024.
Funding
Open access funding provided by Alma Mater Studiorum - Università di Bologna within the CRUI-CARE Agreement.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Ethical approval
The data used to create LAWSUIT have the license that gives us the right to transform and share the dataset publicly.Footnote 13 Regarding the experimental methods, due to the high societal impact of the legislation, experts should verify the quality of the inferred summaries to make the proposed solutions work in real-world applications.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: LAWSUIT release
Accessing The dataset files are stored in JSON format and will be uploaded to Google Drive and GitHub in case of acceptance. We will also integrate our dataset into the HuggingFace Datasets library (Lhoest et al. 2021).
License LAWSUIT is distributed under the CC-BY-SA 3.0 IT license, while the sources and summaries are already in the public domain. The authors assume all responsibility in the event of a breach of rights and accept the dataset licenses.
Maintenance The authors intend to provide long-term support for LAWSUIT, monitor usage, and produce necessary updates.
Appendix B: Implementation details
We trained abstractive summarization models using the Adam optimizer with \(\beta _1=0.9\) and \(\beta _2=0.99\), setting the learning rate to 5e-5 with linear scheduling. We evaluated the performance in the validation set at the end of each epoch, using only the first 100 samples to save time. We then tested the checkpoint on the test set that performed best on the validation set. Table 8 lists the model checkpoints used for pretrained models. Table 9 reports the batch size used during the training.
Appendix c: Insights on extractive results
Table 10 reports results with extractive baselines by varying the number of extracted sentences.
Appendix D: Human evaluation
The interface with human evaluation instructions is sketched in Fig. 8.
Appendix E: The role of epigraph and decision
Table 11 shows how the gold summaries compress the ruling epigraph and the decision information.
Appendix F: Data example
Figure 9 reports the original Italian-language LAWSUIT version of the example depicted in Fig. 1.
Appendix G: Examples of generated summaries
Tables 12 and 13 present qualitative examples from two distinct instances within the LAWSUIT test set. In particular, we provide the summaries generated by the top-notch baselines, such as IT5-small-8192 and IT5-small-512-SegSumm, with a varying size of trainable examples.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ragazzi, L., Moro, G., Guidi, S. et al. LAWSUIT: a LArge expert-Written SUmmarization dataset of ITalian constitutional court verdicts. Artif Intell Law (2024). https://doi.org/10.1007/s10506-024-09414-w
Accepted:
Published:
DOI: https://doi.org/10.1007/s10506-024-09414-w