1 Introduction

Text summarization (Sharma and Sharma 2023) is a persistent pursuit of natural language processing (NLP). Recently, there has been a growing interest in abstractive summarization (AS), which involves paraphrasing the essential details of textual documents in a succinct and accessible language (Zhang et al. 2022). This surge in interest is primarily attributed to the availability of large pretrained language models (Lewis et al. 2020; Guo et al. 2022; Moro et al. 2022) and publicly accessible datasets spanning various domains (Cohan et al. 2018; Narayan et al. 2018). One particularly impactful domain in real-world applications is law, where documents often consist of thousands of words filled with jargon and intricate expressions. The complexity of these documents makes their comprehension a time-consuming and labor-intensive process, even for legal experts (Kanapala et al. 2019). Therefore, legal AS (Moro et al. 2023) is a practical, useful, and essential task to promote knowledge acquisition. Lamentably, current legal summarization corpora are almost entirely devoted to English. There are yet no Italian datasets for legal AS, which limits the research, access, and elaboration of lawful texts and their implications to Italian law practitioners.

To fill this gap, we present the first large-scale Italian legal AS dataset, LAWSUIT,Footnote 1 consisting of 14,000 source documents with expert-authored summaries (Fig. 1). LAWSUIT allows the community to study the AS of legal verdicts in a critical application setting found in the Constitutional Court of the Italian Republic (CCIR). As the highest court in Italy for constitutional law matters, the CCIR maintains a comprehensive record of legal verdicts, accessible through an open-access data portal (https://dati.cortecostituzionale.it). In particular, highly qualified legal experts meticulously crafted and reviewed each ruling and the accompanying maxims, i.e., the synopses that clarify the events and core decisions. Beyond its potential to expand summarization capabilities, benefiting legal NLP benchmarks and tangible uses, LAWSUIT boasts several key features:

  • The average number of source and target words is significantly higher than that contemplated by existing Italian summarization datasets (+269% and +589%, respectively) (Casola and Lavelli 2021), encouraging long document AS for Italian.

  • In contrast to existing English legal benchmarks (Kornilova and Eidelman 2019; Huang et al. 2021), the salient content in the input is more uniformly distributed, and summary-worthy words are not concentrated in specific sections of the text. This characteristic poses a unique challenge for summarization tasks, requiring comprehensive processing of the entire source document rather than relying on localized content.

  • Unlike many summarization datasets that undergo automatic construction processes (Cohan et al. 2018; Grusky et al. 2018; Sharma et al. 2019; Huang et al. 2021), our inputs and targets are authored by experts. Specifically, university law professors and magistrates are responsible for drafting the verdicts, and the corresponding maxims are compiled by the supervisory office. The supervisory office oversees the formal control of the texts in collaboration with the study assistants of the President. This meticulous procedure ensures a high level of quality control and supervision, mitigating the risk of model hallucination (Maynez et al. 2020), which refers to the generation of unfaithful outputs due to training on targets that contain facts that are not supported by the source text.

We benchmark LAWSUIT using various extractive and abstractive summarization solutions, including a segmentation-based pipeline that demonstrates superior performance in both full and few-shot summarization scenarios, namely training models with all or just a few dozen instances.

Fig. 1
figure 1

Sample legal ruling in LAWSUIT (English translated). The input comprises three sections: epigraph, text, and decision. The original Italian version is given in Appendix 10

2 Related work

Natural Language Processing for Legal Texts Legal NLP has been the subject of extensive research in various legal tasks, including information retrieval (Chalkidis et al. 2018; Hendrycks et al. 2021; Sansone and Sperlí 2022), question answering (Ravichander et al. 2019; Huang et al. 2020; Kien et al. 2020; Zhong et al. 2020), text classification (Chalkidis et al. 2019; Tuggener et al. 2020; Chalkidis et al. 2021, 2022; Feng et al. 2022), and automatic text summarization (Duan et al. 2019; Zhong et al. 2019; Bhattacharya et al. 2021; Elaraby and Litman 2022; Moro and Ragazzi 2022; Moro et al. 2023). Moreover, recent endeavors have increasingly shifted towards non-English applications (Metsker et al. 2019; Wang et al. 2019; Malik et al. 2021; Xiao et al. 2021; Bakker et al. 2022; Qin et al. 2022; Niklaus et al. 2023), including Italian (Bellandi et al. 2022; Galli et al. 2022; Licari and Comandé 2022; Tagarelli and Simeri 2022), thus stimulating research in low-resource language contexts. These studies focus on fetching past court decisions and predicting outcomes. To the best of our knowledge, we pioneer the exploration of Italian legal document summarization grounded in a non-common law system. This is achieved by releasing the first large-scale legal abstractive summarization dataset derived from the CCIR.

Legal Document Summarization Previous studies on automatic summarization of court proceedings have mainly relied on extractive approaches, where the predicted summary consists of exact sentences taken directly from the source material. This ranges from unsupervised methods (Farzindar and Lapalme 2004; Saravanan et al. 2006; Polsley et al. 2016; Zhong et al. 2019) to supervised methods (Liu and Chen 2019). In contrast, our work is centered on AS, where the output is a rewording of the input. Abstraction is more closely aligned with the actual conditions of legal practices (Kornilova and Eidelman 2019; Sharma et al. 2019; Huang et al. 2021; Shen et al. 2022).

Legal Summarization Datasets Given the crucial social role of the legal domain and the growing demand for summarization tools (Jain et al. 2021), numerous datasets have been introduced, covering various types of documents. These include case reports (Greenleaf et al. 1995), judgments (Grover et al. 2004), legislative bills (Kornilova and Eidelman 2019), patents (Sharma et al. 2019), government reports (Huang et al. 2021), and federal civil rights LawSuITs (Shen et al. 2022). Diversity allows the development of large language models pretrained on legal text (Chalkidis et al. 2020; Zheng et al. 2021). Our dataset presents unique challenges, as it consists of lengthy domain-specialized documents that are inherently difficult to summarize. Challenges arise due to (i) the scattered distribution of summary-worthy information throughout the input and (ii) the occasional presence of formulaic expressions in the targets. Previous works have introduced Italian summarization datasets featuring short documents, such as those in the news (Landro et al. 2022) and articles related to Wikipedia (Ladhak et al. 2020; Casola and Lavelli 2021). Instead, LAWSUIT comprises longer texts (refer to Table 1), establishing itself as the first dataset for the Italian long document summarization task. Notably, the dataset includes gold summaries, diverging from the cases where summaries are automatically generated using the first sentence (Ladhak et al. 2020) or by concatenating the title with a description (Landro et al. 2022), a procedure that can compromise the factual consistency of models trained on such data (Maynez et al. 2020). In terms of legal contributions, LAWSUIT establishes the first large-scale legal resource, distinguishing itself from smaller datasets (Aumiller et al. 2022) and those designed exclusively for extractive summarization (Licari and Comandé 2022).

Italian Legal Language Models Since 2017, legal text analysis has been revolutionized by transformer-based architectures. Despite these advancements, accurately training machines to understand legal language remains a significant challenge. Legal language models, often benefitting from specialized pretraining (Chalkidis et al. 2020), currently achieve state-of-the-art results on various benchmarks (Zheng et al. 2021; Chalkidis et al. 2022). However, public generative models pretrained on legal corpora are scarce, forcing reliance on general models instead (Hwang et al. 2022; Shen et al. 2022). An extensive literature review (Katz et al. 2023) shows that English dominates open-source Legal NLP (56%), followed by Chinese (\(\approx \)10%), with models usually requiring extensive training hardware (Song et al. 2023). The main challenge in applying current language models to Italian documents is their inadequate training in comprehending instructions in that language.

Some contributions have explored Italian encoder-only models. UmBERTo (110 M) (Parisi et al. 2020) is the result of continual pretraining on top of RoBERTa using whole-word masking with filtered resources from Wikipedia and CommonCrawl. In the legal domain, Licari et al. introduced Italian-Legal-BERT (111 M) (Licari and Comandé 2022), which runs the continual pre-training of a general-domain Italian BERT model with civil law corpora and from scratch pre-training based on CamemBERT (111 M) (Martin et al. 2020), with distilled and long-document variants. However, these works fall outside of our scope, which is instead concerned with generative architectures.

In this sense, Mattei et al. proposed GePpeTto (117 M) (Mattei et al. 2020), a GPT-2 model fine-tuned on Italian Wikipedia and the ItWac corpus (Baroni et al. 2009), mainly aimed at text completion. Sarti and Nissim devised IT5 (60 M, 220 M, 738 M) (Sarti and Nissim 2024), a family of encoder-decoder transformer models pretrained on a cleaned version of the Italian mC4 corpus,Footnote 2 a web-crawled text collection that includes more than 40 billion words. La Quatra and Cagliero submitted BART-IT (Quatra and Cagliero 2023), an Italian version of BART trained on the same mixture of IT5. Santilli et al. released Camoscio (7B) (Santilli and Rodolà 2023), an instruction-tuned LLaMA model trained with low-rank (LoRA) adaptation on an Italian (ChatGPT-translated) version of Stanford Alpaca (Taori et al. 2023).

Regarding conversational objectives, Bacciu et al. (2023) presented Fauno (7B/13B), a LoRA fine-tuned version of Baize (Xu et al. 2023) in heterogeneous synthetic Italian datasets. LLaMantino (7B, 13B, 70B) (Basile et al. 2023) is a family of Italian-adapted LLaMA-2 models, trained using QLoRA on the IT5 data mixture. Maestrale (7B)Footnote 3 is a Mistral model specialized in Italian through continual pretraining and instruction fine-tuning. Zefiro (7B)Footnote 4 is a porting of the Mistral model to the Italian language, obtained through continual pretraining on a random subset of Oscar and Wikipedia data, supervised fine-tuning on UltraChat-ITA (silver translation), and DPO alignment (Rafailov et al. 2023) with the ultrafeedback preference dataset (silver translation). Minerva (350 M, 1B, 3B)Footnote 5 is a family of large language models pretrained from scratch on 660B tokens (330B in Italian, 330B in English). DanteLLM (Bacciu et al. 2024) is a QLoRA fine-tuned version of Mistral-Instruct (7B), trained on the Italian SQuAD dataset (Croce et al. 2018), 25K sentences from the Europarl dataset (Koehn 2005), the Fauno’s Quora dataset and the Camoscio dataset. Notably, as underscored by the current leaderboard dedicated to Italian language modeling available on HuggingFace,Footnote 6 the results achievable by language models pretrained from scratch on the Italian language are significantly inferior compared to those achievable by foundational models that have undergone extensive pretraining on larger multilingual corpora.

Taking LAWSUIT as a testbed, we fairly compare the effectiveness and efficiency of available Italian-adapted or multi-lingual encoder–decoder models with million-scale parameters, which offer significant advantages in hardware-constrained scenarios. We examine their adaptability to different tasks, languages, and amounts of labeled training data.

Table 1 Comparison of LAWSUIT to other related datasets

3 LAWSUIT

LAWSUIT is a large-scale Italian AS dataset that collects CCIR-sourced legal verdicts, serving as a new and demanding benchmark for the NLP community. The corpus comprises 14,000 long texts from 1956 to 2022, classified into orders and judgments (see Fig. 2 for statistics based on the year), each meticulously paired with a set of maxims (concatenated to form the target summary). The term order denotes a legal ruling declared during the judicial proceeding to settle questions and disputes verified during the trial, while judgment refers to a legal ruling declared by the judicial body at the end of the trial. The maxims summarize the judicial process by encapsulating key details about the ruling, general legal characteristics, references, and the final provision. Each source consists of three informative sections (Fig. 1):

  • Epigraph: the introduction detailing the main gist of the ruling and the context in which the request is addressed.

  • Text: the core content that highlights the legal extremes.

  • Decision: the concluding segment of the ruling that contains the final provisions of the Court.

Additional details on data access are provided in Appendix 6.

Fig. 2
figure 2

Graphical and tabular representation of the statistics of orders (O.) and judgments (J.) based on the year. In the figure, dotted boxes refer to document numbers, whereas solid boxes refer to word numbers. The table provides the exact values. An increase in the length of legal verdicts is observable over the years

3.1 LAWSUIT Processing

To construct the LAWSUIT dataset, we started with 21,331 instances obtained from the CCIR open data, discarding verdicts that were too recent and lacked an associated maxim.

Size Filtering We retained records with summary lengths between 100 and 2000 words and source texts between 1000 and 20,000 words, resulting in 14,072 instances. This step aimed to remove unbalanced texts (i.e., outliers) that do not reflect the average typical characteristics of these legal documents. Specifically, excessively short texts often lack sufficient context, while very long texts can introduce redundancy. Therefore, retaining only texts within the specified length ranges ensures that the model is exposed to a more homogeneous and representative sample of the data, leading to better generalization and performance.

Duplicate Data Removal To identify and eliminate duplicate instances, we employed an approach similar to Kornilova and Eidelman (2019), resulting in 14,054 instances. Technically, the process involved (i) removing stop words and the 30 most common terms (e.g., article, law, court), (ii) vectorizing texts using scikit-learn’s CountVectorizer, (iii) computing average cosine similarity between the texts and the summaries for each pair of verdicts, and (iv) iteratively adding verdicts while discarding instances highly similar (>96%) to any verdicts already included. Duplicates were often orders on related subjects pronounced in close time frames or written as corrections to previous orders with drafting errors; we kept the most recent version of the document.

Compression Ratio Filtering The compression ratio quantifies how much a document is condensed to produce its summary. This metric is defined as the ratio between the number of words in the input and its corresponding target (Grusky et al. 2018). As for the size filtering procedure, we aimed to create a high-quality homogeneous dataset without outliers. Thus, due to the considerable variation in the sizes of both sources and summaries, we retained verdicts with a compression ratio between 2 and 70, ending up with 14,000 instances.

Quality Control and Text Cleaning Besides traditional operations (e.g., extra spaces and newline chars disposal), we implemented a preprocessing pipeline aimed at ensuring the textual quality of the dataset. The steps involved were as follows:

  • Removal of epigraph and decision prefixes containing personal names, such as those of the president, editors, and directors;

  • Given that several instances lacked a clear structural separation between the epigraph and the main text, we have explicitly delineated these sections to enhance overall structuring;

  • Elimination of duplicated notes found at the end of maxims, which were deemed irrelevant due to versioning management;

  • Replacement of apostrophes in vowels with correct accents by applying UTF-8 encoding;

  • Removal of publisher, judge, and reviewer information at the end of the decision;

  • Deletion of backslashes in ruling codes to address encoding errors present in the original JSON files.

On the other hand, certain elements, recognized for their high frequency and factuality role, were intentionally retained: (i) cf., bibliographic citations pointing to external references; (ii) artt., legal jargon signifying the citation of multiple articles; (iii) personal names, except for publisher, judge, and reviewer.

Train-test Split. Following prior work (Cripwell et al. 2023), we employ a dataset split size of 90-5-5 to ensure sufficient training data while allowing for adequate validation and testing. Therefore, the dataset was divided into train (90%, 12,600 samples), validation (5%, 700), and test (5%, 700) sets. We carried out a proportional stratified random sampling without replacement, considering the categorization and lengths of the sources. To be precise, we (i) evenly distributed the orders and judgments in the splits to have the same percentage of each type in each split and (ii) divided them equally based on their lengths (tertiles are calculated to assign \(\{\text {short}, \text {medium}, \text {long}\}\) classes). Table 2 shows the equal distribution of documents among the splits, specifying fine-graned statistics about the number of words within the three source sections (i.e., epigraph, text, and decision).

Table 2 LAWSUIT’s train-test splits

3.2 Dataset characterization

Table 1 offers a comparative analysis of key statistics between LAWSUIT and other relevant text summarization datasets. Concretely, we present corpus sizes and the average number of words and sentences in both source documents and target summaries, calculated using the NLTK library (Bird 2006). Additionally, we furnish information on the average coverage, density, and compression ratio of extractive fragments in terms of words and sentences, as defined by Grusky et al. (2018). In particular, LAWSUIT exhibits longer source texts and target summaries than existing datasets, except for GovReport, where the targets contain more source-related tokens, indicating greater coverage. Moreover, we observe a slightly smaller frequency of vocabulary words w.r.t. corpora with a higher number of documents, suggesting that while our dataset is more concise, it still captures the essential linguistic diversity, maintaining a robust and representative vocabulary distribution. In terms of legal contributions, it is noteworthy that LAWSUIT represents the inaugural dataset composed exclusively of Italian documents, distinguishing it from multilingual datasets that include only limited subsets of Italian texts.

Summary Abstractiveness Compared to previous contributions, we observe that the summaries within LAWSUIT exhibit substantial coverage (0.92).Footnote 7 This implies that the target generations contain fewer unsupported entities and facts, ensuring faithfulness while mitigating the risk of hallucinations, an imperative consideration in legal applications. Simultaneously, we note that the density, which represents the average length of the extractive fragments, is the highest among the datasets, suggesting that the summaries in LAWSUIT might have an extractive nature. To dispel this assumption, following established methodologies (See et al. 2017; Chen and Bansal 2018; Sharma et al. 2019), we compute abstractiveness as the fraction of novel n-grams in the summary that do not appear in the input source. Figure 3 illustrates the percentage of novel sentences and n-grams with \(n \in [1,10]\), indicating that many summary details are not verbatim extractions from sources but rather abstractive, despite the high density.

Fig. 3
figure 3

\(\%\) of novel n-grams in the summaries compared to BillSum (BS) and GovReport (GR). S indicates the novelty at the sentence level. The results show the abstractiveness of the summaries in LAWSUIT

Coverage increment and section informativeness Given the obstacles presented by lengthy sources in identifying salient content for inclusion in the summary, we examine the coverage increment of summary-worthy unigrams in the input. To achieve this, we divide each source into ten equal partitions, according to Huang et al. (2021). Specifically, we count the number of unique unigrams that also appear in the target, accumulated from the document’s start to the end of each partition. Figure 4 illustrates that relevant information is spread throughout documents, with novel salient unigrams being covered more uniformly as more content is consumed. This means that LAWSUIT exhibits less positional bias and requires a comprehensive reading of the entire input. To further elucidate this aspect, we break down the informativeness across the three sections (i.e., epigraph, text, decision) by computing the percentage of unique salient unigrams occurring in each text span. Figure 5 demonstrates that the core content of a summary is generally concentrated in the text section of the ruling to which it refers. However, through a deeper qualitative investigation (Appendix 10), we discover that the epigraph and the decision are essential at both ends of the generation, where the maxim is likely to mention references and final court judgments, briefly rephrasing and aggregating them.

Fig. 4
figure 4

\(\%\) of unique salient unigrams accumulated across the input. The summary-relevant details are spread over the sources in LAWSUIT, emphasizing the importance of understanding the entire input

Fig. 5
figure 5

\(\%\) of unique and total unigrams for each ruling section in LAWSUIT (informativeness). Despite the concentration in the central text part, the epigraph and decision n-grams are crucial to briefly reporting the references and conclusions inside the summaries

Summary Formulaicness Legal summaries often incorporate common expressions and shared standard structures, enabling models to learn patterns during training without a deep understanding of the input. To quantify this phenomenon, we analyze the formulaicness of summaries in the training set by calculating the longest common subsequence (LCS) (Lin 2004). Technically, we compute the LCS for each subset by taking 5 non-overlapping subsets of 100 random samples.Footnote 8 Figure 6 highlights that summaries in LAWSUIT have a lower occurrence of structural patterns across targets than related English legal datasets, especially BillSum, despite the latter having shorter summaries. In fact, the longer the resumes, the higher the chance that words overlap.

Fig. 6
figure 6

Average summary formulaicness. LAWSUIT (L-IT) has fewer occurrences of structural patterns than BillSum (BS) and GovReport (GR)

4 Experiments

Our goal with LAWSUIT is to establish a novel and challenging benchmark to advance Legal NLP in real-world applications. Therefore, our experiments with LAWSUIT delve into two research questions.

  • RQ1: can current models effectively summarize Italian legal verdicts to support legal practitioners and automate downstream applications?

  • RQ2: given the high cost of human annotation for creating labeled examples, can models be configured to produce useful summaries in real-world scenarios with only a handful of training instances?

To answer them, we set up the following tasks.

  • Full summarization: this involves training models with the entire set of available instances in LAWSUIT, totaling 12,600 samples.

  • Few-shot summarization: this simulates a scenario marked by data scarcity for model supervision due to the high cost of labeling. To replicate this setting, models are provided with only the first 10 and 100 training samples, aligning with previous works (Zhang et al. 2020; Chen and Shuai 2021; Moro and Ragazzi 2022).Footnote 9

4.1 Models

We investigate the performance of multiple extractive and abstractive solutions on LAWSUIT.

Extractive Baselines For upper-bound performance, we consider an oracle: Oracle-Opt selects, for each of the k gold summary sentences—extrapolated with the NLTK library—the input sentence that maximizes the average ROUGE-{1,2,L} F1 score. LexRank-PLM is a graph-based unsupervised extractive summarizer that leverages LexRank’s eigenvector centrality (Erkan and Radev 2004) and a pretrained language model (paraphrase-multilingual-MiniLM-L12-v2) to enhance sentence representation during text encoding. Epi, Text, and Dec select the first n sentences from the epigraph, text, and decision, respectively. Cat concatenates the first \(\nicefrac {n}{3}\) sentences from the three sections, maintaining the occurrence order in the source document.Footnote 10

Table 3 Performance of small (s), base (b), and large (l) models on LAWSUIT with 10, 100, and full (12,600) training instances

Abstractive Baselines mBART (Liu et al. 2020; Tang et al. 2020) is a sequence-to-sequence model largely pretrained on multiple languages using Bart’s denoising objective (Lewis et al. 2020); it can process inputs up to 1024 tokens.Footnote 11IT5 (Sarti and Nissim 2022) is a text-to-text model centered on T5 (Raffel et al. 2020) and pretrained on Italian corpora; it is unbounded in the input dimension thanks to its positional embedding mechanism. mT5 (Xue et al. 2021) is a T5-based model pretrained on multiple languages. We employ the small (s), base (b), and large (l) model checkpoints (see Table 8 for technical details).

Fig. 7
figure 7

The overview of SegSumm. The segmentation ensures that each chunk has a max length \(\le \mathcal {M}\), corresponding to the model’s max input size. The dashed gray module is only used at training time

Segmentation-based Pipeline Inspired by the necessity of (i) comprehensively processing the entire input source without overlooking details, (ii) minimizing the risk of model hallucination through careful consideration of small, highly correlated source–target pairs, and (iii) generating precise summaries in scenarios with limited data availability, we introduce a straightforward yet powerful language-agnostic segmentation-based approach. Let \(\mathcal {D}=\{d_1, \dots , d_n\}\) be the long input document, where each \(d_i\) is a sentence; this solution divides \(\mathcal {D}\) into non-overlapping chunks (i.e., a set of consecutive sentences), each containing a maximum of \(\mathcal {M}\) tokens. Specifically, we start with an empty chunk c and iteratively add sentences until \(\mathcal {M}\). To train our solution, we assign each summary sentence—selected with NLTK—to the chunk that maximizes the ROUGE-1 precision metric—creating small high-correlated training pairs (\(c_i, t_i\))—as defined by Moro and Ragazzi (2022). On the other hand, at inference time, the chunks are summarized, and their predictions are concatenated in the order of occurrence of the source document to produce the final summary. We refer to this approach with the term SegSumm, depicted in Fig. 7.

This pipeline is related to but differs from Moro and Ragazzi (2022) because the segmentation is model-agnostic—and thus language-agnostic—making it applicable to multiple languages, including Italian (see Sect. 4.4.2 for experiments on English legal texts).

Note: when small values of \(\mathcal {M}\) are used, the document is divided into multiple chunks. Consequently, if the number of summary sentences is fewer than the number of chunks, the chunks without corresponding target sentences are discarded during the training process. In toher words, the above target-matching algorithm does not ensure \(t_i \ne \emptyset \), which is evident if the number of chunks is greater than the target sentences. However, the summaries in LAWSUIT have an averagely higher number of sentences (see Table 1) than the hypothetical number of source chunks.

4.2 Implementation and hardware

For abstractive summarizers, we fine-tune the models using the PyTorch (Paszke et al. 2019) implementation from the HuggingFace library (Wolf et al. 2019), leveraging publicly available checkpoints. The models are trained for 3 epochs on a single NVIDIA GeForce RTX 3090 GPU (24GB VRAM) from an internal cluster, with a learning rate of 5e-5. In the decoding process, we apply beam search with 4 beams and n-gram repetition blocks for n>5, using 1024 as the maximum summary length. The seed is fixed at 42 for reproducibility. Additional details are available in Appendix 7.

4.3 Evaluation setup

To gauge a comprehensive evaluation, we conduct both a quantitative and qualitative analysis with automatic metrics and human annotators, respectively.

Automatic ROUGE-{1,2,L} F1 (Lin 2004) and BERTScore F1 (BS) (Zhang et al. 2020) are used to calculate the lexical overlap and estimated semantic overlap between the generated and the gold summaries, respectively. For BS, we use the bert-base-multilingual-cased model and set rescale_with_baseline=True.

Human Given the potential failure of automatic metrics to act as reliable proxies for summary quality dimensions, we perform an in-depth human evaluation. In the steps of previous work (Narayan et al. 2018; Fabbri et al. 2019; Moro et al. 2023), we use Best-Worst Scaling (Louviere and Woodworth 1991; Louviere et al. 2015), which is more trustworthy and less expensive than rating scales (Kiritchenko and Mohammad 2017). Pointedly, we provide 3 legal expert evaluators with the source document and the artificial summaries from the best-performed models. We ask them to rank predictions according to informativeness, fluency, and factuality. The assessment is done on 30 randomly selected documents from LAWSUIT’s test set by comparing all the possible summary pair combinations, i.e., 90 binary preference annotations per participant. We randomize the order of pairs and per-example sources to guard the rating against being gamed. Elicited labels are used to establish a score in \([-1,1]\) for each summary source s: \(\%_{best}(s) - \%_{worst}(s)\). The annotation process takes \(\approx \)6 h per judge, 18 in total. Appendix 9 illustrates our setup.

4.4 Results and discussion

4.4.1 Dataset

Italian Legal Ruling Summarization Table 3 presents the performance of each baseline in LAWSUIT, where the summarizers are tasked with extracting and synthesizing crucial information from lengthy sources, utilizing varying numbers of training samples (Table 4). Table 5 presents the transfer learning performance of IT5-small when trained and tested across orders and judgments. In terms of abstractive summarizers, models that allow long inputs (IT5) perform better than input-constrained models (mBart) on all tasks, underscoring the utility of an extensive input context. Longer input also brings consistent performance gains for IT5 across tasks. Interestingly, SegSumm significantly exceeds the baselines (p-value \(< 0.05\) with student t-test) in full and few-shot summarization. Human evaluation results are reported in Table 6. SegSumm is rated the best in all dimensions. These findings demonstrate that existing language models can effectively support Italian legal summarization, particularly when equipped with segmentation capabilities (RQ1). Indeed, text segmentation allows the model to process the entire document without truncating information that exceeds the maximum input size permitted by its architecture. In plausible few-shot scenarios, SegSumm emerges as the sole model offering satisfactory effectiveness (RQ2). We provide some examples of the generated summaries for few shot training and training on entire data in Appendix 12.

Table 4 ROUGE F1 performance of IT5-small (8192) in full summarization for generating the summaries from individual and combination of sections
Table 5 Transfer learning performance
Table 6 Human evaluation ranking

Generating summaries from sections To further explore the importance of reading the entire input source in LAWSUIT, we train summarizers on individual sections (i.e., epigraph, text, decision) to generate the summary.

As shown in Table 4, the model trained on the three concatenated sections reveals significant improvements compared to processing only the epigraph and the decision. As the text section is longer, models processing only that part are marginally less efficient. However, this analysis indicates that all the source sections are sufficiently informative to produce a comprehensive summary. This further underscores the importance of avoiding the truncation of longer texts due to context limitations and instead leveraging segmentation-based approaches.

Table 7 Performance of models on BillSum in few-shot summarization

4.4.2 Method

Generality of SegSumm Due to the language independence of the SegSumm approach, we gauge its generality to analyze whether it can improve legal applications in other languages. Specifically, we experiment with the BillSum dataset under low-resource conditions (Moro et al. 2023, 2023, 2023), simulating a real-world legal scenario.Footnote 12 We compare with existing solutions concentrating on few-shot summarization: Pegasus (Zhang et al. 2020), a transformer-based model pretrained with a summarization-specific objective that allows for fast adaptation with few labeled samples. MTL-ABS (Chen and Shuai 2021), a meta-transfer learning approach that augments training data with similar corpora. Se3 (Moro and Ragazzi 2022), a segmentation-based solution equipped with metric learning. Athena (Moro and Ragazzi 2023), a segmentation-based model with a dynamic learned size of the chunks. LW-ML (Huh and Ko 2022), a meta-learning algorithm that inserts a lightweight module into the attention mechanism of a pretrained language model. Regarding our solution, we test SegSumm on top of Bart-base (Lewis et al. 2020). Table 7 points out that SegSumm largely outperforms previous models, confirming the usefulness of text segmentation for legal texts.

5 Conclusion

In this paper, we introduced LAWSUIT, the first large-scale dataset for the abstractive summarization of long Italian legal verdicts. The challenges presented by LAWSUIT include lengthy sources, the uniform distribution of relevant information throughout the input, and the lower presence of formulaic patterns in the targets. Through an extensive series of experiments, we found that a text segmentation pipeline significantly outperforms other methods in both few-shot and full summarization. We anticipate that LAWSUIT will contribute to the development of real-world legal summarization systems and stimulate research towards effective long-range solutions for Italian legal documents. Future works will extend LAWSUIT to new tasks, such as cross-domain and in-domain ruling classification (Domeniconi et al. 2014a, b, 2015, 2016, 2017; Moro et al. 2018), legal reasoning (Guha et al. 2023; Moro et al. 2024), open-domain question answering (Frisoni et al. 2024), corpus-level knowledge extraction (Frisoni and Moro 2020), and lay summarization (Ragazzi et al. 2024). By representing the source document as a graph (Moro et al. 2023), researchers could explore efficient segmentation and summarization techniques based on graph sparsification (Domeniconi et al. 2014, 2016; Zaheer et al. 2020), eventually using distributed algorithms (Lodi et al. 2010; Cerroni et al. 2015) to handle a large number of nodes and edges.

6 Limitations

As there are no publicly available Italian datasets specifically designed to summarize long legal documents, we conducted a comparison between LAWSUIT and existing English legal datasets. However, it is crucial to acknowledge that English and Italian differ not only in language but also in dictionary and style, potentially introducing linguistic biases when comparing statistics. While SegSum serves as a baseline, it requires the generation of at least one sentence for each chunk during inference. Although this is suitable for extensive summaries, such as those found in typical long document summarization datasets such as LAWSUIT and GovReport, it might be less scalable for concise summaries. Regarding low-resource experiments, our method is guided by published top-tier work, but we recognize that the sample selection process could significantly impact the final results. Hence, future contributions should explore various subsets of the training set to gain a more comprehensive understanding.