1 Introduction

Automatic text summarization is undoubtedly one of the most valuable deep learning applications for natural language processing (NLP), especially when document analysis is time-consuming for humans. Reading and evaluating legal case reports is labor-intensive for judges and lawyers, who usually base their choices on report abstracts, legal principles, and commonsense reasoning. Legal report abstracts are poorly available, and the text summarization task requires law experts and a long time to be performed. Furthermore, few datasets are publicly ready for text summarization in the legal domain, mainly if we need corpora written in a specific language because legislation is drafted in its relative nation’s language. Therefore, the challenge of this work is to build an automatic summarization system of legal reports to speed up human productivity and overcome the dearth of available legal corpora.

Two main approaches have been proposed in the literature concerning the text summarization task: extractive and abstractive. Extractive summarization aims to select the most salient sentences in a text, obtaining a summary with exact sentences from the original document. Instead, the abstractive approach summarizes an input text by rephrasing it, also using new words that may not be present in the source text.

In this work, both extractive and abstractive summarization techniques have been tackled by proposing a transfer learning approach that can be used to cope with the lack of labeled legal summarization datasets (i.e., legal corpora without human-generated abstracts), which is a typical low-resource scenario concerning the available data. Our method allows generating abstractive summaries by just starting from tagged catchphrases within legal reports (Fig. 1). The catchphrases are meant to present the essential legal concepts of a case. To this end, we first select and extract the relevant sentences in the text, if not already tagged, by using a lightweight neural model composed of CNN and GRU layers. Then, we pass the extracted sentences to GPT-2 (Radford et al. 2019) as the reference summary. We chose GPT-2 because it is a generative decoder-only transformer-based model that is more efficient than usual sequence-to-sequence summarization models based on encoder and decoder layers that double the memory space and training time.

Fig. 1
figure 1

The overview of our solution for the abstractive summarization of a legal case report. First, our extractive summarizer (CNN+GRU) retrieves relevant sentences from the document. Next, these sentences are provided along with the whole legal case report to be arranged in the following order: (i) Report Full Text, (ii) Extracted Sentences. Then, a transformer language model (GPT-2 small) is trained on legal case reports organized in such a format. During inference, the full text of each new legal case report to summarize is given to the language model as context to generate a summary

Our approach can be considered a general solution to overcome the absence of abstractive reference summaries written by human legal experts. Indeed, only catchphrases tags are required, which are much less time-consuming to create than human-written abstracts because they can be obtained in multiple ways, e.g., by applying an unsupervised extractive algorithm (e.g., TextRank (Mihalcea and Tarau 2004)) or by manually tagging the sentences of a few documents and afterward fine-tuning a pre-trained model.

We experiment with the Australian Legal Case Reports dataset, evaluating our model’s effectiveness in summarizing legal case reports in several languages (i.e., Australian, Italian, German, Spanish, Danish, French) whose translations are yielded automatically with the Google Translate API as in prior works (Feng et al. 2022). We also test the cross-language performance by directly summarizing the English-written texts in all the benchmarked languages. We finally assess the factual consistency (Kryscinski et al. 2020) of the generated abstractive summaries.

The paper is organized as follows. Section 2 analyzes the literature about text summarization and deeps into the related works on the Australian Legal Case Reports dataset. Section 3 presents our transfer learning approach and models employed for the summarization tasks. Section 4 shows the multi-language experiments with extractive and abstractive techniques. Lastly, Sect. 5 sums up the work with final thoughts.

2 Related work

In order to achieve state-of-the-art (SOTA) results in automatic text summarization, many advancements have been made in neural network architectures (Cho et al. 2014; Sutskever et al. 2014; Bahdanau et al. 2015; Vinyals et al. 2015; Vaswani et al. 2017; Moro and Valgimigli 2021; Moro et al. 2022), pretraining and transfer learning (Domeniconi et al. 2015, Domeniconi et al. 2016, McCann et al. 2017; Peters et al. 2018; Devlin et al. 2019), and the availability of large-scale supervised datasets (Sandhaus 2008; Nallapati et al. 2016; Grusky et al. 2018; Narayan et al. 2018; Sharma et al. 2019), which allowed deep learning-based approaches (Domeniconi et al. 2017) to dominate the field, also for biomedical tasks (Frisoni et al. 2021; Frisoni et al. 2022; Frisoni et al. 2023) and multi-modal settings (Moro and Salvatori 2022; Moro et al. 2023), and overcome the need of low-resource summarization techniques (Moro and Ragazzi 2022; Moro et al. 2023a, b; Moro and Ragazzi 2023). SOTA solutions leverage attention layers (Liu 2019; Liu and Lapata 2019; Zhang et al. 2019), copying mechanisms (See et al. 2017; Cohan et al. 2018), and multi-objective training strategies (Guo et al. 2018; Pasunuru and Bansal 2018), including reinforcement learning (RL) techniques (Kryscinski et al. 2018; Dong et al. 2018; Wu and Hu 2018).

SOTA Extractive summarization includes BERT-based models, such as BertSum (Liu and Lapata 2019), BERT + RL (Bae et al. 2019), and PnBert (Zhong et al. 2019). Conversely, abstractive summarization includes transformer-based models for sequence-to-sequence learning, such as BART (Lewis et al. 2020), PEGASUS (Zhang et al. 2020), T5 (Raffel et al. 2020), and ProphetNet (Qi et al. 2020). Further, in the new research focused on building efficient transformers with linear complexity, the SOTA models in long document summarization are Longformer Encoder-Decoder (Beltagy et al. 2020), BigBird-Pegasus (Zaheer et al. 2020), and Hepos (Huang et al. 2021).

The summarization task on the Australian Legal Case Reports dataset has been first tackled by (Galgani and Hoffmann 2010). They presented a new knowledge-based approach towards legal citation classification and created an extensive training and test corpus from court decision reports in Australia. Their later work (Galgani et al. 2012) presents the challenges and possibilities for the automatic generation of catchphrases for legal documents. The authors developed a corpus of human-generated legal catchphrases, which lets them compute statistics useful for automatic catchphrase extraction. Afterward, (Galgani et al. 2012a) presented their approach to assigning categories and generating catchphrases for legal case reports. They describe their knowledge acquisition framework, which lets them quickly build classification rules, using a small number of features to assign general labels to cases. They show how the resulting knowledge base outperforms machine learning models, which use both the designed features or a traditional bag of words representation. In the same year, (Galgani et al. 2012b) described a hybrid approach in which several different summarization techniques are combined in a rule-based system using manual knowledge acquisition. Here, human intuition, supported by data, specifies attributes and algorithms and the contexts where these are best used. Lastly, (Galgani et al. 2012c) presented an approach towards using both incoming and outgoing citation information to generate catchphrases automatically for legal case reports. Specifically, they created a corpus of cases, catchphrases, and citations and performed a ROUGE-based evaluation (Lin et al. 2004), which showed the superiority of their citation-based methods over full-text-only methods.

Mandal et al. (2017) proposed an unsupervised approach for extracting and ranking catchphrases from the same court case documents by focusing on noun phrases. They compared the proposed approach with several unsupervised and supervised baselines, showing that the proposed methodology achieves statistically significantly better performance than all the baselines.

In the latest published work on Australian legal cases, Tran et al. (2018) presented a method of automatic catchphrase extraction. They utilized deep neural networks to construct a scoring model of their extraction system and achieved comparable performance without using citation information.

Other recent works focused on benchmarking different extractive and abstractive approaches (Shukla et al. 2022) without considering catchphrases extraction or experimenting with it on different datasets (Koboyatshwene et al. 2017; Bhargava et al. 2017; Kayalvizhi and Thenmozhi 2020).

Our work focused on the Australian Legal Case Reports dataset for two main reasons. First, several studies have already been performed on such a dataset. Second, it is suitable for simulating the lack of human-crafted abstracts, unlike the BillSum dataset (Kornilova and Eidelman 2019), for example. Further, this work has been partially inspired by a recent approach proposed by Pilault et al. (2020), which combines extractive summarization with abstractive one to increase the performance of their model. Nevertheless, since we do not have abstracts as reference summaries, we fine-tune a pre-trained transformer-based model using extractive summaries as reference ones to overcome that absence.

3 Method

Fig. 2
figure 2

The model architecture used in the extractive summarization experiments

In this work, we apply extractive and abstractive summarization techniques to propose a transfer learning approach to generate abstractive summaries starting from tagged catchphrases. In particular, BERT (Devlin et al. 2019) (base, multilingual cased) and a deep neural network composed of CNN and GRU layers have been used for the extraction phase, whereas the abstractive one has been performed with the GPT-2 transformer-based model.

3.1 Extractive summarization

In order to generate contextualized word embeddings, we apply the BERT-Base-Multilingual-Cased pre-trained model for two main reasons: (i) BERT has achieved the SOTA in various NLP tasks, and (ii) the multilingual model allows us to overcome the absence of multi-lingual legal case datasets in the cross-language experiments. We obtained the sentence embeddings by carrying out the mean of the word embeddings related to words within the same sentence.

For the binary classification task, our model (Fig. 2) comprises:

  1. 1.

    One 1D CNN layer (LeCun et al. 1999), with a subsequent MaxPooling1D operation, with a kernel of size 1 and filters of size 1024.

  2. 2.

    One bidirectional GRU layer (Cho et al. 2014) with a GlobalMaxPooling1D operation.

  3. 3.

    Four fully connected layers of decreasing dimensionality.

All the main layers are interleaved with Dropout levels. Technically, Conv1D, MaxPooling1D, and Dropout layers have only been applied when the mean of word embeddings has been used as the sentence embedding building method for performance reasons.

The word/sentence embeddings have been yielded using the Flair NLP library (Akbik et al. 2019).Footnote 1 This framework lets us choose among different embedding methods and generate tensors of words/sentences from the relative string.

We train the extractive summarization model to minimize the categorical cross-entropy loss, as follows:

$$\begin{aligned} \mathcal {L}_{es} = -\sum _{i=1}^{i=N} y_i \cdot \log (\hat{y_i}) \end{aligned}$$
(1)

where N is the number of samples and \(y_i\) and \(\hat{y_i}\) are the gold and predicted labels, respectively.

3.2 Abstractive summarization

GPT-2 has been used to generate legal abstractive summaries in the abstractive summarization scenario. To this end, two main steps are required:

  1. 1.

    Fine-tuning data (Fig. 3): each instance of our GPT-2 fine-tuning data is sequentially composed of the input text, a <summarize> tag, and the input text summary. In our case, the input text will be the original text of each legal case report, whereas the summary will be the set of extracted sentences labeled as relevant for each report.

  2. 2.

    Inference phase (Fig. 4): each test data instance is composed of the original text of a new legal report followed by the <summarize> tag.

Fig. 3
figure 3

The structure of each instance of fine-tuning data used in the abstractive summarization with the GPT-2 language model. Each data is composed of: (i) the legal case report full text, (ii) a <summarize> tag, and (iii) the report summary obtained via the extractive summarization process

Fig. 4
figure 4

The inference phase in the abstractive summarization process performed via GPT-2. For each summary to produce, the full text of the legal case report is concatenated to the <summarize> tag. The GPT-2 model iterates a specified number of steps producing one token at a time. The report summary will be the tokens generated after the <summarize> tag

The embedding process is integrated into the GPT-2 architecture, where sequences of words are transformed into numeric vectors by the tokenizer.

We train the abstractive summarization model using the standard cross-entropy loss, which requires the model to predict the next token \(y_i\) of the target \(\mathcal {Y}\) given \(\mathcal {X}\) and the previous target tokens \(y_{1:i-1}\), as follows:

$$\begin{aligned} \mathcal {L}_{as} = -\sum _{i=1}^{|\mathcal {Y}|} \log p(y_i |y_{1:i-1}, {\mathcal {X}}) \end{aligned}$$
(2)

where p is the predicted probability over the vocabulary.

4 Experiments

This section first introduces the dataset, experimental setup, and evaluation metrics. Then, we deep into the in- and cross-domain experiments for extractive and abstractive approaches.

4.1 Dataset

The dataset used in our experiments is the Australian Legal Case Reports, and it represents a textual corpus of around 4000 legal cases for automatic summarization and citation analysis (Galgani et al. 2012c). The dataset contains Australian legal cases from the Federal Court of Australia (FCA) from 2006 to 2009, downloaded from AustLII.Footnote 2 For each document, the authors collected catchphrases, citation sentences, citation catchphrases, and citation classes that indicate the type of treatment given to the cases cited by the present case.

The dataset is structured in three directories:

  • fulltext: it contains the full text and the catchphrases of all the cases from the FCA. Each document (<case>) contains:

    • <name>: the name of the case.

    • <AustLII>: the link to the page from where the document was taken.

    • <catchphrases>: a list of <catchphrase> elements.

    • <sentences>: a list of <sentence> elements.

  • citations_summ: it contains citations element for each case with the following fields:

    • <name>: the name of the case.

    • <AustLII>: the link to the page from where the document was taken.

    • <citphrases>: a list of <citphrase> elements that are catchphrases from a case which is cited or cite the current one. The attributes are id, type (cited or citing), and from (the case from where the catchphrase is taken).

    • <citances>: a list of <citance> elements that are sentences from a later case that mention the current case. They also have the from attributes.

    • <legistitles>: a list of <title> elements that are titles of a piece of legislation cited by the current case.

  • citations_class: it contains for each case a list of labeled citations with the following fields:

    • <name>: the name of the case.

    • <AustLII>: the link to the page from where the document was taken.

    • <citations>: a list of <citation> elements. They contains several attributes, such as the <class> of the citation as indicated on the document (considered, followed, cited, applied, notfollowed, referred to, etc.), the name of the case which is cited (<tocase>), the link to the document of the case which is cited (<AustLII>), and the <text> paragraphs in the cited case where the current case is mentioned.

No missing values have been found in fulltext and citation_class directory files, whereas some values are missing in citation_summ documents.

XML files contain many HTML entity characters. These latter ones made XML parsing invalid since “ &” characters would indicate entities of XML types and not HTML ones, so they had to be removed.

The data used to perform the analysis have been selected from the fulltext directory. For each legal case (i.e., an XML file), <name>, <catchphrase>, and <sentence> have been used. HTML special entities have been removed to parse the text correctly. To this end, we replaced HTML entity characters with the corresponding textual representation. Some legal case reports have been truncated as they are not encoded as UTF-8 strings.

In order to create the target variable (i.e., the feature representing the class) of our extractive summarization experiments, a label for each legal sentence is needed, indicating whether a sentence should be included in a case report summary. Thus, the class variable will be binary. Since this information is not directly specified in the metadata, it has been generated using the following annotation process: for each sentence of a legal case, it is checked whether at least one of the catchphrases for that legal case is included in the current legal sentence examined. If this condition is true, then the label of that sentence will be True, else it will be False.

Table 1 Statistics of the legal case reports translated into all the evaluation languages

The legal sentences of each case report have been added to a common dataframe. So, each instance of this latter structure will represent a phrase. The instances in the dataframe have been balanced by class (represented by is_catchphrase attribute). Afterward, data instances were shuffled by groups of phrases from the same legal case report to avoid the situation where the model will classify phrases from legal case reports already seen during training time. Table 1 shows statistics of all datasets (i.e., the Australian Legal Case Reports translated into all the evaluation languages), reporting the number of words and sentences in source and target texts, source-target compression ratio (the number of source words divided by the number of target words), and the % of relevant sentences containing catchphrases.

4.2 Experimental setup

Similar to previous contributions (Zhang et al. 2020), all the analyses have been performed both on a dataset of sizes 100 and 1000, simulating real-world scenarios characterized by the dearth of data. Precisely, one table has been generated for (i) each summarization technique adopted and (ii) each language to which the dataset has been translated. As commonly applied in realistic organizations because of the high data labeling cost, all extractive and abstractive models have been trained using 70% of the dataset sampled in each experiment and tested on the remaining samples, similar to Bajaj et al. (2021).

Regarding the multi-lingual experiments, we translate each legal case report using the Google Translate API in the following languages: Italian, German, Spanish, Danish, and French.

4.2.1 Implementation details

We fine-tuned the models based on the Keras implementations for tensors computation, setting the seed to 7 for reproducibility. We trained the extractive summarization model for 100 epochs with a batch size of 50 and a learning rate of 1e-4, employing Adam as the optimizer with \(\beta _1=0.9\) and \(\beta _2=0.99\) and a weight decay of 1e-6. We used the L1 and L2 regularization penalties of 0 and 0.001, respectively, and setting the dropout to 0.1. Regarding abstractive summarization, we train the model for 4 epochs with a batch size of 4 and a learning rate of 5e-5, using Adam as the optimizer with \(\beta _1=0.9\) and \(\beta _2=0.99\) and a weight decay of 1e-6. For decoding, we utilized top-p nucleus sampling with top_p and temperature set to 0.9 and 1, respectively.

4.3 Evaluation metrics

The extractive experiments have been tested by evaluating the F1 and ROUGE scores (Lin et al. 2004). The F1 score is calculated between the sentences classified as relevant and the gold ones containing the catchphrases. F1 metric has been computed to consider both recall (i.e., the ratio of relevant sentences retrieved by the model out of all relevant sentences in the dataset) and precision metrics (i.e., the percentage of salient sentences retrieved out of all the sentences in the produced summary). Further, we choose the F1 metric because both the training and test set are not perfectly class-balanced since—after the class-balancing operation—data is grouped by case report groups and shuffled. By doing so, we also simulate a production environment where an entirely new legal case report is passed as input to our classifier. Conversely, the ROUGE scores are calculated between the concatenation of the classified sentences and the concatenation of the gold-relevant ones.

The evaluation metrics used for the abstractive summarization task are ROUGE and FactCC (Kryscinski et al. 2020). The latter is used to evaluate the factual consistency of the summaries w.r.t. the related report’s original texts. In particular, the F1 score and balanced accuracy metrics have been calculated. Technically, we used the authors’ data generation scripts to generate positive and negative examples from a JSONL file. Negative examples are created by applying some syntactic transformations to the original texts.

Table 2 The F1 scores obtained in the evaluation of several word embedding methods in the extractive summarization scenario

4.4 Extractive summarization

In order to evaluate the numerous embedding methods among those proposed in the literature, several extractive summarization experiments were carried out (Table 2). BERT-Large is the best performer in creating embeddings of single words without surrounding context, with a 78.33% F1. The intuition behind the gap between BERT-Base-Multilingual and BERT-Large models is that the latter has been trained only on English text and has a larger dimension. However, we chose BERT-Base-Multilingual because it has a similar performance and it can be used as a baseline to compare the results in multi-lingual settings. On the other hand, models performed better where sequences as taken into account for creating contextualized word embeddings because orders are not lost with such a sentence representation, leading to higher F1 scores. In this case, BERT-Base-Multilingual is the best performer. Thus, this latter has been chosen since it has the highest performance and can be used as a baseline to compare the results of the experiments involving more than one language.

4.4.1 In-domain single-language experiments

The detailed results of the in-domain extractive summarization tasks on the original Australian Legal Case Reports dataset are shown in Table 3. As expected, the ROUGE scores obtained in the experiments with 1000 legal case reports are the highest. These ROUGE scores outperform, on average, the ones produced by the latest work on the same dataset, which used a neural network for the catchphrase extraction task (Tran et al. 2018).

Table 3 The ROUGE (Precision, Recall, F1) and F1 scores obtained in the extractive summarization experiments using the original English-written reports and the two different strategy as the building method of sentence embeddings. The best scores are in bold
Table 4 The ROUGE-F1 and F1 scores obtained in the extractive summarization experiments using 100 and 1000 Australian Legal Case Reports, including the original ones (written in English) and those translated into several languages (Italian, German, Spanish, Danish, French)
Fig. 5
figure 5

The ROUGE-F1 and F1% scores obtained in the extractive summarization experiments using 100 Australian legal case reports translated into multiple languages. The mean of word embeddings has been used as the building method for sentence embeddings

Fig. 6
figure 6

The ROUGE-F1 and F1% scores obtained in the extractive summarization experiments using 1000 Australian legal case reports translated into multiple languages. The mean of word embeddings has been used as the building method for sentence embeddings

4.4.2 In-domain multi-language experiments

Table 4 shows the results of the extractive summarization performed on 100 and 1000 legal case reports translated into Italian, German, Spanish, Danish, and French. Regarding 100 samples, the experiment with the Spanish dataset achieved the best results in almost all metrics except for the F1 score, which was obtained with the Danish dataset. Considering 1000 samples, the best performances were found in the German translation scenario except for the F1 score, which was achieved with the Italian dataset.

Figures 5 and 6 sum up the results of the extractive experiments. Table 5 and Table 6 compare the multi-lingual results of other extractive summarization approaches, revealing the better performance of our solution. In detail, we compare with MemSum (Gu et al. 2022), BERT (Devlin et al. 2019), RoBERTa (Liu et al. 2019), and DistilBERT (Sanh et al. 2019). For all languages, we use the corresponding multi-lingual model checkpoint (except for MemSum because it is available only for English and RoBERTa that does not include Danish versions).

Table 5 The comparison with extractive summarization models on the multi-language experiment with 100 labeled samples. Best scores are bolded
Table 6 The comparison with extractive summarization models on the multi-language experiment with 1000 labeled samples. Best scores are bolded
Table 7 The ROUGE (F1) and F1 scores of the cross-language experiments, where our extractive model trained on English reports has been applied in two experiments: (i) it has been tested without fine-tuning on different languages (EN \(\longrightarrow\) LN); (ii) it has been fine-tuned using the translated legal case reports and then tested (EN, LN \(\longrightarrow\) LN)

4.4.3 Cross-language experiments

Table 7 shows the results of the extractive summarization task in the cross-domain scenario. The fine-tuning technique has been used as the transfer learning approach. The following experiments have been conducted:

  • Regarding the experiments with 100 samples, (1) we tested the extractive model trained on 70 English reports directly on 30 cases of a different language, (2) we fine-tuned the extractive model (already trained on 70 English documents) with 70 reports of different languages and then tested it on 30 cases of a different language.

  • About the experiments with 1000 samples, (1) we tested the model trained on 700 English samples directly on 30 cases of a different language, (2) we fine-tuned the model (trained on 700 English reports) with 70 documents of different languages and tested it on 30 cases of a different language.

Our models show a good generalization capability across different language domains. The application of the fine-tuning technique has boosted the considered evaluation metrics, showing how transfer learning can be used to overcome the lack of a labeled dataset in a specific language, as in the legal domain.

Table 8 The ROUGE (Precision, Recall, F1) obtained in the abstractive summarization using the original English-written reports. The best scores are in bold
Table 9 The ROUGE-F1 scores obtained in the abstractive summarization experiments using 100 and 1000 Australian Legal Case Reports, including the original ones (written in English) and those translated into several languages (Italian, German, Spanish, Danish, French). The best scores are bolded

4.5 Abstractive summarization

4.5.1 In-domain single-language experiments

The results of the abstractive summarization tasks performed on the original Australian Legal Case Reports dataset are shown in Table 8. As expected, in the scenario with 1000 samples, the results are much higher for each metric than the ones obtained in the experiment with 100 reports. The results are similar to those obtained in the latest catchphrase extraction work proposed in the literature on the same dataset (Tran et al. 2018). This gives us the intuition that our abstractive model can produce abstracts with a certain degree of lexical and syntactic correctness.

4.5.2 In-domain multi-language experiments

Table 9 shows the results of the abstractive summarization performed on 100 and 1000 translated reports of the Australian Legal Case Reports dataset. Regarding 100 samples, the best ROUGE-3 score is achieved with the French dataset, whereas we obtained the best results from the experiment with the Spanish dataset. Considering 1000 documents, the best scores are achieved in the Spanish translation scenario for all metrics except for ROUGE-3 and ROUGE-4, where the experiment with the French dataset got the first place.

Figures 7 and 8 sum up the results of the abstractive experiments.

Fig. 7
figure 7

The ROUGE-F1 and F1% Scores obtained in the abstractive summarization experiments using 100 Australian legal case reports translated into multiple languages

Fig. 8
figure 8

The ROUGE-F1 and F1% Scores obtained in the abstractive summarization experiments using 1000 Australian legal case reports translated into multiple languages

Table 10 The balanced accuracy and F1 scores of the FactCC model assessment after the fine-tuning using the Australian Legal Case Reports dataset (sampling 100 and 3000 reports, respectively)

4.5.3 FactCC assessment

Table 10 shows the results of the FactCC model evaluation on 60 and 1800 Australian legal case reports scenarios, respectively. The legal case report number is doubled since their approach provides transformations, which creates another dataset containing negative examples. As expected, the model fine-tuned with more examples (2100) achieves the best results among the two experiments, with a much higher balanced accuracy and F1 score. This makes us think the model needs many training data to replicate or overcome the results obtained in the original paper (Kryscinski et al. 2020). As the FactCC model has performed well on legal cases, it has been used as the metric for evaluating our generated abstractive summaries.

We aim to evaluate our abstractive summaries by running the previously trained FactCC model (with 2100 legal training reports). This latter has been used to classify 100 and 1000 abstractive machine-generated summaries as CORRECT or INCORRECT. If a summary is evaluated as CORRECT, we have reasonable assurance that it is fluently and coherently written. The training technique is BERT-base-uncased fine-tuning for 8 epochs with the default parameters of the FactCC model. Table 11 shows the ratio of report summaries classified as CORRECT out of all the evaluated summaries. In evaluating 650 abstractive summaries, nearly 50% of them are classified as consistent.

Although the limits of the evaluation method, we believe this result shows how GPT-2 can produce abstracts with a certain degree of consistency even in the legal domain. In addition, this evaluation represents a baseline, which may be improved with future summarization technique enhancements.

Table 11 The percentage of inferred abstractive summaries classified as CORRECT (i.e., consistent with the related original full text) by our fine-tuned FactCC model in 3 experiments with different reports sampling (30, 300, and 650 machine-generated summaries via abstraction)

4.6 Evaluation

4.6.1 Extractive summarization

In order to compare our results with the SOTA, we searched all the works which used the Australian Legal Case Reports dataset. The latest one we found is Tran et al. (2018) (Table 12), and it has been used as the baseline for the extractive summarization tasks because it has analogies with our work:

  • It used the same data.

  • It did not involve citation data in the training process.

  • It only used sentences and words from catchphrases as the target data.

Table 12 The ROUGE-1 and ROUGE-W\(-\)1.2 (Precision, Recall, F1) scores from “Automatic Catchphrase Extraction from Legal Case Documents via Scoring using Deep Neural Networks” by Tran et al. (2018)

By comparing our results, it can be stated that our model achieved excellent performance in syntactic terms. In particular, ROUGE-1 and ROUGE-W.1.2 scores obtained in our experiments are much higher. The primary motivation could be using BERT as the word/sentence embedding builder system. As explained in the first chapter, choosing a good contextualized embedding model is crucial for better performance. Another reason could be using a more expressive classification model: we used LSTM networks combined with CNN ones, whereas, in their work, only CNNs have been applied. Extractive summarization of translated reports achieves better results for almost all metrics than the English scenario. In particular, the best scores have been obtained by the experiment with the Spanish dataset for the ROUGE metrics and the Danish dataset for the F1 score. This gives us the intuition that the model benefits from using the BERT multilingual model as the embedding builder. This latter lets us generate expressive contextualized word embeddings and allow us to work with different languages. In Table 13, we showcase a representative qualitative instance for each of the languages analyzed thus far. The efficacy of our solution in extracting sentences that include catchphrases is readily apparent.

Table 13 Qualitative examples of extractive summarization in the multilingual setting. Catchphrases are highlighted in italics

4.6.2 Abstractive summarization

Since no similar works for the abstractive summarization on the same dataset have been found, extractive summarization results have been used as the baseline. As expected, ROUGE scores of the abstractive summarization tasks are much lower than the extractive summarization ones. Indeed, the ROUGE score is a mere lexical measure. In the extractive scenario, if the model succeeds in classifying one sentence as relevant or not, then all the words of that sentence represent an overlapping between the generated and the reference summary. This does not apply to the abstractive summarization task by definition because here, the goal is to produce a new summary, also using words that do not exist in the input text to summarize. However, the ROUGE scores of the abstractive summarization experiments are similar to Tran et al. (2018), so we have the intuition that our abstractive model has been able to generate speeches, which are inherent to the input text even if consistency and fluidity of speech are to verify yet. In order to do that, the FactCC model has been applied, and it turned out that about 46% of 300 machine-generated reports have been classified as CORRECT. Even though our fine-tuned FactCC model has a 77% F1 score (i.e., it is affected by errors), this fact gives us the intuition that our abstractive summaries have a certain degree of fluency and coherency w.r.t. their related legal report original texts, which are not reflected in the ROUGE rating. Abstractive summarization of translated reports achieves similar results to the English scenario and even better for Spanish and French languages. This gives us the intuition that the model keeps working well in languages other than English and could be applied to many use cases.

5 Conclusion

In this work, we tackled the automatic summarization of Australian legal case reports by presenting extractive and abstractive techniques. The abstractive solution can be considered a general approach to generating summaries despite lacking human-crafted references. Our method only requires catchphrase tags that can be obtained in several ways: (i) by applying an unsupervised extractive summarization algorithm or (ii) by manually tagging the sentences of a few documents and afterward fine-tuning a pre-trained model.

We showed that our extractive summarization results overtake the ones produced by the latest work on the same dataset in the literature (Tran et al. 2018). Instead, our abstractive summarization results led to similar ROUGE scores to theirs. In addition, we proved the speech consistency of our abstractive summaries using the FactCC model. Precisely, even though our fine-tuned FactCC model can make inference errors on test data (77% F1), the results suggest that our abstractive summaries have a certain degree of fluency and coherency w.r.t. their related legal sources, which are not reflected in the ROUGE rating.

Moreover, a translation task has been achieved to train our models and evaluate their ability to understand and summarize texts in several languages other than English. It turned out that the summarization of translated reports achieves better results than the English report scenario for some other languages. Especially, Spanish and French seem to perform generally better in the abstractive summarization case. In contrast, the summarization of German reports achieves the best results in the extractive summarization scenario with 1000 reports. Hence, our models summarize several languages effectively and could be applied to other legal case reports. Such experiments are supported by Google Translate API and the BERT multilingual embedding model (only used in the extractive summarization scenario). Finally, it turned out that our models can also generalize in a cross-language scenario, summarizing English reports directly in all the different languages.

The main challenges will be improving the quality of the machine-generated abstractive summaries and their evaluation. Some automatic evaluations like (Kryscinski et al. 2020) have been proposed in the literature and represent an improvement to the previous solutions to this problem, even though they still have limitations.

Future works will be related to replicating the cross-language experiments using token embedding sequences as input instead of applying the mean of word embeddings to build sentence embeddings since the first method has led to higher results in the in-domain experiments with original Australian legal reports. Furthermore, the methods proposed in this work could be expanded by adding more advanced data techniques to FactCC to improve abstractive summaries evaluation and repeat the experiments using other SOTA large language models (e.g., ChatGPT) to improve the abstraction quality. Finally, as presented for communication network (Lodi et al. 2010; Moro and Monti 2012; Cerroni et al. 2013; Cerroni et al. 2015), propagating knowledge refinements (Domeniconi et al. 2014), also with entity relationships acquisition (Frisoni et al. 2020; Frisoni and Moro 2021) and event extraction (Frisoni et al. 2021), could be key when modeling complex long legal documents.