1 Introduction

Competition for Legal Information Extraction (COLIEE) is an annual international competition held in conjunction with the International Conference on Artificial Intelligence and Law (ICAIL) and Juris-informatics (JURISIN) conferences [1, 5,6,7,8,9, 12,13,14]. COLIEE 2023 consists of four tasks: Tasks 1 and 2 are case law tasks that use datasets from the Canadian Federal Court, while Tasks 3 and 4 are statute law tasks that use the Japanese Legal Bar exam. In Task 3, a participant system is given a problem text and asked to retrieve relevant articles from Japanese Civil Law to solve the problem. In Task 4, a participant system is given a problem text and its relevant articles, and asked to determine whether the articles entail the problem text or not by answering Yes or No. We participated in Task 4. The analysis of problem types in previous COLIEE tasks [13] showed that the COLIEE dataset includes diverse types of problems. Some are relatively easy to solve, because the texts in the pairs are very similar, while others are complex and difficult, requiring parsing, semantics, anaphora, logic, etc. Previous Task 4 participant systems have included rule-based and deep learning-based systems, such as BERT [19], ELECTRA [11], and GNN [17]. However, previous systems have not performed well on problems that require inferences about person roles. In this paper, we focus on person name resolution, where person names/roles are represented using alphabetical letters. We propose a system that extends our previous system in COLIEE 2022, which achieved the highest accuracy among all submissions using data augmentations. Our proposed system provides two main contributions. First, while we use an ensemble of a rule-based component and a deep learning-based component, we adopt LUKE as our deep learning-based component, which is a named entity trained model based on RoBERTa, instead of BERT. Second, we fine-tune the pretrained LUKE model in multiple ways, comparing fine-tuned training datasets that include alphabetical person names and an ensemble of different fine-tuned models. Our formal run results show that LUKE and our fine-tuning approach for alphabetical person names are effective.

2 Related Works

LUKE [18] is a language model based on RoBERTa [10], which is a derivative of BERT [2]. BERT is a deep learning model that is commonly used in various NLP tasks, and it utilizes the encoder part of the Transformer [16] architecture. LUKE, on the other hand, uses a unique mechanism called Entity-aware Self-attention. LUKE treats not only words, but also entities as independent tokens, and computes intermediate and output representations for all tokens using the Transformer (Fig. 1). Since entities are treated as tokens, LUKE can directly model the relationships between entities. In this paper, we focus on the person type problems which include named entities of persons, thus LUKE is expected to work well with these issues. Furthermore, at the time of its development, LUKE achieved the highest accuracy in several NLP tasks. We adopt LUKE as the base model and fine-tune the pretrained LUKE model.

Fig. 1
figure 1

Architecture of LUKE using the input sentence “Beyonce lives in Los Angeles.” LUKE outputs contextualized representation for each word and entity in the text. The model is trained to predict randomly masked words (e.g., lives and Angeles in the figure) and entities (e.g., Los Angeles in the figure). Downstream tasks are solved using its output representations with linear classifiers. Cited from [18]

Hoshino et al. [4] is our previous work presented in COLIEE 2019. They proposed a rule-based system that parses sentences into clauses based on their original definition. The parsing results were then used to extract the set of clauses, including subject, predicate, and object for each clause, and compared these sets. They developed several modules, such as the Precise Match module, which compared the relevant civil law clauses with the clause set of the problem text and answered Yes if all the elements in the clause sets matched. Fujita et al. [3] is another recent work of ours in COLIEE 2022, which proposed an ensemble of the rule-based system developed by Hoshino et al.’s rule-based system and a BERT-based system. This system achieved the highest accuracy in the formal run of COLIEE 2022 Task 4. To address the issue of limited training data, we performed data augmentation such as logical inversion, replacement of person terms, and replacement of article numbers. In this paper, we extended our previous system by replacing BERT with LUKE and modifying the ensemble method to build different fine-tuned models depending on the type of problem.

3 System

3.1 System Overview

Our system comprises a rule-based component and an LUKE-based component. The LUKE-based component utilizes an LUKE model, which is fine-tuned on three different datasets: all training datasets provided by COLIEE, and two types of training datasets extracted from different problem types. The rule-based and LUKE-based components are integrated through ensemble, which performs binary classification, predicting either Yes or No based on the higher probability value. In the COLIEE Task 4 dataset, alphabetical characters are used to represent persons in the problem text, as illustrated in Fig. 2, which shows an example of a problem involving alphabetical person characters. It is necessary to determine the relationship between each person indicated by an alphabetical character and the person role described in the civil law text. In the example, A in the problem text represents a person who contracted as an agent of another person, B represents a different person, and C corresponds to a counterparty, as defined in the civil code text. Such problems are considered to be among the most challenging to solve automatically.

We focus on problems that involve alphabetical person names, and create separate LUKE models trained on such problems and trained on other problems. For the LUKE-based part, we prepare three LUKE models for comparison: an LUKE model trained on all data (LUKE-all), an LUKE model trained on problems with alphabetical person names (LUKE-person), and an LUKE model trained on problems without alphabetical person names (LUKE-nonperson).

While our previous system [4] had different modules with different matching methods for the clause sets, our previous study [3] showed that the Precise Match module was the most effective, answering Yes only when all pairs of subjects, objects, and predicates match. Therefore, we adopt the Precise Match module as our rule-based part. We fine-tuned a publicly available LUKE model (studio-ousia/luke-japanese-base-liteFootnote 1) which was pretrained on Wikipedia articles, to output binary probabilities of Yes or No, given a problem text and a relevant civil law article as input.

Fig. 2
figure 2

An example of a problem where alphabetical person characters appears

In this section, we describe the design of our system as follows. First, we create additional training data using civil law articles (3.2). Second, after preprocessing the data, we select the most relevant civil law article for solving a given problem statement, based on the similarity of their texts (3.3). Third, we expand the training data by performing logical inversion and replacing person terms (3.4). Fourth, we fine-tune the LUKE model using these datasets. We split the datasets by year and create multiple models for all possible combinations of the training and validation datasets (3.5). Based on the methods above, we created three different submission models for our formal run results: KIS1, KIS2, and KIS3, which were designed for different types of problems (3.6). Among the three formal run submissions, KIS2 was our proposed system. KIS1 was an ensemble of an LUKE-based model using all of the training data and the rule-based system. KIS2 was an ensemble of KIS1 and a model trained specifically for problems in which alphabetical person names appear. KIS3 was an ensemble of a model trained specifically for problems in which alphabetical person names appear and a model trained specifically for problems in which they do not appear. Figure 3 illustrates these relationships. We applied our article selection preprocess (3.3) to the formal run test dataset.

Fig. 3
figure 3

System overview

3.2 Create Training Data from Article(s)

To increase the size of the official training dataset, we created an additional training dataset using the civil code articles without problem texts. In this subsection, we will refer to the relevant articles in COLIEE as premise (t1) and the problem text in COLIEE as hypothesis (t2) to avoid confusion, since both are taken from the articles. First, we divided the distributed civil law articles into sections and created pairs of identical civil code sections, setting their correct answer labels to Yes. For example, “A minor must obtain the consent of his/her legal representative to perform a legal act. However, this shall not apply to acts merely to obtain rights or to be relieved of obligations. (Civil Code Article 5)” and the same paragraph is paired with the label Yes. If the text of the article contains an exception sentence or proviso, such as “Provided, however, [...], this shall not apply.”, we divided the original article texts into a text before the sentence (a principle part) and after the sentence (a proviso part). If “However, [...], this shall not apply” describes an act, person, or right, we manually replace that act, person, or right in the principle part with an act, person, or right in the proviso part. Then, we invert the logic of the predicate as described in 3.4. In the example in Fig. 4, Article 5 of the Civil Code “However, this shall not apply to acts by which a minor merely acquires a right or is relieved of a duty.” was rewritten as “A minor need not obtain the consent of his or her legal representative to commit an act merely to obtain a right or to be relieved of a duty.” The subject normally appears in the principle part, but sometimes it appears in the proviso part. When the subject appears in the proviso part, we revert the affirmative/negation of the principle part using the method described later (3.4) and add it to the training dataset, sharing the same original premise (t1). Figure 5 shows an example.

Fig. 4
figure 4

Divide into principle and exception

Fig. 5
figure 5

\(<\text {t1}> <\text {t2}>\) Pairs created using exceptions

3.3 Preprocess and Article Selection

First, we apply the following preprocessing steps to the articles and then select the relevant ones. A problem statement may have multiple related articles. If we concatenate the texts of all these articles as input, the input to the model may become too long, exceeding the upper limit (in our case, 512 tokens), and important parts may be lost when we truncate the input. To address this issue, we split the relevant articles into sections (each article consists of one or more sections). Then, we create all possible combinations of the divided sections (Fig. 6). We discard any combination in which the total number of tokens of the combined sections and the given problem text exceeds the upper limit.

Fig. 6
figure 6

An example of combinations reconstruction

If the generated text contains reference notations such as “preceding paragraph” or “Article XX”, we search the given relevant articles for the referred article and replace the reference notations with the text from the referred article (as shown in Fig. 7). The replaced version is then added to the training dataset. Notations such as “listed below” are substituted with the specified items in the article. Figure 8 provides an example of this process.

Fig. 7
figure 7

An example of article reference

Fig. 8
figure 8

An example of substituting each item for “lited below”

As shown in Fig. 4, the proviso part of an article describes an exceptional situation where the principle part does not apply. To understand the meaning of the proviso part, we need to include the principle part as well. Therefore, we concatenate the proviso part with its principle part, inverting the affirmation/negation of the latter. If the proviso part includes an act, person, or right, we replace the corresponding item in the principle part with the one in the proviso part. Among these preprocessed articles, we select most relevant article to solve the given problem by the similarity scores of the vectors obtained by Sentence Luke (sonoisa/sentence-luke-japanese-base-liteFootnote 2). Sentence LUKE is a tool for creating advanced sentence vectors using the LUKE model (LUKE version of the Sentence BERT [15] in other words), which was pretrained by the Japanese Wikipedia and the Siamese network. We remove the suffixes of predicates, which could contain negation expressions. This is because we search for the most similar content regardless of affirmative/negative. Figure 9 shows an example.

Fig. 9
figure 9

An example of article selection

3.4 Data Augmentation

Our previous COLIEE 2022 system [3] consisted of two expansions: negation expansion and person term replacement, which we describe below. In this year’s formal run, we have added more negative words and person terms to our manual dictionary. For negation expansion, we create a new sample by reversing the logic at the end of a sentence, along with its Yes or No answers, using a predefined list of affirmative and negation expression pairs. We apply this expansion to both pairs created from the Civil Code articles as described in the previous sections and the given problem text. However, we do not apply this expansion to problems with a gold standard answer of No, since the negative form at the end of a sentence does not always result in a Yes when the original answer is No. The COLIEE problems sometimes use alphabetical characters, such as A or B, to represent person names. Our person term replacement expansion addresses this issue by creating a dataset from the training data that replaces person names with alphabetical characters. We assign the alphabetical letters in the order of appearance, holding identical person names to be identical characters.

3.5 Combinatorial Split of Training and Validation Dataset

To fully utilize the COLIEE official training dataset, we created multiple models trained with different parts of the official dataset. We split the official dataset using various patterns, such as the cross-validation method, where we selected each 2-year period as a validation dataset and used the rest of the official dataset as its training dataset. After fine-tuning for each pattern, we applied an ensemble of these multiple models. We chose 2 years as our splitting unit, because it would be too many combinations if we split by year. Figure 10 illustrates this split method.

Fig. 10
figure 10

A conceptual figure of training data split

3.6 Fine-Tune for Alphabetical Person Names

When alphabetical letters are used as person names in the given problem text, a different approach is required to solve the problem, as it becomes necessary to determine which person the alphabetical character corresponds to in the relevant civil law article. Therefore, we fine-tune a model specifically for such problems. Additionally, we fine-tune a model for problems in which alphabetical person names do not appear. Each model internally performs an ensemble of the combinatorial split fine-tunes described in Sect. 3.5, and thus, the preprocessing steps described in Sects. 3.1 to 3.4 are applied before the fine-tuning. We regard a problem as an alphabetical person name type problem if it contains any single alphabetical character (as the original text is in Japanese except for these characters). As mentioned earlier, KIS2 and KIS3 use the model fine-tuned with problems containing alphabetical characters, while KIS1 uses the model fine-tuned without them. During binary classification, a fully connected linear transformation is performed on the output of the last layer’s node corresponding to the “<s>” token (or the “[CLS]” token in the case of BERT) for both Yes and No answers. Then, the classification scores are compared to determine whether the answer is Yes or No. For fine-tuning, the classification scores are converted into probabilities for each label using the Softmax function, and the loss is calculated using cross-entropy.

3.7 Ensemble Prediction

Finally, we perform an ensemble of our rule-based part and our LUKE-based part. The rule-based (precise match module) is the same as in our previous work, which has high precision but a low number of answerable problems. Therefore, we first apply the rule-based part when applicable, and then apply the LUKE-based part when the rule-based part is not applicable. For the LUKE-based part, we have prepared three models: LUKE-all (fine-tuned on all of our datasets), LUKE-person (fine-tuned on problems with alphabetical person names), and LUKE-nonperson (fine-tuned on problems without alphabetical person names). KIS3 applies LUKE-person when the problem includes alphabetical person names and applies LUKE-nonperson when the problem does not include any alphabetical person names. Similarly, KIS2 applies LUKE-person in the same way but uses LUKE-all when the problem does not include any alphabetical person names. If the rule-based part is not applicable, KIS always applies LUKE-all.

4 Experiments and Results

4.1 Fine-Tune Parameters

We performed our fine-tuning with the following parameters: maximum tokens length of 512, batch size of 32, learning rate of 1e-5, and a maximum number of epochs of 10 but terminates early due to Early Stopping.

4.2 COLIEE 2023 Formal Run Results

Table 1 shows the results of all teams in the COLIEE 2023 Task 4’s formal run, where KIS is our team name.

Table 1 COLIEE 2023 Task 4’s formal run results for each participant’s submission. # represents the number of correct answers; Acc represents Accuracy. The submission IDs in bold are our submissions

4.3 Previous COLIEEs’ Formal Run results

Table 2 shows the results of our experiments using previous formal runs of COLIEE 2019, 2020, and 2021 (test datasets are H30, R01, and R02, respectively) as required by the organizers.

Table 2 Numbers of correct answers and accuracies in previous formal run datasets

4.4 Comparison of BERT and LUKE

Table 3 shows the results of the experiments on the formal run and the past formal runs using BERT and LUKE. Each cell shows numbers of correct answers with total numbers of problems from H30 to R04; the all column shows the total numbers, the person column shows the numbers for problems containing characters of the alphabetical person names, and the nonperson column shows the numbers for problems without the alphabetical person names (Table 4). The results of this table show that the correct numbers of the LUKE model is larger than the BERT model in H30 and R04. Especially in R04, LUKE improved the performance of the alphabetical person names problems. On the other hand, BERT had higher performance in R01 and similar performance in R02.

Table 3 The number of correct answers by problem types (person: problems of person type, nonperson: others, all: including both of person and nonperson), for BERT and LUKE
Table 4 The number of correct answers by problem types (other than person) for BERT and LUKE

4.5 Evaluation of Fine-Tune Models Without Ensemble Using Previous Formal Runs

Table 5 shows the evaluation results of the individual fine-tuned LUKE models on the formal run of COLIEE 2023 and the formal runs of the past three years. Each fine-tuned model was evaluated independently without any ensemble. We evaluated the models separately for the problems with alphabetical person names (person) and others (nonperson). The results show that the person model, which is fine-tuned by person type problems, worked better than other models in all of the datasets.

Table 5 Number of correct answers of three patterns of fine-tuned LUKE models (all, person, and nonperson), for each training/test datasets (H30, R01, R02, and R04), dividing into person type problems (P) and others (N)

5 Discussion

The individual results of the fine-tuned models (Table 5) demonstrate that the fine-tuning was effective for the corresponding type of problems but not for the other types. Our team’s formal run results (Table 1) and the results of our experiments using past formal runs (Table 2) also showed that KIS2, which is an ensemble using the fine-tuned model for alphabetical person names, achieved the highest score. Table 3 shows that LUKE and BERT have different percentages of correct answers. We analyzed the patterns in which either LUKE or BERT answered problems correctly. Figure 11 shows an example problem that can be answered without analyzing the alphabetical person names, even though they appear in the problem text. Such problems could be correctly answered by BERT. As shown in Fig. 12, R04–08-A is an example of a person name problem where LUKE was correct and BERT was incorrect. In this problem, the gold label is “No”, because “B consented to this” in the problem text is different from “a third party consented to this” in the article, since B is an agent and C is a third party. LUKE was able to predict that the label for this problem would be “No”. This example suggests that LUKE might be more proficient in understanding personal relationships compared to BERT. While LUKE itself slightly improved its performance compared with BERT (Table 3), LUKE works significantly better when fine-tuned with the person type problems, which corresponds to the highlighted cells in Table 5; the person type problems (P) were better solved by the person fine-tuned model than other models in any case. By manually checking the problems, we found that among 13 problems, that were correctly answered by LUKE and its person fine-tuned model than BERT, 11 problems were the type of the above explanation, which require to analyze the alphabetical person names.

Fig. 11
figure 11

An example problem which can be solved without analyzing the alphabetical person names

Fig. 12
figure 12

An example of a problem where LUKE provided the correct answer

We analyzed the results of our article selection by Sentence LUKE and found an unsuccessful example shown in Fig. 13. In this example, our system selected Article 5, “A minor shall obtain the consent of his/her legal representative in order to perform a legal act. Any legal act contrary to the provisions of the preceding paragraph may be revoked”, while Article 124-2, item 2 was required to solve the problem. The non-relevant article our system selected shares similar tokens with the problem text, such as “minor” and “consent”, but the relevant article also shares these tokens. This may be because abstract paraphrases like “Any legal act contrary to the provisions of the preceding paragraph may be revoke” make the cosine similarities larger. Pretraining and fine-tuning on legal documents and paraphrase preprocessing into everyday language may help improve this issue.

Fig. 13
figure 13

Examples of article selection failures

Next, we compare the extent to which our three data extension methods have contributed to improve the accuracy of the model. Our three data extension methods applied to the training data augmentations include: (i) the data created from civil law articles described in Sect. 3.2, (ii) the negation expansion, and (iii) the person term replacement described in Sect. 3.4. As a comparison analysis, We applied one of the three data extension methods before fine-tune the BERT model. We also compare the fine-tuned BERT models with all three extensions (our proposed model), and without any of the three extensions, thus five patterns in total. We use the dataset of each of the years H30, R01, R02, and R04 for evaluation, use the dataset of years prior to the year used in evaluation for training; these training-evaluation pairs correspond to the past formal run settings. Within the training dataset, we performed 11-fold cross-validations, resulting in 11 fine-tuned models. Our final prediction results are decided by majority votes between these 11 models. Using our human-created problem type classifications, we counted the number of correctly answered problems for each fine-tuned model for each problem type (Table 6 correspond to H30, R01, R02, and R04, respectively, and Table 7 shows the total number of correct answers for each model question type.). When expanding data using articles, accuracy improvements were observed in many problem types. The negation expansion showed significant contributions in problems involving negation problem types as expected. Data augmentation by person replacement was expected to contribute to the Person problem type (where person names are represented as alphabetical symbols such as A and B); H30 and R01 showed a positive contribution, while we could not observe positive contributions in other years. These result would suggest that the training dataset is still insufficient after augmenting the Person problem type by person replacement, as the alphabetical symbols could appear as a variety of different roles.

Table 6 Problem type counts by year and model.
Table 7 Problem type counts by model

6 Conclusion and Future Works

We extended our previous system from COLIEE 2022 by performing an ensemble of the rule-based part and the LUKE-based part for COLIEE 2023 Task 4. We discriminated problems into two types based on whether they included alphabetical person names or not, and fine-tuned three different datasets on these two types of problems and all problems. We confirmed that our fine-tuned model for alphabetical person names improved the overall accuracy for those types of problems, achieving 0.69 accuracy in the formal run for COLIEE 2023 Task 4. Our future work includes improving the data split method and processing other types of problems, as well as working on improving the accuracy of article selection.