1 Introduction

A legal textual entailment task is a task to check whether a given statement is entailed by the relevant law article(s) and is one of the oldest tasks in the Competition on Legal Information Extraction/Entailment (COLIEE) [3, 4, 11, 12, 16]. In COLIEE 2021 [10], Task 4 is a legal textual entailment task that uses Japanese Bar exam questions and civil code articles. Figure 1 is an example of this question and article pair.

Fig. 1
figure 1

Example of a question and article pair for COLIEE legal textual entailment tasks

To solve this entailment problem, it is necessary to recognize two different relationships between questions and the given articles. One is a relationship based on the logical structure, and the other is a semantic relationship between the words used in the questions and the ones used in the articles. In the early stage [4], analysis results of natural language processing (NLP) tools (e.g., morphological parser and syntactic parser) were used with pattern-based or machine-learning-based systems to identify the former relationship. For handling the latter relationship, machine-readable thesauruses such as WordNet [7] and distributed representation of words such as Word2Vec [6] were used. Recently, the development of deep-learning-based NLP tools enables dealing with both relationships at once. For example, bidirectional encoder representations from transformers (BERT) [1] is a tool that can analyze semantic context and solve classification problems such as the identification of entailment and relevancy of queries and documents. One of the characteristics of BERT is that it provides a general semantic analysis system that can be fine-tuned for a particular task. In COLIEE 2020 [11], a BERT-based system achieved the best performance in the statute law legal article information retrieval task [13] and the entailment task [9]. However, because of the small size of the training data, the best performance system of the entailment task [9] used a large volume of legal texts that were not directly related to the statute law task.

To solve this training data size problem, we propose a data-augmentation method to increase the training data using original civil law articles. This method generates article-and-question pairs systematically. For COLIEE 2021, we also propose a BERT-based ensemble system for legal textual entailment that uses training data provided by organizers and our augmented data for training. This system achieved the best accuracy (0.7037) in COLIEE 2021 Task 4. We also conduct additional experiments to understand the characteristics of the system.

The rest of this paper is divided into the following sections. Section 2 introduces related works, and Sect. 3 introduces our data-augmentation method and BERT-based ensemble legal textual entailment system. Section 4 demonstrates our system using COLIEE Task 4 submission results and additional experiments, and Sect. 5 concludes the paper.

2 Related Works

Because bar exam questions include questions about real use cases of articles, it is necessary to discuss the correspondence between the concepts used in the articles and real use cases. In the early stage of COLIEE, several attempts were made to utilize resources for discussing such semantic matching, such as a machine-readable thesaurus and data for the distributed representation of the terms. For example, Mi-Young et al. [5] used Word2Vec [6] as a resource for distributed representation, and Taniguchi et al. [15] proposed a method to utilize WordNet [7] as a machine-readable thesaurus. However, because those methods cannot handle the context to determine the meaning of such terms, they are not so effective for utilizing such resources.

Recently, Devlin et al. [1] proposed BERT, a deep-learning-based NLP tool that is pretrained for general semantic recognition tasks with larger corpora (such as the whole contents of Wikipedia). Based on this training process, BERT can handle the meaning (distributed representation) of words in a sentence by considering the context. In addition, BERT can be used for various tasks by employing a fine-tuning process that utilizes comparatively smaller numbers of training data. Because a pretrained model of BERT contains rich information about the semantics of the words, the fine-tuned models may be able to handle semantic information even though the words themselves are not included in the training data.

In COLIEE 2020, the BERT-based system achieved the best performance for legal textual entailment tasks (JNLP [9]). In that paper, they proposed a lawfulness classification approach that classified the appropriateness of legal statements using many legal sentences that include bar exam questions provided by organizers without considering given relevant articles. This approach worked well for COLIEE 2020 because of the large number of training data. In addition, they also pointed out that it was difficult to select an appropriate model using validation data for the unseen questions because of the significant variability of the questions.

To increase the size of the training data, the data-augmentation approach is widely used in the field of image recognition [14]. However, few studies related to the data-augmentation method have been conducted for legal textual entailment tasks. Min et al. [8] proposed a syntactic-based data augmentation method to increase the robustness of natural language inferences. They also proposed a systematic method to create positive and negative data from the correct inference sentence by a syntactic operation such as passivation and the inversion of subject and object. Evans et al. [2] proposed a method for data augmentation for logical entailment. In this framework, their method increased negative and positive data by modifying logical inference rules using symbolic vocabulary permutation, including an operation to make implication rules that share the same contents for the condition and derived parts. Those approaches are useful to design data-augmentation methods for legal textual entailment.

3 BERT-based Ensemble Legal Textual Entailment System

Based on a discussion of the previous best performance system (JNLP [9]), we propose a system with the following characteristics.

  1. 1.

    Textual entailment approach with data augmentation

    We assume that the reason why the lawfulness classification approach outperformed the textual entailment one in the last COLIEE is the size of the training data. Therefore, when we provide larger training data by data augmentation, the textual entailment approach may outperform the lawfulness classification approach because it uses the most important information (relevant articles).

  2. 2.

    Ensemble results of multiple BERT-based model outputs

    As discussed, it is difficult to select appropriate models for the task by only evaluating the validation model. From our preliminary experiment (the details are discussed in Sect. 4.2), we confirmed that the characteristics of fine-tuned BERT-based models are different and that the accuracy of the validation data is not directly related to that of the test data. We assume that this result reflects the different characteristics of each model and that the appropriate selection of the generated models for ensemble may improve the performance for the unseen questions.

3.1 Data Augmentation using Articles

In the deep learning framework, it is common to enlarge training data by modifying the existing data (data augmentation). However, it is important to define the appropriate data-augmentation method to obtain the best results. Related to the legal textual entailment task, data-augmentation methods have been used for natural language and logical inference, as introduced in Sect. 2. However, it is difficult to apply these methods to these legal textual entailment data.

In this study, we assume that there are two types of errors to judge whether an article entails a given question. One is a semantic mismatch and the other is a logical mismatch (the appropriateness of the judicial decision).

For example, let us discuss the example of training data using the following article (a part of article 9): “A juridical act performed by an adult ward is voidable.”

  1. 1.

    “A juridical act performed by an adult is voidable.”

    The article does not entail this question because of semantic matching (“adult” is not “adult ward” ).

  2. 2.

    “A juridical act performed by an adult ward is not voidable.”

    The article does not entail this question because of the inappropriateness of the juridical decision (“voidable” and “not voidable”)

  3. 3.

    “A juridical act performed by an adult is not voidable.”

    We cannot judge whether this question is true (it may require another article). However, a given article cannot entail the question.

For the semantic matching case (1), it is difficult to select appropriate pairs (“adult” and “adult ward”) for replacement to make such a semantic mismatch sentence. For both cases (3), it is also difficult to make the data and to use these data for negative examples to identify types of errors to judge the entailment results.

By contrast, if we make the pair of correct answers with logical mismatch cases (2), the examples may help to explain the importance of comparisons between the judicial decision of the article and that of the question.

In the bar exam questions, there are questions to check whether the candidates understand the article appropriately. This type of question uses almost identical sentences from the article or one that flipped the juridical decisions of the original sentence. This may suggest that it is easy to construct augmented data using sentences about juridical decisions in the article automatically. On the contrary, it is not so easy to use golden standard data (question and article pairs). For the negative examples (No: not entail), it is not easy to make appropriate corresponding positive (Yes: entail) questions because it includes all three types (1–3) of mismatch. For the positive examples, even though we made negative examples by flipping the juridical decision for case (2), it can be treated as a mismatch case (3) when the system fails to make mapping between the condition part of the questions and articles. Therefore, we decided to use article data only as a resource for data augmentation.

For the automatic construction of augmented data using articles, simple separation of sentences using end-of-sentence markers (“。”:period) and paragraphs is not appropriate because there are sentences that refer to previous juridical decisions, conditions, or lists of cases that apply the previous juridical decisions. Thus, we propose a method to generate sentence pairs from articles that follows a three-step process.

  1. 1.

    Identify the logical structure of a sentence

    Most of the articles may contain texts for representing juridical decisions and lists of conditions for the case where such decisions may apply. This process classifies the sentence parts and lists of condition parts.

  2. 2.

    Generate sentences for augmentation

    Texts for juridical decisions are split into sentences using end-of-sentence markers and paragraph information. Sentences that refer to juridical conditions and/or decisions are expanded using information in the previous sentence. A list of condition parts is used for expanding a sentence that refers to this list.

  3. 3.

    Make augmentation data

    All generated sentences are used for making augmented data. For the positive data, each sentence from the generated data is used for both question and article parts. For the negative data, we use sentences that flip the juridical decision for the question parts.

Details of this process can be summarized as follows.

  1. 1.

    Identification of logical structures

    Most of the articles may contain texts for representing juridical decisions and lists of conditions for the case where such decisions may apply. Figure 2 shows an example of an article to explain this identification process. Most of the texts are selected as sentences by splitting the texts using end-of-sentence markers and paragraph information (“A”–“F”). A list of cases is described using an enumerated list using numbers (“一”(i) and “二”(ii), “三”(iii),\(\cdots\)). As a result, we obtain three sentences (“A”–“C”) and one condition list (“D”–“F”). These condition list parts are merged with a sentence that refers to this condition list. For this case, Sentence C (just before the condition list) refers to the condition list using the phrase “次に掲げるとき” (“any of the following cases apply”).

  2. 2.

    Generating sentences for data augmentation

    In this step, we generate sentences that contain juridical decisions and conditions. Most sentences have both parts in a sentence, but several exceptional cases refer to other parts of the text. The following are types of such omissions with an explanation of how the system modifies such sentences to generate ones that have decision and condition parts in a sentence.

    • Omission of juridical decision for the other case or exceptional case

      Some sentences refer to juridical decisions in the previous sentence with a particular style. One is “同様とする” (“the same applies”), referring to the same decision; and the other is “この限りでない” (“this does not apply”), referring to the flipped decision. Those parts are replaced using the juridical decision in the previous sentence. For the latter case, flipped decision is used for the replacement ((A+B) in Fig. 3).

    • Omission of conditions for juridical decisions

      Some sentences refer to the conditions of previous sentences. There are two types of references. One is referring to previous texts using an article number or a previous sentence. The other is a reference to the list of conditions. For the former type, we replace a reference to the previous sentence (“この場合において”’ (“In this case”)) only by using condition parts of the previous sentence. For the latter case (“次に掲げる場合”(“In the following cases”)), we replace these parts using referred condition parts. When there are two or more conditions (e.g., reference to the condition list), we generate a sentence for each condition ((C’), (C”), (C”’) in Fig. 3).

    In addition, to make the sentence simpler for making better training data, additional information such as “*の規定にかかわらず” (“Notwithstanding”) and other information such as terms in a bracket (e.g., metadata, paraphrasing of terms) are removed from the sentences.

  3. 3.

    Making training data

    For the positive data, all sentences generated in the last step are used for making positive samples by using both for question and article parts. For the negative examples, we flip the juridical decision parts of the sentences using manually constructed pattern-based matching (e.g., adding or removing “ない” (not) for the verbs, and replacing terms with antonyms (“有効” (effective) and “無効” (ineffective))). However, when there is no appropriate expression for juridical decisions in a sentence, no negative example is generated by the system. Pairs of a negative sentence and an original sentence are used for question and article parts, respectively.

Using this procedure, we construct 3,351 (positive: 1,687, negative: 1,664) training examples for data augmentation. However, the data are not exactly the same as the data used for the COLIEE 2021 submission because of bugs in the data-augmentation program used for the submission. The total number of augmented data used for the submission was 3,331 (positive: 1,677, negative: 1,654), and 2,540 data are the same as the generated ones. The most significant difference between the data generated by this procedure and the submitted one is the treatment of antonyms. For example, the previous program generates “無効としない” (“is not ineffective”) as a negative example for “無効とする” (is ineffective). However, the new program generates “有効とする” (“is effective”) by using an antonym dictionary. Former sentences are not appropriate for the human reader, but it may not be a big problem for the BERT used in the submission system to identify the logical mismatch.

The programs for generating augmented data and other information related to this experiment including experimental settings are available from https://www-kb.ist.hokudai.ac.jp/COLIEE-DA/.

Fig. 2
figure 2

Examples of identification of sentences (Japanese/English)

Fig. 3
figure 3

Examples of generating sentences for data augmentation (Japanese/English)

3.2 Ensemble Methods using BERT-based Entailment Systems

We implemented a BERT-based entailment system using the ordinal BERT fine-tuning process proposed in [1]. We concatenated the question and article using a sentence-separator token (\(\left[ SEP\right]\)) and fed it into the BERT model to estimate whether the article entails a question (positive: 1) or not (negative: 0). We use the BERT-based model of BERT-Japanese.Footnote 1

Fine-tuned models accept a pair of question statements and (an) article(s) as input and return whether the article(s) entail the statement using the score (0 to 1) for the probability of being positive or negative. When the score is larger than 0.5, the system selects the pair as positive (Yes: entail). For other cases, the pair is selected as negative (No: not entail).

From the preliminary experiments (the details are discussed in Sect. 4.2), we confirmed that there is no significant correlation between the accuracy of validation data and the accuracy for test data, reflecting the variability of the questions (difference between validation set and test set). Therefore, it is difficult to select an appropriate model that can estimate the unseen questions well.

To reduce the effect of this variability, we propose to use the ensemble learning framework to merge system outputs. Because of the nondeterministic characteristics of the BERT fine-tuning process and different data sets, we expected that the trained models might focus on different features for analyzing the texts. As a result, these systems may have types of questions where they can estimate answers with higher confidence (the score is close to 0 or 1) or one with little confidence (the score is close to 0.5). Therefore, we use the average score of the pair calculated by the system for the final output of the ensemble models instead of using simple majority voting because this score shows the confidence of the system for the answer.

In such a framework, the selection of appropriate ensemble model sets may correspond to the selection of model sets that work in a complementary manner (the system mostly uses estimated results with higher confidence). Therefore, we propose to use an additional validation data set to select such model sets.

Based on the presented discussion, we propose an ensemble method that uses multiple BERT-based entailment systems with different training settings as the following two steps.

  1. 1.

    Training of BERT-based entailment model with different training and validation data.

    Considering the variability of the questions, we constructed multiple BERT models using different training and validation data that are constructed randomly.

  2. 2.

    Selection of an appropriate ensemble model set using additional validation data.

    For selecting the appropriate model sets as combinations of previously constructed models, we used additional validation data for the selection process. The system produces estimated results using additional validation data. Ensemble results of all possible combinations of these models are generated by calculating the average score of the used models. Those sets are evaluated by the validation data, and the ones with the best performance (or top-ranked ones) are selected as candidate ensemble model sets.

4 Legal Textual Entailment Experiment

4.1 COLIEE 2021 Submission System

For the legal textual entailment task of COLIEE 2021, organizers provided training data constructed from 14 years (2006–2019) of bar exam questions data. From this data set, we decided to use data constructed from the latest year (2019: 111 questions) as validation data for analyzing appropriate ensemble model sets because this data set is used for the evaluation of COLIEE 2020. Therefore, we can easily compare the performance of our proposed system with the ones developed for COLIEE 2020. For the training of a BERT-based entailment system, we used 13 years of data (695 questions).

This original data set was randomly split 90% (625) for training and 10% (70) for validation. We constructed 10 different training and validation sets to make different BERT models. In addition to the training data constructed from the original data, all augmented data (3,331 examples for the submission system) were merged with the training data. As a result, we used 3,956 examples for training and 70 for validation. We also made training sets without using augmented data (625 training and 70 validation examples) to compare system performance without augmented data.

One of the important parameters for the BERT is a max sequence length. If the sequence is too short, it is difficult to obtain appropriate results because the information that appears after the max sequence cannot be used for the analysis. Therefore, we analyzed the sequence length of all question and article pairs to determine the appropriate sequence length.

Figure 4 shows the information of the sequence length for the training data of COLIEE 2021 and the augmented data (total 4,137 pairs). The examined sequences include the CLS token and SEP token that are used for representing class (Yes: 1 and No: 0) and separator (split texts into two parts: question and articles), respectively.

From this graph, we confirmed that most pairs have a sequence length less than 256 (98.38%). Consequently, we decided to use a max sequence length of 256 by considering the efficiency of the training process.

Fig. 4
figure 4

Training data sequence length

In addition, the fine-tuning of the BERT model is done using Adam as an optimizer, cross entropy as the loss function, a training batch size of 12, and a learning rate of 1e-5. The validation loss is calculated at each epoch and stop-training process when the validation loss increases. We used a model with minimal validation loss. Details of the experimental settings including the parameters for the training process, a list of questions used for each training process is also available from https://www-kb.ist.hokudai.ac.jp/COLIEE-DA/.

The fine-tuning process converged quickly, and the validation loss at the end of one epoch was the smallest in most cases. One epoch took less than 160 seconds, making it fast to train. This short computation time will be advantageous when increasing the number of models in the ensemble and when increasing the number of data to be trained.

4.2 Preliminary Experiments

In the first experiment, we compared the system performance of the one that used both original training data and augmented data with the one that used original training data only. For the original training data, we used 10 different training and validation sets constructed in Sect. 4.1 and evaluated the performance of the system using the R1 data set (111 questions for validation data for analyzing appropriate ensemble mode) as test data.

Table 1 shows the evaluation results for the validation and test data with different models. Numbers with bold font and ones with italic show the best and worst results among the 10 models, respectively. As shown in the table, augmented data are helpful to improve the performance of BERT-based entailment systems. Detailed comparison of these two systems is discussed in Sect. 4.6. In addition, we also confirm that the validation accuracy and loss were not closely related to the test accuracy. We assume that these results reflect the variability of the question set.

Table 1 Evaluation results of the 10 models

These 10 models with augmented data were used for selecting the appropriate ensemble model sets. For the COLIEE 2021 submission, we checked the model sets that used three or more models for the evaluation. Table 2 shows the evaluation results of these ensemble model sets by accuracy.

Table 2 Evaluation results of the ensemble models

There were large differences between the accuracy of the ensemble cases. The best accuracy system ensembled seven models, and the worst used all models. Most of the cases that used three to five models were adequate to estimate the results with better accuracy.

All of the highest rank sets contained the best performance system model 2. In addition, they also used model 1, even though the accuracy of model 1 was the lowest among these 10 models. This suggests that it is important to use a complementary set of models that have different characteristics to improve the overall performance of the ensemble models.

4.3 Submitted Results

Based on the results of the preliminary experiments, we submitted the following three results that used different model sets for the ensemble for COLIEE 2021 (Table 3).

Table 3 Ensemble model settings for the submission

HUKB-1 and HUKB-2 were the best and second-best performance systems, respectively, using R1 data as a kind of validation data. HUKB-3 selected the five best models using validation loss information.

Table 4 shows the final evaluation results of our submission runs and the best accuracy runs of each team, among which, HUKB-2 achieved the highest accuracy.

Table 4 Final evaluation results

4.4 Detailed Analysis of the Submitted Results

To understand the effect of the ensemble method, we compared the performance of the ensemble results with that of each model. Table 5 shows the evaluation results for the 10 models. This year, the basic model performed well, and the best performance systems were almost equivalent to the ensemble ones. However, the appropriate selection of the models (HUKB-2) made the ensemble results better than that for each model.

Table 5 Evaluation results of the 10 models for the test data

These results justify the appropriateness of using the ensemble method by selecting an appropriate ensemble set using validation data.

Table 6 shows the number of questions classified by agreement level among the models used. “Agree”, “Majority”, and “Other” represent “all models return the same results”, “final results are the same as majority voting”, and others, respectively. From these results, we can confirm that the average calculation ensemble method is better than majority voting because the number of correct questions for others is larger than the number of wrong ones. For the “Agree” questions, the best performance system (HUKB-2) had the largest numbers because of the small number of used models (three), but the accuracy of HUKB-1 (using seven models) for “Agree” was better than that of HUKB-2. However, the accuracy of HUKB-3 was lower than that of HUKB-2, suggesting that selecting an appropriate set of models for the ensemble is also effective for maintaining the accuracy of the “Agree” questions.

Table 6 Number of questions classified by the ensemble results

4.5 Additional Experiment with Different Sizes of Fine-Tuned Models

One of the parameters that is not discussed well for the COLIEE 2021 submission system is the number of candidate fine-tuned models (10 for the submission system). So, we would like to discuss the effect of this number by conducting additional experiments that change this value. From this experiment, we used augmented data generated by the procedure explained in Sect. 3.1 (3,351 examples) for making BERT fine-tuning models.

We made 20 randomly split training and validation data using the same method introduced in Sect. 4.1 and conducted experiments using the same procedures. Table 7 shows the accuracy of these models evaluated using the R1 data set (same as test accuracy in Table 1) and R2 data set (same as test accuracy in Table 5). Lines are sorted by the accuracy of Test (R1). The best and worst performances for Test (R2) are shown as numbers with bold font and ones with italic, respectively. The average accuracy for Test (R2) (submission data) is 0.6432, which is equivalent to the second and third rank in the submission. However, because of the variability of the question and nondeterministic characteristics of the BERT model, we have a model that is better than our best submission system (model 11) and one that is below the baseline. It is better to have appropriate methods to obtain more stable results.

Table 7 Evaluation results of the 20 models

Table 8 shows the list of selected results for all possible combinations of the 20 models as candidate ensembled model sets. The best performance set (3, 6, 8, 15) for Test (R2) (0.6543) was slightly worse than our submitted system and lower than the second-best team (UA) result in Table 4. However, the average performance of the second-best performance sets (five sets) is 0.6765, which is similar to the score for the submitted results, and the highest accuracy (0.7160) was better than the submitted one. However, because of the effect of generating correct answers by chance and the difficulty in selecting the best performance model set from them, the highest value may not have as much meaning. Comparing the performance of the single model cases, the average is better than the average of the single model cases with more stable results (0.6296–0.7160). In addition, for these model sets, the performance using all models is good for Test (R2), but the performance for Test (R1) is not so good. Because those values for accuracy may reflect the variability of questions for R1 and R2, it is better to select an appropriate set of models rather than using all ensembled models.

Based on this analysis, we confirmed that adding more models may increase the possibility to include models with higher consistent accuracy and/or a set of models that work in a complementary manner. However, increasing the number of such a set may result in over-fitting to the test data (R1 for this case) for selecting ensemble model sets. For example, from the second-best performance sets, ensemble sets with six models (3, 8, 10, 13, 16, 18) do not work well for the Test (R2). It is necessary to take into account such effects for considering the appropriate setting for generating these models and selecting the appropriate ensemble model set.

Table 8 Evaluation results of the ensemble results using 20 models

4.6 Discussion of the Characteristics of Our Proposed System

First, we would like to discuss the effectiveness and side effects of our data-augmentation method. As we explained in Sect. 4.2, the system performance using the data-augmentation method is better than one without using this method. To understand the characteristics of this augmentation method, we select the questions whose performance is significantly improved or degraded. The questions shown in Figs.1 and 5 are typical examples of improved or degraded cases. From these examples, we confirm our system tends to focus on logical matching (i.e., the system tends to answer positive (Yes:entail) for the question and article pairs that share the same judicial decision and answer negative (No: not entail) for the flipped pairs). These general characteristics work well when the most relevant article is selected for the question.

Fig. 5
figure 5

Examples of a degraded question

We also would like to analyze the characteristics of our system using failure analysis. As the Task 4 organizers indicated, there are two types of questions. One is a question about the article(s) mostly using the vocabulary of the articles, and the other is a question about the use case that uses anonymized symbols (such as “A” and “B”) for referring to persons or organizations. These questions are comparatively more difficult than other questions. In the R2 test data set, we have 35 questions that have such symbols.

Table 9 shows the number of models (out of 20 generated models) that can generate correct answers while considering the question types. The average numbers of correct answer models are 11.8 and 13.7 for anonymized questions and others, respectively.

Table 9 Number of models with correct answer classified by types of question

To analyze the difficulty of the problem, it is necessary to take into account the effect of answering correctly by chance. Even though the system cannot find any good clues to estimate whether the statement is true or false, the system can answer positive or negative and may get the correct answer by chance. Therefore, it is difficult to analyze the results of questions whose number of correct answer models is around half (8–12).

In this paper, we analyze questions with a smaller number of correct answer models to understand how the system tends consistently to answer incorrectly.

The following question (Fig. 6: correctly answered by one model) is a typical question that almost no model can answer correctly. This article has a sentence about exceptional cases ( “ただし、〜ときは、この限りでない”(“however, this does not apply ...”)). To understand the meaning correctly, it is necessary to check whether the exceptional condition is satisfied, and if such condition is satisfied, it is necessary to flip the judicial decision. Because our data-augmentation method does not use such exceptional case sentences for training as introduced in Sect. 3.1, the system has difficulty handling such an exceptional sentence. As a result, the systems tend to say positive (entail) for this question because many terms appear in both the question and the first sentence and there is no explicit explanation on the flipped decision (i.e., “対抗することができる” (“may duly assert”)). However, because several articles have such exceptional cases, it may be better to propose a data-augmentation method to handle such articles.

Fig. 6
figure 6

Examples of the failure for a difficult question (1)

The following failure example (Fig. 7: correctly answered by zero models) is related to the logical expression and semantic mismatch. The article says “the other party to the contract gives consent” (require consent), but the question says “regardless of whether A consents.” Because there are no patterns for handling such logical mismatches in the augmented data, it is comparatively difficult for the system to identify this type of logical mismatch. In addition, the vocabulary used for representing related persons is totally different; “A,” “B,” and “E” are used for the questions; and “one of the party,” “the other party,” and “the third party” are used in the article. It is also difficult for the system to estimate the relationship among them.

In such a case, the system tends to compare the juridical decision parts of the question and articles. In this case, questions and articles have the same decision “移転する” (transferred), and therefore, the system tends to answer positive (Yes:entail). This may be a bias that is affected by the augmented training data.

Fig. 7
figure 7

Examples of the failure for a difficult question (2)

The following failure example (Fig. 8: correctly answered by six models) is another type of problem related to the logical expression (quantifier). The article says “together with the obligee” (more than two), but the question says “independently” (single). It is not so easy to make a simple data-augmentation method for handling this type of logical mismatch.

Fig. 8
figure 8

Examples of the failure for a difficult question (3)

Those failure examples show that our system tends to generate final results (positive or negative) using logical matching information; i.e., whether the juridical decision of questions and articles are the same or flipped. One of the reasons why our system achieves better performance in this COLIEE task is that we can use the most relevant article for the question. Because the article is the most relevant one, most of the conditions for the articles are satisfied and comparing judicial decisions is an important part to answer the question appropriately. However, comparing juridical decisions themselves is also an important part of the entailment process, and our data-augmentation method can be applicable for any other type of such logical entailment task.

For future works, it is necessary to pay more attention to the comparison of condition parts. For example, to handle the articles with exceptional cases (Fig. 6) appropriately, it is necessary to check whether an exceptional condition is satisfied or not. For discussing such issues, it is better to propose a data-augmentation method for semantic matching cases. However, there are varieties of semantic mismatches for entailment analysis including insufficient condition (a part of the conditions is satisfied, but not all of the conditions are satisfied) and semantic mismatch (explanation about the condition does not match the concept used in the article). Compared with the logical mismatch case augmentation, it is difficult to balance systematic negative case data with good positive and negative cases. However, golden standard data (question and article pairs) may be helpful to discuss the type of these mismatches.

In addition, our system relies greatly on the quality of the relevant article. For the real use cases, it is not easy to select the most relevant article. So, it is better to have a mechanism to discuss the similarity of the conditions of the question with different articles. This mechanism is important for the real use case (Task 5: utilization of retrieved articles instead of the most relevant article) to determine the most relevant article from the retrieved results.

We also would like to analyze our system in depth by considering the information used for calculating the final output. For example, visualization of attention information may be the next step to analyze the results.

5 Conclusion

In this paper, we proposed a data-augmentation method for legal textual entailment using the original civil law articles. These augmented data support BERT fine-tuning processes by increasing the number of training examples for characterizing the logical mismatch. We implemented a BERT-based ensemble legal textual entailment system using these augmented data. This system supports selecting an appropriate ensemble model setting with validation data. We confirmed the effectiveness of the system using COLIEE 2021 Task 4 (textual entailment task), whose accuracy was 0.7037. This accuracy was the best among all runs. We also discussed the characteristics of the method and future works using failure analysis and additional experiments.