1 Introduction

The objective of the Competition on Legal Information Extraction/Entailment (COLIEE) is to build a research community and establish the state of the art for information retrieval and entailment using legal texts. It is usually co-located with JURISIN, the Juris-Informatics workshop series, which was created to promote community discussion on both fundamental and practical issues on legal information processing, with the intention to embrace various disciplines, including law, social sciences, information processing, logic and philosophy, including the existing conventional “AI and law” area. In alternate years, COLIEE is organized as a workshop with the International Conference on AI and Law (ICAIL), which was the case in 2017, 2019, and again in 2021. Until 2017, COLIEE consisted of two tasks: information retrieval (IR) and entailment using Japanese Statute Law (civil law). Since COLIEE 2018, IR and entailment tasks using Canadian case law were introduced, and the 2021 edition included a fifth task (entailment in statute law text without relying on previously retrieved data).

Task 1 is a legal case retrieval task, and it involves reading a query case and extracting supporting cases from the provided case law corpus, hypothesized to be relevant to the query case. Task 2 is the legal case entailment Task, which involves the identification of a paragraph or paragraphs from existing cases, which are hypothesized to entail a given fragment of a new case. For the information retrieval task (Task 3), based on the discussion about the analysis of previous COLIEE IR Tasks, we modify the evaluation measure of the final results and ask participants to submit ranked relevant articles relevant to the difficulty of the questions. For the entailment task (Task 4), we performed categorized analyses to expose different issues of the problems and characteristics of the submissions, in addition to the evaluation accuracy as in previous COLIEE tasks. Task 5 is similar to Task 4, but competitors can not rely on previously retrieved statute data.

The rest of the paper is organized as follows: Sections 2, 3, 4, 5, describe each task, presenting their definitions, datasets, list of approaches submitted by the participants, and results attained. Section 6 presents some final remarks.

2 Task 1—Case Law Retrieval

2.1 Task Definition

The Case Law Retrieval Task consists in finding which cases should be “noticed”Footnote 1 with respect to a given query case. More formally, given a set of cases C, a set of query cases Q, a set of the true noticed cases N, and a set of false noticed cases F, such that \(C=\{Q \cup N \cup F\}\), the Task is to find the set of answers \(A=\{A_1 \cup A_2 ... \cup A_n\}\), such that \(n = |Q|\) and each \(A_i \subset N\) contains all the true noticed cases and only the true noticed cases with respect to the query case \(q_i \in Q\).

2.2 Dataset

The dataset is comprised of 4415 case law files. A labelled training set of 650 cases is provided, together with a total of 3311 true noticed cases. At first glance, the task may seem simple, as one could think competitors need to identify the 3311 cases among the 4415 total cases. However, the task actually requires competitors to identify the noticed cases for each given query case. On average, there are approximately five noticed cases per query case in the provided training dataset, which should be identified among the 4415 cases. To prevent merely using citations of past cases, citations are suppressed from the case contents and replaced by a “FRAGMENT_SUPPRESSED” tag indicating that fragment was removed.

A test set is given with 250 query cases and a total of 900 true noticed cases, which means there are on average 3.6 noticed cases per query case in the test dataset. In future editions, we intend to ensure that the training and test datasets have similar distributions. Initially, the golden labels for that test set is not provided to competitors.

2.3 Approaches

We received 15 submissions from 7 different teams for Task 1, but only 5 teams submitted papers describing their approaches. Their methods are briefly described below. Please refer to the corresponding papers for further details.

  • Li et al. [11] (team name: siat) propose a pipeline method based on statistical features and semantic understanding models, which enhances the retrieval method with both recall and semantic ranking. siat’s best submission had an f1-score of 0.030.

  • Schilder et al. [21] (team name: TR) applies a two-phase approach for Task 1: first, they generate a candidate set which tentatively contains all true noticed cases but eliminates some of the false candidates (i.e., this step is optimized for recall). The second step is a binary classifier which receives as input the pair \((query\ case, candidate\ case)\) and predicts whether they represent a true noticed relationship.

  • Rosa et at. [20] (team name: NM) presents a vanilla application of BM25 to the case law retrieval problem. They do that by first indexing all base and candidate cases contained in the dataset. Before indexing, each document is split into segments of texts using a context window of 10 sentences with overlapping strides of five sentences (which are called ’candidate case segments’). BM25 is then used to retrieve candidate case segments for each base case segment. The relevance score for a \((base\ case, candidate\ case)\) pair is the maximum score among all their base case segment and candidate case segment pairs. The candidates are then ranked according to threshold-based heuristics. The NM team submitted only one run, which was ranked second place among all submissions with an f1-score of 0.0937.

  • Ma et al. [13] (team name: TLIR) was the top ranked team for Task 1. They apply two methods: the first is a traditional language model for IR (LMIR) [2], which consists of an application of LMIR on a pre-processed version of the dataset. The TLIR team did not use the full case contents, but cleverly made use of the tags inserted in the text to indicate a fragment has been suppressed in order to heuristically identify the potentially most relevant text fragments. The fact this approach ranked first place among all Task 1 competitors indicates traditional IR methods can achieve good results in the case law retrieval task. The second approach is a transformer based method, which factors a document into paragraphs and then computes measures on interactions between paragraphs using BERT. Compared with other neural models, BERT-PLI can take long text representations as an input without truncating them at some threshold. Yet, the results attained with this approach in COLIEE 2021 were not as good as the simpler IR-based approach, ranking at third and fifth places among all submission with an f1-score of 0.0456 and 0.0330.

  • Althammer et al. [1] (team name: DSSIR) combine retrieval methods with neural re-ranking methods using contextualized language models like BERT. Since the cases are typically long documents exceeding BERT’s maximum input length, the authors adopt a two phase approach. The first phase combines lexical and dense retrieval methods on the paragraph-level of the cases. They then re-rank the candidates by summarizing the cases and then apply a fine-tuned BERT re-ranker on said summaries. Their best ranking submission attained fourth place overall, with an f1-score of 0.0411.

2.4 Results and Discussion

Table 1 shows the results of all submissions received for Task 1 in COLIEE 2021. A total of 15 submissions from 7 different teams have been received. It can be seen the f1-scores were, in general, much lower than in previous editions, reflecting the fact the task is now more challenging than its previous formulation. The best performing team in Task 1 in the 2020 edition, for example, achieved an f1-score of 0.6774. For more information on the previous task formulation and approaches, please see the COLIEE 2020 summary [16].

Most of the participating teams applied traditional IR techniques such as BM25, transformer based methods such as BERT, or a combination of both. The best performing team was TLIR, with an f1-score of 0.1917, with an approach that combined traditional IR methods with simple heuristics to identify the most relevant fragments in a case law. Also worth mentioning is the NM team, whose approach was a vanilla application of BM25 and achieved the second place overall.

Table 1 Task 1 results

For future editions of COLIEE, we intend to make the distributions of the training and test datasets more similar with respect to average and standard deviation of number of noticed cases. Besides that, we will fix a few minor issues which were found in the dataset, such as two different files with the exact same contents (i.e., the same case represented as two separate cases). This is a problem with the original dataset from where the competition’s data is drawn, and knowing that dataset presents those issues we will improve our collection methods to correct them. Fortunately, those issues were rare and did not have an impact on the final results.

A known issue with the dataset is that tags inserted to indicate suppression of fragments provide an artificial clue as to where there is potentially highly relevant contents. That aspect was exploited by the winning team in COLIEE 2021. Whereas that is not a problem with that team’s approach, we would like our datasets to represent as accurately as possible real-world problems, so options to improve such datasets will be explored in future editions.

3 Task 2—Case Law Entailment

3.1 Task Definition

Task 2 is a legal case entailment task and it involves the identification of a paragraph from existing cases that can be claimed to entail the decision of a new case. Given a decision Q of a new case and a relevant case R, the challenge is to identify a specific paragraph in R that entails the decision Q. The organizers have confirmed that the answer paragraph cannot be identified merely by information retrieval techniques using some examples. Because the case R is a relevant case to Q, many paragraphs in R could be relevant to Q, regardless of confirming entailment. This task requires one to identify a paragraph which entails the decision of Q, so required is a specific entailment method that compares the meaning of each paragraph in R and the decision in Q. The data are drawn from an existing collection of predominantly Federal Court of Canada case law documents. The evaluation measure will be precision, recall and F-measure.

For COLIEE 2021, the Task 2 training and testing sets contain 426 and 100 base cases respectively. Table 2 shows the dataset information for Task 2.

Training data is provided in the form of triples, each consisting of a query, a noticed case, and a paragraph number of the noticed case by which the decision of the query is entailed. Here, “noticed case” means the relevant case of the query. An example is shown in Table 3.

Table 2 Dataset information in Task 2

3.2 Approaches

Seven teams participated in Task 2, and a total of 17 results were submitted (average 2.43 results per team). Each team was allowed to submit a maximum of three results. Table 4 shows the approaches that teams used in Task 2. Althammer et al. [1] (team name:DSSIR) used either BM25 or DPR [8] model to produce the first two results, which were trained on the entailing paragraph pairs in order to rank each paragraph in the noticed case, given the query paragraph. They also combined the ranking of BM25 and DPR as their third result.

Schilder et al. [21] (team name: TR) used hand-crafted similarity features and applied a classical random forest classifier. Using n-gram vectors, universal sentence encoder vectors, and averaged word embedding vectors, they computed the similarity between each paragraph in the noticed case and the decision fragment in the query. After selecting the most similar k paragraphs, they trained a random forest classifier.

Kim et al. [9] (team name: UA) used BERT pre-trained on a large (general purpose) dataset by fine-tuning on the provided training dataset. If the tokenization step produced more than the 512 token limit, they apply another transformer-based model to generate a summary of the input text, and then process the pair again. Since the input text often includes text in French, they apply a simple language detection model based on naive Bayesian filter to remove those fragments. There are usually very few actual entailing paragraphs in a case (by far, most of the cases only have one entailing paragraph). So in the post-processing step they establish limits for the maximum number of outputs allowed per case. At the same time, they observe a minimum score in an attempt to reduce the number of the false positives.

Table 3 Training data example in Task 2

Li et al. [11] (team name: siat) proposed a pre-training Task on BERT (BERT-base-uncased) with dynamic N-gram masking, to get a special BERT model with legal knowledge (BERTLegal). They utilized n-gram masking to generate masked inputs for what they call “masked language model” targets. The length of each n-gram mask is randomly selected amongst 1, 2, and 3. They also did data augmentation and used a Fast Gradient method.

Nguyen et al. [14] (team name: JNLP) used the supporting model and lexical model for two submissions, and in the last submission, they used a neighbouring structures fingerprint (NSFP) model.

[19] (team name: NM) used monoT5-zero-shot, monoT5 and DeBERTa [7]. They also evaluated an ensemble of their monoT5 and DeBERTa models. The model monoT5-zero-shot is a sequence-to-sequence adaptation of the T5 [17] model.

We were not able to identify the approach of the team MAN01 as there was no corresponding paper submission.

Table 4 Approaches in Task 2

3.3 Evaluation Measure

Task 2 uses micro-average precision, recall and F1-measure as evaluation metrics, which are formulated as follows:

$$\begin{aligned} \text {Precision}&= \frac{N_{\text {TP}}}{N_{\text {TP}}+N_{\text {FP}}}, \end{aligned}$$
(1)
$$\begin{aligned} \text {Recall}&= \frac{N_{\text {TP}}}{N_{\text {TP}}+N_{\text {FN}}}, \end{aligned}$$
(2)
$$\begin{aligned} F1&= \frac{2*\text {Precision}*\text {Recall}}{\text {Precision}+\text {Recall}}, \end{aligned}$$
(3)

where \(N_{\text {TP}}\) denotes the number of true positive prediction for all queries, \(N_{\text {TP}}+N_{\text {FP}}\) is the total positive prediction number for all queries, and \(N_{\text {TP}}+N_{\text {FN}}\) is the ground truth positive case number.

3.4 Results and Discussion

Table 5 shows the Task 2 results. NM team’s three submissions are all ranked no. 1–3. In particular, their Ensemble of DeBERTa and monoT5 showed the best performance with the F1 score of 0.6912. As shown in Table 6, the systems of the the winning team (NM) show balanced performance between precision and recall. This task is to find the paragraph(s) that entails the decision of the query, and in most cases, only one paragraph is the correct answer. So, systems are likely to show better precision than recall. An interesting observation in Table 6 is that the system monoT5 showed better recall than precision.

Most of the systems combined the traditional BM25 information retrieval algorithm and BERT Transformer language model. They showed that the traditional BM25 system is still useful in legal information retrieval and entailment. To solve the issue of the dataset imbalance, some teams tried data augmentation. In addition, some approaches tried to extract semantic relationships between paragraphs using BERT. Finally, there was an approach to use LEGAL-BERT, a BERT system optimized for the legal domain, but the performance was not promising.

Participants have stated that the extreme class-imbalance nature of the problem and the limited data size make it challenging to train an efficient and generalizable classification model. Because of the limited data size, the winning team (NM) adopted zero-shot models, and they showed that zero-shot models can have at least equivalent performance to models that have been fine-tuned on a legal case entailment task. They also confirmed a counter-intuitive result: that models with little or no adaption to the target task can be more robust to changes in the data distribution than models that have been carefully fine-tuned to the task at hand.

Table 5 Task 2 official results
Table 6 Task 2 winning team’s detailed performance

4 Task 3—Statute Law Information Retrieval

4.1 Task Definition

Task 3 requires the retrieval of an appropriate subset (\(S_1\), \(S_2\),..., \(S_n\)) of Japanese Civil Code Articles from the Civil Code texts dataset, used for answering a Japanese legal bar exam question Q.

An appropriate subset means the identification of a subset of statutes for which an entailment system can judge whether the statement Q is true \(Entails(S_1, S_2, ..., S_n , Q)\) or not \(Entails(S_1, S_2, ..., S_n , \lnot Q)\).

4.2 Dataset

For Task 3, questions related to Japanese civil law were selected from the Japanese bar exam. Since there were some updates of Japanese Civil Code on April 2020, we revised the text database to reflect this revision for Civil Code, and its translation into English. However, since the English translated version is not provided for a portion of this code, we exclude those untranslated parts from the civil code text and their related questions. As a result, the number of civil code articles used in the dataset is 768, or about half of previous COLIEE competitions. Training data (the questions and relevant article pairs) were constructed by using previous COLIEE data (806 questions). In this data, questions related to revised articles are reexamined and those for excluded articles are removed from the training data. For the test data, new questions were selected from the 2020 bar exam (81 questions).

The number of questions classified by the number of relevant articles is listed in Table 7.

Table 7 Number of questions classified by number of relevant articles

4.3 Approaches

The following six teams submitted their results (18 runs in total). We describe approaches for each team as follows, using a header format of the form Team Name (number of submitted runs). All teams had experience in submitting results in previous competition. Because the best performance system [22] of COLIEE 2020 uses BERT [5], most of the teams (HUKB, JNLP OvGU, and TR) use BERT and ensemble results with an ordinary IR system (HUKB and OvGU). One characteristic feature proposed in this year’s task is extension of training data for BERT-based IR system training. OvGU proposed a method to extend the contents of original article using text data related to the article (metadata, text from the website). JNLP proposed a method to select a corresponding part of the article for the query using a sliding window mechanism. HUKB proposed a method to add detailed information from the referred articles. Other common techniques used in the system were well known IR engine mechanisms such as BM25, TF-IDF, Indri [23], and Word Movers’ Distance (WMD) [10].

  • HUKB (three runs) [27] uses a BERT-based IR system and Indri for the IR module, and compares the result of each system output to create final results. They construct a new article database with the following two types: one expands the detailed information using the referred article, and the other uses text splitting for describing one judicial decision. They submitted three runs with almost similar settings and the best run is HUKB-3.

  • JNLP (three runs) [14] uses a BERT-based IR models that combines multiple BERT models for generating results. They also construct training data of relevant articles by selecting the most relevant part of the article using a sliding window. The best run is JNLP.CrossLMultiLThreshlod that uses an ensemble of three different systems outputs by selecting the highest result among them.

  • LLNTU (three runs) has not submitted a paper describing their methods.

  • OvGU (three runs) [25] uses a variety of BERT models with different data enrichment techniques. The best run is OvGU_run1 that uses sentence-BERT embedding [18] with TF-IDF by enriching the articles in the training data by using metadata, text from the web data related to the article and relevant queries from training data.

  • TR (three runs) [21] submits three runs and the best run is TR_HB uses Word Mover’s Distance (WMD) approach to calculate the similarity between query and articles.

  • UA (three runs) [9] uses ordinary IR modules for generating results. The best run is BM25.UA that uses BM25 as an IR module.

4.4 Results and Discussion

Table 8 shows the evaluation results of submitted runs. The official evaluation measures used in this task were macro average (average of evaluation measure values for each query over all queries) of the F2 measure, precision, and recall (See Appendix 1 for the definition of those measures).

We also calculate the mean average precision (MAP) and recall at k (R\(_k\): recall is calculated by using the top k ranked documents as returned documents) by using the long ranking list (100 articles). Table 8 shows the results of the evaluation of submitted resultsFootnote 2.

This year, OvGU is the best run among all runs. JNLP achieves almost similar score and have higher MAP. This year, ordinary IR model BM25 achieves good performance for finding one relevant article for the question. From this results, we confirm the effectiveness of using deep learning technology such as BERT for this task.

Table 8 Evaluation results of submitted runs (Task 3)

Figures 1, 2, and 3 show the average of evaluation measure for all submission runs. As we can see from Fig. 1, there are many easy questions for which almost all system can retrieve the relevant article. The easiest question is R02-10-E “An underground space or airspace may be established as the subject matter of superficies for ownership of structures, through the specification of upper and lower extents.” whose relevant article (Article 269-2) has the same sentence in the text.

However, there are five queries for which none of the system can retrieve the relevant articles. All questions (R02-9-E, R02-15-I, 02-15-U, 02-15-E, and R02-23-E) are based on the use case of the article that requires semantic matching and handling anonymized symbols such as “A” and “B” for referring person or other entities. For example, question of R02-9-E is “B obtained A’s bicycle by fraud. In this case, A may demand the return of the bicycle against B by filing an action for recovery of possession.” A related article is “Article 192 A person that commences the possession of movables peacefully and openly by a transactional act acquires the rights that are exercised with respect to the movables immediately if the person possesses it in good faith and without negligence.”\(^{3}\) It is necessary to recognize following semantic relationship (“bicycle” as “movables” and “A” and “B” as persons, and conflict between “by fraud” and “peacefully”). This semantic interpretation of the statue statements is an instance of the greater challenge of identifying relationships between abstract statutes and specific texts.

Fig. 1
figure 1

Averages of precision, recall, F2, MAP, R_5, and R_30 for easy questions with a single relevant article

Fig. 2
figure 2

Averages of precision, recall, F2, MAP, R_5, and R_30 for non-easy questions with a single relevant article

Fig. 3
figure 3

Averages of precision, recall, F2, MAP, R_5, R_10, and R_30 for non-easy questions with a single relevant article

4.5 Discussion

Since the statute law retrieval task is one of the oldest tasks of COLIEE, it is appropriate to discuss which kind of issues have been addressed over the development process. As we can see, there are three different types of questions for which we can describe the challenges.

One of the characteristics of difficult questions of this year are those that uses anonymized symbols as pronouns or placeholders, such as “A” and “B” for referring person or other entities. In the test case of COLIEE 2021, 35 questions contain such anonymized symbol and 27 (out of 35) questions have one related article.

Table 9 represents the number of query with one relevant article for the F2 measure (average) classified by one with anonymized symbol or other. Table 10 represents the number of query with multiple relevant article for the F2 measure classified by one with anonymized symbol.

Table 9 Number of questions classified by F2 score and query type (single relevant articles)

From Table 9, we confirm that most of the retrieval questions without anonymized symbol can be identified by most of the submitted systems (there is no question whose F2 measure (average) is lower than 0.6). However, it is still difficult for the system to retrieve relevant articles for the question with anonymized symbols (16 out of 27 questions has F2 measure (average) lower than 0.6).

This result reflects the different characteristics of the question with anonymized symbol or not. In most of the cases, questions with anonymized symbols represent question about use cases of the articles, therefore they require the handling of semantic relationships that we discussed in Sect. 4.4. On the contrary, most of the questions without anonymized symbols do not require the handling of such semantic relationships. In addition, since deep learning based NLP such as BERT can handle the context information, it is helpful to select appropriate relevant articles from the ones that use a similar vocabulary. However, the similarity of terms in the legal domain may not be same as ones in the usual texts. For example, “jewelry,” “car” and “paintings” are similar terms in the context of valuable movables in the legal domain, but those terms are not similar context in the ordinary texts. Usage of legal-BERT [25] is one of the possible solution for this problem, but their performance is not good as the best run. It is necessary to investigate appropriate model of the transformer (including BERT and other variations) for this task.

Table 10 Number of questions classified by F2 score and query type (multiple relevant articles)

For the questions with multiple relevant articles, we still have difficulties to retrieve all relevant articles (Table 10). This is because most of the systems tried to deal this problem as simple rank-based retrieval problems. For example, the best performance system OvGU [21] and the second best team JNLP [14] also use a thresholding approach to select relevant articles. These selection processes can be interpreted as one for deciding of number of relevant documents using rank-based retrieval results.

However, it is better to consider the relationships among statute law articles using article reference information from the legal perspective. HUKB [27] tried to identify the relationships among articles based on the reference information with rank-based retrieval approach. However, their performance is not currently as good as expected.

Based on the discussion, we can confirm that success can use conventional IR methods for retrieving simple questions whose topic are not use cases and have one relevant article. However, we still have difficulty to handle questions about use cases and ones with multiple relevant articles.

For possible future directions, it is necessary to propose a framework to encourage participants to tackle these problems.

5 Tasks 4 and 5—Statute Law Entailment and Question Answering

5.1 Task Definition

Task 4 is a task to determine textual entailment relationships between a given problem sentence and relevant article sentences. Competitor systems should answer “yes” or “no” regarding the given problem sentences and given article sentences. Until COLIEE 2016, the competition had only pure entailment tasks, where t1 (relevant article sentences) and t2 (problem sentence) were given. Due to the limited number of available problems, COLIEE 2017, 2018 did not retain this style of task. In the Task 4 of COLIEE 2019 and 2020, we returned to the pure textual entailment task to attract more participants, which produced more focused analyses. In COLIEE 2021, we revived the question answering task as Task 5, and retained the textual entailment task as Task 4; Task 5 requires a system to answer “yes” or “no” given a problem sentence(s) only. Participants can use any external data, however this assumes that they do not use the test dataset.

5.2 Dataset

Our training dataset and test dataset are the same as for Task 3. Questions related to Japanese civil law were selected from the Japanese bar exam. The organizers provided a data set used for previous campaigns as training data (806 questions) and new questions selected from the 2020 bar exam as test data (81 questions). The Task 5 dataset is the same as Task 4. We performed Task 5 before Task 4 in order not to reveal the gold standard article labels which are included in the Task 4 dataset.

5.3 Approaches

All teams submitted three runs for each of Tasks 4 and 5, except that the OvGU and HUKB teams participated Task 4 only.

  • HUKB (three runs) [26] used an ensemble architecture of BERT methods with data augmentation. They prepared an ensemble of 10 models. Their data augmentation extracts judicial decision sentences, then makes positive/negative data from articles.

  • JNLP (three runs) [15] uses bert-base-Japanese-whole-word-masking with tf-idf based data augmentation. Their models are trained with different numbers of pretrained/fine-tuned epochs (JNLP.Enss5a and JNLP.Enss5b), and an ensemble of these two models (JNLP.EnssBest). For Task 4, their proposed methods use their proposed Next Foreign Sentence Prediction (JNLP. NFSP) which trains to determine if semantic of two sentences in different languages belong to two consecutive sentences in a document, and Neighbor Multilingual Sentence Prediction (JNLP. NMSP) which adds pairs of same-language sentences in two languages to the bilingual pairs of NFSP, together with the original multilingual BERT (JNLP. BERT_Multilingual) for Task 5.

  • KIS (three runs) [6] extended their previous work using a classic NLP approach, to be explainable, based on predicate-argument structure analysis, original legal dictionary, negation detection, and ensemble of modules with different thresholds and combinations of these features.

  • OvGU (three runs) [25] employed an ensemble of graph neural networks where each node represents either a query or an article, sentences embedded by a pre-trained paraphrase-distilroberta-base-v1 (OvGU_run1), and LEGAL-BERT based on legalbert- base-uncased with different training phases (OvGU_run2 and OvGU_run3).

  • TR (three runs) [21] uses existing models: TR-Ensemble using T5 [17]-based ensemble, TR-MTE using Multee [24], and TR_ Electra using Electra [4] for Task 4; (TRDistill-Roberta) using distilled version of RoBERTa [12], TRGPT3Davinci using the largest model of GPT-3 [3] and TRGPT3Ada using the smaller one for Task 5.

  • UA (three runs) [9] uses BERT (UA_dl), with semantic information (using the Kadokawa thesaurus concept number) (UA_parser).

5.4 Results and Discussion

Tables 11 and 13 show evaluation results of Tasks 4 and 5, respectively. Tables 12 and 14 show our categorization results of Tasks 4 and 5, respectively. Because an entailment task is essentially a complex composition of different subtasks, we manually categorized our test data into linguistic categories, depending on what sort of technical issues require resolution. As this is a composite task, overlap is allowed between categories. Our categorization is based on the original Japanese version of the legal bar exam. The BL column in Table 12 shows correct answer ratios for each category when answering the majority answer “No” to all problems. Interestingly, all runs are under the baseline in the Negation category, which is expected to answer easier than other categories. This comparison supports the discussion that the task is complex and composite one, the result is not simply regarded as it is better when the overall score is better.

The test dataset characteristics seems not to be coherent throughout these years of the COLIEE series. For example, we observe more problems which require handling of anonymized symbol such as “A” and “B” for referring persons (discussed in the Task 3 part as well) than previous years. Such problems should be still very difficult for any NLP method to solve, except similar possible patterns could be sufficiently covered by some external training dataset. The Anaphora rows of Tables 12 The best team in Task 4 would have solved “easier” problems well, while remaining “difficult” linguistic issues remain for future work.

Table 11 Evaluation results of submitted runs (Task 4)
Table 12 Task 4’s Linguistic category statistics of problems, and correct answers of submitted runs for each category in numbers of counts and percentages
Table 13 Evaluation results of submitted runs (Task 5)
Table 14 Task 5’s Linguistic category statistics of problems, and correct answers of submitted runs for each category in numbers of counts and percentages

6 Conclusion

We have summarized the systems and their performance as submitted to the COLIEE 2021 competition. For Task 1, TLIR was the best performing team with an F1 score of 0.1917, whose approach applied a combination of LMIR and a BERT-based method. In Task 2, the winning team ensembled DeBERTa and monoT5 and achieved an F1 score of 0.6912. For Task 3, the top ranked team (OvGU) employed sentence-BERT embeddings and augmented the training data with metadata, web data related to the articles and relevant queries from the training data, to achieve an F2 score of 0.73. HUKB was the Task 4 winner, with an Accuracy of 0.7037. They applied an ensemble of BERT models and data augmentation. In Task 5, JNLP was the best performing team and applied a variety of BERT-based models, achieving an Accuracy of 0.6049.

In this edition, we introduced a new task on statute law question answering (Task 5) and a new formulation for the case law retrieval task (Task 1). We intend to further improve the datasets quality in future editions of COLIEE so the tasks more accurately represent real-world problems.