Japanese Tort-case Dataset for Rationale-supported Legal Judgment Prediction

This paper presents the first dataset for Japanese Legal Judgment Prediction (LJP), the Japanese Tort-case Dataset (JTD), which features two tasks: tort prediction and its rationale extraction. The rationale extraction task identifies the court's accepting arguments from alleged arguments by plaintiffs and defendants, which is a novel task in the field. JTD is constructed based on annotated 3,477 Japanese Civil Code judgments by 41 legal experts, resulting in 7,978 instances with 59,697 of their alleged arguments from the involved parties. Our baseline experiments show the feasibility of the proposed two tasks, and our error analysis by legal experts identifies sources of errors and suggests future directions of the LJP research.


Introduction
Legal information processing aims to provide computational aid in legal procedures, including predicting the outcome of a case through legal judgment prediction (LJP).LJP is beneficial to both legal professionals and the general public.It allows them to predict litigation outcomes in advance, and they can take better actions based on Undisputed Facts () Tort Prediction ()  1: Rationale-supported legal judgment prediction featuring a tort case expectations.For example, they can get faster conciliation and smoother negotiations, resulting in more efficient legal services.LJP also reduces the cost of legal services and improves their access.Easier access to justice is important for those with limited or no access to traditional legal services.

Rationale Extraction Model
LJP has been a longstanding research topic in artificial intelligence and, like other domains, has adopted machine learning (ML) techniques.The ML techniques require large datasets for training and evaluating ML models.Xiao et al. [1] proposed a dataset of 2.6M Chinese criminal cases annotated with applicable laws, charges, and prison terms.Chalkidis et al. [2] presented a dataset of 11.5K cases from the European Court of Human Rights, which is designed for violated article detection and case importance prediction.Katz et al. [3] constructed a dataset of 28K cases from the Supreme Court of the United States.Semo et al. [4] released the LJP dataset focused on class action cases in the United States.Chalkidis et al. [5] proposed a collection of datasets to evaluate model performance across different legal tasks, including LJP tasks in English.In contrast, there is no LJP dataset employing real judgment documents in the Japanese jurisdiction.The LJP tasks and their datasets should be designed to reflect differences in jurisdictions.Against this backdrop, we construct the first dataset of LJP, Japanese Tort-case Dataset (JTD), for the Japanese LJP research.
In JTD, we deal with judgment on civil cases about torts (Civil Code, Art.709) 1 .Tort is an important and popular topic in civil cases.Japanese law affirms a tort as a negligent or intentional infringement of rights or legal interests that cause a plaintiff to suffer loss or harm.In modern society, torts play an important role in disputes on the internet, for example, cases of defamation and privacy infringement on social media.In such cases, tort law is often used to determine liability since there is usually no explicit contract between the parties.

False
It is possible to identify the subject of this posting as the plaintiff.

True
We do not admit all the allegations from plaintiff.  Fig. 2: An example about a defamation case.Translations are ours and modified for presentation purpose.R P g and R D g represent gold labels for R P and R D , respectively.
Figure 1 shows an overview of our two tasks: Tort Prediction (TP) and Rationale Extraction (RE).A tort case involves two parties: plaintiffs and defendants.Plaintiffs are claimants of the case, arguing that a defendant's action is a tort, while defendants contest plaintiffs' arguments.TP predicts whether a tort is affirmed (T , a Boolean value), given undisputed facts (U ) and arguments from both parties (P from plaintiffs and D from defendants).Undisputed facts are not disputed by any parties or agreed upon by both parties.They provide the LJP model with context to validate the parties' arguments.The final decision on a tort (T ) should be based on the arguments that are accepted by the judge.Thus the accepted arguments can be considered rationales for the final decision (T ).RE identifies the accepted arguments (R P for plaintiffs and R D for defendants, both are sequences of Boolean values, denoting accepted arguments as True) in the parties' arguments (P and D).To summarise, our tasks take (U, P, D) as input and output (T, R P , R D ). Figure 2 shows an example of an instance.There is an undisputed fact (U ), four claims from the plaintiff (P ), and one claim from the defendant (D).A gold standard for the tort prediction task is false (T gold ), meaning the subject of this instance is not considered a tort.Gold standard labels for the rationale extraction task are {T rue, T rue, F alse, F alse} for the plaintiff (R P g ) and {F alse} for the defendant (R D g ).
Our main contributions are the following.We propose new tasks for the Japanese LJP, which consist of judicial decision prediction and identification of their rationales.We conducted a large-scale annotation with 41 legal experts.In the annotation, the annotators captured direct causal relations between the court decisions and arguments from the parties, allowing multiple court decisions on multiple subject matters in a case.From the 3,477 annotated documents, we built JTD consisting of 7,978 tortrelated instances.To establish baseline performance with the dataset, we conduct experiments employing hierarchical Transformer architecture and multi-task learning approaches.Moreover, we perform a detailed error analysis by legal experts to identify sources of errors and suggest future research directions.Our dataset will be available for researchers on a website.

Related Work and Background
ML-based approaches have been popular in LJP research.ML-based systems automatically learn how judges make decisions from a large number of cases (e.g., judgment documents).These models take fact descriptions as input and predict outcomes or relevant laws.European Court of Human Rights cases are popular sources of datasets for LJP [2,[7][8][9].Galli et al. [10] constructed a dataset for an outcome prediction task for Italian Value Added Tax decisions.Katz et al. [3] built a dataset of cases from the Supreme Court of the United States to train their models.Semo et al. [4] constructed another LJP dataset of class action cases in the United States.LJP on Chinese Criminal cases is another big venue of ML-based LJP models [11][12][13][14][15].A Japanese dataset for legal tasks is available for Competition on Legal Information Extraction/Entailment (COLIEE) [16].However, COLIEE is designed for legal entailment and information retrieval on the Japanese bar exam, and its data size is limited.
Due to the lack of a large reliable dataset, the Japanese LJP research has hardly employed the ML-based approach.Instead, the symbolic approach has been popular for Japanese LJP.The symbolic systems predict outcomes of legal reasoning by rules and logic [17,18].Although the symbolic systems require human experts' intervention in development, we can easily interpret their behaviour.PROLEG [19] demonstrated that a logic programming system could work for the Japanese legal system.PROLEG is a legal reasoning system based on Prolog, implementing a decision-making theory used in civil litigation in Japan.However, there still needs to be a solution to extracting logical clauses from natural language text [20].
Moreover, in the recent era of rapid socio-economic change, the importance of a type of provision known as general clause is growing in legal practice; the general clauses enable legislators to deal with various unforeseeable situations and to apply the law fairly and appropriately.They do not specify the physical or social facts that must be proven to determine whether specific legal requirements have been met.Instead, they only provide abstractive concepts as requirements.The fulfilment of those requirements will be determined through the comprehensive assessment or evaluation of the relevant individual facts in concrete cases.Determining their fulfilment can hardly be implemented by rule-based or logic programming-based approaches.On the other hand, the ML-based approach can learn the standards of those requirements from many precedents and perform better.Therefore, it is essential to construct a large-scale dataset of Japanese judgment documents to facilitate the ML-based approach for Japanese LJP tasks.We use real Civil Code judgment documents, specifically, torts cases, since the basic rule of torts in the Japanese Civil Code (Art.709) is a good example of the above-mentioned general clause.
The success of deep learning methods in many areas raises concerns about the lack of explainability [21] behind their output.LJP outputs are expected to align with Justice, and LJP systems should be trusted by society.Therefore, explainability is more important and serious in the legal domain than in other domains.Even if LJP systems are used as assistant tools for legal consulting, they can affect people's behaviours and indirectly influence their social status and assets.Thus, the LJP system has to explain the reason for predictions.To accommodate the needs, recent LJP studies introduced explanation tasks, including court view generation [22], rationale paragraph extraction [23], and case features extraction as rationales [24,25].Following the prior work, we design our explanation task as rationale extraction similar to Chalkidis et al. [23], but our extraction task is at span level instead of paragraph level.In addition, our target of rationale extraction is argumentative claims from the parties (e.g., plaintiff's factual allegations) instead of submitted fact descriptions.

Data Source
Our data source of the judgment documents is a legal database "Hanreihisho2 " provided by LIC Co., Ltd.We curated judgment documents from the first instances of Civil Code cases from lower courts.We retrieved the documents from the database using keyword-based queries shown in Table 1.We retrieved documents that describe general tort cases of defamation, privacy infringement and reputation injury using query A. As tort disputes on the Internet are often discussed in Disclosure of Identification Information of the Sender (DIIS) cases, we also retrieved DIIS cases using query B. As they might include non-tort cases, we manually excluded the irrelevant documents later in the human annotation process.
When writing a judgment document, the judges often comply with a particular guideline for writing civil case judgments [26].This leads to high similarities in structure across judgments.The structure starts with Main Text briefly describing the final decision, followed by Facts and Reasons containing sections: a summary of the case, undisputed facts, arguments from the parties, and the court's decisions (detailed judicial decisions).
As our target documents describe court cases, the documents can contain personal information, sensitive information of parties or legally protected information such as trade secrets.Leins et al. [27] sheds light on potential ethical issues in constructing datasets from a sensitive data source like judgment documents.In the Japanese legal system, it is guaranteed that anyone can access judgment documents by law (Code of Civil Procedure, Art.91), but the parties may request to opt out of giving others access to their judgment documents (Code of Civil Procedure, Art.92).Therefore, sensitive secrets should not be contained in the database we use.Moreover, the providers of documents and databases pseudonymise the documents before publishing a case in journals or databases.

Annotation Scheme
To obtain a set of tuples (U, P, D, T g , R P g , R D g ) for our task inputs and outputs, we annotate the following information based on the work of Yamada et al. [28].Here, subscripts g denote task answers (golds).Annotators extract spans at the character level according to the following definitions below.Annotators may not identify any spans for a type if there is no corresponding text in a document.
The Court Decisions (CD) span describes a judge's decision on tort, which should be found in the judicial decision section.The CD span has a Decision (@D) attribute, indicating whether judges affirmed the tort (True) or not (False).
The Claim (CL) span describes important parties' claims which are relevant to the decision on torts3 .The CL span has two attributes: Accepted (@AC) indicating whether judges accepted the claim (True) or not (False), and Who (@W) indicating their claimant, i.e. one of Plaintiff, Defendant and Other4 (e.g., a third party with interest in the outcome).
The Undisputed Facts (UF) fact describes facts that are undisputed by any parties.The UF spans should be identified in sections other than the judicial decision.The UF span has no attribute.
A judgment document might have multiple CD spans.Note that CD spans in a document can have different values since they concern different decisions, though the decisions can be related to each other.The annotators associate each non-CD annotated span (CL and UF spans) with its relevant CD span.A non-CD annotated span can be associated with multiple CD spans.
Our annotation scheme follows Chalkidis et al. [23], but with more precise annotation granularity.First, we aim to annotate each individual subject (CD span) on trial instead of annotating an entire judgment document, which is a more precise task design in terms of simulating judicial decision-making.Also, our scheme allows annotators to extract rationale spans at the character level instead of the paragraph level.Moreover, our scheme captures argumentative claims from the parties (factual allegations) in addition to submitted fact descriptions.While the annotation of the argumentative claims was implemented in previous work [10], ours additionally implements labels indicating whether they are accepted by the court or not.

Annotation Procedure
Annotators get 16 pages of guidelines and five sample annotated documents with commentaries.The guidelines are available as a part of our dataset package.First, the annotators read through a judgment document to understand the argument flow.They discard the document if it does not concern torts.After the screening, the annotators annotate spans (CD, CL and UF) and assign necessary attributes to each span.Because the information of CL and UF spans are used as inputs to a model, they are not annotated in the judicial decision section.The annotators may refer to the judicial decision section only for assigning attribute @D to CD and @AC to CL.Finally, the annotators associate every non-CD annotated span with its corresponding CD span.To balance the workload between annotators, we let annotators skip a document which contains more than 15 CD spans.

Annotation Study
We assessed the reproducibility of the annotation scheme with five annotators, including three lawyers, one law school graduate and one undergraduate.We asked them to annotate 25 documents independently.
As our tasks identify appropriate units in text, which are meaningful and annotatable, in other words, "unitising" tasks, we use Krippendorff's α U [29] as the main metric for the inter-annotator agreement (IAA) 5 .We calculate α U using the character offset of spans and their "labels".The span types6 are used as a label for the spans, while the attribute values are a label for the attributes.In calculating α U for the span association, a label for each non-CD annotated span is its associated CD span.As the boundaries of a CD span might be different between annotators, we merged overlapped spans from different annotators into a single CD span taking the union of them.
We observed good agreement overall.The annotators identified 377.4 spans on average from the 25 documents, achieving 0.654 of α U over all span types.The α U of attributes are at 0.629 (@AC), 0.641 (@W) and 0.608 (@D) showing reproducible anntotions.α U of span association shows a reasonable score at 0.430, but it is still lower than that of span extraction due to the error propagation from the span extraction.

Production Annotation
To increase instances, we deployed the annotation scheme to more annotators with legal knowledge and experience.Each document was annotated by an annotator.We Graduating from law school or passing the preliminary exam is a prerequisite for taking the Japanese Bar exam.All the undergraduates were going to take the Bar exam, and 8 of them had already passed the preliminary exam when the annotation had started.We also checked whether the annotator complied with the annotation guideline in the middle of and after the annotation period and excluded severe violators.During their annotation, the annotators could ask questions and discuss edge cases with other annotators via a text-based communication workspace.

Dataset Construction
We construct the JTD from the annotated judgment documents (Figure 3).JTD consists of a set of tuples (U, P, D, T g , R P g , R D g ); we call each an instance.U is made by text annotated as UF.P and D are sequences of text, which are annotated as CL, from the plaintiffs and the defendants, respectively.Their corresponding gold labels for the RE task are R P g and R D g .Each of them is a sequence of Boolean flags imported from attributes @AC.The orders of elements in those sequences correspond to their appearance in the judgment document.T g is a single Boolean value made by attribute @D which is annotated on a CD span.Table 2 shows an overview of JTD.3,477 documents were annotated, which resulted in 7,978 instances.39.8% of the instances are labelled as True for the T (Table 4).Table 5 shows basic statistics for Claims.In total, 59,697 Claims are available for the rationale extraction task.51.0% of Claims are labelled as accepted (True), and others as rejected (False).Table 6 gives numbers according to parties (plaintiff or defendant).The rate of true-labelled Claims is consistent across parties, at 51.7% for the plaintiff and 50.1% for the defendant.Out of 59,697 Claims, 52.5% are from the plaintiff side, and others come from the defendant side.Table 7 summarises statistics on the length of Claims and Undisputed Facts.The maximum length of Claims from plaintiff and defendant can be outliers given that the 99th percentiles are much lower than them.Table 3 shows the split of our dataset.

Models
To establish an LJP baseline using JTD, we employ a hierarchical Transformer architecture, which achieves competitive performance in various legal NLP tasks [5].Our    Fact embedding E f is shared across all claims, while party-type embeddings E p and E d are exclusive to the plaintiff's and defendant's claims, respectively.E ps n is a position embedding for the n-th claim.
plaintiff's claims and the defendant's claims by assigning different embeddings according to who submitted the claim.IST also considers positions of claims in inputs via position embeddings.As an auxiliary input, IST takes undisputed facts.To inject features from undisputed facts into every encoded claim, IST independently encodes all the concatenated undisputed facts into a single vector, namely fact embeddings.An input representation to the span-level encoder, which corresponds to n-th claim by a plaintiff, is the sum of a claim vector from the word-level encoder, fact embeddings (E f ), party-type embeddings of the plaintiff (E p ), and position embeddings (E ps n ).As word-level encoders, we utilise two different types of pretrained BERT [31] models, BERT-base-Japanese (BERTja) 7 and Japanese-LegalBERT (JLBERT) [32].Both BERTja and JLBERT use the same model architecture as the original BERTbase model with 12 layers, 768 dimensions of hidden states, and 12 attention heads.We feed a vector corresponding to the [CLS] token from BERT output into a linear layer and use its output as a claim vector.While BERTja is pretrained only with the Japanese Wikipedia corpus, JLBERT is adapted to the Japanese legal domain.JLBERT is firstly pretrained with the Japanese Wikipedia corpus and then further pretrained with Japanese judgment documents in Civil code cases.
Our JTD also uses Japanese judgment documents as the source of the dataset, and we manually checked the number of overlapping documents between the JLBERT pre-training data and the JTD test set.We found 97.4 % of documents from the JTD test in the pre-training data as well.Nevertheless, we decided to employ JLBERT in our experiments because of the following three reasons: 1) JLBERT is the only available pretrained language model in the Japanese legal domain, 2) JLBERT was pretrained on the proprietary judgment document corpus so that we cannot pre-train our version of JLBERT model pretrained without documents in the JTD test set.3) As our dataset and tasks are produced with manually annotated judgment documents, masked language modelling cannot steal a look into our two LJP tasks.Our JTD test set is still unseen data for BERTja-based models.We can perform a sanity check by conducting the same experiments with BERTja.By comparing BERTja-based and JLBERT-based models, we avoid overestimating the performance of JLBERTbased models and provide an informative oversight of how the domain-adopted models behave in the Japanese legal judgment tasks.

Rationale Extraction (RE)
Rationale Extraction task identifies accepted arguments in the parties' arguments.Inputs are undisputed facts (U ) and arguments from both parties (P from plaintiffs and D from defendants).Outputs are two sequences of Boolean values, R P for plaintiffs and R D for defendants, denoting accepted arguments as True.
The models used for RE are as follows.

Tort Prediction (TP)
Tort Prediction predicts whether a tort is affirmed (T , a Boolean value), given undisputed facts (U ) and arguments from both parties (P and D) as inputs.
The following are models used for TP.• TP-random: A random baseline, similar to RE-random.
• TP-RF-meta: A randomforest [33] classifier.This model takes non-textual features as inputs: the year of a case, the court to which a case belongs and the number of claims from each party8 .This baseline shows how well a model can predict court outcomes using meta-level features.• TP-RF-gold: A similar model to TP-RF-meta, but this model uses the acceptance rates of each claim type instead of their number.The acceptance rates are calculated according to the gold labels of the rationale extraction task in JTD.Note that the target task here is TP, and this model is still blind against the gold labels of TP during validation and testing.This baseline provides a milestone for the TP task.• TP-RF-cascaded: A classifier similar to TP-RF-gold.It uses the predicted rationales instead of the golds to calculate acceptance rates.
• TP-IST: An IST-based classifier.It outputs a Boolean value for TP, taking all claims from plaintiffs and defendants as inputs.This model also has two variants: TP-IST-BERTja and TP-IST-JLBERT.• TP-IST-gold: An IST-based classifier whose input is only accepted gold claims.This is an IST version of TP-RF-gold.There are TP-IST-gold-BERTja and TP-IST-gold-JLBERT.• TP-IST-cascaded: Its architecture is identical to TP-IST-gold.This model takes only predicted rationales by RE as inputs.

Multi-task Approach
We also take a multi-task approach with the IST architecture, in which the model learns its parameters jointly for both RE and TP tasks using the combined loss function (1).
The multi-task IST takes a sequence of claims and outputs a sequence of Boolean predictions for RE and a Boolean value for TP.There are two variants, Multi-IST-BERTja and Multi-IST-JLBERT.

Experimental Settings
In the training phase of all neural network-based models, we used the AdamW [34] optimiser with a linear scheduler whose warmup step was 10% of total training steps.Epochs were fixed at 30.The maximum length of the word-level encoders is 512.IST models accept up to 64 claims.We chose the best-performing checkpoint according to accuracy scores in the development set.The parameters of word-level encoders (BERTja and JLBERT) are not frozen and fine-tuned.
We employed an optimisation framework Optuna [35] to search optimal hyperparameters for each neural network-based model.All hyperparameters were tuned using 3,000 instances of the training data.For the RE-BERT models, we performed a grid search to find the optimal learning rate where the total number of trials was four.For the IST models, we conducted more extensive searches.We utilised the Treestructured Parzen Estimator algorithm [36] for IST as their search spaces were much larger than RE-BERT's.The total number of trials was 112.Table 8 shows their search spaces.The "TRenc" hyperparameters are for the Transformer module implemented as the span-level encoder in the IST."Use UF" is a flag whether a model utilises fact embeddings or not.α is a weight of the loss function (Eq (1)) for the multi-task models.
We adopt accuracy for the evaluation metrics for both RE and TP.We trained and tested each model five times with different random seeds and averaged the scores.We also performed the permutation test to assess the statistical significance between models (significance level is at p < 0.05, two-tailed test).The target metric of the permutation test is the accuracy score.

Cascaded Model Settings
In the cascaded models, an RE model first predicts claims to be accepted, which are then fed to a TP model.We employ the outputs from the best-performing RE model.To be concrete, we chose Multi-IST-JLBERT and Multi-IST-BERTja for the RE task in the cascaded model.We notate the cascaded models using these RE models as follows: TP-RF-cascaded-BERTja, TP-RF-cascaded-JLBERT, TP-IST-cascaded-BERTja, and TP-IST-cascaded-JLBERT.

Rationale Extraction
Table 9 shows experimental results for the RE task.The "All" column shows accuracy scores calculated on all claims.We use "All" scores as the target metric for the permutation test.The "Plaintiff" and "Defendant" columns show accuracy scores of claims from plaintiff and defendant, respectively.The "Doc-level" column shows accuracy scores at the document level, where the scores are first calculated per document and then averaged over documents.
According to "All" scores, IST showed significant improvement from RE-BERT; capturing the context between input claims helped the task.Comparing BERTja and JLBERT, we find the model using JLBERT always performs numerically better than that with BERTja, confirming the statistically significant difference except for pairs of multi-task approaches.The multi-task models (Multi-IST- * ) always showed better accuracy than their corresponding single-task models (RE-IST- * ).The improvement by the multi-task model is more significant when using BERTja than JLBERT.The improvement from RE-IST-BERTja to Multi-IST-BERTja is statistically significant.
When we see scores according to parties, the overall trend is the same as "All" scores.An exception is "Plaintiff" where RE-IST-JLBERT achieved the best score among all the models.
The "Doc-level" scores share the same trend with "All".

Tort Prediction
Table 10 shows experimental results for the TP task.All TP-IST and Multi-IST models were significantly better than TP-RF-meta in accuracy.Gold models ( * -gold- * ) showed approximately 0.88 in accuracy, suggesting an upper bound of the TP models given the perfect RE output.The multi-task models showed the best performance in TP as well as RE.The multi-task models are always numerically better than their corresponding single-task models in accuracy, and we observed statistical significance between Multi-IST-BERTja and TP-IST-BERTja.
The cascaded models showed improvement from their corresponding single-task models; however, even the best model (TP-IST-cascaded-BERTja) did not come close to the Multi-IST models.

Discussion
In both tasks, the multi-task models show the best performance, suggesting they could leverage interaction between the tort judgment and its supporting arguments.This result is aligned with legal experts' behaviour in interpreting legal cases.They do not simply conclude in a bottom-up manner.Rather, they make inferences moving between consideration of subordinate arguments and the conclusion.We assume the multi-task models that jointly learned both TP and RE could model the legal prediction better.
The best accuracy we achieved in the TP task was 0.683 with JLBERT as a wordlevel encoder and 0.680 with BERTja.Those scores are far from perfect and still much lower than 0.880, which is the milestone achieved by the gold models.Our scores are comparable with an accuracy score of 0.668 reported in the class action LJP task from the United States [4] 9 .
The high accuracy of the gold TP models indicates the importance of RE for TP.The best RE accuracy was 0.674 with JLBERT and 0.666 with BERTja.There is a big room to be improved from those baseline models yet.JTD implements claimlevel rationale extraction for LJP while previous work employed paragraph-level [23].Challenges of JTD stem from finer-grained targets to be classified where more precise inference is required.
In the RE task, we observed that IST models ( * -IST- * ) showed higher accuracy for the plaintiff's claims than the defendant's claim.Differences in the number of instances between the plaintiff and the defendant could have caused this result.JTD has more claims from the plaintiff (52.5%) than the defendant (47.5%) (Table 6).
JLBERT is generally superior to BERTja for the same architecture for both tasks except for the cascaded models.However, the difference is small for the best models (Multi-IST- * ).As our tasks are specific to the legal domain and JLBERT was trained by the judgment documents that are the source of JTD, we expected better performance from the JLBERT-based models.In reality, however, the effectiveness of JLBERT was limited.This indicates that the span-level encoder and its following layers play more important roles than the word-level encoders in our tasks.

Error Analysis
We conducted a detailed error analysis employing human legal experts on the tort prediction outputs to identify the source of prediction errors.

Analysis Setup
Four experts (the authors of this paper) participated in the analysis.They are all legal professionals, including three professors of Law at Japanese universities and a Japanese lawyer who has eight years of experience.
We consider the outputs from the well-performed four models: TP-IST-BERTja, TP-IST-JLBERT, Multi-IST-BERTja and Multi-IST-JLBERT.We merged the TP outputs by majority voting from five runs of each model.We analysed instances where all four models failed in prediction.We obtained 139 out of 811 instances from our test set.Due to limited human resources, we randomly selected 60 from 139 instances.Two of the experts got 20 instances each, and the other two got 10 each.The experts were asked to fill out a form given the same inputs as the models.They were allowed

Human Confidence Score
The experts were asked to rate how confidently they could make tort predictions with the given input.The score scale is 0, 1, 2, 3, where 3 means a human legal expert can confidently judge whether an instance is a tort or not, 2 means a human legal expert can predict with uncertainty, 1 means a human legal expert can only predict a tendency of its outcome and 0 means even a human legal expert cannot predict its outcome at all.

External Knowledge
We hypothesised that a certain type of error occurred because of missing necessary information in the input texts, i.e., legal domain-specific knowledge and facts described in an external document other than a judgment document.We asked the experts to identify what kind of external knowledge was necessary to make a correct prediction.There are four options, "General knowledge", "Legal knowledge", "Insufficient input", and "Other"."Insufficient input" is chosen when an expert believes essential information specific to the instance is missing from input claims or undisputed facts, while "General knowledge" and "Legal knowledge" would be chosen when missing information is independent of the instance."Other" is chosen when the above three do not fit.The necessary information is detailed for the "Other" option in the free text.The experts could choose multiple options from the four.
In addition to the two questions above, the experts could submit their feedback and comments.

Result
Figure 5 shows human confidence scores on 60 instances.There are 23 instances in which even human experts cannot predict their outcomes at all, while the experts make only four confident predictions.Scores 0 and 1 make more than 70% of the analysed instances, which is considered unconfident.The results suggest those wrongly predicted instances by models are also difficult for a human legal expert.
52 out of 60 instances are flagged with one or more types of external knowledge.Table 11 shows the distribution of instances across the missing information types.Only two instances require "General knowledge".One of them is about a sender's information disclosure, which requires missing information "the same IP address does not necessarily mean that the same person posted the messages".The experts found 15 instances require "Legal knowledge" to predict their outcome correctly.The legal knowledge includes specific requirements of certain laws, civil code procedures, and heuristics knowledge, which a legal expert can obtain through their experience.
The dominant category was "Insufficient inputs" (35 instances), which stem from the nature of the judgment document format.Judgment documents often refer to external documents for detailed evidence.In such cases, descriptions and claims about the evidence in the judgment document itself may not be specific.Often, those external documents are not publicly available.Thus, the annotated claims and undisputed facts from the judgment document might be insufficient for confident prediction.Also, in some judgments, concrete descriptions of evidence or even factual claims from the parties can be pseudonymised or censored to protect private and confidential information.
Another cause of insufficient inputs comes from the annotation process.For example, an annotator failed to extract necessary claims and facts during the annotation process.Our annotation guideline prohibits, in principle, the extraction of claims and facts from the "court's decision" section, which contains information corresponding to the answer in the TP task (3.1).Therefore, the necessary facts were not fully annotated when they were described only in the "court's decision" section.
Eight out of 60 instances are not flagged with any external knowledge.However, five are rated with a confidence score of 1, suggesting that they are not necessarily easy to predict.The experts reported that those instances were from controversial cases, including a case whose decision was flipped in a higher court.Such a case is considered challenging for a machine.
Many comments and feedback from the experts are on instances extracted from complicated cases with multiple tortious actions to be judged.Our annotation guideline instructs annotators to distinguish one tortious action from others.When there are many tortious actions in the same documents, there are many related claims and facts, so argumentative relations between them are complicated.For example, a single claim can be related to multiple tortious actions.Thus, an annotator can miss necessary facts or fail to exclude unnecessary claims in the instance, resulting in instances with insufficient inputs.Moreover, even if instances in such cases are annotated perfectly, they are still challenging to predict their outcomes because of the tangled argumentative relations between claims.
Counterclaims make instances more complicated, which allows defendants to assert a new claim against plaintiffs during the same lawsuit that the plaintiffs initially filed.The experts identified three out of 60 instances where counter-claims made the instances difficult.In annotating counterclaim instances, our annotation guideline explicitly instructed annotators to extract the original plaintiffs' claims as defendant claims and the original defendants' claims as plaintiff claims to make argumentative relations between claims consistent with non-counterclaim instances.Nevertheless, we found the counterclaim instances still confusing to a machine.For example, the counterclaim's plaintiff was expressed with "defendant" in the input and vice versa.Although these notations were correct for a counterclaim instance, they can confuse a machine since most instances are non-counterclaim.
Another remarkable feedback is about instances from medical cases.In those cases, detailed medical procedures and expert opinions, which are crucial to making a legal decision, may be prepared separately from the main judgment documents.As a result, our annotators could not extract the necessary information for such instances.Therefore, these medical cases can be difficult to predict.

Limitations and Ethical Considerations
We clarify limitations and ethical considerations here to prevent misuse of our dataset and misunderstanding of our findings.
The task design for LJP reflects essential elements of the Japanese jurisdiction.However, there are differences from real-world conditions.The inputs in our tasks are only Undisputed Facts, Plaintiff's Claims, and Defendant's Claims, which are obtained from judgment documents.In real court cases, there are other documents, including third-party expert opinion letters, detailed evidence documents, and other undisclosed documents (e.g., private information and confidential patent information).They are intermediate outputs or auxiliary inputs in court cases but are still important.This difference limits the model's capability, as shown by manual error analysis results (6.2).The difference often stems from the limited scope of publicly available information.We suspect such limitation also applies to other jurisdictions that do not disclose the whole court document sets.As many LJP datasets are constructed from only judgment documents, we must recognise that the current LJP task design can be limited and reflect only a part of legal decision-making.
The annotation scheme for our dataset was validated through the preliminary experiment with 25 documents annotated by five legal experts.We note that the number of documents used in the preliminary experiment is lower than that of our final dataset, i.e. 3,477 documents, and we did not perform dual-coder annotation for the final dataset.Although we took alternative measures (extended tutorials, and annotator selection via dry-run) to ensure the quality of annotation, we acknowledge that the final dataset may contain inevitable human errors.
We also acknowledge that our dataset is limited in its scope.We must point out that the dataset contains tort-related cases, which is a part of the whole legal judgement in the Japanese Civil jurisdiction.Thus, our findings from the experimental results cannot directly apply to the general Japanese LJP tasks.However, we believe our baseline results provide informative clues to other civil legal judgement tasks.Another limitation is the quality of the annotations.While we have made an effort to keep the annotation quality as high as possible, there is still a chance of errors and misunderstandings by the annotators.Although the agreement study shows reasonable performance, we plan to update and expand our JTD on a long-term schedule continuously.
Given the limitations, we emphasise that one should not rely solely on a model trained on JTD in one's legal decision-making.Our recommended use case of the JTDtrained models is legal assistance services, but they should not be fully automated.We recommend using the models with legal professionals such as lawyers 10 .This humanmachine hybrid approach will improve the lead time of case handling and correct potential errors from models' false predictions.
Moreover, we emphasise that JTD is not intended to develop a judgment system.In other words, we did not intend to replace judges or courts with JTD-trained models.Important intellectual activities of judges include the conceptualisation of new rights and the update of law interpretations in response to age.However, the current LJP tasks in JTD do not cover such aspects.Therefore, we argue only human judges should take responsibility for such authority.

Conclusion and Future Work
We proposed a novel dataset for the Japanese legal judgment prediction featuring tort cases under the Japanese Civil Code.We specifically targeted two tasks: tort prediction and rational extraction.This is the first dataset consisting of real Japanese court cases with human expert annotation.Although we still have to continue to improve the annotation quality, our annotation procedure will be a good starting point to produce reliable datasets in the Japanese legal domain.
We conducted a feasibility study of the two tasks with the baseline models.We also compared the performance between single-task and multi-task approaches.The baseline experiments confirmed the feasibility of our tasks, and we found that the multi-task approach performed better.Moreover, we manually analysed the outputs from the IST models with experienced legal specialists to diagnose the errors.The results suggest that there are difficult cases even for human experts.
Our dataset and extended expert analysis only made these novel findings possible.We believe our dataset provides a useful resource in legal NLP for not just Japanese LJP research but also comparative experiments and analysis across different languages and jurisdictions.
In future work, we plan to extend the size of our dataset and increase the types of cases in addition to tort cases.Additionally, adapting our annotation scheme and tasks to other jurisdictions is an interesting direction for further research.
Fig. 1: Rationale-supported legal judgment prediction featuring a tort case

Fig. 4 :
Fig. 4: Architecture of the IST models for legal judgment prediction with rationale extraction, where there are N claims in total and M out of N claims are from Plaintiff.Fact embedding E f is shared across all claims, while party-type embeddings E p and E d are exclusive to the plaintiff's and defendant's claims, respectively.E ps n is a position embedding for the n-th claim.

Fig. 5 :
Confidence scores by human expert to see the original judgment documents, if necessary.There are two major questions in the form.

Table 1 :
Queries used in our document retrieval.Translations in square brackets are ours.

Table 2 :
Japanese Tort-case Dataset overview

Table 3 :
Dataset split.We split the dataset according to the number of instances.

Table 4 :
Label (tort or not) distribution of instances.

Table 5 :
Label (accepted or not) distribution of claims.

Table 6 :
Label (accepted or not) distribution of claims by parties.

Table 7 :
Length statistics of claims and undisputed facts.Mean, standard deviation, and percentiles.IST takes a sequence of spans as input, where each span corresponds to a claim from the plaintiff or defendant side.Its outputs are a single Boolean flag for tort judgment prediction and a sequence of Boolean flags for rationale extraction.These final outputs are obtained through a linear layer just after the output from the span-level encoder.IST accepts party-type embeddings to distinguish between the hierarchical Transformer-based models (Figure4) are designed to capture both wordlevel context and span-level context by implementing a span-level Transformer encoder on top of word-level Transformer encoders.We call this model Inter-Span Transformer (IST).
• RE-random: A random baseline that produces prediction based on the ratio of labels in the training set.• RE-BERT: A BERT-based binary classifier.It classifies each input claim independently without regarding context across claims.There are two variants: RE-BERTja and RE-JLBERT depending on the BERT model used.• RE-IST: An IST-based classifier.It outputs a sequence of Boolean values for RE, taking a sequence of claims from plaintiffs and defendants as inputs.There are RE-IST-BERTja and RE-IST-JLBERT.

Table 8 :
Hyperparameters search space

Table 9 :
Experimental results of RE (accuracy with standard deviation).

Table 10 :
Experimental results of TP (accuracy with standard deviation).

Table 11 :
Missing information for the TP task